First, I created a csv file from data0 as shown:
data0 = ["car", 0.82, 0.0026, 0.914, 0.59]
test_df = pd.DataFrame([data0])
test_df.to_csv("testfile1.csv")
Output of that "testfile1.csv" appears like this:
0 1 2 3 4
0 car 0.82 0.0026 0.914 0.59
I want to append new data (data1 = ["bus", 0.9, 0.123, 12.907, 42], data2 = ["van", 0.23, 0.41, .031, 0.894, 6.16, 4.104]) to old csv file so that it appears in new row but exactly under previous rows as shown below:
0 1 2 3 4 5 6
0 car 0.82 0.0026 0.914 0.590 NaN NaN
1 bus 0.90 0.1230 12.907 42.000 NaN NaN
2 van 0.23 0.4100 0.031 0.894 6.16 4.104
I tried program and other similar methods using .append() or .to_csv() with mode="a":
test_df = pd.read_csv("testfile1.csv", index_col=0)
test_df = test_df.append([data1], ignore_index=True)
test_df = test_df.append([data2], ignore_index=True)
test_df.to_csv("testfile1.csv")
However, every time new data is appended in new columns and not under previous columns:
0 1 2 3 4 ... 0 1 2 3 4
0 NaN NaN NaN NaN NaN ... car 0.82 0.0026 0.914 0.59
1 bus 0.90 0.123 12.907 42.000 ... NaN NaN NaN NaN NaN
2 van 0.23 0.410 0.031 0.894 ... NaN NaN NaN NaN NaN
My project is to read existing CSV file, appending and saving back to same file. I even tried type casting to pandas.Series
You probably want to turn the list into pandas series first:
data1 = ['bus', 0.90, 0.1230, 12.907, 42.000]
row1 = pd.Series(data1)
data2 = ["van", 0.23, 0.41, .031, 0.894, 6.16, 4.104]
row2 = pd.Series(data2)
Then append it:
test_df = test_df.append(row1, ignore_index=True)
test_df = test_df.append(row2, ignore_index=True)
Output:
0 1 2 3 4 5 6
0 car 0.82 0.0026 0.914 0.590 NaN NaN
1 bus 0.90 0.1230 12.907 42.000 NaN NaN
2 van 0.23 0.4100 0.031 0.894 6.16 4.104
Related
I have an input file of the wavelengths and absorbance from a spectrometer. In this file the data is recorded and just added as the last two columns of the dataframe. The columns are needed to specify the wavelength at which a specific absorbance (=data) was measured.
Wavelength1
Data1
Wavelength2
Data2
Wavelength3
Data3
and so on
800
0.1
798
0.02
798.5
0.6
and so on
799
0.15
797
0.03
798.0
0.2
and so on
798
0.133
796
0.2
797.5
0.4
and so on
797
0.14
795
0.052
797.0
0.34
and so on
and so on
and so on
and so on
and so on
and so on
and so on
and so on
I would like to have a dataframe that makes my analysis a bit easier. Something like that:
Wavelength1
Data1
Wavelength2
Data2
Wavelength3
Data3
and so on
800
0.1
NaN
NaN
798.5
0.6
and so on
799
0.15
NaN
NaN
798.0
0.2
and so on
NaN
NaN
NaN
NaN
798.5
0.6
and so on
798
0.133
798
0.02
798.0
0.2
and so on
NaN
NaN
NaN
NaN
797.5
0.4
and so on
797
0.14
797
0.03
797.0
0.34
and so on
and so on
and so on
and so on
and so on
and so on
and so on
and so on
With my quite basic python skill set, I know, that I could probably store each wavelength-data pair as a list of tuples and make some complicated sorting magic happening. But every since trying to learn more about the pandas module, I was wondering if I can tackle this problem with more ease. However, while I have found pandas shift function, I have not found a way of making it conditional nor shifting and sorting each column individually.
this is classic wide to long
your sample data has two readings for wavelength 796 in second set of data. This effectively means duplicates, dealt with this by putting a subreading in place
finally transform back to long, where values line up on Wavelength with provision for duplicates
clearly steps to convert back to wide and flattening columns are optional based on how you want to run your analysis
df = pd.DataFrame({'Wavelength1': [800, 799, 798, 797],
'Data1': [0.1, 0.15, 0.133, 0.14],
'Wavelength2': [798, 797, 796, 796],
'Data2': [0.02, 0.03, 0.2, 0.052],
'Wavelength3': [798.5, 798.0, 797.5, 797.0],
'Data3': [0.6, 0.2, 0.4, 0.34]})
# wide to long
df2 = (
pd.wide_to_long(df.reset_index(), ["Wavelength", "Data"], i="index", j="reading")
.droplevel(0)
.reset_index()
.set_index(["Wavelength", "reading"])
)
long data
Data
Wavelength reading
800.0 1 0.100
799.0 1 0.150
798.0 1 0.133
797.0 1 0.140
798.0 2 0.020
797.0 2 0.030
796.0 2 0.200
2 0.052
798.5 3 0.600
798.0 3 0.200
797.5 3 0.400
797.0 3 0.340
back to wide with wavelength lined up
# long back to wide, delating with duplicate "Wavelength"
df2 = df2.set_index(
pd.Series(df2.groupby(level=[0, 1]).cumcount().values, name="subreading"),
append=True,
).unstack("reading")
# flatten the columns..
df2.columns = ["".join(map(str, c)) for c in df2.columns]
final output
Data1 Data2 Data3
Wavelength subreading
796.0 0 NaN 0.200 NaN
1 NaN 0.052 NaN
797.0 0 0.140 0.030 0.34
797.5 0 NaN NaN 0.40
798.0 0 0.133 0.020 0.20
798.5 0 NaN NaN 0.60
799.0 0 0.150 NaN NaN
800.0 0 0.100 NaN NaN
I have a data frame for example
df = pd.DataFrame([(np.nan, .32), (.01, np.nan), (np.nan, np.nan), (.21, .18)],
columns=['A', 'B'])
A B
0 NaN 0.32
1 0.01 NaN
2 NaN NaN
3 0.21 0.18
And I want to subtract column B from A
df['diff'] = df['A'] - df['B']
A B diff
0 NaN 0.32 NaN
1 0.01 NaN NaN
2 NaN NaN NaN
3 0.21 0.18 0.03
Difference returns NaN if one of the columns is NaN. To overcome this I use fillna
df['diff'] = df['A'].fillna(0) - df['B'].fillna(0)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03
This solves NaN coming in the diff column, but for index 2 the result is coming to 0, while I want the difference as NaN since columns A and B are NaN.
Is there a way to explicitly tell pandas to output NaN if both columns are NaN?
Use Series.sub with fill_value=0 parameter:
df['diff'] = df['A'].sub(df['B'], fill_value=0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN NaN
3 0.21 0.18 0.03
If need replace NaNs to 0 add Series.fillna:
df['diff'] = df['A'].sub(df['B'], fill_value=0).fillna(0)
print (df)
A B diff
0 NaN 0.32 -0.32
1 0.01 NaN 0.01
2 NaN NaN 0.00
3 0.21 0.18 0.03
I have a 5k x 2 column dataframe called "both".
I want to create a new 5k x 1 DataFrame or column (doesn't matter) by replacing any NaN value in one column with the value of the adjacent column.
ex:
Gains Loss
0 NaN NaN
1 NaN -0.17
2 NaN -0.13
3 NaN -0.75
4 NaN -0.17
5 NaN -0.99
6 1.06 NaN
7 NaN -1.29
8 NaN -0.42
9 0.14 NaN
so for example, I need to swap the NaNs in the first column in rows 1 through 5 with the values in the same rows, in second column to get a new df of the following form:
Change
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
how do I tell python to do this??
You may fill the NaN values with zeroes and then simply add your columns:
both["Change"] = both["Gains"].fillna(0) + both["Loss"].fillna(0)
Then — if you need it — you may return the resulting zeroes back to NaNs:
both["Change"].replace(0, np.nan, inplace=True)
The result:
Gains Loss Change
0 NaN NaN NaN
1 NaN -0.17 -0.17
2 NaN -0.13 -0.13
3 NaN -0.75 -0.75
4 NaN -0.17 -0.17
5 NaN -0.99 -0.99
6 1.06 NaN 1.06
7 NaN -1.29 -1.29
8 NaN -0.42 -0.42
9 0.14 NaN 0.14
Finally, if you want to get rid of your original columns, you may drop them:
both.drop(columns=["Gains", "Loss"], inplace=True)
There are many ways to achieve this. One is using the loc property:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price1': [np.nan,np.nan,np.nan,np.nan,
np.nan,np.nan,1.06,np.nan,np.nan],
'Price2': [np.nan,-0.17,-0.13,-0.75,-0.17,
-0.99,np.nan,-1.29,-0.42]})
df.loc[df['Price1'].isnull(), 'Price1'] = df['Price2']
df = df.loc[:6,'Price1']
print(df)
Output:
Price1
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
You can see more complex recipes in the Cookbook
IIUC, we can filter for null values and just sum the columns to make your new dataframe.
cols = ['Gains','Loss']
s = df.isnull().cumsum(axis=1).eq(len(df.columns)).any(axis=1)
# add df[cols].isnull() if you only want to measure the price columns for nulls.
df['prices'] = df[cols].loc[~s].sum(axis=1)
df = df.drop(cols,axis=1)
print(df)
prices
0 NaN
1 -0.17
2 -0.13
3 -0.75
4 -0.17
5 -0.99
6 1.06
7 -1.29
8 -0.42
I have a daily time series dataframe with nine columns. Each columns represent the measurement from different methods. I want to calculate daily mean only when there are more than two measurements otherwise want to assign as NaN. How to do that with pandas dataframe?
suppose my df looks like:
0 1 2 3 4 5 6 7 8
2000-02-25 NaN 0.22 0.54 NaN NaN NaN NaN NaN NaN
2000-02-26 0.57 NaN 0.91 0.21 NaN 0.22 NaN 0.51 NaN
2000-02-27 0.10 0.14 0.09 NaN 0.17 NaN 0.05 NaN NaN
2000-02-28 NaN NaN NaN NaN NaN NaN NaN NaN 0.14
2000-02-29 0.82 NaN 0.75 NaN NaN NaN 0.14 NaN NaN
and I'm expecting mean values like:
0
2000-02-25 NaN
2000-02-26 0.48
2000-02-27 0.11
2000-02-28 NaN
2000-02-29 0.57
Use where for NaNs values by condition created by DataFrame.count for count with exclude NaNs and comparing by Series.gt (>):
s = df.where(df.count(axis=1).gt(2)).mean(axis=1)
#alternative soluton with changed order
#s = df.mean(axis=1).where(df.count(axis=1).gt(2))
print (s)
2000-02-25 NaN
2000-02-26 0.484
2000-02-27 0.110
2000-02-28 NaN
2000-02-29 0.570
dtype: float64
I have a pandas dataframe that looks as follows:
In [23]: dataframe.head()
Out[23]:
column_id 1 10 11 12 13 14 15 16 17 18 ... 46 47 48 49 5 50 \
row_id ...
1 NaN NaN 1 1 1 1 1 1 1 1 ... 1 1 NaN 1 NaN NaN
10 1 1 1 1 1 1 1 1 1 NaN ... 1 1 1 NaN 1 NaN
100 1 1 NaN 1 1 1 1 1 NaN 1 ... NaN NaN 1 1 1 NaN
11 NaN 1 1 1 1 1 1 1 1 NaN ... NaN 1 1 1 1 1
12 1 1 1 NaN 1 1 1 1 NaN 1 ... 1 NaN 1 1 NaN 1
The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:
In [24]: dataframe.transpose().corr().head()
Out[24]:
row_id 1 10 100 11 12 13 14 15 16 17 ... 90 91 92 93 94 95 \
row_id ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?
Thanks!
A.
The key question here is what distance metric to use.
Let's say this is your data.
>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 \
0 1 1 1 NaN 1 NaN NaN 1 1 1 ... 1 1 NaN 1 NaN 1 1 1
1 1 1 1 NaN 1 1 1 1 1 1 ... NaN 1 1 NaN NaN 1 1 1
2 1 1 1 1 1 1 1 1 1 1 ... 1 NaN 1 1 1 1 1 NaN
3 1 NaN 1 NaN 1 NaN 1 NaN 1 1 ... 1 1 1 1 NaN 1 1 1
4 1 1 1 1 1 1 1 1 NaN 1 ... NaN 1 1 1 1 1 1 1
What is the % difference?
You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 7 8 9 ... 40 \
0 0.00 0.36 0.33 0.37 0.32 0.41 0.35 0.33 0.39 0.33 ... 0.37
1 0.36 0.00 0.37 0.29 0.30 0.37 0.33 0.37 0.33 0.31 ... 0.35
2 0.33 0.37 0.00 0.36 0.29 0.38 0.40 0.34 0.30 0.28 ... 0.28
3 0.37 0.29 0.36 0.00 0.29 0.30 0.34 0.26 0.32 0.36 ... 0.36
4 0.32 0.30 0.29 0.29 0.00 0.31 0.35 0.29 0.29 0.25 ... 0.27
What is the correlation coefficient?
Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 1.000000 0.013158 0.026262 -0.059786 -0.024293 -0.078056 0.054074
1 0.013158 1.000000 -0.093109 0.170159 0.043187 0.027425 0.108148
2 0.026262 -0.093109 1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786 0.170159 -0.124540 1.000000 0.004245 0.184153 0.042524
4 -0.024293 0.043187 -0.048485 0.004245 1.000000 0.079196 -0.099834
Incidentally, this is the same result that you would get with the Spearman R coefficient as well.
What is the Euclidean distance?
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 0.000000 6.000000 5.744563 6.082763 5.656854 6.403124 5.916080
1 6.000000 0.000000 6.082763 5.385165 5.477226 6.082763 5.744563
2 5.744563 6.082763 0.000000 6.000000 5.385165 6.164414 6.324555
3 6.082763 5.385165 6.000000 0.000000 5.385165 5.477226 5.830952
4 5.656854 5.477226 5.385165 5.385165 0.000000 5.567764 5.916080
By now, you'd have a sense of the pattern. Create a distance method. Then apply it pairwise to every column using
data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))
If your distance method relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).
A proposal to improve the excellent answer from #s-anand for Euclidian distance:
instead of
zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
we can apply the fillna the fill only the missing data, thus:
distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))
This way, the distance on missing dimensions will not be counted.
This is my numpy-only version of #S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.
Happy to share it with a short, reproducible example:
# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
Let's try scipy.stats.pearsonr first.
Executing:
distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
returns:
and:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0],
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, -0.12, 0.87, 0.82, 0.78],
[-0.12, 1.00, -0.43, -0.37, -0.43],
[0.87, -0.43, 1.00, 0.96, 0.95],
[0.82, -0.37, 0.96, 1.00, 0.96],
[0.78, -0.43, 0.95, 0.96, 1.00]])
As a second example let's try the distance correlation from the dcor library.
Executing:
import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt
returns:
while:
rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2),
axis = 0, arr=iris_df),
axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np
returns:
array([[1.00, 0.31, 0.86, 0.83, 0.78],
[0.31, 1.00, 0.54, 0.51, 0.51],
[0.86, 0.54, 1.00, 0.97, 0.95],
[0.83, 0.51, 0.97, 1.00, 0.95],
[0.78, 0.51, 0.95, 0.95, 1.00]])
I compared 3 variants from the other answers here for their speed. I had a trial 1000x25 matrix (leading to resulting 1000x1000 matrix)
dcor library
Time: 0.03s
https://dcor.readthedocs.io/en/latest/functions/dcor.distances.pairwise_distances.html
import dcor
result = dcor.distances.pairwise_distances(data)
scipy.distance
Time: 0.05s
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
from scipy.spatial import distance_matrix
result = distance_matrix(data, data)
using lambda function and numpy or pandas
Time: 180s / 90s
import numpy as np # variant A (180s)
import pandas as pd # variant B (90s)
distance = lambda x, y: np.sqrt(np.sum((x - y) ** 2)) # variant A
distance = lambda x, y: pd.np.linalg.norm(x - y) # variant B
result = data.apply(lambda x: data.apply(lambda y: distance(x, y), axis=1), axis=1)