Pairwise Euclidean distance with pandas ignoring NaNs - python

I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.

You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j] represents the distance between the ith and jth column in the original DataFrame.

The code below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()

Related

Replace column in Pandas dataframe with the mean of that column

I have a dataframe:
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
A B
0 1 2
1 1 3
2 4 6
I want to return a dataframe of the same size containing the mean of each column:
A B
0 2 3.666
1 2 3.666
2 2 3.666
Is there a simple way of doing this?
You can only provide one single line at DataFrame creation time:
pd.DataFrame(data = [df.mean()], index = df.index)
It gives:
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Here's one with assign:
df.assign(**df.mean())
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Details
The mean is easily obtained with DataFrame.mean:
df.mean()
tenor_yrs 14.292857
rates 2.622000
dtype: float64
From the above Series, we can use dictionary unpacking to replace the existing columns with the resulting values. Note that we can unpack the Series into a dictionary using **:
{**df.mean()}
# {'tenor_yrs': 14.292857142857143, 'rates': 2.622}
Given that the way assign adds new columns is as df.assign(a_given_column=a_value, another_column=some_other_value), the unpacking makes the dictionary keys be the function's arguments. And since the original dataframe's index is respected, df.assign(**df.mean()) will replace the dataframe`s values with the means.
Recreate the DataFrame. Send the mean Series to a dict, then the index defines the number of rows.
pd.DataFrame(df.mean().to_dict(), index=df.index)
# A B
#0 2.0 3.666667
#1 2.0 3.666667
#2 2.0 3.666667
Same concept, but creating the full array first saves a decent amount of time.
pd.DataFrame(np.broadcast_to(df.mean(), df.shape),
index=df.index,
columns=df.columns)
Here are some timings. Of course this will depend slightly on the number of columns but you can see there are pretty large differences when you provide the entire array to begin with
import perfplot
import pandas as pd
import numpy as np
perfplot.show(
setup=lambda N: pd.DataFrame(np.random.randint(1,100, (N, 5)),
columns=[str(x) for x in range(5)]),
kernels=[
lambda df: pd.DataFrame(np.broadcast_to(df.mean(), df.shape), index=df.index, columns=df.columns),
lambda df: df.assign(**df.mean()),
lambda df: pd.DataFrame(df.mean().to_dict(), index=df.index)
],
labels=['numpy broadcast', 'assign', 'dict'],
n_range=[2 ** k for k in range(1, 22)],
equality_check=np.allclose,
xlabel="Len(df)"
)

create correlation-matrix-like data frame in pandas

I have a df with correlation values for A and B
df = pd.DataFrame({'x':['A','A','B','B'],'y':['A','B','A','B'],'c':[1,0.5,0.5,1]})
I'm trying to create a correlation-matrix-like data frame from df of the kind DataFrame.corr would give me.
I tried
corr = df.pivot_table(columns='y',index='x')
y A B
x
A 1.0 0.5
B 0.5 1.0
but I don't know how to get rid of the multi-index.
You just need specifying values to get rid of multiindex
corr = df.pivot_table(columns='y',index='x', values='c')
Out[41]:
y A B
x
A 1.0 0.5
B 0.5 1.0
If you also want to get rid of axis name, chain rename_axis
corr = (df.pivot_table(columns='y',index='x', values='c')
.rename_axis(index=None, columns=None))
Out[43]:
A B
A 1.0 0.5
B 0.5 1.0

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.
I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.
One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.

Calculate difference between first valid value and last valid value in DataFrame per row?

I'm trying to find the difference between the first valid value and the last valid value in a DataFrame per row.
I have a working code with a for loop and looking for something faster.
Here's an example of what I'm doing currently:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(16).astype(np.float).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
# Fill some NaN
df.loc[0, ['a', 'd']] = np.nan
df.loc[1, ['c', 'd']] = np.nan
df.loc[2, 'b'] = np.nan
df.loc[3, :] = np.nan
print(df)
# a b c d
# 0 NaN 1.0 2.0 NaN
# 1 4.0 5.0 NaN NaN
# 2 8.0 NaN 10.0 11.0
# 3 NaN NaN NaN NaN
diffs = pd.Series(index=df.index)
for i in df.index:
row = df.loc[i]
min_i = row.first_valid_index()
max_i = row.last_valid_index()
if min_i is None or min_i == max_i: # 0 or 1 valid values
continue
diffs[i] = df.loc[i, max_i] - df.loc[i, min_i]
df['diff'] = diffs
print(df)
# a b c d diff
# 0 NaN 1.0 2.0 NaN 1.0
# 1 4.0 5.0 NaN NaN 1.0
# 2 8.0 NaN 10.0 11.0 3.0
# 3 NaN NaN NaN NaN NaN
One way would be to back and forward fill the missing values, and then just compare the first and last rows.
df2 = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)
df['diff'] = df2.ix[:, -1] - df2.ix[:, 0]
If you want to do it in one line, without creating a new dataframe:
df['diff'] = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).apply(lambda r: r.d - r.a, axis=1)
Pandas making your life easy, one method (first_valid_values()) at a time. Note that you'll have to delete any rows that have all NaN values (no point in having these anyways):
For first valid values:
a= [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row.first_valid_index(), axis=1))]
For last valid values:
b = [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row[::-1].first_valid_index(), axis=1))]
Subtract to get final result:
a-b

Return list of indices/index where a min/max value occurs in a pandas dataframe

I'd like to search a pandas DataFrame for minimum values. I need the min in the entire dataframe (across all values) analogous to df.min().min(). However I also need the know the index of the location(s) where this value occurs.
I've tried a number of different approaches:
df.where(df == (df.min().min())),
df.where(df == df.min().min()).notnull()(source) and
val_mask = df == df.min().min(); df[val_mask] (source).
These return a dataframe of NaNs on non-min/boolean values but I can't figure out a way to get the (row, col) of these locations.
Is there a more elegant way of searching a dataframe for a min/max and returning a list containing all of the locations of the occurrence(s)?
import pandas as pd
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
list_of_lowest = []
for column_name, column in df.iteritems():
if len(df[column == df.min().min()]) > 0:
print(column_name, column.where(column ==df.min().min()).dropna())
list_of_lowest.append([column_name, column.where(column ==df.min().min()).dropna()])
list_of_lowest
output: [['x', 2 -1.0
Name: x, dtype: float64]]
Based on your revised update:
In [209]:
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,-1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
df
Out[209]:
x y z
0 1 3 4
1 2 5 2
2 -1 -1 3
Then the following would work:
In [211]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna()
Out[211]:
x y
2 -1.0 -1.0
So this uses the boolean mask on the df:
In [212]:
df[df==df.min().min()]
Out[212]:
x y z
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.0 -1.0 NaN
and we call dropna with param thresh=1 this drops columns that don't have at least 1 non-NaN value:
In [213]:
df[df==df.min().min()].dropna(axis=1, thresh=1)
Out[213]:
x y
0 NaN NaN
1 NaN NaN
2 -1.0 -1.0
Probably safer to call again with thresh=1:
In [214]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna(thresh=1)
Out[214]:
x y
2 -1.0 -1.0

Categories