create correlation-matrix-like data frame in pandas - python

I have a df with correlation values for A and B
df = pd.DataFrame({'x':['A','A','B','B'],'y':['A','B','A','B'],'c':[1,0.5,0.5,1]})
I'm trying to create a correlation-matrix-like data frame from df of the kind DataFrame.corr would give me.
I tried
corr = df.pivot_table(columns='y',index='x')
y A B
x
A 1.0 0.5
B 0.5 1.0
but I don't know how to get rid of the multi-index.

You just need specifying values to get rid of multiindex
corr = df.pivot_table(columns='y',index='x', values='c')
Out[41]:
y A B
x
A 1.0 0.5
B 0.5 1.0
If you also want to get rid of axis name, chain rename_axis
corr = (df.pivot_table(columns='y',index='x', values='c')
.rename_axis(index=None, columns=None))
Out[43]:
A B
A 1.0 0.5
B 0.5 1.0

Related

Subtract two consecutive rows and save in first row using pandas

I have a pandas dataframe of format with 226 columns :
**W X Y Z.....**
a b c d.....
e f g h.....
i want to subtract columns Y and Z in the following way:
**W X Y Z.....**
a (b-c) (c-d) (d-nextvalue).....
e (f-g) (g-h) (h-nextvalue).....
how do i go about doing this? I am a rookie in python, thanks in advance
Use DataFrame.diff and if necessary convert first column to index by DataFrame.set_index:
df = pd.DataFrame({
'W':list('abc'),
'X':[10,5,4],
'Y':[7,8,9],
'Z':[1,1,0],
'E':[5,3,6],
})
df = df.set_index('W').diff(-1, axis=1)
print (df)
X Y Z E
W
a 3.0 6.0 -4.0 NaN
b -3.0 7.0 -2.0 NaN
c -5.0 9.0 -6.0 NaN
To create 'W' as the index you can do,
df.set_index('W', inplace=True)
Further, you may try the following:
for i in range(len(df.columns) - 1):
df.iloc[:, i] = df.iloc[:, i] - df.iloc[:, i+1]

Pandas divide multiple mutliindex columns

I've got a dataframe that looks like this:
and I'd like to divide the x columns by the y columns, but at the moment I get the following result:
Full example:
import pandas as pd
# create example dataframe
data = {'x': [2, 4, 6], 'y': [1, 2, 3]}
df = pd.DataFrame(data)
df = pd.concat([df, df*10], axis=1, keys=['apple', 'orange'])
# slice just x and y columns
x = df.loc[:, (slice(None), 'x')]
y = df.loc[:, (slice(None), 'y')]
# divide (this doesn't work)
result = x / y
Ideally I'd like to add the result back as a separate column:
Is there an elegant way to do this?
Your solution working if same second level created by rename:
new = (x.rename(columns={'x':'x/y'}) / y.rename(columns={'y':'x/y'})
print (new)
apple orange
x/y x/y
0 2.0 2.0
1 2.0 2.0
2 2.0 2.0
Or is possible use DataFrame.xs - be default is removed selected level, so divid working nice (because same columns in x and y DataFrame), so is necessary create second level by MultiIndex.from_product:
x = df.xs('x', axis=1, level=1)
y = df.xs('y', axis=1, level=1)
new = x / y
new.columns = pd.MultiIndex.from_product([new.columns, ['x/y']])
print (new)
apple orange
x/y x/y
0 2.0 2.0
1 2.0 2.0
2 2.0 2.0
And then use concat with DataFrame.sort_index and DataFrame.reindex:
df = pd.concat([df, new], axis=1).sort_index(axis=1).reindex(['x','x/y','y'], axis=1, level=1)
print (df)
apple orange
x x/y y x x/y y
0 2 2.0 1 20 2.0 10
1 4 2.0 2 40 2.0 20
2 6 2.0 3 60 2.0 30

Pairwise Euclidean distance with pandas ignoring NaNs

I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.
You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j] represents the distance between the ith and jth column in the original DataFrame.
The code below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.
I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.
One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.

Pandas DataFrame from MultiIndex and NumPy structured array (recarray)

First I create a two-level MultiIndex:
import numpy as np
import pandas as pd
ind = pd.MultiIndex.from_product([('X','Y'), ('a','b')])
I can use it like this:
pd.DataFrame(np.zeros((3,4)), columns=ind)
Which gives:
X Y
a b a b
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
But now I'm trying to do this:
dtype = [('Xa','f8'), ('Xb','i4'), ('Ya','f8'), ('Yb','i4')]
pd.DataFrame(np.zeros(3, dtype), columns=ind)
But that gives:
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
I expected something like the previous result, with three rows.
Perhaps more generally, what I want to do is to generate a Pandas DataFrame with MultiIndex columns where the columns have distinct types (as in the example, a is float but b is int).
This looks like a bug, and worth reporting as an issue github.
A workaround is to set the columns manually after construction:
In [11]: df1 = pd.DataFrame(np.zeros(3, dtype))
In [12]: df1.columns = ind
In [13]: df1
Out[13]:
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0
pd.DataFrame(np.zeros(3, dtype), columns=ind)
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
is just showing the textual representation of the dataframe output.
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
is then just the text representation of the index.
if you instead:
df = pd.DataFrame(np.zeros(3, dtype), columns=ind)
print type(df.columns)
<class 'pandas.indexes.multi.MultiIndex'>
You see it is indeed a pd.MultiIndex
That said and out of the way. What I don't understand is why specifying the index in the dataframe constructor removes the values.
A work around is this.
df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind
print df
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0

Categories