Correlation matrix for two Pandas dataframes [duplicate] - python

This question already has answers here:
Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?
(4 answers)
Closed 6 years ago.
Say I have two dataframes:
df1 df2
A B C D
1 3 -2 7
2 4 0 10
I need to create a correlation matrix which consists of columns from two dataframes.
corrmat_df
C D
A 1 *
B * 1
stands for correlation
I can do it elementwise in nested loop, but maybe there is more pythonic way?
Thanks.

Simply combine the dataframes and use .corr():
result = pd.concat([df1, df2], axis=1).corr()
# A B C D
#A 1.0 1.0 1.0 1.0
#B 1.0 1.0 1.0 1.0
#C 1.0 1.0 1.0 1.0
#D 1.0 1.0 1.0 1.0
The result contains all wanted (and also some unwanted) correlations. E.g.:
result[['C','D']].ix[['A','B']]
# C D
#A 1.0 1.0
#B 1.0 1.0

Related

Interpolate NaN values over a DataFrame as a ring

I need to interpolate the NaN values over a Dataframe but I want that interpolation to get the first values of the DataFrame in case the NaN value is the last value. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"a": [1,2,3], "b":[1,2,np.nan]})
So the DataFrame is:
a b
0 1 1.0
1 2 2.0
2 3 NaN
But when I interpolate the nan values like:
df.interpolate(method="linear", inplace=True)
I got:
a b
0 1 1.0
1 2 2.0
2 3 2.0
The interpolation doesn't use the first value to do it. My desired output wold be to fill in with the value of 1.5 because of that circular interpolation.
One possible solution is add first row, interpolate and remove last row:
df = df.append(df.iloc[0]).interpolate(method="linear").iloc[:-1]
print (df)
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 1.5
EDIT:
More general solution:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df = pd.concat([df] * 3).interpolate(method="linear").iloc[len(df):-len(df)]
print (df)
a b
0 1 1.333333
1 2 1.000000
2 3 2.000000
3 4 1.666667
Or if need working only with last non missing values:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df1 = df.ffill().iloc[[-1]]
df2 = df.bfill().iloc[[0]]
df = pd.concat([df1, df, df2]).interpolate(method="linear").iloc[1:-1]
print (df)
a b
0 1 1.5
1 2 1.0
2 3 2.0
3 4 1.5

Pandas combining dataframes based on column value

I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?
You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')
You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0

Pandas: how to join two dataframes combinatorially [duplicate]

This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 4 years ago.
I have two dataframes that I would like to combine combinatorial-wise (i.e. combinatorially join each row from one df to each row of another df). I can do this by merging on 'key's but my solution is clearly cumbersome. I'm looking for a more straightforward, even pythonesque way of handling this operation. Any suggestions?
MWE:
fred = pd.DataFrame({'A':[1., 4.],'B':[2., 5.], 'C':[3., 6.]})
print(fred)
A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
jim = pd.DataFrame({'one':['a', 'c'],'two':['b', 'd']})
print(jim)
one two
0 a b
1 c d
fred['key'] = [1,2]
jim1 = jim.copy()
jim1['key'] = 1
jim2 = jim.copy()
jim2['key'] = 2
jim3 = jim1.append(jim2)
jack = pd.merge(fred, jim3, on='key').drop(['key'], axis=1)
print(jack)
A B C one two
0 1.0 2.0 3.0 a b
1 1.0 2.0 3.0 c d
2 4.0 5.0 6.0 a b
3 4.0 5.0 6.0 c d
You can join every row of fred with every row of jim by merging on a key column which is equal to the same value (say, 1) for every row:
In [16]: pd.merge(fred.assign(key=1), jim.assign(key=1), on='key').drop('key', axis=1)
Out[16]:
A B C one two
0 1.0 2.0 3.0 a b
1 1.0 2.0 3.0 c d
2 4.0 5.0 6.0 a b
3 4.0 5.0 6.0 c d
Are you looking for the cartesian product of the two dataframes, like a cross join?
It is answered here.

Merge dataframes without duplicating rows in python pandas [duplicate]

This question already has answers here:
Pandas left join on duplicate keys but without increasing the number of columns
(2 answers)
Closed 4 years ago.
I'd like to combine two dataframes using their similar column 'A':
>>> df1
A B
0 I 1
1 I 2
2 II 3
>>> df2
A C
0 I 4
1 II 5
2 III 6
To do so I tried using:
merged = pd.merge(df1, df2, on='A', how='outer')
Which returned:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 4
2 II 3.0 5
3 III NaN 6
However, since df2 only contained one value for A == 'I', I do not want this value to be duplicated in the merged dataframe. Instead I would like the following output:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 NaN
2 II 3.0 5
3 III NaN 6
What is the best way to do this? I am new to python and still slightly confused with all the join/merge/concatenate/append operations.
Let us create a new variable g, by cumcount
df1['g']=df1.groupby('A').cumcount()
df2['g']=df2.groupby('A').cumcount()
df1.merge(df2,how='outer').drop('g',1)
Out[62]:
A B C
0 I 1.0 4.0
1 I 2.0 NaN
2 II 3.0 5.0
3 III NaN 6.0

Python - divide data frame by a list of numbers, zero included

I have a data frame with 10 columns. I want to divide each column with a different number. How to divide the data frame by the list of numbers? Also there are zeros in the list, and if divided by zero I want the numbers in that column to be 1. How to do this?
Thanks
given the dataframe df and list lst as a numpy array
df = pd.DataFrame(np.random.rand(10, 10))
lst = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
Then we can use a mask to filter. By using a mask, we can use boolean slicing to get at just the columns that have corresponding zero values in lst. We can also easily access the non zeros with ~m and slice.
m = lst == 0
# assign the number 1 to all columns where there is a zero in lst
df.values[:, m] = 1
# do the division in place for all columns where lst is not zero
df.values[:, ~m] /= lst[~m]
print(df)
0 1 2 3 4 5
0 0.195316 1.0 0.988503 1.0 0.981752 1.0
1 0.136812 1.0 0.887689 1.0 0.346385 1.0
2 0.927454 1.0 0.733464 1.0 0.773818 1.0
3 0.782234 1.0 0.363441 1.0 0.295135 1.0
4 0.751046 1.0 0.442886 1.0 0.700396 1.0
5 0.028402 1.0 0.724199 1.0 0.047674 1.0
6 0.680154 1.0 0.974464 1.0 0.717932 1.0
7 0.636310 1.0 0.191252 1.0 0.777813 1.0
8 0.766330 1.0 0.975292 1.0 0.224856 1.0
9 0.335766 1.0 0.093384 1.0 0.547195 1.0
You can use div and then replace values where 0 in L by 1:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
L = [0,1,2,3,0,3]
s = pd.Series(L, index=df.columns)
df1 = df.div(s)
df1[s.index[s == 0]] = 1
print (df1)
A B C D E F
0 1.0 4.0 3.5 0.333333 1.0 2.333333
1 1.0 5.0 4.0 1.000000 1.0 1.333333
2 1.0 6.0 4.5 1.666667 1.0 1.000000

Categories