Pandas: how to join two dataframes combinatorially [duplicate] - python

This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 4 years ago.
I have two dataframes that I would like to combine combinatorial-wise (i.e. combinatorially join each row from one df to each row of another df). I can do this by merging on 'key's but my solution is clearly cumbersome. I'm looking for a more straightforward, even pythonesque way of handling this operation. Any suggestions?
MWE:
fred = pd.DataFrame({'A':[1., 4.],'B':[2., 5.], 'C':[3., 6.]})
print(fred)
A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
jim = pd.DataFrame({'one':['a', 'c'],'two':['b', 'd']})
print(jim)
one two
0 a b
1 c d
fred['key'] = [1,2]
jim1 = jim.copy()
jim1['key'] = 1
jim2 = jim.copy()
jim2['key'] = 2
jim3 = jim1.append(jim2)
jack = pd.merge(fred, jim3, on='key').drop(['key'], axis=1)
print(jack)
A B C one two
0 1.0 2.0 3.0 a b
1 1.0 2.0 3.0 c d
2 4.0 5.0 6.0 a b
3 4.0 5.0 6.0 c d

You can join every row of fred with every row of jim by merging on a key column which is equal to the same value (say, 1) for every row:
In [16]: pd.merge(fred.assign(key=1), jim.assign(key=1), on='key').drop('key', axis=1)
Out[16]:
A B C one two
0 1.0 2.0 3.0 a b
1 1.0 2.0 3.0 c d
2 4.0 5.0 6.0 a b
3 4.0 5.0 6.0 c d

Are you looking for the cartesian product of the two dataframes, like a cross join?
It is answered here.

Related

Python: How can I extend a DataFrame with multiply fields that calculated from a column

I have a datadrame which looks like:
A B
0 2.0 'C=4;D=5;'
1 2.0 'C=4;D=5;'
2 2.0 'C=4;D=5;'
I can parse the string in column B, lets say using a function name parse_col(), in to a dict that looks like:
{C: 4, D: 5}
How can I add the 2 extra column to the data frame so it would look like that:
A B C D
0 2.0 'C=4;D=5;' 4 5
1 2.0 'C=4;D=5;' 4 5
2 2.0 'C=4;D=5;' 4 5
I can take only the specific column, parse it and add it but its clearly not the best way.
I also tried using a variation of the example in pandas apply documentation but I didn't manage to make it work only on a specific column.
We can use Series.str.extractall and then chain it with unstack to pivot the rows to columns:
df[['C', 'D']] = df['B'].str.extractall('(\d+)').unstack()
A B C D
0 2.0 'C=4;D=5;' 4 5
1 2.0 'C=4;D=5;' 4 5
2 2.0 'C=4;D=5;' 4 5
You can use df.eval and functools.reduce, this way you can read the column names directly:
>>> from functools import reduce
>>> reduce(
lambda x,y: x.eval(y),
df.B.str
.extractall(r'([A-Za-z]=\d+)')
.unstack().xs(0), df
)
A B C D
0 2.0 'C=4;D=5;' 4 5
1 2.0 'C=4;D=5;' 4 5
2 2.0 'C=4;D=5;' 4 5
You can use a named aggregation to extract the column name and the value associated with it. Then reshape and join it back.
df1 = (df['B'].str.extractall(r'(?P<col>[A-Za-z]+)=(?P<val>\d+);')
.reset_index(1, drop=True)
.pivot(columns='col', values='val'))
pd.concat([df, df1], axis=1)
A B C D
0 2.0 C=4;D=5; 4 5
1 2.0 C=4;D=5; 4 5
2 2.0 C=4;D=5; 4 5
One added benefit of this method is it's a bit safer if column 'B' can contain an arbitrary number of columns you need to assign. More importantly, the extraction of Column=Number will be correct even if values are unordered in column 'B'. Here's an extended example:
print(df)
A B
0 2.0 C=4;D=5;
1 2.0 C=4;D=5;
2 2.0 C=4;D=5;
3 2.0 D=5;E=7;C=12;
4 2.0 D=1;C=4;
df1 = (df['B'].str.extractall(r'(?P<col>[A-Za-z]+)=(?P<val>\d+);')
.reset_index(1, drop=True)
.pivot(columns='col', values='val'))
pd.concat([df, df1], axis=1)
# A B C D E
#0 2.0 C=4;D=5; 4 5 NaN
#1 2.0 C=4;D=5; 4 5 NaN
#2 2.0 C=4;D=5; 4 5 NaN
#3 2.0 D=5;E=7;C=12; 12 5 7
#4 2.0 D=1;C=4; 4 1 NaN

Custom expanding function with raw=False

Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.
It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

groupby on subset of a multi index

I have a dataframe (df) with a multi index consisting of 3 indexes, 'A', 'B', and 'C' say, and I have a column called Quantity containing floats.
What I would like to do is perform a groupby on 'A' and 'B' summing the values in Quantity. How would I do this? The standard way of working does not work because pandas does no recognize the indexes as columns and if I use something like
df.groupby(level=0).sum()
it seems I can only select a single level. How would one go about this?
You can specify multiple levels like:
df.groupby(level=[0, 1]).sum()
#alternative
df.groupby(level=['A','B']).sum()
Or pass parameter level to sum:
df.sum(level=[0, 1])
#alternative
df.sum(level=['A','B'])
Sample:
df = pd.DataFrame({'A':[1,1,2,2,3],
'B':[3] * 5,
'C':[3,4,5,4,5],
'Quantity':[1.0,3,4,5,6]}).set_index(['A','B','C'])
print (df)
Quantity
A B C
1 3 3 1.0
4 3.0
2 3 5 4.0
4 5.0
3 3 5 6.0
df1 = df.groupby(level=[0, 1]).sum()
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.groupby(level=['A','B']).sum()
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.sum(level=[0, 1])
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.sum(level=['A','B'])
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0

Merge dataframes without duplicating rows in python pandas [duplicate]

This question already has answers here:
Pandas left join on duplicate keys but without increasing the number of columns
(2 answers)
Closed 4 years ago.
I'd like to combine two dataframes using their similar column 'A':
>>> df1
A B
0 I 1
1 I 2
2 II 3
>>> df2
A C
0 I 4
1 II 5
2 III 6
To do so I tried using:
merged = pd.merge(df1, df2, on='A', how='outer')
Which returned:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 4
2 II 3.0 5
3 III NaN 6
However, since df2 only contained one value for A == 'I', I do not want this value to be duplicated in the merged dataframe. Instead I would like the following output:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 NaN
2 II 3.0 5
3 III NaN 6
What is the best way to do this? I am new to python and still slightly confused with all the join/merge/concatenate/append operations.
Let us create a new variable g, by cumcount
df1['g']=df1.groupby('A').cumcount()
df2['g']=df2.groupby('A').cumcount()
df1.merge(df2,how='outer').drop('g',1)
Out[62]:
A B C
0 I 1.0 4.0
1 I 2.0 NaN
2 II 3.0 5.0
3 III NaN 6.0

Correlation matrix for two Pandas dataframes [duplicate]

This question already has answers here:
Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?
(4 answers)
Closed 6 years ago.
Say I have two dataframes:
df1 df2
A B C D
1 3 -2 7
2 4 0 10
I need to create a correlation matrix which consists of columns from two dataframes.
corrmat_df
C D
A 1 *
B * 1
stands for correlation
I can do it elementwise in nested loop, but maybe there is more pythonic way?
Thanks.
Simply combine the dataframes and use .corr():
result = pd.concat([df1, df2], axis=1).corr()
# A B C D
#A 1.0 1.0 1.0 1.0
#B 1.0 1.0 1.0 1.0
#C 1.0 1.0 1.0 1.0
#D 1.0 1.0 1.0 1.0
The result contains all wanted (and also some unwanted) correlations. E.g.:
result[['C','D']].ix[['A','B']]
# C D
#A 1.0 1.0
#B 1.0 1.0

Categories