Merge of 4 dataframes within groups

Merge of 4 dataframes within groups - python

I have a 4 df's:
df1
a b
1 0 3
2 1 4
df2
a b
1 0 5
2 0 6
3 1 7
df3
a b
1 0 2
2 1 6
3 1 5
...
Within groups of 'a' I want to merge all 4 df's on a and keep all values by putting them in a further column. The merge of df1 and df2 should look like:
a b1 b2
1 0 3 5
2 0 3 6
3 1 4 7
Merge of df1, df2, df3:
a b1 b2 b3
1 0 3 5 2
2 0 3 6 2
3 1 4 7 6
4 1 4 7 5
I tried:
df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy', how='outer').drop('dummy', axis=1)
but this is ignoring the groups and 'a' disappears.

This is not Cartesian product, but a simple merge across multiple dataframes.
Try this:
In [846]: df1.merge(df2, on='a').merge(df3, on='a').rename(columns={'b_x':'b1', 'b_y':'b2', 'b':'b3'})
Out[846]:
a b1 b2 b3
0 0 3 5 2
1 0 3 6 2
2 1 4 7 6
3 1 4 7 5
OR, if your dataframes can increase you can do this:
In [851]: from functools import reduce
In [852]: reduce(lambda x,y: pd.merge(x,y, on='a'), [df1, df2, df3])
Out[852]:
a b_x b_y b
0 0 3 5 2
1 0 3 6 2
2 1 4 7 6
3 1 4 7 5

Modify the b column name before merging, then use reduce to expand to an arbitrary number of dataframes.
from functools import reduce
dfs = [df.rename(columns={'b':f'b{num+1}'}) for num, df in enumerate([df1, df2, df3])]
reduce(lambda x,y: pd.merge(x,y), dfs)
Note that by default, pd.merge on shared columns, hence a.

Use pd.DataFrame.join:
First set the index of each dataframe to 'a', you can use list comprehension to do this in place like this.
[i.set_index('a', inplace=True) for i in [df1, df2, df3]]
Next, use join:
df1.join([df2, df3])
Output:
a b_x b_y b
0 0 3 5 2
1 0 3 6 2
2 1 4 7 6
3 1 4 7 5

Related

pop rows from dataframe based on conditions

From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?

Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8

How to take minimum column value in pandas data frame if values in another column repeat?

If I have a pandas data frame like this:
Col A Col B Col C
1 4 3
1 4 5
2 3 7
2 4 6
1 6 6
1 6 4
When values in Column B repeat (are consecutive) I want to keep the row with the minimum value in Column C. Such that I get a pandas data frame like this:
Col A Col B Col C
1 4 3
2 3 7
2 4 6
1 6 4
It's okay if values in Column B repeat they just can't be consecutive.

IIUC sort_values + drop_duplicates
Yourdf=df.sort_values(['ColC']).drop_duplicates(['ColA','ColB']).sort_index()
ColA ColB ColC
0 1 4 3
2 2 3 7
3 2 4 6
5 1 6 4

All the other answers seem to overlook values in Column B repeat (are consecutive), so here's my approach:
B_blocks = df['Col B'].ne(df['Col B'].shift()).cumsum()
min_idx = df.groupby(B_blocks)['Col C'].idxmin()
df.loc[min_idx]
Output:
Col A Col B Col C
0 1 4 3
2 2 3 7
3 2 4 6
5 1 6 4

You can also use DataFrame.sort_values + GroupBy.first:
g=df['Col_B'].ne(df['Col_B'].shift()).cumsum()
new_df=df.sort_values('Col_C').groupby(g).first().reset_index(drop=True)
print(new_df)
Col_A Col_B Col_C
0 1 4 3
1 2 3 7
2 2 4 6
3 1 6 4

How to compare columns of two different data frames and keep the common values

I have two data frames with same column but different values, out of which some are same and some are different. I want to compare both columns and keep the common values.
df1 :
A B C
1 1 1
2 4 6
3 7 9
4 9 0
6 0 1
df2 :
A D E
1 5 7
5 6 9
2 3 5
7 6 8
3 7 0
This is what I am expecting after comparison
df2 :
A D E
1 5 7
2 3 5
3 7 0

You can use pd.Index.intersection() to find the matching columns and do a inner merge finally reindex() to keep df2.columns:
match=df2.columns.intersection(df1.columns).tolist() #finds matching cols in both df
df2.merge(df1,on=match).reindex(df2.columns,axis=1) #merge and reindex to df2.columns
A D E
0 1 5 7
1 2 3 5
2 3 7 0

Python: how to merge two dataframe based only on different columns?

I have two dataframes df1 and df2
df1
A B
0 4 2
1 3 3
2 1 2
df2
B AB C
0 4 8 3
1 3 9 2
2 1 2 4
I would like to make a join only on different columns
df3
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4

Use Index.isin with inverse mask or Index.difference:
df22 = df2.loc[:, ~df2.columns.isin(df1.columns)]
df = df1.join(df22)
Or:
df22 = df2[df2.columns.difference(df1.columns)]
df = df1.join(df22)
print (df)
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4

You can also use the merge functions as an alternate solution:
df3=pd.merge(df1,df2, left_on='A', right_on='B', how ='left', suffixes=('','_')).drop('B_',axis=1)

Rearrange dataframe structure

I get a dataframe
df
A B
0 1 4
1 2 5
2 3 6
For further processing, it would be more convenient to have the df restructered
as follows:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
How can I achieve that?

Use unstack with reset_index :
df = df.unstack().reset_index(level=1, drop=True).reset_index()
df.columns = ['letters','numbers']
print (df)
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
Or numpy.concatenate + numpy.repeat + DataFrame:
a = np.concatenate(df.values)
b = np.repeat(df.columns,len(df.index))
df = pd.DataFrame({'letters':b, 'numbers':a})
print (df)
letters numbers
0 A 1
1 A 4
2 A 2
3 B 5
4 B 3
5 B 6

Probably simplest to melt:
In [36]: pd.melt(df, var_name="letters", value_name="numbers")
Out[36]:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge of 4 dataframes within groups - python

Use pd.DataFrame.join: First set the index of each dataframe to 'a', you can use list comprehension to do this in place like this. [i.set_index('a', inplace=True) for i in [df1, df2, df3]] Next, use join: df1.join([df2, df3]) Output: a b_x b_y b 0 0 3 5 2 1 0 3 6 2 2 1 4 7 6 3 1 4 7 5

Related

pop rows from dataframe based on conditions

How to take minimum column value in pandas data frame if values in another column repeat?

How to compare columns of two different data frames and keep the common values

Python: how to merge two dataframe based only on different columns?

Rearrange dataframe structure

Categories

Resources