I have a 4 df's:
df1
a b
1 0 3
2 1 4
df2
a b
1 0 5
2 0 6
3 1 7
df3
a b
1 0 2
2 1 6
3 1 5
...
Within groups of 'a' I want to merge all 4 df's on a and keep all values by putting them in a further column. The merge of df1 and df2 should look like:
a b1 b2
1 0 3 5
2 0 3 6
3 1 4 7
Merge of df1, df2, df3:
a b1 b2 b3
1 0 3 5 2
2 0 3 6 2
3 1 4 7 6
4 1 4 7 5
I tried:
df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy', how='outer').drop('dummy', axis=1)
but this is ignoring the groups and 'a' disappears.
This is not Cartesian product, but a simple merge across multiple dataframes.
Try this:
In [846]: df1.merge(df2, on='a').merge(df3, on='a').rename(columns={'b_x':'b1', 'b_y':'b2', 'b':'b3'})
Out[846]:
a b1 b2 b3
0 0 3 5 2
1 0 3 6 2
2 1 4 7 6
3 1 4 7 5
OR, if your dataframes can increase you can do this:
In [851]: from functools import reduce
In [852]: reduce(lambda x,y: pd.merge(x,y, on='a'), [df1, df2, df3])
Out[852]:
a b_x b_y b
0 0 3 5 2
1 0 3 6 2
2 1 4 7 6
3 1 4 7 5
Modify the b column name before merging, then use reduce to expand to an arbitrary number of dataframes.
from functools import reduce
dfs = [df.rename(columns={'b':f'b{num+1}'}) for num, df in enumerate([df1, df2, df3])]
reduce(lambda x,y: pd.merge(x,y), dfs)
Note that by default, pd.merge on shared columns, hence a.
Use pd.DataFrame.join:
First set the index of each dataframe to 'a', you can use list comprehension to do this in place like this.
[i.set_index('a', inplace=True) for i in [df1, df2, df3]]
Next, use join:
df1.join([df2, df3])
Output:
a b_x b_y b
0 0 3 5 2
1 0 3 6 2
2 1 4 7 6
3 1 4 7 5
Related
From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?
Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8
If I have a pandas data frame like this:
Col A Col B Col C
1 4 3
1 4 5
2 3 7
2 4 6
1 6 6
1 6 4
When values in Column B repeat (are consecutive) I want to keep the row with the minimum value in Column C. Such that I get a pandas data frame like this:
Col A Col B Col C
1 4 3
2 3 7
2 4 6
1 6 4
It's okay if values in Column B repeat they just can't be consecutive.
IIUC sort_values + drop_duplicates
Yourdf=df.sort_values(['ColC']).drop_duplicates(['ColA','ColB']).sort_index()
ColA ColB ColC
0 1 4 3
2 2 3 7
3 2 4 6
5 1 6 4
All the other answers seem to overlook values in Column B repeat (are consecutive), so here's my approach:
B_blocks = df['Col B'].ne(df['Col B'].shift()).cumsum()
min_idx = df.groupby(B_blocks)['Col C'].idxmin()
df.loc[min_idx]
Output:
Col A Col B Col C
0 1 4 3
2 2 3 7
3 2 4 6
5 1 6 4
You can also use DataFrame.sort_values + GroupBy.first:
g=df['Col_B'].ne(df['Col_B'].shift()).cumsum()
new_df=df.sort_values('Col_C').groupby(g).first().reset_index(drop=True)
print(new_df)
Col_A Col_B Col_C
0 1 4 3
1 2 3 7
2 2 4 6
3 1 6 4
I have two data frames with same column but different values, out of which some are same and some are different. I want to compare both columns and keep the common values.
df1 :
A B C
1 1 1
2 4 6
3 7 9
4 9 0
6 0 1
df2 :
A D E
1 5 7
5 6 9
2 3 5
7 6 8
3 7 0
This is what I am expecting after comparison
df2 :
A D E
1 5 7
2 3 5
3 7 0
You can use pd.Index.intersection() to find the matching columns and do a inner merge finally reindex() to keep df2.columns:
match=df2.columns.intersection(df1.columns).tolist() #finds matching cols in both df
df2.merge(df1,on=match).reindex(df2.columns,axis=1) #merge and reindex to df2.columns
A D E
0 1 5 7
1 2 3 5
2 3 7 0
I have two dataframes df1 and df2
df1
A B
0 4 2
1 3 3
2 1 2
df2
B AB C
0 4 8 3
1 3 9 2
2 1 2 4
I would like to make a join only on different columns
df3
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4
Use Index.isin with inverse mask or Index.difference:
df22 = df2.loc[:, ~df2.columns.isin(df1.columns)]
df = df1.join(df22)
Or:
df22 = df2[df2.columns.difference(df1.columns)]
df = df1.join(df22)
print (df)
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4
You can also use the merge functions as an alternate solution:
df3=pd.merge(df1,df2, left_on='A', right_on='B', how ='left', suffixes=('','_')).drop('B_',axis=1)
I get a dataframe
df
A B
0 1 4
1 2 5
2 3 6
For further processing, it would be more convenient to have the df restructered
as follows:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
How can I achieve that?
Use unstack with reset_index :
df = df.unstack().reset_index(level=1, drop=True).reset_index()
df.columns = ['letters','numbers']
print (df)
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
Or numpy.concatenate + numpy.repeat + DataFrame:
a = np.concatenate(df.values)
b = np.repeat(df.columns,len(df.index))
df = pd.DataFrame({'letters':b, 'numbers':a})
print (df)
letters numbers
0 A 1
1 A 4
2 A 2
3 B 5
4 B 3
5 B 6
Probably simplest to melt:
In [36]: pd.melt(df, var_name="letters", value_name="numbers")
Out[36]:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6