My question is closely related to Pandas Merge - How to avoid duplicating columns but not identical.
I want to concatenate the columns that are different in three dataframes. The dataframes have a column id, and some columns that are identical: Ex.
df1
id place name qty unit A
1 NY Tom 2 10 a
2 TK Ron 3 15 a
3 Lon Don 5 90 a
4 Hk Sam 4 49 a
df2
id place name qty unit B
1 NY Tom 2 10 b
2 TK Ron 3 15 b
3 Lon Don 5 90 b
4 Hk Sam 4 49 b
df3
id place name qty unit C D
1 NY Tom 2 10 c d
2 TK Ron 3 15 c d
3 Lon Don 5 90 c d
4 Hk Sam 4 49 c d
Result:
id place name qty unit A B C D
1 NY Tom 2 10 a b c d
2 TK Ron 3 15 a b c d
3 Lon Don 5 90 a b c d
4 Hk Sam 4 49 a b c d
The columns place, name, qty, and unit will always be part of the three dataframes, the names of columns that are different could vary (A,B,C,D in my example). The three dataframes have the same number of rows.
I have tried:
cols_to_use = df1.columns - df2.columns
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
The problem is that I get more rows than expected and columns renamed in the resulting dataframe (when using concat).
Using reduce from functools
from functools import reduce
reduce(lambda left,right: pd.merge(left,right), [df1,df2,df3])
Out[725]:
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
You can use nested merge
merge_on = ['id','place','name','qty','unit']
df1.merge(df2, on = merge_on).merge(df3, on = merge_on)
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
Using concat with groupby and first:
pd.concat([df1, df2, df3], 1).groupby(level=0, axis=1).first()
A B C D id name place qty unit
0 a b c d 1 Tom NY 2 10
1 a b c d 2 Ron TK 3 15
2 a b c d 3 Don Lon 5 90
3 a b c d 4 Sam Hk 4 49
You can extract only those columns from df2 (and df3 similarly) which are not already present in df1. Then just use pd.concat to concatenate the data frames:
cols = [c for c in df2.columns if c not in df1.columns]
df = pd.concat([df1, df2[cols]], axis=1)
Related
What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
Let's suppose I have a dataframe:
import numpy as np
a = [['A',np.nan,2,'x|x|x|y'],['B','a|b',56,'b|c'],['C','c|e|e',65,'f|g'],['D','h',98,'j'],['E','g',98,'k|h'],['F','a|a|a|a|a|b',98,np.nan],['G','w',98,'p'],['H','s',98,'t|u']]
df1 = pd.DataFrame(a, columns=['1', '2','3','4'])
df1
1 2 3 4
0 A NaN 2 x|x|x|y
1 B a|b 56 b|c
2 C c|e|e 65 f|g
3 D h 98 j
4 E g 98 k|h
5 F a|a|a|a|a|b 98 NaN
6 G w 98 p
7 H s 98 t|u
and another dataframe:
a = [['x'],['b'],['h'],['v']]
df2 = pd.DataFrame(a, columns=['1'])
df2
1
0 x
1 b
2 h
3 v
I want to compare column 1 in df2 with column 2 and 4 (splitting it by "|") in df1, and if the value matches with either or both column 2 or 4 (after splitting), I want to extract only those rows of df1 in another dataframe with an added column that will have the value of df2 that matched with either column 2 or column 4 of df1.
For example, the result would look something like this:
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
2 F a|a|a|a|a|b 98 NaN b
3 D h 98 j h
4 E g 98 k|h h
Solution is join values of both columns to Series in DataFrame.agg, then splitting by Series.str.split, filter values in DataFrame.where with DataFrame.isin and then join values together without NaNs, last filter columns without empty strings:
df11 = df1[['2','4']].fillna('').agg('|'.join, 1).str.split('|', expand=True)
df1['5'] = (df11.where(df11.isin(df2['1'].tolist()))
.apply(lambda x: ','.join(set(x.dropna())), axis=1))
df1 = df1[df1['5'].ne('')]
print (df1)
1 2 3 4 5
0 A NaN 2 x|x|x|y x
1 B a|b 56 b|c b
3 D h 98 j h
4 E g 98 k|h h
5 F a|a|a|a|a|b 98 NaN b
I want to group by a dataframe based on multiple columns. For example to make this:
Country Type_1 Type_2 Type_3 Type_4 Type_5
China A B C D E
Spain A A R B C
Italy B A B R R
Into this:
Country Type Count
China A 1
B 1
C 1
D 1
E 1
Spain A 2
R 1
B 1
C 1
Italy B 2
A 1
R 2
I tried to concat vertically the columns from Type_1 to Type_5, apply reset_index() and then trying to count. However i don't how to group vertically by country. Any ideas?
Thx
Do melt then groupby with size
s = df.melt('Country').groupby(['Country','value']).size()
Out[326]:
Country value
China A 1
B 1
C 1
D 1
E 1
Italy A 1
B 2
R 2
Spain A 2
B 1
C 1
R 1
dtype: int64
I would like to know if there's a way in Python to place columns from different dataframes with the same names (or related names) adjacent to each other.
I know there's the option to use JOIN, but I would like to make a function from the scratch that can achieve the same.
Example:
Let's assume 2 dataframes df1 and df2
df1 is
id A B
50 1 5
60 2 6
70 3 7
80 4 8
df2 is
id A_1 B_1
50 a b
60 c d
70 e f
80 g h
Expected Output: A new dataframe, say df3, looking like this
id A A_1 B B_1
50 1 a 5 b
60 2 c 6 d
70 3 e 7 f
80 4 g 8 h
you can use sorted() with column names like:
m=pd.concat([df1.set_index('id'),df2.set_index('id')],axis=1)
m[(sorted(m.columns))].reset_index()
id A A_1 B B_1
0 50 1 a 5 b
1 60 2 c 6 d
2 70 3 e 7 f
3 80 4 g 8 h
First you join the 2 dataframes -
df3 = df1.join(df2, how='inner')
And then you can sort the index -
df3 = df3.sort_index(axis=1)
My question is closely related to Pandas Merge - How to avoid duplicating columns but not identical.
I want to concatenate the columns that are different in three dataframes. The dataframes have a column id, and some columns that are identical: Ex.
df1
id place name qty unit A
1 NY Tom 2 10 a
2 TK Ron 3 15 a
3 Lon Don 5 90 a
4 Hk Sam 4 49 a
df2
id place name qty unit B
1 NY Tom 2 10 b
2 TK Ron 3 15 b
3 Lon Don 5 90 b
4 Hk Sam 4 49 b
df3
id place name qty unit C D
1 NY Tom 2 10 c d
2 TK Ron 3 15 c d
3 Lon Don 5 90 c d
4 Hk Sam 4 49 c d
Result:
id place name qty unit A B C D
1 NY Tom 2 10 a b c d
2 TK Ron 3 15 a b c d
3 Lon Don 5 90 a b c d
4 Hk Sam 4 49 a b c d
The columns place, name, qty, and unit will always be part of the three dataframes, the names of columns that are different could vary (A,B,C,D in my example). The three dataframes have the same number of rows.
I have tried:
cols_to_use = df1.columns - df2.columns
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
The problem is that I get more rows than expected and columns renamed in the resulting dataframe (when using concat).
Using reduce from functools
from functools import reduce
reduce(lambda left,right: pd.merge(left,right), [df1,df2,df3])
Out[725]:
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
You can use nested merge
merge_on = ['id','place','name','qty','unit']
df1.merge(df2, on = merge_on).merge(df3, on = merge_on)
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
Using concat with groupby and first:
pd.concat([df1, df2, df3], 1).groupby(level=0, axis=1).first()
A B C D id name place qty unit
0 a b c d 1 Tom NY 2 10
1 a b c d 2 Ron TK 3 15
2 a b c d 3 Don Lon 5 90
3 a b c d 4 Sam Hk 4 49
You can extract only those columns from df2 (and df3 similarly) which are not already present in df1. Then just use pd.concat to concatenate the data frames:
cols = [c for c in df2.columns if c not in df1.columns]
df = pd.concat([df1, df2[cols]], axis=1)