I want to group by a dataframe based on multiple columns. For example to make this:
Country Type_1 Type_2 Type_3 Type_4 Type_5
China A B C D E
Spain A A R B C
Italy B A B R R
Into this:
Country Type Count
China A 1
B 1
C 1
D 1
E 1
Spain A 2
R 1
B 1
C 1
Italy B 2
A 1
R 2
I tried to concat vertically the columns from Type_1 to Type_5, apply reset_index() and then trying to count. However i don't how to group vertically by country. Any ideas?
Thx
Do melt then groupby with size
s = df.melt('Country').groupby(['Country','value']).size()
Out[326]:
Country value
China A 1
B 1
C 1
D 1
E 1
Italy A 1
B 2
R 2
Spain A 2
B 1
C 1
R 1
dtype: int64
Related
What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
My question is closely related to Pandas Merge - How to avoid duplicating columns but not identical.
I want to concatenate the columns that are different in three dataframes. The dataframes have a column id, and some columns that are identical: Ex.
df1
id place name qty unit A
1 NY Tom 2 10 a
2 TK Ron 3 15 a
3 Lon Don 5 90 a
4 Hk Sam 4 49 a
df2
id place name qty unit B
1 NY Tom 2 10 b
2 TK Ron 3 15 b
3 Lon Don 5 90 b
4 Hk Sam 4 49 b
df3
id place name qty unit C D
1 NY Tom 2 10 c d
2 TK Ron 3 15 c d
3 Lon Don 5 90 c d
4 Hk Sam 4 49 c d
Result:
id place name qty unit A B C D
1 NY Tom 2 10 a b c d
2 TK Ron 3 15 a b c d
3 Lon Don 5 90 a b c d
4 Hk Sam 4 49 a b c d
The columns place, name, qty, and unit will always be part of the three dataframes, the names of columns that are different could vary (A,B,C,D in my example). The three dataframes have the same number of rows.
I have tried:
cols_to_use = df1.columns - df2.columns
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
The problem is that I get more rows than expected and columns renamed in the resulting dataframe (when using concat).
Using reduce from functools
from functools import reduce
reduce(lambda left,right: pd.merge(left,right), [df1,df2,df3])
Out[725]:
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
You can use nested merge
merge_on = ['id','place','name','qty','unit']
df1.merge(df2, on = merge_on).merge(df3, on = merge_on)
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
Using concat with groupby and first:
pd.concat([df1, df2, df3], 1).groupby(level=0, axis=1).first()
A B C D id name place qty unit
0 a b c d 1 Tom NY 2 10
1 a b c d 2 Ron TK 3 15
2 a b c d 3 Don Lon 5 90
3 a b c d 4 Sam Hk 4 49
You can extract only those columns from df2 (and df3 similarly) which are not already present in df1. Then just use pd.concat to concatenate the data frames:
cols = [c for c in df2.columns if c not in df1.columns]
df = pd.concat([df1, df2[cols]], axis=1)
My question is closely related to Pandas Merge - How to avoid duplicating columns but not identical.
I want to concatenate the columns that are different in three dataframes. The dataframes have a column id, and some columns that are identical: Ex.
df1
id place name qty unit A
1 NY Tom 2 10 a
2 TK Ron 3 15 a
3 Lon Don 5 90 a
4 Hk Sam 4 49 a
df2
id place name qty unit B
1 NY Tom 2 10 b
2 TK Ron 3 15 b
3 Lon Don 5 90 b
4 Hk Sam 4 49 b
df3
id place name qty unit C D
1 NY Tom 2 10 c d
2 TK Ron 3 15 c d
3 Lon Don 5 90 c d
4 Hk Sam 4 49 c d
Result:
id place name qty unit A B C D
1 NY Tom 2 10 a b c d
2 TK Ron 3 15 a b c d
3 Lon Don 5 90 a b c d
4 Hk Sam 4 49 a b c d
The columns place, name, qty, and unit will always be part of the three dataframes, the names of columns that are different could vary (A,B,C,D in my example). The three dataframes have the same number of rows.
I have tried:
cols_to_use = df1.columns - df2.columns
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
The problem is that I get more rows than expected and columns renamed in the resulting dataframe (when using concat).
Using reduce from functools
from functools import reduce
reduce(lambda left,right: pd.merge(left,right), [df1,df2,df3])
Out[725]:
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
You can use nested merge
merge_on = ['id','place','name','qty','unit']
df1.merge(df2, on = merge_on).merge(df3, on = merge_on)
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
Using concat with groupby and first:
pd.concat([df1, df2, df3], 1).groupby(level=0, axis=1).first()
A B C D id name place qty unit
0 a b c d 1 Tom NY 2 10
1 a b c d 2 Ron TK 3 15
2 a b c d 3 Don Lon 5 90
3 a b c d 4 Sam Hk 4 49
You can extract only those columns from df2 (and df3 similarly) which are not already present in df1. Then just use pd.concat to concatenate the data frames:
cols = [c for c in df2.columns if c not in df1.columns]
df = pd.concat([df1, df2[cols]], axis=1)
My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?
df2 = df2.loc[df2['products'] <= 20]
I have the following wide df1:
Area geotype type ...
1 a 2 ...
1 a 1 ...
2 b 4 ...
4 b 8 ...
And the following two-column df2:
Area geotype
1 London
4 Cambridge
And I want the following:
Area geotype type ...
1 London 2 ...
1 London 1 ...
2 b 4 ...
4 Cambridge 8 ...
So I need to match based on the non-unique Area column, and then only if there is a match, replace the set values in the geotype column.
Apologies if this is a duplicate, I did actually search hard for a solution to this.
use update + map
df1.geotype.update(df1.Area.map(df2.set_index('Area').geotype))
Area geotype type
0 1 London 2
1 1 London 1
2 2 b 4
3 4 Cambridge 8
I think you can use map by Series created with set_index and then fill NaN values by combine_first or fillna:
df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).combine_first(df1.geotype)
#df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).fillna(df1.geotype)
print (df1)
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8e
Another solution with mask and numpy.in1d:
df1.geotype = df1.geotype.mask(np.in1d(df1.ID, df2.ID),
df1.ID.map(df2.set_index('ID')['geotype']))
print (df1)
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8e
EDIT by comment:
Problem is not unique ID values in df2 like:
df2 = pd.DataFrame({'ID': [1, 1, 4], 'geotype': ['London', 'Paris', 'Cambridge']})
print (df2)
ID geotype
0 1 London
1 1 Paris
2 4 Cambridge
So function map cannot choose right value and raise error.
Solution is remove duplicates by drop_duplicates, by default keep first value:
df2 = df2.drop_duplicates('ID')
print (df2)
ID geotype
0 1 London
2 4 Cambridge
Or if need keep last value:
df2 = df2.drop_duplicates('ID', keep='last')
print (df2)
ID geotype
1 1 Paris
2 4 Cambridge
If cannot remove duplicates, there is another solution with outer merge, but there are duplicated rows where is duplicated ID in df2:
df1 = pd.merge(df1, df2, on='ID', how='outer', suffixes=('_',''))
df1.geotype = df1.geotype.combine_first(df1.geotype_)
df1 = df1.drop('geotype_', axis=1)
print (df1)
ID type geotype
0 1 2 London
1 1 2 Paris
2 2 1 a
3 3 4 b
4 4 8e Cambridge
alternative solution:
In [78]: df1.loc[df1.ID.isin(df2.ID), 'geotype'] = df1.ID.map(df2.set_index('ID').geotype)
In [79]: df1
Out[79]:
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8
UPDATE: answers updated question - if you have duplicates in the Area column in the df2 DF:
In [152]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.set_index('Area').geotype)
...
skipped
...
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
get rid of duplicates:
In [153]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.drop_duplicates(subset='Area').set_index('Area').geotype)
In [154]: df1
Out[154]:
Area geotype type
0 1 London 2
1 1 London 1
2 2 b 4
3 4 Cambridge 8