I want to get the countries affinities by products.
I have such a df:
cntr prod
0 fr cheese
1 ger potato
2 it cheese
3 it tomato
4 fr wine
5 it wine
6 ger cabbage
7 fr cabbage
I was trying to get a co-existence matrix of number of products, which would tell me the countries affinities, as such:
fr ger it
fr 1 2
ger 1 0
it 2 0
my test was first to proceed to do a cross groupby trying to add a 3rd dimension so to get
fr fr
ger 1
it 2
ger fr 1
ger
it 0
it fr 2
ger 0
it
this is what I tryied, but it is failing to add the second layer..
any suggestion?
I believe you need merge for cross join with crosstab and if necessary set diagonal to NaN by numpy.fill_diagonal:
df = pd.merge(df, df, on='prod')
df = pd.crosstab(df['cntr_x'], df['cntr_y']).astype(float)
np.fill_diagonal(df.values, np.nan)
print (df)
cntr_y fr ger it
cntr_x
fr NaN 1.0 2.0
ger 1.0 NaN 0.0
it 2.0 0.0 NaN
Related
Below is the input data
df1
A B C D E F G
Messi Forward Argentina 1 Nan 5 6
Ronaldo Defender Portugal Nan 4 Nan 3
Messi Midfield Argentina Nan 5 Nan 6
Ronaldo Forward Portugal 3 Nan 2 3
Mbappe Forward France 1 3 2 5
Below is the intended output
df
A B C D E F G
Messi Forward,Midfield Argentina 1 5 5 6
Ronaldo Forward,Defender Portugal 3 4 2 3
Mbappe Forward France 1 3 2 5
My try:
df.groupby(['A','C'])['B'].agg(','.join).reset_index()
df.fillna(method='ffill')
Do we have a better way to do this ?
You can get first non missing values per groups by all columns without A,C and for B aggregate by join:
d = dict.fromkeys(df.columns.difference(['A','C']), 'first')
d['B'] = ','.join
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d)
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1.0 5.0 5.0 6
1 Ronaldo Portugal Defender,Forward 3.0 4.0 2.0 3
2 Mbappe France Forward 1.0 3.0 2.0 5
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d).convert_dtypes()
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1 5 5 6
1 Ronaldo Portugal Defender,Forward 3 4 2 3
2 Mbappe France Forward 1 3 2 5
For a generic method without manual definition of the columns, you can use the columns types to define whether to aggregate with ', '.join or 'first':
from pandas.api.types import is_string_dtype
out = (df.groupby(['A', 'C'], as_index=False)
.agg({c: ', '.join if is_string_dtype(df[c]) else 'first' for c in df})
)
Output:
A B C D E F G
0 Mbappe Forward France 1.0 3.0 2.0 5
1 Messi, Messi Forward, Midfield Argentina, Argentina 1.0 5.0 5.0 6
2 Ronaldo, Ronaldo Defender, Forward Portugal, Portugal 3.0 4.0 2.0 3
I have a df as shown below.
df:
Country Player
Arg Messi
Bra Neymar
Arg NaN
Arg Messi
Arg Aguero
Arg Messi
Bra Ronaldo
Spain Xavi
Spain NaN
Spain NaN
Bra Rivaldo
Spain Iniesta
Bra NaN
Spain Xavi
Where NaN stands for information not available.
From the above df, I would like to perform multiple groupby counts as shown below.
Expected output:
Country Player Counts Percentage_of_country
Arg NaN 1 20
Arg Messi 3 60
Arg Aguero 1 20
Bra Neymar 1 25
Bra NaN 1 25
Bra Ronaldo 1 25
Bra Rivaldo 1 25
Spain NaN 2 40
Spain Xavi 2 40
Spain Iniesta 1 20
I tried below code:
df2 = df.groupby(['Country', 'Player']).size().reset_index(name='counts')
df2['prcntg'] = df2['counts']/df2.groupby('Country')['counts'].transform('sum')
df2
Another way to do it, really producing all results in a single groupby is as follows:
Define a helper function to calculate the percentage:
with dropna=False to keep the NaN values:
f = lambda x: x.size / df.groupby('Country', dropna=False).size()[x.iloc[0]] * 100
The first size function returns counts under the group of ['Country', 'Player'], while the second size function, which is grouped under Country only, returns the counts under the bigger group.
Then, make use of the named aggregation of DataFrameGroupBy.aggregate():
(df.groupby(['Country', 'Player'], dropna=False)
.agg(counts=('Player', 'size'),
prcntg=('Country', f))
)
Result:
counts prcntg
Country Player
Arg Aguero 1 20.0
Messi 3 60.0
NaN 1 20.0
Bra Neymar 1 25.0
Rivaldo 1 25.0
Ronaldo 1 25.0
NaN 1 25.0
Spain Iniesta 1 20.0
Xavi 2 40.0
NaN 2 40.0
Edit
If you got error TypeError: groupby() got an unexpected keyword argument 'dropna', probably your Pandas version is older than version 1.1.0. This dropna parameter, which allows you keeping the NaN counts, is supported since this version. Probably you should consider upgrading your Pandas for richer sets of Pandas features.
If you cannot upgrade at the moment, a workaround solution is to replace NaN in the Player column by some other text, eg. string '_NaN' or some special word before the grouping. You can restore its values after the grouping if you need to. Sample codes below:
import numpy as np
df['Player'] = df['Player'].fillna('_NaN') # Set `NaN` values to string `_NaN`
# Main processing with all results produced in a single `groupby`:
f = lambda x: x.size / df.groupby('Country').size()[x.iloc[0]] * 100
df_out = (df.groupby(['Country', 'Player'], as_index=False)
.agg(counts=('Player', 'size'),
prcntg=('Country', f))
)
df_out['Player'] = df_out['Player'].replace('_NaN', np.nan) # restore `NaN` values
Result:
print(df_out)
Country Player counts prcntg
0 Arg Aguero 1 20.0
1 Arg Messi 3 60.0
2 Arg NaN 1 20.0
3 Bra Neymar 1 25.0
4 Bra Rivaldo 1 25.0
5 Bra Ronaldo 1 25.0
6 Bra NaN 1 25.0
7 Spain Iniesta 1 20.0
8 Spain Xavi 2 40.0
9 Spain NaN 2 40.0
First group the dataframe by Country, and Player, then call size for the count, and call to_frame passing the column name to create dataframe out of it. You also need to pass dropna=True since you want to include NaN.
After that, you can group the count by level=0 then call tranform to get the sum for the levels, and divide the count by this value. You can call reset_index at last if needed.
count=df.groupby(['Country', 'Player'], dropna=False).size().to_frame('Counts')
count['Percentage_of_country']=100*count/count.groupby(level=0).transform('sum')
OUTPUT:
Counts Percentage_of_country
Country Player
Arg Aguero 1 20.0
Messi 3 60.0
NaN 1 20.0
Bra Neymar 1 25.0
Rivaldo 1 25.0
Ronaldo 1 25.0
NaN 1 25.0
Spain Iniesta 1 20.0
Xavi 2 40.0
NaN 2 40.0
The dropna parameter was introduced in pandas version 1.1.0, so if you are using a version older than that, you can first try to replace the NaN value to something else, then revert back to NaN after performing the required operation.
df['Player'] = df['Player'].fillna('#!Missing!#') #replace NaN by #!Missing!#'
count=df.groupby(['Country', 'Player']).size().to_frame('Counts')
count['Percentage_of_country']=100*count/count.groupby(level=0).transform('sum')
count.reset_index(inplace=True)
count['Player'] = count['Player'].replace({'#!Missing!#':float('nan')})
Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).
i have master dataset like this
master = pd.DataFrame({'Channel':['1','1','1','1','1'],'Country':['India','Singapore','Japan','United Kingdom','Austria'],'Product':['X','6','7','X','X']})
and user table like this
user = pd.DataFrame({'User':['101','101','102','102','102','103','103','103','103','103'],'Country':['India','Brazil','India','Brazil','Japan','All','Austria','Japan','Singapore','United Kingdom'],'count':['2','1','3','2','1','1','1','1','1','1']})
i wanted master table left join with user table for each user. like below for one user
merge_101 = pd.merge(master,user[(user.User=='101')],how='left',on=['Country'])
merge_102 = pd.merge(master,user[(user.User=='102')],how='left',on=['Country'])
merge_103 = pd.merge(master,user[(user.User=='103')],how='left',on=['Country'])
merge_all = pd.concat([merge_101, merge_102,merge_103], ignore_index=True)
how to iterate each user here i am first filtering the dataset and creating another data set and appending the whole data set later.
is there any better way to do this task like for loop or any joins?
Thanks
IIUC, you need:
pd.concat([pd.merge(master,user[(user.User==x)],how='left',on=['Country']) for x in list(user['User'].unique())], ignore_index=True)
Output:
Channel Country Product User count
0 1 India X 101 2
1 1 Singapore 6 NaN NaN
2 1 Japan 7 NaN NaN
3 1 United Kingdom X NaN NaN
4 1 Austria X NaN NaN
5 1 India X 102 3
6 1 Singapore 6 NaN NaN
7 1 Japan 7 102 1
8 1 United Kingdom X NaN NaN
9 1 Austria X NaN NaN
10 1 India X NaN NaN
11 1 Singapore 6 103 1
12 1 Japan 7 103 1
13 1 United Kingdom X 103 1
14 1 Austria X 103 1
I have seen in R, imputation of categorical data is done straight forward by packages like DMwR, Caret and also I do have algorithm options like KNN or CentralImputation. But I do not see any libraries in python doing the same. FancyImpute performs well on numeric data.
Is there a way to do imputation of Null values in python for categorical data?
Edit: Added the top few rows of the data set.
>>> data_set.head()
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond \
0 856 854 0 NaN 3 1Fam TA
1 1262 0 0 NaN 3 1Fam TA
2 920 866 0 NaN 3 1Fam TA
3 961 756 0 NaN 3 1Fam Gd
4 1145 1053 0 NaN 4 1Fam TA
BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street \
0 No 706.0 0.0 ... WD 0 Pave
1 Gd 978.0 0.0 ... WD 0 Pave
2 Mn 486.0 0.0 ... WD 0 Pave
3 No 216.0 0.0 ... WD 0 Pave
4 Av 655.0 0.0 ... WD 0 Pave
TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd \
0 8 856.0 AllPub 0 2003 2003
1 6 1262.0 AllPub 298 1976 1976
2 6 920.0 AllPub 0 2001 2002
3 7 756.0 AllPub 0 1915 1970
4 9 1145.0 AllPub 192 2000 2000
YrSold
0 2008
1 2007
2 2008
3 2006
4 2008
[5 rows x 81 columns]
There are few ways to deal with missing values. As I understand you want to fill NaN according to specific rule. Pandas fillna can be used. Below code is example of how to fill categoric NaN with most frequent value.
df['Alley'].fillna(value=df['MSZoning'].value_counts().index[0],inplace =True)
Also this might be helpful sklearn.preprocessing.Imputer
For more information about pandas fillna pandas.DataFrame.fillna
Hope this will work