Doing Crosstable in Pandas like in Qlik? - python

I have a dataframe:
df1=pd.DataFrame({
'ID':[101,102],
'Name':['Axel','Bob'],
'US':['GrA','GrC'],
'Europe':['GrB','GrD'],
'AsiaPac':['GrZ','GrF']
})
Which I want to change to this:
df2=pd.DataFrame({
'ID':[101,101,101,102,102,102],
'Name':['Axel','Axel','Axel','Bob','Bob','Bob'],
'Region':['US','Europe','AsiaPac','US','Europe','AsiaPac'],
'Group':['GrA','GrB','GrZ','GrC','GrD','GrF']
})
How do I do it? There is a crosstab function in pandas but it doesn't do this. In Qlik I would simply do
Crosstable(Region,Group,2)
LOAD
ID,
Name,
US,
Europe,
AsiaPac
And I would go from df1 to df2. How can I do this in python (pandas or otherwise)?

This is essentially reshaping your data from a wide format to a long format, as it's known in R parlance. In pandas, you can do this with pd.melt:
pd.melt(df1, id_vars=['ID', 'Name'], var_name='Region', value_name='Group')
# ID Name Region Group
# 0 101 Axel AsiaPac GrZ
# 1 102 Bob AsiaPac GrF
# 2 101 Axel Europe GrB
# 3 102 Bob Europe GrD
# 4 101 Axel US GrA
# 5 102 Bob US GrC
If you need your columns sorted on ID or Name and Group, as in your example output, you can add .sort_values() to the expression:
pd.melt(df1, id_vars=['ID', 'Name'], var_name='Region', value_name='Group').sort_values(['ID', 'Group'])
# ID Name Region Group
# 4 101 Axel US GrA
# 2 101 Axel Europe GrB
# 0 101 Axel AsiaPac GrZ
# 5 102 Bob US GrC
# 3 102 Bob Europe GrD
# 1 102 Bob AsiaPac GrF

You can try
1st
stack()
df1.set_index(['ID','Name']).stack().reset_index().rename(columns={'level_2':'Region',0:'Group'})
Out[890]:
ID Name Region Group
0 101 Axel AsiaPac GrZ
1 101 Axel Europe GrB
2 101 Axel US GrA
3 102 Bob AsiaPac GrF
4 102 Bob Europe GrD
5 102 Bob US GrC
2nd
pd.wide_to_long , even it is overkill. :)
df1=df1.rename(columns={'AsiaPac':'Group_AsiaPac','Europe':'Group_Europe','US':'Group_US'})
pd.wide_to_long(df1,['Group'], i=['ID','Name'], j='Region',sep='_',suffix='.').reset_index()
Out[918]:
ID Name Region Group
0 101 Axel AsiaPac GrZ
1 101 Axel Europe GrB
2 101 Axel US GrA
3 102 Bob AsiaPac GrF
4 102 Bob Europe GrD
5 102 Bob US GrC

Related

Filter pandas Data Frame Based on other Dataframe Column Values

df1:
Id Country Product
1 india cotton
2 germany shoes
3 algeria bags
df2:
id Country Product Qty Sales
1 India cotton 25 635
2 India cotton 65 335
3 India cotton 96 455
4 India cotton 78 255
5 germany shoes 25 635
6 germany shoes 65 458
7 germany shoes 96 455
8 germany shoes 69 255
9 algeria bags 25 635
10 algeria bags 89 788
11 algeria bags 96 455
12 algeria bags 78 165
I need to filter df2 based on the Country and Products Column from df1 and Create New Data Frame.
For example in df1, there are 3 unique country, Categories, so Number of df would be 3.
Output:
df_India_Cotton :
id Country Product Qty Sales
1 India cotton 25 635
2 India cotton 65 335
3 India cotton 96 455
4 India cotton 78 255
df_germany_Product:
id Country Product Qty Sales
1 germany shoes 25 635
2 germany shoes 65 458
3 germany shoes 96 455
4 germany shoes 69 255
df_algeria_Product:
id Country Product Qty Sales
1 algeria bags 25 635
2 algeria bags 89 788
3 algeria bags 96 455
4 algeria bags 78 165
i can also filter out these dataframe with basic subsetting in pandas.
df[(df.Country=='India') & (df.Products=='cotton')]
it would solve this problem, there could be so many unique combination of Country, Products in my df1.
You can create a dictionary and save all dataframes in it.
Check the code below:
d={}
for i in range(len(df1)):
name=df1.Country.iloc[i]+'_'+df1.Product.iloc[i]
d[name]=df2[(df2.Country==df1.Country.iloc[i]) & (df2.Product==df1.Product.iloc[i])]
And you can call each dataframe by its values like below:
d['India_cotton'] will give:
id Country Product Qty Sales
1 India cotton 25 635
2 India cotton 65 335
3 India cotton 96 455
4 India cotton 78 255
Try creating two groupby's. Use the first to select from the second:
import pandas as pd
selector_df = pd.DataFrame(data=
{
'Country':'india germany algeria'.split(),
'Product':'cotton shoes bags'.split()
})
details_df = pd.DataFrame(data=
{
'Country':'india india india india germany germany germany germany algeria algeria algeria algeria'.split(),
'Product':'cotton cotton cotton cotton shoes shoes shoes shoes bags bags bags bags'.split(),
'qty':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})
selectorgroups = selector_df.groupby(by=['Country', 'Product'])
datagroups = details_df.groupby(by=['Country', 'Product'])
for tag, group in selectorgroups:
print(tag)
try:
print(datagroups.get_group(tag))
except KeyError:
print('tag does not exist in datagroup')

Subtract/Add existing values if contents of one dataframe is present in another using pandas

Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0

custom sort in python pandas dataframe needs better approach

i have a dataframe like this
user = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK']})
i want to apply custom sort in country and Japan needs to be in top for both the users
i have done this but this is not my expected output
user.sort_values(['User','Country'], ascending=[True, False], inplace=True)
my expected output
expected_output = pd.DataFrame({'User':['101','101','101','101','101','102','102','102','102','102'],'Country':['Japan','India','India','UK','Austria','Japan','Japan','Brazil','Singapore','UK']})
i tried to Cast the column as category and when passing the categories and put Japan at the top. is there any other approach i don't want to pass the all the countries list every time. i just want to give user 101 -japan or user 102- UK then the remaining rows order needs to come.
Thanks
Create a new key help sort by using map
user.assign(New=user.Country.map({'Japan':1}).fillna(0)).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[80]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
4 Japan 102
7 Japan 102
3 Brazil 102
8 Singapore 102
9 UK 102
Update base on comment
mapdf=pd.DataFrame({'Country':['Japan','UK'],'User':['101','102'],'New':[1,1]})
user.merge(mapdf,how='left').fillna(0).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[106]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
9 UK 102
3 Brazil 102
4 Japan 102
7 Japan 102
8 Singapore 102
Use boolean indexing with append, last sort by column User:
user = (user[user['Country'] == 'Japan']
.append(user[user['Country'] != 'Japan'])
.sort_values('User'))
Alternative solution:
user = (user.query('Country == "Japan"')
.append(user.query('Country != "Japan"'))
.sort_values('User'))
print (user)
User Country count
1 101 Japan 1
0 101 India 2
2 101 India 3
5 101 UK 1
6 101 Austria 1
4 102 Japan 1
7 102 Japan 1
3 102 Brazil 2
8 102 Singapore 1
9 102 UK 1

check same rows and create new column conditionally in pandas

I have dataframe like this
df = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK'],
'Name':['RN','TN','AP','AP','TN','TN','TS','RN','TN','AP']})
if the user and country same i want to combine name column values in other column like below
You could let
df['Name_E'] = df.groupby(['User', 'Country']).Name.transform(lambda x: str.join(', ', x))
groupby with transform
df['all_names'] = df.groupby(['Country', 'User']).Name.transform(lambda x: ','.join(set(x)))
Country Name User all_names
0 India RN 101 AP,RN
1 Japan TN 101 TN
2 India AP 101 AP,RN
3 Brazil AP 102 AP
4 Japan TN 102 TN,RN
5 UK TN 101 TN
6 Austria TS 101 TS
7 Japan RN 102 TN,RN
8 Singapore TN 102 TN
9 UK AP 102 AP
You need:
res = df.merge(df.groupby(['User', 'Country'])['Name'].unique().reset_index().rename(columns={'Name':'Name_E'}), on=['Country', 'User'])
res['Name_E'] = res['Name_E'].apply(lambda x: ",".join(x))
Output:
User Country Name Name_E
0 101 India RN RN,AP
1 101 India AP RN,AP
2 101 Japan TN TN
3 102 Brazil AP AP
4 102 Japan TN TN,RN
5 102 Japan RN TN,RN
6 101 UK TN TN
7 101 Austria TS TS
8 102 Singapore TN TN
9 102 UK AP AP

Remove rows from pandas DataFrame if multiple columns contain the same data, but interchanged

I have a pandas DataFrame with pairs of names in 'name_x' and 'name_y' columns and an associated id:
id name_x name_y
0 104 molly james
1 104 james molly
2 104 sarah adam
3 236 molly adam
4 388 adam sarah
5 388 johnny pete
6 104 adam sarah
7 236 adam james
8 236 pete johnny
I would like to remove 'duplicate' rows where the id numbers are the same and both names have appeared together in either name column.
eg.
Such that the row with index 1 is removed because the pair of names 'molly' and 'james' have already appeared with id 104. Similarly the row with index 6 is removed as the pair of names 'adam' and 'sarah' have already appeared with id 104 so that the DataFrame looks like this:
id name_x name_y
0 104 molly james
1 104 sarah adam
2 236 molly adam
3 388 adam sarah
4 388 johnny pete
5 236 adam james
6 236 pete johnny
(The ordering of the names does not matter)
I would then like to be able to create another DataFrame which displays the count of pairs of names depending on how many times they appear with different id's and those id's eg:
count ids name_x name_y
0 1 104 molly james
1 2 [104, 388] sarah adam
2 1 236 molly adam
3 2 [388, 236] johnny pete
4 1 236 adam james
I am new to programming/python/pandas and have yet to find an answer for this! Thanks!
You can use:
first sort columns with names
groupby, convert to sets and then to lists
get length of lists by len
last if necessary use mask with indexing with str for scalar for one item lists
df[['name_x','name_y']] = np.sort( df[['name_x','name_y']], axis=1)
df=df.groupby(['name_x','name_y'])['id'].apply(lambda x:list(set(x))).reset_index(name='ids')
df['count'] = df['ids'].str.len()
print (df)
name_x name_y ids count
0 adam james [236] 1
1 adam molly [236] 1
2 adam sarah [104, 388] 2
3 james molly [104] 1
4 johnny pete [388, 236] 2
df['ids'] = df['ids'].mask(df['count'] == 1, df['ids'].str[0])
print (df)
name_x name_y ids count
0 adam james 236 1
1 adam molly 236 1
2 adam sarah [104, 388] 2
3 james molly 104 1
4 johnny pete [388, 236] 2

Categories