custom sort in python pandas dataframe needs better approach - python

i have a dataframe like this
user = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK']})
i want to apply custom sort in country and Japan needs to be in top for both the users
i have done this but this is not my expected output
user.sort_values(['User','Country'], ascending=[True, False], inplace=True)
my expected output
expected_output = pd.DataFrame({'User':['101','101','101','101','101','102','102','102','102','102'],'Country':['Japan','India','India','UK','Austria','Japan','Japan','Brazil','Singapore','UK']})
i tried to Cast the column as category and when passing the categories and put Japan at the top. is there any other approach i don't want to pass the all the countries list every time. i just want to give user 101 -japan or user 102- UK then the remaining rows order needs to come.
Thanks

Create a new key help sort by using map
user.assign(New=user.Country.map({'Japan':1}).fillna(0)).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[80]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
4 Japan 102
7 Japan 102
3 Brazil 102
8 Singapore 102
9 UK 102
Update base on comment
mapdf=pd.DataFrame({'Country':['Japan','UK'],'User':['101','102'],'New':[1,1]})
user.merge(mapdf,how='left').fillna(0).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[106]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
9 UK 102
3 Brazil 102
4 Japan 102
7 Japan 102
8 Singapore 102

Use boolean indexing with append, last sort by column User:
user = (user[user['Country'] == 'Japan']
.append(user[user['Country'] != 'Japan'])
.sort_values('User'))
Alternative solution:
user = (user.query('Country == "Japan"')
.append(user.query('Country != "Japan"'))
.sort_values('User'))
print (user)
User Country count
1 101 Japan 1
0 101 India 2
2 101 India 3
5 101 UK 1
6 101 Austria 1
4 102 Japan 1
7 102 Japan 1
3 102 Brazil 2
8 102 Singapore 1
9 102 UK 1

Related

Filter pandas Data Frame Based on other Dataframe Column Values

df1:
Id Country Product
1 india cotton
2 germany shoes
3 algeria bags
df2:
id Country Product Qty Sales
1 India cotton 25 635
2 India cotton 65 335
3 India cotton 96 455
4 India cotton 78 255
5 germany shoes 25 635
6 germany shoes 65 458
7 germany shoes 96 455
8 germany shoes 69 255
9 algeria bags 25 635
10 algeria bags 89 788
11 algeria bags 96 455
12 algeria bags 78 165
I need to filter df2 based on the Country and Products Column from df1 and Create New Data Frame.
For example in df1, there are 3 unique country, Categories, so Number of df would be 3.
Output:
df_India_Cotton :
id Country Product Qty Sales
1 India cotton 25 635
2 India cotton 65 335
3 India cotton 96 455
4 India cotton 78 255
df_germany_Product:
id Country Product Qty Sales
1 germany shoes 25 635
2 germany shoes 65 458
3 germany shoes 96 455
4 germany shoes 69 255
df_algeria_Product:
id Country Product Qty Sales
1 algeria bags 25 635
2 algeria bags 89 788
3 algeria bags 96 455
4 algeria bags 78 165
i can also filter out these dataframe with basic subsetting in pandas.
df[(df.Country=='India') & (df.Products=='cotton')]
it would solve this problem, there could be so many unique combination of Country, Products in my df1.
You can create a dictionary and save all dataframes in it.
Check the code below:
d={}
for i in range(len(df1)):
name=df1.Country.iloc[i]+'_'+df1.Product.iloc[i]
d[name]=df2[(df2.Country==df1.Country.iloc[i]) & (df2.Product==df1.Product.iloc[i])]
And you can call each dataframe by its values like below:
d['India_cotton'] will give:
id Country Product Qty Sales
1 India cotton 25 635
2 India cotton 65 335
3 India cotton 96 455
4 India cotton 78 255
Try creating two groupby's. Use the first to select from the second:
import pandas as pd
selector_df = pd.DataFrame(data=
{
'Country':'india germany algeria'.split(),
'Product':'cotton shoes bags'.split()
})
details_df = pd.DataFrame(data=
{
'Country':'india india india india germany germany germany germany algeria algeria algeria algeria'.split(),
'Product':'cotton cotton cotton cotton shoes shoes shoes shoes bags bags bags bags'.split(),
'qty':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})
selectorgroups = selector_df.groupby(by=['Country', 'Product'])
datagroups = details_df.groupby(by=['Country', 'Product'])
for tag, group in selectorgroups:
print(tag)
try:
print(datagroups.get_group(tag))
except KeyError:
print('tag does not exist in datagroup')

Pandas merge produces no data

I have 2 dataframes:
df1:
branch sessions users
toledo 10 3
new york 14 2
boston 102 43
seattle 9 7
df2:
branch guests
toledo 10
new york 14
boston 102
seattle 9
The result I'm looking for merges the "guests" column from df2 to df1 like this:
df1:
branch sessions users guests
toledo 10 3 10
new york 14 2 14
boston 102 43 102
seattle 9 7 9
I've tried concat, join and merge with no luck.
With merge
guest_sessions = df2[['branch','guests']].copy()
pd.merge(df1, guest_sessions, left_index=True, right_index=True)
I get this:
branch_x sessions users guests_x branch_y guests_y
What am I doing wrong?

New dataframe from grouping together two columns

I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0

Python: remove rows with max value in each group

I have a pandas data frame df like this.
In [1]: df
Out[1]:
country count
0 Japan 78
1 Japan 80
2 USA 45
3 France 34
4 France 90
5 UK 45
6 UK 34
7 China 32
8 China 87
9 Russia 20
10 Russia 67
I want to remove rows with the maximum value in each group. So the result should look like:
country count
0 Japan 78
3 France 34
6 UK 34
7 China 32
9 Russia 20
My first attempt:
idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))
My second attempt:
idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))
But it didn't work. Any ideas?
groupby / transform('max')
You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.
g = df.groupby(['country'])['count'].transform('max')
df = df[~(df['count'] == g)]
The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.
print(df.groupby(['country'])['count'].transform('max'))
0 80
1 80
2 45
3 90
4 90
5 45
6 45
7 87
8 87
9 20
Name: count, dtype: int64
sort + drop
Alternatively, you can sort and drop the final occurrence:
res = df.sort_values('count')
res = res.drop(res.groupby('country').tail(1).index)
print(res)
country count
9 Russia 20
7 China 32
3 France 34
6 UK 34
0 Japan 78

check same rows and create new column conditionally in pandas

I have dataframe like this
df = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK'],
'Name':['RN','TN','AP','AP','TN','TN','TS','RN','TN','AP']})
if the user and country same i want to combine name column values in other column like below
You could let
df['Name_E'] = df.groupby(['User', 'Country']).Name.transform(lambda x: str.join(', ', x))
groupby with transform
df['all_names'] = df.groupby(['Country', 'User']).Name.transform(lambda x: ','.join(set(x)))
Country Name User all_names
0 India RN 101 AP,RN
1 Japan TN 101 TN
2 India AP 101 AP,RN
3 Brazil AP 102 AP
4 Japan TN 102 TN,RN
5 UK TN 101 TN
6 Austria TS 101 TS
7 Japan RN 102 TN,RN
8 Singapore TN 102 TN
9 UK AP 102 AP
You need:
res = df.merge(df.groupby(['User', 'Country'])['Name'].unique().reset_index().rename(columns={'Name':'Name_E'}), on=['Country', 'User'])
res['Name_E'] = res['Name_E'].apply(lambda x: ",".join(x))
Output:
User Country Name Name_E
0 101 India RN RN,AP
1 101 India AP RN,AP
2 101 Japan TN TN
3 102 Brazil AP AP
4 102 Japan TN TN,RN
5 102 Japan RN TN,RN
6 101 UK TN TN
7 101 Austria TS TS
8 102 Singapore TN TN
9 102 UK AP AP

Categories