I have a BIG dataframe with millions of rows & many columns and need to do GROUPBY AND COUNT OF VALUES OF DIFFERENT COLUMNS .
Need help with efficient coding for the problem with minimal lines of code and a code which runs very fast.
I'm giving a simpler example below about my problem.
Below is my input CSV.
UID,CONTINENT,AGE_GROUP,APPROVAL_STATUS
user1,ASIA,26-30,YES
user10,ASIA,26-30,NO
user11,ASIA,36-40,YES
user12,EUROPE,21-25,NO
user13,AMERICA,31-35,not_confirmed
user14,ASIA,26-30,YES
user15,EUROPE,41-45,not_confirmed
user16,AMERICA,21-25,NO
user17,ASIA,26-30,YES
user18,EUROPE,41-45,NO
user19,AMERICA,31-35,YES
user2,AMERICA,31-35,NO
user20,ASIA,46-50,NO
user21,EUROPE,18-20,not_confirmed
user22,ASIA,26-30,not_confirmed
user23,ASIA,36-40,YES
user24,AMERICA,26-30,YES
user25,EUROPE,36-40,NO
user26,EUROPE,Above 50,NO
user27,ASIA,46-50,YES
user28,AMERICA,31-35,NO
user29,AMERICA,Above 50,not_confirmed
user3,ASIA,36-40,YES
user30,EUROPE,41-45,YES
user4,EUROPE,41-45,NO
user5,ASIA,26-30,not_confirmed
user6,ASIA,46-50,not_confirmed
user7,ASIA,26-30,YES
user8,AMERICA,18-20,YES
user9,EUROPE,31-35,NO
I Expect the output to be as below.
Output should show
CONTINENT column as the main groupby column
UNIQUE values of AGE_GROUP and APPROVAL_STATUS columns as separate column name. And also, it should display the count of UNIQUE values of AGE_GROUP and APPROVAL_STATUS columns for each CONTINENT under respective output columns.
Output:-
CONTINENT,18-20,21-25,26-30,31-35,36-40,41-45,46-50,Above 50,NO,YES,not_confirmed,USER_COUNT
AMERICA,1,1,1,4,0,0,0,1,3,3,2,8
ASIA,0,0,7,0,3,0,3,0,2,8,3,13
EUROPE,1,1,0,1,1,4,0,1,6,1,2,9
Below is how I'm achieving it currently, but this is NOT en efficient way.
Need help with efficient coding for the problem with minimal lines of code and a code which runs very fast.
I've also sen that this could be achieved by using pivit table with pandas. But not too sure about it.
in_file = "/Users/user1/groupby.csv"
out_file = "/Users/user1/groupby1.csv"
df= pd.read_csv(in_file)
print(df)
df1 = df.groupby(['CONTINENT', 'AGE_GROUP']).size().unstack(fill_value=0).reset_index()
df1 = df1.sort_values(["CONTINENT"], axis=0, ascending=True)
print(df1)
df2 = df.groupby(['CONTINENT', 'APPROVAL_STATUS']).size().unstack(fill_value=0).reset_index()
df2 = df2.sort_values(["CONTINENT"], axis=0, ascending=True)
print(df2)
df3 = df.groupby("CONTINENT").count().reset_index()
df3 = df3[df3.columns[0:2]]
df3.columns = ["CONTINENT", "USER_COUNT"]
df3 = df3.sort_values(["CONTINENT"], axis=0, ascending=True)
df3.reset_index(drop=True, inplace=True)
# df3.to_csv(out_file, index=False)
print(df3)
df2.drop('CONTINENT', axis=1, inplace=True)
df3.drop('CONTINENT', axis=1, inplace=True)
df_final = pd.concat([df1, df2, df3], axis=1)
print(df_final)
df_final.to_csv(out_file, index=False)
Easy solution
Let us use crosstabs to calculate frequency tables then concat the tables along columns axis:
s1 = pd.crosstab(df['CONTINENT'], df['AGE_GROUP'])
s2 = pd.crosstab(df['CONTINENT'], df['APPROVAL_STATUS'])
pd.concat([s1, s2, s2.sum(1).rename('USER_COUNT')], axis=1)
18-20 21-25 26-30 31-35 36-40 41-45 46-50 Above 50 NO YES not_confirmed USER_COUNT
CONTINENT
AMERICA 1 1 1 4 0 0 0 1 3 3 2 8
ASIA 0 0 7 0 3 0 3 0 2 8 3 13
EUROPE 1 1 0 1 1 4 0 1 6 1 2 9
I have 3 datasets
All the same shape
CustomerNumber, Name, Status
A customer can appear on 1, 2 or all 3.
Each dataset is a list of gold/silver/bronze.
example data:
Dataframe 1:
100,James,Gold
Dataframe 2:
100,James,Silver
101,Paul,Silver
Dataframe 3:
100,James,Bronze
101,Paul,Bronze
102,Fred,Bronze
Expected output/aggregated list:
100,James,Gold
101,Paul,Silver
102,Fred,Bronze
So a customer that is captured in all 3, I want to keep Status as gold.
Have been playing with join and merge and just can’t get it right.
Use concat with convert column to ordered categorical, so get priorites if sorting values by multiple columns and last remove duplicates by DataFrame.drop_duplicates:
print (df1)
print (df2)
print (df3)
a b c
0 100 James Gold
a b c
0 100 James Silver
1 101 Paul Silver
a b c
0 101 Paul Bronze
1 102 Fred Bronze
df = pd.concat([df1, df2, df3], ignore_index=True)
df['c'] = pd.Categorical(df['c'], ordered=True, categories=['Gold','Silver','Bronze'])
df = df.sort_values(['a','b','c']).drop_duplicates(['a','b'])
print (df)
a b c
0 100 James Gold
2 101 Paul Silver
4 102 Fred Bronze
I have original df where I have column "average", where is average value counted for country . Now I have new_df, where I want to add these df average values based on country.
df
id country value average
1 USA 3 2
2 UK 5 5
3 France 2 2
4 USA 1 2
new df
country average
USA 2
Italy Nan
I had a solution that worked but there is a problem, when there is in new_df a country for which I have not count the average yet. In that case I want to fill just nan.
Can you please recommend me any solution?
Thanks
If need add average column to df2 use DataFrame.merge with DataFrame.drop_duplicates:
df2.merge(df1.drop_duplicates('country')[['country','average']], on='country', how='left')
If need aggregate mean:
df2.join(df1.groupby('country')['average'].mean(), on='country')
I have a big df with tarifs for aviation lines. you can specify data for concrete route, for example by airport of origination, airport of destination, aicraft, month.
Plain example of df:
data = {'orig':['A','A','A','B','B','B'],
'dest':['C','C','C','D','D','D'],
'currency':['RUB','USD','RUB','USD','RUB','USD'],
'tarif':[100,10,120,20,150,30]}
df=pd.DataFrame(data)
df
orig dest currency tarif
0 A C RUB 100
1 A C USD 10
2 A C RUB 120
3 B D USD 20
4 B D RUB 150
5 B D USD 30
I have df2, that contains aviation plan for concrete company. There you may find the same info, like month, orig,dest,aircraft
Plain example of df2:
data2={'orig':['A','B'],
'dest':['C','D']}
df2=pd.DataFrame(data2)
df2
orig dest
0 A C
1 B D
Task:for each row in df2, summurize tarif using conditions.
What I expect:
orig dest RUB USD
0 A C 220 10
1 B D 150 50
Thanks.
Hmmm
df = df.groupby(["orig", "dest", "currency"]).agg(sum).unstack()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
df
gives me
tarif_RUB tarif_USD
orig dest
A C 220 10
B D 150 50
Which is your desired result but I haven't looked at df2 yet so I am afraid you have to describe better/extend your example so I have to do something with df2.