Is there a way to optimize pandas apply function during groupby? - python
I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
Related
Rounding off to the nearest 50's pandas dataframe
I have a pandas dataframe , screenshot shown below: ID Price 100 1040.0 101 1025.0 102 750.0 103 891.0 104 924.0 Expected output shown below ID Price Price_new 100 1040.0 1050 101 1025.0 1050 102 750.0 750 103 891.0 900 104 920.0 900 This is what I have done but it's not what I want. I want to round off to the nearest fifty in such a way that at 1025 it should round to 1050. df['Price_new'] = (df['Price'] / 50).round().astype(int) * 50
This is due to the issue : round with python 3 s = (df['Price'] % 50) df['new'] = df['Price'] + np.where(s>=25,50-s,-s) df Out[33]: ID Price new 0 100 1040 1050 1 101 1025 1050 2 102 750 750 3 103 891 900 4 104 924 900
Follow my suggestion: import pandas as pd dt = pd.DataFrame({'ID':[100,101,102,103,104], 'Price': [1040,1025,750,891,924]}) #VERSION1 dt['Price_new'] = round((dt['Price']+1)/50).astype(int)*50 #VERSION2 dt['Price_new_v2'] = dt['Price']-(dt['Price'].map(lambda x: x%50)) + (dt['Price'].map(lambda x: round((((x%50)+1)/50))))*50 ID Price Price_new Price_new_V2 0 100 1040 1050 1050 1 101 1025 1050 1050 2 102 750 750 750 3 103 891 900 900 4 104 924 900 900 Just plus 1 in your math you will be able to find your correct answer. But there is another way to do it, my opnião is more understandable than the second version even though I used the modulo operator.
How do I use pandas groupby() to show the value of 2 things per one column?
So I have been trying to use pandas to create a DataFrame that reports the number of graduates working at jobs that do require college degrees ('college_jobs'), and do not require college degrees ('non_college_jobs'). note: the name of the dataframe I am dealing with is recent_grads I tried the following code: df1 = recent_grads.groupby(['major_category']).college_jobs.non_college_jobs.sum() or df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs','non_college_jobs'].sum() or df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs'],['non_college_jobs'].sum() none of them worked! what am I supposed to do? can somebody give me a simple explanation regarding this? I had been trying to read through pandas documentations and did not find the explanation wanted. here is the head of the dataframe: rank major_code major major_category \ 0 1 2419 PETROLEUM ENGINEERING Engineering 1 2 2416 MINING AND MINERAL ENGINEERING Engineering 2 3 2415 METALLURGICAL ENGINEERING Engineering 3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 4 5 2405 CHEMICAL ENGINEERING Engineering total sample_size men women sharewomen employed ... \ 0 2339 36 2057 282 0.120564 1976 ... 1 756 7 679 77 0.101852 640 ... 2 856 3 725 131 0.153037 648 ... 3 1258 16 1123 135 0.107313 758 ... 4 32260 289 21239 11021 0.341631 25694 ... part_time full_time_year_round unemployed unemployment_rate median \ 0 270 1207 37 0.018381 110000 1 170 388 85 0.117241 75000 2 133 340 16 0.024096 73000 3 150 692 40 0.050125 70000 4 5180 16697 1672 0.061098 65000 p25th p75th college_jobs non_college_jobs low_wage_jobs 0 95000 125000 1534 364 193 1 55000 90000 350 257 50 2 50000 105000 456 176 0 3 43000 80000 529 102 0 4 50000 75000 18314 4440 972 [5 rows x 21 columns]
You could filter the initial DataFrame by the columns you're interested in and then perform the groupby and summation as below: recent_grads[['major_category', 'college_jobs', 'non_college_jobs']].groupby('major_category').sum() Conversely, if you don't perform the initial column filter and then do a .sum() on the recent_grads.groupby('major_category') it will be applied to all numeric columns possible.
How do I select columns while having few conditions in pandas
I've got datframe: 1990 1991 1992 .... 2015 2016 2017 0 9 40 300 100 200 554 1 9 70 700 3300 200 554 2 5 70 900 100 200 554 3 8 80 900 176 200 554 4 7 50 200 250 280 145 5 9 30 900 100 207 554 6 2 80 700 180 200 554 7 2 80 400 100 200 554 8 5 80 300 100 200 554 9 7 70 800 100 200 554 How do I select df<2000 & df>2005? I tried code below but it failed: 1. df[(df.loc[:, :2000]) & (df.loc[:, 2005:])] 2. df[(df <2000) & (df>2005)]
Compare columns names: print (df) 1999 2002 2003 2005 2006 2017 0 9 40 300 100 200 554 1 9 70 700 3300 200 554 2 5 70 900 100 200 554 3 8 80 900 176 200 554 4 7 50 200 250 280 145 5 9 30 900 100 207 554 6 2 80 700 180 200 554 7 2 80 400 100 200 554 8 5 80 300 100 200 554 9 7 70 800 100 200 554 df = df.loc[:, (df.columns <2000) | (df.columns>2005)] print (df) 1999 2006 2017 0 9 200 554 1 9 200 554 2 5 200 554 3 8 200 554 4 7 280 145 5 9 207 554 6 2 200 554 7 2 200 554 8 5 200 554 9 7 200 554
Pivoting data with date as a row in Python
I have data that I've left in a format that will allow me to pivot on dates that look like: Region 0 1 2 3 Date 2005-01-01 2005-02-01 2005-03-01 .... East South Central 400 500 600 Pacific 100 200 150 . . Mountain 500 600 450 I need to pivot this table so it looks like: 0 Date Region value 1 2005-01-01 East South Central 400 2 2005-02-01 East South Central 500 3 2005-03-01 East South Central 600 . . 4 2005-03-01 Pacific 100 4 2005-03-01 Pacific 200 4 2005-03-01 Pacific 150 . . Since both Date and Region are under one another I'm not sure how to melt or pivot around these strings so that I can get my desired output. How can I go about this?
I think this is the solution you are looking for. Shown by example. import pandas as pd import numpy as np N=100 regions = list('abcdef') df = pd.DataFrame([[i for i in range(N)], ['2016-{}'.format(i) for i in range(N)], list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N))]) df.index = ['Region', 'Date', 'a', 'b', 'c', 'd'] print(df) This gives 0 1 2 3 4 5 6 7 \ Region 0 1 2 3 4 5 6 7 Date 2016-0 2016-1 2016-2 2016-3 2016-4 2016-5 2016-6 2016-7 a 96 432 181 64 87 355 339 314 b 360 23 162 98 450 78 114 109 c 143 375 420 493 321 277 208 317 d 371 144 207 108 163 67 465 130 And the solution to pivot this into the form you want is df.transpose().melt(id_vars=['Date'], value_vars=['a', 'b', 'c', 'd']) which gives Date variable value 0 2016-0 a 96 1 2016-1 a 432 2 2016-2 a 181 3 2016-3 a 64 4 2016-4 a 87 5 2016-5 a 355 6 2016-6 a 339 7 2016-7 a 314 8 2016-8 a 111 9 2016-9 a 121 10 2016-10 a 124 11 2016-11 a 383 12 2016-12 a 424 13 2016-13 a 453 ... 393 2016-93 d 176 394 2016-94 d 277 395 2016-95 d 256 396 2016-96 d 174 397 2016-97 d 349 398 2016-98 d 414 399 2016-99 d 132
Pandas pivot table: columns order and subtotals
I'm using Pandas 0.19. Considering the following data frame: FID admin0 admin1 admin2 windspeed population 0 cntry1 state1 city1 60km/h 700 1 cntry1 state1 city1 90km/h 210 2 cntry1 state1 city2 60km/h 100 3 cntry1 state2 city3 60km/h 70 4 cntry1 state2 city4 60km/h 180 5 cntry1 state2 city4 90km/h 370 6 cntry2 state3 city5 60km/h 890 7 cntry2 state3 city6 60km/h 120 8 cntry2 state3 city6 90km/h 420 9 cntry2 state3 city6 120km/h 360 10 cntry2 state4 city7 60km/h 740 How can I create a table like this one? population 60km/h 90km/h 120km/h admin0 admin1 admin2 cntry1 state1 city1 700 210 0 cntry1 state1 city2 100 0 0 cntry1 state2 city3 70 0 0 cntry1 state2 city4 180 370 0 cntry2 state3 city5 890 0 0 cntry2 state3 city6 120 420 360 cntry2 state4 city7 740 0 0 I have tried with the following pivot table: table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0) In general it works great, but unfortunately I am not able to sort the new columns in the right order: the 120km/h column appears before the ones for 60km/h and 90km/h. How can I specify the order of the new columns? Moreover, as a second step I need to add subtotals both for admin0 and admin1. Ideally, the table I need should be like this: population 60km/h 90km/h 120km/h admin0 admin1 admin2 cntry1 state1 city1 700 210 0 cntry1 state1 city2 100 0 0 SUM state1 800 210 0 cntry1 state2 city3 70 0 0 cntry1 state2 city4 180 370 0 SUM state2 250 370 0 SUM cntry1 1050 580 0 cntry2 state3 city5 890 0 0 cntry2 state3 city6 120 420 360 SUM state3 1010 420 360 cntry2 state4 city7 740 0 0 SUM state4 740 0 0 SUM cntry2 1750 420 360 SUM ALL 2800 1000 360
you can do it using reindex() method and custom sorting: In [26]: table Out[26]: population windspeed 120km/h 60km/h 90km/h admin0 admin1 admin2 cntry1 state1 city1 0 700 210 city2 0 100 0 state2 city3 0 70 0 city4 0 180 370 cntry2 state3 city5 0 890 0 city6 360 120 420 state4 city7 0 740 0 In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h',''))) In [28]: cols Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')] In [29]: table = table.reindex(columns=cols) In [30]: table Out[30]: population windspeed 60km/h 90km/h 120km/h admin0 admin1 admin2 cntry1 state1 city1 700 210 0 city2 100 0 0 state2 city3 70 0 0 city4 180 370 0 cntry2 state3 city5 890 0 0 city6 120 420 360 state4 city7 740 0 0
Solution with subtotals and MultiIndex.from_arrays. Last concat and all Dataframes, sort_index and add all sum: #replace km/h and convert to int df.windspeed = df.windspeed.str.replace('km/h','').astype(int) print (df) FID admin0 admin1 admin2 windspeed population 0 0 cntry1 state1 city1 60 700 1 1 cntry1 state1 city1 90 210 2 2 cntry1 state1 city2 60 100 3 3 cntry1 state2 city3 60 70 4 4 cntry1 state2 city4 60 180 5 5 cntry1 state2 city4 90 370 6 6 cntry2 state3 city5 60 890 7 7 cntry2 state3 city6 60 120 8 8 cntry2 state3 city6 90 420 9 9 cntry2 state3 city6 120 360 10 10 cntry2 state4 city7 60 740 #pivoting table = pd.pivot_table(df, index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"], fill_value=0) print (table) population windspeed 60 90 120 admin0 admin1 admin2 cntry1 state1 city1 700 210 0 city2 100 0 0 state2 city3 70 0 0 city4 180 370 0 cntry2 state3 city5 890 0 0 city6 120 420 360 state4 city7 740 0 0 #groupby and create sum dataframe by levels 0,1 df1 = table.groupby(level=[0,1]).sum() df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), df1.index.get_level_values(1)+ '_sum', len(df1.index) * ['']]) print (df1) population windspeed 60 90 120 admin0 cntry1 state1_sum 800 210 0 state2_sum 250 370 0 cntry2 state3_sum 1010 420 360 state4_sum 740 0 0 df2 = table.groupby(level=0).sum() df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum', len(df2.index) * [''], len(df2.index) * ['']]) print (df2) population windspeed 60 90 120 cntry1_sum 1050 580 0 cntry2_sum 1750 420 360 #concat all dataframes together, sort index df = pd.concat([table, df1, df2]).sort_index(level=[0]) #add km/h to second level in columns df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0), df.columns.get_level_values(1).astype(str) + 'km/h']) #add all sum df.loc[('All_sum','','')] = table.sum().values print (df) population 60km/h 90km/h 120km/h admin0 admin1 admin2 cntry1 state1 city1 700 210 0 city2 100 0 0 state1_sum 800 210 0 state2 city3 70 0 0 city4 180 370 0 state2_sum 250 370 0 cntry1_sum 1050 580 0 cntry2 state3 city5 890 0 0 city6 120 420 360 state3_sum 1010 420 360 state4 city7 740 0 0 state4_sum 740 0 0 cntry2_sum 1750 420 360 All_sum 2800 1000 360 EDIT by comment: def f(x): print (x) if (len(x) > 1): return x.sum() df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all') df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), df1.index.get_level_values(1)+ '_sum', len(df1.index) * ['']]) print (df1) population windspeed 60 90 120 admin0 cntry1 state1_sum 800.0 210.0 0.0 state2_sum 250.0 370.0 0.0 cntry2 state3_sum 1010.0 420.0 360.0