Is there a way to optimize pandas apply function during groupby? - python

I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?

For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00

Related

Rounding off to the nearest 50's pandas dataframe

I have a pandas dataframe , screenshot shown below:
ID Price
100 1040.0
101 1025.0
102 750.0
103 891.0
104 924.0
Expected output shown below
ID Price Price_new
100 1040.0 1050
101 1025.0 1050
102 750.0 750
103 891.0 900
104 920.0 900
This is what I have done but it's not what I want. I want to round off to the nearest fifty in such a way that at 1025 it should round to 1050.
df['Price_new'] = (df['Price'] / 50).round().astype(int) * 50
This is due to the issue : round with python 3
s = (df['Price'] % 50)
df['new'] = df['Price'] + np.where(s>=25,50-s,-s)
df
Out[33]:
ID Price new
0 100 1040 1050
1 101 1025 1050
2 102 750 750
3 103 891 900
4 104 924 900
Follow my suggestion:
import pandas as pd
dt = pd.DataFrame({'ID':[100,101,102,103,104], 'Price':
[1040,1025,750,891,924]})
#VERSION1
dt['Price_new'] = round((dt['Price']+1)/50).astype(int)*50
#VERSION2
dt['Price_new_v2'] = dt['Price']-(dt['Price'].map(lambda x: x%50)) +
(dt['Price'].map(lambda x: round((((x%50)+1)/50))))*50
ID Price Price_new Price_new_V2
0 100 1040 1050 1050
1 101 1025 1050 1050
2 102 750 750 750
3 103 891 900 900
4 104 924 900 900
Just plus 1 in your math you will be able to find your correct answer. But there is another way to do it, my opnião is more understandable than the second version even though I used the modulo operator.

How do I use pandas groupby() to show the value of 2 things per one column?

So I have been trying to use pandas to create a DataFrame that reports the number of graduates working at jobs that do require college degrees ('college_jobs'), and do not require college degrees ('non_college_jobs').
note: the name of the dataframe I am dealing with is recent_grads
I tried the following code:
df1 = recent_grads.groupby(['major_category']).college_jobs.non_college_jobs.sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs','non_college_jobs'].sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs'],['non_college_jobs'].sum()
none of them worked! what am I supposed to do? can somebody give me a simple explanation regarding this? I had been trying to read through pandas documentations and did not find the explanation wanted.
here is the head of the dataframe:
rank major_code major major_category \
0 1 2419 PETROLEUM ENGINEERING Engineering
1 2 2416 MINING AND MINERAL ENGINEERING Engineering
2 3 2415 METALLURGICAL ENGINEERING Engineering
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering
4 5 2405 CHEMICAL ENGINEERING Engineering
total sample_size men women sharewomen employed ... \
0 2339 36 2057 282 0.120564 1976 ...
1 756 7 679 77 0.101852 640 ...
2 856 3 725 131 0.153037 648 ...
3 1258 16 1123 135 0.107313 758 ...
4 32260 289 21239 11021 0.341631 25694 ...
part_time full_time_year_round unemployed unemployment_rate median \
0 270 1207 37 0.018381 110000
1 170 388 85 0.117241 75000
2 133 340 16 0.024096 73000
3 150 692 40 0.050125 70000
4 5180 16697 1672 0.061098 65000
p25th p75th college_jobs non_college_jobs low_wage_jobs
0 95000 125000 1534 364 193
1 55000 90000 350 257 50
2 50000 105000 456 176 0
3 43000 80000 529 102 0
4 50000 75000 18314 4440 972
[5 rows x 21 columns]
You could filter the initial DataFrame by the columns you're interested in and then perform the groupby and summation as below:
recent_grads[['major_category', 'college_jobs', 'non_college_jobs']].groupby('major_category').sum()
Conversely, if you don't perform the initial column filter and then do a .sum() on the recent_grads.groupby('major_category') it will be applied to all numeric columns possible.

How do I select columns while having few conditions in pandas

I've got datframe:
1990 1991 1992 .... 2015 2016 2017
0 9 40 300 100 200 554
1 9 70 700 3300 200 554
2 5 70 900 100 200 554
3 8 80 900 176 200 554
4 7 50 200 250 280 145
5 9 30 900 100 207 554
6 2 80 700 180 200 554
7 2 80 400 100 200 554
8 5 80 300 100 200 554
9 7 70 800 100 200 554
How do I select df<2000 & df>2005?
I tried code below but it failed:
1. df[(df.loc[:, :2000]) & (df.loc[:, 2005:])]
2. df[(df <2000) & (df>2005)]
Compare columns names:
print (df)
1999 2002 2003 2005 2006 2017
0 9 40 300 100 200 554
1 9 70 700 3300 200 554
2 5 70 900 100 200 554
3 8 80 900 176 200 554
4 7 50 200 250 280 145
5 9 30 900 100 207 554
6 2 80 700 180 200 554
7 2 80 400 100 200 554
8 5 80 300 100 200 554
9 7 70 800 100 200 554
df = df.loc[:, (df.columns <2000) | (df.columns>2005)]
print (df)
1999 2006 2017
0 9 200 554
1 9 200 554
2 5 200 554
3 8 200 554
4 7 280 145
5 9 207 554
6 2 200 554
7 2 200 554
8 5 200 554
9 7 200 554

Pivoting data with date as a row in Python

I have data that I've left in a format that will allow me to pivot on dates that look like:
Region 0 1 2 3
Date 2005-01-01 2005-02-01 2005-03-01 ....
East South Central 400 500 600
Pacific 100 200 150
.
.
Mountain 500 600 450
I need to pivot this table so it looks like:
0 Date Region value
1 2005-01-01 East South Central 400
2 2005-02-01 East South Central 500
3 2005-03-01 East South Central 600
.
.
4 2005-03-01 Pacific 100
4 2005-03-01 Pacific 200
4 2005-03-01 Pacific 150
.
.
Since both Date and Region are under one another I'm not sure how to melt or pivot around these strings so that I can get my desired output.
How can I go about this?
I think this is the solution you are looking for. Shown by example.
import pandas as pd
import numpy as np
N=100
regions = list('abcdef')
df = pd.DataFrame([[i for i in range(N)], ['2016-{}'.format(i) for i in range(N)],
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)),
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N))])
df.index = ['Region', 'Date', 'a', 'b', 'c', 'd']
print(df)
This gives
0 1 2 3 4 5 6 7 \
Region 0 1 2 3 4 5 6 7
Date 2016-0 2016-1 2016-2 2016-3 2016-4 2016-5 2016-6 2016-7
a 96 432 181 64 87 355 339 314
b 360 23 162 98 450 78 114 109
c 143 375 420 493 321 277 208 317
d 371 144 207 108 163 67 465 130
And the solution to pivot this into the form you want is
df.transpose().melt(id_vars=['Date'], value_vars=['a', 'b', 'c', 'd'])
which gives
Date variable value
0 2016-0 a 96
1 2016-1 a 432
2 2016-2 a 181
3 2016-3 a 64
4 2016-4 a 87
5 2016-5 a 355
6 2016-6 a 339
7 2016-7 a 314
8 2016-8 a 111
9 2016-9 a 121
10 2016-10 a 124
11 2016-11 a 383
12 2016-12 a 424
13 2016-13 a 453
...
393 2016-93 d 176
394 2016-94 d 277
395 2016-95 d 256
396 2016-96 d 174
397 2016-97 d 349
398 2016-98 d 414
399 2016-99 d 132

Pandas pivot table: columns order and subtotals

I'm using Pandas 0.19.
Considering the following data frame:
FID admin0 admin1 admin2 windspeed population
0 cntry1 state1 city1 60km/h 700
1 cntry1 state1 city1 90km/h 210
2 cntry1 state1 city2 60km/h 100
3 cntry1 state2 city3 60km/h 70
4 cntry1 state2 city4 60km/h 180
5 cntry1 state2 city4 90km/h 370
6 cntry2 state3 city5 60km/h 890
7 cntry2 state3 city6 60km/h 120
8 cntry2 state3 city6 90km/h 420
9 cntry2 state3 city6 120km/h 360
10 cntry2 state4 city7 60km/h 740
How can I create a table like this one?
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
cntry2 state4 city7 740 0 0
I have tried with the following pivot table:
table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0)
In general it works great, but unfortunately I am not able to sort the new columns in the right order: the 120km/h column appears before the ones for 60km/h and 90km/h. How can I specify the order of the new columns?
Moreover, as a second step I need to add subtotals both for admin0 and admin1. Ideally, the table I need should be like this:
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
SUM state1 800 210 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
SUM state2 250 370 0
SUM cntry1 1050 580 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
SUM state3 1010 420 360
cntry2 state4 city7 740 0 0
SUM state4 740 0 0
SUM cntry2 1750 420 360
SUM ALL 2800 1000 360
you can do it using reindex() method and custom sorting:
In [26]: table
Out[26]:
population
windspeed 120km/h 60km/h 90km/h
admin0 admin1 admin2
cntry1 state1 city1 0 700 210
city2 0 100 0
state2 city3 0 70 0
city4 0 180 370
cntry2 state3 city5 0 890 0
city6 360 120 420
state4 city7 0 740 0
In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h','')))
In [28]: cols
Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')]
In [29]: table = table.reindex(columns=cols)
In [30]: table
Out[30]:
population
windspeed 60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
Solution with subtotals and MultiIndex.from_arrays. Last concat and all Dataframes, sort_index and add all sum:
#replace km/h and convert to int
df.windspeed = df.windspeed.str.replace('km/h','').astype(int)
print (df)
FID admin0 admin1 admin2 windspeed population
0 0 cntry1 state1 city1 60 700
1 1 cntry1 state1 city1 90 210
2 2 cntry1 state1 city2 60 100
3 3 cntry1 state2 city3 60 70
4 4 cntry1 state2 city4 60 180
5 5 cntry1 state2 city4 90 370
6 6 cntry2 state3 city5 60 890
7 7 cntry2 state3 city6 60 120
8 8 cntry2 state3 city6 90 420
9 9 cntry2 state3 city6 120 360
10 10 cntry2 state4 city7 60 740
#pivoting
table = pd.pivot_table(df,
index=["admin0","admin1","admin2"],
columns=["windspeed"],
values=["population"],
fill_value=0)
print (table)
population
windspeed 60 90 120
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
#groupby and create sum dataframe by levels 0,1
df1 = table.groupby(level=[0,1]).sum()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800 210 0
state2_sum 250 370 0
cntry2 state3_sum 1010 420 360
state4_sum 740 0 0
df2 = table.groupby(level=0).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum',
len(df2.index) * [''],
len(df2.index) * ['']])
print (df2)
population
windspeed 60 90 120
cntry1_sum 1050 580 0
cntry2_sum 1750 420 360
#concat all dataframes together, sort index
df = pd.concat([table, df1, df2]).sort_index(level=[0])
#add km/h to second level in columns
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
df.columns.get_level_values(1).astype(str) + 'km/h'])
#add all sum
df.loc[('All_sum','','')] = table.sum().values
print (df)
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state1_sum 800 210 0
state2 city3 70 0 0
city4 180 370 0
state2_sum 250 370 0
cntry1_sum 1050 580 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state3_sum 1010 420 360
state4 city7 740 0 0
state4_sum 740 0 0
cntry2_sum 1750 420 360
All_sum 2800 1000 360
EDIT by comment:
def f(x):
print (x)
if (len(x) > 1):
return x.sum()
df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all')
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800.0 210.0 0.0
state2_sum 250.0 370.0 0.0
cntry2 state3_sum 1010.0 420.0 360.0

Categories