How to retrieve the columns of DataFrame within the loop in Python? - python

I have this below output. I have wrote code for this inside while loop for looping. Here i enter 3 then it creates 3 different dataframes with different values.
Enter the Number to iter:3
Paid CDF Ultimate Reserves
OP1 3901463.0 NaN NaN NaN
OP2 5339085.0 1.000000 5339085.000 0.0
OP3 4909315.0 1.085767 5331516.090 422201.1
OP4 4588268.0 1.136096 5212272.448 624004.4
OP5 3873311.0 1.238680 4799032.329 925721.3
OP6 3691712.0 1.350145 4983811.200 1292099.2
OP7 3483130.0 1.602168 5579974.260 2096844.3
OP8 2864498.0 2.334476 6685738.332 3821240.3
OP9 1363294.0 4.237940 5777639.972 4414346.0
OP10 344014.0 15.204053 5230388.856 4886374.9
Total NaN NaN NaN 18482831.5
Paid CDF Ultimate Reserves
OP1 3901463.0 NaN NaN NaN
OP2 5339085.0 1.000000 5339085.000 0.0
OP3 4909315.0 1.090000 5351153.350 441838.4
OP4 4588268.0 1.137559 5221448.984 633181.0
OP5 3873311.0 1.231368 4768045.841 894734.8
OP6 3691712.0 1.331933 4917360.384 1225648.4
OP7 3483130.0 1.563703 5447615.320 1964485.3
OP8 2864498.0 2.318600 6642770.862 3778272.9
OP9 1363294.0 4.234960 5773550.090 4410256.1
OP10 344014.0 16.958969 5834133.426 5490119.4
Total NaN NaN NaN 18838536.3
Paid CDF Ultimate Reserves
OP1 3901463.0 NaN NaN NaN
OP2 5339085.0 1.000000 5339085.000 0.0
OP3 4909315.0 1.072698 5267694.995 358380.0
OP4 4588268.0 1.130229 5184742.840 596474.8
OP5 3873311.0 1.208164 4678959.688 805648.7
OP6 3691712.0 1.267187 4677399.104 985687.1
OP7 3483130.0 1.497767 5217728.740 1734598.7
OP8 2864498.0 2.229342 6384966.042 3520468.0
OP9 1363294.0 4.219405 5751737.386 4388443.4
OP10 344014.0 16.036065 5516608.504 5172594.5
Total NaN NaN NaN 17562295.2
Using Above reserve column i have to generate below dataframe like simulation1 ,simulation2 so on till the number of reserve column generated by user input.
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 422201.1 624004.4 925721.3 1292099.2 2096844.3 3821240.3 4414346.0 4886374.9 18482831.5
Simulation2 NaN 0.0 441838.4 633181.0 894734.8 1225648.4 1964485.3 3778272.9 4410256.1 5490119.4 18838536.3
Simulation3 NaN 0.0 358380.0 596474.8 805648.7 985687.1 1734598.7 3520468.0 4388443.4 5172594.5 17562295.2
I have below code:
itno=int(input("Enter the Number to iter:"))
iter_count=0
while (iter_count < itno):
randomdf = scaledDF.copy()
choices = randomdf.values[~pd.isnull(randomdf.values)]
randomdf = randomdf.applymap(lambda x: np.random.choice(choices) if not pd.isnull(x) else x)
#print(randomdf,"\n\n")
cumdf = CumulativePaidTriangledf.iloc[:, :-1][:-2].copy()
ldfdf = LDFTriangledf.iloc[:, :-1].copy()
ResampledDF = randomdf.copy()
for colname4, col4 in ResampledDF.iteritems():
ResampledDF[colname4] = (ResampledDF[colname4] * (Variencedf[colname4][-1]/(cumdf[colname4]**0.5)))+ldfdf[colname4][-1]
#print(ResampledDF,"\n\n")
#SUMPRODUCT:
sumPro = ResampledDF.copy()
#cumdf = cumdf.iloc[:, :-1]
for colname5,col5 in sumPro.iteritems():
sumPro[colname5] = (sumPro[colname5].round(2))*cumdf[colname5]
sumPro = sumPro.append(pd.Series(sumPro.sum(), name='SUMPRODUCT'))
#print(sumPro)
#SUM(OFFSET):
sumOff = cumdf.apply(lambda x: x.iloc[:cumdf.index.get_loc(x.last_valid_index())].sum())
#print(sumOff)
#Weighted avg:
Weighted_avg = sumPro.loc['SUMPRODUCT']/sumOff
#print(Weighted_avg)
ResampledDF = ResampledDF.append(pd.Series(Weighted_avg, name='Weighted Avg'))
#print(ResampledDF,"\n\n")
'''for colname6,col6 in ResampledDF.iteritems():
ResampledDF[colname6] = ResampledDF[colname6].replace({'0':np.nan, 0:np.nan})
print(ResampledDF)'''
ResampledDF.loc['Weighted Avg'] = ResampledDF.loc['Weighted Avg'].replace(0, 1)
c = ResampledDF.iloc[-1][::-1].cumprod()[::-1]
ResampledDF = ResampledDF.append(pd.Series(c,name='CDF'))
#print("\n\n",ResampledDF)
#Getting Calculation of ultimates:
s = CumulativePaidTriangledf.iloc[:, :][:-2].copy()
ultiCalc = pd.DataFrame()
ultiCalc['Paid']= s['Total']
ultiCalc['CDF'] = np.flip(ResampledDF.loc['CDF'].values)
ultiCalc['Ultimate'] = ultiCalc['Paid']*(ultiCalc['CDF']).round(3)
ultiCalc['Reserves'] = (ultiCalc['Ultimate'].round(1))-ultiCalc['Paid']
ultiCalc.loc['Total'] = pd.Series(ultiCalc['Reserves'].sum(), index = ['Reserves']).round(2)
print("\n\n",ultiCalc)
iter_count+=1
#Getting Simulations:
simulationDf = pd.DataFrame(columns=['OP1','OP2','OP3','OP4','OP5','OP6','OP7','OP8','OP9','OP10','Total'])
simulationDf.loc['Simulation'] = ultiCalc['Reserves']
print("\n\n",simulationDf)
Current Output:
Simulation1 NaN
Simulation2 0.0
Simulation3 353470.7
Simulation4 559768.7
Simulation5 859875.0
Simulation6 1162889.3
Simulation7 1828643.2
Simulation8 3958736.2
Simulation9 4464787.9
Simulation10 5224196.6
Simulation11 18412367.6
Simulation12 NaN
Simulation13 0.0
Simulation14 402563.8
Simulation15 669887.1
Simulation16 883114.9
Simulation17 1185039.6
Simulation18 1859991.4
Simulation19 3511874.5
Simulation20 3875844.8
Simulation21 4481126.4
Simulation22 16869442.5

Use list comprehension for loop by list of DataFrames with select column Reserves and join together by DataFrame constructor, last if necessary set index:
dfs = [df1, df2, df3]
df = pd.DataFrame([x['Reserves'] for x in dfs]).reset_index(drop=True)
df.index = 'Simulation' + (df.index + 1).astype(str)
If there is some loop for generate DataFrames like pseudocode:
#create list outside loop
dfs = []
iter_count=0
while (iter_count < itno):
randomdf = scaledDF.copy()
choices = randomdf.values[~pd.isnull(randomdf.values)]
randomdf = randomdf.applymap(lambda x: np.random.choice(choices) if not pd.isnull(x) else x)
#print(randomdf,"\n\n")
....
....
#Getting Calculation of ultimates:
s = CumulativePaidTriangledf.iloc[:, :][:-2].copy()
ultiCalc = pd.DataFrame()
ultiCalc['Paid']= s['Total']
ultiCalc['CDF'] = np.flip(ResampledDF.loc['CDF'].values)
ultiCalc['Ultimate'] = ultiCalc['Paid']*(ultiCalc['CDF']).round(3)
ultiCalc['Reserves'] = (ultiCalc['Ultimate'].round(1))-ultiCalc['Paid']
ultiCalc.loc['Total'] = pd.Series(ultiCalc['Reserves'].sum(), index = ['Reserves']).round(2)
print("\n\n",ultiCalc)
iter_count+=1
#append in loop
dfs.append(ultiCalc['Reserves'])
And then outside loops join together:
df = pd.DataFrame([x for x in dfs]).reset_index(drop=True)
df.index = 'Simulation' + (df.index + 1).astype(str)

Related

How to iterate through columns of the dataframe?

I want go through the all the columns of the dataframe. so that I will get a particular data of the column, using these data I have to calculate for another dataframe.
Here i have :
DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 DP9 DP10 Total
OP1 357848.0 1124788.0 1735330.0 2218270.0 2745596.0 3319994.0 3466336.0 3606286.0 3833515.0 3901463.0 3901463.0
OP2 352118.0 1236139.0 2170033.0 3353322.0 3799067.0 4120063.0 4647867.0 4914039.0 5339085.0 NaN 5339085.0
OP3 290507.0 1292306.0 2218525.0 3235179.0 3985995.0 4132918.0 4628910.0 4909315.0 NaN NaN 4909315.0
OP4 310608.0 1418858.0 2195047.0 3757447.0 4029929.0 4381982.0 4588268.0 NaN NaN NaN 4588268.0
OP5 443160.0 1136350.0 2128333.0 2897821.0 3402672.0 3873311.0 NaN NaN NaN NaN 3873311.0
OP6 396132.0 1333217.0 2180715.0 2985752.0 3691712.0 NaN NaN NaN NaN NaN 3691712.0
OP7 440832.0 1288463.0 2419861.0 3483130.0 NaN NaN NaN NaN NaN NaN 3483130.0
OP8 359480.0 1421128.0 2864498.0 NaN NaN NaN NaN NaN NaN NaN 2864498.0
OP9 376686.0 1363294.0 NaN NaN NaN NaN NaN NaN NaN NaN 1363294.0
OP10 344014.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 344014.0
Total 3671385.0 11614543.0 17912342.0 21930921.0 21654971.0 19828268.0 17331381.0 13429640.0 9172600.0 3901463.0 34358090.0
Latest Observation 344014.0 1363294.0 2864498.0 3483130.0 3691712.0 3873311.0 4588268.0 4909315.0 5339085.0 3901463.0 NaN
From this table I would like to calculate formula this formula :in column DP1,Total/Last observation and this answer is divides by DP2 columns total. Like this we have to calculate all the columns and save it in another dataframe.
we need row like this :
Weighted Average 3.491 1.747 1.457 1.174 1.104 1.086 1.054 1.077 1.018
This code we tried:
LDFTriangledf['Weighted Average'] =CumulativePaidTriangledf.loc['Total','DP2']/(CumulativePaidTriangledf.loc['Total','DP1'] - CumulativePaidTriangledf.loc['Latest Observation','DP1'])
You can remove the column names from .loc and just shift(-1, axis=1) to get the next column's Total. This lets you apply the formula to all columns in a single operation:
CumulativePaidTriangledf.shift(-1, axis=1).loc['Total'] / (CumulativePaidTriangledf.loc['Total'] - CumulativePaidTriangledf.loc['Latest Observation'])
# DP1 3.490607
# DP2 1.747333
# DP3 1.457413
# DP4 1.173852
# DP5 1.103824
# DP6 1.086269
# DP7 1.053874
# DP8 1.076555
# DP9 1.017725
# DP10 inf
# Total NaN
# dtype: float64
Here is a breakdown of what the three components are doing:
DP1
DP2
DP3
DP4
DP5
DP6
DP7
DP8
DP9
DP10
Total
A: .shift(-1, axis=1).loc['Total'] -- We are shifting the whole Total row to the left, so every column now has the next Total value.
1.161454e+07
1.791234e+07
2.193092e+07
2.165497e+07
1.982827e+07
1.733138e+07
1.342964e+07
9.172600e+06
3.901463e+06
34358090.0
NaN
B: .loc['Total'] -- This is the normal Total row.
3.671385e+06
1.161454e+07
1.791234e+07
2.193092e+07
2.165497e+07
1.982827e+07
1.733138e+07
1.342964e+07
9.172600e+06
3901463.0
34358090.0
C: .loc['Latest Observation'] -- This is the normal Latest Observation.
3.440140e+05
1.363294e+06
2.864498e+06
3.483130e+06
3.691712e+06
3.873311e+06
4.588268e+06
4.909315e+06
5.339085e+06
3901463.0
NaN
A / (B-C) -- This is what the code above does. It takes the shifted Total row (A) and divides it by the difference of the current Total row (B) and current Latest observation row (C).
3.490607
1.747333
1.457413
1.173852
1.103824
1.086269
1.053874
1.076555
1.017725
inf
NaN

How to increment Dataframe row name value?

I have this dataframe:
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
I need this row name as like this :
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
Simulation2 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
Simulation3 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
.
.
.
So on
Here I have to increment row name simulation1 ,simulation2...so on
I have this code:
simulationDf=pd.DataFrame(columns['OP1','OP2','OP3','OP4','OP5','OP6','OP7','OP8','OP9','OP10','Total'])
simulationDf.loc['Simulation1'] = ultiCalc['Reserves']
Seems like that is your index(the dataframe that you posted in second code block) so you can use .index attribute and list comprehension:
df.index=['Simulation'+str(x) for x in range (1,len(df)+1)]
Now if you print df you will get your desired output:
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
Simulation2 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
Simulation3 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9

Error of using Zip function is returning missing result

I want to scrape some data from the website so I write the code to create a list which contains all records. And then, I want extract some elements from all records to create a dataframe.
However, some information of the dataframe is missing. In the all data list, it has the information from 2012 to 2019 but the dataframe only has 2018 and 2019 information. I tried different ways the resolve the problem. Finally, I find out if I do not use Zip function, the problem will not occur, may I know why and if I do not use Zip function, any solution I can use?
import requests
import pandas as pd
records = []
tickers = ['AAL']
url_metrics = 'https://stockrow.com/api/companies/{}/financials.json?ticker={}&dimension=A&section=Growth'
indicators_url = 'https://stockrow.com/api/indicators.json'
# scrape all data and append to a list - all_records
for s in tickers:
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_records = []
for d in requests.get(url_metrics.format(s,s)).json():
d['id'] = indicators[d['id']]['name']
all_records.append(d)
gross_profit_growth = next(d for d in all_records if 'Gross Profit Growth' in d['id'])
operating_income_growth = next(d for d in all_records if 'Operating Income Growth' in d['id'])
net_income_growth = next(d for d in all_records if 'Net Income Growth' in d['id'])
diluted_eps_growth = next(d for d in all_records if 'EPS Growth (diluted)' in d['id'])
operating_cash_flow_growth = next(d for d in all_records if 'Operating Cash Flow Growth' in d['id'])
# extract values from all_records and create the dataframe
for (k1, v1), (_, v2), (_, v3), (_, v4), (_, v5) in zip(gross_profit_growth.items(), operating_income_growth.items(), net_income_growth.items(), diluted_eps_growth.items(), operating_cash_flow_growth.items()):
if k1 in ('id'):
continue
records.append({
'symbol' : s,
'date' : k1,
'gross_profit_growth%': v1,
'operating_income_growth%': v2,
'net_income_growth%': v3,
'diluted_eps_growth%' : v4,
'operating_cash_flow_growth%' : v5
})
df = pd.DataFrame(records)
df.head(50)
The result is incorrect. It only has 2018 and 2019 data. It should have data from 2012 to 2019.
symbol date gross_profit_growth% operating_income_growth% net_income_growth% diluted_eps_growth% operating_cash_flow_growth%
0 AAL 2019-12-31 0.0405 -0.1539 -0.0112 0.2508 0.0798
1 AAL 2018-12-31 -0.0876 -0.2463 0.0 -0.2231 -0.2553
My excepted result:
symbol date gross_profit_growth% operating_income_growth% net_income_growth% diluted_eps_growth% operating_cash_flow_growth%
0 AAL 31/12/2019 0.0405 0.154 0.1941 0.2508 0.0798
1 AAL 31/12/2018 -0.0876 -0.3723 0.1014 -0.2231 -0.2553
2 AAL 31/12/2017 -0.0165 -0.1638 -0.5039 -0.1892 -0.2728
3 AAL 31/12/2016 -0.079 -0.1844 -0.6604 -0.5655 0.044
4 AAL 31/12/2015 0.1983 0.4601 1.6405 1.8168 1.0289
5 AAL 31/12/2014 0.7305 2.0372 2.5714 1.2308 3.563
6 AAL 31/12/2013 0.3575 8.4527 0.0224 nan -0.4747
7 AAL 31/12/2012 0.1688 1.1427 0.052 nan 0.7295
8 AAL 31/12/2011 0.0588 -4.3669 -3.2017 nan -0.4013
9 AAL 31/12/2010 0.3413 1.3068 0.6792 nan 0.3344
import requests
import pandas as pd
records = []
tickers = ['A', 'AAL', 'AAPL']
url_metrics = 'https://stockrow.com/api/companies/{}/financials.json?ticker={}&dimension=A&section=Growth'
indicators_url = 'https://stockrow.com/api/indicators.json'
for s in tickers:
print('Getting data for ticker: {}'.format(s))
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_records = []
for d in requests.get(url_metrics.format(s,s)).json():
d['id'] = indicators[d['id']]['name']
all_records.append(d)
gross_profit_growth = next(d for d in all_records if 'Gross Profit Growth' == d['id'])
operating_income_growth = next(d for d in all_records if 'Operating Income Growth' == d['id'])
net_income_growth = next(d for d in all_records if 'Net Income Growth' == d['id'])
eps_growth_diluted = next(d for d in all_records if 'EPS Growth (diluted)' == d['id'])
operating_cash_flow_growth = next(d for d in all_records if 'Operating Cash Flow Growth' == d['id'])
del gross_profit_growth['id']
del operating_income_growth['id']
del net_income_growth['id']
del eps_growth_diluted['id']
del operating_cash_flow_growth['id']
d1 = pd.DataFrame({'date': gross_profit_growth.keys(), 'gross_profit_growth%': gross_profit_growth.values()}).set_index('date')
d2 = pd.DataFrame({'date': operating_income_growth.keys(), 'operating_income_growth%': operating_income_growth.values()}).set_index('date')
d3 = pd.DataFrame({'date': net_income_growth.keys(), 'net_income_growth%': net_income_growth.values()}).set_index('date')
d4 = pd.DataFrame({'date': eps_growth_diluted.keys(), 'diluted_eps_growth%': eps_growth_diluted.values()}).set_index('date')
d5 = pd.DataFrame({'date': operating_cash_flow_growth.keys(), 'operating_cash_flow_growth%': operating_cash_flow_growth.values()}).set_index('date')
d = pd.concat([d1, d2, d3, d4, d5], axis=1)
d['symbol'] = s
records.append(d)
df = pd.concat(records)
print(df)
Prints:
gross_profit_growth% operating_income_growth% net_income_growth% diluted_eps_growth% operating_cash_flow_growth% symbol
2019-10-31 0.0466 0.0409 2.3892 2.4742 -0.0607 A
2018-10-31 0.1171 0.1202 -0.538 -0.5381 0.2227 A
2017-10-31 0.0919 0.3122 0.4805 0.5 0.1211 A
2016-10-31 0.0764 0.1782 0.1521 0.1765 0.5488 A
2015-10-31 0.0329 0.2458 -0.2696 -0.1905 -0.2996 A
2014-10-31 0.0362 0.0855 -0.252 -0.3 -0.3655 A
2013-10-31 -0.4709 -0.655 -0.3634 -0.3578 -0.0619 A
2012-10-31 0.0213 0.0448 0.1393 0.1474 -0.0254 A
2011-10-31 0.2044 0.8922 0.4795 0.6102 0.7549 A
2019-12-31 0.0405 0.154 0.1941 0.2508 0.0798 AAL
2018-12-31 -0.0876 -0.3723 0.1014 -0.2231 -0.2553 AAL
2017-12-31 -0.0165 -0.1638 -0.5039 -0.1892 -0.2728 AAL
2016-12-31 -0.079 -0.1844 -0.6604 -0.5655 0.044 AAL
2015-12-31 0.1983 0.4601 1.6405 1.8168 1.0289 AAL
2014-12-31 0.7305 2.0372 2.5714 1.2308 3.563 AAL
2013-12-31 0.3575 8.4527 0.0224 NaN -0.4747 AAL
2012-12-31 0.1688 1.1427 0.052 NaN 0.7295 AAL
2011-12-31 0.0588 -4.3669 -3.2017 NaN -0.4013 AAL
2010-12-31 0.3413 1.3068 0.6792 NaN 0.3344 AAL
2020-09-30 0.0667 0.0369 0.039 NaN 0.1626 AAPL
2019-09-30 -0.0338 -0.0983 -0.0718 -0.0017 -0.1039 AAPL
2018-09-30 0.1548 0.1557 0.2312 0.2932 0.2057 AAPL
2017-09-30 0.0466 0.022 0.0583 0.1083 -0.0303 AAPL
2016-09-30 -0.1 -0.1573 -0.1443 -0.0987 -0.185 AAPL
2015-09-30 0.3273 0.3567 0.3514 0.4295 0.3609 AAPL
2014-09-30 0.0969 0.0715 0.0668 0.1358 0.1127 AAPL
2013-09-30 -0.0635 -0.113 -0.1125 -0.0996 0.0553 AAPL
2012-09-30 0.567 0.6348 0.6099 0.595 0.3551 AAPL
2011-09-30 0.706 0.8379 0.8499 0.827 1.0182 AAPL

How to groupby, cut, transpose then merge result of one pandas Dataframe using vectorisation

Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds

Pythonic / Panda Way to Create Function to Groupby

I am fairly new to programming & am looking for a more pythonic way to implement some code. Here is dummy data:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I have lots of transactional data like that that I perform various Groupby's on. My current solution is to make a master groupby like this:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
From there, I perform various groupbys using .groupby(level=) function to aggregate the information in the way I'm looking for. I usually make a summary at each level. In addition, I create sub-totals at each level using some variation of the below code.
y = master.groupby(level=[0,1,2]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=[0,1]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
pd.concat([master,y,y1,y2]).sort_index()\
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])\
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)\
.dropna(how='all')\
This is just an example - I may perform the same exercise, but perform the groupby in a different order. For example - next I may want to group by 'Category', 'Product', then 'Customer', so I'd have to do:
master.groupby(level=[1,3,0).sum()
Then I will have to repeat the whole exercise for sub-totals like above. I also frequently change the time period - could be year-ending a specific month, could be year to date, could be by quarter, etc.
From what I've learned so far in programming (which is minimal, clearly!), you should look to write a function any time you repeat code. Obviously I am repeating code over & over again in this example.
Is there a way to construct a function where you can provide the levels to Groupby, along with the time frame, all while creating a function for sub-totaling each level as well?
Thanks in advance for any guidance on this. It is very much appreciated.
For a DRY-er solution, consider generalizing your current method into a defined module that filters original data frame by date ranges and runs aggregations, receiving the group_by levels and date ranges (latter being optional) as passed in parameters:
Method
def multiple_agg(mylevels, start_date='2016-01-01', end_date='2018-12-31'):
filter_df = df[df['Date'].between(start_date, end_date)]
master = (filter_df.groupby(['Customer', 'Category', 'Sub-Category', 'Product',
pd.Grouper(key='Date',freq='A')])['Units_Sold']
.sum()
.unstack()
)
y = master.groupby(level=mylevels[:-1]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=mylevels[0:2]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=mylevels[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
final_df = (pd.concat([master,y,y1,y2])
.sort_index()
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)
.dropna(how='all')
.reorder_levels(mylevels)
)
return final_df
Aggregation Runs (of different levels and date ranges)
agg_df1 = multiple_agg([0,1,2,3])
agg_df2 = multiple_agg([1,3,0,2], '2016-01-01', '2017-12-31')
agg_df3 = multiple_agg([2,3,1,0], start_date='2017-01-01', end_date='2018-12-31')
Testing (final_df being OP'S pd.concat() output)
# EQUALITY TESTING OF FIRST 10 ROWS
print(final_df.head(10).eq(agg_df1.head(10)))
# Date 2016-12-31 00:00:00 2017-12-31 00:00:00 2018-12-31 00:00:00 Diff Diff_Perc
# Customer Category Sub-Category Product
# 45mhn4PU1O Group A X Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# X Total True True True True True
# Y Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# Y Total True True True True True
# Z Product 1 True True True True True
# Product 2 True True True True True
I think you can do it using sum with the level parameter:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True)
s2 = master.sum(level=[0,1])
# Wanted to use assign method but because of the hyphen in the column name you can't.
# Also use the Z in front for sorting purposes
s2['Sub-Category'] = 'ZTotal'
s2['Product'] = ''
s2 = s2.set_index(['Sub-Category','Product'], append=True)
s3 = master.sum(level=[0])
s3['Category'] = 'Total'
s3['Sub-Category'] = ''
s3['Product'] = ''
s3 = s3.set_index(['Category','Sub-Category','Product'], append=True)
master_new = pd.concat([master,s1,s2,s3]).sort_index()
master_new
Output:
Date 2016-12-31 2017-12-31 2018-12-31
Customer Category Sub-Category Product
30XWmt1jm0 Group A X Product 1 651.0 341.0 453.0
Product 2 267.0 445.0 117.0
Product 3 186.0 280.0 352.0
Total 1104.0 1066.0 922.0
Y Product 1 426.0 417.0 670.0
Product 2 362.0 210.0 380.0
Product 3 232.0 290.0 430.0
Total 1020.0 917.0 1480.0
Z Product 1 196.0 212.0 703.0
Product 2 277.0 340.0 579.0
Product 3 416.0 392.0 259.0
Total 889.0 944.0 1541.0
ZTotal 3013.0 2927.0 3943.0
Group B X Product 1 356.0 230.0 407.0
Product 2 402.0 370.0 590.0
Product 3 262.0 381.0 377.0
Total 1020.0 981.0 1374.0
Y Product 1 575.0 314.0 643.0
Product 2 557.0 375.0 411.0
Product 3 344.0 246.0 280.0
Total 1476.0 935.0 1334.0
Z Product 1 278.0 152.0 392.0
Product 2 149.0 596.0 303.0
Product 3 234.0 505.0 521.0
Total 661.0 1253.0 1216.0
ZTotal 3157.0 3169.0 3924.0
Total 6170.0 6096.0 7867.0
3U2anYOD6o Group A X Product 1 214.0 443.0 195.0
Product 2 170.0 220.0 423.0
Product 3 111.0 469.0 369.0
... ... ... ...
somc22Y2Hi Group B Z Total 906.0 1063.0 680.0
ZTotal 3070.0 3751.0 2736.0
Total 6435.0 7187.0 6474.0
zRZq6MSKuS Group A X Product 1 421.0 182.0 387.0
Product 2 359.0 287.0 331.0
Product 3 232.0 394.0 279.0
Total 1012.0 863.0 997.0
Y Product 1 245.0 366.0 111.0
Product 2 377.0 148.0 239.0
Product 3 372.0 219.0 310.0
Total 994.0 733.0 660.0
Z Product 1 280.0 363.0 354.0
Product 2 384.0 604.0 178.0
Product 3 219.0 462.0 366.0
Total 883.0 1429.0 898.0
ZTotal 2889.0 3025.0 2555.0
Group B X Product 1 466.0 413.0 187.0
Product 2 502.0 370.0 368.0
Product 3 745.0 480.0 318.0
Total 1713.0 1263.0 873.0
Y Product 1 218.0 226.0 385.0
Product 2 123.0 382.0 570.0
Product 3 173.0 572.0 327.0
Total 514.0 1180.0 1282.0
Z Product 1 480.0 317.0 604.0
Product 2 256.0 215.0 572.0
Product 3 463.0 50.0 349.0
Total 1199.0 582.0 1525.0
ZTotal 3426.0 3025.0 3680.0
Total 6315.0 6050.0 6235.0
[675 rows x 3 columns]

Categories