How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF - python
I have data that looks like this.
Year Quarter Quantity Price TotalRevenue
0 2000 1 23 142 3266
1 2000 2 23 144 3312
2 2000 3 23 147 3381
3 2000 4 23 151 3473
4 2001 1 22 160 3520
5 2001 2 22 183 4026
6 2001 3 22 186 4092
7 2001 4 22 186 4092
8 2002 1 21 212 4452
9 2002 2 19 232 4408
10 2002 3 19 223 4237
I'm trying to figure out how to get the 'MarginalRevenue', where:
MR = (∆TR/∆Q)
MarginalRevenue = (Change in TotalRevenue) / (Change in Quantity)
I found: df.pct_change()
But that seems to get the percentage change for everything.
Also, I'm trying to figure out how to get something related:
ElasticityPrice = (%ΔQuantity/%ΔPrice)
Do you mean something like this ?
df['MarginalRevenue'] = df['TotalRevenue'].pct_change() / df['Quantity'].pct_change()
or
df['MarginalRevenue'] = df['TotalRevenue'].diff() / df['Quantity'].diff()
Related
Drop certain rows based on quantity of rows with specific values
I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they've played. Example: Player Season Pos GP G A P +/- PIM P/GP ... PPG PPP SHG SHP OTG GWG S S% TOI/GP FOW% 0 Nathan MacKinnon 2022 1 65 32 56 88 22 42 1.35 ... 7 27 0 0 1 5 299 10.7 21.07 45.4 1 Nathan MacKinnon 2021 1 48 20 45 65 22 37 1.35 ... 8 25 0 0 0 2 206 9.7 20.37 48.5 2 Nathan MacKinnon 2020 1 69 35 58 93 13 12 1.35 ... 12 31 0 0 2 4 318 11.0 21.22 43.1 3 Nathan MacKinnon 2019 1 82 41 58 99 20 34 1.21 ... 12 37 0 0 1 6 365 11.2 22.08 43.7 4 Nathan MacKinnon 2018 1 74 39 58 97 11 55 1.31 ... 12 32 0 1 3 12 284 13.7 19.90 41.9 5 Nathan MacKinnon 2017 1 82 16 37 53 -14 16 0.65 ... 2 14 2 2 2 4 251 6.4 19.95 50.6 6 Nathan MacKinnon 2016 1 72 21 31 52 -4 20 0.72 ... 7 16 0 1 0 6 245 8.6 18.87 48.4 7 Nathan MacKinnon 2015 1 64 14 24 38 -7 34 0.59 ... 3 7 0 0 0 2 192 7.3 17.05 47.0 8 Nathan MacKinnon 2014 1 82 24 39 63 20 26 0.77 ... 8 17 0 0 0 5 241 10.0 17.35 42.9 9 J.T. Compher 2022 2 70 18 15 33 6 25 0.47 ... 4 6 1 1 0 0 102 17.7 16.32 51.4 10 J.T. Compher 2021 2 48 10 8 18 10 19 0.38 ... 1 2 0 0 0 2 47 21.3 14.22 45.9 11 J.T. Compher 2020 2 67 11 20 31 9 18 0.46 ... 1 5 0 3 1 3 106 10.4 16.75 47.7 12 J.T. Compher 2019 2 66 16 16 32 -8 31 0.48 ... 4 9 3 3 0 3 118 13.6 17.48 49.2 13 J.T. Compher 2018 2 69 13 10 23 -29 20 0.33 ... 4 7 2 2 2 3 131 9.9 16.00 45.1 14 J.T. Compher 2017 2 21 3 2 5 0 4 0.24 ... 1 1 0 0 0 1 30 10.0 14.93 47.6 15 Darren Helm 2022 1 68 7 8 15 -5 14 0.22 ... 0 0 1 2 0 1 93 7.5 10.55 44.2 16 Darren Helm 2021 1 47 3 5 8 -3 10 0.17 ... 0 0 0 0 0 0 83 3.6 14.68 66.7 17 Darren Helm 2020 1 68 9 7 16 -6 37 0.24 ... 0 0 1 2 0 0 102 8.8 13.73 53.6 18 Darren Helm 2019 1 61 7 10 17 -11 20 0.28 ... 0 0 1 4 0 0 107 6.5 14.57 44.4 19 Darren Helm 2018 1 75 13 18 31 3 39 0.41 ... 0 0 2 4 0 0 141 9.2 15.57 44.1 [sample of my dataset][1] [1]: https://i.stack.imgur.com/7CsUd.png If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages. As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games. Removing Season 2021 results in an Average Games Played of 73.75. Keeping Season 2021 in the data, the Average Games Played becomes 70.89. While not drastic, it compounds into the other metrics as well. I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this. I don't want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years' worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages. I would really appreciate some assistance from anyone more experienced than me!
This can be accomplished by using groupby and apply. For example: edited_players = (players .groupby("Player") .apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021")) ) Round brackets for formatting purposes. The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc. The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I'll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".
How to join/merge and sum columns with the same name
How can I merge and sum the columns with the same name? So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada. Country/Region Brazil Canada Canada Canada Canada Week 1 0 3 0 0 0 Week 2 0 17 0 0 0 Week 3 0 21 0 0 0 Week 4 0 21 0 0 0 Week 5 0 23 0 0 0 Week 6 0 80 0 5 0 Week 7 0 194 0 20 0 Week 8 12 702 3 199 20 Week 9 182 2679 16 2395 260 Week 10 737 8711 80 17928 892 Week 11 1674 25497 153 48195 1597 Week 12 2923 46392 175 85563 2003 Week 13 4516 76095 182 122431 2180 Week 14 6002 105386 183 163539 2431 Week 15 6751 127713 189 210409 2995 Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for: df = df.set_index('Country/Region') # optional df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method. Output: index Brazil Canada Country/Region Week 1 0 3 Week 2 0 17 Week 3 0 21 Week 4 0 21 Week 5 0 23 Week 6 0 85 Week 7 0 214 Week 8 12 924 Week 9 182 5350 Week 10 737 27611 Week 11 1674 75442 Week 12 2923 134133 Week 13 4516 200888 Week 14 6002 271539 Week 15 6751 341306 Week 16 7081 409938 I found your dataset interesting, here's how I would clean it up from step 1: df = pd.read_csv('file.csv') df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index() df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value'] df['date'] = pd.to_datetime(df['date']) df = df.set_index('date') df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum) print(df) Output: Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe date ... 2020-01-22 0 0 0 0 0 ... 0 0 0 0 0 2020-01-23 0 0 0 0 0 ... 0 0 0 0 0 2020-01-24 0 0 0 0 0 ... 0 0 0 0 0 2020-01-25 0 0 0 0 0 ... 0 0 0 0 0 2020-01-26 0 0 0 0 0 ... 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... 2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092 2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169 2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659 2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921 2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075 If you now want to do weekly aggregations, it's as simple as: print(df.resample('w').sum()) Output: Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe date ... 2020-01-26 0 0 0 0 0 ... 0 0 0 0 0 2020-02-02 0 0 0 0 0 ... 0 0 0 0 0 2020-02-09 0 0 0 0 0 ... 0 0 0 0 0 2020-02-16 0 0 0 0 0 ... 0 0 0 0 0 2020-02-23 0 0 0 0 0 ... 0 0 0 0 0 2020-03-01 7 0 6 0 0 ... 0 0 0 0 0 2020-03-08 10 0 85 7 0 ... 43 0 0 0 0 2020-03-15 57 160 195 7 0 ... 209 0 0 0 0 2020-03-22 175 464 705 409 5 ... 309 0 0 11 7 2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31 2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59 2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84 2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154 2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200 2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244 2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241 2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274 2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354 2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894 2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694 2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335 2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089 2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813 2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426 2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200 2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058 2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933 2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241 2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try: np.random.seed(0) df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC']) df.groupby(df.columns, axis=1, sort=False).sum() Output: Z A B C 0 44 111 67 67 1 9 104 36 87 2 70 176 12 58 3 65 126 46 88 4 81 62 77 72 5 9 100 69 79 6 47 146 99 88 7 49 48 19 14 8 39 97 9 57 9 32 105 23 35 10 75 83 34 0 11 0 89 5 38 12 17 83 42 58 13 31 66 41 57 14 35 57 82 91 15 0 113 53 12 16 42 159 68 6 17 68 50 76 52 18 78 35 99 58 19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below. df_T = df.tranpose() df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it: df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)] df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada']) First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates. Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada. Full test code is: import pandas as pd df = pd.DataFrame( columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()], index=['Week ' + str(i) for i in range(1, 17)], data=[[i] * 5 for i in range(1, 17)]) df.columns.names=['Country/Region'] print(df) df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)] df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada']) print(df) Output: Country/Region Brazil Canada Canada Canada Canada Week 1 1 1 1 1 1 Week 2 2 2 2 2 2 Week 3 3 3 3 3 3 Week 4 4 4 4 4 4 Week 5 5 5 5 5 5 Week 6 6 6 6 6 6 Week 7 7 7 7 7 7 Week 8 8 8 8 8 8 Week 9 9 9 9 9 9 Week 10 10 10 10 10 10 Week 11 11 11 11 11 11 Week 12 12 12 12 12 12 Week 13 13 13 13 13 13 Week 14 14 14 14 14 14 Week 15 15 15 15 15 15 Week 16 16 16 16 16 16 Brazil Canada Week 1 1 4 Week 2 2 8 Week 3 3 12 Week 4 4 16 Week 5 5 20 Week 6 6 24 Week 7 7 28 Week 8 8 32 Week 9 9 36 Week 10 10 40 Week 11 11 44 Week 12 12 48 Week 13 13 52 Week 14 14 56 Week 15 15 60 Week 16 16 64
How to create multiple triangles based on given number of simulations?
Below is my code: triangle = cl.load_sample('genins') # Use bootstrap sampler to get resampled triangles bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_ #converting to dataframe resampledtriangledf = bootstrapdataframe.to_frame() print(resampledtriangledf) In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame: 0 2001 12 254,926 0 2001 24 535,877 0 2001 36 1,355,613 0 2001 48 2,034,557 0 2001 60 2,311,789 0 2001 72 2,539,807 0 2001 84 2,724,773 0 2001 96 3,187,095 0 2001 108 3,498,646 0 2001 120 3,586,037 0 2002 12 542,369 0 2002 24 1,016,927 0 2002 36 2,201,329 0 2002 48 2,923,381 0 2002 60 3,711,305 0 2002 72 3,914,829 0 2002 84 4,385,757 0 2002 96 4,596,072 0 2002 108 5,047,861 0 2003 12 235,361 0 2003 24 960,355 0 2003 36 1,661,972 0 2003 48 2,643,370 0 2003 60 3,372,684 0 2003 72 3,642,605 0 2003 84 4,160,583 0 2003 96 4,480,332 0 2004 12 764,553 0 2004 24 1,703,557 0 2004 36 2,498,418 0 2004 48 3,198,358 0 2004 60 3,524,562 0 2004 72 3,884,971 0 2004 84 4,268,241 0 2005 12 381,670 0 2005 24 1,124,054 0 2005 36 2,026,434 0 2005 48 2,863,902 0 2005 60 3,039,322 0 2005 72 3,288,253 0 2006 12 320,332 0 2006 24 1,022,323 0 2006 36 1,830,842 0 2006 48 2,676,710 0 2006 60 3,375,172 0 2007 12 330,361 0 2007 24 1,463,348 0 2007 36 2,771,839 0 2007 48 4,003,745 0 2008 12 282,143 0 2008 24 1,782,267 0 2008 36 2,898,699 0 2009 12 362,726 0 2009 24 1,277,750 0 2010 12 321,247 1 2001 12 219,021 1 2001 24 755,975 1 2001 36 1,360,298 1 2001 48 2,062,947 1 2001 60 2,356,983 1 2001 72 2,781,187 1 2001 84 2,987,837 1 2001 96 3,118,952 1 2001 108 3,307,522 1 2001 120 3,455,107 1 2002 12 302,932 1 2002 24 1,022,459 1 2002 36 1,634,938 1 2002 48 2,538,708 1 2002 60 3,005,695 1 2002 72 3,274,719 1 2002 84 3,356,499 1 2002 96 3,595,361 1 2002 108 4,100,065 1 2003 12 489,934 1 2003 24 1,233,438 1 2003 36 2,471,849 1 2003 48 3,672,629 1 2003 60 4,157,489 1 2003 72 4,498,470 1 2003 84 4,587,579 1 2003 96 4,816,232 1 2004 12 518,680 1 2004 24 1,209,705 1 2004 36 2,019,757 1 2004 48 2,997,820 1 2004 60 3,630,442 1 2004 72 3,881,093 1 2004 84 4,080,322 1 2005 12 453,963 1 2005 24 1,458,504 1 2005 36 2,036,506 1 2005 48 2,846,464 1 2005 60 3,280,124 1 2005 72 3,544,597 1 2006 12 369,755 1 2006 24 1,209,117 1 2006 36 1,973,136 1 2006 48 3,034,294 1 2006 60 3,537,784 1 2007 12 477,788 1 2007 24 1,524,537 1 2007 36 2,170,391 1 2007 48 3,355,093 1 2008 12 250,690 1 2008 24 1,546,986 1 2008 36 2,996,737 1 2009 12 271,270 1 2009 24 1,446,353 1 2010 12 510,114 2 2001 12 170,866 2 2001 24 797,338 2 2001 36 1,663,610 2 2001 48 2,293,697 2 2001 60 2,607,067 2 2001 72 2,979,479 2 2001 84 3,127,308 2 2001 96 3,285,338 2 2001 108 3,574,272 2 2001 120 3,630,610 2 2002 12 259,060 2 2002 24 1,011,092 2 2002 36 1,851,504 2 2002 48 2,705,313 2 2002 60 3,195,774 2 2002 72 3,766,008 2 2002 84 3,944,417 2 2002 96 4,234,043 2 2002 108 4,763,664 2 2003 12 239,981 2 2003 24 983,484 2 2003 36 1,929,785 2 2003 48 2,497,929 2 2003 60 2,972,887 2 2003 72 3,313,868 2 2003 84 3,727,432 2 2003 96 4,024,122 2 2004 12 77,522 2 2004 24 729,401 2 2004 36 1,473,914 2 2004 48 2,376,313 2 2004 60 2,999,197 2 2004 72 3,372,020 2 2004 84 3,887,883 2 2005 12 321,598 2 2005 24 1,132,502 2 2005 36 1,710,504 2 2005 48 2,438,620 2 2005 60 2,801,957 2 2005 72 3,182,466 2 2006 12 255,407 2 2006 24 1,275,141 2 2006 36 2,083,421 2 2006 48 3,144,579 2 2006 60 3,891,772 2 2007 12 338,120 2 2007 24 1,275,697 2 2007 36 2,238,715 2 2007 48 3,615,323 2 2008 12 310,214 2 2008 24 1,237,156 2 2008 36 2,563,326 2 2009 12 271,093 2 2009 24 1,523,131 2 2010 12 430,591 3 2001 12 330,887 3 2001 24 831,193 3 2001 36 1,601,374 3 2001 48 2,188,879 3 2001 60 2,662,773 3 2001 72 3,086,976 3 2001 84 3,332,247 3 2001 96 3,317,279 3 2001 108 3,576,659 3 2001 120 3,613,563 3 2002 12 358,263 3 2002 24 1,139,259 3 2002 36 2,236,375 3 2002 48 3,163,464 3 2002 60 3,715,130 3 2002 72 4,295,638 3 2002 84 4,502,105 3 2002 96 4,769,139 3 2002 108 5,323,304 3 2003 12 489,934 3 2003 24 1,570,352 3 2003 36 3,123,215 3 2003 48 4,189,299 3 2003 60 4,819,070 3 2003 72 5,306,689 3 2003 84 5,560,371 3 2003 96 5,827,003 3 2004 12 419,727 3 2004 24 1,308,884 3 2004 36 2,118,936 3 2004 48 2,906,732 3 2004 60 3,561,577 3 2004 72 3,934,400 3 2004 84 4,010,511 3 2005 12 389,217 3 2005 24 1,173,226 3 2005 36 1,794,216 3 2005 48 2,528,910 3 2005 60 3,474,035 3 2005 72 3,908,999 3 2006 12 291,940 3 2006 24 1,136,674 3 2006 36 1,915,614 3 2006 48 2,693,930 3 2006 60 3,375,601 3 2007 12 506,055 3 2007 24 1,684,660 3 2007 36 2,678,739 3 2007 48 3,545,156 3 2008 12 282,143 3 2008 24 1,536,490 3 2008 36 2,458,789 3 2009 12 271,093 3 2009 24 1,199,897 3 2010 12 266,359 Using above dataframe I have to create 4 triangles based on Toatal column: For example: Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total 2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119 2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832 2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261 2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659 2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635 2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377 2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294 2008 282,143 1,782,267 2,898,699 4,963,110 2009 362,726 1,277,750 1,640,475 2010 321,247 321,247 Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009 . . . Like this i need 4 triangles (4 is number of simulation) using 1st dataframe. If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles. In above triangle i just displayed only 1 triangle for 0th value. But i neet triangle for 1 ,2 and 3 also.
Try: df['sample_size'] = pd.to_numeric(df['sample_size'].str.replace(',','')) df.pivot_table('sample_size','year', 'no', aggfunc='first')\ .pipe(lambda x: pd.concat([x,x.sum().to_frame('Grand Total').T])) Output: no 12 24 36 48 60 72 84 96 108 120 2001 254926.0 535877.0 1355613.0 2034557.0 2311789.0 2539807.0 2724773.0 3187095.0 3498646.0 3586037.0 2002 542369.0 1016927.0 2201329.0 2923381.0 3711305.0 3914829.0 4385757.0 4596072.0 5047861.0 NaN 2003 235361.0 960355.0 1661972.0 2643370.0 3372684.0 3642605.0 4160583.0 4480332.0 NaN NaN 2004 764553.0 1703557.0 2498418.0 3198358.0 3524562.0 3884971.0 4268241.0 NaN NaN NaN 2005 381670.0 1124054.0 2026434.0 2863902.0 3039322.0 3288253.0 NaN NaN NaN NaN 2006 320332.0 1022323.0 1830842.0 2676710.0 3375172.0 NaN NaN NaN NaN NaN 2007 330361.0 1463348.0 2771839.0 4003745.0 NaN NaN NaN NaN NaN NaN 2008 282143.0 1782267.0 2898699.0 NaN NaN NaN NaN NaN NaN NaN 2009 362726.0 1277750.0 NaN NaN NaN NaN NaN NaN NaN NaN 2010 321247.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN Grand Total 3795688.0 10886458.0 17245146.0 20344023.0 19334834.0 17270465.0 15539354.0 12263499.0 8546507.0 3586037.0
Transform each group in a DataFrame
I have the following DataFrame: id x y timestamp sensorTime 1 32 30 1031 2002 1 4 105 1035 2005 1 8 110 1050 2006 2 18 10 1500 3600 2 40 20 1550 3610 2 80 10 1450 3620 .... import pandas as pd import numpy as np df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime] For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following: start = df.iloc[0]['timestamp'] df['sensorTime'] -= df.iloc[0]['sensorTime'] df['sensorTime'] += start But I would like to do this for each id group separately. The resulting DataFrame should be: id x y timestamp sensorTime 1 32 30 1031 1031 1 4 105 1035 1034 1 8 110 1050 1035 2 18 10 1500 1500 2 40 20 1550 1510 2 80 10 1450 1520 .... How can this operation done per group?
df id x y timestamp sensorTime 0 1 32 30 1031 2002 1 1 4 105 1035 2005 2 1 8 110 1050 2006 3 2 18 10 1500 3600 4 2 40 20 1550 3610 5 2 80 10 1450 3620 You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output. def func(x): diff = x['sensorTime'].diff() diff.iloc[0] = x['timestamp'].iloc[0] return (diff.cumsum().to_frame()) df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func) df id x y timestamp sensorTime 0 1 32 30 1031 1031.0 1 1 4 105 1035 1034.0 2 1 8 110 1050 1035.0 3 2 18 10 1500 1500.0 4 2 40 20 1550 1510.0 5 2 80 10 1450 1520.0
You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum: box = df.groupby("id").sensorTime.transform("diff") df.assign( new_sensorTime=np.where(box.isna(), df.timestamp, box), new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(), ).drop(columns="new_sensorTime") id x y timestamp sensorTime new 0 1 32 30 1031 2002 1031.0 1 1 4 105 1035 2005 1034.0 2 1 8 110 1050 2006 1035.0 3 2 18 10 1500 3600 1500.0 4 2 40 20 1550 3610 1510.0 5 2 80 10 1450 3620 1520.0
Python pandas groupby with cumsum and percentage
Given the following dataframe df: app platform uuid minutes 0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67 1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1 2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13 3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2 4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8 5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16 6 2 0 245501ec2e39cb782bab1fb02d7813b7 1 7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30 8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10 9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470 10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3 11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58 12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24 13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7 14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465 15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266 16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10 18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13 19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408` I'll group it: y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid']) minutes app platform uuid 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 a696ccf9-22cb-428b-adee-95c9a97a4581 67 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470 ec7fedb6-b118-424a-babe-b8ffad579685 266 4eb1024b-c293-42a4-95a2-31b20c3b524b 24 f786528ded200c9f553dd3a5e9e9bb2d 10 34271596-eebb-4423-b890-dc3761ed37ca 8 245501ec2e39cb782bab1fb02d7813b7 1 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7 19fdaedfd0dbdaf6a7a6b49619f11a19 3 So that I got its minutes per uuid in decrescent order. Now, I will sum the cumulative minutes per app/platform/uuid: y.groupby(level=[0,1]).cumsum() app platform uuid 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 a696ccf9-22cb-428b-adee-95c9a97a4581 251 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878 ec7fedb6-b118-424a-babe-b8ffad579685 3144 4eb1024b-c293-42a4-95a2-31b20c3b524b 3168 f786528ded200c9f553dd3a5e9e9bb2d 3178 34271596-eebb-4423-b890-dc3761ed37ca 3186 245501ec2e39cb782bab1fb02d7813b7 3187 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553 C57D0F52-B565-4322-85D2-C2798F7CA6FF 569 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582 8E0B0BE3-8553-4F38-9837-6C907E01F84C 589 19fdaedfd0dbdaf6a7a6b49619f11a19 592 My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this: app platform uuid 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26 a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36 ... ... ...
It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this: y['cumsum'] = y.groupby(level=[0,1]).cumsum() y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1]) Should give the right output. In [398]: y['running_pct'].head() Out[398]: app platform uuid 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273 a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760 Name: running_pct, dtype: float64 EDIT: Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1 y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum') y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum') And as #Jeff notes, in 0.15.0 this may be faster yet. y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')