Transform each group in a DataFrame - python

I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?

df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0

You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0

Related

How to join/merge and sum columns with the same name

How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64

Find Max of successive Similar Values

I have a Dataframe like:
timestamp Order Price Quantity
0 2019-10-09 09:15:42 0 27850.00 2040
1 2019-10-09 09:15:42 0 27850.00 1980
2 2019-10-09 09:15:53 0 27860.85 1800
3 2019-10-09 09:16:54 0 27860.85 2340
4 2019-10-09 09:18:48 0 27860.85 1500
5 2019-10-09 09:21:08 0 27979.00 1840
6 2019-10-09 09:21:08 0 27979.00 2020
7 2019-10-09 09:21:12 0 27850.00 1800
8 2019-10-09 09:21:15 0 27850.00 1580
9 2019-10-09 09:21:21 35 28000.00 1840
10 2019-10-09 09:21:23 34 28000.00 1800
11 2019-10-09 09:28:17 0 28035.00 2020
12 2019-10-09 09:28:18 0 28035.00 1960
13 2019-10-09 09:28:18 0 28035.00 1920
14 2019-10-09 09:28:24 0 28035.00 1940
15 2019-10-09 09:28:24 0 28035.00 1960
16 2019-10-09 09:28:25 0 28000.00 2140
17 2019-10-09 09:28:25 0 28000.00 2020
18 2019-10-09 09:28:26 0 28000.00 2120
I want to check when successive Price Values are same then return the row with Max Quantity Value.
My Result Dataframe Like:
timestamp Order Price Quantity
0 2019-10-09 09:15:42 0 27850.00 2040
3 2019-10-09 09:16:54 0 27860.85 2340
6 2019-10-09 09:21:08 0 27979.00 2020
7 2019-10-09 09:21:12 0 27850.00 1800
9 2019-10-09 09:21:21 35 28000.00 1840
11 2019-10-09 09:28:17 0 28035.00 2020
16 2019-10-09 09:28:25 0 28000.00 2140
PS: Here in result table Price Value 27850.00 appears once more in Row No:7 and will be considered as independently. Similarly for 28000.00 also.
First create a price_group column to identify consecutive rows with the same price (as in this answer).
price_group = (df.Price != df.Price.shift()).cumsum()
Then group the rows by this column and find the rows with max quantity for each group (as in these answers).
result = df.loc[df.Quantity.groupby(price_group).idxmax()]
Something like this:
from itertools import groupby
x = [[list(n) for m, n in groupby(df['Price'])]][0]
y = [(ind,val) for ind,val in enumerate(x)]
z = [i[0] for i in y for j in i[1]]
df['label'] = z
# it gives you df like this
# Unnamed: 0 Unnamed: 1 timestamp Order Price Quantity label
# 0 0 09.10.2019 9:15:42 0 27850.00 2040 0
# 1 1 09.10.2019 9:15:42 0 27850.00 1980 0
# 2 2 09.10.2019 9:15:53 0 27860.85 1800 1
# 3 3 09.10.2019 9:16:54 0 27860.85 2340 1
# 4 4 09.10.2019 9:18:48 0 27860.85 1500 1
# 5 5 09.10.2019 9:21:08 0 27979.00 1840 2
# 6 6 09.10.2019 9:21:08 0 27979.00 2020 2
# 7 7 09.10.2019 9:21:12 0 27850.00 1800 3
# 8 8 09.10.2019 9:21:15 0 27850.00 1580 3
# 9 9 09.10.2019 9:21:21 35 28000.00 1840 4
# 10 10 09.10.2019 9:21:23 34 28000.00 1800 4
# 11 11 09.10.2019 9:28:17 0 28035.00 2020 5
# 12 12 09.10.2019 9:28:18 0 28035.00 1960 5
# 13 13 09.10.2019 9:28:18 0 28035.00 1920 5
# 14 14 09.10.2019 9:28:24 0 28035.00 1940 5
# 15 15 09.10.2019 9:28:24 0 28035.00 1960 5
# 16 16 09.10.2019 9:28:25 0 28000.00 2140 6
# 17 17 09.10.2019 9:28:25 0 28000.00 2020 6
# 18 18 09.10.2019 9:28:26 0 28000.00 2120 6
# then you able to use groupby
df.groupby('label').max()
Out[27]:
Unnamed: 0 Unnamed: 1 timestamp Order Price Quantity
label
0 1 09.10.2019 9:15:42 0 27850.00 2040
1 4 09.10.2019 9:18:48 0 27860.85 2340
2 6 09.10.2019 9:21:08 0 27979.00 2020
3 8 09.10.2019 9:21:15 0 27850.00 1800
4 10 09.10.2019 9:21:23 35 28000.00 1840
5 15 09.10.2019 9:28:24 0 28035.00 2020
6 18 09.10.2019 9:28:26 0 28000.00 2140
This is not the slimmest solution, but I think it makes it more obvious what is happening. I'm sure it can be trimmed down to more concise code.
import pandas as pd
# Generating a similar df
df = pd.DataFrame({'Order' :[1,2,3,4,5,6,7],
'Price' :[27850.00,27850.00,27860.85,27860.85,27860.85,27979.00,27979.00],
'Quantity':[2040, 1980, 1800, 2340 ,1500, 1840, 2020 ]
})
print(df)
print("--------------")
# Get the unique values from the Price column
# This tells us which values we want to select the highest value from
values = df["Price"].unique()
# Loop through the values, selecting the rows which match each value, one at a time
for value in values:
# df["Price"] == value" (Selects all the rows where price equals ONE of the values)
# For example, the above will give us 3 rows where Price == 27860.85
# .max() gives us the row with the largest value from Quantity, since the Price column are all equal
# The above would give us a Series with two values, Price and Quantity. I.e.
# Price 27860.85
# Quantity 2340.00
# ["Quantity"] then selects only the Quantity value and assigns it to highest
highest = df[df["Price"] == value].max()["Quantity"]
print(value, "...", highest)
# You can, during this loop, build a new dict object to create a new df if desired
Or, more succinctly...
# Create a new list in one line
highest = [ df[df["Price"] == value].max()["Quantity"] for value in df["Price"].unique()]
# Add as columns to new df
df1 = pd.DataFrame({
'Price' :df["Price"].unique(),
'Quantity':highest
})
print(df1)
Use the same idea to grab the appropriate value from other columns for each unique Price, and add them to the new df1

How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF

I have data that looks like this.
Year Quarter Quantity Price TotalRevenue
0 2000 1 23 142 3266
1 2000 2 23 144 3312
2 2000 3 23 147 3381
3 2000 4 23 151 3473
4 2001 1 22 160 3520
5 2001 2 22 183 4026
6 2001 3 22 186 4092
7 2001 4 22 186 4092
8 2002 1 21 212 4452
9 2002 2 19 232 4408
10 2002 3 19 223 4237
I'm trying to figure out how to get the 'MarginalRevenue', where:
MR = (∆TR/∆Q)
MarginalRevenue = (Change in TotalRevenue) / (Change in Quantity)
I found: df.pct_change()
But that seems to get the percentage change for everything.
Also, I'm trying to figure out how to get something related:
ElasticityPrice = (%ΔQuantity/%ΔPrice)
Do you mean something like this ?
df['MarginalRevenue'] = df['TotalRevenue'].pct_change() / df['Quantity'].pct_change()
or
df['MarginalRevenue'] = df['TotalRevenue'].diff() / df['Quantity'].diff()

Summing up previous 10 rows of a dataframe

I'm wondering how to sum up 10 rows of a data frame from any point.
I tried using rolling(10,window =1).sum() but the very first row should sum up the 10 rows below. Similar issue with cumsum()
So if my data frame is just the A column, id like it to output B.
A B
0 10 550
1 20 650
2 30 750
3 40 850
4 50 950
5 60 1050
6 70 1150
7 80 1250
8 90 1350
9 100 1450
10 110 etc
11 120 etc
12 130 etc
13 140
14 150
15 160
16 170
17 180
18 190
It would be similar to doing this operation in excel and copying it down
Excel Example:
You can reverse your series before using pd.Series.rolling, and then reverse the result:
df['B'] = df['A'][::-1].rolling(10, min_periods=0).sum()[::-1]
print(df)
A B
0 10 550.0
1 20 650.0
2 30 750.0
3 40 850.0
4 50 950.0
5 60 1050.0
6 70 1150.0
7 80 1250.0
8 90 1350.0
9 100 1450.0
10 110 1350.0
11 120 1240.0
12 130 1120.0
13 140 990.0
14 150 850.0
15 160 700.0
16 170 540.0
17 180 370.0
18 190 190.0

Python pandas groupby with cumsum and percentage

Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...
It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')

Categories