I have a Dataframe like:
timestamp Order Price Quantity
0 2019-10-09 09:15:42 0 27850.00 2040
1 2019-10-09 09:15:42 0 27850.00 1980
2 2019-10-09 09:15:53 0 27860.85 1800
3 2019-10-09 09:16:54 0 27860.85 2340
4 2019-10-09 09:18:48 0 27860.85 1500
5 2019-10-09 09:21:08 0 27979.00 1840
6 2019-10-09 09:21:08 0 27979.00 2020
7 2019-10-09 09:21:12 0 27850.00 1800
8 2019-10-09 09:21:15 0 27850.00 1580
9 2019-10-09 09:21:21 35 28000.00 1840
10 2019-10-09 09:21:23 34 28000.00 1800
11 2019-10-09 09:28:17 0 28035.00 2020
12 2019-10-09 09:28:18 0 28035.00 1960
13 2019-10-09 09:28:18 0 28035.00 1920
14 2019-10-09 09:28:24 0 28035.00 1940
15 2019-10-09 09:28:24 0 28035.00 1960
16 2019-10-09 09:28:25 0 28000.00 2140
17 2019-10-09 09:28:25 0 28000.00 2020
18 2019-10-09 09:28:26 0 28000.00 2120
I want to check when successive Price Values are same then return the row with Max Quantity Value.
My Result Dataframe Like:
timestamp Order Price Quantity
0 2019-10-09 09:15:42 0 27850.00 2040
3 2019-10-09 09:16:54 0 27860.85 2340
6 2019-10-09 09:21:08 0 27979.00 2020
7 2019-10-09 09:21:12 0 27850.00 1800
9 2019-10-09 09:21:21 35 28000.00 1840
11 2019-10-09 09:28:17 0 28035.00 2020
16 2019-10-09 09:28:25 0 28000.00 2140
PS: Here in result table Price Value 27850.00 appears once more in Row No:7 and will be considered as independently. Similarly for 28000.00 also.
First create a price_group column to identify consecutive rows with the same price (as in this answer).
price_group = (df.Price != df.Price.shift()).cumsum()
Then group the rows by this column and find the rows with max quantity for each group (as in these answers).
result = df.loc[df.Quantity.groupby(price_group).idxmax()]
Something like this:
from itertools import groupby
x = [[list(n) for m, n in groupby(df['Price'])]][0]
y = [(ind,val) for ind,val in enumerate(x)]
z = [i[0] for i in y for j in i[1]]
df['label'] = z
# it gives you df like this
# Unnamed: 0 Unnamed: 1 timestamp Order Price Quantity label
# 0 0 09.10.2019 9:15:42 0 27850.00 2040 0
# 1 1 09.10.2019 9:15:42 0 27850.00 1980 0
# 2 2 09.10.2019 9:15:53 0 27860.85 1800 1
# 3 3 09.10.2019 9:16:54 0 27860.85 2340 1
# 4 4 09.10.2019 9:18:48 0 27860.85 1500 1
# 5 5 09.10.2019 9:21:08 0 27979.00 1840 2
# 6 6 09.10.2019 9:21:08 0 27979.00 2020 2
# 7 7 09.10.2019 9:21:12 0 27850.00 1800 3
# 8 8 09.10.2019 9:21:15 0 27850.00 1580 3
# 9 9 09.10.2019 9:21:21 35 28000.00 1840 4
# 10 10 09.10.2019 9:21:23 34 28000.00 1800 4
# 11 11 09.10.2019 9:28:17 0 28035.00 2020 5
# 12 12 09.10.2019 9:28:18 0 28035.00 1960 5
# 13 13 09.10.2019 9:28:18 0 28035.00 1920 5
# 14 14 09.10.2019 9:28:24 0 28035.00 1940 5
# 15 15 09.10.2019 9:28:24 0 28035.00 1960 5
# 16 16 09.10.2019 9:28:25 0 28000.00 2140 6
# 17 17 09.10.2019 9:28:25 0 28000.00 2020 6
# 18 18 09.10.2019 9:28:26 0 28000.00 2120 6
# then you able to use groupby
df.groupby('label').max()
Out[27]:
Unnamed: 0 Unnamed: 1 timestamp Order Price Quantity
label
0 1 09.10.2019 9:15:42 0 27850.00 2040
1 4 09.10.2019 9:18:48 0 27860.85 2340
2 6 09.10.2019 9:21:08 0 27979.00 2020
3 8 09.10.2019 9:21:15 0 27850.00 1800
4 10 09.10.2019 9:21:23 35 28000.00 1840
5 15 09.10.2019 9:28:24 0 28035.00 2020
6 18 09.10.2019 9:28:26 0 28000.00 2140
This is not the slimmest solution, but I think it makes it more obvious what is happening. I'm sure it can be trimmed down to more concise code.
import pandas as pd
# Generating a similar df
df = pd.DataFrame({'Order' :[1,2,3,4,5,6,7],
'Price' :[27850.00,27850.00,27860.85,27860.85,27860.85,27979.00,27979.00],
'Quantity':[2040, 1980, 1800, 2340 ,1500, 1840, 2020 ]
})
print(df)
print("--------------")
# Get the unique values from the Price column
# This tells us which values we want to select the highest value from
values = df["Price"].unique()
# Loop through the values, selecting the rows which match each value, one at a time
for value in values:
# df["Price"] == value" (Selects all the rows where price equals ONE of the values)
# For example, the above will give us 3 rows where Price == 27860.85
# .max() gives us the row with the largest value from Quantity, since the Price column are all equal
# The above would give us a Series with two values, Price and Quantity. I.e.
# Price 27860.85
# Quantity 2340.00
# ["Quantity"] then selects only the Quantity value and assigns it to highest
highest = df[df["Price"] == value].max()["Quantity"]
print(value, "...", highest)
# You can, during this loop, build a new dict object to create a new df if desired
Or, more succinctly...
# Create a new list in one line
highest = [ df[df["Price"] == value].max()["Quantity"] for value in df["Price"].unique()]
# Add as columns to new df
df1 = pd.DataFrame({
'Price' :df["Price"].unique(),
'Quantity':highest
})
print(df1)
Use the same idea to grab the appropriate value from other columns for each unique Price, and add them to the new df1
Related
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data:
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Here is my current code
# Date fields
def DateFields(df_input):
dates = df_input.index.to_series()
df_input['day_of_week'] = dates.dt.dayofweek
df_input['day_of_month'] = dates.dt.day
df_input['day_of_year'] = dates.dt.dayofyear
df_input['month_of_year'] = dates.dt.month
df_input['isWeekStart'] = "No" #<--- Need help here
df_input['isWeekEnd'] = "No" #<--- Need help here
df_input['date'] = dates.dt.strftime('%Y-%m-%d')
return df_input
How can I calculate if a row is beginning of week and end of week?
Example of what I am looking for:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
5/1/2017 0 1 121 5 1 0
5/2/2017 1 2 122 5 0 0
5/3/2017 2 3 123 5 0 0
5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day
5/8/2017 0 8 128 5 1 0
5/9/2017 1 9 129 5 0 0
5/10/2017 2 10 130 5 0 0
5/11/2017 3 11 131 5 0 0
5/12/2017 4 12 132 5 0 1
5/15/2017 0 15 135 5 1 0
5/16/2017 1 16 136 5 0 0
5/17/2017 2 17 137 5 0 0
5/18/2017 3 18 138 5 0 0
5/19/2017 4 19 139 5 0 1
5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day
5/24/2017 2 24 144 5 0 0
5/25/2017 3 25 145 5 0 0
5/26/2017 4 26 146 5 0 1
5/30/2017 1 30 150 5 1 0
EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay:
prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1)
df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week !=
prev_working_day.dt.isocalendar().week)
And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one.
Output:
date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay
0 2017-05-01 0 1 121 5 True
1 2017-05-02 1 2 122 5 False
2 2017-05-03 2 3 123 5 False
3 2017-05-04 3 4 124 5 False
4 2017-05-08 0 8 128 5 True
5 2017-05-09 1 9 129 5 False
6 2017-05-10 2 10 130 5 False
7 2017-05-11 3 11 131 5 False
8 2017-05-12 4 12 132 5 False
9 2017-05-15 0 15 135 5 True
10 2017-05-16 1 16 136 5 False
11 2017-05-17 2 17 137 5 False
12 2017-05-18 3 18 138 5 False
13 2017-05-19 4 19 139 5 False
14 2017-05-23 1 23 143 5 False
15 2017-05-24 2 24 144 5 False
16 2017-05-25 3 25 145 5 False
17 2017-05-26 4 26 146 5 False
18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby.
df['date'] = pd.to_datetime(df['date'])
business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame()
business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x])
business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x])
business_days = business_days.apply(pd.Series.explode)
pd.merge(df, business_days, left_on='date', right_on='date')
output:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
0 2017-05-01 0 1 121 5 1 0
1 2017-05-02 1 2 122 5 0 0
2 2017-05-03 2 3 123 5 0 0
3 2017-05-04 3 4 124 5 0 1
4 2017-05-08 0 8 128 5 1 0
5 2017-05-09 1 9 129 5 0 0
6 2017-05-10 2 10 130 5 0 0
7 2017-05-11 3 11 131 5 0 0
8 2017-05-12 4 12 132 5 0 1
9 2017-05-15 0 15 135 5 1 0
10 2017-05-16 1 16 136 5 0 0
11 2017-05-17 2 17 137 5 0 0
12 2017-05-18 3 18 138 5 0 0
13 2017-05-19 4 19 139 5 0 1
14 2017-05-23 1 23 143 5 1 0
15 2017-05-24 2 24 144 5 0 0
16 2017-05-25 3 25 145 5 0 0
17 2017-05-26 4 26 146 5 0 1
18 2017-05-30 1 30 150 5 1 1
Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.
I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?
df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0
You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0
I have a dataframe df that looks like:
A B
0 0 4140
1 0.142857 1071
2 0 1196
3 0.090909 2110
4 0.083333 1926
5 0.166667 1388
6 0 3081
7 0 1149
8 0 1600
9 0.058824 1873
10 0 3960
: : :
19 0 4315
20 0 2007
21 0.086957 3323
22 0.166667 1084
23 0.5 2703
24 0 1214
25 0 1955
26 0 6750
27 0 3240
28 0 1437
29 0 1701
I am trying to use the following line of code is trying to produce a new column which divides A/B if A is greater than 0 (else populate with 0) and then mutiply by 90:
df['new_column'] = np.where(df['A'] = 0, 0.0, df['A'].divide(df['B']))*90.0
However I get the error for the line:
result[:] = [tuple(x) for x in values]
TypeError: 'int' object is not iterable
The desired output for new_column is:
A B new_column
0 0 4140 0
1 0.142857 1071 0.01200479
2 0 1196 0
3 0.090909 2110 0.003877635
4 0.083333 1926 0.003894065
5 0.166667 1388 0.010806938
6 0 3081 0
7 0 1149 0
8 0 1600 0
9 0.058824 1873 0.002826567
10 0 3960 0
: : : :
19 0 4315 0
20 0 2007 0
21 0.086957 3323 0.00235514
22 0.166667 1084 0.013837666
23 0.5 2703 0.016648169
24 0 1214 0
25 0 1955 0
26 0 6750 0
27 0 3240 0
28 0 1437 0
29 0 1701 0
Note that numpy.where works as:
numpy.where(condition[, x, y])
Therefore:
import pandas as pd
df = pd.DataFrame({'A':[0,0.142857,0,0.090909],
'B':[4140,1071,1196,2110]})
df['new_column'] = np.where(df['A'] > 0, df['A']*90/df['B'], 0)
output
A B new_column
0 0.000000 4140 0.000000
1 0.142857 1071 0.012005
2 0.000000 1196 0.000000
3 0.090909 2110 0.003878
Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...
It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')