I have a Dataset like this
Date Runner Group distance [km]
2021-01-01 Joe 1 7
2021-01-02 Jack 1 6
2021-01-03 Jess 1 9
2021-01-01 Paul 2 11
2021-01-02 Peter 2 12
2021-01-02 Sara 3 15
2021-01-03 Sarah 3 10
and I want to calculate the cumulative sum for each group of runners.
Date Runner Group distance [km] cum sum [km]
2021-01-01 Joe 1 7 7
2021-01-02 Jack 1 6 13
2021-01-03 Jess 1 9 22
2021-01-01 Paul 2 11 11
2021-01-02 Peter 2 12 23
2021-01-02 Sara 3 15 15
2021-01-03 Sarah 3 10 25
Unfortunately, I have no idea how to do this and I didn't find the answer somewhere else. Could someone give me a hint?
import pandas as pd
import numpy as np
df = pd.DataFrame([['2021-01-01','Joe', 1, 7],
['2021-01-02',"Jack", 1, 6],
['2021-01-03',"Jess", 1, 9],
['2021-01-01',"Paul", 2, 11],
['2021-01-02',"Peter", 2, 12],
['2021-01-02',"Sara", 3, 15],
['2021-01-03',"Sarah", 3, 10]],
columns=['Date','Runner', 'Group', 'distance [km]'])
Try groupby cumsum:
>>> df['cum sum [km]'] = df.groupby('Group')['distance [km]'].cumsum()
>>> df
Date Runner Group distance [km] cum sum [km]
0 2021-01-01 Joe 1 7 7
1 2021-01-02 Jack 1 6 13
2 2021-01-03 Jess 1 9 22
3 2021-01-01 Paul 2 11 11
4 2021-01-02 Peter 2 12 23
5 2021-01-02 Sara 3 15 15
6 2021-01-03 Sarah 3 10 25
>>>
Related
I have a new code I'm trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I'm basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I'm stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.
Person
Location
Date
0
Tom
A
1/1/2021
1
Tom
A
1/2/2021
2
Tom
A
1/3/2021
3
Tom
B
1/4/2021
4
Tom
B
1/5/2021
5
Tom
B
1/6/2021
6
Tom
A
1/7/2021
7
Tom
A
1/8/2021
8
Tom
A
1/9/2021
9
Tom
C
1/10/2021
10
Tom
C
1/11/2021
11
Tom
A
1/12/2021
12
Tom
A
1/13/2021
13
Tom
B
1/14/2021
14
Tom
B
1/15/2021
15
Mark
A
1/1/2021
16
Mark
A
1/2/2021
17
Mark
B
1/3/2021
18
Mark
B
1/4/2021
19
Mark
A
1/5/2021
20
Mark
A
1/6/2021
21
Mark
C
1/7/2021
22
Mark
C
1/8/2021
23
Mark
C
1/9/2021
24
Mark
C
1/10/2021
25
Mark
A
1/11/2021
26
Mark
A
1/12/2021
27
Mark
B
1/13/2021
28
Mark
B
1/14/2021
29
Mark
B
1/15/2021
Expected outcome:
Person
Location
StintNum
Start_Date
End Date
0
Tom
A
1
1/1/2021
1/3/2021
1
Tom
B
2
1/4/2021
1/6/2021
2
Tom
A
3
1/7/2021
1/9/2021
3
Tom
C
4
1/10/2021
1/11/2021
4
Tom
A
5
1/12/2021
1/13/2021
5
Tom
B
6
1/14/2021
1/15/2021
6
Mark
A
1
1/1/2021
1/2/2021
7
Mark
B
2
1/3/2021
1/4/2021
8
Mark
A
3
1/5/2021
1/6/2021
9
Mark
C
4
1/7/2021
1/10/2021
10
Mark
A
5
1/11/2021
1/12/2021
11
Mark
B
6
1/13/2021
1/15/2021
IMO, a clean way is to use groupby+agg, this enables to set custom aggregators easily and is faster than apply:
df['Date'] = pd.to_datetime(df['Date'])
group = df['Location'].ne(df['Location'].shift()).cumsum()
df2 = (
df.groupby(['Person', group], as_index=False)
.agg(Location=('Location', 'first'),
# line below is a dummy function to set a column placeholder
# uncomment it you want the columns in order
#StintNum=('Location', lambda x: float('NaN')),
Start_Date=('Date', 'min'),
End_Date=('Date', 'max'),
)
)
df2['StintNum'] = df2.groupby('Person').cumcount().add(1)
Output:
Person Location StintNum Start_Date End_Date
0 Mark A 1 2021-01-01 2021-01-02
1 Mark B 2 2021-01-03 2021-01-04
2 Mark A 3 2021-01-05 2021-01-06
3 Mark C 4 2021-01-07 2021-01-10
4 Mark A 5 2021-01-11 2021-01-12
5 Mark B 6 2021-01-13 2021-01-15
6 Tom A 1 2021-01-01 2021-01-03
7 Tom B 2 2021-01-04 2021-01-06
8 Tom A 3 2021-01-07 2021-01-09
9 Tom C 4 2021-01-10 2021-01-11
10 Tom A 5 2021-01-12 2021-01-13
11 Tom B 6 2021-01-14 2021-01-15
Try this:
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)
Output:
>>> new_df
Person Location Start_Date End_Date StintNum
0 Tom 1 2021-01-01 2021-01-03 1
1 Tom 2 2021-01-04 2021-01-06 2
2 Tom 3 2021-01-07 2021-01-09 3
3 Tom 4 2021-01-10 2021-01-11 4
4 Tom 5 2021-01-12 2021-01-13 5
5 Tom 6 2021-01-14 2021-01-15 6
6 Mark 7 2021-01-01 2021-01-02 1
7 Mark 8 2021-01-03 2021-01-04 2
8 Mark 9 2021-01-05 2021-01-06 3
9 Mark 10 2021-01-07 2021-01-10 4
10 Mark 11 2021-01-11 2021-01-12 5
11 Mark 12 2021-01-13 2021-01-15 6
I'd like to convert a Datafram which has this format:
df = pd.DataFrame({"Date": ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
"A1": [1, 2, 2, 2],
"A2": [9, 2, 2, 3],
"A3": [1, 3, 2, 9],
"B1": [1, 8, 2, 3],
"B2": [3, 8, 9, 3],
"B3": [2, 4, 5, 5]})
Date
A1
A2
A3
B1
B2
B3
2021-01-01
1
9
1
1
3
2
2021-01-02
2
2
3
8
8
4
2021-01-03
2
2
2
2
9
5
2021-01-04
2
3
9
3
3
5
What I want to create table, that just starts with letters in the rows.
My idea is the following:
Add 2 dummy rows after every row with date
Copy the values from (X2) und (X3) into that dummy row for the same date
Delete the Columns (X2) and (X3)
transpose the whole table
The target format looks like this:
Date
2021-01-01 (1)
2021-01-01 (2)
2021-01-02 (3)
2021-01-02 (4)
2021-01-02 (5)
2021-01-02 (6)
2021-01-03 1 (7)
2021-01-03 (8)
2021-01-03 (9)
A
1
9
1
2
3
8
2
2
2
B
1
3
2
8
8
4
2
9
5
I couldnt get it to work, I'll try to post the code later on.
Is there any cleaner, faster way to do so?
Thank you for any help!
Use melt to get the long format to construct the corresponding date format for each category:
df = pd.melt(df, id_vars='Date') # in each row: 2021-01-01 | A1 | 1
df['idx'] = df['variable'].str[:-1] # A, B, ...
df['Date'] = df['Date'].astype(str) + ' (' + df['variable'].str[-1] + ')'
df = df[['Date', 'idx', 'value']].pivot(values='value', index='idx', columns='Date')
set df.index.name = None if you don't want the col shown:
Date 2021-01-01 (1) 2021-01-01 (2) 2021-01-01 (3) 2021-01-02 (1) 2021-01-02 (2) 2021-01-02 (3) 2021-01-03 (1) 2021-01-03 (2) 2021-01-03 (3) 2021-01-04 (1) 2021-01-04 (2) 2021-01-04 (3)
A 1 9 1 2 2 3 2 2 2 2 3 9
B 1 3 2 8 8 4 2 9 5 3 3 5
I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?
I have a pandas multiindex with two indices, a data and a gender columns. It looks like this:
Division North South West East
Date Gender
2016-05-16 19:00:00 F 0 2 3 3
M 12 15 12 12
2016-05-16 20:00:00 F 12 9 11 11
M 10 13 8 9
2016-05-16 21:00:00 F 9 4 7 1
M 5 1 12 10
Now if I want to find the average values for each hour, I know I can do like:
df.groupby(df.index.hour).mean()
but this does not seem to work when you have a multi index. I found that I could do reach the Date index like:
df.groupby(df.index.get_level_values('Date').hour).mean()
which sort of averages over the 24 hours in a day, but I loose track of the Gender index...
so my question is: how can I find the average hourly values for each Division by Gender?
I think you can add level of MultiIndex, need pandas 0.20.1+:
df1 = df.groupby([df.index.get_level_values('Date').hour,'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Another solution:
df1 = df.groupby([df.index.get_level_values('Date').hour,
df.index.get_level_values('Gender')]).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Or simply create columns from MultiIndex:
df = df.reset_index()
df1 = df.groupby([df['Date'].dt.hour, 'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
I have some pricing data that looks like this:
import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
['A','1', 2015-02-06, 16.00, 20.00, 8],
['A','1', 2015-02-14, 14.00, 20.00, 34],
['A','1', 2015-03-20, 20.00, 20.00, 5],
['A','1', 2015-03-25, 15.00, 20.00, 15],
['A','2', 2015-02-01, 75.99, 100.00, 22],
['A','2', 2015-02-23, 100.00, 100.00, 30],
['A','2', 2015-03-25, 65.00, 100.00, 64],
['B','3', 2015-04-01, 45.00, 45.00, 15],
['B','3', 2015-04-16, 40.00, 45.00, 2],
['B','3', 2015-04-18, 45.00, 45.00, 30],
['B','4', 2015-07-25, 5.00, 10.00, 55]],
columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)
dept sku date price orig_price days_at_price
0 A 1 2015-02-01 20.00 20.00 5
1 A 1 2015-02-06 16.00 20.00 8
2 A 1 2015-02-14 14.00 20.00 34
3 A 1 2015-03-20 20.00 20.00 5
4 A 1 2015-03-25 15.00 20.00 15
5 A 2 2015-02-01 75.99 100.00 22
6 A 2 2015-02-23 100.00 100.00 30
7 A 2 2015-03-25 65.00 100.00 64
8 B 3 2015-04-01 45.00 45.00 15
9 B 3 2015-04-16 40.00 45.00 2
10 B 3 2015-04-18 45.00 45.00 30
11 B 4 2015-07-25 5.00 10.00 55
I want to describe the pricing cycles, which can be defined as the period when a sku goes from original price to promotional price (or multiple promotional prices) and returns to original. A cycle must start with the original price. It is okay to include cycles which never change in price, as well as those that are reduced and never return. But an initial price that is less than orig_price would not be counted as a cycle. For the above df, the result I am looking for is:
dept sku cycle orig_price_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 1 30 64
3 B 3 1 15 2
4 B 3 2 30 0
I played around with groupby and sum, but can't quite figure out how to define a cycle and total the rows accordingly. Any help would be greatly appreciated.
I got very close to producing the desired end result...
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0
The only part that I couldn't quite crack was how to restart the cycle number for each sku before the groupby.
Try using loc instead of groupby - you want chunks of skus over time periods, not aggregated groups. A for-loop, used in moderation, can also help here and won't be particularly un-pandas like. (At least if, like me, you consider looping over unique array slices to be fine.)
df['cycle'] = -1 # create a column for the cycle
skus = df.sku.unique() # get unique skus for iteration
for sku in skus:
# Get the start date for each cycle for this sku
# NOTE that we define cycles as beginning
# when the price equals the original price
# This avoids the mentioned issue that a cycle should not start
# if initial is less than original.
cycle_start_dates = df.loc[(df.sku == sku]) & \
(df.price == df.orig_price),
'date'].tolist()
# append a terminal date
cycle_start_dates.append(df.date.max()+timedelta(1))
# Assign the cycle values
for i in range(len(cycle_start_dates) - 1):
df.loc[(df.sku == sku) & \
(cycle_start_dates[i] <= df.date) & \
(df.date < cycle_start_dates[i+1]), 'cycle'] = i+1
This should give you a column with all of the cycles for each sku:
dept sku date price orig_price days_at_price cycle
0 A 1 2015-02-01 20.00 20.0 5 1
1 A 1 2015-02-06 16.00 20.0 8 1
2 A 1 2015-02-14 14.00 20.0 34 1
3 A 1 2015-03-20 20.00 20.0 5 2
4 A 1 2015-03-25 15.00 20.0 15 2
5 A 2 2015-02-01 75.99 100.0 22 1
6 A 2 2015-02-23 100.00 100.0 30 1
7 A 2 2015-03-25 65.00 100.0 64 2
8 B 3 2015-04-01 45.00 45.0 15 2
9 B 3 2015-04-16 40.00 45.0 2 2
10 B 3 2015-04-18 45.00 45.0 30 2
11 B 4 2015-07-25 5.00 10.0 55 2
Once you have the cycle column, aggregation becomes relatively straightforward. This multiple aggregation:
df.groupby(['dept', 'sku','cycle'])['days_at_price']\
.agg({'orig_price_days': lambda x: x[:1].sum(),
'promo_days': lambda x: x[1:].sum()
})\
.reset_index()
will give you the desired result:
dept sku cycle promo_days orig_price_days
0 A 1 1 42 5
1 A 1 2 15 5
2 A 2 -1 0 22
3 A 2 1 64 30
4 B 3 1 2 15
5 B 3 2 0 30
6 B 4 -1 0 55
Note that this has additional -1 values for cycle for pre-cycle, below original pricing.