Cumulative sum (pandas) - python

Apologies if this has been asked already.
I am trying to create a yearly cumulative sum for all order-points within a certain customer account, and am struggling.
Essentially, I want to create `YearlyTotal' below:
Customer Year Date Order PointsPerOrder YearlyTotal
123456 2016 11/2/16 A939 1 20
123456 2016 3/13/16 A102 19 19
789089 2016 7/15/16 A123 7 7
I've tried:
df['YEARLYTOTAL'] = df.groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
But this produces YearlyTotal in the wrong order (i.e., YearlyTotal of A939 is 1 instead of 20.
Not sure if this matters, but Customer is a string (the database has leading zeroes -- don't get me started). sort_values(by=['Customer','Year','Date'],ascending=True) at the front also produces an error.
Help?

Use [::-1] for reversing dataframe:
df['YEARLYTOTAL'] = df[::-1].groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
print (df)
Customer Year Date Order PointsPerOrder YearlyTotal YEARLYTOTAL
0 123456 2016 11/2/16 A939 1 20 20
1 123456 2016 3/13/16 A102 19 19 19
2 789089 2016 7/15/16 A123 7 7 7

first make sure Date is a datetime column:
In [35]: df.Date = pd.to_datetime(df.Date)
now we can do:
In [36]: df['YearlyTotal'] = df.sort_values('Date').groupby(['Customer','Year'])['PointsPerOrder'].cumsum()
In [37]: df
Out[37]:
Customer Year Date Order PointsPerOrder YearlyTotal
0 123456 2016 2016-11-02 A939 1 20
1 123456 2016 2016-03-13 A102 19 19
2 789089 2016 2016-07-15 A123 7 7
PS this solution will NOT depend on the order of records...

Related

How to calculate Cumulative Average Revenue ? Python

I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429

How to remove leading '0' from my column? Python

I am trying to remove the '0' leading my data
My dataframe looks like this
Id Year Month Day
1 2019 01 15
2 2019 03 30
3 2019 10 20
4 2019 11 18
Note: 'Year','Month','Day' columns data types are object
I get the 'Year','Month','Day' columns by extracting it from a date.
I want to remove the '0' at the beginning of each months.
Desired Ouput:
Id Year Month Day
1 2019 1 15
2 2019 3 30
3 2019 10 20
4 2019 11 18
What I tried to do so far:
df['Month'].str.lstrip('0')
But it did not work.
Any solution? Thank you!
You could use re package and apply regex on it
import re
# Create sample data
d = pd.DataFrame(data={"Month":["01","02","03","10","11"]})
d["Month" = d["Month"].apply(lambda x: re.sub(r"^0+", "", x))
Result:
0 1
1 2
2 3
3 10
4 11
Name: Month, dtype: object
If you are 100% that Month column will always contain numbers, then you could simply do:
d["Month"] = d["Month"].astype(int)

How to calculate average and most frequent values per group?

I have the following df:
df =
year intensity category
2015 22 1
2015 21 1
2015 23 2
2016 25 2
2017 20 1
2017 21 1
2017 20 3
I need to group by year and calculate an average intensity and a most frequent category(per year).
I know that it's possible to calculate most frequent category as follows:
df.groupby('year')['category'].agg(lambda x: x.value_counts().index[0])
I also know how to calculate average intensity:
df = df.groupby(["year"]).agg({'intensity':'mean'}).reset_index()
But I don't know how to put everything together without join operation.
Use agg with a dictionary to define how to aggregate each column.
df.groupby('year', as_index=False)[['category', 'intensity']]\
.agg({'category': lambda x: pd.Series.mode(x)[0], 'intensity':'mean'})
Output:
year category intensity
0 2015 1 22.000000
1 2016 2 25.000000
2 2017 1 20.333333
Or you can still use lambda funcion
df.groupby('year', as_index=False)[['category','intensity']]\
.agg({'category': lambda x: x.value_counts().index[0],'intensity':'mean'})

Python: Pandas dataframe re-arrange rows based on last three digits of Integer in Column

I have the following dataframe:
YearMonth Total Cost
2015009 $11,209,041
2015010 $20,581,043
2015011 $37,079,415
2015012 $36,831,335
2016008 $57,428,630
2016009 $66,754,405
2016010 $45,021,707
2016011 $34,783,970
2016012 $66,215,044
YearMonth is an int64 column. A value in YearMonth such as 2015009 stands for September 2015. I want to re-order the rows so that if the last 3 digits are the same, then I want the rows to appear right on top of each other sorted by year.
Below is my desired output:
YearMonth Total Cost
2015009 $11,209,041
2016009 $66,754,405
2015010 $20,581,043
2016010 $45,021,707
2015011 $37,079,415
2016011 $34,783,970
2015012 $36,831,335
2016012 $66,215,044
2016008 $57,428,630
I have scoured google to try and find how to do this but to no avail.
df['YearMonth'] = pd.to_datetime(df['YearMonth'],format = '%Y0%m')
df['Year'] = df['YearMonth'].dt.year
df['Month'] = df['YearMonth'].dt.month
df.sort_values(['Month','Year'])
YearMonth Total Year Month
8 2016-08-01 $57,428,630 2016 8
0 2015-09-01 $11,209,041 2015 9
1 2016-09-01 $66,754,405 2016 9
2 2015-10-01 $20,581,043 2015 10
3 2016-10-01 $45,021,707 2016 10
4 2015-11-01 $37,079,415 2015 11
5 2016-11-01 $34,783,970 2016 11
6 2015-12-01 $36,831,335 2015 12
7 2016-12-01 $66,215,044 2016 12
One way of doing. There may be a quicker way with fewer steps that don't involve converting YearMonth to datetime but if you have a date, it makes more sense to use that.
One way of dong this to cast your int column to string and use the string access with indexing.
df.assign(sortkey=df.YearMonth.astype(str).str[-3:])\
.sort_values('sortkey')\
.drop('sortkey', axis=1)
Output:
YearMonth Total Cost
4 2016008 $57,428,630
0 2015009 $11,209,041
5 2016009 $66,754,405
1 2015010 $20,581,043
6 2016010 $45,021,707
2 2015011 $37,079,415
7 2016011 $34,783,970
3 2015012 $36,831,335
8 2016012 $66,215,044

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories