I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30
I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143
I need to find the total monthly cumulative number of order. I have 2 columns OrderDate and OrderId.I cant use a list to find the cumulative numbers since data is so large. and result should be year_month format along with cumulative order total per each months.
orderDate OrderId
2011-11-18 06:41:16 23
2011-11-18 04:41:16 2
2011-12-18 06:41:16 69
2012-03-12 07:32:15 235
2012-03-12 08:32:15 234
2012-03-12 09:32:15 235
2012-05-12 07:32:15 233
desired Result
Date CumulativeOrder
2011-11 2
2011-12 3
2012-03 6
2012-05 7
I have imported my excel into pycharm and use pandas to read excel
I have tried to split the datetime column to year and month then grouped but not getting the correct result.
df1 = df1[['OrderId','orderDate']]
df1['year'] = pd.DatetimeIndex(df1['orderDate']).year
df1['month'] = pd.DatetimeIndex(df1['orderDate']).month
df1.groupby(['year','month']).sum().groupby('year','month').cumsum()
print (df1)
Convert column to datetimes, then to months period by to_period, add new column by numpy.arange and last remove duplicates with keep last dupe by column Date and DataFrame.drop_duplicates:
import numpy as np
df1['orderDate'] = pd.to_datetime(df1['orderDate'])
df1['Date'] = df1['orderDate'].dt.to_period('m')
#use if not sorted datetimes
#df1 = df1.sort_values('Date')
df1['CumulativeOrder'] = np.arange(1, len(df1) + 1)
print (df1)
orderDate OrderId Date CumulativeOrder
0 2011-11-18 06:41:16 23 2011-11 1
1 2011-11-18 04:41:16 2 2011-11 2
2 2011-12-18 06:41:16 69 2011-12 3
3 2012-03-12 07:32:15 235 2012-03 4
df2 = df1.drop_duplicates('Date', keep='last')[['Date','CumulativeOrder']]
print (df2)
Date CumulativeOrder
1 2011-11 2
2 2011-12 3
3 2012-03 4
Another solution:
df2 = (df1.groupby(df1['orderDate'].dt.to_period('m')).size()
.cumsum()
.rename_axis('Date')
.reset_index(name='CumulativeOrder'))
print (df2)
Date CumulativeOrder
0 2011-11 2
1 2011-12 3
2 2012-03 6
3 2012-05 7
I have the following dataframe:
YearMonth Total Cost
2015009 $11,209,041
2015010 $20,581,043
2015011 $37,079,415
2015012 $36,831,335
2016008 $57,428,630
2016009 $66,754,405
2016010 $45,021,707
2016011 $34,783,970
2016012 $66,215,044
YearMonth is an int64 column. A value in YearMonth such as 2015009 stands for September 2015. I want to re-order the rows so that if the last 3 digits are the same, then I want the rows to appear right on top of each other sorted by year.
Below is my desired output:
YearMonth Total Cost
2015009 $11,209,041
2016009 $66,754,405
2015010 $20,581,043
2016010 $45,021,707
2015011 $37,079,415
2016011 $34,783,970
2015012 $36,831,335
2016012 $66,215,044
2016008 $57,428,630
I have scoured google to try and find how to do this but to no avail.
df['YearMonth'] = pd.to_datetime(df['YearMonth'],format = '%Y0%m')
df['Year'] = df['YearMonth'].dt.year
df['Month'] = df['YearMonth'].dt.month
df.sort_values(['Month','Year'])
YearMonth Total Year Month
8 2016-08-01 $57,428,630 2016 8
0 2015-09-01 $11,209,041 2015 9
1 2016-09-01 $66,754,405 2016 9
2 2015-10-01 $20,581,043 2015 10
3 2016-10-01 $45,021,707 2016 10
4 2015-11-01 $37,079,415 2015 11
5 2016-11-01 $34,783,970 2016 11
6 2015-12-01 $36,831,335 2015 12
7 2016-12-01 $66,215,044 2016 12
One way of doing. There may be a quicker way with fewer steps that don't involve converting YearMonth to datetime but if you have a date, it makes more sense to use that.
One way of dong this to cast your int column to string and use the string access with indexing.
df.assign(sortkey=df.YearMonth.astype(str).str[-3:])\
.sort_values('sortkey')\
.drop('sortkey', axis=1)
Output:
YearMonth Total Cost
4 2016008 $57,428,630
0 2015009 $11,209,041
5 2016009 $66,754,405
1 2015010 $20,581,043
6 2016010 $45,021,707
2 2015011 $37,079,415
7 2016011 $34,783,970
3 2015012 $36,831,335
8 2016012 $66,215,044
I have the following pandas DataFrame dt:
auftragskennung sku artikel_bezeichnung summen_netto system_created
0 14 200182 Product 1 -16.64 2015-05-12 19:55:16
1 14 730293 Product 2 -4.16 2015-05-12 19:55:16
2 3 720933 Product 3 0.00 2014-03-25 12:12:44
3 3 192042 Product 4 19.95 2014-03-25 12:12:45
4 3 423902 Product 5 23.88 2014-03-25 12:12:45
I then execute this command to get the best selling products ordered by sku:
topseller = dt.groupby("sku").agg({"summen_netto": np.sum}).sort("summen_netto", ascending=False)
Which returns something like:
summen_netto
sku
730293 55622.24
720933 35603.99
192042 27698.99
423902 26726.28
734630 25730.21
740353 22798.14
This is what I want, but how can I now access the sku column? topseller["sku"] does not work. It always gives me a KeyError.
I would like to be able to do this:
topseller["sku"]["730293"]
Which would then return 55622.24
The sku is now the column so you need to use loc to perform label selection:
In [7]:
topseller.loc[730293]
Out[7]:
summen_netto -4.16
Name: 730293, dtype: float64
You can confirm this here:
In [8]:
topseller.index
Out[8]:
Int64Index([423902, 192042, 720933, 730293, 200182], dtype='int64', name='sku')