I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30
I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143
My data includes invoices and customers.
One customer can have multiple invoices. One invoice belongs to always one customer. The invoices are updated daily (Report Date).
My goal is to calculate the age of the customer in days (see column "Age in Days"). In order to achieve this, I take the first occurrence of a customers report date and calculate the difference to the last occurrence of the report date.
e.g. Customer 1 occurs from 08-14 till 08-15. Therefore he/she is 1 day old.
Report Date Invoice No Customer No Amount Age in Days
2018-08-14 A 1 50$ 1
2018-08-14 B 1 100$ 1
2018-08-14 C 2 75$ 2
2018-08-15 A 1 20$ 1
2018-08-15 B 1 45$ 1
2018-08-15 C 2 70$ 2
2018-08-16 C 2 40$ 1
2018-08-16 D 3 100$ 0
2018-08-16 E 3 60$ 0
I solved this, but however, very inefficiently and it takes too long. My data contains 26 million rows. Below I calculated the age for one customer only.
# List every customer no
customerNo = df["Customer No"].unique()
customer_age = []
# Testing for one specific customer
testCustomer = df.loc[df["Customer No"] == customerNo[0]]
testCustomer = testCustomer.sort_values(by="Report Date", ascending=True)
first_occur = testCustomer.iloc[0]['Report Date']
last_occur = testCustomer.iloc[-1]['Report Date']
age = (last_occur - first_occur).days
customer_age.extend([age] * len(testCustomer))
testCustomer.loc[:,'Customer Age']=customer_age
Is there a better way to solve this problem?
Use groupby.transform with first and last aggregations:
grps = df.groupby('Customer No')['Report Date']
df['Age in Days'] = (grps.transform('last') - grps.transform('first')).dt.days
[out]
Report Date Invoice No Customer No Amount Age in Days
0 2018-08-14 A 1 50$ 1
1 2018-08-14 B 1 100$ 1
2 2018-08-14 C 2 75$ 2
3 2018-08-15 A 1 20$ 1
4 2018-08-15 B 1 45$ 1
5 2018-08-15 C 2 70$ 2
6 2018-08-16 C 2 40$ 2
7 2018-08-16 D 3 100$ 0
8 2018-08-16 E 3 60$ 0
If you need one value per customer, indicating its age you can use a group by(very common):
grpd = my_df.groupby('Customer No')['Report Date'].agg([min, max]).reset_index()
grpd['days_diff'] = (grpd['max'] - grpd['min']).dt.days
I need to find the total monthly cumulative number of order. I have 2 columns OrderDate and OrderId.I cant use a list to find the cumulative numbers since data is so large. and result should be year_month format along with cumulative order total per each months.
orderDate OrderId
2011-11-18 06:41:16 23
2011-11-18 04:41:16 2
2011-12-18 06:41:16 69
2012-03-12 07:32:15 235
2012-03-12 08:32:15 234
2012-03-12 09:32:15 235
2012-05-12 07:32:15 233
desired Result
Date CumulativeOrder
2011-11 2
2011-12 3
2012-03 6
2012-05 7
I have imported my excel into pycharm and use pandas to read excel
I have tried to split the datetime column to year and month then grouped but not getting the correct result.
df1 = df1[['OrderId','orderDate']]
df1['year'] = pd.DatetimeIndex(df1['orderDate']).year
df1['month'] = pd.DatetimeIndex(df1['orderDate']).month
df1.groupby(['year','month']).sum().groupby('year','month').cumsum()
print (df1)
Convert column to datetimes, then to months period by to_period, add new column by numpy.arange and last remove duplicates with keep last dupe by column Date and DataFrame.drop_duplicates:
import numpy as np
df1['orderDate'] = pd.to_datetime(df1['orderDate'])
df1['Date'] = df1['orderDate'].dt.to_period('m')
#use if not sorted datetimes
#df1 = df1.sort_values('Date')
df1['CumulativeOrder'] = np.arange(1, len(df1) + 1)
print (df1)
orderDate OrderId Date CumulativeOrder
0 2011-11-18 06:41:16 23 2011-11 1
1 2011-11-18 04:41:16 2 2011-11 2
2 2011-12-18 06:41:16 69 2011-12 3
3 2012-03-12 07:32:15 235 2012-03 4
df2 = df1.drop_duplicates('Date', keep='last')[['Date','CumulativeOrder']]
print (df2)
Date CumulativeOrder
1 2011-11 2
2 2011-12 3
3 2012-03 4
Another solution:
df2 = (df1.groupby(df1['orderDate'].dt.to_period('m')).size()
.cumsum()
.rename_axis('Date')
.reset_index(name='CumulativeOrder'))
print (df2)
Date CumulativeOrder
0 2011-11 2
1 2011-12 3
2 2012-03 6
3 2012-05 7
NOTE: Looking for some help on an efficient way to do this besides a mega join and then calculating the difference between dates
I have table1 with country ID and a date (no duplicates of these values) and I want to summarize table2 information (which has country, date, cluster_x and a count variable, where cluster_x is cluster_1, cluster_2, cluster_3) so that table1 has appended to it each value of the cluster ID and the summarized count from table2 where date from table2 occurred within 30 days prior to date in table1.
I believe this is simple in SQL: How to do this in Pandas?
select a.date,a.country,
sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3
from table1 a
left outer join table2 b
on a.country=b.country
group by a.date,a.country
EDIT:
Here is a somewhat altered example. Say this is table1, an aggregated data set with date, city, cluster and count. Below it is the "query" dataset (table2). in this case we want to sum the count field from table1 for cluster1,cluster2,cluster3 (there is actually 100 of them) corresponding to the country id as long as the date field in table1 is within 30 days prior.
So for example, the first row of the query dataset has date 2/2/2015 and country 1. In table 1, there is only one row within 30 days prior and it is for cluster 2 with count 2.
Here is a dump of the two tables in CSV:
date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3
and table2:
date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2
Edit: Oop - wish I would have seen that edit about joining before submitting. Np, I'll leave this as it was fun practice. Critiques welcome.
Where table1 and table2 are located in the same directory as this script at "table1.csv" and "table2.csv", this should work.
I didn't get the same result as your examples with 30 days - had to bump it to 31 days, but I think the spirit is here:
import pandas as pd
import numpy as np
table1_path = './table1.csv'
table2_path = './table2.csv'
with open(table1_path) as f:
table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)
with open(table2_path) as f:
table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)
joined = pd.merge(table2, table1, how='outer', on=['country'])
joined['datediff'] = joined.date_x - joined.date_y
filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]
gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])
summed = pd.DataFrame(gb_date_x['count'].sum())
result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)
My test output:
ipdb> table1
date country cluster count
0 2014-01-30 00:00:00 1 1 1
1 2015-02-03 00:00:00 1 1 3
2 2015-01-30 00:00:00 1 2 2
3 2015-04-15 00:00:00 1 2 5
4 2015-03-01 00:00:00 2 1 6
5 2015-07-01 00:00:00 2 2 4
6 2015-01-31 00:00:00 2 3 8
7 2015-01-21 00:00:00 2 1 2
8 2015-01-21 00:00:00 2 1 3
ipdb> table2
date country
0 2015-02-01 00:00:00 1
1 2015-04-21 00:00:00 1
2 2015-02-21 00:00:00 2
...
ipdb> result
date_x country count
cluster 1 2 3
0 2015-02-01 00:00:00 1 0 2 0
1 2015-02-21 00:00:00 2 5 0 8
2 2015-04-21 00:00:00 1 0 5 0
UPDATE:
I think it doesn't make much sense to use pandas for processing data that can't fit into your memory. Of course there are some tricks how to deal with that, but it's painful.
If you want to process your data efficiently you should use a proper tool for that.
I would recommend to have a closer look at Apache Spark SQL where you can process your distributed data on multiple cluster nodes, using much more memory/processing power/IO/etc. compared to one computer/IO subsystem/CPU pandas approach.
Alternatively you can try use RDBMS like Oracle DB (very expensive, especially software licences! and their free version is full of limitations) or free alternatives like PostgreSQL (can't say much about it, because of lack of experience) or MySQL (not that powerful compared to Oracle; for example there is no native/clear solution for dynamic pivoting which you most probably will want to use, etc.)
OLD answer:
you can do it this way (please find explanations as comments in the code):
#
# <setup>
#
dates1 = pd.date_range('2016-03-15','2016-04-15')
dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20']
dates2 = [pd.to_datetime(d) for d in dates2]
countries = ['c1', 'c2', 'c3']
t1 = pd.DataFrame({
'date': dates1,
'country': np.random.choice(countries, len(dates1)),
'cluster': np.random.randint(1, 4, len(dates1)),
'count': np.random.randint(1, 10, len(dates1))
})
t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)})
#
# </setup>
#
# merge two DFs by `country`
merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country')
# filter dates and drop 'date1' column
merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\
& \
(merged.date >= merged.date1)
].drop(['date1'], axis=1)
# group `merged` DF by ['country', 'date', 'cluster'],
# sum up `counts` for overlapping dates,
# reset the index,
# pivot: convert `cluster` values to columns,
# taking sum's of `count` as values,
# NaN's will be replaced with zeroes
# and finally reset the index
r = merged.groupby(['country', 'date', 'cluster'])\
.sum()\
.reset_index()\
.pivot_table(index=['country','date'],
columns='cluster',
values='count',
aggfunc='sum',
fill_value=0)\
.reset_index()
# rename numeric columns to: 'cluster_N'
rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()}
r = r.rename(columns=rename_cluster_cols)
Output (for my datasets):
In [124]: r
Out[124]:
cluster country date cluster_1 cluster_2 cluster_3
0 c1 2016-04-01 8 0 11
1 c2 2016-04-01 0 34 22
2 c3 2016-05-01 4 18 36