I am working on a transactions dataset and I need to figure out the average time between purchases for each client.
I managed to get the diff between the latest date and the earliest (in months) and divide by the total purchases (NumPurchases). But I am not sure of this approach as it does not take into consideration the fact that not every only bought on multiple occasions.
Imagine the following dataset how would you extract the average time between purchases.
CustomerID EarliestSale LatestSale NumPurchases
0 1 2017-01-05 2017-12-23 11
1 10 2017-06-20 2017-11-17 5
2 100 2017-05-10 2017-12-19 2
3 1000 2017-02-19 2017-12-30 9
4 1001 2017-02-07 2017-11-18 7
Apologies for the rookie question in advance and thanks StackOverflow community :).
Given your revised question and initial dataset (I've revised your dataset slightly to include two customers):
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'],
'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'],
'Quantity': [5, 1, 1, 6]})
You can easily include the average time between transactions (in days) in your group by with the following code:
NOTE: This will only work if you dataset is ordered by CustomerID then SaleDate.
import pandas as pd
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'],
'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'],
'Quantity': ['5', '1', '1', '6']})
# convert the string date to a datetime
df['SaleDate'] = pd.to_datetime(df.SaleDate)
# sort the dataset
df = df.sort_values(['CustomerId', 'SaleDate'])
# calculate the difference between each date in days
# (the .shift method will offset the rows, play around with this to understand how it works
# - We apply this to every customer using a groupby
df2 = df.groupby("CustomerId").apply(
lambda df: (df.SaleDate - df.SaleDate.shift(1)).dt.days).reset_index()
df2 = df2.rename(columns={'SaleDate': 'time-between-sales'})
df2.index = df2.level_1
# then join the result back on to the original dataframe
df = df.join(df2['time-between-sales'])
# add the mean time to your groupby
grouped = df.groupby("CustomerId").agg({
"SaleDate": ["min", "max"],
"Quantity": "sum",
"time-between-sales": "mean"})
# rename columns per your original specification
grouped.columns = grouped.columns.get_level_values(0) + grouped.columns.get_level_values(1)
grouped = grouped.rename(columns={
'SaleDatemin': 'EarliestSale',
'SaleDatemax': 'LatestSale',
'Quantitysum': 'NumPurchases',
'time-between-salesmean': 'avgTimeBetweenPurchases'})
print(grouped)
EarliestSale LatestSale NumPurchases avgTimeBetweenPurchases
CustomerId
001 2017-01-10 2017-04-10 6 90.0
002 2017-08-10 2017-09-10 7 31.0
Related
I am attempting to join two data frames together. However, when I use the Merge function in Pandas, in some cases it is creating rows that should not be present and are not accurate.
import pandas as pd
df = pd.DataFrame({'TimeAway Type': ['Sick', 'Vacation', 'Late'], 'Date': ['2022-03-09', '2022-03-09', '2022-03-15'], 'Hours Requested': [0.04, 3, 5]})
df2 = pd.DataFrame({'Schedule Segment': ['Sick', 'VTO', 'Tardy'], 'Date': ['2022-03-09', '2022-03-09', '2022-03-15'], 'Duration': [2, 3, 1]})
merged = pd.merge(df, df2, on=['Date'])
print(merged)
Output image
As you can see from the image above, on the date where there is only 1 instance in each DF, everything works perfectly fine. However, on the dates where there are more than one in each DF, it is producing extra rows. It's almost as if it's saying "I don't know how to match this data so here's every single possible combination."
Desired Output:
Desired output image
This is just a subset of the data. The DF is quite large and there are a lot more instances of where this occurs with different values. Is there anything that can be done to stop this from happening?
Update per comment in question to join on order of records by date:
df_o=df.assign(date_order=df.groupby('Date').cumcount())
df2_o=df2.assign(date_order=df2.groupby('Date').cumcount())
df_o.merge(df2_o, on=['Date','date_order'])
Output:
TimeAway Type Date Hours Requested date_order Schedule Segment Duration
0 Sick 2022-03-09 0.04 0 Sick 2
1 Vacation 2022-03-09 3.00 1 VTO 3
2 Late 2022-03-15 5.00 0 Tardy 1
Create a psuedo key to join on order the by count in each day.
Try this:
df2['TimeAway Type'] = (df2['Schedule Segment'].map({'VTO':'Vacation',
'Tardy':'Late',})
.fillna(df2['Schedule Segment']))
merged = pd.merge(df, df2, on=['TimeAway Type', 'Date'])
merged
Output:
TimeAway Type Date Hours Requested Schedule Segment Duration
0 Sick 2022-03-09 0.04 Sick 2
1 Vacation 2022-03-09 3.00 VTO 3
2 Late 2022-03-15 5.00 Tardy 1
I want to group by my dataframe by different columns based on UserId,Date,category (frequency of use per day ) ,max duration per category ,and the part of the day when it is most used and finally store the result in a .csv file.
name duration UserId category part_of_day Date
Settings 3.436 1 System tool evening 2020-09-10
Calendar 2.167 1 Calendar night 2020-09-11
Calendar 5.705 1 Calendar night 2020-09-11
Messages 7.907 1 Phone_and_SMS night 2020-09-11
Instagram 50.285 9 Social night 2020-09-28
Drive 30.260 9 Productivity night 2020-09-28
df.groupby(["UserId", "Date","category"])["category"].count()
my code result is :
UserId Date category
1 2020-09-10 System tool 1
2020-09-11 Calendar 8
Clock 2
Communication 86
Health & Fitness 5
But i want this result
UserId Date category count(category) max-duration
1 2020-09-10 System tool 1 3
2020-09-11 Calendar 2 5
2 2020-09-28 Social 1 50
Productivity 1 30
How can I do that? I can not find the wanted result for any solution
From your question, it looks like you'd like to make a table with each combination and the count. For this, you might consider using the as_index parameter in groupby:
df.category.groupby(["UserId", "Date"], as_index=False).count()
It looks like you might be wanting to calculate statistics for each group.
grouped = df.groupby(["UserId", "Date","category"])
result = grouped.agg({'category': 'count', 'duration': 'max'})
result.columns = ['group_count','duration_max']
result = result.reset_index()
result
UserId Date category group_count duration_max
0 1 2020-09-10 System tool 1 3.436
1 1 2020-09-11 Calendar 2 5.705
2 1 2020-09-11 Phone_and_SMS 1 7.907
3 9 2020-09-28 Productivity 1 30.260
4 9 2020-09-28 Social 1 50.285
You take advantage of pandas.DataFrame.groupby , pandas.DataFrame.aggregate and pandas.DataFrame.rename in following format to generate your desired output in one line:
code:
import pandas as pd
df = pd.DataFrame({'name': ['Settings','Calendar','Calendar', 'Messages', 'Instagram', 'Drive'],
'duration': [3.436, 2.167, 5.7050, 7.907, 50.285, 30.260],
'UserId': [1, 1, 1, 1, 2, 2],
'category' : ['System_tool', 'Calendar', 'Calendar', 'Phone_and_SMS', 'Social', 'Productivity'],
'part_of_day' : ['evening', 'night','night','night','night','night' ],
'Date' : ['2020-09-10', '2020-09-11', '2020-09-11', '2020-09-11', '2020-09-28', '2020-09-28'] })
df.groupby(['UserId', 'Date', 'category']).aggregate( count_cat = ('category', 'count'), max_duration = ('duration', 'max'))
out:
I have a dataframe which involves Vendor, Product, Price of various listings on a market among other column values.
I need a dataframe which has the unique vendors, number of products, sum of their product listings, average price/product and (average * no. of sales) as different columns.
Something like this -
What's the best way to make this new dataframe?
Thanks!
First multiple columns Number of Sales with Price, then use DataFrameGroupBy.agg by dictionary of columns names with aggregate functions, then flatten MultiIndex in columns by map and rename. :
df['Number of Sales'] *= df['Price']
d1 = {'Product':'size', 'Price':['sum', 'mean'], 'Number of Sales':'mean'}
df = df.groupby('Vendor').agg(d1)
df.columns = df.columns.map('_'.join)
d = {'Product_size':'No. of Product',
'Price_sum':'Sum of Prices',
'Price_mean':'Mean of Prices',
'Number of Sales_mean':'H Factor'
}
df = df.rename(columns=d).reset_index()
print (df)
Vendor No. of Product Sum of Prices Mean of Prices H Factor
0 A 4 121 30.25 6050.0
1 B 1 12 12.00 1440.0
2 C 2 47 23.50 587.5
3 H 1 45 45.00 9000.0
You can do it using groupby(), like this:
df.groupby('Vendor').agg({'Products': 'count', 'Price': ['sum', 'mean']})
That's just three columns, but you can work out the rest.
You can do this by using pandas pivot_table. Here is an example based on your data.
import pandas as pd
import numpy as np
>>> f = pd.pivot_table(d, index=['Vendor', 'Sales'], values=['Price', 'Product'], aggfunc={'Price': np.sum, 'Product':np.ma.count}).reset_index()
>>> f['Avg Price/Product'] = f['Price']/f['Product']
>>> f['H Factor'] = f['Sales']*f['Avg Price/Product']
>>> f.drop('Sales', axis=1)
Vendor Price Product Avg Price/Product H Factor
0 A 121 4 30.25 6050.0
1 B 12 1 12.00 1440.0
2 C 47 2 23.50 587.5
3 H 45 1 45.00 9000.0
My data frame (DF) looks like this
Customer_number Store_number year month last_buying_date1 amount
1 20 2014 10 2015-10-07 100
1 20 2014 10 2015-10-09 200
2 20 2014 10 2015-10-20 100
2 10 2014 10 2015-10-13 500
and I want to get an output like this
year month sum_purchase count_purchases distinct customers
2014 10 900 4 3
How do I get an output like this using Agg and group by . I am using a 2 step group by currently but struggling to get the distinct customers . Here's my approach
#### Step 1 - Aggregating everything at customer_number, store_number level
aggregations = {
'amount': 'sum',
'last_buying_date1': 'count',
}
grouped_at_Cust = DF.groupby(['customer_number','store_number','month','year']).agg(aggregations).reset_index()
grouped_at_Cust.columns = ['customer_number','store_number','month','year','total_purchase','num_purchase']
#### Step2 - Aggregating at year month level
aggregations = {
'total_purchase': 'sum',
'num_purchase': 'sum',
size
}
Monthly_customers = grouped_at_Cust.groupby(['year','month']).agg(aggregations).reset_index()
Monthly_customers.colums = ['year','month','sum_purchase','count_purchase','distinct_customers']
My struggle is in the 2nd step. How do i include size in the 2nd aggregation step ?
You could use groupby.agg and supplying the function nunique to return number of unique Customer Ids in the group.
df_grp = df.groupby(['year', 'month'], as_index=False) \
.agg({'purchase_amt':['sum','count'], 'Customer_number':['nunique']})
df_grp.columns = map('_'.join, df_grp.columns.values)
df_grp
Incase, you are trying to group them differently (omitting certain column) when performing groupby operation:
df_grp_1 = df.groupby(['year', 'month']).agg({'purchase_amt':['sum','count']})
df_grp_2 = df.groupby(['Store_number', 'month', 'year'])['Customer_number'].agg('nunique')
Take the first level of the multiindex columns which contains the agg operation performed:
df_grp_1.columns = df_grp_1.columns.get_level_values(1)
Merge them back on the intersection of the columns used to group them:
df_grp = df_grp_1.reset_index().merge(df_grp_2.reset_index().drop(['Store_number'],
axis=1), on=['year', 'month'], how='outer')
Rename the columns to new ones:
d = {'sum': 'sum_purchase', 'count': 'count_purchase', 'nunique': 'distinct_customers'}
df_grp.columns = [d.get(x, x) for x in df_grp.columns]
df_grp
NOTE: Looking for some help on an efficient way to do this besides a mega join and then calculating the difference between dates
I have table1 with country ID and a date (no duplicates of these values) and I want to summarize table2 information (which has country, date, cluster_x and a count variable, where cluster_x is cluster_1, cluster_2, cluster_3) so that table1 has appended to it each value of the cluster ID and the summarized count from table2 where date from table2 occurred within 30 days prior to date in table1.
I believe this is simple in SQL: How to do this in Pandas?
select a.date,a.country,
sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3
from table1 a
left outer join table2 b
on a.country=b.country
group by a.date,a.country
EDIT:
Here is a somewhat altered example. Say this is table1, an aggregated data set with date, city, cluster and count. Below it is the "query" dataset (table2). in this case we want to sum the count field from table1 for cluster1,cluster2,cluster3 (there is actually 100 of them) corresponding to the country id as long as the date field in table1 is within 30 days prior.
So for example, the first row of the query dataset has date 2/2/2015 and country 1. In table 1, there is only one row within 30 days prior and it is for cluster 2 with count 2.
Here is a dump of the two tables in CSV:
date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3
and table2:
date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2
Edit: Oop - wish I would have seen that edit about joining before submitting. Np, I'll leave this as it was fun practice. Critiques welcome.
Where table1 and table2 are located in the same directory as this script at "table1.csv" and "table2.csv", this should work.
I didn't get the same result as your examples with 30 days - had to bump it to 31 days, but I think the spirit is here:
import pandas as pd
import numpy as np
table1_path = './table1.csv'
table2_path = './table2.csv'
with open(table1_path) as f:
table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)
with open(table2_path) as f:
table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)
joined = pd.merge(table2, table1, how='outer', on=['country'])
joined['datediff'] = joined.date_x - joined.date_y
filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]
gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])
summed = pd.DataFrame(gb_date_x['count'].sum())
result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)
My test output:
ipdb> table1
date country cluster count
0 2014-01-30 00:00:00 1 1 1
1 2015-02-03 00:00:00 1 1 3
2 2015-01-30 00:00:00 1 2 2
3 2015-04-15 00:00:00 1 2 5
4 2015-03-01 00:00:00 2 1 6
5 2015-07-01 00:00:00 2 2 4
6 2015-01-31 00:00:00 2 3 8
7 2015-01-21 00:00:00 2 1 2
8 2015-01-21 00:00:00 2 1 3
ipdb> table2
date country
0 2015-02-01 00:00:00 1
1 2015-04-21 00:00:00 1
2 2015-02-21 00:00:00 2
...
ipdb> result
date_x country count
cluster 1 2 3
0 2015-02-01 00:00:00 1 0 2 0
1 2015-02-21 00:00:00 2 5 0 8
2 2015-04-21 00:00:00 1 0 5 0
UPDATE:
I think it doesn't make much sense to use pandas for processing data that can't fit into your memory. Of course there are some tricks how to deal with that, but it's painful.
If you want to process your data efficiently you should use a proper tool for that.
I would recommend to have a closer look at Apache Spark SQL where you can process your distributed data on multiple cluster nodes, using much more memory/processing power/IO/etc. compared to one computer/IO subsystem/CPU pandas approach.
Alternatively you can try use RDBMS like Oracle DB (very expensive, especially software licences! and their free version is full of limitations) or free alternatives like PostgreSQL (can't say much about it, because of lack of experience) or MySQL (not that powerful compared to Oracle; for example there is no native/clear solution for dynamic pivoting which you most probably will want to use, etc.)
OLD answer:
you can do it this way (please find explanations as comments in the code):
#
# <setup>
#
dates1 = pd.date_range('2016-03-15','2016-04-15')
dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20']
dates2 = [pd.to_datetime(d) for d in dates2]
countries = ['c1', 'c2', 'c3']
t1 = pd.DataFrame({
'date': dates1,
'country': np.random.choice(countries, len(dates1)),
'cluster': np.random.randint(1, 4, len(dates1)),
'count': np.random.randint(1, 10, len(dates1))
})
t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)})
#
# </setup>
#
# merge two DFs by `country`
merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country')
# filter dates and drop 'date1' column
merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\
& \
(merged.date >= merged.date1)
].drop(['date1'], axis=1)
# group `merged` DF by ['country', 'date', 'cluster'],
# sum up `counts` for overlapping dates,
# reset the index,
# pivot: convert `cluster` values to columns,
# taking sum's of `count` as values,
# NaN's will be replaced with zeroes
# and finally reset the index
r = merged.groupby(['country', 'date', 'cluster'])\
.sum()\
.reset_index()\
.pivot_table(index=['country','date'],
columns='cluster',
values='count',
aggfunc='sum',
fill_value=0)\
.reset_index()
# rename numeric columns to: 'cluster_N'
rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()}
r = r.rename(columns=rename_cluster_cols)
Output (for my datasets):
In [124]: r
Out[124]:
cluster country date cluster_1 cluster_2 cluster_3
0 c1 2016-04-01 8 0 11
1 c2 2016-04-01 0 34 22
2 c3 2016-05-01 4 18 36