Aggregation within a index heirarchy - python

Currently, I have a dataframe with an index heirarchy for monthly cohorts. Here is how I grouped them.
grouped = dfsort.groupby(['Cohort','Lifetime_Revenue'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
Which outputs:
Cohort Lifetime_Revenue Customer_ID
2014-01 149.9 1
2014-02 299.9 1
2014-03 269.91 1
329.89 1
899.88 1
2014-04 299.9 1
674.91 2
2014-05 899.88 1
2014-06 824.89 1
And so on.
I was looking to get the total sum of the Lifetime revenue for each cohort as well as the total amount of users for a cohort.
Basically, I want to turn it into a regular database.
Anyone got any thoughts on this?

Related

Applying Pandas groupby to multiple columns

I have a set of data that has several different columns, with daily data going back several years. The variable is the exact same for each column. I've calculated the daily, monthly, and yearly statistics for each column, and want to do the same, but combining all columns together to get one statistic for each day, month, and year rather than the several different ones I calculated before.
I've been using Pandas group by so far, using something like this:
sum_daily_files = daily_files.groupby(daily_files.Date.dt.day).sum()
sum_monthly_files = daily_files.groupby(daily_files.Date.dt.month).sum()
sum_yearly_files = daily_files.groupby(daily_files.Date.dt.year).sum()
Any suggestions on how I might go about using Pandas - or any other package - to combine the statistics together? Thanks so much!
edit
Here's a snippet of my dataframe:
Date site1 site2 site3 site4 site5 site6
2010-01-01 00:00:00 2 0 1 1 0 1
2010-01-02 00:00:00 7 5 1 3 1 1
2010-01-03 00:00:00 3 3 2 2 2 1
2010-01-04 00:00:00 0 0 0 0 0 0
2010-01-05 00:00:00 0 0 0 0 0 1
I just had to type it in because I was having trouble getting it over, so my apologies. Basically, it's six different sites from 2010 to 2019 that details how much snow (in inches) each site received on each day.
(Your problem need to be clarify)
Is this what you want?
all_sum_daily_files = sum_daily_files.sum(axis=1) # or daily_files.sum(axis=1)
all_sum_monthly_files = sum_monthly_files.sum(axis=1)
all_sum_yearly_files = sum_yearly_files.sum(axis=1)
If your data is daily, why calculate the daily sum, you can use directly daily_files.sum(axis=1).

How to aggregate days by year-month and pivot so that count becomes count_source summed over the month with Python

I am manipulating some data in Python and was wondering if anyone can help.
I have data that looks like this:
count source timestamp tokens
0 1 alt-right-census 2006-03-21 setting
1 1 alt-right-census 2006-03-21 twttr
2 1 stormfront 2006-06-24 head
3 1 stormfront 2006-10-07 five
and I need data that looks like this:
count_stormfront count_alt-right-census month token
2 1 2006-01 setting
or like this:
date token alt_count storm_count
4069995 2016-09 zealand 0 0
4069996 2016-09 zero 11 8
4069997 2016-09 zika 295 160
How can I aggregate days by year-month and pivot so that count becomes count_source summed over the month?
Any help would be appreciated. Thanks!
df.groupby(['source', df['timestamp'].str[:7]]).size().unstack()
Result:
timestamp 2006-03 2006-06 2006-10
source
alt-right-census 2.0 NaN NaN
stormfront NaN 1.0 1.0

pandas calculated field by comparing other fields and dictionary values

I have two dataframes. the first one looks like this:
id, orderdate, orderid, amount,camp1, camp2, camp3
1 2020-01-01 100 100 1 0 0
2 2020-02-01 120 200 1 0 1
3 2019-12-01 130 500 0 1 0
4 2019-11-01 150 750 0 1 0
5 2020-01-01 160 1000 1 1 1
camp1, camp2, camp3 parts show if the customer attended to a campaign.
and the campaigns have a period dictionary such that
camp_periods = {'camp1':
[datetime.strptime('2019-04-08', '%Y-%m-%d'), datetime.strptime('2019-06-06', '%Y-%m-%d')],
'camp2':
[datetime.strptime('2019-09-15', '%Y-%m-%d'), datetime.strptime('2019-09-28', '%Y-%m-%d')],
'camp3':
[datetime.strptime('2019-11-15', '%Y-%m-%d'), datetime.strptime('2019-12-28', '%Y-%m-%d')]
}
I would like to create a table giving the number of orders and total of order amounts per customer, if the orderdate is between the campaign periods in the camp_periods dictionary and if the customer attended to that campaign.
I'm not sure if I understood very well your question, I guess when you say the number of orders and total of order amounts you want to get the first n number of orders that are under or equal with a given total order amounts, here is my approach:
data example:
from operator import or_
from functools import reduce
number_orders = 2
total_order_amounts = 3000
orderdate is between the campaign periods in the camp_periods
dictionary and if the customer attended to that campaign
cond = [(df[k].astype('bool') & df['orderdate'].between(*v)) for k, v in camp_periods.items()]
cond = reduce(or_, cond)
df_cond = df[cond]
df_final = df_cond[df_cond['amount'].cumsum() <= total_order_amounts].head(number_orders)
df_final
output:

subset by counting the number of times 0 occurs in a column after groupby in python

I have some typical stock data. I want to create a column called "Volume_Count" that will count the number of 0 volume days per quarter. My ultimate goal is to remove all stocks that have 0 volume for more than 5 days in a quarter. By creating this column, I can write a simple statement to subset Vol_Count > 5.
A typical Dataset:
Stock Date Qtr Volume
XYZ 1/1/19 2019 Q1 0
XYZ 1/2/19 2019 Q1 598
XYZ 1/3/19 2019 Q1 0
XYZ 1/4/19 2019 Q1 0
XYZ 1/5/19 2019 Q1 0
XYZ 1/6/19 2019 Q1 2195
XYZ 1/7/19 2019 Q1 0
... ... and so on (for multiple stocks and quarters)
This is what I've tried - a 1 liner -
df = df.groupby(['stock','Qtr'], as_index=False).filter(lambda x: len(x.Volume == 0) > 5)
However, as stated previously, this produced inconsistent results.
I want to remove the stock from the dataset only for the quarter where the volume == 0 for 5 or more days.
Note: I have multiple Stocks and Qtr in my dataset, therefore it's essential to groupby Qtr, Stock.
Desired Output:
I want to keep the dataset but remove any stocks for a qtr if they have a volume = 0 for > 5 days.. that might entail a stock not being in the dataset for 2019 Q1 (because Vol == 0 >5 days) but being in the df in 2019 Q2 (Vol == 0 < 5 days)...
Try this:
df[df['Volume'].eq(0).groupby([df['Stock'],df['Qtr']]).transform('sum') < 5]
Details.
First take the Volume column of your dataframe and check to see if
it zero for each record.
Next, group that column by 'Stock' and 'Qtr' columns and get a sum of each True values from step 1 assign that sum to each record using groupby and transform.
Create boolean series from that sum where True if less than 5 and
use that series to boolean index your original dataframe.

Python selecting row from second dataframe based on complex criteria

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.

Categories