Aggregating customer spend without any customer ID - python

I have 2 columns as below. The first column is spend, and the second column is months from offer. Unfortunately there is no ID to identify each customer. In the case below, there are three customers. e.g. The first 5 rows represent customer 1, the next 3 rows are customer 2, and then final 7 rows are customer 3. You can tell by looking at the months_from_offer, which go from -x to x months for each customer (x is not necessarily the same for each customer, as shown here where x=2,1,3 respectively for customers 1,2,3).
What I am looking to do is calculate the difference in post offer spend vs pre-offer spend for each customer. I don't care about the individual customers themselves, but I would like an overview - e.g. 10 customers had a post/pre difference in between $0-$100.
As an example with the data below, to calculate the post/pre offer difference for customer 1, it is -$10 - $32 + $23 + $54 = $35
for customer 2: -$21 + $87 = $66
for customer 3: -$12 - $83 - $65 + $80 + $67 + $11 = -$2
spend months_from_offer
$10 -2
$32 -1
$43 0
$23 1
$54 2
$21 -1
$23 0
$87 1
$12 -3
$83 -2
$65 -1
$21 0
$80 1
$67 2
$11 3

You can identify the customers using the following and then groupby customer:
df['customer'] = df['months_from_offer'].cumsum().shift().eq(0).cumsum().add(1)
#Another way to calculate customer per #teylyn method
#df['customer'] = np.sign(df['months_from_offer']).diff().lt(0).cumsum().add(1)
df['amount'] = df['spend'].str[1:].astype(int) * np.sign(df['months_from_offer']
df.groupby('customer')['amount'].sum().reset_index()
Output:
customer amount
0 1 35
1 2 66
2 3 -2
How it is done:
spend months_from_offer customer amount
0 $10 -2 1 -10
1 $32 -1 1 -32
2 $43 0 1 0
3 $23 1 1 23
4 $54 2 1 54
5 $21 -1 2 -21
6 $23 0 2 0
7 $87 1 2 87
8 $12 -3 3 -12
9 $83 -2 3 -83
10 $65 -1 3 -65
11 $21 0 3 0
12 $80 1 3 80
13 $67 2 3 67
14 $11 3 3 11
Calculate 'customer' column using cumsum, shift and eq and add to start at customer 1.
Calculate 'amount' using string manipulation and multiply by np.sign
'month from offer'
sum 'amount' with groupby 'customer'

In Excel, you can insert a helper column that looks at the sign and determines if the sign is different to the row above and then increments a counter number.
Hard code a customer ID of 1 into the first row of data, then calculate the rest.
=IF(AND(SIGN(A3)=-1,SIGN(A3)<>SIGN(A2)),B2+1,B2)
Copy the results and paste as values, then you can use them to aggregate your data

Use pandas.Series.diff with cumsum to create pseudo user id:
s = df["months_from_offer"].diff().lt(0).cumsum()
Output:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 2
9 2
10 2
11 2
12 2
13 2
14 2
Name: months_from_offer, dtype: int64
Then use pandas.Series.clip to make the series either -1, 0, or 1, then do multiplication:
spend = (df["spend"] * df["months_from_offer"].clip(-1, 1))
Then use groupby.sum with the psuedo id s:
spend.groupby(s).sum()
Final output:
months_from_offer
0 35
1 66
2 -2
dtype: int64

Create id
s = df['months_from_offer'].iloc[::-1].cumsum().eq(0).iloc[::-1].cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3
13 3
14 3
Name: months_from_offer, dtype: int32
Then assign it
df['id']=s

I assume you wanted to read an excel file using pandas.
import pandas as pd
df = pd.read_excel('file.xlsx', sheetname='yoursheet')
pre = 0
post = 0
for i in df.index:
if df['months_from_offer'][i] < 0:
pre += int(df['spend'][i])
if df['months_from_offer'][i] > 0:
post += int(df['spend'][i])
dif = post - pre
If you would like to read the data for each customer
import pandas as pd
df = pd.read_excel('file.xlsx', sheetname='yoursheet')
customers = list[]
last = None
pre = 0
post = 0
for i in df.index:
if last is not None and abs(last + df['months_from_offer'][i]) > 1:
customers.append(post - pre)
pre = 0
post = 0
if df['months_from_offer'][i] < 0:
pre += int(df['spend'][i])
if df['months_from_offer'][i] > 0:
post += int(df['spend'][i])
last = df['months_from_offer'][i]
Or you can use a dict to name a customer. The way I separated the customers is when 2 months are more than (int) 1 from apart, there must be another person's record starting.

Related

Pandas Conditional Rolling Count

I have a question that extends from Pandas: conditional rolling count. I would like to create a new column in a dataframe that reflects the cumulative count of rows that meets several criteria.
Using the following example and code from stackoverflow 25119524
import pandas as pd
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
cowmast['xmast'] = cowmast['Cow'].apply(rolling_count) #new column in dataframe
cowmast
The output is xmast (number of times mastitis) for each cow
Cow Lact DIM xmast
0 1 1 45 1
1 1 2 25 2
2 1 2 28 3
3 2 2 70 1
4 2 2 95 2
5 2 2 98 3
6 2 2 120 4
7 2 3 80 5
What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.
To incorporate more than one condition to reset the count for each cows lactation (Lact) I used the following code.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
count_consecutive_items_n_cols(cowmast, ['Cow', 'Lact'], ['Lxmast'])
That produces the following output
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
I would appreciate insight as to how to add another condition in the cumulative count that takes into consideration the time between mastitis events (difference in DIM between rows for cows within the same Lact). If the difference in DIM between rows for the same cow and lactation is less than 7 then the count should not increment.
The output I am looking for is called "Adjusted" in the table below.
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
In the example above for cow 1 lact 2 the count is not incremented when the dim goes from 25 to 28 as the difference between the two events is less than 7 days. Same for cow 2 lact 2 when is goes from 95 to 98. For the larger increments 70 to 95 and 98 to 120 the count is increased.
Thank you for your help
John
Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.
Renaming your dataframe cowmast to df, you can set up xmast as follows:
df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1
Similarly, to set up Lxmast, you can use:
df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(),
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)
Data Input
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
df = cowmast
Output
print(df)
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
Now, continue with the last part of your requirement highlighted in bold below:
What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7.
we can do it as follows:
To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:
m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()
Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:
df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1
Now, turn to the main works here. Let's say we create another new column Adjusted for it:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Result:
print(df)
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.
Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Edit (Possible simplification depending on your data sequence)
As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.
Different from the sample data from the referenced question, where:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.
If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:
df = df.sort_values(['Cow', 'Lact'])
Then, we can simplify our codes, as follows:
(when data already sorted by [Cow, Lact]):
df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)
Same result and output values in the 3 columns xmast, Lxmast and Adjusted

How to create N groups based on conditions in columns?

I need to create groups using two columns. For example, I took shop_id and week. Here is the df:
shop_id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 3 2
6 1 5
Imagine that each group is some promo which took place in each shop consecutively (week by week). So, my attempt was to use sorting, shifting by 1 to get last_week, use booleans and then iterate over them, incrementing each time whereas condition not met:
test_df = pd.DataFrame({'shop_id':[1,1,1,2,2,3,1], 'week':[1,2,3,1,2,2,5]})
def createGroups(df, shop_id, week, group):
'''Create groups where is the same shop_id and consecutive week
'''
periods = []
period = 0
# sorting to create chronological order
df = df.sort_values(by = [shop_id,week],ignore_index = True)
last_week = df[week].shift(+1)==df[week]-1
last_shop = df[shop_id].shift(+1)==df[shop_id]
# here i iterate over booleans and increment group by 1
# if shop is different or last period is more than 1 week ago
for p,s in zip(last_week,last_shop):
if (p == True) and (s == True):
periods.append(period)
else:
period += 1
periods.append(period)
df[group] = periods
return df
createGroups(test_df, 'shop_id', 'week', 'promo')
And I get the grouping I need:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
3 1 5 2
4 2 1 3
5 2 2 3
6 3 2 4
However, function seems to be an overkill. Any ideas on how to get the same without a for-loop using native pandas function? I saw .ngroups() in docs but have no idea how to apply it to my case. Even better would be to vectorise it somehow, but I don't know how to achieve this:(
First we want to identify the promotions (continuously in weeks), then use groupby().ngroup() to enumerate the promotion:
df = df.sort_values('shop_id')
s = df['week'].diff().ne(1).groupby(df['shop_id']).cumsum()
df['promo'] = df.groupby(['shop_id',s]).ngroup() + 1
Update: This is based on your solution:
df = df.sort_values(['shop_id','week'])
s = df[['shop_id', 'week']]
df['promo'] = (s['shop_id'].ne(s['shop_id'].shift()) |
s['week'].diff().ne(1) ).cumsum()
Output:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
6 1 5 2
3 2 1 3
4 2 2 3
5 3 2 4

Given a value or constant, I need to only output relevant rows on Pandas

This is how my data looks like:
Day Price A Price B Price C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 64503 43692 79982
6 86664 69990 53468
7 77924 62998 68911
8 66600 68830 94396
9 82664 89972 49614
10 59741 48904 49528
11 34030 98074 72993
12 74400 85547 37715
13 51031 50031 85345
14 74700 59932 73935
15 62290 98130 88818
I have a small python script that outputs a sum for each column. I need to input an n value (for number of days) and the summing will run and output the values.
However, for example, given n=5 (for days), I want to output only Price A/B/C rows starting from the next day (which is day 6). Hence, the row for Day 5 should be '0'.
How can I produce this logic on Pandas ?
The idea I have is to use the n input value to then, truncate values on the rows corresponding to that particular (n day value). But how can I do this on code ?
if dataframe['Day'] == n:
dataframe['Price A'] == 0 & dataframe['Price B'] == 0 & dataframe['Price C'] == 0
You can filter rows by condition and set all columns without first by iloc[mask, 1:], for next row add Series.shift:
n = 5
df.iloc[(df['Day'].shift() <= n).values, 1:] = 0
print (df)
Day Price A Price B Price C
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
5 6 0 0 0
6 7 77924 62998 68911
7 8 66600 68830 94396
8 9 82664 89972 49614
9 10 59741 48904 49528
10 11 34030 98074 72993
11 12 74400 85547 37715
12 13 51031 50031 85345
13 14 74700 59932 73935
14 15 62290 98130 88818
Pseudo Code
Make sure to sort by day
shift columns 'A', 'B' and 'C' by n and fill in with 0
Sum accordingly
All of that can be done on one line as well
It is simply
dataframe.iloc[:n+1] = 0
This sets the values of all columns for the first n days to 0
# Sample output
dataframe
a b
0 1 2
1 2 3
2 3 4
3 4 2
4 5 3
n = 1
dataframe.iloc[:n+1] = 0
dataframe
a b
0 0 0
1 0 0
2 3 4
3 4 2
4 5 3
This truncates all for all the previous days. If you want to truncate only for the nth day.
dataframe.iloc[n] = 0

Selecting row from a group on highest score based on two columns

Data
Sentence Score_Unigram Score_Bigram versionId
0 As of Dat 5 1 269004158
1 Date Docum 4 3 269004158
2 As of Dat 4 1 269004158
3 Date Docum 5 3 345973060
4 x Indicate 4 1 372529352
5 Date Docum 5 3 372529352
6 1 Financial 9 1 372529352
7 020 per shar 2 0 372529352
8 Date $ in 8 1 372529352
9 Date $ in 9 4 372529352
10 4 --------- 4 1 372529352
11 Date Begin 1 0 372529352
Required Output
Sentence Score_Unigram Score_Bigram versionId
0 As of Dat 5 1 269004158
3 Date Docum 5 3 345973060
9 Date $ in 9 4 372529352
Objective
Group by version id, get the row with max Score_unigram, if results are more than one, then check the Score_Bigram column and get the row with the highest value (If there are more than one such rows return all)
What have I tried
maximum = 0
index_to_pick = []
for index,row_data in a.iterrows():
if row_data['Score_Unigram'] > maximum:
maximum = row_data['Score_Unigram']
score_bigram = row_data['Score_Bigram']
index_to_pick.append(index)
elif row_data['Score_Unigram'] == maximum:
if row_data['Score_Bigram'] > score_bigram:
maximum = row_data['Score_Unigram']
score_bigram = row_data['Score_Bigram']
index_to_pick = []
index_to_pick.append(index)
elif row_data['Score_Bigram'] == score_bigram:
index_to_pick.append(index)
a.loc[[index_to_pick[0]]]
Output
Sentence Score_Unigram Score_Bigram versionId
5 Date $ in 9 4 372529352
Okay the approach is not pretty i guess (since data is large), looking for a efficient one.
I tried idxmax but that returns the only the top one. Might be a duplicate but wasn't able to find one. Thanks for the help!!.
Use double filtering by boolean indexing - first by max of first column Score_Unigram and then by Score_Bigram:
df = df[ df['Sentence'].duplicated(keep=False)]
df = df[df.groupby('Sentence')['Score_Unigram'].transform('max') == df['Score_Unigram']]
df = df[df.groupby(['Sentence', 'Score_Unigram'])['Score_Bigram'].transform('max') == df['Score_Bigram']]
print (df)
Sentence Score_Unigram Score_Bigram versionId
0 As of Dat 5 1 269004158
3 Date Docum 5 3 345973060
5 Date Docum 5 3 372529352
9 Date $ in 9 4 372529352
try this on your df :
df.sort_values(['Score_Unigram','Score_Bigram'],ascending=False).head(1)
Output:
Sentence Score_Unigram Score_Bigram versionId
5 Date $ in 9 4 372529352
I believe you don't need to sort data, just compare to the max value of those 2 columns
df[ (df['Score_Unigram'] == df['Score_Unigram'].max()) &
(df['Score_Bigram'] == df['Score_Bigram'].max()) ]

Vectorizing conditional count in Pandas

I have a Pandas script that counts the number of readmissions to hospital within 30 days based on a few conditions. I wonder if it could be vectorized to improve performance. I've experimented with df.rolling().apply, but so far without luck.
Here's a table with contrived data to illustrate:
ID VISIT_NO ARRIVED LEFT HAD_A_MASSAGE BROUGHT_A_FRIEND
1 1 29/02/1996 01/03/1996 0 1
1 2 01/12/1996 04/12/1996 1 0
2 1 20/09/1996 21/09/1996 1 0
3 1 27/06/1996 28/06/1996 1 0
3 2 04/07/1996 06/07/1996 0 1
3 3 16/07/1996 18/07/1996 0 1
4 1 21/02/1996 23/02/1996 0 1
4 2 29/04/1996 30/04/1996 1 0
4 3 02/05/1996 02/05/1996 0 1
4 4 02/05/1996 03/05/1996 0 1
5 1 03/10/1996 05/10/1996 1 0
5 2 07/10/1996 08/10/1996 0 1
5 3 10/10/1996 11/10/1996 0 1
First, I create a dictionary with IDs:
ids = massage_df[massage_df['HAD_A_MASSAGE'] == 1]['ID']
id_dict = {id:0 for id in ids}
Everybody in this table has had a massage, but in my real dataset, not all people are so lucky.
Next, I run this bit of code:
for grp, df in massage_df.groupby(['ID']):
date_from = df.loc[df[df['HAD_A_MASSAGE']==1].index, 'LEFT']
date_to = date_from + DateOffset(days=30)
mask = ((date_from.values[0] < df['ARRIVED']) &
(df['ARRIVED'] <= date_to.values[0]) &
(df['BROGHT_A_FRIEND'] == 1))
if len(df[mask]) > 0:
id_dict[df['ID'].iloc[0]] = len(df[mask])
Basically, I want to count the number of times when someone originally came in for a massage (single or with a friend) and then came back within 30 days with a friend. The expected results for this table would be a total of 6 readmissions for IDs 3, 4 and 5.

Categories