Pandas Expanding Mean with Group By and before current row date - python

I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date

instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0

Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0

Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

Can't fill nan values in pandas even with inplace flag

I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)

how to expand a string into multiple rows in dataframe?

i want to split a string into multiple rows.
df.assign(MODEL_ABC = df['MODEL_ABC'].str.split('_').explode('MODEL_ABC'))
my output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
if i run individually for column i'm getting like below but not entire dataframe
A
B
this is my dataframe df
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A_B 75.0 25.0
expected output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
1 2018 First B 75.0 25.0
You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows:
df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_')
df = df.explode('MODEL_ABC')

How to improve function with all-to-all rows computation within a groupby object?

Say I have this simple dataframe-
dic = {'firstname':['Steve','Steve','Steve','Steve','Steve','Steve'],
'lastname':['Johnson','Johnson','Johnson','Johnson','Johnson',
'Johnson'],
'company':['CHP','CHP','CHP','CHP','CHP','CHP'],
'faveday':['2020-07-13','2020-07-20','2020-07-16','2020-10-14',
'2020-10-28','2020-10-21'],
'paid':[200,300,550,100,900,650]}
df = pd.DataFrame(dic)
df['faveday'] = pd.to_datetime(df['faveday'])
print(df)
with output-
firstname lastname company faveday paid
0 Steve Johnson CHP 2020-07-13 200
1 Steve Johnson CHP 2020-07-20 300
2 Steve Johnson CHP 2020-07-16 550
3 Steve Johnson CHP 2020-10-14 100
4 Steve Johnson CHP 2020-10-28 900
5 Steve Johnson CHP 2020-10-21 650
I want to be able to keep the rows that have a faveday within 7 days of another, but also their paid columns have to sum greater than 1000.
Individually, if I wanted to apply the 7 day function, I would use-
def sefd (x):
return np.sum((np.abs(x.values-x.values[:,None])/np.timedelta64(1, 'D'))<=7,axis=1)>=2
s=df.groupby(['firstname', 'lastname', 'company'])['faveday'].transform(sefd)
df['seven_days']=s
df = df[s]
del df['seven_days']
This would keep all of the entries (All of these are within 7 days of another faveday grouped by firstname, lastname, and company).
If I wanted to apply a function that keeps rows for the same person with the same company and a summed paid amount > 1000, I would use-
df = df[df.groupby(['lastname', 'firstname','company'])['paid'].transform(sum) > 1000]
Just a simple transform(sum) function
This would also keep all of the entries (since all are under the same name and company and sum to greater than 1000).
However, if we were to combine these two functions at the same time, one row actually would not be included.
My desired output is-
firstname lastname company faveday paid
0 Steve Johnson CHP 2020-07-13 200
1 Steve Johnson CHP 2020-07-20 300
2 Steve Johnson CHP 2020-07-16 550
4 Steve Johnson CHP 2020-10-28 900
5 Steve Johnson CHP 2020-10-21 650
Notice how index 3 is no longer valid because it's only within 7 days of index 5, but if you were to sum index 3 paid and index 5 paid, it would only be 750 (<1000).
It is also important to note that since indexes 0, 1, and 2 are all within 7 days of each other, that counts as one summed group (200 + 300 + 550 > 1000).
The logic is that I would want to first see (based on a group of firstname, lastname, and company name) whether or not a faveday is within 7 days of another. Then after confirming this, see if the paid column for these favedays sums to over 1000. If so, keep those indexes in the dataframe. Otherwise, do not.
A suggested answer given to me was-
df=df.sort_values(["firstname","lastname","company","faveday"])
def date_difference_from(x,df):
return abs((df.faveday - x).dt.days)
def grouped_dates(grouped_df):
keep = []
for idx, row in grouped_df.iterrows():
within_7 = date_difference_from(row.faveday,grouped_df) <= 7
keep.append(within_7.sum() > 1 and grouped_df[within_7].paid.sum() > 1000)
msk = np.array(keep)
return grouped_df[msk]
df = df.groupby(["firstname","lastname","company"]).apply(grouped_dates).reset_index(drop=True)
print(df)
This works perfectly for small data sets like this one, but when I apply it to a bigger dataset (10,000+ rows), some inconsistencies appear.
Is there any way to improve this code?
I found a solution that avoids looping idx to compare if other rows are within 7 days, but involves unstack and reindex so it will increase memory usage (I tried tapping into the _get_window_bounds method of rolling but it proved above my expertise). It should be fine for the scale you request. Although this solution's is on par of yours with the toy df you provided, it is orders of magnitude faster on larger datasets.
Edit: allow multiple deposits in one date.
Take this data (with replace=True by default in random.choice)
import string
np.random.seed(123)
n = 40
df = pd.DataFrame([[a, b, b, faveday, paid]
for a in string.ascii_lowercase
for b in string.ascii_lowercase
for faveday, paid in zip(
np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), n),
np.random.randint(100, 1200, n))
], columns=['firstname', 'lastname', 'company', 'faveday', 'paid'])
df['faveday'] = pd.to_datetime(df['faveday'])
df = df.sort_values(["firstname", "lastname", "company", "faveday"]).reset_index(drop=True)
>>>print(df)
firstname lastname company faveday paid
0 a a a 2020-01-03 1180
1 a a a 2020-01-18 206
2 a a a 2020-02-02 490
3 a a a 2020-02-09 615
4 a a a 2020-02-17 471
... ... ... ... ... ...
27035 z z z 2020-11-22 173
27036 z z z 2020-12-22 863
27037 z z z 2020-12-23 675
27038 z z z 2020-12-26 1165
27039 z z z 2020-12-30 683
[27040 rows x 5 columns]
And the code
def get_valid(df, window_size=7, paid_gt=1000, groupbycols=['firstname', 'lastname', 'company']):
# df_clean = df.set_index(['faveday'] + groupbycols).unstack(groupbycols)
# # unstack names to bypass groupby
df_clean = df.groupby(['faveday'] + groupbycols).paid.agg(['size', sum])
df_clean.columns = ['ct', 'paid']
df_clean = df_clean.unstack(groupbycols)
df_clean = df_clean.reindex(pd.date_range(df_clean.index.min(),
df_clean.index.max())).sort_index() # include all dates, to treat index as integer
window = df_clean.fillna(0).rolling(window_size + 1).sum()
# notice fillna to prevent false NaNs while summing
df_clean = df_clean.paid * ( # multiply times a mask for both conditions
(window.ct > 1) & (window.paid > paid_gt)
).replace(False, np.nan).bfill(limit=7)
# replacing with np.nan so we can backfill to include all dates in window
df_clean = df_clean.rename_axis('faveday').stack(groupbycols)\
.reset_index(level='faveday').sort_index().reset_index()
# reshaping to original format
return df_clean
df1 = get_valid(df, window_size=7, paid_gt=1000,
groupbycols=['firstname', 'lastname', 'company'])
Still running at 1.5 seconds (vs 143 seconds of your current code) and returns
firstname lastname company faveday 0
0 a a a 2020-02-02 490.0
1 a a a 2020-02-09 615.0
2 a a a 2020-02-17 1232.0
3 a a a 2020-03-09 630.0
4 a a a 2020-03-14 820.0
... ... ... ... ... ...
17561 z z z 2020-11-12 204.0
17562 z z z 2020-12-22 863.0
17563 z z z 2020-12-23 675.0
17564 z z z 2020-12-26 1165.0
17565 z z z 2020-12-30 683.0
[17566 rows x 5 columns]

Pandas/Python add a row based on condition

YY_MM_CD customerid pol_no type WE WP
2019-07 15680 1313145 new 3 89
2020-01 14672 1418080 renwd -8 223
2019-01 15681 1213143 new 4 8
2019-01 15683 1213344 new -6 3
2019-03 14678 1418280 renwd -66 -7
Now . I have some x years of data and i am taking for snapshot of 19-20. Suppose if a customer in this snapshot paid premium on 01/11/2019 the customer didn't paid premium on 01/11/2020 so that record will be not their in data. I need to created a dummy record for that customer . Like this customer id 15681 and keep their WE and WP as 0 0 since customer didn't paid
YY_MM_CD customerid pol_no type WE WP
2019-07 15680 1313145 new 3 89
2020-01 14672 1418080 renwd -8 223
2019-01 15681 1213143 new 4 8
2020-01 15681 1213143 new 0 0
2019-03 14678 1418280 renwd -66 -7
Don't create a dummy datapoint. Write the expiration date next to each customer id. Then when accessing the data just check if the current data is before the expiration date.
Simpler and cleaner
If you would like to do what you asked (add a row or column based on a condition):
You need to group the customers
Use a lambda function to add your condition
For example.
new_df = pd.DataFrame()
df = YOURDATA
groups = df.groupby("customerid")
for group in groups:
if len(group) < 2: #your condition
df2 = pd.DataFrame( ADD YOUR DATA HERE)
new_df.append(df2, ignore_index=True)
at the end you can combine new_df and df with concat:https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Categories