Calculate recurring customer - python

I'm analyzing sales data from a shop and want to calculate the percentage of "first order customer" who turn into recurring customers in following month.
I have a DataFrame with all the orders. This includes a customer id, a date and a flag if this is his/her first order. This is my data:
import pandas as pd
data = {'Name': ['Tom', 'nick', 'krish', 'Tom'],
'First_order': [1, 1, 1, 0],
'Date' :['01-01-2018', '01-01-2018', '01-01-2018', '02-02-2018']}
df = pd.DataFrame(data)
I would now create a list of all new customers in January and a list of all recurring customers in February and inner-join them. Then I have two numbers with which I could calculate the percentage.
But I have no clue, how I could calculate this rolling for a whole year without looping over the data frame. Is there a nice pandas/python way to do so?
The goal would be to have a new dataframe with the month and the percentage of recurring customers from the previous month.

One thought would be to take all orders Jan-November and have a column "reccurr" which gives you a true/false based on if this customer ordered in the next month. Then you can take a per-month groupby with count / sum of true/falses and add a column giving the ratio.
EDIT: before this you may need to convert dates:
df.Date = pd.to_datetime(df.Date)
Then:
df['month'] = df['Date'].apply(lambda x: x.month) #this is for simplicity's sake, not hard to extend to MMYYYY
df1 = df[df.month != 12].copy() #now we select everything but Nov
df1 = df1[df1.First_order == 1].copy() #and filter out non-first orders
df1['recurr'] = df1.apply(lambda x: True if len(df[(df.month == x.month + 1)&(df.Name == x.Name)])>0 else False, axis=1) #Now we fill a column with True if it finds an order from the same person next month
df2 = df1[['month','Name','recurr']].groupby('month').agg({'Name':'count','recurr':'sum'})
At this point, for each month, the "Name" column has number of first orders and "recurr" column has number of those that ordered again the following month. A simple extra column gives you percentage:
df2['percentage_of_recurring_customer'] = (df2.recurr/df2.Name)*100
EDIT: For any number of dates, here's a clumsy solution. Choose a start date and use that year's January as month 1, and number all months sequentially after that.
df.Date = pd.to_datetime(df.Date)
start_year = df.Date.min().year
def get_month_num(date):
return (date.year-start_year)*12+date.month
Now that we have a function to convert dates, the slightly changed code:
df['month'] = df['Date'].apply(lambda x: get_month_num(x))
df1 = df[df.First_order == 1].copy()
df1['recurr'] = df1.apply(lambda x: True if len(df[(df.month == x.month + 1)&(df.Name == x.Name)])>0 else False, axis=1)
df2 = df1[['month','Name','recurr']].groupby('month').agg({'Name':'count','recurr':'sum'})
Finally, you can make a function to revert your month numbers into dates:
def restore_month(month_num):
year = int(month_num/12)+start_year #int rounds down so we can do this.
month = month_num%12 #modulo gives us month
return pd.Timestamp(str(year)+'-'+str(month)+'-1') #This returns the first of that month
df3 = df2.reset_index().copy() #removing month from index so we can change it.
df3['month_date'] = df3['month'].apply(lambda x: restore_month(x))

Related

Calculate Monthly Attrition Rate

I have the following dataframe df
import pandas as pd
import random
dates = pd.date_range(start = "2015-06-02", end = "2022-05-02", freq = "3D")
boolean = [random.randint(0, 1) for i in range(len(dates))]
boolean = [bool(x) for x in boolean]
df = pd.DataFrame(
{"Dates":dates,
"Boolean":boolean}
)
I then add the following attributes and group the data:
df["Year"] = df["Dates"].dt.year
df["Month"] = df["Dates"].dt.month
df.groupby(by = ["Year", "Month", "Boolean"]).size().unstack()
Which gets me something looking like this:
What I need to do is the following:
Calculate the attrition rate for the most recent complete month (say 30 days) - to do this I need to count the number of occurrences where "Boolean == False" at the beginning of this 1-month period, then I need to count the number of occurrences where "Boolean == True" within this 1-month period. I then use these two numbers to get the attrition rate (which I think would be sum(True occurrences within 1-month period) / sum(False occurrences at beginning of 1-month period)
I would use this same above approach to calculate the attrition rate for the entire historical period (that is, all months in between 2015-06-02 to 2022-05-02)
My Current Thinking
I'm wondering if I also need to derive a day attribute (that is, df["Day"] = df["Dates"].dt.day. Once I have this, do I just need to perform the necessary arithmetic over the days in each month in each year?
Please help, I am struggling with this quite a bit

Take user input for number of months back and filter data based on that date

I am working on a problem where I have to take user input which is an integer indicating the number of months I have to look back at. For example if I want to look at the data 3 months back I must take input from user as 3.
Based on this integer user input I have to filter my dataset.
For example today's date is 8/30/2022 so 3 months back will be 5/30/2022.
Now I want to filter my dataframe to include only those rows for this date which is 3 months back i.e. 5/30/2022
I tried using datetime and relativetime libraries but nothing seems to work for me.
Below is an example of my dataframe:
id text1 text2 date
1 Ram patient 5/30/2022 10:22:00
2 John patient 5/30/2022 11:45:08
3 Rich child 5/28/2022 10:45:13
so I want output to be rows corresponding to 1 and 2
enter image description here
You can use the DateOffset function in Pandas. See the documentation for more details.
An example of this is below which assumes you have a data frame with a date column:
num_months = int(input('Please enter the number of months to look back: '))
df['date'] = pd.to_datetime(df['date'])
past_date = pd.to_datetime('today') - pd.DateOffset(months=num_months)
df = df[df['date'] >= past_date]
print(df)
The above will filter the data frame on the date column leaving only those rows where the date is on or after the calculated date i.e. today's date minus the specified number of months.
Please try to show your attempt in future, this feels like I'm doing your homework, but hopefully, this gives you an idea.
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3], 'text1': ['Ram', 'John', 'Rich'], 'text2': ['patient', 'patient', 'child'], 'date': ['5/30/2021 10:22:00', '5/30/2022 11:45:08', '5/28/2022 10:45:13']})
user_input = int(input("Enter the number of months to look back: "))
#convert the date column to datetime
df['date'] = pd.to_datetime(df['date'])
#set the date column as index
df = df.set_index('date')
#sort the index
df = df.sort_index()
#filter the dataframe to get the rows which are within the last x months
df = df[df.index >= pd.Timestamp.today() - pd.DateOffset(months=user_input)]
#convert the index back to a column
df.reset_index(inplace=True)
#print the filtered dataframe
print(df)
Input:
Enter the number of months to look back: 12
Output:
date id text1 text2
0 2022-05-28 10:45:13 3 Rich child
1 2022-05-30 11:45:08 2 John patient

Group Pandas Dataframe by Year and Month

I have the following dataframe dft with two columns 'DATE' and 'Income'
dft = pd.DataFrame(chunk, columns=['DATE','Income'])
dft['DATE'] = pd.to_datetime(dft['DATE'], format='%m/%d/%Y')
_= dft.sort_values(by='DATE', ascending=1)
I am now trying to sum the data up for each month of each year. This would mean the new dataframe has two columns like Jan 2012 and then the income for that month in that year. I can do this for just a month by using the following code but this doesn't take into account the year that month sits in. Is there a way I can groupby month and year?
monthlyincome = dft.groupby(dft['DATE'].dt.strftime('%B'))
[['Income']].sum().reset_index()
The end goal is to then put this into a bar chart. I was thinking converting into two lists and then using something like:
plt.bar(xaxis,yaxis)
How can I get this to work?
Final Solution was:
dft = pd.DataFrame(chunk, columns=['DATE','Income'])
dft['DATE'] = pd.to_datetime(dft['DATE'], format='%m/%d/%Y')
_= dft.sort_values(by='DATE', ascending=1)
periods = dft.DATE.dt.to_period("M")
group = dft.groupby(periods).sum()
group = group.reset_index()
Thanks to Mayank.
Try this:
periods = dft.DATE.dt.to_period("M")
group = dft.groupby(periods).sum()
This should return you year and month combined.

Calculating cummulative returns mid-year to mid-year Pandas

I have a Pandas dataframe of the size (80219 * 5) with the same structure as the image I have uploaded. The data can range from 2002-2016 for each company but if missing values appear the data either starts at a later date or ends at an earlier date as you can see in the image.
What I would like to do is to calculate yearly compounded returns measured from June to June for each company. If there is no data for the specific company for the full 12 months period from June to June the result should be nan. Below is my current code, but I don't know how to calculate the returns from June to June.
After having loaded the file and cleaned it I:
df[['Returns']] = df[['Returns']].apply(pd.to_numeric)
df['Names Date'] = pd.to_datetime(df['Names Date'])
df['Returns'] = df['Returns']+ 1
df = df[['Company Name','Returns','Names Date']]
df['year']=df['Names Date'].dt.year
df['cum_return'] = df.groupby(['Company Name','year']).cumprod()
df = df.groupby(['Company Name','year']).nth(11)
print(tabulate(df, headers='firstrow', tablefmt='psql'))
Which calculates the annual return from 1st of january to 31st of december..
I finally found a way to do it. The easiest way I could find is to calculate a rolling 12 month compounded return for each month and then slice the dataframe for to give me the 12 month returns of the months I want:
def myfunc(arr):
return np.cumprod(arr)[-1]
cum_ret = pd.Series()
grouped = df.groupby('Company Name')
for name, group in grouped:
cum_ret = cum_ret.append(pd.rolling_apply(group['Returns'],12,myfunc))
df['Cum returns'] = cum_ret
df = df.loc[df['Names Date'].dt.month==6]
df['Names Date'] = df['Names Date'].dt.year

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

Categories