I have a dataset with millions of records just like below
CustomerID
StartTime
EndTime
1111
2015-7-10
2016-3-7
1112
2016-1-5
2016-1-19
1113
2015-10-18
2020-9-1
This dataset contains the information for different subscription contracts and it is assumed that:
if the contract is active then the customer will need to pay a monthly fee in advance. The first payment will be collected on the start date.
If the contract ends before the next payment date, which is exactly one month after the last payment date, the customer does not need to pay the next subscription. For instance, customer 1112 only needs to pay once.
monthly payment fee is $10
In this situation, I need to calculate the monthly/quarterly/annual sales between 2015 and 2020. It is ideal to also show the breakdown of sales by different customer IDs so that subsequent machine learning tasks can be performed.
Importing data (I saved your table as a .csv in Excel, which is the reason for the specific formatting of the pd.to_datetime):
import pandas as pd
import numpy as np
df = pd.read_csv("Data.csv", header=0)
# convert columns "to_datetime"
df["StartTime"] = pd.to_datetime(df["StartTime"], format="%d/%m/%Y")
df["EndTime"] = pd.to_datetime(df["EndTime"], format="%d/%m/%Y")
Calculate the number of months between the start and end dates (+1 at the end because there will be a payment even if the contract is not active for a whole month, because it is in advance):
df["Months"] = ((df["EndTime"] - df["StartTime"])/np.timedelta64(1, 'M')).astype(int) + 1
Generate a list of payment dates (from the start date, for the given number of months, one month apart). The pd.tseries.offsets.DateOffset(months=1) will ensure that the payment date is on the same day every month, rather than the default end-of-month if freq="M".
df["PaymentDates"] = df.apply(lambda x: list(pd.date_range(start=x["StartTime"], periods=x["Months"], freq=pd.tseries.offsets.DateOffset(months=1)).date), axis=1)
Create a new row for each payment date, add a payment column of 10, then pivot so that the CustomerID is the column, and the date is the row:
df = df.explode("PaymentDates").reset_index(drop=True)
df["PaymentDates"] = pd.to_datetime(df["PaymentDates"])
df["Payment"] = 10
df = pd.pivot_table(df, index="PaymentDates", columns="CustomerID", values="Payment")
Aggregate for month, quarter, and year sales (this will be an aggregation for each individual CustomerID. You can then sum by row to get a total amount:
months = df.groupby([df.index.year, df.index.month]).sum()
quarters = df.groupby([df.index.year, df.index.quarter]).sum()
years = df.groupby(df.index.year).sum()
# total sales
months["TotalSales"] = months.sum(axis=1)
quarters["TotalSales"] = quarters.sum(axis=1)
years["TotalSales"] = years.sum(axis=1)
I realise this may be slow for the df.apply if you have millions of records, and there may be other ways to complete this, but this is what I have thought of.
You will also have a lot of columns if there are many millions of customers, but this way you will keep all the CustomerID values separate and be able to know which customers made payments in a given month.
After the number of months is calculated in df["Months"], you could then multiply this by 10 to get the number of sales for each customer.
If this is the only data you need for each customer individually, you would not need to pivot the data at all, just aggregate on the "PaymentDates" column, count the number of rows and multiply by 10 to get the sales for month, quarter, year.
Related
I have a dataframe of 8,000 transactions. The assets have all sold / resold at least twice. Each transaction has a sales date. Each transaction has a unique identifier. The resales share a common identifier (TaxID). I would like to calculate the difference between the sale price for each common identifier by taking the date of the sale into account.
E.g. Asset 1 sold 1 Jan 2021, then again 15 Jan 2022. TaxID is common to the two. How do I calculate across these 8000 rows the difference in sale price between common identifiers? Some sold 3 or 4 times.
import pandas as pd
df = pd.read_csv('transactions.csv')
grouped = df.groupby('TaxID')
grouped = grouped.sort_values('sale_date')
df['diff'] = grouped['sale_price'].diff()
Using Pandas, you can calculate the difference in sale price between common identifiers by first grouping your transactions by their shared identifier (TaxID). This will create groups of transactions that have the same TaxID. Then, you would sort these groups by their sale date to ensure they are in chronological order. After that, you can use the diff() method to find the difference in sale price between each group of transactions with the same TaxID. This will give you the difference in sale price between each resale of an asset.
Say I have the following data (please note that this data set is overly simplified and is for illustrative use only - it is not the actual data I am working with)
df = pd.DataFrame({start_date:[2010-05-03, 2010-06-02, 2011-06-02,
2011-07-21, 2012-11-05],
boolean: True, True, False, True, False})
#converting start_date to datetime object
df["start_date"] = pd.to_datetime(df["start_date"], format = "%Y-%m-%d")
#Deriving year and month attributes
df["year"] = df["start_date"].dt.year
df["month"] = df["start_date"].dt.month
I then derive the following dataframe:
df2 = df.groupby(by = ["year", "month", "boolean"]).size().unstack()
This code produces the table I want which is a multi-index data-frame which looks something like this:
I get a nice looking time series plot with the following code (the image of which I have not included here):
df2.plot(
kind = "line",
figsize = (14, 4)
)
What I want is the following:
I need a way to find the number of current customers at the beginning of each month (that is, a count of the number of times "boolean == False" for each month
I need a way to find the number of lost customers for each month (that is, a count of the number of times "boolean == True")
I would then use these two numbers to get an attrition rate per month (something like "Number of customers lost within each month, divided by the total number of customers at the start of each month)
I have an idea as to how to get what I want but I don't know how to implement it with code.
My thinking was that I'd need to first derive a "day" attribute (e.g., df["start_date"].dt.day) - with this attribute, I would have the beginning of each month. I would then count the number of current customers at the start of each month (which I think would be the sum total of current customers from the previous month) and then count the number of lost customers within each month (which would be the number of times "boolean == True" occurred between the first day of each month and the last day of each month). I'd then use these two numbers to get the customer attrition rate.
Once I had the monthly attrition rate, I would then plot it on a time-series graph
I'm working with a dataframe that has daily information (measured data) across 30 years for different variables. I am trying to groupby days of the year, and then find a mean across 30 years. How do I go about this? This is what the dataframe looks like
I tried to groupby day after checking for type of YYYYMMDD (it's an int64 type.) now I have the dataframe looking like this. It has just added new columns for Day, Month year
[]
I'm a bit stuck on how to calculate means from here, i would need to somehow group all Jan-1sts, jan-2nds etc over 30 years and average it after.
You can groupby with month and day:
df.index = pd.to_datetime(df.index)
( df.groupby([df.index.month, df.index.day]).mean().reset_index().
rename({'level_0':'month', 'level_1':'day'}, axis=1))
or if you want to group them by the day of year, i.e. 1, 2, .. 365, set as_index=False:
df.groupby([df.index.month, df.index.day], as_index=False).mean()
So I have sales data that I'm trying to analyze. I have datetime data ["Order Date Time"] and I'd like to see the most common hours for sales but more importantly I'd like to see what minutes have NO sales.
I have been spinning my wheels for a while and I can't get my brain around a solution. Any help is greatly appreciated.
I import the data:
df = pd.read_excel ('Audit Period.xlsx')
print (df)
I clean up the data:
# Remove all columns except `applieddate` and null rows
time_df = df[df["Order Date Time"].notnull()]
# Ensure the index is still sequential
time_df = time_df[["Order Date Time"]].reset_index(drop=True)
# Select the first 10 rows
time_df.head(10)
I convert to datetime and I look at the month totals:
# Convert applieddate to datetime
time_df = time_df.copy()
time_df["Order Date Time"] = time_df["Order Date Time"].apply(pd.to_datetime)
time_df = time_df.set_index(time_df["Order Date Time"])
# Group by month
grouped = time_df.resample("M").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
I try to group by hour but that gives me totals per day/hour rather than totals per hour like every order ever at noon, etc:
# Group by hour
grouped = time_df.resample("2H").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
And that is where I'm stuck. I'm trying to integrate the below suggestions but can't quite get a grasp on them yet. Any help would be appreciated.
Not sure if this is the most brilliant solution, but I would start by generating a dataframe at the level of detail I wanted, whether that is 1-hour intervals, 5-minute intervals, etc. Then in your df with all the actual data, you could do your grouping as you currently are doing it above. Once it is grouped, join the two. That way you have one dataframe that includes empty rows associated with time spans with no records. The tricky part will just be making sure you have your date and time formatted in a way that it will match and join properly.
I have sales data (revenue and units) by Customer, by Product (type, id, description), by "fiscal quarter id", where the fiscal quarters are unique to this company and are not regular (i.e., not the exact same number of days for each).
I want (I think?) to "split" each row into two effective observations/transactions to allocate the proper share of the units and revenue to the two regular calendar quarters that the fiscal quarter straddles.
I also have a table (df2) that maps each of the company's fiscal quarters to calendar start and end dates.
Tiny sample:
df1 = pd.DataFrame({'fisc_q_id': ['2013Q1', '2013Q2'],
'cust':['Faux Corp', 'Notaco'],
'prod_id':['ABC-123', 'DEF-456'],
'revenue':[100, 400]})
df2 = pd.DataFrame({'fisc_q_id': ['2013Q1', '2013Q2'],
'fq_start':['2012-07-29', '2012-10-28'],
'fq_end':['2012-10-27', '2013-01-26']})
Desired output would be FOUR rows, each keeping the original "fiscal quarter ID", but would add a column with the appropriate calendar quarter and the allocated revenue for that quarter.
I have some ideas as to how this might work, but my solution -- if I could even get to one -- would surely be inelegant compared to what you guys can offer.
IICU
#Merge the datframes
df3=df1.merge(df2)
#Coerce dates into datetime
df3.fq_start = pd.to_datetime(df3.fq_start)
df3.fq_end = pd.to_datetime(df3.fq_end)#Calculate the Calender Quarter for strat and end
df3['fq_startquarter'] = pd.PeriodIndex(df3.fq_start, freq='Q')
df3['fq_endquarter'] = pd.PeriodIndex(df3.fq_end, freq='Q')
#Calculate the end date of the first quarter in the date range and hence the day difference on either side of the partition
df3['Qdate'] = df3['fq_start'].dt.to_period("Q").dt.end_time
df3['EndQdate'] = pd.to_datetime(df3['Qdate'], format='%Y-%M-%d')
df3['days1']=(df3['EndQdate']-df3['fq_start']).dt.days+1
df3['days2']=(df3['fq_end']-df3['EndQdate']).dt.days
df3['dys0']=(df3['fq_end']-df3['fq_start']).dt.days
df3.drop(columns=['Qdate','EndQdate'], inplace=True)
#Melt the calculated quarters
df4=pd.melt(df3, id_vars=['fisc_q_id','cust','prod_id','revenue','fq_start','fq_end','days1','days2','dys0'], value_name='CalenderQuarter')
df4.sort_values(by='prod_id', inplace=True)
#Allocate groups to the quarteres to allow allocation of calculated days
df4['daysp']=df4.groupby('prod_id')['CalenderQuarter'].cumcount()+1
#Set conditions and choices and use np.where to conditionally calculate revenue prportions
conditions= (df4['daysp']==1, df4['daysp']==2)
choices=(df4['revenue']*(df4['days1']/df4['dys0']),df4['revenue']*(df4['days2']/df4['dys0']))
df4['revenuep']=np.select(conditions,choices)
#Drop columns not required
df4['revenuep']=np.select(conditions,choices).round(0)
Curly one. Certainly opportunity to method chain so that it is efficient and faster.