I have sales data (revenue and units) by Customer, by Product (type, id, description), by "fiscal quarter id", where the fiscal quarters are unique to this company and are not regular (i.e., not the exact same number of days for each).
I want (I think?) to "split" each row into two effective observations/transactions to allocate the proper share of the units and revenue to the two regular calendar quarters that the fiscal quarter straddles.
I also have a table (df2) that maps each of the company's fiscal quarters to calendar start and end dates.
Tiny sample:
df1 = pd.DataFrame({'fisc_q_id': ['2013Q1', '2013Q2'],
'cust':['Faux Corp', 'Notaco'],
'prod_id':['ABC-123', 'DEF-456'],
'revenue':[100, 400]})
df2 = pd.DataFrame({'fisc_q_id': ['2013Q1', '2013Q2'],
'fq_start':['2012-07-29', '2012-10-28'],
'fq_end':['2012-10-27', '2013-01-26']})
Desired output would be FOUR rows, each keeping the original "fiscal quarter ID", but would add a column with the appropriate calendar quarter and the allocated revenue for that quarter.
I have some ideas as to how this might work, but my solution -- if I could even get to one -- would surely be inelegant compared to what you guys can offer.
IICU
#Merge the datframes
df3=df1.merge(df2)
#Coerce dates into datetime
df3.fq_start = pd.to_datetime(df3.fq_start)
df3.fq_end = pd.to_datetime(df3.fq_end)#Calculate the Calender Quarter for strat and end
df3['fq_startquarter'] = pd.PeriodIndex(df3.fq_start, freq='Q')
df3['fq_endquarter'] = pd.PeriodIndex(df3.fq_end, freq='Q')
#Calculate the end date of the first quarter in the date range and hence the day difference on either side of the partition
df3['Qdate'] = df3['fq_start'].dt.to_period("Q").dt.end_time
df3['EndQdate'] = pd.to_datetime(df3['Qdate'], format='%Y-%M-%d')
df3['days1']=(df3['EndQdate']-df3['fq_start']).dt.days+1
df3['days2']=(df3['fq_end']-df3['EndQdate']).dt.days
df3['dys0']=(df3['fq_end']-df3['fq_start']).dt.days
df3.drop(columns=['Qdate','EndQdate'], inplace=True)
#Melt the calculated quarters
df4=pd.melt(df3, id_vars=['fisc_q_id','cust','prod_id','revenue','fq_start','fq_end','days1','days2','dys0'], value_name='CalenderQuarter')
df4.sort_values(by='prod_id', inplace=True)
#Allocate groups to the quarteres to allow allocation of calculated days
df4['daysp']=df4.groupby('prod_id')['CalenderQuarter'].cumcount()+1
#Set conditions and choices and use np.where to conditionally calculate revenue prportions
conditions= (df4['daysp']==1, df4['daysp']==2)
choices=(df4['revenue']*(df4['days1']/df4['dys0']),df4['revenue']*(df4['days2']/df4['dys0']))
df4['revenuep']=np.select(conditions,choices)
#Drop columns not required
df4['revenuep']=np.select(conditions,choices).round(0)
Curly one. Certainly opportunity to method chain so that it is efficient and faster.
Related
I have a dataframe of 8,000 transactions. The assets have all sold / resold at least twice. Each transaction has a sales date. Each transaction has a unique identifier. The resales share a common identifier (TaxID). I would like to calculate the difference between the sale price for each common identifier by taking the date of the sale into account.
E.g. Asset 1 sold 1 Jan 2021, then again 15 Jan 2022. TaxID is common to the two. How do I calculate across these 8000 rows the difference in sale price between common identifiers? Some sold 3 or 4 times.
import pandas as pd
df = pd.read_csv('transactions.csv')
grouped = df.groupby('TaxID')
grouped = grouped.sort_values('sale_date')
df['diff'] = grouped['sale_price'].diff()
Using Pandas, you can calculate the difference in sale price between common identifiers by first grouping your transactions by their shared identifier (TaxID). This will create groups of transactions that have the same TaxID. Then, you would sort these groups by their sale date to ensure they are in chronological order. After that, you can use the diff() method to find the difference in sale price between each group of transactions with the same TaxID. This will give you the difference in sale price between each resale of an asset.
Say I have the following data (please note that this data set is overly simplified and is for illustrative use only - it is not the actual data I am working with)
df = pd.DataFrame({start_date:[2010-05-03, 2010-06-02, 2011-06-02,
2011-07-21, 2012-11-05],
boolean: True, True, False, True, False})
#converting start_date to datetime object
df["start_date"] = pd.to_datetime(df["start_date"], format = "%Y-%m-%d")
#Deriving year and month attributes
df["year"] = df["start_date"].dt.year
df["month"] = df["start_date"].dt.month
I then derive the following dataframe:
df2 = df.groupby(by = ["year", "month", "boolean"]).size().unstack()
This code produces the table I want which is a multi-index data-frame which looks something like this:
I get a nice looking time series plot with the following code (the image of which I have not included here):
df2.plot(
kind = "line",
figsize = (14, 4)
)
What I want is the following:
I need a way to find the number of current customers at the beginning of each month (that is, a count of the number of times "boolean == False" for each month
I need a way to find the number of lost customers for each month (that is, a count of the number of times "boolean == True")
I would then use these two numbers to get an attrition rate per month (something like "Number of customers lost within each month, divided by the total number of customers at the start of each month)
I have an idea as to how to get what I want but I don't know how to implement it with code.
My thinking was that I'd need to first derive a "day" attribute (e.g., df["start_date"].dt.day) - with this attribute, I would have the beginning of each month. I would then count the number of current customers at the start of each month (which I think would be the sum total of current customers from the previous month) and then count the number of lost customers within each month (which would be the number of times "boolean == True" occurred between the first day of each month and the last day of each month). I'd then use these two numbers to get the customer attrition rate.
Once I had the monthly attrition rate, I would then plot it on a time-series graph
I have a dataset with millions of records just like below
CustomerID
StartTime
EndTime
1111
2015-7-10
2016-3-7
1112
2016-1-5
2016-1-19
1113
2015-10-18
2020-9-1
This dataset contains the information for different subscription contracts and it is assumed that:
if the contract is active then the customer will need to pay a monthly fee in advance. The first payment will be collected on the start date.
If the contract ends before the next payment date, which is exactly one month after the last payment date, the customer does not need to pay the next subscription. For instance, customer 1112 only needs to pay once.
monthly payment fee is $10
In this situation, I need to calculate the monthly/quarterly/annual sales between 2015 and 2020. It is ideal to also show the breakdown of sales by different customer IDs so that subsequent machine learning tasks can be performed.
Importing data (I saved your table as a .csv in Excel, which is the reason for the specific formatting of the pd.to_datetime):
import pandas as pd
import numpy as np
df = pd.read_csv("Data.csv", header=0)
# convert columns "to_datetime"
df["StartTime"] = pd.to_datetime(df["StartTime"], format="%d/%m/%Y")
df["EndTime"] = pd.to_datetime(df["EndTime"], format="%d/%m/%Y")
Calculate the number of months between the start and end dates (+1 at the end because there will be a payment even if the contract is not active for a whole month, because it is in advance):
df["Months"] = ((df["EndTime"] - df["StartTime"])/np.timedelta64(1, 'M')).astype(int) + 1
Generate a list of payment dates (from the start date, for the given number of months, one month apart). The pd.tseries.offsets.DateOffset(months=1) will ensure that the payment date is on the same day every month, rather than the default end-of-month if freq="M".
df["PaymentDates"] = df.apply(lambda x: list(pd.date_range(start=x["StartTime"], periods=x["Months"], freq=pd.tseries.offsets.DateOffset(months=1)).date), axis=1)
Create a new row for each payment date, add a payment column of 10, then pivot so that the CustomerID is the column, and the date is the row:
df = df.explode("PaymentDates").reset_index(drop=True)
df["PaymentDates"] = pd.to_datetime(df["PaymentDates"])
df["Payment"] = 10
df = pd.pivot_table(df, index="PaymentDates", columns="CustomerID", values="Payment")
Aggregate for month, quarter, and year sales (this will be an aggregation for each individual CustomerID. You can then sum by row to get a total amount:
months = df.groupby([df.index.year, df.index.month]).sum()
quarters = df.groupby([df.index.year, df.index.quarter]).sum()
years = df.groupby(df.index.year).sum()
# total sales
months["TotalSales"] = months.sum(axis=1)
quarters["TotalSales"] = quarters.sum(axis=1)
years["TotalSales"] = years.sum(axis=1)
I realise this may be slow for the df.apply if you have millions of records, and there may be other ways to complete this, but this is what I have thought of.
You will also have a lot of columns if there are many millions of customers, but this way you will keep all the CustomerID values separate and be able to know which customers made payments in a given month.
After the number of months is calculated in df["Months"], you could then multiply this by 10 to get the number of sales for each customer.
If this is the only data you need for each customer individually, you would not need to pivot the data at all, just aggregate on the "PaymentDates" column, count the number of rows and multiply by 10 to get the sales for month, quarter, year.
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I would like to return the top 10 performing (by average) variables for each week in my DataFrame. It is about 2 years worth of data
I am using Python to figure this out but, would also eventually like to do it in SQL.
I have been able to produce code that returns the top 10 for the latest week but, would like results for every week
Creating df that creates the datetime range
range_max = rtbinds['pricedate'].max()
range_min = range_max - datetime.timedelta(days=7)
sliced_df = rtbinds[(rtbinds['pricedate'] >= range_min)
& (rtbinds['pricedate'] <= range_max)]
grouping by 'shadow'
sliced_df.groupby(['pricedate','cons_name']).aggregate(np.mean)
.sort_values('shadow').head(10)
returns for the first week of data.
pricedate cons_name shadow
2019-04-26 TEMP71_24753 -643.691
2019-04-27 TMP175_24736 -508.062
2019-04-25 TMP109_22593 -383.263
2019-04-23 TEMP48_24759 -376.967
2019-04-29 TEMP71_24753 -356.476
TMP175_24736 -327.230
TMP273_23483 -303.234
2019-04-27 TEMP71_24753 -294.377
2019-04-28 TMP175_24736 -272.603
TMP109_22593 -270.887
But, I would like a list that returns the top 10 for each week until the earliest date of my data
heads up pd.sort_values is sorting by default in ascending order, so when you take head(10), it's actually the worst 10 if we consider the natural ordering of real numbers.
Now for your problem, here is a solution
First we need to create some columns to identify the week of the year (rtbins is renamed df):
df['year'] = df['pricedate'].apply(lambda x: x.year)
df['week'] = df['pricedate'].apply(lambda x: x.isocalendar()[1])
Then we will group the data by ['year', 'week', 'cons_name'] :
df2 = df.groupby(['year', 'week', 'cons_name'], as_index=False).aggregate(np.mean)
You should get now a dataframe where for each (year, week) you have only one record of a cons_name with the mean shadow.
Then we will take the top 10 for each (year, week)
def udf(df):
return df.sort_values('shadow').head(10)
df2.groupby(['year', 'week'], as_index=False).apply(udf)
This should give you the result you want.