Say I have the following data (please note that this data set is overly simplified and is for illustrative use only - it is not the actual data I am working with)
df = pd.DataFrame({start_date:[2010-05-03, 2010-06-02, 2011-06-02,
2011-07-21, 2012-11-05],
boolean: True, True, False, True, False})
#converting start_date to datetime object
df["start_date"] = pd.to_datetime(df["start_date"], format = "%Y-%m-%d")
#Deriving year and month attributes
df["year"] = df["start_date"].dt.year
df["month"] = df["start_date"].dt.month
I then derive the following dataframe:
df2 = df.groupby(by = ["year", "month", "boolean"]).size().unstack()
This code produces the table I want which is a multi-index data-frame which looks something like this:
I get a nice looking time series plot with the following code (the image of which I have not included here):
df2.plot(
kind = "line",
figsize = (14, 4)
)
What I want is the following:
I need a way to find the number of current customers at the beginning of each month (that is, a count of the number of times "boolean == False" for each month
I need a way to find the number of lost customers for each month (that is, a count of the number of times "boolean == True")
I would then use these two numbers to get an attrition rate per month (something like "Number of customers lost within each month, divided by the total number of customers at the start of each month)
I have an idea as to how to get what I want but I don't know how to implement it with code.
My thinking was that I'd need to first derive a "day" attribute (e.g., df["start_date"].dt.day) - with this attribute, I would have the beginning of each month. I would then count the number of current customers at the start of each month (which I think would be the sum total of current customers from the previous month) and then count the number of lost customers within each month (which would be the number of times "boolean == True" occurred between the first day of each month and the last day of each month). I'd then use these two numbers to get the customer attrition rate.
Once I had the monthly attrition rate, I would then plot it on a time-series graph
Related
I have the following dataframe df
import pandas as pd
import random
dates = pd.date_range(start = "2015-06-02", end = "2022-05-02", freq = "3D")
boolean = [random.randint(0, 1) for i in range(len(dates))]
boolean = [bool(x) for x in boolean]
df = pd.DataFrame(
{"Dates":dates,
"Boolean":boolean}
)
I then add the following attributes and group the data:
df["Year"] = df["Dates"].dt.year
df["Month"] = df["Dates"].dt.month
df.groupby(by = ["Year", "Month", "Boolean"]).size().unstack()
Which gets me something looking like this:
What I need to do is the following:
Calculate the attrition rate for the most recent complete month (say 30 days) - to do this I need to count the number of occurrences where "Boolean == False" at the beginning of this 1-month period, then I need to count the number of occurrences where "Boolean == True" within this 1-month period. I then use these two numbers to get the attrition rate (which I think would be sum(True occurrences within 1-month period) / sum(False occurrences at beginning of 1-month period)
I would use this same above approach to calculate the attrition rate for the entire historical period (that is, all months in between 2015-06-02 to 2022-05-02)
My Current Thinking
I'm wondering if I also need to derive a day attribute (that is, df["Day"] = df["Dates"].dt.day. Once I have this, do I just need to perform the necessary arithmetic over the days in each month in each year?
Please help, I am struggling with this quite a bit
I'm new to coding and am trying to make a time series scatterplot. I have hourly ozone concentrations from every day of the year for 12 years. I have calculated average and max values for each month of the year and am trying to compare the monthly average and monthly max data. I want to make 3 separate scatterplots for April, May, and June (so each graph should have two lines, avg and max). Here's what I've done so far:
#earlier in the code I specified only the months of Apr, May, Jun using:
df = df[df.month.isin([4, 5, 6])].copy()
#more code involving calculations, fast forward:
for month in avg_MDA8.month.unique():
for month in max_MDA8.month.unique():
data1 = avg_MDA8[avg_MDA8.month == month]
data2 = max_MDA8[max_MDA8.month == month] # filter and plot the data for a specific month
plt.figure() # create a new figure for each month
plt.plot(data1.datetime, data1.r_mean, color='k',linewidth=2.0,label='average MDA8')
plt.plot(data2.datetime, data2.r_mean, color='g',linewidth=2.0,label='max MDA8')
plt.xlim(date(2009, 1, 1), date(2020, 12, 31))
plt.ylim(0, 100)
plt.title(f'Month: {month}')
plt.ylabel('MDA8 (ppb)')
plt.xlabel('Year')
plt.legend(bbox_to_anchor=(1.0, 0.15))
plt.tight_layout()
However, the output is giving me 9 total plots: April_avg/April_max, April_avg/May_max, April_avg/June_max; May_avg/April_max, etc...
I just want to compare April_avg/April_max, May_avg/May_max, June_avg/June_max.
EDIT
I'm sorry, I was wrong. The loop isn't plotting the code incorrectly, just printing 3 versions of each graph. Any advice on how to prevent it from duplicating the graphs?
First, note how you've overloaded month in your nested loops:
for month in avg_MDA8.month.unique():
for month in max_MDA8.month.unique():
Every time you try to set month in the outer loop, the inner loop immediately destroys that value. Your description says that you want to get corresponding elements, and iterate through the months once, in parallel. Do this more simply: the unique months are the same set in both avg and max, right? So iterate through the months, regardless of where you got them. Use only one loop:
for month in avg_MDA8.month.unique():
data1 = avg_MDA8[avg_MDA8.month == month]
data2 = max_MDA8[max_MDA8.month == month]
month now takes on each desired value exactly once.
I have sales data (revenue and units) by Customer, by Product (type, id, description), by "fiscal quarter id", where the fiscal quarters are unique to this company and are not regular (i.e., not the exact same number of days for each).
I want (I think?) to "split" each row into two effective observations/transactions to allocate the proper share of the units and revenue to the two regular calendar quarters that the fiscal quarter straddles.
I also have a table (df2) that maps each of the company's fiscal quarters to calendar start and end dates.
Tiny sample:
df1 = pd.DataFrame({'fisc_q_id': ['2013Q1', '2013Q2'],
'cust':['Faux Corp', 'Notaco'],
'prod_id':['ABC-123', 'DEF-456'],
'revenue':[100, 400]})
df2 = pd.DataFrame({'fisc_q_id': ['2013Q1', '2013Q2'],
'fq_start':['2012-07-29', '2012-10-28'],
'fq_end':['2012-10-27', '2013-01-26']})
Desired output would be FOUR rows, each keeping the original "fiscal quarter ID", but would add a column with the appropriate calendar quarter and the allocated revenue for that quarter.
I have some ideas as to how this might work, but my solution -- if I could even get to one -- would surely be inelegant compared to what you guys can offer.
IICU
#Merge the datframes
df3=df1.merge(df2)
#Coerce dates into datetime
df3.fq_start = pd.to_datetime(df3.fq_start)
df3.fq_end = pd.to_datetime(df3.fq_end)#Calculate the Calender Quarter for strat and end
df3['fq_startquarter'] = pd.PeriodIndex(df3.fq_start, freq='Q')
df3['fq_endquarter'] = pd.PeriodIndex(df3.fq_end, freq='Q')
#Calculate the end date of the first quarter in the date range and hence the day difference on either side of the partition
df3['Qdate'] = df3['fq_start'].dt.to_period("Q").dt.end_time
df3['EndQdate'] = pd.to_datetime(df3['Qdate'], format='%Y-%M-%d')
df3['days1']=(df3['EndQdate']-df3['fq_start']).dt.days+1
df3['days2']=(df3['fq_end']-df3['EndQdate']).dt.days
df3['dys0']=(df3['fq_end']-df3['fq_start']).dt.days
df3.drop(columns=['Qdate','EndQdate'], inplace=True)
#Melt the calculated quarters
df4=pd.melt(df3, id_vars=['fisc_q_id','cust','prod_id','revenue','fq_start','fq_end','days1','days2','dys0'], value_name='CalenderQuarter')
df4.sort_values(by='prod_id', inplace=True)
#Allocate groups to the quarteres to allow allocation of calculated days
df4['daysp']=df4.groupby('prod_id')['CalenderQuarter'].cumcount()+1
#Set conditions and choices and use np.where to conditionally calculate revenue prportions
conditions= (df4['daysp']==1, df4['daysp']==2)
choices=(df4['revenue']*(df4['days1']/df4['dys0']),df4['revenue']*(df4['days2']/df4['dys0']))
df4['revenuep']=np.select(conditions,choices)
#Drop columns not required
df4['revenuep']=np.select(conditions,choices).round(0)
Curly one. Certainly opportunity to method chain so that it is efficient and faster.
I am using python pandas date range package to create a list of hourly timestamps for a calendar year. I code to do this, it looks like :
year = 2018
times = list(pd.date_range('{}-01-01'.format(year), '{}-12-31'.format(year), freq='H'))
I expect the length of times to be 8760 (the number of hours in a year). But when I view the length of the times vector, it is only 8737. Why????
When you specify a list by range, the first boundary is included and the second boundary is not. So here you are including {}-01-01 and not including {}-12-31. But you are including the midnight value.
So, you need to include the last day of the year, but omit the "celebratory" New Year Hour:
>>> year = 2018
>>> times = list(pd.date_range('{}-01-01'.format(year), '{}-01-01'.format(year+1), freq='H'))
>>> times = times[:-1]
>>> len(times)
8760
You need to include the New Year's Day, {}-01-01, so that you get New Year's Eve, {}-12-31. But then you get the midnight hour since that's what starts the day. Hence the need to eliminate the last entry in the list: times = times[:-1], so that you're ending at 11:00pm on 12-31.
I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()