I have the following dataframe df
import pandas as pd
import random
dates = pd.date_range(start = "2015-06-02", end = "2022-05-02", freq = "3D")
boolean = [random.randint(0, 1) for i in range(len(dates))]
boolean = [bool(x) for x in boolean]
df = pd.DataFrame(
{"Dates":dates,
"Boolean":boolean}
)
I then add the following attributes and group the data:
df["Year"] = df["Dates"].dt.year
df["Month"] = df["Dates"].dt.month
df.groupby(by = ["Year", "Month", "Boolean"]).size().unstack()
Which gets me something looking like this:
What I need to do is the following:
Calculate the attrition rate for the most recent complete month (say 30 days) - to do this I need to count the number of occurrences where "Boolean == False" at the beginning of this 1-month period, then I need to count the number of occurrences where "Boolean == True" within this 1-month period. I then use these two numbers to get the attrition rate (which I think would be sum(True occurrences within 1-month period) / sum(False occurrences at beginning of 1-month period)
I would use this same above approach to calculate the attrition rate for the entire historical period (that is, all months in between 2015-06-02 to 2022-05-02)
My Current Thinking
I'm wondering if I also need to derive a day attribute (that is, df["Day"] = df["Dates"].dt.day. Once I have this, do I just need to perform the necessary arithmetic over the days in each month in each year?
Please help, I am struggling with this quite a bit
Related
Say I have the following data (please note that this data set is overly simplified and is for illustrative use only - it is not the actual data I am working with)
df = pd.DataFrame({start_date:[2010-05-03, 2010-06-02, 2011-06-02,
2011-07-21, 2012-11-05],
boolean: True, True, False, True, False})
#converting start_date to datetime object
df["start_date"] = pd.to_datetime(df["start_date"], format = "%Y-%m-%d")
#Deriving year and month attributes
df["year"] = df["start_date"].dt.year
df["month"] = df["start_date"].dt.month
I then derive the following dataframe:
df2 = df.groupby(by = ["year", "month", "boolean"]).size().unstack()
This code produces the table I want which is a multi-index data-frame which looks something like this:
I get a nice looking time series plot with the following code (the image of which I have not included here):
df2.plot(
kind = "line",
figsize = (14, 4)
)
What I want is the following:
I need a way to find the number of current customers at the beginning of each month (that is, a count of the number of times "boolean == False" for each month
I need a way to find the number of lost customers for each month (that is, a count of the number of times "boolean == True")
I would then use these two numbers to get an attrition rate per month (something like "Number of customers lost within each month, divided by the total number of customers at the start of each month)
I have an idea as to how to get what I want but I don't know how to implement it with code.
My thinking was that I'd need to first derive a "day" attribute (e.g., df["start_date"].dt.day) - with this attribute, I would have the beginning of each month. I would then count the number of current customers at the start of each month (which I think would be the sum total of current customers from the previous month) and then count the number of lost customers within each month (which would be the number of times "boolean == True" occurred between the first day of each month and the last day of each month). I'd then use these two numbers to get the customer attrition rate.
Once I had the monthly attrition rate, I would then plot it on a time-series graph
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1
I have a dataframe of downsampled Open/High/Low/Last/Change/Volume values for a security over ten years.
I'm trying to get the weekly count of samples i.e. how many samples did my downsampling method, in this case a Volume bar, sample per week over the entire dataset so that I can plot it and compare to other downsampling methods.
So far I've tried creating a series in the df called 'Year-Week' following the answers prescribed here and here.
The problem with these answers is that my EOY dates such as '1997-12-30' get transformed to '1997-01' because of the ISO calendar system used as described in this answer, which breaks my results when I apply the value_counts method.
My code is the following:
volumeBar['Year/Week'] = (pd.Series(volumeBar.index).dt.year.astype(str) + "/" + pd.Series(volumeBar.index).dt.week.astype(str)).values
So my question is: As it stand the following sample DateTimeIndex
Date
1997-12-22
1997-12-29
1997-12-30
becomes
Year/Week
1997/52
1997/1
1997/1
How could I get the following expected result?
Year/Week
1997/52
1997/52
1997/52
Please keep in mind that I cannot manually correct this behavior because of the size of the dataset and the erradict nature of these appearing results due to the way the ISO calendar works.
Many thanks in advance!
You can use the below function get_years_week to get years and weeks without ISO formating.
import pandas as pd
import datetime
a = {'Date': ['1997-11-29', '1997-12-22',
'1997-12-29',
'1997-12-30']}
data = pd.DataFrame(a)
data['Date'] = pd.to_datetime(data['Date'])
# Function for getting weeks and years
def get_years_week(data):
# Get year from date
data['year'] = data['Date'].dt.year
# loop over each row of date column and get week number
for i in range(len(data)):
data['week'] = (((data['Date'][i] - datetime.datetime\
(data['Date'][i].year,1,1)).days // 7) + 1)
# create column for week and year
data['year/week'] = pd.Series(data_2['year'].astype('str'))\
+ '/' + pd.Series(data_2['week'].astype('str'))
return data
I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()