Python - Get policy year from datetime dataframe - python

I have a dataframe (df) with a column in datetime format YYYY-MM-DD ('date'). I am trying to create a new column that returns the policy year, which always starts on April 1st and thus the policy year for January through March will always be the prior calander year. There are dates that are rather old so setting up individual date ranges for the sample size below wouldn't be ideal
The dataframe would look like this
df['date']
2020-12-10
2021-02-10
2019-03-31
and output should look like this
2020
2020
2018
I now know how to get the year using df['date'].dt.year. However, I am having trouble getting the dataframe to convert each year to the respective policy year so that if df['date'].dt.month >= 4 then df['date'].dt.year, else df['date'].dt.year - 1
I am not quite sure how to set this up exactly. I have been trying to avoid setting up multiple columns to do a bool for month >= 4 and then setting up different columns. I've gone so far as to set up this but get ValueError stating the series is too ambiguous
def PolYear(x):
y = x.dt.month
if y >= 4:
x.dt.year
else:
x.dt.year - 1
df['Pol_Year'] = PolYear(df['date'])
I'm wasn't sure if this was the right way to go about it so I also tried a df.loc format for >= and < 4 but len key and value are not equal. Definitely think I'm missing something super simple.
I previously had mentioned 'fiscal year', but this is incorrect.

Quang Hoand had the right idea but used the incorrect frequency in the call to to_period(self, freq). For your purposes you want to use the following code:
df.date.dt.to_period('Q-MAR').dt.qyear
This will give you:
0 2021
1 2021
2 2019
Name: date, dtype: int64
Q-MAR defines fiscal year end in March
These values are the correct fiscal years (fiscal years use the year in which they end, not where they begin[reference]). If you you want to have the output using the year in which they begin, it's simple:
df.date.dt.to_period('Q-MAR').dt.qyear - 1
Giving you
0 2020
1 2020
2 2018
Name: date, dtype: int64
qyear docs

This is qyear:
df.date.dt.to_period('Q').dt.qyear
Output:
0 2020
1 2021
2 2019
Name: date, dtype: int64

Related

Calculate the Number of Users at the Start of the Month

I have a table which looks like this:
ID
Start Date
End Date
1
01/01/2022
29/01/2022
2
03/01/2022
3
15/01/2022
4
01/02/2022
01/03/2022
5
01/03/2022
01/05/2022
6
01/04/2022
So, for every row i have the start date of the contract with the user and the end date. If the contract is still present, there will be no end date.
I'm trying to get a table that looks like this:
Feb
Mar
Apr
Jun
3
3
4
3
Which counts the number of active users on the first day of the month.
What is the most efficient way to calculate this?
At the moment the only idea that came to my mind was to use a scaffold table containing the dates i'm intereseted in (the first day of every month) and from that easily create the new table I need.
But my question is, is there a better way to solve this? I would love to find a more efficient way to calculate this since i would need to repeat the exact same calculations for the number of users at the start of the week.
This might help:
# initializing dataframe
df = pd.DataFrame({'start':['01/01/2022','03/01/2022','15/01/2022','01/02/2022','01/03/2022','01/04/2022'],
'end':['29/01/2022','','','01/03/2022','01/05/2022','']})
# cleaning datetime (the empty ones are replaced with the max exit)
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y', errors='coerce')
df['end'].fillna(df.end.max(), inplace=True)
dt_range = pd.date_range(start=df.start.min(),end=df.end.max(),freq='MS')
df2 = pd.DataFrame(columns=['month','number'])
for dat in dt_range:
row = {'month':dat.strftime('%B - %Y'),'number':len(df[(df.start <= dat)&(df.end >= dat)])}
df2 = df2.append(row, ignore_index=True)
Output:
month number
0 January - 2022 1
1 February - 2022 3
2 March - 2022 4
3 April - 2022 4
4 May - 2022 4
Or, if you want the format as in your question:
df2.T
month January - 2022 February - 2022 March - 2022 April - 2022 May - 2022
number 1 3 4 4 4

Calculating calendar weeks from fiscal weeks

So I am really new to this and struggling with something, which I feel should be quite simple.
I have a Pandas Dataframe containing two columns: Fiscal Week (str) and Amount sold (int).
Fiscal Week
Amount sold
0
2019031
24
1
2019041
47
2
2019221
34
3
2019231
46
4
2019241
35
My problem is the fiscal week column. It contains strings which describe the fiscal year and week . The fiscal year for this purpose starts on October 1st and ends on September 30th. So basically, 2019031 is the Monday (the 1 at the end) of the third week of October 2019. And 2019221 would be the 2nd week of March 2020.
The issue is that I want to turn this data into timeseries later. But I can't do that with the data in string format - I need it to be in date time format.
I actually added the 1s at the end of all these strings using
df['Fiscal Week']= df['Fiscal Week'].map('{}1'.format)
so that I can then turn it into a proper date:
df['Fiscal Week'] = pd.to_datetime(df['Fiscal Week'], format="%Y%W%w")
as I couldn't figure out how to do it with just the weeks and no day defined.
This, of course, returns the following:
Fiscal Week
Amount sold
0
2019-01-21
24
1
2019-01-28
47
2
2019-06-03
34
3
2019-06-10
46
4
2019-06-17
35
As expected, this is clearly not what I need, as according to the definition of the fiscal year week 1 is not January at all but rather October.
Is there some simple solution to get the dates to what they are actually supposed to be?
Ideally I would like the final format to be e.g. 2019-03 for the first entry. So basically exactly like the string but in some kind of date format, that I can then work with later on. Alternatively, calendar weeks would also be fine.
Assuming you have a data frame with fiscal dates of the form 'YYYYWW' where YYY = the calendar year of the start of the fiscal year and ww = the number of weeks into the year, you can convert to calendar dates as follows:
def getCalendarDate(fy_date: str):
f_year = fy_date[0:4]
f_week = fy_date[4:]
fys = pd.to_datetime(f'{f_year}/10/01', format= '%Y/%m/%d')
return fys + pd.to_timedelta(int(f_week), "W")
You can then use this function to create the column of calendar dates as follows:
df['Calendar Date]'] = list(getCalendarDate(x) for x in df['Fiscal Week'].to_list())

Pandas - seaborn lineplot hue unexpected legend

I have a data frame of client names, dates and transactions. I'm not sure how far back my error goes, so here is all the pre-processing I do:
data = pd.read_excel('Test.xls')
## convert to datetime object
data['Date Order'] = pd.to_datetime(data['Date Order'], format = '%d.%m.%Y')
## add columns for month and year of each row for easier analysis later
data['month'] = data['Date Order'].dt.month
data['year'] = data['Date Order'].dt.year
So the data frame becomes something like:
Date Order NameCustomers SumOrder month year
2019-01-02 00:00:00 Customer 1 290 1 2019
2019-02-02 00:00:00 Customer 1 50 2 2019
-----
2020-06-28 00:00:00 Customer 2 900 6 2020
------
..etc.
You get the idea. Next I group by both month and year and calculate the mean.
groupedMonthYearMean = data.groupby(['month', 'year'])['SumOrder'].mean().reset_index()
Output:
month year SumOrder
1 2019 233.08
1 2020 303.40
2 2019 255.34
2 2020 842.24
--------------------------
I use the resulting dataframe to make a lineplot, which tracks the SumOrder for each month, and displays it for each year.
linechart = sns.lineplot(x = 'month',
y = 'SumOrder',
hue = 'year',
data = groupedMonthYearMean).set_title('Mean Sum Order by month')
plt.show()
I have attached a screenshot of the resulting plot - overall it seems to show what I expected to create.
In my entire data, the 'year' column has only two values: 2019 and 2020. For some reason, whatever I do, they show up as 0, -1 and -2. Any ideas what is going on?
You want to change the dtype of the year column from int to category
df['year'] = df['year'].astype('category')
This is due to how hue treats ints.

how can I align different-day timeseries in pandas?

I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.

Combining Columns in Isoweek Object

I have a pandas dataframe that contains a year and week column:
year week
2018 18
2019 17
2019 17
I'm trying to combine the year and week columns into a new 'isoweek' column using the isoweek library. I can't seem to figure out how to properly loop through the rows to create the object column. If I do something like:
df['isoweek'] = Week(df['year'],df['week'])
isoweek chokes on the vectorization. I've tried creating a basic list and appending it to my dataframe, like so:
obj_list = []
for i in range(500):
year = df['year'][i]
week = df['week'][i]
w = Week(year,week)
obj_list.append(w)
df['isoweek'] = obj_list
But I end up with a simple tuple in the column.
The goal is to be able to use some of the isoweek library's operations to calculate date differences, like:
df['isoweek'] - 4
>isoweek.Week(2019, 34)
Is it even possible to store an object like this in a dataframe column? If so, how does one go about it?
As an alternative, you can use the built in method for datetime:
df['week_start'] = pd.to_datetime(df['year'].astype(str), format='%Y') + pd.to_timedelta(df['week'].mul(7).astype(str) + ' days')
# Output:
week year week_start
0 18 2018 2018-05-07
1 17 2019 2019-04-30
2 17 2019 2019-04-30
Calculating time differences is pretty straightforward here:
# Choose 7 weeks
n_weeks = pd.to_timedelta(7, unit='W')
# Adding is simple
df['week_start'] + n_weeks
# Output
0 2018-06-25
1 2019-06-18
2 2019-06-18
For more on this, read: Pandas: How to create a datetime object from Week and Year?
Potentially you could do this
First, set up the example dataframe
from isoweek import Week
df = pd.DataFrame ({'year' : [2018,2019,2019],
'week' : [18,17,17]})
Loop through the dataframe, adding the isoweek to a list
ls_isoweek = []
for row in df.itertuples():
ls_isoweek.append(Week(row[1],row[2]))
The list looks like this
[isoweek.Week(2018, 18), isoweek.Week(2019, 17), isoweek.Week(2019, 17)]
This list can be accessed thusly
ls_isoweek[0] - 4
Produces this output
isoweek.Week(2018, 14)
However, the list can also be added back to the dataframe if you wish
df['isoweek'] = ls_isoweek
You can then do things like ...
df['isoweek_minus_4'] = df['isoweek'].apply(lambda x: x-4)
Producing an output like the below
A little late, but if anyone else is still looking to use a solution of this form as I was, you could use lambda functions along with apply. For the dataframe below (with int64 dtypes),
year week
0 2018 18
1 2019 17
2 2019 17
Now we use isoweek to appropriately parse the data,
from isoweek import Week
df.apply(lambda row : Week(row["year"],row["week"]),axis=1)
This produces the output,
0 (2018, 18)
1 (2019, 17)
2 (2019, 17)
dtype: object
You could also identify the (week,year) with a datetime object by combining this approach with this answer https://stackoverflow.com/a/7687085.
df.apply(lambda row : Week(int(row["year"]),int(row["week"])).monday(),axis=1)
The int appears a little redundant there, but pandas by default uses int64 which doesn't appear to function with isoweek correctly. This produces the output,
0 2018-04-30
1 2019-04-22
2 2019-04-22
dtype: object

Categories