Rounding pandas column to year - python

Python and Pandas beginner here.
I want to round off a pandas dataframe column to years. Dates before the 1st of July must be rounded off to the current year and dates after and on the 1st of July must be rounded up to the next year.
For example:
2011-04-05 must be rounded to 2011
2011-08-09 must be rounded to 2012
2011-06-30 must be rounded to 2011
2011-07-01 must be rounded to 2012
What I've tried:
pd.series.dt.round(freq='Y')
Gives the error: ValueError: <YearEnd: month=12> is a non-fixed frequency
The dataframe column has a wide variety of dates, starting from 1945 all the way up to 2021. Therefore a simple if df.date < 2011-07-01: df['Date']+ pd.offsets.YearBegin(-1) is not working.
I also tried the dt.to_period('Y') function, but then I can't give the before and after the 1st of July argument.
Any tips on how I can solve this issue?

Suppose you have this dataframe:
dates
0 2011-04-05
1 2011-08-09
2 2011-06-30
3 2011-07-01
4 1945-06-30
5 1945-07-01
Then:
# convert to datetime:
df["dates"] = pd.to_datetime(df["dates"])
df["year"] = np.where(
(df["dates"].dt.month < 7), df["dates"].dt.year, df["dates"].dt.year + 1
)
print(df)
Prints:
dates year
0 2011-04-05 2011
1 2011-08-09 2012
2 2011-06-30 2011
3 2011-07-01 2012
4 1945-06-30 1945
5 1945-07-01 1946

a bit of a roundabout year is to convert the date values to strings, separate them, and then classify them in a loop, like so:
for i in df["Date"]: # assuming the column's name is "Date"
thisdate = df["Date"] # extract the ith element of Date
thisdate = str(thisdate) # convert to string
datesplit = thisdate.split("-") # split
Yr = int(datesplit[0]) # get the year # convert year back to a number
Mth = int(datesplit[1]) # get the month # convert month back to a number
if Mth < 7: # any date before July
rnd_Yr = Yr
else: # any date after July 1st
rnd_Yr = Yr + 1

Related

Convert Integer or Float to Year?

I am trying to convert a column with type Integer to Year. Here is my situation:
Original Column: June 13, 1980 (United States)
I split and slice it into
Year Column: 1980
Here, I tried to use:
df['Year'] = pd.to_datetime(df['Year'])
It changed the column to have the year is different from the Original column. For example,
Original Year
1980 1970
2000 1970
2016 1970
I am looking forward to your help. Thank you in advance.
Best Regards,
Tu Le
df['Year'] = df['Original'].astype(str).astype('datetime64')
print(df)
Prints:
Original Year
0 1980 1980-01-01
1 2000 2000-01-01
2 2016 2016-01-01
If need datetimes from year, it means also added month=1 and day=1 add format parameter, here %Y for YYYY:
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
print (df)
Original Year
0 1980 1970-01-01
1 2000 1970-01-01
2 2016 1970-01-01

How to get a date from year, month, week of month and Day of week in Pandas?

I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)

Python: Pandas dataframe get the year to which the week number belongs and not the year of the date

I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication

How to convert columns in a dataframe into time series?

So I selected 3 columns from my dataframe in order to create a time series that I could then plot:
booking_date = pd.DataFrame({'day': hotel_bookings_cleaned["arrival_date_day_of_month"],
'month': hotel_bookings_cleaned["arrival_date_month"],
'year': hotel_bookings_cleaned["arrival_date_year"]})
and the output looks like:
day month year
0 1 July 2015
1 1 July 2015
2 1 July 2015
3 1 July 2015
4 1 July 2015
I tried using
dates = pd.to_datetime(booking_date)
but got the error message
ValueError: Unable to parse string "July" at position 0
I'm assuming I need to convert the Month column to a numeric value before I can convert it to a datetime, but I haven't been able to make any parsers work.
Try this
dates = pd.to_datetime(booking_date.astype(str).agg('-'.join, axis=1), format='%d-%B-%Y')
Out[13]:
0 2015-07-01
1 2015-07-01
2 2015-07-01
3 2015-07-01
4 2015-07-01
dtype: datetime64[ns]
Not sure if this is more performant than the previous answer, but you can convert your string column to integers with a dictionary mapping to fit the format that pandas expects in to_datetime()
month_map = {
'January':1,
'February':2,
'March':3,
'April':4,
'May':5,
'June':6,
'July':7,
'August':8,
'September':9,
'October':10,
'November':11,
'December':12
}
dates = pd.DataFrame({
'day':booking_date.day,
'month':booking_date.month.apply(lambda x: month_map[x]),
'year':booking_date.year
})
ts = pd.to_datetime(dates)

How to find the median month between two dates?

I need to find the median month value between two dates in a date frame. I am simplifying the case by showing four examples.
import pandas as pd
import numpy as np
import datetime
df=pd.DataFrame([["1/31/2016","3/1/2016"],
["6/15/2016","7/14/2016"],
["7/14/2016","8/15/2016"],
["8/7/2016","9/6/2016"]], columns=['FromDate','ToDate'])
df['Month'] = df.ToDate.dt.month-df.FromDate.dt.month
I am trying to append a column but I am not getting the desired result.
I need to see these values: [2,6,7,8].
You can calculate the average date explicitly by adding half the timedelta between 2 dates to the earlier date. Then just extract the month:
# convert to datetime if necessary
df[df.columns] = df[df.columns].apply(pd.to_datetime)
# calculate mean date, then extract month
df['Month'] = (df['FromDate'] + (df['ToDate'] - df['FromDate']) / 2).dt.month
print(df)
FromDate ToDate Month
0 2016-01-31 2016-03-01 2
1 2016-06-15 2016-07-14 6
2 2016-07-14 2016-08-15 7
3 2016-08-07 2016-09-06 8
You need to convert the string to datetime before using dt.month.
This line calculates the average month number :
df['Month'] = (pd.to_datetime(df['ToDate']).dt.month +
pd.to_datetime(df['FromDate']).dt.month)//2
print(df)
FromDate ToDate Month
0 1/31/2016 3/1/2016 2
1 6/15/2016 7/14/2016 6
2 7/14/2016 8/15/2016 7
3 8/7/2016 9/6/2016 8
This only works with both dates in the same year.
jpp's solution is fine but will in some cases give the wrong answer:
["1/1/2016","3/1/2016"] one would expect 2 because February is between January and March, but jpp's will give 1 corresponding to January.

Categories