Python: How to calculate Difference btw Current Year and Year from Column? - python

I have a column "DateBecameRep_Year" that contains only year values in it (i.e. 1974, 1999, etc.). I want to create a new column in my dataframe that calculates the difference between the current year and the year in the "DateBecameRep_Year" field.
Below is the code I tried to use:
df_DD['DateBecameRep_Year'] = pd.to_datetime(df_DD['DateBecameRep_Year'])
df_DD['Current Year'] = datetime.now().year
df_DD['Current Year'] = pd.to_datetime(df_DD['Current Year'])
df_DD['Years_Since_BecameRep'] = df_DD['Current Year'] - df_DD['DateBecameRep_Year']
df_DD['Years_Since_BecameRep'] = df_DD['Years_Since_BecameRep'] / np.timedelta64(1, 'Y')
df_DD['Years_Since_BecameRep'].head()
This is the output I get which looks very strange:
My hypothesis is that this has something to do with the following:
Any help is greatly appreciated!

If you just want to get the different year number, you could simply use substraction, no need to convert to datetime.
import pandas as pd
import datetime
current_year = datetime.datetime.now().year #get current year
df_DD = pd.DataFrame.from_dict({"DateBecameRep_Year":[1999,2000,2015,1898,1788,1854]})
df_DD['Current Year'] = datetime.datetime.now().year
df_DD["Years_Since_BecameRep"] = df_DD['Current Year'] - df_DD['DateBecameRep_Year'] # substract to get the year delta
df_DD will be:
DateBecameRep_Year Current Year Years_Since_BecameRep
0 1999 2017 18
1 2000 2017 17
2 2015 2017 2
3 1898 2017 119
4 1788 2017 229
5 1854 2017 163

Related

How to convert date format (dd/mm/yyyy) to days in python csv

I need a function to count the total number of days in the 'days' column between a start date of 1st Jan 1995 and an end date of 31st Dec 2019 in a dataframe taking leap years into account as well.
Example: 1st Jan 1995 - Day 1, 1st Feb 1995 - Day 32 .......and so on all the way to 31st.
If you want to filter a pandas dataframe using a range of 2 date you can do this by:
start_date = '1995/01/01'
end_date = '1995/02/01'
df = df[ (df['days']>=start_date) & (df['days']<=end_date) ]
and with len(df) you will see the number of rows of the filter dataframe.
Instead, if you want to calculate a range of days between 2 different date you can do without pandas with datetime:
from datetime import datetime
start_date = '1995/01/01'
end_date = '1995/02/01'
delta = datetime.strptime(end_date, '%Y/%m/%d') - datetime.strptime(start_date, '%Y/%m/%d')
print(delta.days)
Output:
31
The only thing is that this not taking into account leap years

Convert daily data to weekly by taking average of the 7 days

I've created the following datafram from data given on CDC link.
googledata = pd.read_csv('/content/data_table_for_daily_case_trends__the_united_states.csv', header=2)
# Inspect data
googledata.head()
id
State
Date
New Cases
0
United States
Oct 2 2022
11553
1
United States
Oct 1 2022
8024
2
United States
Sep 30 2022
46383
3
United States
Sep 29 2022
89873
4
United States
Sep 28 2022
63763
After converting the date column to datetime and trimming the data for the last 1 year by implementing the mask operation I got the data in the last 1 year:
googledata['Date'] = pd.to_datetime(googledata['Date'])
df = googledata
start_date = '2021-10-1'
end_date = '2022-10-1'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
But the problem is I am getting the data in terms of days, but I wish to convert this data in terms of weeks ; i.e converting the 365 rows to 52 rows corresponding to weeks data taking mean of New cases the 7 days in 1 week's data.
I tried implementing the following method as shown in the previous post: link I don't think I am even applying this correctly! Because this code is not asking me to put my dataframe anywhere!
logic = {'New Cases' : 'mean'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
But I am getting the following error:
AttributeError: module 'pandas.tseries.offsets' has no attribute
'timedelta'
If I'm understanding you want to resample
df = df.set_index("Date")
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
df = df.resample("W").agg({"New Cases": "mean"}).reset_index()
You can use strftime to convert date to week number before applying groupby
df['Week'] = df['Date'].dt.strftime('%Y-%U')
df.groupby('Week')['New Cases'].mean()

How to get a date from year, month, week of month and Day of week in Pandas?

I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)

Rounding pandas column to year

Python and Pandas beginner here.
I want to round off a pandas dataframe column to years. Dates before the 1st of July must be rounded off to the current year and dates after and on the 1st of July must be rounded up to the next year.
For example:
2011-04-05 must be rounded to 2011
2011-08-09 must be rounded to 2012
2011-06-30 must be rounded to 2011
2011-07-01 must be rounded to 2012
What I've tried:
pd.series.dt.round(freq='Y')
Gives the error: ValueError: <YearEnd: month=12> is a non-fixed frequency
The dataframe column has a wide variety of dates, starting from 1945 all the way up to 2021. Therefore a simple if df.date < 2011-07-01: df['Date']+ pd.offsets.YearBegin(-1) is not working.
I also tried the dt.to_period('Y') function, but then I can't give the before and after the 1st of July argument.
Any tips on how I can solve this issue?
Suppose you have this dataframe:
dates
0 2011-04-05
1 2011-08-09
2 2011-06-30
3 2011-07-01
4 1945-06-30
5 1945-07-01
Then:
# convert to datetime:
df["dates"] = pd.to_datetime(df["dates"])
df["year"] = np.where(
(df["dates"].dt.month < 7), df["dates"].dt.year, df["dates"].dt.year + 1
)
print(df)
Prints:
dates year
0 2011-04-05 2011
1 2011-08-09 2012
2 2011-06-30 2011
3 2011-07-01 2012
4 1945-06-30 1945
5 1945-07-01 1946
a bit of a roundabout year is to convert the date values to strings, separate them, and then classify them in a loop, like so:
for i in df["Date"]: # assuming the column's name is "Date"
thisdate = df["Date"] # extract the ith element of Date
thisdate = str(thisdate) # convert to string
datesplit = thisdate.split("-") # split
Yr = int(datesplit[0]) # get the year # convert year back to a number
Mth = int(datesplit[1]) # get the month # convert month back to a number
if Mth < 7: # any date before July
rnd_Yr = Yr
else: # any date after July 1st
rnd_Yr = Yr + 1

Python: Pandas dataframe get the year to which the week number belongs and not the year of the date

I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication

Categories