Convert daily data to weekly by taking average of the 7 days - python

I've created the following datafram from data given on CDC link.
googledata = pd.read_csv('/content/data_table_for_daily_case_trends__the_united_states.csv', header=2)
# Inspect data
googledata.head()
id
State
Date
New Cases
0
United States
Oct 2 2022
11553
1
United States
Oct 1 2022
8024
2
United States
Sep 30 2022
46383
3
United States
Sep 29 2022
89873
4
United States
Sep 28 2022
63763
After converting the date column to datetime and trimming the data for the last 1 year by implementing the mask operation I got the data in the last 1 year:
googledata['Date'] = pd.to_datetime(googledata['Date'])
df = googledata
start_date = '2021-10-1'
end_date = '2022-10-1'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
But the problem is I am getting the data in terms of days, but I wish to convert this data in terms of weeks ; i.e converting the 365 rows to 52 rows corresponding to weeks data taking mean of New cases the 7 days in 1 week's data.
I tried implementing the following method as shown in the previous post: link I don't think I am even applying this correctly! Because this code is not asking me to put my dataframe anywhere!
logic = {'New Cases' : 'mean'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
But I am getting the following error:
AttributeError: module 'pandas.tseries.offsets' has no attribute
'timedelta'

If I'm understanding you want to resample
df = df.set_index("Date")
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
df = df.resample("W").agg({"New Cases": "mean"}).reset_index()

You can use strftime to convert date to week number before applying groupby
df['Week'] = df['Date'].dt.strftime('%Y-%U')
df.groupby('Week')['New Cases'].mean()

Related

How to get a date from year, month, week of month and Day of week in Pandas?

I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)

Pandas - seaborn lineplot hue unexpected legend

I have a data frame of client names, dates and transactions. I'm not sure how far back my error goes, so here is all the pre-processing I do:
data = pd.read_excel('Test.xls')
## convert to datetime object
data['Date Order'] = pd.to_datetime(data['Date Order'], format = '%d.%m.%Y')
## add columns for month and year of each row for easier analysis later
data['month'] = data['Date Order'].dt.month
data['year'] = data['Date Order'].dt.year
So the data frame becomes something like:
Date Order NameCustomers SumOrder month year
2019-01-02 00:00:00 Customer 1 290 1 2019
2019-02-02 00:00:00 Customer 1 50 2 2019
-----
2020-06-28 00:00:00 Customer 2 900 6 2020
------
..etc.
You get the idea. Next I group by both month and year and calculate the mean.
groupedMonthYearMean = data.groupby(['month', 'year'])['SumOrder'].mean().reset_index()
Output:
month year SumOrder
1 2019 233.08
1 2020 303.40
2 2019 255.34
2 2020 842.24
--------------------------
I use the resulting dataframe to make a lineplot, which tracks the SumOrder for each month, and displays it for each year.
linechart = sns.lineplot(x = 'month',
y = 'SumOrder',
hue = 'year',
data = groupedMonthYearMean).set_title('Mean Sum Order by month')
plt.show()
I have attached a screenshot of the resulting plot - overall it seems to show what I expected to create.
In my entire data, the 'year' column has only two values: 2019 and 2020. For some reason, whatever I do, they show up as 0, -1 and -2. Any ideas what is going on?
You want to change the dtype of the year column from int to category
df['year'] = df['year'].astype('category')
This is due to how hue treats ints.

Pandas DateTime Conversions with Math Between Column

I have a DataFrame of various columns: Two columns I have are 'Date of birth' and 'Date of murder'.
Date of birth looks like :
Date of murder looks like :
I would like to be able to subtract each row from the DOB column from the DOM column so I can get how old someone was when they committed murder. Is there any way to go about this?
Now it appears to be working (mostly). I made the following edits:
df_1['Date of birth'] = pd.to_datetime(df_1['Date of birth'],
errors
=
'coerce')
df_1['Date of Murder revised'] = pd.to_datetime(df_1['Date of
Murder
revised'], errors = 'coerce')
df_1['Date at Murder'] = (df_1['Date of birth'] - df_1['Date of
Murder
revised'])
print(df_1['Date at Murder'].head(10))
This gives m the following output:
0 -12395 days
1 -9941 days
2 -7651 days
3 NaT
4 -9313 days
5 -9184 days
I would however like to get years, but when I do the same code above
like so:
df_1['Date at Murder'] = (df_1['Date of birth'] - df_1['Date of
Murder revised']) / 365.25
I get this output:
0 -34 days +01:32:38.932238
1 -28 days +18:47:33.388090
2 -21 days +01:15:53.593429
3 NaT
4 -26 days +12:03:26.981519
If DOM and DOB are datetime type then this should do the job:
df['Age'] = df.DOM - df.DOB
If the data type is not datetime then, convert them by
df.DOM = pandas.to_datetime(df.DOM)
df.DOB = pandas.to_datetime(df.DOB)
df['Age'] = (df.DOM - df.DOB).dt.days/365.25

Filter by monthday Pandas Dataframe

I would like to filter a pandas dataframe between months for a number of years.
I have a dataframe with data from 2000-2016, and I want to filter between October 22nd and November 15th for each of the years.
To keep this simple let's say I have 4 columns. The date index, the month index, the day index, and price.
What I have attempted so far is to concatenate the month column and the day column. Ie. October 22 becomes 1022 and November 15th becomes 1115.
The problem arises when I look at dates before #10. Ie. November first is 111 rather that 1101.
So when I do a conditional filter specifying df['monthday'] > 1015 & df['monthday'] < 1115 it entirely fails to capture all the November dates from November first to November 9th because 111 through to 119 < 1015.
I have also tried to compare this number as a string, so I have succesfully converted 111 to str(1101). But then this is not comparable to int(1101).
This is a seemingly easy problem that I have had no luck solving. Any help is appreciated.
Code snippets below. Thank you,
df = web.DataReader('SPY', 'yahoo',datetime.datetime(2015 ,1, 1),
datetime.datetime.today())
#this adds zeroes but really doesn't help me
df['Day of Month'] = df['Day of Month'].astype(str).str.zfill(2)
df['month'] = df['month'].astype(str).str.zfill(2)
#This one converts it to str but can't compare str to int
df['monthday'] = df['month'].map(str) + df['Day of Month'].map(str)
#This one converts it to a # but can't use 111 as November 1st because it is
#smaller than 1015 ie October 15th and I want to filter between those dates.
df['monthday'] = pd.to_numeric(df.monthday, errors='coerce')
#here is where I attempt my intermonth filter for each year since 2000
df = df[(df['month'] >= 10) & (df['month'] <= 11) & (df['monthday'] >= 1021)
& (df['monthday'] <=1115)]
Thank you for your support.
dfperiod = df[(df['month'] >= '10') & (df['month'] <= '11') & (df['monthday']
>= '1021') & (df['monthday'] <='1115')]

Python: How to calculate Difference btw Current Year and Year from Column?

I have a column "DateBecameRep_Year" that contains only year values in it (i.e. 1974, 1999, etc.). I want to create a new column in my dataframe that calculates the difference between the current year and the year in the "DateBecameRep_Year" field.
Below is the code I tried to use:
df_DD['DateBecameRep_Year'] = pd.to_datetime(df_DD['DateBecameRep_Year'])
df_DD['Current Year'] = datetime.now().year
df_DD['Current Year'] = pd.to_datetime(df_DD['Current Year'])
df_DD['Years_Since_BecameRep'] = df_DD['Current Year'] - df_DD['DateBecameRep_Year']
df_DD['Years_Since_BecameRep'] = df_DD['Years_Since_BecameRep'] / np.timedelta64(1, 'Y')
df_DD['Years_Since_BecameRep'].head()
This is the output I get which looks very strange:
My hypothesis is that this has something to do with the following:
Any help is greatly appreciated!
If you just want to get the different year number, you could simply use substraction, no need to convert to datetime.
import pandas as pd
import datetime
current_year = datetime.datetime.now().year #get current year
df_DD = pd.DataFrame.from_dict({"DateBecameRep_Year":[1999,2000,2015,1898,1788,1854]})
df_DD['Current Year'] = datetime.datetime.now().year
df_DD["Years_Since_BecameRep"] = df_DD['Current Year'] - df_DD['DateBecameRep_Year'] # substract to get the year delta
df_DD will be:
DateBecameRep_Year Current Year Years_Since_BecameRep
0 1999 2017 18
1 2000 2017 17
2 2015 2017 2
3 1898 2017 119
4 1788 2017 229
5 1854 2017 163

Categories