I would like to filter a pandas dataframe between months for a number of years.
I have a dataframe with data from 2000-2016, and I want to filter between October 22nd and November 15th for each of the years.
To keep this simple let's say I have 4 columns. The date index, the month index, the day index, and price.
What I have attempted so far is to concatenate the month column and the day column. Ie. October 22 becomes 1022 and November 15th becomes 1115.
The problem arises when I look at dates before #10. Ie. November first is 111 rather that 1101.
So when I do a conditional filter specifying df['monthday'] > 1015 & df['monthday'] < 1115 it entirely fails to capture all the November dates from November first to November 9th because 111 through to 119 < 1015.
I have also tried to compare this number as a string, so I have succesfully converted 111 to str(1101). But then this is not comparable to int(1101).
This is a seemingly easy problem that I have had no luck solving. Any help is appreciated.
Code snippets below. Thank you,
df = web.DataReader('SPY', 'yahoo',datetime.datetime(2015 ,1, 1),
datetime.datetime.today())
#this adds zeroes but really doesn't help me
df['Day of Month'] = df['Day of Month'].astype(str).str.zfill(2)
df['month'] = df['month'].astype(str).str.zfill(2)
#This one converts it to str but can't compare str to int
df['monthday'] = df['month'].map(str) + df['Day of Month'].map(str)
#This one converts it to a # but can't use 111 as November 1st because it is
#smaller than 1015 ie October 15th and I want to filter between those dates.
df['monthday'] = pd.to_numeric(df.monthday, errors='coerce')
#here is where I attempt my intermonth filter for each year since 2000
df = df[(df['month'] >= 10) & (df['month'] <= 11) & (df['monthday'] >= 1021)
& (df['monthday'] <=1115)]
Thank you for your support.
dfperiod = df[(df['month'] >= '10') & (df['month'] <= '11') & (df['monthday']
>= '1021') & (df['monthday'] <='1115')]
Related
I've created the following datafram from data given on CDC link.
googledata = pd.read_csv('/content/data_table_for_daily_case_trends__the_united_states.csv', header=2)
# Inspect data
googledata.head()
id
State
Date
New Cases
0
United States
Oct 2 2022
11553
1
United States
Oct 1 2022
8024
2
United States
Sep 30 2022
46383
3
United States
Sep 29 2022
89873
4
United States
Sep 28 2022
63763
After converting the date column to datetime and trimming the data for the last 1 year by implementing the mask operation I got the data in the last 1 year:
googledata['Date'] = pd.to_datetime(googledata['Date'])
df = googledata
start_date = '2021-10-1'
end_date = '2022-10-1'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
But the problem is I am getting the data in terms of days, but I wish to convert this data in terms of weeks ; i.e converting the 365 rows to 52 rows corresponding to weeks data taking mean of New cases the 7 days in 1 week's data.
I tried implementing the following method as shown in the previous post: link I don't think I am even applying this correctly! Because this code is not asking me to put my dataframe anywhere!
logic = {'New Cases' : 'mean'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
But I am getting the following error:
AttributeError: module 'pandas.tseries.offsets' has no attribute
'timedelta'
If I'm understanding you want to resample
df = df.set_index("Date")
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
df = df.resample("W").agg({"New Cases": "mean"}).reset_index()
You can use strftime to convert date to week number before applying groupby
df['Week'] = df['Date'].dt.strftime('%Y-%U')
df.groupby('Week')['New Cases'].mean()
I want to produce a dataframe that splits by day (which is the day date of the month) but then orders them by the date. At the moment the code below splits them into dates e.g. 1 - 11, 2 - 11 but the 30 -10 and 31-10 come after all my November dates.
ResultSet2 = ResultProxy2.fetchall()
df2 = pd.DataFrame(ResultSet2)
resultsrecovery = [group[1] for group in df2.groupby(["day"])]
The current code output :
I basically want the grouped dataframe for the 30-10 and 31st of October to come before all the ones in November
Attempting to filter df to only include rows with date before 2018-11-06.
Column is in datetime format. Running this code returns only rows with exact date of 2018-11-06 instead of values less than. Also, when running code with less than symbol '<', only dates later than 2018-11-06 are returned. It appears that I am doing something very incorrectly.
db4=db3[~(db3['registration_dt']>'2018-11-06')]
It seems like you are comparing the string '2018-11-06' with a datetime.
import datetime as dt
# Selects all rows where registration date is after 6 november 2018
df = db3[db3['registration_dt']>dt.datetime(2018,11,6)]
# Selects all rows where registration_dt is before 6 november 2018
df = db3[db3['registration_dt']>dt.datetime(2018,11,6)]
# The ~ symbol can be read as not
# This selects all rows before or equal to 6 november 2018
df = db3[~(db3['registration_dt']>dt.datetime(2018,11,6))]
Python and Pandas beginner here.
I want to round off a pandas dataframe column to years. Dates before the 1st of July must be rounded off to the current year and dates after and on the 1st of July must be rounded up to the next year.
For example:
2011-04-05 must be rounded to 2011
2011-08-09 must be rounded to 2012
2011-06-30 must be rounded to 2011
2011-07-01 must be rounded to 2012
What I've tried:
pd.series.dt.round(freq='Y')
Gives the error: ValueError: <YearEnd: month=12> is a non-fixed frequency
The dataframe column has a wide variety of dates, starting from 1945 all the way up to 2021. Therefore a simple if df.date < 2011-07-01: df['Date']+ pd.offsets.YearBegin(-1) is not working.
I also tried the dt.to_period('Y') function, but then I can't give the before and after the 1st of July argument.
Any tips on how I can solve this issue?
Suppose you have this dataframe:
dates
0 2011-04-05
1 2011-08-09
2 2011-06-30
3 2011-07-01
4 1945-06-30
5 1945-07-01
Then:
# convert to datetime:
df["dates"] = pd.to_datetime(df["dates"])
df["year"] = np.where(
(df["dates"].dt.month < 7), df["dates"].dt.year, df["dates"].dt.year + 1
)
print(df)
Prints:
dates year
0 2011-04-05 2011
1 2011-08-09 2012
2 2011-06-30 2011
3 2011-07-01 2012
4 1945-06-30 1945
5 1945-07-01 1946
a bit of a roundabout year is to convert the date values to strings, separate them, and then classify them in a loop, like so:
for i in df["Date"]: # assuming the column's name is "Date"
thisdate = df["Date"] # extract the ith element of Date
thisdate = str(thisdate) # convert to string
datesplit = thisdate.split("-") # split
Yr = int(datesplit[0]) # get the year # convert year back to a number
Mth = int(datesplit[1]) # get the month # convert month back to a number
if Mth < 7: # any date before July
rnd_Yr = Yr
else: # any date after July 1st
rnd_Yr = Yr + 1
I have a dataframe which has 100,000 rows and 24 columns; representing crime over a year period October 2019 - October 2020
I'm trying to split the my df into two one dataframe of all rows ranging from october 1st - 31st March and the second ranging from April 1st - October 31st;
Would anyone be able to kindly assist how using pandas?
Assuming the column is of datetime type. You can do like this :
import pandas as pd
split_data = pd.datetime(2020,03,31)
df_1 = df.loc[df['Date']<= split_date]
df_2 = df.loc[df['Date'] > split_date]
if the column containing date field is not datetime type. You should first convert it into datetime type.
df['Date'] = pd.to_datetime(df['Date'])