Python: slice yearly data between February and June with pandas - python

I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)

You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()

Related

How to slice a pandas DataFrame between two dates (day/month) ignoring the year?

I want to filter a pandas DataFrame with DatetimeIndex for multiple years between the 15th of april and the 16th of september. Afterwards I want to set a value the mask.
I was hoping for a function similar to between_time(), but this doesn't exist.
My actual solution is a loop over the unique years.
Minimal Example
import pandas as pd
df = pd.DataFrame({'target':0}, index=pd.date_range('2020-01-01', '2022-01-01', freq='H'))
start_date = "04-15"
end_date = "09-16"
for year in df.index.year.unique():
# normal approche
# df[f'{year}-{start_date}':f'{year}-{end_date}'] = 1
# similar approche slightly faster
df.iloc[df.index.get_loc(f'{year}-{start_date}'):df.index.get_loc(f'{year}-{end_date}')+1]=1
Does a solution exist where I can avoid the loop and maybe improve the performance?
To get the dates between April 1st and October 31st, what about using the month?
df.loc[df.index.month.isin(range(4, 10)), 'target'] == 1
If you want to map any date/time, just ignoring the year, you can replace the year to 2000 (leap year) and use:
s = pd.to_datetime(df.index.strftime('2000-%m-%d'))
df.loc[(s >= '2000-04-15') & (s <= '2020-09-16'), 'target'] = 1

How to groupby a fixed list of dates

Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks

It is possible to create a new data frame on Pandas from a time series, with the daily diference?

Is there a way to create a new data frame from a time series with the daily diffence?
This means, suppose that on October 5 I had 5321 counts and on October 6 5331 counts. This represents the difference of 10; what I want is, for example, that my DataFrame shows 10 on October 6.
Here's my code of the raw dataframe:
import pandas as pd
from datetime import datetime, timedelta
url = 'https://raw.githubusercontent.com/mariorz/covid19-mx-time-series/master/data/covid19_confirmed_mx.csv'
df = pd.read_csv(url, index_col=0)
df = df.loc['Colima','18-03-2020':'06-10-2020']
df = pd.DataFrame(df)
df.index = pd.to_datetime(df.index, format='%d-%m-%Y')
df
This is the raw outcome:
Thank you guys!
There's an inbuilt diff function just for these kind of operations:
df['Diff'] = df.Colima.diff()
Yes, you can use the shift method to access the preceding row's value to calculate the difference.
df['difference'] = df.Colima - df.Colima.shift(1)

Problem using Groupby in Python for date time. How to make a bar plot with Month/Year?

I have the following data set:
df
OrderDate Total_Charged
7/9/2017 5
7/9/2017 5
7/20/2017 10
8/20/2017 6
9/20/2019 1
...
I want to make a bar plot with month_year (X-axis) and Total charged per month/year i.e. sum it over month and year. Firstly, I want to groupby month and year and next make the plot.However, I get error on the first step:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
monthly_orders=df.groupby([(df.index.year),(df.index.month)]).sum()["Total_Charged"]
Got following error:
AttributeError: 'RangeIndex' object has no attribute 'year'
What am I doing wrong (what does the error mean)? How can i fix it?
Not sure why you're grouping by the index there. If you want the group by year and month respectively you could do the following:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
df.groupby([df.OrderDate.dt.year, df.OrderDate.dt.month]).sum().plot.bar()
pandas.DataFrame.resample
This is a versatile option, that easily implements aggregation over various time ranges (e.g. weekly, daily, quarterly, etc)
Code:
A more expansive dataset:
This code block sets up the sample dataset.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# list of dates
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
df = pd.DataFrame({'OrderDate': list_of_dates,
'Total_Charged': [np.random.randint(10) for _ in range(len(list_of_dates))]})
Using resample for Monthly Sum:
requires a datetime index
df.OrderDate = pd.to_datetime(df.OrderDate)
df.set_index('OrderDate', inplace=True)
monthly_sums = df.resample('M').sum()
monthly_sums.plot.bar(figsize=(8, 6))
plt.show()
An example with Quarterly Avg:
this shows the versatility of resample compared to groupby
Quarterly would not be easily implemented with groupby
quarterly_avg = df.resample('Q').mean()
quarterly_avg.plot.bar(figsize=(8, 6))
plt.show()

In Python/Pandas how do I convert century-months to DateTimeIndex?

I am working with a dataset that encodes dates as the integer number of months since December 1899, so month 1 is January 1900 and month 1165 is January 1997. I would like to convert to a pandas DateTimeIndex. So far the best I've come up with is:
month0 = np.datetime64('1899-12-15')
one_month = np.timedelta64(30, 'D') + np.timedelta64(10.5, 'h')
birthdates = pandas.DatetimeIndex(month0 + one_month * resp.cmbirth)
The start date is the 15th of the month, and the timedelta is 30 days 10.5 hours, the average length of a calendar month. So the date within the month drifts by a day or two.
So this seems a little hacky and I wondered if there's a better way.
You can use built-in pandas date-time functionality.
import pandas as pd
import numpy as np
indexed_months = np.random.random_integers(0, high=1165, size=100)
month0 = pd.to_datetime('1899-12-01')
date_list = [month0 + pd.DateOffset(months=mnt) for mnt in indexed_months]
birthdates = pd.DatetimeIndex(date_list)
I've made an assumption that your resp.cmbirth object looks like an array of integers between 0 and 1165.
I'm not quite clear on why you want the bin edges of the indices to be offset from the start or end of the month. This can be done:
shifted_birthdates = birthdates.shift(15, freq=pd.datetools.day)
and similarly for hours if you want. There is also useful info in the answers to this SO question and the related pandas github issue.

Categories