Time Series Resampling with wrong out and without Frequency - python

At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days

Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471

Related

Python Timedelta[M] adds incomplete days

I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]
The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.

how can I align different-day timeseries in pandas?

I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.

How to obtain difference of a date column in groupby

Currently my data looks like :
user_ID order_number order_start_date order_value week_day
237 135950 1594878.0 2018-01-01 534.0 Monday
235 32911 1594942.0 2018-01-01 89.0 Monday
232 208474 1594891.0 2018-01-01 85.0 Monday
231 9048 1594700.0 2018-01-01 224.0 Monday
228 134896 1594633.0 2018-01-01 449.0 Monday
What I want to achieve is groupby the records by user_ID and take difference of min and max value of each date and find out difference between them in days. Where I am struggling:
Groupby does not inherently supports minimum maximum difference
It is not possible to perform numerical operations such as mean() on datetime series which exist as a column in a dataframe. Though possible for individual series.
Any help?
I feel like your description was practically the pseudocode!
output = df.groupby('user_ID')['order_start_date'].apply(lambda g: g.max()-g.min())
You can then get the difference in days as numbers (rather than timedeltas):
output = [i / pd.Timedelta(days=1) for i in output]
The output on your example data is all 0 because there is only one entry per user, this is what you expect yes?
As for taking the mean, you just need to represent the dates as seconds since some time and then take the average. I had tried to convert all to timedeltas since an old time and then average, but this post does it better and works well with groupby. Here's a test scenario where its all data for one userID and the dates go from Jan 1st to Jan 5th, 2020:
df.loc[:,'user_ID'] = 1111
df['order_start_date'] = pd.date_range('01-01-2020','01-05-2020',periods=5)
df['order_start_date'] = np.array(df['order_start_date'],dtype='datetime64[s]').view('i8')
output = df.groupby('user_ID')['order_start_date'].mean().astype('datetime64[s]')
Results:
user_ID
1111 2020-01-03

Converting str to datetime makes all the values go to NaTType

I have a Pandas dataframe with dates for the last two property purchases. I have subtracted one from another, labelled that column Sale Date Diff and saved to a csv file. Now, I am trying to convert the data back to datetime, but its problematic.
Here's the data
Area Sale Date Diff
10 Downtown 16553 days 00:00:00.000000000
167 Downtown 67 days 00:00:00.000000000
555 Upper Sahali 2289 days 00:00:00.000000000
987 Brockluhurst 2912 days 00:00:00.000000000
1400 North Shore 4663 days 00:00:00.000000000
When I first loaded the data from csv, it had a format type 'str'.
The column has some null values, so I tried the following:
gdf['Sale Date Diff'] = pd.to_datetime(gdf['Sale Date Diff'], errors='coerce')
Which converted all my data to pandas.tslib.NaTType and it now looks like this:
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
What would be a way around this?
I would also want to format the column to only have days, is that possible?
I'm not entirely convinced you're reading your csv correctly, it looks like you are splitting things into columns that shouldn't be split up. However, you don't want to cast to datetime, you want to cast to timedelta:
pd.to_timedelta(df['Sale Date Diff'])
10 16553 days
167 67 days
555 2289 days
987 2912 days
1400 4663 days
Name: Sale Date Diff, dtype: timedelta64[ns]
It would be helpful in the future to remove the errors='coerce' line from your code, so you can better understand what went wrong. With that change, here is the error you would have seen:
ValueError: ('Unknown string format:', '16553 days 00:00:00.000000000')
This was caused by you trying to cast a string representing a timedelta object, to a Timestamp.

cut time spells into calendar months in pandas

I have data on spells (hospital stays), each with a start and end date, but I want to count the number of days spent in hospital for calendar months. Of course, this number can be zero for months not appearing in a spell. But I cannot just attribute the length of each spell to the starting month, as longer spells run over to the following month (or more).
Basically, it would suffice for me if I could cut spells at turn-of-month datetimes, getting from the data in the first example to the data in the second:
id start end
1 2011-01-01 10:00:00 2011-01-08 16:03:00
2 2011-01-28 03:45:00 2011-02-04 15:22:00
3 2011-03-02 11:04:00 2011-03-05 05:24:00
id start end month stay
1 2011-01-01 10:00:00 2011-01-08 16:03:00 2011-01 7
2 2011-01-28 03:45:00 2011-01-31 23:59:59 2011-01 4
2 2011-02-01 00:00:00 2011-02-04 15:22:00 2011-02 4
3 2011-03-02 11:04:00 2011-03-05 05:24:00 2011-03 3
I read up on the Time Series / Date functionality of pandas, but I do not see a straightforward solution to this. How can one accomplish the slicing?
It's simpler than you think: just subtract the dates. The result is a time span. See Add column with number of days between dates in DataFrame pandas
You even get to do this for the entire frame at once:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.subtract.html
Update, now that I understand the problem better.
Add a new column: take the spell's end date; if the start date is in a different month, then set this new date's day to 01 and the time to 00:00.
This is the cut DateTime you can use to compute the portion of the stay attributable to each month. cut - start is the first month; end - cut is the second.

Categories