Interpolate date in pandas - python

I have a data frame with a date column. There are almost 7k rows and 10 of them are NaN. So I wanted to interpolate the date. I checked out the documentation and they used .interpolate(). However, when I tried that, I was not getting the desired result.
One sample row:
0 November 1, 2019
1 May 1, 2017
2 NaN
3 December 15, 2017
4 March 9, 2018
My approach:
main_df['date_added'].interpolate(method='linear', inplace=True)
When I viewed the rows, they remain NaN.
Is there a way to fill that date? I have 10 such cases in the data frame.
Thank you in advance.

You don't get to do usual arithmetic operators on datetime type, e.g. multiplication/division. So you don't get to interpolate the dates linearly. One option is to convert the dates into float by subtracting a time stamp, then dividing by a period:
first_date=pd.to_datetime('1900-01-01')
periods = pd.Timedelta('1s')
(df['date'].sub(first_date).div(periods)
.interpolate(method='linear')
.mul(periods).add(first_date)
)
Output:
0 2019-11-01
1 2017-05-01
2 2017-08-23
3 2017-12-15
4 2018-03-09
Name: date, dtype: datetime64[ns]

Related

getting week number from date python

I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.

Python Timedelta[M] adds incomplete days

I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]
The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.

Rolling mean of 1 year after filtering of data in pandas dataframe

I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).

how can I align different-day timeseries in pandas?

I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.

How to generate a series of dates that are the first monday from a given start date and increment by 2 weeks

I have a dataferam with a column that contains the date for the first monday of evry week between an arbitrary start date and now. I wish to generate a new column that has 2 week jumps but is the same length as the original column and would contain repeated values. For example this would be the result for the month of October where the column weekly exists and bi-weekly is the target:
data = {'weekly':['2018-10-08','2018-10-15','2018-10-22','2018-10-29']
,'bi-weekly':['2018-10-08','2018-10-08',
'2018-10- 22','2018-10-22']}
df = pd.DataFrame(data)
At the moment I am stuck with pd.date_range(start,end,freq='14D') but this does not contain any repeated values which I need to be able to groupby
IIUC
df.groupby(np.arange(len(df))//2).weekly.transform('first')
Out[487]:
0 2018-10-08
1 2018-10-08
2 2018-10-22
3 2018-10-22
Name: weekly, dtype: datetime64[ns]

Categories