Python Timedelta[M] adds incomplete days - python

I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]

The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.

Related

Python - Converting a column with weekly data to a datetime object

I have the following date column that I would like to transform to a pandas datetime object. Is it possible to do this with weekly data? For example, 1-2018 stands for week 1 in 2018 and so on. I tried the following conversion but I get an error message: Cannot use '%W' or '%U' without day and year
import pandas as pd
df1 = pd.DataFrame(columns=["date"])
df1['date'] = ["1-2018", "1-2018", "2-2018", "2-2018", "3-2018", "4-2018", "4-2018", "4-2018"]
df1["date"] = pd.to_datetime(df1["date"], format = "%W-%Y")
You need to add a day to the datetime format
df1["date"] = pd.to_datetime('0' + df1["date"], format='%w%W-%Y')
print(df1)
Output
date
0 2018-01-07
1 2018-01-07
2 2018-01-14
3 2018-01-14
4 2018-01-21
5 2018-01-28
6 2018-01-28
7 2018-01-28
As the error message says, you need to specify the day of the week by adding %w :
df1["date"] = pd.to_datetime( '0'+df1.date, format='%w%W-%Y')

Date and Time Format Conversion in Pandas, Python

Initially, my dataframe had a Month column containing numbers representing the months.
Month
1
2
3
4
I typed df["Month"] = pd.to_datetime(df["Month"]) and I get this...
Month
970-01-01 00:00:00.0000000001
1970-01-01 00:00:00.000000002
1970-01-01 00:00:00.000000003
1970-01-01 00:00:00.000000004
I would like to just retain just the dates and not the time. Any solutions?
get the date from the column using df['Month'].dt.date
Use format='%m' in to_datetime:
df["Month"] = pd.to_datetime(df["Month"], format='%m')
print (df)
Month
0 1900-01-01
1 1900-02-01
2 1900-03-01
3 1900-04-01

Time Series Resampling with wrong out and without Frequency

At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days
Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471

How to find the median month between two dates?

I need to find the median month value between two dates in a date frame. I am simplifying the case by showing four examples.
import pandas as pd
import numpy as np
import datetime
df=pd.DataFrame([["1/31/2016","3/1/2016"],
["6/15/2016","7/14/2016"],
["7/14/2016","8/15/2016"],
["8/7/2016","9/6/2016"]], columns=['FromDate','ToDate'])
df['Month'] = df.ToDate.dt.month-df.FromDate.dt.month
I am trying to append a column but I am not getting the desired result.
I need to see these values: [2,6,7,8].
You can calculate the average date explicitly by adding half the timedelta between 2 dates to the earlier date. Then just extract the month:
# convert to datetime if necessary
df[df.columns] = df[df.columns].apply(pd.to_datetime)
# calculate mean date, then extract month
df['Month'] = (df['FromDate'] + (df['ToDate'] - df['FromDate']) / 2).dt.month
print(df)
FromDate ToDate Month
0 2016-01-31 2016-03-01 2
1 2016-06-15 2016-07-14 6
2 2016-07-14 2016-08-15 7
3 2016-08-07 2016-09-06 8
You need to convert the string to datetime before using dt.month.
This line calculates the average month number :
df['Month'] = (pd.to_datetime(df['ToDate']).dt.month +
pd.to_datetime(df['FromDate']).dt.month)//2
print(df)
FromDate ToDate Month
0 1/31/2016 3/1/2016 2
1 6/15/2016 7/14/2016 6
2 7/14/2016 8/15/2016 7
3 8/7/2016 9/6/2016 8
This only works with both dates in the same year.
jpp's solution is fine but will in some cases give the wrong answer:
["1/1/2016","3/1/2016"] one would expect 2 because February is between January and March, but jpp's will give 1 corresponding to January.

Filter out unique values of a column within a set time interval

I have a dataframe with dates and tick-data like below
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.192 160.225
3 20160601 00:00:00.327 160.230
4 20160601 00:00:01.606 160.231
5 20160601 00:00:01.613 160.230
I want to filter out unique values in the 'Bid' column at set intervals
E.g: 2016-06-01 00:00:00 - 00:15:00, 2016-06-01 00:15:00 - 00:30:00...
The result will be a new dataframe (keeping the filtered values with its datetime).
Here's the code I have so far:
#Convert Date column to index with seconds as base
df['Date'] = pd.DatetimeIndex(df['Date'])
df['Date'] = df['Date'].astype('datetime64[s]')
df.set_index('Date', inplace=True)
#Create new DataFrame with filtered values
ts = pd.DataFrame(df.loc['2016-06-01'].between_time('00:00', '00:30')['Bid'].unique())
With the method above I loose the [Dates] (datetime) of the filtered values in the process of creating a new DataFrame plus I have to manually input each date and time interval which is unrealistic.
Output:
0
0 160.225
1 160.226
2 160.230
3 160.231
4 160.232
5 160.228
6 160.227
Ideally I'm looking for an operation where I can set the time interval as a timedelta and have an operation done on the whole file (about 8Gb) at once, creating a new DataFrame with Date and Bid columns of the unique values within the set interval. Like this
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.327 160.230
3 20160601 00:00:01.606 160.231
...
805 20160601 00:15:00.606 159.127
PS. I also tried using pd.rolling() & pd.resample() methods with apply(lambda x: function (eg. pd['Bid'].unique()) but it never was able to cut it, maybe someone better at it could attempt.
Just to clarify: This is not a rolling calculation. You mentioned attempting to solve this using rolling, but from your clarification it seems you want to split the time series into discrete, non-overlapping 15 minutes sequences.
Setup
df = pd.DataFrame({
'Date': [
'2016-06-01 00:00:00.020', '2016-06-01 00:00:00.136',
'2016-06-01 00:15:00.636', '2016-06-01 00:15:02.836',
],
'Bid': [150, 150, 200, 200]
})
print(df)
Date Bid
0 2016-06-01 00:00:00.020 150
1 2016-06-01 00:00:00.136 150 # Should be dropped
2 2016-06-01 00:15:00.636 200
3 2016-06-01 00:15:02.836 200 # Should be dropped
First, verify that your Date column is datetime:
df.Date = pd.to_datetime(df.Date)
Now use dt.floor to round each value down to the nearest 15 minutes, and use this new column to drop_duplicates per 15 minute window, but still keep the precision of your dates.
df.assign(flag=df.Date.dt.floor('15T')).drop_duplicates(['flag', 'Bid']).drop('flag', 1)
Date Bid
0 2016-06-01 00:00:00.020 150
2 2016-06-01 00:15:00.636 200
From my original answer, but I still believe it holds value. If you'd like to access the unique values per group, you can make use of pd.Grouper and unique, and I believe learning to leverage pd.Grouper is a powerful tool to have with pandas:
df.groupby(pd.Grouper(key='Date', freq='15T')).Bid.unique()
Date
2016-06-01 00:00:00 [150]
2016-06-01 00:15:00 [200]
Freq: 15T, Name: Bid, dtype: object

Categories