Python Timedelta[M] adds incomplete days

Python Timedelta[M] adds incomplete days - python

I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]

The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.

Related

Python - Converting a column with weekly data to a datetime object

I have the following date column that I would like to transform to a pandas datetime object. Is it possible to do this with weekly data? For example, 1-2018 stands for week 1 in 2018 and so on. I tried the following conversion but I get an error message: Cannot use '%W' or '%U' without day and year
import pandas as pd
df1 = pd.DataFrame(columns=["date"])
df1['date'] = ["1-2018", "1-2018", "2-2018", "2-2018", "3-2018", "4-2018", "4-2018", "4-2018"]
df1["date"] = pd.to_datetime(df1["date"], format = "%W-%Y")

You need to add a day to the datetime format
df1["date"] = pd.to_datetime('0' + df1["date"], format='%w%W-%Y')
print(df1)
Output
date
0 2018-01-07
1 2018-01-07
2 2018-01-14
3 2018-01-14
4 2018-01-21
5 2018-01-28
6 2018-01-28
7 2018-01-28

As the error message says, you need to specify the day of the week by adding %w :
df1["date"] = pd.to_datetime( '0'+df1.date, format='%w%W-%Y')

Date and Time Format Conversion in Pandas, Python

Initially, my dataframe had a Month column containing numbers representing the months.
Month
1
2
3
4
I typed df["Month"] = pd.to_datetime(df["Month"]) and I get this...
Month
970-01-01 00:00:00.0000000001
1970-01-01 00:00:00.000000002
1970-01-01 00:00:00.000000003
1970-01-01 00:00:00.000000004
I would like to just retain just the dates and not the time. Any solutions?

get the date from the column using df['Month'].dt.date

Use format='%m' in to_datetime:
df["Month"] = pd.to_datetime(df["Month"], format='%m')
print (df)
Month
0 1900-01-01
1 1900-02-01
2 1900-03-01
3 1900-04-01

Time Series Resampling with wrong out and without Frequency

At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days

Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471

How to find the median month between two dates?

I need to find the median month value between two dates in a date frame. I am simplifying the case by showing four examples.
import pandas as pd
import numpy as np
import datetime
df=pd.DataFrame([["1/31/2016","3/1/2016"],
["6/15/2016","7/14/2016"],
["7/14/2016","8/15/2016"],
["8/7/2016","9/6/2016"]], columns=['FromDate','ToDate'])
df['Month'] = df.ToDate.dt.month-df.FromDate.dt.month
I am trying to append a column but I am not getting the desired result.
I need to see these values: [2,6,7,8].

You can calculate the average date explicitly by adding half the timedelta between 2 dates to the earlier date. Then just extract the month:
# convert to datetime if necessary
df[df.columns] = df[df.columns].apply(pd.to_datetime)
# calculate mean date, then extract month
df['Month'] = (df['FromDate'] + (df['ToDate'] - df['FromDate']) / 2).dt.month
print(df)
FromDate ToDate Month
0 2016-01-31 2016-03-01 2
1 2016-06-15 2016-07-14 6
2 2016-07-14 2016-08-15 7
3 2016-08-07 2016-09-06 8

You need to convert the string to datetime before using dt.month.
This line calculates the average month number :
df['Month'] = (pd.to_datetime(df['ToDate']).dt.month +
pd.to_datetime(df['FromDate']).dt.month)//2
print(df)
FromDate ToDate Month
0 1/31/2016 3/1/2016 2
1 6/15/2016 7/14/2016 6
2 7/14/2016 8/15/2016 7
3 8/7/2016 9/6/2016 8
This only works with both dates in the same year.
jpp's solution is fine but will in some cases give the wrong answer:
["1/1/2016","3/1/2016"] one would expect 2 because February is between January and March, but jpp's will give 1 corresponding to January.

Filter out unique values of a column within a set time interval

I have a dataframe with dates and tick-data like below
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.192 160.225
3 20160601 00:00:00.327 160.230
4 20160601 00:00:01.606 160.231
5 20160601 00:00:01.613 160.230
I want to filter out unique values in the 'Bid' column at set intervals
E.g: 2016-06-01 00:00:00 - 00:15:00, 2016-06-01 00:15:00 - 00:30:00...
The result will be a new dataframe (keeping the filtered values with its datetime).
Here's the code I have so far:
#Convert Date column to index with seconds as base
df['Date'] = pd.DatetimeIndex(df['Date'])
df['Date'] = df['Date'].astype('datetime64[s]')
df.set_index('Date', inplace=True)
#Create new DataFrame with filtered values
ts = pd.DataFrame(df.loc['2016-06-01'].between_time('00:00', '00:30')['Bid'].unique())
With the method above I loose the [Dates] (datetime) of the filtered values in the process of creating a new DataFrame plus I have to manually input each date and time interval which is unrealistic.
Output:
0
0 160.225
1 160.226
2 160.230
3 160.231
4 160.232
5 160.228
6 160.227
Ideally I'm looking for an operation where I can set the time interval as a timedelta and have an operation done on the whole file (about 8Gb) at once, creating a new DataFrame with Date and Bid columns of the unique values within the set interval. Like this
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.327 160.230
3 20160601 00:00:01.606 160.231
...
805 20160601 00:15:00.606 159.127
PS. I also tried using pd.rolling() & pd.resample() methods with apply(lambda x: function (eg. pd['Bid'].unique()) but it never was able to cut it, maybe someone better at it could attempt.

Just to clarify: This is not a rolling calculation. You mentioned attempting to solve this using rolling, but from your clarification it seems you want to split the time series into discrete, non-overlapping 15 minutes sequences.
Setup
df = pd.DataFrame({
'Date': [
'2016-06-01 00:00:00.020', '2016-06-01 00:00:00.136',
'2016-06-01 00:15:00.636', '2016-06-01 00:15:02.836',
],
'Bid': [150, 150, 200, 200]
})
print(df)
Date Bid
0 2016-06-01 00:00:00.020 150
1 2016-06-01 00:00:00.136 150 # Should be dropped
2 2016-06-01 00:15:00.636 200
3 2016-06-01 00:15:02.836 200 # Should be dropped
First, verify that your Date column is datetime:
df.Date = pd.to_datetime(df.Date)
Now use dt.floor to round each value down to the nearest 15 minutes, and use this new column to drop_duplicates per 15 minute window, but still keep the precision of your dates.
df.assign(flag=df.Date.dt.floor('15T')).drop_duplicates(['flag', 'Bid']).drop('flag', 1)
Date Bid
0 2016-06-01 00:00:00.020 150
2 2016-06-01 00:15:00.636 200
From my original answer, but I still believe it holds value. If you'd like to access the unique values per group, you can make use of pd.Grouper and unique, and I believe learning to leverage pd.Grouper is a powerful tool to have with pandas:
df.groupby(pd.Grouper(key='Date', freq='15T')).Bid.unique()
Date
2016-06-01 00:00:00 [150]
2016-06-01 00:15:00 [200]
Freq: 15T, Name: Bid, dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Timedelta[M] adds incomplete days - python

Related

Python - Converting a column with weekly data to a datetime object

Date and Time Format Conversion in Pandas, Python

Time Series Resampling with wrong out and without Frequency

How to find the median month between two dates?

Filter out unique values of a column within a set time interval

Categories

Resources