This question already has an answer here:
Filling missing date values with the least possible date in Pandas dataframe
(1 answer)
Closed 3 years ago.
LastLogin LastPurchased
2018-08-21 00:28:04.081677 0001-01-01 00:00:00
2018-08-21 00:28:58.209522 2018-08-20 00:28:58.209522
I need difference in days (df[LastLogin] - df['LastPurchased']).dt.days but there are some '0001-01-01 00:00:00' in LastPurchased. Anything I try to do to change 1-01-01 to a date within the Panda bounds results in Out of bounds nanosecond timestamp: 1-01-01 00:00:00. Is there any other ways?
LastLogin LastPurchased Days
2018-08-21 00:28:04.081677 1999-01-01 00:00:00 6935
2018-08-21 00:28:58.209522 2018-08-20 00:28:58.209522 1
Pandas requires that the year in your datetime be greater than 1677 and less than 2622 (approximately - see pandas/_libs/tslibs/src/datetime/np_datetime.c for the exact bounds). Otherwise, the given date is outside the range that can be represented by nanosecond-resolution 64-bit integers:
>>> pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
>>> pd.Timestamp.max - pd.Timestamp.min
datetime.timedelta(213503, 84873, 709550)
It's up to you how you want to handle this. Consider what you are ultimately trying to indicate by subtracting the date 0001-01-01. I'll assume that means a user has logged in but never purchased.
To coerce LastPurchased to either a valid Pandas Timestamp or pd.NaT ("not a time"), you can use
df['LastPurchased'] = pd.to_datetime(df['LastPurchased'], errors='coerce')
This will give NaT as the difference in those spots:
>>> pd.Timestamp(2018, 1, 1) - pd.NaT
NaT
Which you can use as a "sentinel" and check for with pd.isnat().
Related
At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days
Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471
I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]
The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.
Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)
This question already has answers here:
Add months to a date in Pandas
(4 answers)
How can I get pandas Timestamp offset by certain amount of months?
(1 answer)
Closed 4 years ago.
I have multiple df, and they are indexed with timestamps for consecutive months. For example:
1996-01-01 01:00:00
1996-02-01 01:00:00
1996-03-01 01:00:00
1996-04-01 01:00:00
1996-05-01 01:00:00
1996-06-01 01:00:00
I'm trying to create a function where I can add an arbitrary number of rows onto the df, continuing on from whatever the last month happens to be. I tried to solve this by using:
df.iloc[-1].name + pd.Timedelta(1, unit='M')
in a for loop, but this only seems to add 30 days, instead of changing the month value +1. Is there a more reliable way to fetch a pd.Timestamp and add 1 month?
Thank you
When I compute the difference between two pandas datetime64 dates I get np.timedelta64. Is there any easy way to convert these deltas into representations like hours, days, weeks, etc.?
I could not find any methods in np.timedelta64 that facilitate conversions between different units, but it looks like Pandas seems to know how to convert these units to days when printing timedeltas (e.g. I get: 29 days, 23:20:00 in the string representation dataframes). Any way to access this functionality ?
Update:
Strangely, none of the following work:
> df['column_with_times'].days
> df['column_with_times'].apply(lambda x: x.days)
but this one does:
df['column_with_times'][0].days
pandas stores timedelta data in the numpy timedelta64[ns] type, but also provides the Timedelta type to wrap this for more convenience (eg to provide such accessors of the days, hours, .. and other components).
In [41]: timedelta_col = pd.Series(pd.timedelta_range('1 days', periods=5, freq='2 h'))
In [42]: timedelta_col
Out[42]:
0 1 days 00:00:00
1 1 days 02:00:00
2 1 days 04:00:00
3 1 days 06:00:00
4 1 days 08:00:00
dtype: timedelta64[ns]
To access the different components of a full column (series), you have to use the .dt accessor. For example:
In [43]: timedelta_col.dt.hours
Out[43]:
0 0
1 2
2 4
3 6
4 8
dtype: int64
With timedelta_col.dt.components you get a frame with all the different components (days to nanoseconds) as different columns.
When accessing one value of the column above, this gives back a Timedelta, and on this you don't need to use the dt accessor, but you can access directly the components:
In [45]: timedelta_col[0]
Out[45]: Timedelta('1 days 00:00:00')
In [46]: timedelta_col[0].days
Out[46]: 1L
So the .dt accessor provides access to the attributes of the Timedelta scalar, but on the full column. That is the reason you see that df['column_with_times'][0].days works but df['column_with_times'].days not.
The reason that df['column_with_times'].apply(lambda x: x.days) does not work is that apply is given the timedelta64 values (and not the Timedelta pandas type), and these don't have such attributes.