Create a boolean dataframe based on the difference between two datetimes - python

I have a pandas dataframe called "gaps" that looks like this:
Index Gap in days
0 2 days 00:00:00
1 8 days 00:00:00
2 4 days 00:00:00
3 15 days 00:00:00
...
201 21 days 00:00:00
The date format has been converted to the standard datetime format. I want to create a simple boolean dataframe that returns TRUE if the gap in days is more than 7 days, and FALSE otherwise.
My initial attempt was the simple:
morethan7days = gaps > 7
For which I get the error:
TypeError: invalid type comparison
Anybody know what I'm doing wrong and how to fix it?

Nevermind, I got the answer through trial and error:
morethan7days = gaps > datetime.timedelta(days=7)

You can convert timedeltas to days by Series.dt.days and then compare by integer:
gaps = df['Gap in days']
morethan7days = gaps.dt.days > 7
print (morethan7days)
0 False
1 True
2 False
3 True
4 True
Name: Gap in days, dtype: bool
Another solution is compare with pandas.Timedelta:
gaps = df['Gap in days']
morethan7days = gaps > pd.Timedelta(7, unit='d')

Related

replacing/re-assign pandas value with new value

I wanted to re-assign/replace my new value, from my current
20000123
19850123
19880112
19951201
19850123
20190821
20000512
19850111
19670133
19850123
As you can see there is data with 19670133 (YYYYMMDD), which means that date is not exist since there is no month with 33 days in it.So I wanted to re assign it to the end of the month. I tried to make it to the end of the month, and it works.
But when i try to replace the old value with the new ones, it became a problem.
What I've tried to do is this :
for x in df_tmp_customer['date']:
try:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x), axis=1)
except Exception:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0), axis=1)
This part is the one that makes it end of the month :
pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0)
Probably not efficient on a large dataset but can be done using pendulum.parse()
import pendulum
def parse_dates(x: str) -> pendulum:
i = 0
while ValueError:
try:
return pendulum.parse(str(int(x) - i)).date()
except ValueError:
i += 1
df["date"] = df["date"].apply(lambda x: parse_dates(x))
print(df)
date
0 2000-01-23
1 1985-01-23
2 1988-01-12
3 1995-12-01
4 1985-01-23
5 2019-08-21
6 2000-05-12
7 1985-01-11
8 1967-01-31
9 1985-01-23
For a vectorial solution, you can use:
# try to convert to YYYYMMDD
date1 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
# get rows for which conversion failed
m = date1.isna()
# try to get end of month
date2 = pd.to_datetime(df.loc[m, 'date'].str[:6], format='%Y%m', errors='coerce').add(pd.offsets.MonthEnd())
# Combine both
df['date2'] = date1.fillna(date2)
NB. Assuming df['date'] is of string dtype. If rather of integer dtype, use df.loc[m, 'date'].floordiv(100) in place of df.loc[m, 'date'].str[:6].
Output:
date date2
0 20000123 2000-01-23
1 19850123 1985-01-23
2 19880112 1988-01-12
3 19951201 1995-12-01
4 19850123 1985-01-23
5 20190821 2019-08-21
6 20000512 2000-05-12
7 19850111 1985-01-11
8 19670133 1967-01-31 # invalid replaced by end of month
9 19850123 1985-01-23

Getting index value in pandas multiindex based on condition

I have created a dataframe with this code:
The objective of this is to find the weekly low and to get the dates date at which the weekly low took place.
To do this:
import pandas as pd
from pandas_datareader import data as web
import pandas_datareader
import datetime
df = web.DataReader('GOOG', 'yahoo', start, end)
df2 = web.DataReader('GOOG', 'yahoo', start, end)
start = datetime.datetime(2021,1,1)
end = datetime.datetime.today()
df['Date1'] = df.index
df['month'] = df.index.month
df['week'] = df.index.week
df['day'] = df.index.day
df.set_index('week',append=True,inplace=True)
df.set_index('day',append=True,inplace=True)
To get the weekly low :
df['Low'].groupby(['week']).min().tail(50)
I trying to find out the date on which weekly low occured: Such as 1735.420044
If i try to do this :
df['Low'].isin([1735.420044])
I get :
Date week day
2020-12-31 53 31 False
2021-01-04 1 4 False
2021-01-05 1 5 False
2021-01-06 1 6 False
2021-01-07 1 7 False
...
2021-08-02 31 2 False
2021-08-03 31 3 False
2021-08-04 31 4 False
2021-08-05 31 5 False
2021-08-06 31 6 False
Name: Low, Length: 151, dtype: bool
How can i get the actual dates for the low?
To get the weekly lows, you could simply access the index.
res = df['Low'].groupby(['week']).min()
res is the series of lowest prices with the date in the index. You can access the raw numpy array that represents the index with res.index.values. This will include week and day levels as well.
To get just the dates as a series, this should work:
dates = res.index.get_level_values("Date").to_series()
PS: Clarification from the comments
df['Low'].isin([1735.420044]).any() # returns False
The above doesn't work for you (should return True if there's a match) because when you say .isin([<bunch of floats>]), you are essentially comparing floats for equality. Which doesn't work because floating point comparisons can never be guaranteed to be exact, they always have to be in ranges of tolerance (this is not Python specific, true for all languages). Sometimes it might seem to work in Python, but that is entirely coincidental and is a result of underlying memory optimisations. Have a look at this thread to gain some (Python specific) insight into this.

searching pandas for value greater than a number

I have the following data:
toggle_day Diff
Date
2000-01-04 True NaT
2000-01-11 True 7 days
2000-01-24 True 13 days
2000-01-28 True 4 days
2000-02-09 True 12 days
... ... ...
2019-08-14 True 2 days
2019-08-23 True 9 days
2019-10-01 True 39 days
2019-10-02 True 1 days
2019-10-08 True 6 days
677 rows × 2 columns
I want to see the dates when Diff is greater than 20 days. To do this i have done something like this:
df1[df1.diff > 20 days] This is wrong, I think because i need to tell it days in datetime. I tried df1[df1.diff > datetime.datetime(20)] but that does not work either:
TypeError: function missing required argument 'month' (pos 2)
How can i search Diff for days greater than a number.
First idea is compare by timedeltas:
df[df['Diff'] > pd.Timedelta(1, 'd')]
Or you can convert timedeltas to days by Series.dt.days and compare by number:
df[df['Diff'].dt.days > 1]

calculate date difference between today's date and pandas date series

Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)

Convert integer series to timedelta in pandas

I have a data frame in pandas which includes number of days since an event occurred. I want to create a new column that calculates the date of the event by subtracting the number of days from the current date. Every time I attempt to apply pd.offsets.Day or pd.Timedelta I get an error stating that Series are an unsupported type. This also occurs when I use apply. When I use map I receive a runtime error saying "maximum recursion depth exceeded while calling a Python object".
For example, assume my data frame looked like this:
index days_since_event
0 5
1 7
2 3
3 6
4 0
I want to create a new column with the date of the event, so my expected outcome (using today's date of 12/29/2015)
index days_since_event event_date
0 5 2015-12-24
1 7 2015-12-22
2 3 2015-12-26
3 6 2015-12-23
4 0 2015-12-29
I have attempted multiple ways to do this, but have received errors for each.
One method I tried was:
now = pd.datetime.date(pd.datetime.now())
df['event_date'] = now - df.days_since_event.apply(pd.offsets.Day)
With this I received an error saying that Series are an unsupported type.
I tried the above with .map instead of .apply, and received the error that "maximum recursion depth exceeded while calling a Python object".
I also attempted to convert the days into timedelta, such as:
df.days_since_event = (dt.timedelta(days = df.days_since_event)).apply
This also received an error referencing the series being an unsupported type.
First, to convert the column with integers to a timedelta, you can use to_timedelta:
In [60]: pd.to_timedelta(df['days_since_event'], unit='D')
Out[60]:
0 5 days
1 7 days
2 3 days
3 6 days
4 0 days
Name: days_since_event, dtype: timedelta64[ns]
Then you can create a new column with the current date and substract those timedelta's:
In [62]: df['event_date'] = pd.Timestamp('2015-12-29')
In [63]: df['event_date'] = df['event_date'] - pd.to_timedelta(df['days_since_event'], unit='D')
In [64]: df['event_date']
Out[64]:
0 2015-12-24
1 2015-12-22
2 2015-12-26
3 2015-12-23
4 2015-12-29
dtype: datetime64[ns]
Just to follow up with joris' response, you can convert an int or a float into whatever time unit you want with pd.to_timedelta(x, unit=''), changing only the entry for unit=:
# Years, Months, Days:
pd.to_timedelta(3.5, unit='Y') # returns '1095 days 17:27:36'
pd.to_timedelta(3.5, unit='M') # returns '91 days 07:27:18'
pd.to_timedelta(3.5, unit='D') # returns '3 days 12:00:00'
# Hours, Minutes, Seconds:
pd.to_timedelta(3.5, unit='h') # returns '0 days 03:30:00'
pd.to_timedelta(3.5, unit='m') # returns '0 days 00:03:30'
pd.to_timedelta(3.5, unit='s') # returns '0 days 00:00:03.50'
Note that mathematical operations are legal once correctly formatted:
pd.to_timedelta(3.5, unit='h') - pd.to_timedelta(3.25, unit='h') # returns '0 days 00:15:00'

Categories