searching pandas for value greater than a number - python

I have the following data:
toggle_day Diff
Date
2000-01-04 True NaT
2000-01-11 True 7 days
2000-01-24 True 13 days
2000-01-28 True 4 days
2000-02-09 True 12 days
... ... ...
2019-08-14 True 2 days
2019-08-23 True 9 days
2019-10-01 True 39 days
2019-10-02 True 1 days
2019-10-08 True 6 days
677 rows × 2 columns
I want to see the dates when Diff is greater than 20 days. To do this i have done something like this:
df1[df1.diff > 20 days] This is wrong, I think because i need to tell it days in datetime. I tried df1[df1.diff > datetime.datetime(20)] but that does not work either:
TypeError: function missing required argument 'month' (pos 2)
How can i search Diff for days greater than a number.

First idea is compare by timedeltas:
df[df['Diff'] > pd.Timedelta(1, 'd')]
Or you can convert timedeltas to days by Series.dt.days and compare by number:
df[df['Diff'].dt.days > 1]

Related

Convert string hours to minute pd.eval

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

Problem with Pandas dataframe.mean() and slice error

I have 2 databases :
df1:
control_1 Average_return
2019-09-07 True 0
2019-06-06 True 0
2019-02-19 True 0
2019-01-17 True 0
2018-12-20 True 0
2018-11-27 True 0
2018-10-12 True 0
... ... ...
df2:
return
2010-01-01 NaN
2010-04-01 0.010920
2010-05-01 -0.004404
2010-06-01 -0.025209
2010-07-01 -0.023280
... ...
My aim is to do the average mean between two date from df2 if control_1 is True.
for i in range(0,df1.row): #I go through my data df1
if (control_1.iloc[i]==True): #I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d') # I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d') # I remove 244 days from my date
df1.["Average_return"].iloc[i]=df2["return"].iloc[date_1:date_2].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days
Unfortunately it gives me this error :
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2019-05-31] of <class 'datetime.date'>
Is someone able to help me :) ?
the csv file from df1 is :

Create a boolean dataframe based on the difference between two datetimes

I have a pandas dataframe called "gaps" that looks like this:
Index Gap in days
0 2 days 00:00:00
1 8 days 00:00:00
2 4 days 00:00:00
3 15 days 00:00:00
...
201 21 days 00:00:00
The date format has been converted to the standard datetime format. I want to create a simple boolean dataframe that returns TRUE if the gap in days is more than 7 days, and FALSE otherwise.
My initial attempt was the simple:
morethan7days = gaps > 7
For which I get the error:
TypeError: invalid type comparison
Anybody know what I'm doing wrong and how to fix it?
Nevermind, I got the answer through trial and error:
morethan7days = gaps > datetime.timedelta(days=7)
You can convert timedeltas to days by Series.dt.days and then compare by integer:
gaps = df['Gap in days']
morethan7days = gaps.dt.days > 7
print (morethan7days)
0 False
1 True
2 False
3 True
4 True
Name: Gap in days, dtype: bool
Another solution is compare with pandas.Timedelta:
gaps = df['Gap in days']
morethan7days = gaps > pd.Timedelta(7, unit='d')

Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float

I have a pandas DataFrame Series time differences that looks like::
print(delta_t)
1 0 days 00:00:59
3 0 days 00:04:22
6 0 days 00:00:56
8 0 days 00:01:21
19 0 days 00:01:09
22 0 days 00:00:36
...
(the full DataFrame had a bunch of NaNs which I dropped).
I'd like to know which delta_t's are less than 1 day, 1 hour, 1 minute,
so I tried:
delta_t_lt1day = delta_t[np.where(delta_t < 30.)]
but then got a:
TypeError: cannot compare a TimedeltaIndex with type float
Little help?!?!
Assuming your Series is in timedelta format, you can skip the np.where, and index using something like this, where you compare your actual values to other timedeltas, using the appropriate units:
delta_t_lt1day = delta_t[delta_t < pd.Timedelta(1,'D')]
delta_t_lt1hour = delta_t[delta_t < pd.Timedelta(1,'h')]
delta_t_lt1minute = delta_t[delta_t < pd.Timedelta(1,'m')]
You'll get the following series:
>>> delta_t_lt1day
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1hour
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1minute
0
1 00:00:59
6 00:00:56
22 00:00:36
Name: 1, dtype: timedelta64[ns]
You could use the TimeDelta class:
import pandas as pd
deltas = pd.to_timedelta(['0 days 00:00:59',
'0 days 00:04:22',
'0 days 00:00:56',
'0 days 00:01:21',
'0 days 00:01:09',
'0 days 00:31:09',
'0 days 00:00:36'])
for e in deltas[deltas < pd.Timedelta(value=30, unit='m')]:
print(e)
Output
0 days 00:00:59
0 days 00:04:22
0 days 00:00:56
0 days 00:01:21
0 days 00:01:09
0 days 00:00:36
Note that this filter outs '0 days 00:31:09' as expected. The expression pd.Timedelta(value=30, unit='m') creates a time delta of 30 minutes.

Signed time deltas to signed seconds in Pandas

Consider the following series:
> df['time_delta']
0 -1 days +00:08:11
1 0 days 01:57:46
2 0 days 00:58:34
3 0 days 17:30:23
4 -1 days +21:44:34
5 -2 days +22:01:56
6 0 days 03:18:57
7 -1 days +21:44:48
8 -1 days +00:07:56
Name: time_delta, dtype: timedelta64[ns]
Say I want to convert this timedelta to total signed seconds. That is:
Positive deltas should convert to positive seconds
Negative deltas should convert to negative seconds
For example:
0 days 00:01:05 => 65 seconds
-1 days +23:58:30 => -90 seconds
How can I get this conversion?
Failed attempt
When I try the usual:
temp_df['seconds'] = temp_df['time_delta'].dt.seconds
I end up with:
time_delta seconds
0 -1 days +00:08:11 491
1 0 days 01:57:46 7066
2 0 days 00:58:34 3514
3 0 days 17:30:23 63023
4 -1 days +21:44:34 78274
5 -2 days +22:01:56 79316
6 0 days 03:18:57 11937
7 -1 days +21:44:48 78288
8 -1 days +00:07:56 476
which correctly handled positive deltas, but not the negative ones. To see this, note that the negative deltas seem to ignore the sign of the day offset. That is, in the example above:
-1 days +21:44:48 should convert to -8112 seconds, not 78288 seconds (wrong sign and value).
If it's a Timedelta object, just divide it by Timedelta(seconds=1):
>>> pd.Timedelta(days=-1) / pd.Timedelta(seconds=1)
-86400.0
just call abs prior to dt.total_seconds to get the absolute values:
df['seconds'] = df['time_delta'].abs().dt.total_seconds()
Example:
In [63]:
df = pd.DataFrame({'date_time':pd.date_range(dt.datetime(2015,1,1,12,10,32), dt.datetime(2015,1,3,12,12,30,2))})
df['time_delta'] = df['date_time'] - dt.datetime(2015,1,2)
df
Out[63]:
date_time time_delta
0 2015-01-01 12:10:32 -1 days +12:10:32
1 2015-01-02 12:10:32 0 days 12:10:32
2 2015-01-03 12:10:32 1 days 12:10:32
In [64]:
df['time_delta'].abs().dt.total_seconds()
Out[64]:
0 42568
1 43832
2 130232
Name: time_delta, dtype: float64
To add the signs back you can compare against pd.Timedelta(0):
In [78]:
df['seconds'] = df['time_delta'].abs().dt.total_seconds()
df.loc[df['time_delta'] < pd.Timedelta(0), 'seconds'] = -df['seconds']
df
Out[78]:
date_time time_delta seconds
0 2015-01-01 12:10:32 -1 days +12:10:32 -42568
1 2015-01-02 12:10:32 0 days 12:10:32 43832
2 2015-01-03 12:10:32 1 days 12:10:32 130232
However, I think #Ami Tamory's answer is superior
EDIT
After sleeping on this I realised that this is just dt.total_seconds:
In [137]:
df['time_delta'].dt.total_seconds()
Out[137]:
0 -42568
1 43832
2 130232
Name: time_delta, dtype: float64

Categories