I have this dataframe
open high low close volume
TimeStamp
2017-12-22 13:15:00 12935.00 13200.00 12508.71 12514.91 244.728611
2017-12-22 13:30:00 12514.91 12999.99 12508.71 12666.34 150.457869
2017-12-22 13:45:00 12666.33 12899.97 12094.00 12094.00 198.680014
2017-12-22 14:00:00 12094.01 12354.99 11150.00 11150.00 256.812634
2017-12-22 14:15:00 11150.01 12510.00 10400.00 12276.33 262.217127
I want to know if every rows have exactly 15 minutes diference in time
So I build a new column with a shift of the first columns
open high low close volume \
TimeStamp
2017-12-20 13:30:00 17503.98 17600.00 17100.57 17119.89 312.773644
2017-12-20 13:45:00 17119.89 17372.98 17049.00 17170.00 322.953671
2017-12-20 14:00:00 17170.00 17573.00 17170.00 17395.74 236.085829
2017-12-20 14:15:00 17395.74 17398.00 17200.01 17280.00 220.467382
2017-12-20 14:30:00 17280.00 17313.94 17150.00 17256.05 222.760598
new_time
TimeStamp
2017-12-20 13:30:00 2017-12-20 13:45:00
2017-12-20 13:45:00 2017-12-20 14:00:00
2017-12-20 14:00:00 2017-12-20 14:15:00
2017-12-20 14:15:00 2017-12-20 14:30:00
2017-12-20 14:30:00 2017-12-20 14:45:00
Now I want to locate every row that don't respect the 15minutes diference rule so I did
dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
I get this error,
Traceback (most recent call last):
File "<pyshell#252>", line 1, in <module>
dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
File "C:\Users\Araujo\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'to_pydatetime'
Is there any way of do this?
EDIT:
Shift just works with periodic, there is any way of do this with non periodic?
This would work:
import pandas as pd
import numpy as np
import datetime as dt
data = [
['2017-12-22 13:15:00', 12935.00, 13200.00, 12508.71, 12514.91, 244.728611],
['2017-12-22 13:30:00', 12514.91, 12999.99, 12508.71, 12666.34, 150.457869],
['2017-12-22 13:45:00', 12666.33, 12899.97, 12094.00, 12094.00, 198.680014],
['2017-12-22 14:00:00', 12094.01, 12354.99, 11150.00, 11150.00, 256.812634],
['2017-12-22 14:15:00', 11150.01, 12510.00, 10400.00, 12276.33, 262.217127]
]
df = pd.DataFrame(data, columns = ['Timestamp', 'open', 'high', 'low', 'close', 'volume'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['plus_15'] = df['Timestamp'].shift(1) + dt.timedelta(minutes = 15)
df['valid_time'] = np.where((df['Timestamp'] == df['plus_15']) | (df.index == 0), 1, 0)
print(df[['Timestamp', 'valid_time']])
#output
Timestamp valid_time
0 2017-12-22 13:15:00 1
1 2017-12-22 13:30:00 1
2 2017-12-22 13:45:00 1
3 2017-12-22 14:00:00 1
4 2017-12-22 14:15:00 1
So create a new column, plus 15, that looks at the previous timestamp and adds 15 minutes to it. Then create another column, valid time, which compares the timestamp column to the plus 15 column, and marks 1 when they are equal and 0 when they are not.
Can we do something like this?
import pandas as pd
import numpy as np
data = '''\
TimeStamp open high low close volume
2017-12-22T13:15:00 12935.00 13200.00 12508.71 12514.91 244.728611
2017-12-22T13:30:00 12514.91 12999.99 12508.71 12666.34 150.457869
2017-12-22T13:45:00 12666.33 12899.97 12094.00 12094.00 198.680014
2017-12-22T14:00:00 12094.01 12354.99 11150.00 11150.00 256.812634
2017-12-22T14:15:00 11150.01 12510.00 10400.00 12276.33 262.217127'''
df = pd.read_csv(pd.compat.StringIO(data),
sep='\s+', parse_dates=['TimeStamp'], index_col=['TimeStamp'])
df['new_time'] = df.index[1:].tolist()+[np.NaN]
# df['new_time'] = np.roll(df.index, -1) # if last is not first+15min
# use boolean indexing to filter away unwanted rows
df[[(dt2-dt1)/np.timedelta64(1, 's') == 900
for dt1,dt2 in zip(df.index.values,df.new_time.values)]]
Related
I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data
I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)
Given a dataset where each row represent a hour sample, that is each day has 24 entries with the following index set
...
2020-10-22T20:00:00
2020-10-22T21:00:00
2020-10-22T22:00:00
...
2020-10-22T20:00:00
2020-10-22T20:00:00
2020-10-22T20:00:00
...
Now I want to filter out so that for each day only the hours between 9am-3pm is left.
The only way I know would be to iterate over the dataset and filter each row given a condition, however knowing pandas there is always some trick for this kind of filtering that does not involve explicit iterating.
You can use the aptly named pd.DataFrame.between_time method. This will only work if your dataframe has a DatetimeIndex.
Data Creation
date_index = pd.date_range("2020-10-22T20:00:00", "2020-11-22T20:00:00", freq="H")
values = np.random.rand(len(dates), 1)
df = pd.DataFrame(values, index=date_index, columns=["value"])
print(df.head())
value
2020-10-22 20:00:00 0.637542
2020-10-22 21:00:00 0.590626
2020-10-22 22:00:00 0.474802
2020-10-22 23:00:00 0.058775
2020-10-23 00:00:00 0.904070
Method
subset = df.between_time("9:00am", "3:00pm")
print(subset.head(10))
value
2020-10-23 09:00:00 0.210816
2020-10-23 10:00:00 0.086677
2020-10-23 11:00:00 0.141275
2020-10-23 12:00:00 0.065100
2020-10-23 13:00:00 0.892314
2020-10-23 14:00:00 0.214991
2020-10-23 15:00:00 0.106937
2020-10-24 09:00:00 0.900106
2020-10-24 10:00:00 0.545249
2020-10-24 11:00:00 0.793243
import pandas as pd
# sample data (strings)
data = [f'2020-10-{d:02d}T{h:02d}:00:00' for h in range(24) for d in range(1, 21)]
# series of DT values
ds = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S')
# filter by hours
ds_filter = ds[(ds.dt.hour >= 9) & (ds.dt.hour <= 15)]
Okay so I have a csv with minute data for the S&P 500 index for 2020, and I am looking how to index out only the close and open for 9:30 and 4:00 only. In essence I just want what the market open and close was. So far the code is:
import pandas as pd
import datetime as dt
import numpy as np
d = pd.read_csv('/Volumes/Seagate Portable/usindex_2020_all_tickers_awvbxk9/SPX_2020_2020.txt')
d.columns = ['Dates', 'Open', 'High', 'Low', 'Close']
d.drop(['High', 'Low'], axis=1, inplace=True)
d.set_index('Dates', inplace=True)
d.head()
It wont let me share the csv file but this is what the output looks like:
Open Close
Dates
2020-01-02 09:31:00 3247.19 3245.22
2020-01-02 09:32:00 3245.07 3244.66
2020-01-02 09:33:00 3244.89 3247.61
2020-01-02 09:34:00 3247.38 3246.92
2020-01-02 09:35:00 3246.89 3249.09
I have tried using loc and dt.time, which I am assmuning is the right way to code I just cannot think of the exact code to index out these 2 times. Any ideas? Thank you!
If the .dt extractor is used on the 'Dates' column (d.Dates.dt.time[0]), the .time component is datetime.time(9, 30), therefore d.Dates.dt.time == dtime(9, 30) must be used for the Boolean match, and not d.Dates.dt.time == '09:30:00'
import pandas as pd
from datetime import time as dtime
# test dataframe
d = pd.DataFrame({'Dates': ['2020-01-02 09:30:00', '2020-01-02 09:31:00', '2020-01-02 09:32:00', '2020-01-02 09:33:00', '2020-01-02 09:34:00', '2020-01-02 09:35:00', '2020-01-02 16:00:00'], 'Open': [3247.19, 3247.19, 3245.07, 3244.89, 3247.38, 3246.89, 3247.19], 'Close': [3245.22, 3245.22, 3244.66, 3247.61, 3246.92, 3249.09, 3245.22]})
# display(d)
Dates Open Close
0 2020-01-02 09:30:00 3247.19 3245.22
1 2020-01-02 09:31:00 3247.19 3245.22
2 2020-01-02 09:32:00 3245.07 3244.66
3 2020-01-02 09:33:00 3244.89 3247.61
4 2020-01-02 09:34:00 3247.38 3246.92
5 2020-01-02 09:35:00 3246.89 3249.09
6 2020-01-02 16:00:00 3247.19 3245.22
# verify Dates is a datetime format
d.Dates = pd.to_datetime(d.Dates)
# use Boolean selection for 9:30 and 16:00 (4pm)
d = d[(d.Dates.dt.time == dtime(9, 30)) | (d.Dates.dt.time == dtime(16, 0))].copy()
# set the index
d.set_index('Dates', inplace=True)
# display(d)
Open Close
Dates
2020-01-02 09:30:00 3247.19 3245.22
2020-01-02 16:00:00 3247.19 3245.22
Try:
import pandas as pd
# create dummy daterange
date_range = pd.DatetimeIndex(pd.date_range("00:00", "23:59", freq='1min'))
# create df with enumerated column as data, and with daterange(DatetimeIndex) as index
df = pd.DataFrame(data=[i for i, d in enumerate(date_range)], index=date_range)
# boolean index using strings
four_and_nine = df[(df.index == '16:00:00') | (df.index == '21:00:00')]
print(four_and_nine)
0
2021-01-01 16:00:00 960
2021-01-01 21:00:00 1260
Pandas is pretty smart with comparing strings to actual datetimes(DatetimeIndex in this case).
Above is selecting top of the hour. If you wanted all minutes/seconds within specific hours, use boolean index like: df[(df.index.hour == 4) | (df.index.hour == 9)]
Suppose i have a DataFrame with DateTimeIndex like this:
Date_TimeOpen High Low Close Volume
2018-01-22 11:05:00 948.00 948.10 947.95 948.10 9820.0
2018-01-22 11:06:00 948.10 949.60 948.05 949.30 33302.0
2018-01-22 11:07:00 949.25 949.85 949.20 949.85 20522.0
2018-03-27 09:15:00 907.20 908.80 905.00 908.15 126343.0
2018-03-27 09:16:00 908.20 909.20 906.55 906.60 38151.0
2018-03-29 09:30:00 908.90 910.45 908.80 910.15 46429.0
I want to select only the first row of each Unique Date (discard Time) so that i get such output as below:
Date_Time Open High Low Close Volume
2018-01-22 11:05:00 948.00 948.10 947.95 948.10 9820.0
2018-03-27 09:15:00 907.20 908.80 905.00 908.15 126343.0
2018-03-29 09:30:00 908.90 910.45 908.80 910.15 46429.0
I tried with loc and iloc but it dint helped.
Any help will be greatly appreciated.
You need to group by date and get the first element of each group:
import pandas as pd
data = [['2018-01-22 11:05:00', 948.00, 948.10, 947.95, 948.10, 9820.0],
['2018-01-22 11:06:00', 948.10, 949.60, 948.05, 949.30, 33302.0],
['2018-01-22 11:07:00', 949.25, 949.85, 949.20, 949.85, 20522.0],
['2018-03-27 09:15:00', 907.20, 908.80, 905.00, 908.15, 126343.0],
['2018-03-27 09:16:00', 908.20, 909.20, 906.55, 906.60, 38151.0],
['2018-03-29 09:30:00', 908.90, 910.45, 908.80, 910.15, 46429.0]]
df = pd.DataFrame(data=data)
df = df.set_index([0])
df.columns = ['Open', 'High', 'Low', 'Close', 'Volume']
result = df.groupby(pd.to_datetime(df.index).date).head(1)
print(result)
Output
Open High Low Close Volume
0
2018-01-22 11:05:00 948.0 948.10 947.95 948.10 9820.0
2018-03-27 09:15:00 907.2 908.80 905.00 908.15 126343.0
2018-03-29 09:30:00 908.9 910.45 908.80 910.15 46429.0