I'm trying to process some data in pandas that looks like this in the CSV:
2014.01.02,08:56,1.37549,1.37552,1.37549,1.37552,3
2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21
2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18
2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15
2014.01.02,09:03,1.37546,1.37563,1.37546,1.37559,39
2014.01.02,09:04,1.37559,1.37562,1.37555,1.37561,37
2014.01.02,09:05,1.37561,1.37564,1.37558,1.37561,35
2014.01.02,09:06,1.37561,1.37566,1.37558,1.37563,38
2014.01.02,09:07,1.37563,1.37567,1.37561,1.37566,42
2014.01.02,09:08,1.37570,1.37571,1.37564,1.37566,25
I imported it using:
raw_data = pd.read_csv('raw_data.csv', engine='c', header=None, index_col=0, names=['date', 'time', 'open', 'high', 'low', 'close', 'volume'], parse_dates=[[0,1]])
But now I want to extract some random (or even continuous) samples from the data, but only the ones where I have 5 consecutive minutes always with data. So, for instance, the data from 2014.01.02,08:56 can't be used because it has a gap. But the data from 2014.01.02,09:00 is ok because it has consecutive data always for the 5 next minutes.
Any suggestions on how to accomplish this in a efficient way?
Here is one way by first .asfreq('T') to populate some NaNs and then using rolling_apply and count whether the recent or next 5 observations has no NaNs.
# populate NaNs at minutely freq
# ======================
df = raw_data.asfreq('T')
print(df)
open high low close volume
date_time
2014-01-02 08:56:00 1.3755 1.3755 1.3755 1.3755 3
2014-01-02 08:57:00 NaN NaN NaN NaN NaN
2014-01-02 08:58:00 NaN NaN NaN NaN NaN
2014-01-02 08:59:00 NaN NaN NaN NaN NaN
2014-01-02 09:00:00 1.3756 1.3756 1.3755 1.3755 21
2014-01-02 09:01:00 1.3755 1.3755 1.3754 1.3755 18
2014-01-02 09:02:00 1.3755 1.3755 1.3755 1.3755 15
2014-01-02 09:03:00 1.3755 1.3756 1.3755 1.3756 39
2014-01-02 09:04:00 1.3756 1.3756 1.3756 1.3756 37
2014-01-02 09:05:00 1.3756 1.3756 1.3756 1.3756 35
2014-01-02 09:06:00 1.3756 1.3757 1.3756 1.3756 38
2014-01-02 09:07:00 1.3756 1.3757 1.3756 1.3757 42
2014-01-02 09:08:00 1.3757 1.3757 1.3756 1.3757 25
consecutive_previous_5min = pd.rolling_apply(df['open'], 5, lambda g: np.isnan(g).any()) == 0
consecutive_previous_5min
date_time
2014-01-02 08:56:00 False
2014-01-02 08:57:00 False
2014-01-02 08:58:00 False
2014-01-02 08:59:00 False
2014-01-02 09:00:00 False
2014-01-02 09:01:00 False
2014-01-02 09:02:00 False
2014-01-02 09:03:00 False
2014-01-02 09:04:00 True
2014-01-02 09:05:00 True
2014-01-02 09:06:00 True
2014-01-02 09:07:00 True
2014-01-02 09:08:00 True
Freq: T, dtype: bool
# use the reverse trick to get the next 5 values
consecutive_next_5min = (pd.rolling_apply(df['open'][::-1], 5, lambda g: np.isnan(g).any()) == 0)[::-1]
consecutive_next_5min
date_time
2014-01-02 08:56:00 False
2014-01-02 08:57:00 False
2014-01-02 08:58:00 False
2014-01-02 08:59:00 False
2014-01-02 09:00:00 True
2014-01-02 09:01:00 True
2014-01-02 09:02:00 True
2014-01-02 09:03:00 True
2014-01-02 09:04:00 True
2014-01-02 09:05:00 False
2014-01-02 09:06:00 False
2014-01-02 09:07:00 False
2014-01-02 09:08:00 False
Freq: T, dtype: bool
# keep rows with either have recent 5 or next 5 elements non-null
df.loc[consecutive_next_5min | consecutive_previous_5min]
open high low close volume
date_time
2014-01-02 09:00:00 1.3756 1.3756 1.3755 1.3755 21
2014-01-02 09:01:00 1.3755 1.3755 1.3754 1.3755 18
2014-01-02 09:02:00 1.3755 1.3755 1.3755 1.3755 15
2014-01-02 09:03:00 1.3755 1.3756 1.3755 1.3756 39
2014-01-02 09:04:00 1.3756 1.3756 1.3756 1.3756 37
2014-01-02 09:05:00 1.3756 1.3756 1.3756 1.3756 35
2014-01-02 09:06:00 1.3756 1.3757 1.3756 1.3756 38
2014-01-02 09:07:00 1.3756 1.3757 1.3756 1.3757 42
2014-01-02 09:08:00 1.3757 1.3757 1.3756 1.3757 25
Related
How do I modify my code to have groupby return the previous days min instead of current days min Please see desired output below as this shows exactly what I am trying to achieve.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after groupby
series.groupby(series.index.date).transform(min)
2014-01-01 00:00:00 3
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 1
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Desired output (yesterday min)
2014-01-01 00:00:00 Nan
2014-01-01 08:00:00 Nan
2014-01-01 16:00:00 Nan
2014-01-02 00:00:00 3
2014-01-02 08:00:00 3
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
You can swap the index to just the date, calculate min per day, shift it and swap the original index back:
# Swap the index to just the date component
s = series.set_axis(series.index.date)
# Calculate the min per day, and shift it
t = s.groupby(level=0).min().shift()
# Final assembly
s[t.index] = t
s.index = series.index
Let us do reindex
series[:] = series.groupby(series.index.date).min().shift().reindex(series.index.date)
series
Out[370]:
2014-01-01 00:00:00 NaN
2014-01-01 08:00:00 NaN
2014-01-01 16:00:00 NaN
2014-01-02 00:00:00 1.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 3.0
2014-01-03 08:00:00 3.0
2014-01-03 16:00:00 3.0
2014-01-04 00:00:00 1.0
Freq: 8H, dtype: float64
How do I modify my code to have Pandas rolling daily reset each day? Please see desired output below as this shows exactly what I am trying to achieve.
I think I may need to use groupby to get the same result but unsure how to progress.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after pandas rolling
series.rolling('D', min_periods=1).min()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 3.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
Desired output (reset each day)
I can get the desired output like this but want to avoid looping:
series_list = []
for i in set(series.index.date):
series_list.append(series.loc[str(i)].rolling('D', min_periods=1).min())
pd.concat(series_list).sort_index()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 5.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
series.groupby(series.index.date).cummin()
Output:
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Freq: 8H, dtype: int64
I have a df which has a MultiIndex [(latitude, longitude, time)] with the number of rows being 148 x 244 x 90 x 24. For each latitude and longitude, the time is hourly from 2014-01-01 00:00:00 to 2014:03:31 23:00:00.
FFDI isInRange
latitude longitude time
-39.20000 140.80000 2014-01-01 00:00:00 6.20000 True
2014-01-01 01:00:00 4.10000 True
2014-01-01 02:00:00 2.40000 True
2014-01-01 03:00:00 1.90000 True
2014-01-01 04:00:00 1.70000 True
2014-01-01 05:00:00 1.50000 True
2014-01-01 06:00:00 1.40000 True
2014-01-01 07:00:00 1.30000 True
2014-01-01 08:00:00 1.20000 True
2014-01-01 09:00:00 1.00000 True
2014-01-01 10:00:00 1.00000 True
2014-01-01 11:00:00 0.90000 True
2014-01-01 12:00:00 0.90000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
140.83786 2014-01-01 00:00:00 3.20000 True
2014-01-01 01:00:00 2.90000 True
2014-01-01 02:00:00 2.10000 True
2014-01-01 03:00:00 2.90000 True
2014-01-01 04:00:00 1.20000 True
2014-01-01 05:00:00 0.90000 True
2014-01-01 06:00:00 1.10000 True
2014-01-01 07:00:00 1.60000 True
2014-01-01 08:00:00 1.40000 True
2014-01-01 09:00:00 1.50000 True
2014-01-01 10:00:00 1.20000 True
2014-01-01 11:00:00 0.80000 True
2014-01-01 12:00:00 0.40000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
... ... ... ...
... ... ...
-33.90000 140.80000 2014-01-01 00:00:00 6.20000 True
2014-01-01 01:00:00 4.10000 True
2014-01-01 02:00:00 2.40000 True
2014-01-01 03:00:00 1.90000 True
2014-01-01 04:00:00 1.70000 True
2014-01-01 05:00:00 1.50000 True
2014-01-01 06:00:00 1.40000 True
2014-01-01 07:00:00 1.30000 True
2014-01-01 08:00:00 1.20000 True
2014-01-01 09:00:00 1.00000 True
2014-01-01 10:00:00 1.00000 True
2014-01-01 11:00:00 0.90000 True
2014-01-01 12:00:00 0.90000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
140.83786 2014-01-01 00:00:00 3.20000 True
2014-01-01 01:00:00 2.90000 True
2014-01-01 02:00:00 2.10000 True
2014-01-01 03:00:00 2.90000 True
2014-01-01 04:00:00 1.20000 True
2014-01-01 05:00:00 0.90000 True
2014-01-01 06:00:00 1.10000 True
2014-01-01 07:00:00 1.60000 True
2014-01-01 08:00:00 1.40000 True
2014-01-01 09:00:00 1.50000 True
2014-01-01 10:00:00 1.20000 True
2014-01-01 11:00:00 0.80000 True
2014-01-01 12:00:00 0.40000 True
... ... ... ...
2014-03-31 21:00:00 0.30000 False
2014-03-31 22:00:00 0.30000 False
2014-03-31 23:00:00 0.50000 False
78001920 rows × 1 columns
What I want to achieve is to calculate a daily maximum FFDI value for every 24 hours for each latitude and longitude on the condition of:
If isInRange = True for all 24 hours/rows in the group - use FFDI from 13:00:00 of previous day to 12:00:00 of next day
If isInRange = False for all 24 hours/rows in the group - use FFDI from 14:00:00 of previous day to 13:00:00 of next day
Then my code is:
df_daily_max = df.groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',label='right',level='time')])['FFDI'].max().reset_index(name='Max FFDI') if df['isInRange'] else isInRange.groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=14,loffset='10H',label='right',level='time')])['FFDI'].max().reset_index(name='Max FFDI')
However this line raised an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You can filter first all True rows and then all Falses rows for aggregate max, then join by concat, sorting MultiIndex and convert to DataFrame by Series.reset_index:
s1 = df[df['isInRange']].groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=13,loffset='11H',label='right',level='time')])['FFDI'].max()
s2 = df[~df['isInRange']].groupby(['latitude', 'longitude', pd.Grouper(freq='24H',base=14,loffset='10H',label='right',level='time')])['FFDI'].max()
df_daily_max = pd.concat([s1, s2]).sort_index().reset_index(name='Max FFDI')
I have an indicator in a stored in a dataframe with a 1 hour time serie, and I'd like to extract it to a dataframe with a 1 min time serie, but I would like this indicator to have a value for each possible index of the one minute array, instead of every hour.
I tried the following code:
import pandas as pd
import datetime
DAX_H1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_H1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_m1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1=DAX_M1.iloc[0:6000]
DAX_H1=DAX_H1.iloc[0:110]
DAX_H1["EWMA_8"]=DAX_H1["Open"].ewm(min_periods=8,com=3.5).mean()
DAX_H1.index =pd.to_datetime(DAX_H1.index)
DAX_M1.index =pd.to_datetime(DAX_M1.index)
DAX_M1["H1_EWMA_8"]=""
for i in DAX_M1.index:
DAX_M1["H1_EWMA_8"][i] = DAX_H1["EWMA_8"][pd.Timestamp(datetime.datetime(i.year,i.month,i.day,i.hour))]
However, it doesn't seem to work, and even if it worked, I'd assume it would be very slow.
If I simply do the following:
DAX_M1["H1_EWMA_8"]=DAX_H1["EWMA_8"]
I don't have a value for "H1_EWMA_8" for each index:
2014-01-02 16:53:00 NaN
2014-01-02 16:54:00 NaN
2014-01-02 16:55:00 NaN
2014-01-02 16:56:00 NaN
2014-01-02 16:57:00 NaN
2014-01-02 16:58:00 NaN
2014-01-02 16:59:00 NaN
2014-01-02 17:00:00 9449.979026
2014-01-02 17:01:00 NaN
2014-01-02 17:02:00 NaN
Name: H1_EWMA_8, dtype: float64
Is there a simple way to replace the NaN value with the last available value of "H1_EWMA_8"?
Some part of DAX_M1 and DAX_H1 for illustration purposes:
DAX_M1:
Open High Low Close Total Ticks
date
2014-01-01 22:00:00 9597 9597 9597 9597 1
2014-01-02 07:05:00 9597 9619 9597 9618 18
2014-01-02 07:06:00 9618 9621 9617 9621 5
2014-01-02 07:07:00 9621 9623 9620 9623 6
2014-01-02 07:08:00 9623 9625 9623 9625 9
2014-01-02 07:09:00 9625 9625 9622 9623 6
2014-01-02 07:10:00 9623 9625 9622 9624 13
2014-01-02 07:11:00 9624 9626 9624 9626 8
2014-01-02 07:12:00 9626 9626 9623 9623 9
2014-01-02 07:13:00 9623 9625 9623 9625 5
DAX_H1:
Open High Low Close Total Ticks EWMA_8
date
2014-01-01 22:00:00 9597 9597 9597 9597 1 NaN
2014-01-02 07:00:00 9597 9626 9597 9607 322 NaN
2014-01-02 08:00:00 9607 9617 9510 9535 1730 NaN
2014-01-02 09:00:00 9535 9537 9465 9488 1428 NaN
2014-01-02 10:00:00 9488 9505 9478 9490 637 NaN
2014-01-02 11:00:00 9490 9512 9473 9496 817 NaN
2014-01-02 12:00:00 9496 9510 9495 9504 450 NaN
2014-01-02 13:00:00 9504 9514 9484 9484 547 9518.123073
2014-01-02 14:00:00 9484 9493 9424 9436 1497 9509.658500
Any help would be gladly appreciated!
Edit: This solution worked:
DAX_M1 = DAX_M1.fillna(method='ffill')
I'm building a basic rota/schedule for staff, and have a DataFrame from a MySQL cursor which gives a list of IDs, dates and class
id the_date class
0 195593 2017-09-12 14:00:00 3
1 193972 2017-09-13 09:15:00 2
2 195594 2017-09-13 14:00:00 3
3 195595 2017-09-15 14:00:00 3
4 193947 2017-09-16 17:30:00 3
5 195627 2017-09-17 08:00:00 2
6 193948 2017-09-19 11:30:00 2
7 195628 2017-09-21 08:00:00 2
8 193949 2017-09-21 11:30:00 2
9 195629 2017-09-24 08:00:00 2
10 193950 2017-09-24 10:00:00 2
11 193951 2017-09-27 11:30:00 2
12 195644 2017-09-28 06:00:00 1
13 194400 2017-09-28 08:00:00 1
14 195630 2017-09-28 08:00:00 2
15 193952 2017-09-29 11:30:00 2
16 195631 2017-10-01 08:00:00 2
17 194401 2017-10-06 08:00:00 1
18 195645 2017-10-06 10:00:00 1
19 195632 2017-10-07 13:30:00 3
If the class == 1, I need that instance duplicated 5 times.
first_class = df[df['class'] == 1]
non_first_class = df[df['class'] != 1]
first_class_replicated = pd.concat([tests_df]*5,ignore_index=True).sort_values(['the_date'])
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-28 06:00:00 1
4 195644 2017-09-28 06:00:00 1
12 195644 2017-09-28 06:00:00 1
8 195644 2017-09-28 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-28 08:00:00 1
9 194400 2017-09-28 08:00:00 1
5 194400 2017-09-28 08:00:00 1
1 194400 2017-09-28 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-06 08:00:00 1
10 194401 2017-10-06 08:00:00 1
14 194401 2017-10-06 08:00:00 1
2 194401 2017-10-06 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-06 10:00:00 1
15 195645 2017-10-06 10:00:00 1
7 195645 2017-10-06 10:00:00 1
19 195645 2017-10-06 10:00:00 1
I then merge non_first_class and first_class_replicated. Before that though, I need the dates in first_class_replicated to increment by one day, grouped by id. Below is how I need it to look. Is there an elegant Pandas solution to this, or should I be looking at looping over a groupby series to modify the dates?
Desired:
id
0 195644 2017-09-28 6:00:00
16 195644 2017-09-29 6:00:00
4 195644 2017-09-30 6:00:00
12 195644 2017-10-01 6:00:00
8 195644 2017-10-02 6:00:00
17 194400 2017-09-28 8:00:00
13 194400 2017-09-29 8:00:00
9 194400 2017-09-30 8:00:00
5 194400 2017-10-01 8:00:00
1 194400 2017-10-02 8:00:00
6 194401 2017-10-06 8:00:00
18 194401 2017-10-07 8:00:00
10 194401 2017-10-08 8:00:00
14 194401 2017-10-09 8:00:00
2 194401 2017-10-10 8:00:00
11 195645 2017-10-06 10:00:00
3 195645 2017-10-07 10:00:00
15 195645 2017-10-08 10:00:00
7 195645 2017-10-09 10:00:00
19 195645 2017-10-10 10:00:00
You can use cumcount for count categories, then convert to_timedelta and add to column:
#another solution for repeat
first_class_replicated = first_class.loc[np.repeat(first_class.index, 5)]
.sort_values(['the_date'])
df1 = first_class_replicated.groupby('id').cumcount()
first_class_replicated['the_date'] += pd.to_timedelta(df1, unit='D')
print (first_class_replicated)
id the_date class
0 195644 2017-09-28 06:00:00 1
16 195644 2017-09-29 06:00:00 1
4 195644 2017-09-30 06:00:00 1
12 195644 2017-10-01 06:00:00 1
8 195644 2017-10-02 06:00:00 1
17 194400 2017-09-28 08:00:00 1
13 194400 2017-09-29 08:00:00 1
9 194400 2017-09-30 08:00:00 1
5 194400 2017-10-01 08:00:00 1
1 194400 2017-10-02 08:00:00 1
6 194401 2017-10-06 08:00:00 1
18 194401 2017-10-07 08:00:00 1
10 194401 2017-10-08 08:00:00 1
14 194401 2017-10-09 08:00:00 1
2 194401 2017-10-10 08:00:00 1
11 195645 2017-10-06 10:00:00 1
3 195645 2017-10-07 10:00:00 1
15 195645 2017-10-08 10:00:00 1
7 195645 2017-10-09 10:00:00 1
19 195645 2017-10-10 10:00:00 1