Create pandas column from different time serie - python

I have an indicator in a stored in a dataframe with a 1 hour time serie, and I'd like to extract it to a dataframe with a 1 min time serie, but I would like this indicator to have a value for each possible index of the one minute array, instead of every hour.
I tried the following code:
import pandas as pd
import datetime
DAX_H1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_H1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1 = pd.read_csv("D:\Finance python\Data\DAX\GER30_m1_fixed.csv",index_col='date',error_bad_lines=False)
DAX_M1=DAX_M1.iloc[0:6000]
DAX_H1=DAX_H1.iloc[0:110]
DAX_H1["EWMA_8"]=DAX_H1["Open"].ewm(min_periods=8,com=3.5).mean()
DAX_H1.index =pd.to_datetime(DAX_H1.index)
DAX_M1.index =pd.to_datetime(DAX_M1.index)
DAX_M1["H1_EWMA_8"]=""
for i in DAX_M1.index:
DAX_M1["H1_EWMA_8"][i] = DAX_H1["EWMA_8"][pd.Timestamp(datetime.datetime(i.year,i.month,i.day,i.hour))]
However, it doesn't seem to work, and even if it worked, I'd assume it would be very slow.
If I simply do the following:
DAX_M1["H1_EWMA_8"]=DAX_H1["EWMA_8"]
I don't have a value for "H1_EWMA_8" for each index:
2014-01-02 16:53:00 NaN
2014-01-02 16:54:00 NaN
2014-01-02 16:55:00 NaN
2014-01-02 16:56:00 NaN
2014-01-02 16:57:00 NaN
2014-01-02 16:58:00 NaN
2014-01-02 16:59:00 NaN
2014-01-02 17:00:00 9449.979026
2014-01-02 17:01:00 NaN
2014-01-02 17:02:00 NaN
Name: H1_EWMA_8, dtype: float64
Is there a simple way to replace the NaN value with the last available value of "H1_EWMA_8"?
Some part of DAX_M1 and DAX_H1 for illustration purposes:
DAX_M1:
Open High Low Close Total Ticks
date
2014-01-01 22:00:00 9597 9597 9597 9597 1
2014-01-02 07:05:00 9597 9619 9597 9618 18
2014-01-02 07:06:00 9618 9621 9617 9621 5
2014-01-02 07:07:00 9621 9623 9620 9623 6
2014-01-02 07:08:00 9623 9625 9623 9625 9
2014-01-02 07:09:00 9625 9625 9622 9623 6
2014-01-02 07:10:00 9623 9625 9622 9624 13
2014-01-02 07:11:00 9624 9626 9624 9626 8
2014-01-02 07:12:00 9626 9626 9623 9623 9
2014-01-02 07:13:00 9623 9625 9623 9625 5
DAX_H1:
Open High Low Close Total Ticks EWMA_8
date
2014-01-01 22:00:00 9597 9597 9597 9597 1 NaN
2014-01-02 07:00:00 9597 9626 9597 9607 322 NaN
2014-01-02 08:00:00 9607 9617 9510 9535 1730 NaN
2014-01-02 09:00:00 9535 9537 9465 9488 1428 NaN
2014-01-02 10:00:00 9488 9505 9478 9490 637 NaN
2014-01-02 11:00:00 9490 9512 9473 9496 817 NaN
2014-01-02 12:00:00 9496 9510 9495 9504 450 NaN
2014-01-02 13:00:00 9504 9514 9484 9484 547 9518.123073
2014-01-02 14:00:00 9484 9493 9424 9436 1497 9509.658500
Any help would be gladly appreciated!
Edit: This solution worked:
DAX_M1 = DAX_M1.fillna(method='ffill')

Related

Pandas groupby and get yesterday min value

How do I modify my code to have groupby return the previous days min instead of current days min Please see desired output below as this shows exactly what I am trying to achieve.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after groupby
series.groupby(series.index.date).transform(min)
2014-01-01 00:00:00 3
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 1
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Desired output (yesterday min)
2014-01-01 00:00:00 Nan
2014-01-01 08:00:00 Nan
2014-01-01 16:00:00 Nan
2014-01-02 00:00:00 3
2014-01-02 08:00:00 3
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
You can swap the index to just the date, calculate min per day, shift it and swap the original index back:
# Swap the index to just the date component
s = series.set_axis(series.index.date)
# Calculate the min per day, and shift it
t = s.groupby(level=0).min().shift()
# Final assembly
s[t.index] = t
s.index = series.index
Let us do reindex
series[:] = series.groupby(series.index.date).min().shift().reindex(series.index.date)
series
Out[370]:
2014-01-01 00:00:00 NaN
2014-01-01 08:00:00 NaN
2014-01-01 16:00:00 NaN
2014-01-02 00:00:00 1.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 3.0
2014-01-03 08:00:00 3.0
2014-01-03 16:00:00 3.0
2014-01-04 00:00:00 1.0
Freq: 8H, dtype: float64

Replace loop with groupby for rolling daily calculation

How do I modify my code to have Pandas rolling daily reset each day? Please see desired output below as this shows exactly what I am trying to achieve.
I think I may need to use groupby to get the same result but unsure how to progress.
Data
np.random.seed(5)
series = pd.Series(np.random.choice([1,3,5], 10), index = pd.date_range('2014-01-01', '2014-01-04', freq = '8h'))
series
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 5
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 3
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 5
2014-01-04 00:00:00 1
Output after pandas rolling
series.rolling('D', min_periods=1).min()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 3.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
Desired output (reset each day)
I can get the desired output like this but want to avoid looping:
series_list = []
for i in set(series.index.date):
series_list.append(series.loc[str(i)].rolling('D', min_periods=1).min())
pd.concat(series_list).sort_index()
2014-01-01 00:00:00 5.0
2014-01-01 08:00:00 3.0
2014-01-01 16:00:00 3.0
2014-01-02 00:00:00 5.0
2014-01-02 08:00:00 1.0
2014-01-02 16:00:00 1.0
2014-01-03 00:00:00 1.0
2014-01-03 08:00:00 1.0
2014-01-03 16:00:00 1.0
2014-01-04 00:00:00 1.0
series.groupby(series.index.date).cummin()
Output:
2014-01-01 00:00:00 5
2014-01-01 08:00:00 3
2014-01-01 16:00:00 3
2014-01-02 00:00:00 5
2014-01-02 08:00:00 1
2014-01-02 16:00:00 1
2014-01-03 00:00:00 1
2014-01-03 08:00:00 1
2014-01-03 16:00:00 1
2014-01-04 00:00:00 1
Freq: 8H, dtype: int64

How to resample a Pandas DataFrame at a lower frequency and stop it creating NaN's?

I have a Pandas Dataframe with a DateTime index. It has closing prices of some stocks sampled at the 1-minute interval. I want to resample this dataframe and get it at the 5-minute interval, as if it had been collected in that way. For example:
SPY AAPL
DateTime
2014-01-02 09:30:00 183.91 555.890
2014-01-02 09:31:00 183.89 556.060
2014-01-02 09:32:00 183.90 556.180
2014-01-02 09:33:00 184.00 556.550
2014-01-02 09:34:00 183.98 556.325
2014-01-02 09:35:00 183.89 554.620
2014-01-02 09:36:00 183.83 554.210
I need to get something like
SPY AAPL
DateTime
2014-01-02 09:30:00 183.91 555.890
2014-01-02 09:35:00 183.89 554.620
The natural method would be resample() or asfreq() with Pandas. They indeed produce what I need, however with some undesired output as well. My sample has no observations from 4pm of a given weekday until 9:30am of the following day because trading halts during these hours. These mentioned methods end up completing the dataframe with NaN during these periods when there is actually no data to resample from. Is there any option I can use to avoid this behavior? From 4:05pm until 9:25am of the following day I get lots of NaN and just that!
My quick and dirty solution was the following:
Prices_5min = Prices[np.remainder(Prices.index.minute, 5) == 0]
Although I believe that this a quick and elegant solution, I'd would assume that resample() has some option to perform this task. Any ideas? Thanks a lot!
EDIT: Following the comment regarding the undesired output, I add the following code to showcase the problem:
New_Prices = Prices.asfreq('5min')
New_Prices.loc['2014-01-02 15:50:00':'2014-01-03 9:05:00']
Out:
SPY AAPL
DateTime
2014-01-02 15:50:00 183.12 552.83
2014-01-02 15:55:00 183.08 552.89
2014-01-02 16:00:00 182.92 553.18
2014-01-02 16:05:00 NaN NaN
2014-01-02 16:10:00 NaN NaN
... ... ...
2014-01-03 08:45:00 NaN NaN
2014-01-03 08:50:00 NaN NaN
2014-01-03 08:55:00 NaN NaN
2014-01-03 09:00:00 NaN NaN
2014-01-03 09:05:00 NaN NaN
All these NaN should be part of the final result. They are only there because there were no trading hours. I want to avoid that.
You could simply discard the rows containing NaN values with dropna().
Demo with a slightly modified version of your input data:
SPY AAPL
DateTime
2014-01-02 09:30:00 183.91 555.890
2014-01-02 09:31:00 183.89 556.060
2014-01-02 09:32:00 183.90 556.180
2014-01-02 09:33:00 184.00 556.550
2014-01-02 09:34:00 183.98 556.325
2014-01-02 09:45:00 183.89 554.620
2014-01-02 09:46:00 183.83 554.210
Straight resampling gives rows with NaN values:
df.asfreq('5min')
SPY AAPL
DateTime
2014-01-02 09:30:00 183.91 555.89
2014-01-02 09:35:00 NaN NaN
2014-01-02 09:40:00 NaN NaN
2014-01-02 09:45:00 183.89 554.62
which go avay with dropna():
df.asfreq('5min').dropna()
SPY AAPL
DateTime
2014-01-02 09:30:00 183.91 555.89
2014-01-02 09:45:00 183.89 554.62
Overview: Create an interval index to describe trading times (0930 to 1400 on business days). Then find the time stamps (from resample) that are in the trading window.
import pandas as pd
bdate_range = pd.bdate_range(start='2014-01-02', periods=5)
bdate_range
trading_windows = [
(d + pd.Timedelta('9.5h'), d + pd.Timedelta('16h'))
for d in bdate_range
]
trading_windows
trading_windows = pd.IntervalIndex.from_tuples(trading_windows)
for t in trading_windows: print(t)
(2014-01-02 09:30:00, 2014-01-02 16:00:00]
(2014-01-03 09:30:00, 2014-01-03 16:00:00]
(2014-01-06 09:30:00, 2014-01-06 16:00:00]
(2014-01-07 09:30:00, 2014-01-07 16:00:00]
(2014-01-08 09:30:00, 2014-01-08 16:00:00]
...and created a list of the 5-minute intervals from your example (some during trading hours, other time stamps when trading is halted)
stamps = [
'2014-01-02 15:50:00',
'2014-01-02 15:55:00',
'2014-01-02 16:00:00',
'2014-01-02 16:05:00',
'2014-01-02 16:10:00',
]
stamps = pd.to_datetime(stamps)
Then, I used the .contains() method of Interval Index to determine whether a timestamp (from resample) is during the trading window:
mask = [trading_windows.contains(stamp).any() for stamp in stamps]
stamps[mask]
[3]:
DatetimeIndex(['2014-01-02 15:50:00', '2014-01-02 15:55:00',
'2014-01-02 16:00:00'],
dtype='datetime64[ns]', freq=None)
This keeps all time stamps during the trading window (whether there are actual trades or not). And you can include holidays in the creation of 'trading_windows'.
Possibly resampling at 5 min freq along with 'last' statistic must work in ur case
U can specify the labels as the right and include the right end in the resampling
Finally, u can apply ffill in the to avoid time leakage

Resample python list with pandas

Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()

Selecting only date ranges in Pandas that have data every consecutive minute

I'm trying to process some data in pandas that looks like this in the CSV:
2014.01.02,08:56,1.37549,1.37552,1.37549,1.37552,3
2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21
2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18
2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15
2014.01.02,09:03,1.37546,1.37563,1.37546,1.37559,39
2014.01.02,09:04,1.37559,1.37562,1.37555,1.37561,37
2014.01.02,09:05,1.37561,1.37564,1.37558,1.37561,35
2014.01.02,09:06,1.37561,1.37566,1.37558,1.37563,38
2014.01.02,09:07,1.37563,1.37567,1.37561,1.37566,42
2014.01.02,09:08,1.37570,1.37571,1.37564,1.37566,25
I imported it using:
raw_data = pd.read_csv('raw_data.csv', engine='c', header=None, index_col=0, names=['date', 'time', 'open', 'high', 'low', 'close', 'volume'], parse_dates=[[0,1]])
But now I want to extract some random (or even continuous) samples from the data, but only the ones where I have 5 consecutive minutes always with data. So, for instance, the data from 2014.01.02,08:56 can't be used because it has a gap. But the data from 2014.01.02,09:00 is ok because it has consecutive data always for the 5 next minutes.
Any suggestions on how to accomplish this in a efficient way?
Here is one way by first .asfreq('T') to populate some NaNs and then using rolling_apply and count whether the recent or next 5 observations has no NaNs.
# populate NaNs at minutely freq
# ======================
df = raw_data.asfreq('T')
print(df)
open high low close volume
date_time
2014-01-02 08:56:00 1.3755 1.3755 1.3755 1.3755 3
2014-01-02 08:57:00 NaN NaN NaN NaN NaN
2014-01-02 08:58:00 NaN NaN NaN NaN NaN
2014-01-02 08:59:00 NaN NaN NaN NaN NaN
2014-01-02 09:00:00 1.3756 1.3756 1.3755 1.3755 21
2014-01-02 09:01:00 1.3755 1.3755 1.3754 1.3755 18
2014-01-02 09:02:00 1.3755 1.3755 1.3755 1.3755 15
2014-01-02 09:03:00 1.3755 1.3756 1.3755 1.3756 39
2014-01-02 09:04:00 1.3756 1.3756 1.3756 1.3756 37
2014-01-02 09:05:00 1.3756 1.3756 1.3756 1.3756 35
2014-01-02 09:06:00 1.3756 1.3757 1.3756 1.3756 38
2014-01-02 09:07:00 1.3756 1.3757 1.3756 1.3757 42
2014-01-02 09:08:00 1.3757 1.3757 1.3756 1.3757 25
consecutive_previous_5min = pd.rolling_apply(df['open'], 5, lambda g: np.isnan(g).any()) == 0
consecutive_previous_5min
date_time
2014-01-02 08:56:00 False
2014-01-02 08:57:00 False
2014-01-02 08:58:00 False
2014-01-02 08:59:00 False
2014-01-02 09:00:00 False
2014-01-02 09:01:00 False
2014-01-02 09:02:00 False
2014-01-02 09:03:00 False
2014-01-02 09:04:00 True
2014-01-02 09:05:00 True
2014-01-02 09:06:00 True
2014-01-02 09:07:00 True
2014-01-02 09:08:00 True
Freq: T, dtype: bool
# use the reverse trick to get the next 5 values
consecutive_next_5min = (pd.rolling_apply(df['open'][::-1], 5, lambda g: np.isnan(g).any()) == 0)[::-1]
consecutive_next_5min
date_time
2014-01-02 08:56:00 False
2014-01-02 08:57:00 False
2014-01-02 08:58:00 False
2014-01-02 08:59:00 False
2014-01-02 09:00:00 True
2014-01-02 09:01:00 True
2014-01-02 09:02:00 True
2014-01-02 09:03:00 True
2014-01-02 09:04:00 True
2014-01-02 09:05:00 False
2014-01-02 09:06:00 False
2014-01-02 09:07:00 False
2014-01-02 09:08:00 False
Freq: T, dtype: bool
# keep rows with either have recent 5 or next 5 elements non-null
df.loc[consecutive_next_5min | consecutive_previous_5min]
open high low close volume
date_time
2014-01-02 09:00:00 1.3756 1.3756 1.3755 1.3755 21
2014-01-02 09:01:00 1.3755 1.3755 1.3754 1.3755 18
2014-01-02 09:02:00 1.3755 1.3755 1.3755 1.3755 15
2014-01-02 09:03:00 1.3755 1.3756 1.3755 1.3756 39
2014-01-02 09:04:00 1.3756 1.3756 1.3756 1.3756 37
2014-01-02 09:05:00 1.3756 1.3756 1.3756 1.3756 35
2014-01-02 09:06:00 1.3756 1.3757 1.3756 1.3756 38
2014-01-02 09:07:00 1.3756 1.3757 1.3756 1.3757 42
2014-01-02 09:08:00 1.3757 1.3757 1.3756 1.3757 25

Categories