How to calculate daily averages from noon to noon with pandas? - python

I am fairly new to python and pandas, so I apologise for any future misunderstandings.
I have a pandas DataFrame with hourly values, looking something like this:
2014-04-01 09:00:00 52.9 41.1 36.3
2014-04-01 10:00:00 56.4 41.6 70.8
2014-04-01 11:00:00 53.3 41.2 49.6
2014-04-01 12:00:00 50.4 39.5 36.6
2014-04-01 13:00:00 51.1 39.2 33.3
2016-11-30 16:00:00 16.0 13.5 36.6
2016-11-30 17:00:00 19.6 17.4 44.3
Now I need to calculate 24h average values for each column starting from 2014-04-01 12:00 to 2014-04-02 11:00
So I want daily averages from noon to noon.
Unfortunately, I have no idea how to do that. I have read some suggestions to use groupby, but I don't really know how...
Thank you very much in advance! Any help is appreciated!!

For newer versions of pandas (>= 1.1.0) use the offset argument:
df.resample('24H', offset='12H').mean()
The base argument.
A day is 24 hours, so a base of 12 would start the grouping from Noon - Noon. Resample gives you all days in between, so you could .dropna(how='all') if you don't need the complete basis. (I assume you have a DatetimeIndex, if not you can use the on argument of resample to specify your datetime column.)
df.resample('24H', base=12).mean()
#df.groupby(pd.Grouper(level=0, base=12, freq='24H')).mean() # Equivalent
1 2 3
0
2014-03-31 12:00:00 54.20 41.30 52.233333
2014-04-01 12:00:00 50.75 39.35 34.950000
2014-04-02 12:00:00 NaN NaN NaN
2014-04-03 12:00:00 NaN NaN NaN
2014-04-04 12:00:00 NaN NaN NaN
... ... ... ...
2016-11-26 12:00:00 NaN NaN NaN
2016-11-27 12:00:00 NaN NaN NaN
2016-11-28 12:00:00 NaN NaN NaN
2016-11-29 12:00:00 NaN NaN NaN
2016-11-30 12:00:00 17.80 15.45 40.450000

You could subtract your time and groupby:
df.groupby((df.index - pd.to_timedelta('12:00:00')).normalize()).mean()

You can shift the hours by 12 hours and resample on day level.
from io import StringIO
import pandas as pd
data = """
2014-04-01 09:00:00,52.9,41.1,36.3
2014-04-01 10:00:00,56.4,41.6,70.8
2014-04-01 11:00:00,53.3,41.2,49.6
2014-04-01 12:00:00,50.4,39.5,36.6
2014-04-01 13:00:00,51.1,39.2,33.3
2016-11-30 16:00:00,16.0,13.5,36.6
2016-11-30 17:00:00,19.6,17.4,44.3
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, index_col=0)
df.index = pd.to_datetime(df.index)
# shift by 12 hours
df.index = df.index - pd.Timedelta(hours=12)
# resample and drop na rows
df.resample('D').mean().dropna()

Related

Transform an hourly dataframe into a monthly totalled dataframe in Python

I have a Pandas dataframe containing hourly precipitation data (tp) between 2013 and 2020, the dataframe is called df:
tp
time
2013-01-01 00:00:00 0.1
2013-01-01 01:00:00 0.1
2013-01-01 02:00:00 0.1
2013-01-01 03:00:00 0.0
2013-01-01 04:00:00 0.2
...
2020-12-31 19:00:00 0.2
2020-12-31 20:00:00 0.1
2020-12-31 21:00:00 0.0
2020-12-31 22:00:00 0.1
2020-12-31 23:00:00 0.0
I'm trying to convert this hourly dataset into monthly totals for each year, I then want to take an average of the monthly summed rainfall so that I end up with a data frame with 12 rows for each month, showing the average summed rainfall over the whole period.
I've tried the resample function:
df.resample('M').mean()
However, this outputs the following and is not what I'm looking to achieve:
tp1
time
2013-01-31 0.121634
2013-02-28 0.318097
2013-03-31 0.356973
2013-04-30 0.518160
2013-05-31 0.055290
...
2020-09-30 0.132713
2020-10-31 0.070817
2020-11-30 0.060525
2020-12-31 0.040002
2021-01-31 0.000000
[97 rows x 1 columns]
While it's converting the hourly data to monthly, I want to show an average of the rainfall across the years.
e.g.
January Column = Average of January rainfall between 2013 and 2020.
Assuming your index is a DatetimeIndex, you can use:
out = df.groupby(df.index.month).mean()
print(out)
# Output
tp1
time
1 0.498262
2 0.502057
3 0.502644
4 0.496880
5 0.499100
6 0.497931
7 0.504981
8 0.497841
9 0.499646
10 0.499804
11 0.506938
12 0.501172
Setup:
import pandas as pd
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2013-01-31', '2021-01-31', freq='H', name='time')
df = pd.DataFrame({'tp1': np.random.random(len(dti))}, index=dti)
print(df)
# Output
tp1
time
2013-01-31 00:00:00 0.009359
2013-01-31 01:00:00 0.499058
2013-01-31 02:00:00 0.113384
2013-01-31 03:00:00 0.049974
2013-01-31 04:00:00 0.685408
... ...
2021-01-30 20:00:00 0.021295
2021-01-30 21:00:00 0.275759
2021-01-30 22:00:00 0.367263
2021-01-30 23:00:00 0.777680
2021-01-31 00:00:00 0.021225
[70129 rows x 1 columns]

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

Pandas : merge on date and hour from datetime index

I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0

Resample python list with pandas

Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()

Python pandas time series w/ hierarchical indexing and rolling/ shifting

Struggling with pandas' rolling and shifting concept. There are many good suggestions including in this forum but I failed miserably to apply these to my scenario.
Now I use traditional looping over the time series but ugh, it took like 8 hours to iterate over 150,000 rows which is about 3 days of data for all tickers. Got 2 months data to process it probably won't finish after I come back from a sabbatical, not mentioning risk of electricity cut off after which I'd have to start over again this time no sabbatical while waiting.
I have the following 15 min stock price time series (Hierarchical index on datetime(timestamp) and ticker, the only original column is closePrice):
closePrice
datetime ticker
2014-02-04 09:15:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
2014-02-04 09:30:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
2014-02-04 09:45:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
I need to add two columns:
12sma, 12 days moving average. Having searched SO for hours the best suggestion would be to use rolling_mean, so I tried. But it didn't work given my TS structure i.e. it works top down the first MA is calculated based on the first 12 rows regardless of different ticker values. How do I make it average based on the index i.e. first datetime then ticker so I get MA for say AAPL? Currently it does (AAPL+EQIX+FB+GOOG+MSFT+AAPL...up to 12th row) / 12
Once I got the 12sma column, I need 12ema column, 12 days exponential MA. For the calculation, the first value in the time series for each ticker would just copy 12sma value from the same row. Subsequently, I'd need closePrice from the same row and 12ema from the previous row i.e. past 15 min. I did a long research seems like the solution would be a combination of rolling and shifting but I can't figure out how to put it together.
Any help I'd be grateful.
Thanks.
EDIT:
Thanks to Jeff's tips, after swapping and sorting ix level I am able to get the 12sma right with rolling_mean() and with a effort managed to insert the first 12ema value copied from 12sma at the same timestamp:
close 12sma 12ema
sec_code datetime
AAPL 2014-02-05 11:45:00 113.0 NaN NaN
2014-02-05 12:00:00 113.2 NaN NaN
2014-02-05 13:15:00 112.9 NaN NaN
2014-02-05 13:30:00 113.2 NaN NaN
2014-02-05 13:45:00 113.0 NaN NaN
2014-02-05 14:00:00 113.1 NaN NaN
2014-02-05 14:15:00 113.3 NaN NaN
2014-02-05 14:30:00 113.3 NaN NaN
2014-02-05 14:45:00 113.3 NaN NaN
2014-02-05 15:00:00 113.2 NaN NaN
2014-02-05 15:15:00 113.2 NaN NaN
2014-02-05 15:30:00 113.3 113.16 113.16
2014-02-05 15:45:00 113.3 113.19 NaN
2014-02-05 16:00:00 113.2 113.19 NaN
2014-02-06 09:45:00 112.6 113.16 NaN
2014-02-06 10:00:00 113.5 113.19 NaN
2014-02-06 10:15:00 113.8 113.25 NaN
2014-02-06 10:30:00 113.5 113.29 NaN
2014-02-06 10:45:00 113.7 113.32 NaN
2014-02-06 11:00:00 113.5 113.34 Nan
I understand pandas has pandas.stats.moments.ewma but I prefer to use a formula I got from a book which needs close price 'at the moment' and 12ema from previous row.
So, I tried to fill 12ema column from Feb 5, 15:45 and onward. I tried apply() with a function but shift gave an error:
def f12ema(x):
K = 2 / (12 + 1)
return x['price_nom'] * K + x['12ema'].shift(-1) * (1-K)
df1.apply(f12ema, axis=1)
AttributeError: ("'numpy.float64' object has no attribute 'shift'", u'occurred at index 2014-02-05 11:45:00')
Another possibility that crossed my mind is rolling_appy() but it is beyond my knowledge.
Create a date range inclusive of the times you want
In [47]: rng = date_range('20130102 09:30:00','20130105 16:00:00',freq='15T')
In [48]: rng = rng.take(rng.indexer_between_time('09:30:00','16:00:00'))
In [49]: rng
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-02 09:30:00, ..., 2013-01-05 16:00:00]
Length: 108, Freq: None, Timezone: None
Create a frame similar to yours (2000 tickers x dates)
In [50]: df = DataFrame(np.random.randn(len(rng)*2000,1),columns=['close'],index=MultiIndex.from_product([rng,range(2000)],names=['date','ticker']))
Reorder the levels so that its ticker x date for the index, SORT IT!!!!
In [51]: df = df.swaplevel('ticker','date').sortlevel()
In [52]: df
Out[52]:
close
ticker date
0 2013-01-02 09:30:00 0.840767
2013-01-02 09:45:00 1.808101
2013-01-02 10:00:00 -0.718354
2013-01-02 10:15:00 -0.484502
2013-01-02 10:30:00 0.563363
2013-01-02 10:45:00 0.553920
2013-01-02 11:00:00 1.266992
2013-01-02 11:15:00 -0.641117
2013-01-02 11:30:00 -0.574673
2013-01-02 11:45:00 0.861825
2013-01-02 12:00:00 -1.562111
2013-01-02 12:15:00 -0.072966
2013-01-02 12:30:00 0.673079
2013-01-02 12:45:00 0.766105
2013-01-02 13:00:00 0.086202
2013-01-02 13:15:00 0.949205
2013-01-02 13:30:00 -0.381194
2013-01-02 13:45:00 0.316813
2013-01-02 14:00:00 -0.620176
2013-01-02 14:15:00 -0.193126
2013-01-02 14:30:00 -1.552111
2013-01-02 14:45:00 1.724429
2013-01-02 15:00:00 -0.092393
2013-01-02 15:15:00 0.197763
2013-01-02 15:30:00 0.064541
2013-01-02 15:45:00 -1.574853
2013-01-02 16:00:00 -1.023979
2013-01-03 09:30:00 -0.079349
2013-01-03 09:45:00 -0.749285
2013-01-03 10:00:00 0.652721
2013-01-03 10:15:00 -0.818152
2013-01-03 10:30:00 0.674068
2013-01-03 10:45:00 2.302714
2013-01-03 11:00:00 0.614686
...
[216000 rows x 1 columns]
Groupby the ticker. Return a DataFrame for each ticker that is the application of rolling_mean and ewma. Note that are many options for controlling this, e.g windowing, you could make it not wrap around days, etc.
In [53]: df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
Out[53]:
ema spma
ticker date
0 2013-01-02 09:30:00 0.840767 NaN
2013-01-02 09:45:00 1.393529 NaN
2013-01-02 10:00:00 0.480282 0.643504
2013-01-02 10:15:00 0.127447 0.201748
2013-01-02 10:30:00 0.270334 -0.213164
2013-01-02 10:45:00 0.356580 0.210927
2013-01-02 11:00:00 0.619245 0.794758
2013-01-02 11:15:00 0.269100 0.393265
2013-01-02 11:30:00 0.041032 0.017067
2013-01-02 11:45:00 0.258476 -0.117988
2013-01-02 12:00:00 -0.216742 -0.424986
2013-01-02 12:15:00 -0.179622 -0.257750
2013-01-02 12:30:00 0.038741 -0.320666
2013-01-02 12:45:00 0.223881 0.455406
2013-01-02 13:00:00 0.188995 0.508462
2013-01-02 13:15:00 0.380972 0.600504
2013-01-02 13:30:00 0.188987 0.218071
2013-01-02 13:45:00 0.221125 0.294942
2013-01-02 14:00:00 0.009907 -0.228185
2013-01-02 14:15:00 -0.041013 -0.165496
2013-01-02 14:30:00 -0.419688 -0.788471
2013-01-02 14:45:00 0.117299 -0.006936
2013-01-04 10:00:00 -0.060415 0.341013
2013-01-04 10:15:00 0.074068 0.604611
2013-01-04 10:30:00 -0.108502 0.440256
2013-01-04 10:45:00 -0.514229 -0.636702
... ...
[216000 rows x 2 columns]
Pretty good perf as its essentially looping over the tickers.
In [54]: %timeit df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
1 loops, best of 3: 2.1 s per loop

Categories