Pandas Multi-Index EWMA: Comparing same minute over multiple days - python

I am trying to plug a data set into Pandas and am doing something a bit unique with the approach.
I have a data set that looks like the following:
Date, Time, Venue, Volume, SummedVolume
2015-09-14, 09:30, NYSE, 1000, 10000
2015-09-14, 09:31, NYSE, 1100, 10100
However, I have this data sliced by minute per date. I have files going back a number of days, so I call a certain number of them and concat them into my DataFrame, typically using the last 20 days.
What I would like to do is use pandas ewma to do an ewma on the exact same minute of the day, across those 20 days, by Venue. So what the result would be, is comparing the 09:30 minute across the last 20 days for NYSE, using an alpha of 0.5 (which I think would be span=20 in this case). Obviously, sort the data so that the oldest data is at the back and newest data is at the front is critical, so I am doing that as well, the data can't be in a random order.
Right now I am able to get pandas to do simple math (means, etc) on this data set using groupby on Time and Venue (shown below). However, when I try to do an ewma on this, I get errors about not being able to do an ewma on a non-unique data set - which is reasonable. But adding the Date into the MultiIndex kind of wrecks being able to compare the same exact minute to that minute on other dates.
Can anyone think of a solution here?
frame = pd.DataFrame()
concat = []
for fn in files:
df = pd.read_csv(fn, index_col=None, header=0)
concat.append(df)
frame = pd.concat(concat)
df = pd.DataFrame(frame)
if conf == "VenueStats":
grouped = df.groupby(['time','Venue'], sort=True)
elif conf == "SymbolStats":
grouped = df.groupby(['time','Symbol'], sort=True)
stats = grouped.mean().astype(int)
stats.to_csv('out.csv')
Initial output from df.head() before the mean (I changed the Venue names and values to 0 since this is sensitive information):
Date Time Venue Volume SummedVolume
0 2015-09-14 17:00 NYSE 0 0
1 2015-09-14 17:00 ARCA 0 0
2 2015-09-14 17:00 AMEX 0 0
3 2015-09-14 17:00 NASDAQ 0 0
4 2015-09-14 17:00 BATS 0 0
Output from stats.head() after the mean:
Volume SummedVolume
Time Venue
00:00 NYSE 0 0
ARCA 0 0
AMEX 0 0
NASDAQ 0 0
BATS 0 0
Here is what is different from doing a mean (above) to when I try to do the ewma:
for fn in files:
df = pd.read_csv(fn, index_col=[0,1,2], header=0) #0=Date,1=Time,2=Venue
concat.append(df)
frame = pd.concat(concat)
df = pd.DataFrame(frame, columns=['Volume','SummedVolume'])
if conf == "VenueStats":
stats = df.groupby(df.index).apply(lambda x: pd.ewma(x,span=20))
elif conf == "SymbolStats":
stats = df.groupby(df.index).apply(lambda x: pd.ewma(x,span=20))
Here is the df.head() from the ewma version and the stats.head() from the ewma version (they look the same):
Volume SummedVolume
Date Time Venue
2015-09-14 17:00 NYSE 0 0
ARCA 0 0
AMEX 0 0
NASDAQ 0 0
BATS 0 0
Volume SummedVolume
Date Time Venue
2015-09-14 17:00 NYSE 0 0
ARCA 0 0
AMEX 0 0
NASDAQ 0 0
BATS 0 0

You want to pivot your data so that dates are down one axis and time the other.
It is difficult to work on this problem without some reproduceable data, but the solution would be something like this:
df2 = (df.reset_index()
.groupby(['tradeDate', 'time', 'exchange'])
.first() # Given that the data is unique by selected grouping
.unstack(['exchange', 'time'])
pd.ewma(df2, span=20)

Related

Biweekly pandas data with period label

I'm trying to create a biweekly periods from pandas data frame. For instance like that
import pandas as pd
date_range = pd.date_range("2022-04-01", "2022-04-30", freq="B")
test_data = pd.DataFrame(np.arange(len(date_range)), index=date_range)
I'd like to have a Period index with 2 weeks length. I have assumed that the pandas way to do it is the following
test_data.resample("2W", kind="period").last()
However the labels I'm getting are
0
2022-03-28/2022-04-03 5
2022-04-11/2022-04-17 15
2022-04-25/2022-05-01 20
I'd expect to see something like this
0
2022-03-21/2022-04-03 0
2022-04-04/2022-04-17 10
2022-04-18/2022-05-01 20
Another interesting point is that changing kind="timestamp" changes the values to the values I'd like to see at the end.
0
2022-04-03 0
2022-04-17 10
2022-05-01 20
Is there any native way to get biweekly index from pandas, or better to do it manually?
You can try pandas.Grouper
df = test_data.groupby(pd.Grouper(freq='2W')).last()
print(df)
0
2022-04-03 0
2022-04-17 10
2022-05-01 20

Applying Pandas groupby to multiple columns

I have a set of data that has several different columns, with daily data going back several years. The variable is the exact same for each column. I've calculated the daily, monthly, and yearly statistics for each column, and want to do the same, but combining all columns together to get one statistic for each day, month, and year rather than the several different ones I calculated before.
I've been using Pandas group by so far, using something like this:
sum_daily_files = daily_files.groupby(daily_files.Date.dt.day).sum()
sum_monthly_files = daily_files.groupby(daily_files.Date.dt.month).sum()
sum_yearly_files = daily_files.groupby(daily_files.Date.dt.year).sum()
Any suggestions on how I might go about using Pandas - or any other package - to combine the statistics together? Thanks so much!
edit
Here's a snippet of my dataframe:
Date site1 site2 site3 site4 site5 site6
2010-01-01 00:00:00 2 0 1 1 0 1
2010-01-02 00:00:00 7 5 1 3 1 1
2010-01-03 00:00:00 3 3 2 2 2 1
2010-01-04 00:00:00 0 0 0 0 0 0
2010-01-05 00:00:00 0 0 0 0 0 1
I just had to type it in because I was having trouble getting it over, so my apologies. Basically, it's six different sites from 2010 to 2019 that details how much snow (in inches) each site received on each day.
(Your problem need to be clarify)
Is this what you want?
all_sum_daily_files = sum_daily_files.sum(axis=1) # or daily_files.sum(axis=1)
all_sum_monthly_files = sum_monthly_files.sum(axis=1)
all_sum_yearly_files = sum_yearly_files.sum(axis=1)
If your data is daily, why calculate the daily sum, you can use directly daily_files.sum(axis=1).

How to resample/reindex/groupby a time series based on a column's data?

SO I've got a pandas data frame that contains 2 values of water use at a 1 second resolution. The values are "hotIn" and "hotOut". The hotIn can record down to the tenth of a gallon at a one second resolution while the hotOut records whole number pulses representing a gallon, i.e. when a pulse occurs, one gallon has passed through the meter. The pulses occur roughly every 14-15 seconds.
Data looks roughly like this:
Index hotIn(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:00 4 0
2019-03-23T00:00:01 5 0
2019-03-23T00:00:02 4 0
2019-03-23T00:00:03 4 0
2019-03-23T00:00:04 3 0
2019-03-23T00:00:05 4 1
2019-03-23T00:00:06 4 0
2019-03-23T00:00:07 5 0
2019-03-23T00:00:08 3 0
2019-03-23T00:00:09 3 0
2019-03-23T00:00:10 4 0
2019-03-23T00:00:11 4 0
2019-03-23T00:00:12 5 0
2019-03-23T00:00:13 5 1
What I'm trying to do is resample or reindex the data frame based on the occurrence of pulses and sum the hotIn between the new timestamps.
For example, sum the hotIn between 00:00:00 - 00:00:05 and 00:00:06 - 00:00:13.
Results would ideally look like this:
Index hotIn sum(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 32 1
I've explored using a two step for-elif loop that just checks if the hotOut == 1, it works but its painfully slow on large datasets. I'm positive the timestamp functionality of Pandas will be superior if this is possible.
I also can't simply resample on a set frequency because the interval between pulses changes periodically. The primary issue is the period of timestamps between pulses changes so a general resample rule would not work. I've also run into problems with matching data frame lengths when pulling out the timestamps associated with pulses and applying them to the main as a new index.
IIUC, you can do:
s = df['hotOut(pulse=1gal)'].shift().ne(0).cumsum()
(df.groupby(s)
.agg({'Index':'last', 'hotIn(gpm)':'sum'})
.reset_index(drop=True)
)
Output:
Index hotIn(gpm)
0 2019-03-23T00:00:05 24
1 2019-03-23T00:00:13 33
You don't want to group on the Index. You want to group whenever 'hotOut(pulse=1gal)' changes.
s = df['hotOut(pulse=1gal)'].cumsum().shift().bfill()
(df.reset_index()
.groupby(s, as_index=False)
.agg({'Index': 'last', 'hotIn(gpm)': 'sum', 'hotOut(pulse=1gal)': 'last'})
.set_index('Index'))
hotIn(gpm) hotOut(pulse=1gal)
Index
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 33 1

Rolling up / Cumulative sum of units shipped in the 3 days since first shipment in Pandas

Its a little hard to explain but ill try my best, please bear with me.
I have a pd with ID, Shipping date and Units.
I want to calculate the units shipped within a 3 day timeframe, and the count should not overlap e.g. my dataframe is as follows.
ID Shipping Date Units Expected output
153131151007 20180801 1 1
153131151007 20180828 1 2
153131151007 20180829 1 0
153131151007 20180904 1 1
153131151007 20181226 2 4
153131151007 20181227 1 0
153131151007 20181228 1 0
153131151007 20190110 1 1
153131151007 20190115 2 3
153131151007 20190116 1 0
153131151011* 20180510 1 2
153131151011* 20180511 1 0
153131151011* 20180513 1 2
153131151011* 20180515 1 0
153131151011* 20180813 1 1
153131151011* 20180822 1 2
153131151011* 20180824 1 0
153131151011* 20190103 1 1
The code should check the date, see if there are any shipments in the next 3 days, if there is a shipment, it should sum it in its current date column and make sure it does not consider the summed count for next date calculation.
So for the first ID Shipping date 20181226, it checks 1226,1227,1228 and sum them together and show result in 1226 and it shows 0 in the next 2 cells.
Similarly for 2nd ID 20180510, 0510 is the first date of the shipment in the series. It checks 0510,0511 and 0512 and sums it in 0510 and zeros the rest, which is why 0511 does not consider 0513 and it is a part of other shipment group.
data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})
This works but the results are in wide format:
import pandas as pd
import numpy as np
from dateutil.parser import parse
from datetime import timedelta
data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})
def keep_first(ser):
ixs = []
ts = ser.dropna().index[0]
while ts <= ser.dropna().index.max():
if ts in ser.dropna().index:
ixs.append(ts)
ts+=timedelta(3)
else:
ts+=timedelta(1)
return np.where(ser.index.isin(ixs), ser, 0)
data['Date'] = data['Date'].map(lambda x: parse(str(x))) # parse dates
units = data.groupby(['ID', 'Date']).sum().unstack(0).resample('D').sum() # create resampled units df
units = units.sort_index(ascending=False).rolling(3, min_periods=1).sum().sort_index() # calculate forward-rolling sum
grouped_ix = data.groupby(['ID', 'Date']).sum().unstack(0).index # get indices for actual data
units.loc[grouped_ix].apply(keep_first) # get sums for actual data indices, keep only first

subset by counting the number of times 0 occurs in a column after groupby in python

I have some typical stock data. I want to create a column called "Volume_Count" that will count the number of 0 volume days per quarter. My ultimate goal is to remove all stocks that have 0 volume for more than 5 days in a quarter. By creating this column, I can write a simple statement to subset Vol_Count > 5.
A typical Dataset:
Stock Date Qtr Volume
XYZ 1/1/19 2019 Q1 0
XYZ 1/2/19 2019 Q1 598
XYZ 1/3/19 2019 Q1 0
XYZ 1/4/19 2019 Q1 0
XYZ 1/5/19 2019 Q1 0
XYZ 1/6/19 2019 Q1 2195
XYZ 1/7/19 2019 Q1 0
... ... and so on (for multiple stocks and quarters)
This is what I've tried - a 1 liner -
df = df.groupby(['stock','Qtr'], as_index=False).filter(lambda x: len(x.Volume == 0) > 5)
However, as stated previously, this produced inconsistent results.
I want to remove the stock from the dataset only for the quarter where the volume == 0 for 5 or more days.
Note: I have multiple Stocks and Qtr in my dataset, therefore it's essential to groupby Qtr, Stock.
Desired Output:
I want to keep the dataset but remove any stocks for a qtr if they have a volume = 0 for > 5 days.. that might entail a stock not being in the dataset for 2019 Q1 (because Vol == 0 >5 days) but being in the df in 2019 Q2 (Vol == 0 < 5 days)...
Try this:
df[df['Volume'].eq(0).groupby([df['Stock'],df['Qtr']]).transform('sum') < 5]
Details.
First take the Volume column of your dataframe and check to see if
it zero for each record.
Next, group that column by 'Stock' and 'Qtr' columns and get a sum of each True values from step 1 assign that sum to each record using groupby and transform.
Create boolean series from that sum where True if less than 5 and
use that series to boolean index your original dataframe.

Categories