Pandas apply on rolling with multi-column output - python

I am working on a code that would apply a rolling window to a function that would return multiple columns.
Input: Pandas Series
Expected output: 3-column DataFrame
def fun1(series, ):
# Some calculations producing numbers a, b and c
return {"a": a, "b": b, "c": c}
res.rolling('21 D').apply(fun1)
Contents of res:
time
2019-09-26 16:00:00 0.674969
2019-09-26 16:15:00 0.249569
2019-09-26 16:30:00 -0.529949
2019-09-26 16:45:00 -0.247077
2019-09-26 17:00:00 0.390827
...
2019-10-17 22:45:00 0.232998
2019-10-17 23:00:00 0.590827
2019-10-17 23:15:00 0.768991
2019-10-17 23:30:00 0.142661
2019-10-17 23:45:00 -0.555284
Length: 1830, dtype: float64
Error:
TypeError: must be real number, not dict
What I've tried:
Changing raw=True in apply
Using a lambda function in in apply
Returning result in fun1 as lists/numpy arrays/dataframe/series.
I have also went through many related posts in SO, to state a few:
Pandas - Using `.rolling()` on multiple columns
Returning two values from pandas.rolling_apply
How to apply a function to two columns of Pandas dataframe
Apply pandas function to column to create multiple new columns?
But none of the solution specified solves this problem.
Is there a straight-forward solution to this?

Here is a hacky answer using rolling, producing a DataFrame:
import pandas as pd
import numpy as np
dr = pd.date_range('09-26-2019', '10-17-2019', freq='15T')
data = np.random.rand(len(dr))
s = pd.Series(data, index=dr)
output = pd.DataFrame(columns=['a','b','c'])
row = 0
def compute(window, df):
global row
a = window.max()
b = window.min()
c = a - b
df.loc[row,['a','b','c']] = [a,b,c]
row+=1
return 1
s.rolling('1D').apply(compute,kwargs={'df':output})
output.index = s.index
It seems like the rolling apply function is always expecting a number to be returned, in order to immediately generate a new Series based on the calculations.
I am getting around this by making a new output DataFrame (with the desired output columns), and writing to that within the function. I'm not sure if there is a way to get the index within a rolling object, so I instead use global to make an increasing count for writing new rows. In light of the point above though, you need to return some number. So while the actually rolling operation returns a series of 1, output is modified:
In[0]:
s
Out[0]:
2019-09-26 00:00:00 0.106208
2019-09-26 00:15:00 0.979709
2019-09-26 00:30:00 0.748573
2019-09-26 00:45:00 0.702593
2019-09-26 01:00:00 0.617028
2019-10-16 23:00:00 0.742230
2019-10-16 23:15:00 0.729797
2019-10-16 23:30:00 0.094662
2019-10-16 23:45:00 0.967469
2019-10-17 00:00:00 0.455361
Freq: 15T, Length: 2017, dtype: float64
In[1]:
output
Out[1]:
a b c
2019-09-26 00:00:00 0.106208 0.106208 0.000000
2019-09-26 00:15:00 0.979709 0.106208 0.873501
2019-09-26 00:30:00 0.979709 0.106208 0.873501
2019-09-26 00:45:00 0.979709 0.106208 0.873501
2019-09-26 01:00:00 0.979709 0.106208 0.873501
... ... ...
2019-10-16 23:00:00 0.980544 0.022601 0.957943
2019-10-16 23:15:00 0.980544 0.022601 0.957943
2019-10-16 23:30:00 0.980544 0.022601 0.957943
2019-10-16 23:45:00 0.980544 0.022601 0.957943
2019-10-17 00:00:00 0.980544 0.022601 0.957943
[2017 rows x 3 columns]
This feels like more of an exploit of rolling than an intended use, so I would be interested to see a more elegant answer.
UPDATE: Thanks to #JuanPi, you can get the rolling window index using this answer. So a non-globalanswer could look like this:
def compute(window, df):
a = window.max()
b = window.min()
c = a - b
df.loc[window.index.max(),['a','b','c']] = [a,b,c]
return 1

This hack seem to work for me, albeit the additional features of rolling cannot be applied to this solution. However, the speed of the application is significantly faster due to multiprocessing.
from multiprocessing import Pool
import functools
def apply_fn(indices, fn, df):
return fn(df.loc[indices])
def rolling_apply(df, fn, window_size, start=None, end=None):
"""
The rolling application of a function fn on a DataFrame df given the window_size
"""
x = df.index
if start is not None:
x = x[x >= start]
if end is not None:
x = x[x <= end]
if type(window_size) == str:
delta = pd.Timedelta(window_size)
index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
else:
assert type(window_size) == int, "Window size should be str (representing Timedelta) or int"
delta = window_size
index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
with Pool() as pool:
result = list(pool.map(functools.partial(apply_fn, fn=fn, df=df), index_sets))
result = pd.DataFrame(data=result, index=x)
return result
Having the above functions in place, plug in the function to roll into the custom rolling_function.
result = rolling_apply(res, fun1, "21 D")
Contents of result:
a b c
time
2019-09-26 16:00:00 NaN NaN NaN
2019-09-26 16:15:00 0.500000 0.106350 0.196394
2019-09-26 16:30:00 0.500000 0.389759 -0.724829
2019-09-26 16:45:00 2.000000 0.141436 -0.529949
2019-09-26 17:00:00 6.010184 0.141436 -0.459231
... ... ... ...
2019-10-17 22:45:00 4.864015 0.204483 -0.761609
2019-10-17 23:00:00 6.607717 0.204647 -0.761421
2019-10-17 23:15:00 7.466364 0.204932 -0.761108
2019-10-17 23:30:00 4.412779 0.204644 -0.760386
2019-10-17 23:45:00 0.998308 0.203039 -0.757979
1830 rows × 3 columns
Note:
This implementation works for both Series and DataFrame input
This implementation works for both time and integer windows
The result returned by fun1 can even be a list, numpy array, series or a dictionary
The window_size considers only the max window size, so all starting indices below the window_size would have their windows include all elements up to the starting element.
The apply function should not be nested inside the rolling_apply function since the pool.map cannot accept local or lambda functions as they cannot be 'pickled' according to the multiprocessing library

Related

python masking each day in dataframe

I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)

Filtering out specific hours each day in pandas

Given a dataset where each row represent a hour sample, that is each day has 24 entries with the following index set
...
2020-10-22T20:00:00
2020-10-22T21:00:00
2020-10-22T22:00:00
...
2020-10-22T20:00:00
2020-10-22T20:00:00
2020-10-22T20:00:00
...
Now I want to filter out so that for each day only the hours between 9am-3pm is left.
The only way I know would be to iterate over the dataset and filter each row given a condition, however knowing pandas there is always some trick for this kind of filtering that does not involve explicit iterating.
You can use the aptly named pd.DataFrame.between_time method. This will only work if your dataframe has a DatetimeIndex.
Data Creation
date_index = pd.date_range("2020-10-22T20:00:00", "2020-11-22T20:00:00", freq="H")
values = np.random.rand(len(dates), 1)
df = pd.DataFrame(values, index=date_index, columns=["value"])
print(df.head())
value
2020-10-22 20:00:00 0.637542
2020-10-22 21:00:00 0.590626
2020-10-22 22:00:00 0.474802
2020-10-22 23:00:00 0.058775
2020-10-23 00:00:00 0.904070
Method
subset = df.between_time("9:00am", "3:00pm")
print(subset.head(10))
value
2020-10-23 09:00:00 0.210816
2020-10-23 10:00:00 0.086677
2020-10-23 11:00:00 0.141275
2020-10-23 12:00:00 0.065100
2020-10-23 13:00:00 0.892314
2020-10-23 14:00:00 0.214991
2020-10-23 15:00:00 0.106937
2020-10-24 09:00:00 0.900106
2020-10-24 10:00:00 0.545249
2020-10-24 11:00:00 0.793243
import pandas as pd
# sample data (strings)
data = [f'2020-10-{d:02d}T{h:02d}:00:00' for h in range(24) for d in range(1, 21)]
# series of DT values
ds = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S')
# filter by hours
ds_filter = ds[(ds.dt.hour >= 9) & (ds.dt.hour <= 15)]

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
Potência Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

Not able to use a key from a merged dataframe

I've got two dataframes that both have a date column and an emaX column, when I merge them I get the expected result of a single date column and two emaX columns. But when I try access the date key from the merged dataframe, it returns a KeyError: date.
This is the function that returns the emaX (I have two, but they're nearly identical):
def av_get_ema_20():
ti = TechIndicators(key=TOKEN, output_format="pandas")
emaData20, meta_ema = ti.get_ema(symbol=SYMBOL, interval=INTERVAL, time_period=20, series_type=EMA_TYPE)
ema20renamed = pd.DataFrame(emaData20)
ema20renamed.rename(columns={'EMA': 'ema20'}, inplace=True)
return ema20renamed
Then I merge the two returned dataframes:
mergedDF = pd.merge(av_get_ema_10(), av_get_ema_20(), on=["date"], how="inner")
# TEST LINE
print(mergedDF)
The dataframe that is printed out appears as I expected it to be:
ema10 ema20
date
2020-01-02 11:30:00 3226.5200 NaN
2020-01-02 12:30:00 3229.0927 NaN
2020-01-02 13:30:00 3232.0558 NaN
2020-01-02 14:30:00 3235.0839 NaN
2020-01-02 15:30:00 3239.1668 NaN
... ... ...
2020-03-26 11:30:00 2524.9545 2473.8551
2020-03-26 12:30:00 2533.1755 2483.0279
2020-03-26 13:30:00 2541.2982 2492.0586
2020-03-26 14:30:00 2551.0458 2501.8540
2020-03-26 15:30:00 2565.2866 2513.9983
But then when I attempt to use the merged dataframe (for ex. interating through the dataframe), I get KeyError: date:
for index, row in mergedDF.iterrows():
print(row["date"], row["ema10"], row["ema20"])
Am I misinterpreting the dataframe in some way or is there something else I am supposed to do prior to using the merged set (including the date)? I'm at a loss here.

Pandas read_hdf query by date and time range

I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:
Foo Bar
TIME
2014-07-14 12:02:00 0 0
2014-07-14 12:03:00 0 0
2014-07-14 12:04:00 0 0
2014-07-14 12:05:00 0 0
2014-07-14 12:06:00 0 0
2014-07-15 12:02:00 0 0
2014-07-15 12:03:00 0 0
2014-07-15 12:04:00 0 0
2014-07-15 12:05:00 0 0
2014-07-15 12:06:00 0 0
2014-07-16 12:02:00 0 0
2014-07-16 12:03:00 0 0
2014-07-16 12:04:00 0 0
2014-07-16 12:05:00 0 0
2014-07-16 12:06:00 0 0
Now I store this into a .h5 using the following command:
store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()
Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))
As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:
This:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')
Is not the same as this:
pd.read_hdf('qux.h5', 'data')['20140714':'20140715']
While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?
Thanks!
EDIT:
Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?
See an example of selecting using a where mask
Here's an example
In [50]: pd.set_option('max_rows',10)
In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))
In [52]: df
Out[52]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 11:00:00 0.554420 0.777484
2013-02-11 12:00:00 -0.558041 1.833465
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[1000 rows x 2 columns]
In [53]: store = pd.HDFStore('test.h5',mode='w')
In [54]: store.append('df',df)
In [55]: c = store.select_column('df','index')
In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')
In [57]: store.select('df',where=where)
Out[57]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 03:00:00 0.902023 1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[664 rows x 2 columns]
In [58]: store.close()
Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column I don't believe can accept a query either.
You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.
Recombing data is relatively cheap (via concat), so don't be afraid to sub-query (though doing this too much can drag perf as well).

Categories