Is there an efficient way to iterate over Pandas DataFrame chunks? - python

Original Post
I am working with time series data and I want to apply a function to each data frame chunk for rolling time intervals/windows. When I use rolling() and apply() on a Pandas DataFrame, it applies the function iteratively for each column given a time interval. Here's example code:
Sample data
In:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6],
'B': [2, 4, 6, 8, 10, 12]},
index=pd.date_range('2019-01-01', periods=6, freq='5T'))
print(df)
Out:
A B
2019-01-01 00:00:00 1 2
2019-01-01 00:05:00 2 4
2019-01-01 00:10:00 3 6
2019-01-01 00:15:00 4 8
2019-01-01 00:20:00 5 10
2019-01-01 00:25:00 6 12
Output when using the combination of rolling() and apply():
In:
print(df.rolling('15T', min_periods=2).apply(lambda x: x.sum().sum()))
Out:
A B
2019-01-01 00:00:00 NaN NaN
2019-01-01 00:05:00 3.0 6.0
2019-01-01 00:10:00 6.0 12.0
2019-01-01 00:15:00 9.0 18.0
2019-01-01 00:20:00 12.0 24.0
2019-01-01 00:25:00 15.0 30.0
Desired Out:
2019-01-01 00:00:00 NaN
2019-01-01 00:05:00 9.0
2019-01-01 00:10:00 18.0
2019-01-01 00:15:00 27.0
2019-01-01 00:20:00 36.0
2019-01-01 00:25:00 45.0
Freq: 5T, dtype: float64
Currently, I am using a for loop to do the job, but I am looking for a more efficient way to handle this operation. I would appreciate it if you can provide a solution within the Pandas framework or even with other libraries.
Note: Please do not take the example function (summation) seriously, assume that the function in interest requires iterating over the chunks of datasets as is, i.e., with no prior column operations.
Thanks in advance!
Edit/Update
After reading the responses I have realized that the example I have given was not adequate. Here I go again using the same sample data:
In:
print(
df.rolling('15T', min_periods=2).apply(
lambda x: x['A'].mean() / x['B'].std()
)
)
Out:
KeyError: 'A'
Desired Out:
2019-01-01 00:00:00 NaN
2019-01-01 00:05:00 1.06
2019-01-01 00:10:00 1.00
2019-01-01 00:15:00 1.50
2019-01-01 00:20:00 2.00
2019-01-01 00:25:00 2.50
Freq: 5T, dtype: float64
Again, I want to point out that the main objective of my question is to find an efficient way to iterate over chunks of dataframes. For example, I do not want the following solution.
df['A'].rolling('15T', min_periods=2).mean() / df['B'].rolling('15T', min_periods=2).std()
And, for those who are interested in the real problem rather than the simple example, you can check it out here at mlfactor where triple barrier method is explained.

Related

Sum of dataframes : treating NaN as 0 when summed with other values, but returning NaN where all summed elements are NaN

I am trying to add some dataframes that contain NaN values. The data frames are index by time series, and in my case a NaN is meaningful, it means that a measurement wasn't done. So if all the data frames I'm adding have a NaN for a given timestamp, I need the result to have a NaN for this timestamp. But if one or more df have a value for the timestamp, I need to have the sum of theses values.
EDIT : Also, in my case, a 0 is different from an NaN, it means that there was a mesurement and it mesured 0 activity, different from a NaN meaning that there was no mesurement. So any solution using fillna(0) won't work.
I haven't found a proper way to do this yet. Here is an exemple of what I want to do :
import pandas as pd
df1 = pd.DataFrame({'value': [0, 1, 1, 1, np.NaN, np.NaN, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df2 = pd.DataFrame({'value': [0, 5, 5, 5, 5, 5, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df1 + df2
What i get :
df1 + df2
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 NaN
2020-01-01 00:50:00 NaN
2020-01-01 01:00:00 NaN
What I would want to have as a result :
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
Does anybody know a clean way to do so ?
Thank you.
(I'm using Python 3.9.1 and pandas 1.2.4)
You can use add with the fill_value=0 option. This will maintain the "all NaN" combinations as NaN:
df1.add(df2, fill_value=0)
output:
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN

Efficient conditional rolling calculation Pandas

Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng))),
columns=['data'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
I want to calculate std() for rolling with windows = 5, if more than half of the elements in the windows = NaN, the rolling calculation is equal to NaN, if less than half of the elements in the windows = NaN, dropna() and calculate std() for the rest of the elements.
I only know how to calculate normal rolling:
df.rolling(5).std()
How could I specify the conditon of the rolling calculation
I think you can use the argument min_periods in the rolling function
df['rollingstd'] = df.rolling(5, min_periods=3).std()
df.head(20)
Out put:
data rollingstd
2018-01-01 00:00:00 1.0 NaN
2018-01-01 01:00:00 6.0 NaN
2018-01-01 02:00:00 1.0 2.886751
2018-01-01 03:00:00 NaN 2.886751
2018-01-01 04:00:00 5.0 2.629956
2018-01-01 05:00:00 3.0 2.217356
2018-01-01 06:00:00 NaN 2.000000
2018-01-01 07:00:00 NaN NaN
2018-01-01 08:00:00 3.0 1.154701
2018-01-01 09:00:00 NaN NaN
2018-01-01 10:00:00 5.0 NaN
2018-01-01 11:00:00 9.0 3.055050
2018-01-01 12:00:00 NaN 3.055050
2018-01-01 13:00:00 9.0 2.309401
2018-01-01 14:00:00 1.0 3.829708
2018-01-01 15:00:00 0.0 4.924429
2018-01-01 16:00:00 3.0 4.031129
2018-01-01 17:00:00 0.0 3.781534
2018-01-01 18:00:00 1.0 1.224745
2018-01-01 19:00:00 NaN 1.414214
Here is an alternative more custom method :
Write a custom method for your logic which taken an array of window size elements as input and return the wanted result for that window:
def cus_mean(x):
notnone = ~(np.isnan(x))
if notnone.sum()>2:
return np.mean([y for y in x if ~(np.isnan(y))])
Then call the rolling function on your dataframe as below:
df.rolling(5).apply(cus_mean)

Include one row in multiple groupby groups

I am grouping a time series by hour to perform an operation on each hour of data separately:
import pandas as pd
from datetime import datetime, timedelta
x = [2, 2, 4, 2, 2, 0]
idx = pd.date_range(
start=datetime(2019, 1, 1),
end=datetime(2019, 1, 1, 2, 30),
freq=timedelta(minutes=30),
)
s = pd.Series(x, index=idx)
hourly = s.groupby(lambda x: x.hour)
print(s)
print("summed:")
print(hourly.sum())
which produces:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Freq: 30T, dtype: int64
summed:
0 4
1 6
2 2
dtype: int64
As expected.
I now want to know the area under the time series per hour, for which I can use numpy.trapz:
import numpy as np
def series_trapz(series):
hours = [i.timestamp() / 3600 for i in series.index]
return np.trapz(series, x=hours)
print("Area under curve")
print(hourly.agg(series_trapz))
But for this to work correctly, the boundaries between the groups must appear in both groups!
For example, the first group must be:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
and the second group must be
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
etc.
Is this at all possible using pandas.groupby?
I don't think that I have your np.trapz logic completely correct here, but I think you can probably get what you want with .rolling(..., closed="both") so that the endpoints of the intervals are always included:
In [366]: s.rolling("1H", closed="both").apply(np.trapz).iloc[::2]
Out[366]:
2019-01-01 00:00:00 0.0
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
Freq: 60T, dtype: float64
I think you could repeat the limit of groups in your serie using Series.repeat:
r=(s.index.minute==0).astype(int)+1
new_s=s.repeat(r)
print(new_s)
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Then you could use Series.groupby:
groups=(new_s.index.to_series().shift(-1,fill_value=0).dt.minute!=0).cumsum()
for i,group in new_s.groupby(groups):
print(group)
print('-'*50)
Name: col1, dtype: int64
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Name: col1, dtype: int64
--------------------------------------------------
IIUC, this can be solved manually with rolling:
hours = np.unique(s.index.floor('H'))
# the answer:
(s.add(s.shift())
.mul(s.index.to_series()
.diff()
.dt.total_seconds()
.div(3600)
)
.rolling('1H').sum()[hours]
)
Output:
2019-01-01 00:00:00 NaN
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
dtype: float64

How to divide 60 mins datapoints into 15 mins?

I have a dataset with every 60 mins interval value. Now, I want to divide them into 15mins interval using the averages between those 2 hourly values. How do I do that?
Time A
2016-01-01 00:00:00 1
2016-01-01 01:00:00 5
2016-01-01 02:00:00 13
So, I now want it to be in 15mins interval with average values:
Time A
2016-01-01 00:00:00 1
2016-01-01 00:15:00 2 ### at 2016-01-01 00:00:00 values is 1 and
2016-01-01 00:30:00 3 ### at 2016-01-01 01:00:00 values is 5.
2016-01-01 00:45:00 4 ### Therefore we have to fill 4 values ( 15 mins interval )
2016-01-01 01:00:00 5 ### with the average of the hour values.
2016-01-01 01:15:00 7
2016-01-01 01:30:00 9
2016-01-01 01:45:00 11
2016-01-01 02:00:00 13
I tried resampling it with mean to 15 mins but it won't work ( obviously ) and it given Nan values. Can anyone help me out? on how to do it?
I would just resample: df.resample("15min").interpolate("linear")
As you have the column Time set as index already, it should directly work
We can do this in one line with resample, replace and interpolate:
df.resample('15min').sum().replace(0, np.NaN).interpolate()
Output
A
Time
2016-01-01 00:00:00 1.0
2016-01-01 00:15:00 2.0
2016-01-01 00:30:00 3.0
2016-01-01 00:45:00 4.0
2016-01-01 01:00:00 5.0
2016-01-01 01:15:00 7.0
2016-01-01 01:30:00 9.0
2016-01-01 01:45:00 11.0
2016-01-01 02:00:00 13.0
You can do that like this:
import pandas as pd
df = pd.DataFrame({
'Time': ["2016-01-01 00:00:00", "2016-01-01 01:00:00", "2016-01-01 02:00:00"],
'A': [1 , 5, 13]
})
df['Time'] = pd.to_datetime(df['Time'])
new_idx = pd.DatetimeIndex(start=df['Time'].iloc[0], end=df['Time'].iloc[-1], freq='15min')
df2 = df.set_index('Time').reindex(new_idx).interpolate().reset_index()
df2.rename(columns={'index': 'Time'}, inplace=True)
print(df2)
# Time A
# 0 2016-01-01 00:00:00 1.0
# 1 2016-01-01 00:15:00 2.0
# 2 2016-01-01 00:30:00 3.0
# 3 2016-01-01 00:45:00 4.0
# 4 2016-01-01 01:00:00 5.0
# 5 2016-01-01 01:15:00 7.0
# 6 2016-01-01 01:30:00 9.0
# 7 2016-01-01 01:45:00 11.0
# 8 2016-01-01 02:00:00 13.0
If you want column A in the result to be an integer you can add something like:
df2['A'] = df2['A'].round().astype(int)

How to create a rolling time window with offset in Pandas

I want to apply some statistics on records within a time window with an offset. My data looks something like this:
lon lat stat ... speed course head
ts ...
2016-09-30 22:00:33.272 5.41463 53.173161 15 ... 0.0 0.0 511
2016-09-30 22:01:42.879 5.41459 53.173180 15 ... 0.0 0.0 511
2016-09-30 22:02:42.879 5.41461 53.173161 15 ... 0.0 0.0 511
2016-09-30 22:03:44.051 5.41464 53.173168 15 ... 0.0 0.0 511
2016-09-30 22:04:53.013 5.41462 53.173141 15 ... 0.0 0.0 511
[5 rows x 7 columns]
I need the records within time windows of 600 seconds, with steps of 300 seconds. For example, these windows:
start end
2016-09-30 22:00:00.000 2016-09-30 22:10:00.000
2016-09-30 22:05:00.000 2016-09-30 22:15:00.000
2016-09-30 22:10:00.000 2016-09-30 22:20:00.000
I have looked at Pandas rolling to do this. But it seems like it does not have the option to add the offset which I described above. Am I overlooking something, or should I create a custom function for this?
What you want to achieve should be possible by combining DataFrame.resample with DataFrame.shift.
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
df = pd.DataFrame(series)
That will give you a primitive timeseries (example taken from api docs DataFrame.resample).
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Now resample by your step size (see DataFrame.shift).
sampled = df.resample('90s').sum()
This will give you non-overlapping windows of the step size.
2000-01-01 00:00:00 1
2000-01-01 00:01:30 2
2000-01-01 00:03:00 7
2000-01-01 00:04:30 5
2000-01-01 00:06:00 13
2000-01-01 00:07:30 8
Finally, shift the sampled df by one step and sum with the previously created df. Window size being twice the step size helps.
sampled.shift(1, fill_value=0) + sampled
This will yield:
2000-01-01 00:00:00 1
2000-01-01 00:01:30 3
2000-01-01 00:03:00 9
2000-01-01 00:04:30 12
2000-01-01 00:06:00 18
2000-01-01 00:07:30 21
There may be a more elegant solution, but I hope this helps.

Categories