Include one row in multiple groupby groups

Include one row in multiple groupby groups - python

I am grouping a time series by hour to perform an operation on each hour of data separately:
import pandas as pd
from datetime import datetime, timedelta
x = [2, 2, 4, 2, 2, 0]
idx = pd.date_range(
start=datetime(2019, 1, 1),
end=datetime(2019, 1, 1, 2, 30),
freq=timedelta(minutes=30),
)
s = pd.Series(x, index=idx)
hourly = s.groupby(lambda x: x.hour)
print(s)
print("summed:")
print(hourly.sum())
which produces:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Freq: 30T, dtype: int64
summed:
0 4
1 6
2 2
dtype: int64
As expected.
I now want to know the area under the time series per hour, for which I can use numpy.trapz:
import numpy as np
def series_trapz(series):
hours = [i.timestamp() / 3600 for i in series.index]
return np.trapz(series, x=hours)
print("Area under curve")
print(hourly.agg(series_trapz))
But for this to work correctly, the boundaries between the groups must appear in both groups!
For example, the first group must be:
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
and the second group must be
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
etc.
Is this at all possible using pandas.groupby?

I don't think that I have your np.trapz logic completely correct here, but I think you can probably get what you want with .rolling(..., closed="both") so that the endpoints of the intervals are always included:
In [366]: s.rolling("1H", closed="both").apply(np.trapz).iloc[::2]
Out[366]:
2019-01-01 00:00:00 0.0
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
Freq: 60T, dtype: float64

I think you could repeat the limit of groups in your serie using Series.repeat:
r=(s.index.minute==0).astype(int)+1
new_s=s.repeat(r)
print(new_s)
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Then you could use Series.groupby:
groups=(new_s.index.to_series().shift(-1,fill_value=0).dt.minute!=0).cumsum()
for i,group in new_s.groupby(groups):
print(group)
print('-'*50)
Name: col1, dtype: int64
2019-01-01 00:00:00 2
2019-01-01 00:30:00 2
2019-01-01 01:00:00 4
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 01:00:00 4
2019-01-01 01:30:00 2
2019-01-01 02:00:00 2
Name: col1, dtype: int64
--------------------------------------------------
2019-01-01 02:00:00 2
2019-01-01 02:30:00 0
Name: col1, dtype: int64
--------------------------------------------------

IIUC, this can be solved manually with rolling:
hours = np.unique(s.index.floor('H'))
# the answer:
(s.add(s.shift())
.mul(s.index.to_series()
.diff()
.dt.total_seconds()
.div(3600)
)
.rolling('1H').sum()[hours]
)
Output:
2019-01-01 00:00:00 NaN
2019-01-01 01:00:00 5.0
2019-01-01 02:00:00 5.0
dtype: float64

Related

Is there an efficient way to iterate over Pandas DataFrame chunks?

Original Post
I am working with time series data and I want to apply a function to each data frame chunk for rolling time intervals/windows. When I use rolling() and apply() on a Pandas DataFrame, it applies the function iteratively for each column given a time interval. Here's example code:
Sample data
In:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6],
'B': [2, 4, 6, 8, 10, 12]},
index=pd.date_range('2019-01-01', periods=6, freq='5T'))
print(df)
Out:
A B
2019-01-01 00:00:00 1 2
2019-01-01 00:05:00 2 4
2019-01-01 00:10:00 3 6
2019-01-01 00:15:00 4 8
2019-01-01 00:20:00 5 10
2019-01-01 00:25:00 6 12
Output when using the combination of rolling() and apply():
In:
print(df.rolling('15T', min_periods=2).apply(lambda x: x.sum().sum()))
Out:
A B
2019-01-01 00:00:00 NaN NaN
2019-01-01 00:05:00 3.0 6.0
2019-01-01 00:10:00 6.0 12.0
2019-01-01 00:15:00 9.0 18.0
2019-01-01 00:20:00 12.0 24.0
2019-01-01 00:25:00 15.0 30.0
Desired Out:
2019-01-01 00:00:00 NaN
2019-01-01 00:05:00 9.0
2019-01-01 00:10:00 18.0
2019-01-01 00:15:00 27.0
2019-01-01 00:20:00 36.0
2019-01-01 00:25:00 45.0
Freq: 5T, dtype: float64
Currently, I am using a for loop to do the job, but I am looking for a more efficient way to handle this operation. I would appreciate it if you can provide a solution within the Pandas framework or even with other libraries.
Note: Please do not take the example function (summation) seriously, assume that the function in interest requires iterating over the chunks of datasets as is, i.e., with no prior column operations.
Thanks in advance!
Edit/Update
After reading the responses I have realized that the example I have given was not adequate. Here I go again using the same sample data:
In:
print(
df.rolling('15T', min_periods=2).apply(
lambda x: x['A'].mean() / x['B'].std()
)
)
Out:
KeyError: 'A'
Desired Out:
2019-01-01 00:00:00 NaN
2019-01-01 00:05:00 1.06
2019-01-01 00:10:00 1.00
2019-01-01 00:15:00 1.50
2019-01-01 00:20:00 2.00
2019-01-01 00:25:00 2.50
Freq: 5T, dtype: float64
Again, I want to point out that the main objective of my question is to find an efficient way to iterate over chunks of dataframes. For example, I do not want the following solution.
df['A'].rolling('15T', min_periods=2).mean() / df['B'].rolling('15T', min_periods=2).std()
And, for those who are interested in the real problem rather than the simple example, you can check it out here at mlfactor where triple barrier method is explained.

apply lamba to df based on condition

If I can make up a df with some random data
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 24,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='H')
df = pd.DataFrame(data, columns=['Temperature1','Temperature2','Value'], index=tidx)
How could I use the lamba function to add 5000 to each row to the columns Temperature1 & Temperature2 only if the df index hour is less than 6?
If I use
for hour in df.index.hour:
if hour < 6:# and name contains 'Temperature'
df = df.apply(lambda x: x + 5000)
The above code isnt correct it adds 5000 to all rows in the df. Any tips greatly appreciated..

You can do with loc:
# get the temperature columns
temp_cols = [x for x in df.columns if 'Temperature' in x]
# update with loc access
df.loc[df.index.hour<6, temp_cols] += 5000
Output:
Temperature1 Temperature2 Value
2019-01-01 00:00:00 5000.180270 5000.019475 0.463219
2019-01-01 01:00:00 5000.724934 5000.420204 0.485427
2019-01-01 02:00:00 5000.012781 5000.487372 0.941807
2019-01-01 03:00:00 5000.850795 5000.729964 0.108736
2019-01-01 04:00:00 5000.893904 5000.857154 0.165087
2019-01-01 05:00:00 5000.632334 5000.020484 0.116737
2019-01-01 06:00:00 0.316367 0.157912 0.758980
2019-01-01 07:00:00 0.818275 0.344624 0.318799
2019-01-01 08:00:00 0.111661 0.083953 0.712726
2019-01-01 09:00:00 0.599543 0.055674 0.479797
2019-01-01 10:00:00 0.401676 0.847979 0.717849
2019-01-01 11:00:00 0.602064 0.552384 0.949102
2019-01-01 12:00:00 0.986673 0.338054 0.239875
2019-01-01 13:00:00 0.796436 0.063686 0.364616
2019-01-01 14:00:00 0.070023 0.319368 0.070383
2019-01-01 15:00:00 0.290264 0.790101 0.905400
2019-01-01 16:00:00 0.792621 0.561819 0.616018
2019-01-01 17:00:00 0.361484 0.168817 0.436241
2019-01-01 18:00:00 0.732825 0.062888 0.020733
2019-01-01 19:00:00 0.770548 0.299952 0.701164
2019-01-01 20:00:00 0.734668 0.932905 0.400328
2019-01-01 21:00:00 0.358438 0.806567 0.764491
2019-01-01 22:00:00 0.652615 0.810967 0.642215
2019-01-01 23:00:00 0.957444 0.333874 0.738253

Boolean select contains Temperature
m=df.columns.str.contains('Temperature')
m
Select rows with hour<6 and Update by;
df.loc[df.index.hour<6, m] += 5000
df

How to divide 60 mins datapoints into 15 mins?

I have a dataset with every 60 mins interval value. Now, I want to divide them into 15mins interval using the averages between those 2 hourly values. How do I do that?
Time A
2016-01-01 00:00:00 1
2016-01-01 01:00:00 5
2016-01-01 02:00:00 13
So, I now want it to be in 15mins interval with average values:
Time A
2016-01-01 00:00:00 1
2016-01-01 00:15:00 2 ### at 2016-01-01 00:00:00 values is 1 and
2016-01-01 00:30:00 3 ### at 2016-01-01 01:00:00 values is 5.
2016-01-01 00:45:00 4 ### Therefore we have to fill 4 values ( 15 mins interval )
2016-01-01 01:00:00 5 ### with the average of the hour values.
2016-01-01 01:15:00 7
2016-01-01 01:30:00 9
2016-01-01 01:45:00 11
2016-01-01 02:00:00 13
I tried resampling it with mean to 15 mins but it won't work ( obviously ) and it given Nan values. Can anyone help me out? on how to do it?

I would just resample: df.resample("15min").interpolate("linear")
As you have the column Time set as index already, it should directly work

We can do this in one line with resample, replace and interpolate:
df.resample('15min').sum().replace(0, np.NaN).interpolate()
Output
A
Time
2016-01-01 00:00:00 1.0
2016-01-01 00:15:00 2.0
2016-01-01 00:30:00 3.0
2016-01-01 00:45:00 4.0
2016-01-01 01:00:00 5.0
2016-01-01 01:15:00 7.0
2016-01-01 01:30:00 9.0
2016-01-01 01:45:00 11.0
2016-01-01 02:00:00 13.0

You can do that like this:
import pandas as pd
df = pd.DataFrame({
'Time': ["2016-01-01 00:00:00", "2016-01-01 01:00:00", "2016-01-01 02:00:00"],
'A': [1 , 5, 13]
})
df['Time'] = pd.to_datetime(df['Time'])
new_idx = pd.DatetimeIndex(start=df['Time'].iloc[0], end=df['Time'].iloc[-1], freq='15min')
df2 = df.set_index('Time').reindex(new_idx).interpolate().reset_index()
df2.rename(columns={'index': 'Time'}, inplace=True)
print(df2)
# Time A
# 0 2016-01-01 00:00:00 1.0
# 1 2016-01-01 00:15:00 2.0
# 2 2016-01-01 00:30:00 3.0
# 3 2016-01-01 00:45:00 4.0
# 4 2016-01-01 01:00:00 5.0
# 5 2016-01-01 01:15:00 7.0
# 6 2016-01-01 01:30:00 9.0
# 7 2016-01-01 01:45:00 11.0
# 8 2016-01-01 02:00:00 13.0
If you want column A in the result to be an integer you can add something like:
df2['A'] = df2['A'].round().astype(int)

Groupby for datetime on a scale of hours (ignoring what day)

I have a series of floats with a datetimeindex that I have resampled into bins of 3 hours. As such I have an index containing
2015-01-01 09:00:00
2015-01-01 12:00:00
2015-01-01 15:00:00
2015-01-01 18:00:00
2015-01-01 21:00:00
2015-01-02 00:00:00
2015-01-02 03:00:00
2015-01-02 06:00:00
2015-01-02 09:00:00
and so forth. I am trying to sum the floats associated with each time of day, say 09:00:00, for all days.
The only way I can think to do it with my limited experience is to convert this series to a dataframe by using the date time index as another column, then running iterations to see if the hours slot of the date time is equal to one another than summing the values. I feel like this is horribly inefficient and probably not the 'correct' way to do this. Any help would be appreciated!

IIUC:
In [116]: s
Out[116]:
2015-01-01 09:00:00 3
2015-01-01 12:00:00 1
2015-01-01 15:00:00 0
2015-01-01 18:00:00 1
2015-01-01 21:00:00 0
2015-01-02 00:00:00 9
2015-01-02 03:00:00 2
2015-01-02 06:00:00 2
2015-01-02 09:00:00 7
2015-01-02 12:00:00 8
Freq: 3H, Name: val, dtype: int32
In [117]: s.groupby(s.index - s.index.normalize()).sum()
Out[117]:
00:00:00 9
03:00:00 2
06:00:00 2
09:00:00 10
12:00:00 9
15:00:00 0
18:00:00 1
21:00:00 0
Name: val, dtype: int32
or:
In [118]: s.groupby(s.index.hour).sum()
Out[118]:
0 9
3 2
6 2
9 10
12 9
15 0
18 1
21 0
Name: val, dtype: int32

Another way to use downsampling in pandas

let’s look at some one-minute data:
In [513]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [514]: ts = Series(np.arange(12), index=rng)
In [515]: ts
Out[515]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking
the sum of each group:
In [516]: ts.resample('5min', how='sum')
Out[516]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T
However I don't want to use the resample method and still want the same input-output. How can I use group_by or reindex or any of such other methods?

You can use a custom pd.Grouper this way:
In [78]: ts.groupby(pd.Grouper(freq='5min', closed='right')).sum()
Out [78]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The closed='right' ensures that the output is exactly the same.
However, if your aim is to do more custom grouping, you can use .groupby with your own vector:
In [78]: buckets = (ts.index - ts.index[0]) / pd.Timedelta('5min')
In [79]: grp = ts.groupby(np.ceil(buckets.values))
In [80]: grp.sum()
Out[80]:
0 0
1 15
2 40
3 11
dtype: int64
The output is not exactly the same, but the method is more flexible (e.g. can create uneven buckets).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Include one row in multiple groupby groups - python

Related

Is there an efficient way to iterate over Pandas DataFrame chunks?

apply lamba to df based on condition

How to divide 60 mins datapoints into 15 mins?

Groupby for datetime on a scale of hours (ignoring what day)

Another way to use downsampling in pandas

Categories

Resources