pandas rolling computation to find set of small numbers

pandas rolling computation to find set of small numbers - python

I have a dataframe that has one variable and an equally spaced date time index (the index is at 1 second granularity).Say there are 1000 samples overall:
dates = pd.date_range('2015-1-1', periods=(1000) ,freq='S')
df = pd.DataFrame(np.random.rand(1000),index=dates, columns=['X'])
X
2015-01-01 00:00:00 2.2
2015-01-01 00:00:01 2.5
2015-01-01 00:00:02 1.2
2015-01-01 00:00:03 1.5
2015-01-01 00:00:04 3.7
2015-01-01 00:00:05 3.1
etc
I want to determine the start of the rolling window (of a given length) that contains the largest set that contains the smallest valued numbers within the given window size.
So in the example above, if the window was of size two, the answer would be:
start_index = 2015-01-01 00:00:02
end_index = 2015-01-01 00:00:03
I've tried to read the pandas document to see if there is a rolling computation that can help, but no luck! Thanks.

You just need to do rolling_sum over df['X'] == df['X'].min(). Then the end of window is simply:
>>> ts = df['X'] == df['X'].min()
>>> pd.rolling_sum(ts, win_size).argmax()
and in order to obtain start of the window you may either shift the end of window or alternatively shift the series:
>>> pd.rolling_sum(ts.shift(-win_size), win_size).argmax()

Related

Generating DataFrame with combination of columns and sum of grouped values

So, I have a DataFrame, in which each row represent an event, formed by essentially 4 columns:
happen start_date end_date number
0 2015-01-01 2015-01-01 2015-01-03 100.0
1 2015-01-01 2015-01-01 2015-01-01 20.0
2 2015-01-01 2015-01-02 2015-01-02 50.0
3 2015-01-02 2015-01-02 2015-01-02 40.0
4 2015-01-02 2015-01-02 2015-01-03 50.0
where happen is the date the event took place, start_date and end_date are the validity of that event, and number is just a summable variable.
What I'd like to get is a DataFrame that has for each row the combination of the happen date and validity date and, contextually, the sum of the number column.
What I tried so far is a double for loop on all dates, knowing that start_date >= happen:
startdate = pd.to_datetime('01/06/2014', format='%d/%m/%Y') # the minimum possible happen
enddate = pd.to_datetime('31/12/2021', format='%d/%m/%Y') # the maximum possible happen (and validity)
df_day = pd.DataFrame()
for dt1 in pd.date_range(start=startdate, end=enddate):
for dt2 in pd.date_range(start=dt1, end=enddate):
num_sum = df[(df['happen'] == dt1)&(df['start_date'] <= dt2)&
(df['end_date'] >= dt2)]['number'].sum()
row = {'happen':dt1,'valid':dt2,'number':num_sum}
df_day = df_day.append(row,ignore_index = True)
and that never came to an end. So I tried other way, I generated the df with all date combination first (like 3.8e6 rows), and then tried to fill it with a lambda func (it's crazy, I know, but don't know how to work around it):
dt1 = pd.date_range(start=startdate, end=enddate).tolist()
df_day = pd.DataFrame()
for i in dt1:
dt_acc1 = [i]
dt2 = pd.date_range(start=i, end=enddate).tolist()
df_comb = pd.DataFrame(list(product(dt_acc1, dt2)), columns=['happen', 'valid'])
df_day = df_day.append(df_comb, ignore_index=True)
df_day['number'] = 0
def append_num(happen,valid):
return df[(df['happen'] == happen)&(df['start_date'] <= valid)&
(df['end_date'] >= valid)]['number'].sum()
df_day['number'] = df_day.apply(lambda x: append_num(x['happen'],x['valid']), axis=1)
and also this loop take forever.
My expected output is something like this:
happen valid number
0 2015-01-01 2015-01-01 120.0
1 2015-01-01 2015-01-02 150.0
2 2015-01-01 2015-01-03 100.0
3 2015-01-02 2015-01-02 90.0
4 2015-01-02 2015-01-03 50.0
5 2015-01-03 2015-01-03 0.0
As you can see the first row represents the sum of all rows with happen on 2015-01-01 and with a start_date and end_date that contain the 2015-01-01 in valid. The number column contains the sum (with 120. = 100. + 20.). On the second row, with valid going one day forward, I "lose" element with index 1 and I "gain" element with index 2 (150. = 100. + 50.).
Every help or suggestion is appreciated!

Is there a way to find hourly averages in pandas timeframes that do not start from even hours?

I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.

If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]

Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64

Efficient groupby when rows of groups are contiguous?

The context
I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame.
This time serie is sorted according its DatetimeIndex.
Blocks are defined by a custom DatetimeIndex.
To do so, I came up with this (ok) code.
# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)
# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()
# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]:
val
2021-01-01 00:00:00 5
2021-01-01 01:00:00 3
2021-01-01 02:00:00 9
2021-01-01 03:00:00 4
2021-01-01 04:00:00 8
2021-01-01 05:00:00 13
2021-01-01 06:00:00 15
2021-01-01 07:00:00 14
2021-01-01 08:00:00 11
2021-01-01 09:00:00 7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]:
[Timestamp('2021-01-01 01:00:00', freq='H'),
Timestamp('2021-01-01 03:00:00', freq='H'),
Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]:
val id cumsum
2021-01-01 00:00:00 5 NaN -1
2021-01-01 01:00:00 3 0.0 3
2021-01-01 02:00:00 9 0.0 12
2021-01-01 03:00:00 4 1.0 4
2021-01-01 04:00:00 8 1.0 12
2021-01-01 05:00:00 13 1.0 25
2021-01-01 06:00:00 15 1.0 40
2021-01-01 07:00:00 14 2.0 14
2021-01-01 08:00:00 11 2.0 25
2021-01-01 09:00:00 7 2.0 32
The question
Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows?
I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.
Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group.
As soon as I hit the row of the next group, I know calculations are done with previous group.
In case rows are contiguous, the sorting step is lighter.
Hence the question, is there a way to mention this to pandas to save some CPU?
Thanks in advance for your feedbacks,
Bests

group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.
What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.
Here is an implementation:
import numpy as np
import numba as nb
# To avoid the compilation cost at runtime, use:
# #nb.njit('int64[:](float64[:],int64[:])')
#nb.njit
def segmentedCumSum(ids, values):
size = len(ids)
res = np.empty(size, dtype=values.dtype)
if size == 0:
return res
zero = values.dtype.type(0)
curValue = zero
for i in range(size):
if not np.isnan(ids[i]):
if i > 0 and ids[i-1] != ids[i]:
curValue = zero
curValue += values[i]
res[i] = curValue
else:
res[i] = -1
curValue = zero
return res
df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())
Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.

Identifying a window in time series data where data points are below set value

This is a small subset of my data:
heartrate
2018-01-01 00:00:00 67.0
2018-01-01 00:01:00 55.0
2018-01-01 00:02:00 60.0
2018-01-01 00:03:00 67.0
2018-01-01 00:04:00 72.0
2018-01-01 00:05:00 53.0
2018-01-01 00:06:00 62.0
2018-01-01 00:07:00 59.0
2018-01-01 00:08:00 117.0
2018-01-01 00:09:00 62.0
2018-01-01 00:10:00 65.0
2018-01-01 00:11:00 70.0
2018-01-01 00:12:00 49.0
2018-01-01 00:13:00 59.0
This data is a collection of daily heart rates from patients. I am trying to see if, based off their heart rate, I can find the time window that they are asleep.
I am not sure how to write a code that is able to identify the time window that the patient is asleep because every few minutes, there will be a spike in the data. For example, in the data provided from 2018-01-01 00:07:00 to 2018-01-01 00:08:00, the heartrate jumped from 59 to 117. Can anyone suggest a way around this and a way to find the time window when the Heartrate is below the mean for a few hours?

As mentioned in your comments, you can find the rolling mean to 'smoothen' your signal using:
patient_data_df['rollingmeanVal'] = patient_data_df.rolling('3T').heartrate.mean()
Assuming you are using a dataframe and want to identify rows that have a HR bellow or equal to the mean you can use:
HR_mean = patient_data_df['rollingmeanVal'].mean()
selected_data_df = patient_data_df[patient_data_df['rollingmeanVal'] <= HR_mean]
Then, instead of dealing with the dataframe as a time-series dataframe, you can reset the index and generate a column called index with the datetime as values. Now that you have a dataframe with all values bellow the mean, you can group them into groups when there is more than 30 mins difference between each group. This is assuming that having fluctuating data for 30 mins is ok.
Assuming that the group with the most data is when the patient is asleep, you can identify that group. Using the first and last date of this group, you can then identify the time window that the patient is asleep.
Reset the index, adding a new col called index with the time-series data:
selected_data_df.reset_index(inplace=True)
Group by:
selected_data_df['grp'] = selected_data_df['index'].diff().dt.seconds.ge(30 * 60).cumsum()
sleep_grp = selected_data_df.groupby('grp').count().sort_values(['grp']).head(1)
sleep_grp_index = sleep_grp.index.values[0]
sleep_df = selected_data_df[selected_data_df['grp'] == sleep_grp_index].drop('grp', axis=1)
Start of sleep time:
temp2_df['index'].iloc[0]
End of sleep time:
temp2_df['index'].iloc[-1]

You may use Run Length Encoding function from base R for solving your problem. In step 1 you may calculate the rolling mean of your patients heart rate. You may use your solution or any other. Afterwards you add a logic flag to your data.frame, e.g. patient['lowerVal'] = patient['heartrate'] < patient['rollingmeanVal']. Afterwards apply rle function on that variable lowerVal. As return you get the length of runs below and above mean. By applying cumsum on the lengths value, you get locations of your sleeping time frames.
Sorry. It is Python. Therefore, you may use the Python version of Run Length Encoding.

adding column with per-row computed time difference from group start?

(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?

Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas rolling computation to find set of small numbers - python

Related

Generating DataFrame with combination of columns and sum of grouped values

Is there a way to find hourly averages in pandas timeframes that do not start from even hours?

Efficient groupby when rows of groups are contiguous?

Identifying a window in time series data where data points are below set value

adding column with per-row computed time difference from group start?

Categories

Resources