pandas rolling average with a rolling mask / excluding entries - python

I have a pandas dataframe with a time index like this
import pandas as pd
import numpy as np
idx = pd.date_range(start='2000',end='2001')
df = pd.DataFrame(np.random.normal(size=(len(idx),2)),index=idx)
which looks like this:
0 1
2000-01-01 0.565524 0.355548
2000-01-02 -0.234161 0.888384
I would like to compute a rolling average like
df_avg = df.rolling(60).mean()
but excluding always entries corresponding to (let's say) 10 days before +- 2 days. In other words, for each date df_avg should contain the mean (exponential with ewm or flat) of previous 60 entries but excluding entries from t-48 to t-52. I guess I should do a kind of a rolling mask but I don't know how. I could also try to compute two separate averages and obtain the result as a difference but it looks dirty and I wonder if there is a better way which generalize to other non-linear computations...
Many thanks!

You can use apply to customize your function:
# select indexes you want to average over
avg_idx = [idx for idx in range(60) if idx not in range(8, 13)]
# do rolling computation, calculating average only on the specified indexes
df_avg = df.rolling(60).apply(lambda x: x[avg_idx].mean())
The x DataFrame in apply will always have 60 rows, so you can specify your positional index based on this, knowing that the first entry (0) is t-60.
I am not entirely sure about your exclusion logic, but you can easily modify my solution for your case.

Unfortunately, not. From pandas source code:
df.rolling(window, min_periods=None, freq=None, center=False, win_type=None,
on=None, axis=0, closed=None)
window : int, or offset
Size of the moving window. This is the number of observations used for
calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period.

Related

Is there a way to loop through pandas dataframe and drop windows of rows dependent on condition?

Problem Summary - I have a dataframe of ~10,000 rows. Some rows contain data aberrations that I would like to get rid of, and those aberrations are tied to observations made at certain temperatures (one of the data columns).
What I've tried - My thought is that the easiest way to remove the rows of bad data is to loop through the temperature intervals, find the maximum index that is less than each of the temperature interval observations, and use the df.drop function to get rid of a window of rows around that index. Between every temperature interval at which bad data is observed, I reset the index of the dataframe. However, it seems to be completely unstable!! Sometimes it nearly works, other times it throws key errors. I think my problem may be in working with the data frame "in place," but I don't see another way to do it.
Example Code:
Here is an example with a synthetic dataframe and a function that uses the same principles that I've tried. Note that I've tried different renditions with .loc and .iloc (commented out below).
#Create synthetic dataframe
import pandas as pd
import numpy as np
temp_series = pd.Series(range(25, 126, 1))
temp_noise = np.random.rand(len(temp_series))*3
df = pd.DataFrame({'temp':(temp_series+temp_noise), 'data':(np.random.rand(len(temp_series)))*400})
#calculate length of original and copy original because function works in place.
before_length = len(df)
df_dup = df
temp_intervals = [50, 70, 92.7]
window = 5
From here, run a function based on the dataframe (df), the temperature observations (temp_intervals) and the window size (window):
def remove_window(df, intervals, window):
'''Loop through the temperature intervals to define a window of indices around given temperatures in the dataframe to drop. Drop the window of indices in place and reset the index prior to moving to the next interval.
'''
def remove_window(df, intervals, window):
for temp in intervals[0:len(intervals)]:
#Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
#Define window of indices to remove from the df
drop_indices = list(range(cent_index-window, cent_index+window))
#Use df.drop
df.drop(drop_indices, inplace=True)
df.reset_index(drop=True)
return df
So, is this a problem with he funtcion I've defined or is there a problem with df.drop?
Thank you,
Brad
It can be tricky to repeatedly delete parts of the dataframe and keep track of what you're doing. A cleaner approach is to keep track of which rows you want to delete within the loop, but only delete them outside of the loop, all at once. This should also be faster.
def remove_window(df, intervals, window):
# Create a Boolean array indicating which rows to keep
keep_row = np.repeat(True, len(df))
for temp in intervals[0:len(intervals)]:
# Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
# Define window of indices to remove from the df
keep_row[range(cent_index - window, cent_index + window)] = False
# Delete all unwanted rows at once, outside the loop
df = df[keep_row]
df.reset_index(drop=True, inplace=True)
return df

Add rolling window to columns in each row in pandas

I have a timeseries in a dataframe and would like to add the rolling window with window size n to each of the rows.
df['A'].rolling(window=6)
This mean each row would have additional 6 columns with the respective numbers of the rolling window of that column. How can I achieve this using the rolling function of pandas, automatically naming the columns [t-1, t-2, t-3...t-n]?
df.shift(n) is what you're looking for:
n = 6 # max t-n
for t in range(1, n+1):
df[f'A-{t}'] = df['A'].shift(t)
Ref. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html

Pandas - Understanding how rolling averages work

So I'm trying to calculate rolling averages, based on some column and some groupby columns.
In my case:
rolling column = RATINGS,
groupby_columns = ["DEMOGRAPHIC","ORIGINATOR","START_ROUND_60","WDAY","PLAYBACK_PERIOD"]
one group of my data looks like that:
my code to compute the rolling average is:
df['rolling']= df.groupby(groupby_columns_keys)['RATINGS'].\
apply(lambda x: x.shift().rolling(10,min_periods=1).mean())
What I don't understand is what is happening when the RATINGS value are starting to be NaN.
As my window size is 10, I would expect the second number in the test (index 11) to be:
np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
But it is instead 171.9444, and same apply to the next numbers.
What is happening here?
And how I should calculate the next rolling averages the way I want (simply to average the 10 last ratings - and if ratings is NaN to take the calculated average of the previous row instead).
Any help will be appreciated.
np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
Where does the 164.55 come from? The rest of those values are from the "RATINGS" column and the 164.55 is from the "rolling" column. Maybe I am misunderstanding what the rolling function does.

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Finding a range in the series with the most frequent occurrence of the entries over a defined time (in Pandas)

I have a large dataset in Pandas in which the entries are marked with a time stamp. I'm looking for a solution how to get a range of a defined length (like 1 minute) with the highest occurrence of entries.
One solution could be to resample the data to a higher timeframe (such as a minute) and comparing the sections with the highest number of values. However, It would only find ranges that correspond to the start and end time of the given timeframe.
I'd rather find a solution to find any 1-minute ranges no matter where they actually start.
In following example I would be looking for 1 minute “window” with highest occurrence of the entries starting with the first signal in the range and ending with last signal in the range:
8:50:00
8:50:01
8:50:03
8:55:00
8:59:10
9:00:01
9:00:02
9:00:03
9:00:04
9:05:00
Thus I would like to get range 8:59:10 - 9:00:04
Any hint how to accomplish this?
You need to create 1 minute windows with a sliding start time of 1 second; compute the maximum occurrence for any of the windows. In pandas 0.19.0 or greater, you can resample a time series using base as an argument to start the resampled windows at different times.
I used tempfile to copy your data as a toy data set below.
import tempfile
import pandas as pd
tf = tempfile.TemporaryFile()
tf.write(b'''8:50:00
8:50:01
8:50:03
8:55:00
8:59:10
9:00:01
9:00:02
9:00:03
9:00:04
9:05:00''')
tf.seek(0)
df = pd.read_table(tf, header=None)
df.columns = ['time']
df.time = pd.to_datetime(df.time)
max_vals = []
for t in range(60):
# .max().max() is not a mistake, use it to return just the value
max_vals.append(
(t, df.resample('60s', on='time', base=t).count().max().max())
)
max(max_vals, key=lambda x: x[-1])
# returns:
(5, 5)
For this toy dataset, an offset of 5 seconds for the window (i.e. 8:49:05, 8:50:05, ...) has the first of the maximum count for a windows of 1 minute with 5 counts.

Categories