Pandas - Understanding how rolling averages work

Pandas - Understanding how rolling averages work - python

So I'm trying to calculate rolling averages, based on some column and some groupby columns.
In my case:
rolling column = RATINGS,
groupby_columns = ["DEMOGRAPHIC","ORIGINATOR","START_ROUND_60","WDAY","PLAYBACK_PERIOD"]
one group of my data looks like that:
my code to compute the rolling average is:
df['rolling']= df.groupby(groupby_columns_keys)['RATINGS'].\
apply(lambda x: x.shift().rolling(10,min_periods=1).mean())
What I don't understand is what is happening when the RATINGS value are starting to be NaN.
As my window size is 10, I would expect the second number in the test (index 11) to be:
np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
But it is instead 171.9444, and same apply to the next numbers.
What is happening here?
And how I should calculate the next rolling averages the way I want (simply to average the 10 last ratings - and if ratings is NaN to take the calculated average of the previous row instead).
Any help will be appreciated.

np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
Where does the 164.55 come from? The rest of those values are from the "RATINGS" column and the 164.55 is from the "rolling" column. Maybe I am misunderstanding what the rolling function does.

Related

How to select every 4th row in a pandas dataframe and calculate the rolling average

I have a pandas dataframe that you can see in the screenshot. The dataframe has a time resolution of 15 minutes (it is generation data). I would like to reduce this time resolution to 1 hour meaning that I should take every 4th row and the value in every 4th row should be the anverage values of the last 4 rows (including this one). So it should be a rolling average with non-overlapping horizons.
I tried the following for one column (wind offshore):
df_generation = pd.read_csv("C:/Users/Desktop/Data/generation_data.csv", sep =",")
df_generation_2 = df_generation
df_generation_2['Wind Offshore Average'] = df_generation_2['Wind Offshore'].rolling(4).mean()
But this is not what I really want. As you can see in the screenshot, my code just created a further column with the average of the last 4th entries for every timeslot. Here the rolling average has overlapping horizons. What I want is to have a new dataframe that only has an entry after every hour (after 4 timslots of the original array). Do you have an idea how I can do that? I'd appreciate every comment.

From looking at your Index it looks like the .resample method is what you are looking for (with many examples for specific uses): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
as in
new = df_generation['Wind Offshore'].resample('1H').mean()

Pandas dataframe: summing cell data from a group of rows, storing in a new column

As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael

use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")

pandas rolling average with a rolling mask / excluding entries

I have a pandas dataframe with a time index like this
import pandas as pd
import numpy as np
idx = pd.date_range(start='2000',end='2001')
df = pd.DataFrame(np.random.normal(size=(len(idx),2)),index=idx)
which looks like this:
0 1
2000-01-01 0.565524 0.355548
2000-01-02 -0.234161 0.888384
I would like to compute a rolling average like
df_avg = df.rolling(60).mean()
but excluding always entries corresponding to (let's say) 10 days before +- 2 days. In other words, for each date df_avg should contain the mean (exponential with ewm or flat) of previous 60 entries but excluding entries from t-48 to t-52. I guess I should do a kind of a rolling mask but I don't know how. I could also try to compute two separate averages and obtain the result as a difference but it looks dirty and I wonder if there is a better way which generalize to other non-linear computations...
Many thanks!

You can use apply to customize your function:
# select indexes you want to average over
avg_idx = [idx for idx in range(60) if idx not in range(8, 13)]
# do rolling computation, calculating average only on the specified indexes
df_avg = df.rolling(60).apply(lambda x: x[avg_idx].mean())
The x DataFrame in apply will always have 60 rows, so you can specify your positional index based on this, knowing that the first entry (0) is t-60.
I am not entirely sure about your exclusion logic, but you can easily modify my solution for your case.

Unfortunately, not. From pandas source code:
df.rolling(window, min_periods=None, freq=None, center=False, win_type=None,
on=None, axis=0, closed=None)
window : int, or offset
Size of the moving window. This is the number of observations used for
calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period.

highest number in each row in a DF and return column of numbers

Hi I have a list of sock prices and calculated 5 moving averages
I want to find the max number in each ROW. The code is returning the max number for the entire array
Here is the code
# For stock in df:
Create 10,30,50,100 and 200D MAvgs
MA10D = stock.rolling(10).mean()
MA30D = stock.rolling(30).mean()
MA50D = stock.rolling(50).mean()
MA100D = stock.rolling(100).mean()
MA200D = stock.rolling(200).mean()
max_line = pd.concat([MA10D, MA30D, MA50D, MA100D, MA200D],axis=0).max()
I want to create new column with the max number (either the 10D, 30D, 50D, 100D or 200DMA). So I should get a value on each row.
Right now all I get in the max number of the each entire array. I tried axis=1 and that did not work either.
Seems like a simple question but I can not get it written properly. Please let me know if you can help. thanks

the axis=0 in your code refers to the concatenation. You need to make that axis=1 to make each moving average a separate column. Then use axis=1 in your call to max as well. It should look like this.
max_line = pd.concat([MA10D, MA30D, MA50D, MA100D, MA200D], axis=1).max(1)

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.

The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Understanding how rolling averages work - python

np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205 Where does the 164.55 come from? The rest of those values are from the "RATINGS" column and the 164.55 is from the "rolling" column. Maybe I am misunderstanding what the rolling function does.

Related

How to select every 4th row in a pandas dataframe and calculate the rolling average

Pandas dataframe: summing cell data from a group of rows, storing in a new column

pandas rolling average with a rolling mask / excluding entries

highest number in each row in a DF and return column of numbers

Performing multiple calculations on a Python Pandas group from CSV data

Categories

Resources