Calculate the maximum difference in rolling pandas - improve performance - python

I have a Dataframe with one column.
I need to calculate the average of the difference between the min and max values over 600 seconds period (10 minutes). Or more clearly this :
np.average(originalData[sensor1].rolling(600)
.apply(lambda mylist : (max(mylist) - min(mylist)), raw = True).dropna())
The code works perfectly and returns me the results I need.
The problem is that my Dataframe is pretty large (1.5 million lines, and 200 columns), and it takes a lot of time, specially if I want to go from 600 seconds to 3600 second.
I want to improve it by not to calculating the difference on every row, but by skipping 10 rows each time, it shouldn't impact the results significantly.
Meaning :
Calculate max(list)-min(list) on row 0 to 600
Calculate max(list)-min(list) on row 10 to 610
Calculate max(list)-min(list) on row 20 to 620
Calculate max(list)-min(list) on row 30 to 630
This will speed up the calculation 10 times (hopefully), but I don't see how I can do it with rolling
Any suggestions ?
Edit:
muzzyq requested sample data :
a = np.ones(1500000)
np.average(pd.Series(a).rolling(600).
apply(lambda thing : (max(thing) - min(thing)), raw = True).dropna())

You can use the resample method with '10min' as argument to group by 10 minute intervals. It is more efficient than using rolling for large sets of time series data, assuming it is set as the index.
Sample data
rng = pd.date_range('2000-01-01', periods=1_500_000, freq='S')
ts = pd.Series(np.arange(1_500_000), index=rng)
ts.head()
Output:
2000-01-01 00:00:00 0
2000-01-01 00:00:01 1
2000-01-01 00:00:02 2
2000-01-01 00:00:03 3
2000-01-01 00:00:04 4
Freq: S, dtype: int64
Answer
Using the function from your question:
np.average(ts.resample('10min').apply(lambda mylist: (max(mylist) - min(mylist))))
Output:
599.0
Alternative
Just because I'm not 100% sure what you want the outcome to look like, this will give you the range per 10 minute interval:
result = ts.resample('10min').apply(lambda mylist: (max(mylist) - min(mylist)))
result.head()
Output:
2000-01-01 00:00:00 599
2000-01-01 00:10:00 599
2000-01-01 00:20:00 599
2000-01-01 00:30:00 599
2000-01-01 00:40:00 599
Freq: 10T, dtype: int64
In this case, the answer will always be 599 as the maximum of 600 seconds is 600 and the minimum is 1, so 600 - 1 = 599

Related

Pandas: resample hourly values to monthly values with offset

I want to aggregate a pandas.Series with an hourly DatetimeIndex to monthly values - while considering the offset to midnight.
Example
Consider the following (uniform) timeseries that spans about 1.5 months.
import pandas as pd
hours = pd.Series(1, pd.date_range('2020-02-23 06:00', freq = 'H', periods=1008))
hours
# 2020-02-23 06:00:00 1
# 2020-02-23 07:00:00 1
# ..
# 2020-04-05 04:00:00 1
# 2020-04-05 05:00:00 1
# Freq: H, Length: 1000, dtype: int64
I would like to sum these to months while considering, that days start at 06:00 in this use-case. The result should be:
2020-02-01 06:00:00 168
2020-03-01 06:00:00 744
2020-04-01 06:00:00 96
freq: MS, dtype: int64
How do I do that??
What I've tried and what works
I can aggregate to days while considering the offset, using the offset parameter:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
days
# 2020-02-23 06:00:00 24
# 2020-02-24 06:00:00 24
# ..
# 2020-04-03 06:00:00 24
# 2020-04-04 06:00:00 24
# Freq: D, dtype: int64
Using the same method to aggregate to months does not work. The timestamps do not have a time component, and the values are incorrect:
months = hours.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 162 # wrong
# 2020-03-01 744
# 2020-04-01 102 # wrong
# Freq: MS, dtype: int64
I could do the aggregation to months as a second step after aggregating to days. In that case, the values are correct, but the time component is still missing from the timestamps:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
months = days.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 168
# 2020-03-01 744
# 2020-04-01 96
# Freq: MS, dtype: int64
My current workaround is adding the timedelta and resetting the frequency manually.
months.index += pd.Timedelta('06:00:00')
months.index.freq = 'MS'
months
# 2020-02-01 06:00:00 168
# 2020-03-01 06:00:00 744
# 2020-04-01 06:00:00 96
# freq: MS, dtype: int64
Not too much of an improvement on your attempt, but you could write the resampling as
months = hours.resample('D', offset='06:00:00').sum().resample('MS').sum()
changing the index labels still requires the hack you've been doing, as in adding the time delta manually and setting freq to MS
note that you can pass a string representation of the time delta to offset.
The reason two resampling operations are needed is because when the resampling frequency is greater than 'D', the offset is ignored. Once your resample at the daily level is performed with the offset, the result can be further resampled without specifying the offset.
I believe this is buggy behaviour, and I agree with you that hours.resample('MS', offset='06:00:00').sum() should produce the expected result.
Essentially, there are two issues:
the binning is incorrect when there is an offset applied & the frequency is greater than 'D'. The offset is ignored.
the offset is not reflected in the final output, the output truncates to the start or end of the period. I'm not sure if the behaviour you're expecting can be generalized for all users.
That there is a related bug issue impacting resampling with offsets. I have not determined yet whether that and the issue you face have the same root cause. Its the same root cause.

Is there a way to find hourly averages in pandas timeframes that do not start from even hours?

I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64

Resampling dataframe is producing unexpected results

Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.
Below is an example,
I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.
For example,
df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')
Creating some example data,
df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))
The data looks like,
print(df.head())
print(df.shape)
data1 data2 data3
date
2018-01-01 7 7 0
2018-01-02 8 8 1
2018-01-03 2 7 2
2018-01-04 2 2 3
2018-01-05 2 5 4
(128, 3)
When I resample the data using offset aliases I get an unexpected results.
Below I resample the data every 3 minutes.
resampled=df.resample('3T').mean()
print(resampled.head())
print(resampled.shape)
data1 data2 data3
date
2018-01-01 00:00:00 4.0 5.0 0.0
2018-01-01 00:03:00 NaN NaN NaN
2018-01-01 00:06:00 NaN NaN NaN
2018-01-01 00:09:00 NaN NaN NaN
2018-01-01 00:12:00 NaN NaN NaN
Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.
Can a dataframe be resampled for increments less than the datetime resolution?
I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.
When you resample lowering the frequency (downsample), then
one of possible options to compute the result is just mean().
It actuaaly means:
The source DataFrame contains too detailed data.
You want to change the sampling frequency to some lower one and
compute e.g. a mean of each column from some number
of source rows for the current sampling period.
But when you increase the sampling frequency (upsample), then:
Your source data are too general.
You want to change the frequency to a higher one.
One of possible options to compute the result is e.g. to
interpolate between known source values.
Note that when you upsample daily data to 3-minute frequency then:
The first row will contain data between 2018-01-01 00:00:00 and
2018-01-01 00:03:00.
The next row will contain data between 2018-01-01 00:03:00 and
2018-01-01 00:06:00.
And so on.
So, based on your source data:
The first row contains data from 2018-01-01 (sharp on midnight).
Since no source data is available for the time range between
00:03:00 and 00:06:00 (on 2018-01-01), the second row contains
just NaN values.
The same pertains to further rows, up to 2018-01-01 23:57:00
(no source data for these time slices).
The next row, for 2018-01-02 00:00:00 can be filled with source data.
And so on.
There is nothing strange in this behaviour. Resample works just this way.
As you actually upsample the source data, maybe you should interpolate
the missing values?

Python Pandas - Computing stats on TimeSeriesIndexedData for each customer

UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)

Python- splitting a csv file into different dataframes according to date or time interval

I have imported a csv file into python. It has readings at 5 min intervals over a period of a month. There are about 250 readings per 5 min timestamp. Below is a sample of one row per timestamp. Is there a way to split the csv into different dataframes grouped by date or even 5 min interval for plotting purposes? Like i mentioned, this dataset has 250 readings per 5 min interval for a month so I would like to do this without having to hard code each dataframe for each day or each interval in the set.
df.head()
tmc_code measurement_tstamp ... miles road_order
0 112-05650 2018-05-01 00:00:00 ... 0.427814 768.0
1 112-05650 2018-05-01 00:05:00 ... 0.427814 768.0
2 112-05650 2018-05-01 00:10:00 ... 0.427814 768.0
3 112-05650 2018-05-01 00:15:00 ... 0.427814 768.0
4 112-05650 2018-05-01 00:20:00 ... 0.427814 768.0
What it sounds like to me is that you want a new DataFrame for each date. If that is what you desire, the following code will take your dataframe, and make a list of dataframes, each which will only contain data for one date.
df.measurement_tstamp = df.measurement_tstamp.str[:10]
l = df.measurement_tstamp.unique()
data = [df.loc[df['measurement_tstamp']==i] for i in l]
Edit
If you want to do it by 5 min interval, it's even simpler!
data = [df.loc[df['measurement_tstamp']==i] for i in df.measurement_tstamp.unique()]
That should do it

Categories