Count on a rolling time window in pandas - python

I'm trying to return a count on a time window about a (moving) fixed point.
It's an attempt to understand the condition of an instrument at any time, as a function of usage prior to it.
So if the instrument is used at 12.05pm, 12.10, 12.15, 12.30, 12.40 and 1pm, the usage counts would be:
12.05 -> 1 (once in the last hour)
12.10 -> 2
12.15 -> 3
12.30 -> 4
12.40 -> 5
1.00 -> 6
... but then lets say usage resumes at 1.06:
1.06 -> 6
this doesn't increase the count, as the first run is over an hour ago.
How can I calculate this count and append it as a column?
It feels like this is an groupby/aggregate/count using possibly timedeltas in a lambda function, but I don't know where to start past that.
I'd like to be able to play with the time window too, so not just the past hour, but the hour surrounding an instance i.e. + and -30 minutes.
The following code gives a starting dataframe:
s = pd.Series(pd.date_range('2020-1-1', periods=8000, freq='250s'))
df = pd.DataFrame({'Run time': s})
df_sample = df.sample(6000)
df_sample = df_sample.sort_index()
The best help i found (and to be fair i can usually hack together from the logic) was this Distinct count on a rolling time window but i've not managed this time.
Thanks

I've done something similar previously with the DataFrame.rolling function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
So for your dataset, first you need to update the index to the datetime field, then you can preform the analysis you need, so continuing on from your code:
s = pd.Series(pd.date_range('2020-1-1', periods=8000, freq='250s'))
df = pd.DataFrame({'Run time': s})
df_sample = df.sample(6000)
df_sample = df_sample.sort_index()
# Create a value we can count
df_sample('Occurrences') = 1
# Set the index to the datetime element
df_sample = df_sample.set_index('Run time')
# Use Pandas rolling method, 3600s = 1 Hour
df_sample['Occurrences in Last Hour'] = df_sample['Occurrences'].rolling('3600s').sum()
df_sample.head(15)
Occurrences Occurrences in Last Hour
Run time
2020-01-01 00:00:00 1 1.0
2020-01-01 00:04:10 1 2.0
2020-01-01 00:08:20 1 3.0
2020-01-01 00:12:30 1 4.0
2020-01-01 00:16:40 1 5.0
2020-01-01 00:25:00 1 6.0
2020-01-01 00:29:10 1 7.0
2020-01-01 00:37:30 1 8.0
2020-01-01 00:50:00 1 9.0
2020-01-01 00:54:10 1 10.0
2020-01-01 00:58:20 1 11.0
2020-01-01 01:02:30 1 11.0
2020-01-01 01:06:40 1 11.0
2020-01-01 01:15:00 1 10.0
2020-01-01 01:19:10 1 10.0
You need to set the index to a datetime element to utilised the time base window, otherwise you can only use integer values corresponding to the number of rows.

Related

How to loop through a pandas grouped time series?

I have a dataframe like this:
datetime type d13C ... dayofyear week dmy
1 2018-01-05 15:22:30 air -8.88 ... 5 1 5-1-2018
2 2018-01-05 15:23:30 air -9.08 ... 5 1 5-1-2018
3 2018-01-05 15:24:30 air -10.08 ... 5 1 5-1-2018
4 2018-01-05 15:25:30 air -9.51 ... 5 1 5-1-2018
5 2018-01-05 15:26:30 air -9.61 ... 5 1 5-1-2018
... ... ... ... ... ... ...
341543 2018-12-17 12:42:30 air -9.99 ... 351 51 17-12-2018
341544 2018-12-17 12:43:30 air -9.53 ... 351 51 17-12-2018
341545 2018-12-17 12:44:30 air -9.54 ... 351 51 17-12-2018
341546 2018-12-17 12:45:30 air -9.93 ... 351 51 17-12-2018
341547 2018-12-17 12:46:30 air -9.66 ... 351 51 17-12-2018
Full data here: https://drive.google.com/file/d/1KmOwnpvrG2Edz1AlLyD0CKZlBpaFervM/view?usp=sharing
I'm plotting d13C column on the Y-axis and inverse total_co2 on the X and then fitting a regression line for each day in the data. I then filter out and store the dates I want depending on if the r^2 value of the regression line is > 0.8 like this:
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import numpy as np
from scipy import stats
df = pd.read_csv('dataset.txt', usecols = ['datetime', 'type', 'total_co2', 'd13C', 'day','month','year','dayofyear','week','hour'], dtype = {'total_co2':
np.float64, 'd13C':np.float64, 'day':str, 'month':str, 'year':str,'week':str, 'hour': str, 'dayofyear':str})
df['dmy'] = df['day'] +'-'+ df['month'] +'-'+ df['year'] # adding a full date column to make it easir to filter through
# the rows, ie. each day
# window18 = df[((df['year']=='2018'))] # selecting just the data from the year 2018
accepted_dates_list = [] # creating an empty list to store the dates that we're interested in
for d in df['dmy'].unique(): # this will pass through each day, the .unique() ensures that it doesnt go over the same days
acceptable_date = {} # creating a dictionary to store the valid dates
period = df[df.dmy==d] # defining each period from the dmy column
p = (period['total_co2'])**-1
q = period['d13C']
c,m = polyfit(p,q,1) # intercept and gradient calculation of the regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # getting some statistical properties of the regression line
if r_value**2 >= 0.8:
acceptable_date['period'] = d # populating the dictionary with the accpeted dates and corresponding other values
acceptable_date['r-squared'] = r_value**2
acceptable_date['intercept'] = intercept
accepted_dates_list.append(acceptable_date) # sending the valid stuff in the dictionary to the list
else:
pass
accepted_dates18 = pd.DataFrame(accepted_dates_list) # converting the list to a df
print(accepted_dates18)
But now I want to do the same thing, just over three day periods which I'm trying to select from the day of year column (unsure if this is the best way or not). For example, I would want to fit the regression line using all the rows with dayofyear=5, dayofyear=6, dayofyear=7, then for the next three days until the end of the data. There are some days missing, but essentially I just need to do this for every 3 days in the data.
The output dataframe I am then trying to get would have the list of the three day intervals with the r^2 >0.8, so anything like this that will show the valid date range:
Accepted dates
0 23-08-2018 - 25-08-2018
1 26-08-2018 - 28-08-2018
2 31-08-2018 - 02-09-2018
3 15-09-2018 - 17-09-2018
4 24-09-2018 - 26-09-2018
I'm not too sure what to do to iterate over every three days. Any help would go a long way, thanks!
Your code loops through a list of unique dates and filters the dataframe on each iteration.
Pandas implemented this with df.groupby(). It can be used to loop and get each group or it can be combined with aggregations, function applications, and transformations. You can read more about it on the user guide. This function can return groups according to any the columns (or set of columns) in df, levels of the index, or any other exogenous list-like with the same length as df (we are grouping rows, but note it can also group columns). It even has implementations for the most common statistical aggregations like mean, stdev, and corr, among many others.
Now to your problem. You not only want the correlation but the equation, so you do need to loop. And to get three-day groups you can use that dayofyear column with a twist.
Take this data
import io
fo = io.StringIO(
'''datetime,d13C
2018-01-05 15:22:30,-8.88
2018-01-05 15:23:30,-9.08
2018-01-06 15:24:30,-10.0
2018-01-06 15:25:30,-9.51
2018-01-07 15:26:30,-9.61
2018-01-07 15:27:30,-9.61
2018-01-08 15:28:30,-9.61
2018-01-08 15:29:30,-9.61
2018-01-09 15:26:30,-9.61
2018-01-09 15:27:30,-9.61
''')
df = pd.read_csv(fo)
df.datetime = pd.to_datetime(df.datetime)
fo.close()
With the code for grouping and looping
first_day = 5
days_to_group = 3
for doy, gdf in df.groupby((df.datetime.dt.dayofyear.sub(first_day) // days_to_group)
* days_to_group + first_day):
print(gdf, '\n')
print(doy, '\n')
Output
datetime d13C
0 2018-01-05 15:22:30 -8.88
1 2018-01-05 15:23:30 -9.08
2 2018-01-06 15:24:30 -10.00
3 2018-01-06 15:25:30 -9.51
4 2018-01-07 15:26:30 -9.61
5 2018-01-07 15:27:30 -9.61
5
datetime d13C
6 2018-01-08 15:28:30 -9.61
7 2018-01-08 15:29:30 -9.61
8 2018-01-09 15:26:30 -9.61
9 2018-01-09 15:27:30 -9.61
8
Now you can plug your code into this loop and get what you need.
PS
You can also use df.datetime.dt.floor('3d') as the grouper but I am not aware of how to control the first_day, so use it with caution.
Here is one approach. As I understand it, the primary goal is to get from current observations (multiple per day) to a 3-day moving average. First, I created a smaller, simpler data set:
import pandas as pd
df = pd.DataFrame({'counter': [*range(100)],
'date': pd.date_range('2020-01-01', periods=100, freq='7H')})
df = df.set_index('date')
print(df.head())
counter
date
2020-01-01 00:00:00 0
2020-01-01 07:00:00 1
2020-01-01 14:00:00 2
2020-01-01 21:00:00 3
2020-01-02 04:00:00 4
Second, I re-sampled on a daily basis:
df2 = df['counter'].resample('1D').mean() # <-- called df2
print(df2.head())
date
2020-01-01 1.5
2020-01-02 5.0
2020-01-03 8.5
2020-01-04 12.0
2020-01-05 15.5
Freq: D, Name: counter, dtype: float64
Third, I computed mean value for a 3-day moving window:
print(df2.rolling(3).mean().head())
date
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 5.0
2020-01-04 8.5
2020-01-05 12.0
Freq: D, Name: counter, dtype: float64
Seems like resample().mean() and rolling().mean() would be useful in this case.

How to make a pandas series whose index is every day of 2020

I would like to make an empty pandas series with a date index which is every day of 2020. That is 01-01-2020, 02-01-2020 etc.
Although this looks very simple I couldn’t find out how to do it.
Use date_range:
range_2020 = pd.date_range("2020-01-01", "2020-12-31", freq="D")
pd.DataFrame(range(366), index=range_2020)
The output is:
0
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
...

How to replace by NaN a time delta object in a pandas serie?

I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30

Complete days in data from a DataFrame grouped by group by

I have this Dataframe:
DataFrame
I applied df.groupby ('site') to classify data by this feature.
grouped = Datos.groupby('site')
After classifying it I want to complete, for all records, the "date" column day by day.
The procedure that I think I should follow will be:
1. Generate a complete sequence between start and end date. (Step completed).
for site in grouped:
dates = ['2018-01-01', '2020-01-17']
startDate = datetime.datetime.strptime( dates[0], "%Y-%m-%d") # parse first date
endDate = datetime.datetime.strptime( dates[-1],"%Y-%m-%d") # parse last date
days = (endDate - startDate).days # how many days between?
allDates = {datetime.datetime.strftime(startDate+datetime.timedelta(days=k),
"%Y-%m-%d"):0 for k in range(days+1)}
Compare this sequence with the column 'date' of my groupby. ('Site) and add those that are not present do not match the dates in' date '.
Write a function or loop that allows you to update the 'date' column with the new dates and also complete the missing values with 0.
(grouped.apply(add_days))
So far I have only managed to complete step 1, so I ask for your help to complete steps 2 and 3.
I would very much appreciate your always important help.
Regards
I had to do quiet the same thing for a project:
Maybe it's not the best solution for you but it can help you. (and I hope save you the headache I had)
Here is how as I managed it with help of https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
df_DateRange=pd.DataFrame()
df_1=pd.DataFrame()
grouped=pd.DataFrame()
#1. Create a DataFrame with alldays (your step2):
#Create a DataFrame with alldays
dates_list = ['2019-12-31', '2020-01-05']
df_DateRange['date']=pd.date_range(start=dates_list [0],end=dates_list [-1],freq='1D')
df_DateRange['date']=df_DateRange['date'].dt.strftime('%Y-%m-%d')
df_DateRange.set_index(['date'],inplace=True)
#Set index of you Datos DataFrame:
Datos.set_index(['date'], inplace=True)
#Join both DataFrame:
df_1=df_DateRange.join(Datos)
#2. Replace the NaN:
df_1['site'].fillna("", inplace=True)
df_1['value'].fillna(0, inplace=True)
df_1['value2'].fillna(0, inplace=True)
#3. do the calculation:
grouped = df_1.groupby('site').sum()
df_DateRange:
date
0 2019-12-31
1 2020-01-01
2 2020-01-02
3 2020-01-03
4 2020-01-04
5 2020-01-05
Datos:
date site value value2
0 2020-01-01 site1 1 -1
1 2020-01-01 site2 2 -2
2 2020-01-02 site1 10 -10
3 2020-01-02 site2 20 -20
df1:
site value value2
date
2019-12-31 0.0 0.0
2020-01-01 site1 1.0 -1.0
2020-01-01 site2 2.0 -2.0
2020-01-02 site1 10.0 -10.0
2020-01-02 site2 20.0 -20.0
2020-01-03 0.0 0.0
2020-01-04 0.0 0.0
2020-01-05 0.0 0.0
grouped=
value value2
site
0.0 0.0
site1 11.0 -11.0
site2 22.0 -22.0

Python Pandas - Computing stats on TimeSeriesIndexedData for each customer

UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)

Categories