I want to aggregate a pandas.Series with an hourly DatetimeIndex to monthly values - while considering the offset to midnight.
Example
Consider the following (uniform) timeseries that spans about 1.5 months.
import pandas as pd
hours = pd.Series(1, pd.date_range('2020-02-23 06:00', freq = 'H', periods=1008))
hours
# 2020-02-23 06:00:00 1
# 2020-02-23 07:00:00 1
# ..
# 2020-04-05 04:00:00 1
# 2020-04-05 05:00:00 1
# Freq: H, Length: 1000, dtype: int64
I would like to sum these to months while considering, that days start at 06:00 in this use-case. The result should be:
2020-02-01 06:00:00 168
2020-03-01 06:00:00 744
2020-04-01 06:00:00 96
freq: MS, dtype: int64
How do I do that??
What I've tried and what works
I can aggregate to days while considering the offset, using the offset parameter:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
days
# 2020-02-23 06:00:00 24
# 2020-02-24 06:00:00 24
# ..
# 2020-04-03 06:00:00 24
# 2020-04-04 06:00:00 24
# Freq: D, dtype: int64
Using the same method to aggregate to months does not work. The timestamps do not have a time component, and the values are incorrect:
months = hours.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 162 # wrong
# 2020-03-01 744
# 2020-04-01 102 # wrong
# Freq: MS, dtype: int64
I could do the aggregation to months as a second step after aggregating to days. In that case, the values are correct, but the time component is still missing from the timestamps:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
months = days.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 168
# 2020-03-01 744
# 2020-04-01 96
# Freq: MS, dtype: int64
My current workaround is adding the timedelta and resetting the frequency manually.
months.index += pd.Timedelta('06:00:00')
months.index.freq = 'MS'
months
# 2020-02-01 06:00:00 168
# 2020-03-01 06:00:00 744
# 2020-04-01 06:00:00 96
# freq: MS, dtype: int64
Not too much of an improvement on your attempt, but you could write the resampling as
months = hours.resample('D', offset='06:00:00').sum().resample('MS').sum()
changing the index labels still requires the hack you've been doing, as in adding the time delta manually and setting freq to MS
note that you can pass a string representation of the time delta to offset.
The reason two resampling operations are needed is because when the resampling frequency is greater than 'D', the offset is ignored. Once your resample at the daily level is performed with the offset, the result can be further resampled without specifying the offset.
I believe this is buggy behaviour, and I agree with you that hours.resample('MS', offset='06:00:00').sum() should produce the expected result.
Essentially, there are two issues:
the binning is incorrect when there is an offset applied & the frequency is greater than 'D'. The offset is ignored.
the offset is not reflected in the final output, the output truncates to the start or end of the period. I'm not sure if the behaviour you're expecting can be generalized for all users.
That there is a related bug issue impacting resampling with offsets. I have not determined yet whether that and the issue you face have the same root cause. Its the same root cause.
Related
TLDR: When downsampling a Series with a DatetimeIndex, e.g. from hourly to daily values, how can I ensure the result only contains time periods that are fully present in the original?
Example
I'll explain with a simplified example.
Starting point: daily values
import pandas as pd
# Source data: 2 full days, AND SOME ADDITIONAL HOURS.
i = pd.date_range('2022-03-04 22:00', '2022-03-07 09:00', freq='H')
hourly = pd.Series(range(len(i)), i)
I want to resample to days, but keep only those that days are completely present in the source series.
What is working: calendar days
If a day is defined as a normal calendar day, i.e., midnight to midnight, we can do this in 2 steps:
# 1) Resample.
grouper = pd.Grouper(freq='D')
daily = hourly.groupby(grouper).sum() # or .resample('D').sum()
# 2022-03-04 1
# 2022-03-05 324
# 2022-03-06 900
# 2022-03-07 545
# Freq: D, dtype: int64
# 2) Discard incomplete days.
# (reject the days that start before the start of the first hour)
incomplete_left = daily.index < hourly.index[0]
# (reject the days that end after the end of the last hour)
incomplete_right = daily.index + pd.offsets.Day(1) > hourly.index[-1] + pd.offsets.Hour(1)
# Trim.
daily_trimmed = daily[~incomplete_left & ~incomplete_right] # Keeps 2022-03-05 and -06. Good.
# 2022-03-05 324
# 2022-03-06 900
# Freq: D, dtype: int64
Sofar, so good.
What is not working: custom starting point
But what if a day is defined as starting at 06:00 and ending at 06:00 the next calender day? I can do the resampling, but don't know how to check which timestamps to reject.
# 1) Resampling is doable:
import datetime
def gasday(ts: pd.Timestamp) -> pd.Timestamp:
day = ts.floor("D")
if ts.time() < datetime.time(hour=6):
day = day - pd.DateOffset(days=1) # get previous day
return day
daily2 = hourly.groupby(gasday).sum()
# 2022-03-04 28
# 2022-03-05 468
# 2022-03-06 1044
# 2022-03-07 230
# dtype: int64
# 2) ... but how to find the days that must be rejected??
Remarks
I'm using DatetimeIndex, instead of PeriodIndex, which is why I we have the somewhat complicated formula for incomplete_right. The reason for using DatetimeIndex is that I'm generally dealing with timezones (not shown in this example). The timestamps in the datetimeindex are left-bound.
In my use-case, I'm given the grouper function (gasday in this case), without knowing, what the cutoff time is (06:00 in this case).
If your data is guaranteed to be hourly, then you can just count the records:
daily = hourly.groupby(pd.Grouper(freq='D', offset='6H')).agg(['size','sum'])
Output:
size sum
2022-03-04 06:00:00 8 28
2022-03-05 06:00:00 24 468
2022-03-06 06:00:00 24 1044
2022-03-07 06:00:00 4 230
Looking form the data, it's fairly easy to see which ones should be dropped
complete_daily = daily.query('size==24')
Output:
size sum
2022-03-05 06:00:00 24 468
2022-03-06 06:00:00 24 1044
Update: You can also try:
daily = (hourly.reset_index().groupby(pd.Grouper(key='index', freq='D', offset='6H'))
.agg(start=('index','min'), end=('index','max'), total=(0,'sum'))
)
Output:
start end total
index
2022-03-04 06:00:00 2022-03-04 22:00:00 2022-03-05 05:00:00 28
2022-03-05 06:00:00 2022-03-05 06:00:00 2022-03-06 05:00:00 468
2022-03-06 06:00:00 2022-03-06 06:00:00 2022-03-07 05:00:00 1044
2022-03-07 06:00:00 2022-03-07 06:00:00 2022-03-07 09:00:00 230
You can then query for complete days, e.g.
daily['start'].dt.hour.eq(6) & daily['end'].dt.hour.eq(5)
I've decided to use the following logic:
If the first timestamp in hourly.index belongs to the same group (= same day) as the timestamp immediately before it, then the group is not fully present in the hourly series and the first datapoint in daily must be removed. If it does not belong to the same group, it is really the start of a new group, the group is fully present in hourly, and no change is needed to daily.
Likewise, if the end of the last timestamp in hourly.index belongs to the same day as the timestamp immediately before* it, then the group is also not fully present in the hourly series, and the final datapoint in daily must be removed.
eps = pd.Timedelta(seconds=1)
start = hourly.index[0]
if gasday(start) == gasday(start - eps):
daily2 = daily2.iloc[1:]
end = hourly.index[-1] + pd.offsets.Hour(1)
if gasday(end - eps) == gasday(end):
daily2 = daily2.iloc[:-1]
This works, and keeps the two days (2022-03-05 and -06) as wanted. It includes (-04) if we start i at 2022-03-04 06:00 or earlier. Likewise, it keeps (-06) only if we end i at 2022-03-07 05:00 or later.
*) Why before? Well, the 05:00 timestamp denotes the left-closed interval [05:00-06:00). 06:00 is actually the start of the next hour. Therefor, if this 06:00 timestamp belongs to the same day, as the moment immediately before it (05:59:59), then we do not have the complete day.
Now the only issues I have left is the following: I'd like to abstract this all away, like so:
def resample_and_trim(source, grouper):
agg = source.groupby(grouper).sum()
eps = pd.Timedelta(seconds=1)
start = source.index[0]
if grouper(start) == grouper(start - eps):
agg = agg.iloc[1:]
end = source.index[-1] + pd.offsets.Hour(1)
if grouper(end - eps) == grouper(end):
agg = agg.iloc[:-1]
return agg
And then be able to call this in both cases. The latter works:
daily2 = resample_and_trim(hourly, gasday)
# 2022-03-05 468
# 2022-03-06 1044
# dtype: int64
But the former does not:
daily = resample_and_trim(hourly, pd.Grouper(freq='H'))
# Error in `grouper(start)`
# TypeError: 'TimeGrouper' object is not callable
I'll doctor around a bit more; if I find the solution, I'll edit this answer.
I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05
I have a dataframe like this:
datetime type d13C ... dayofyear week dmy
1 2018-01-05 15:22:30 air -8.88 ... 5 1 5-1-2018
2 2018-01-05 15:23:30 air -9.08 ... 5 1 5-1-2018
3 2018-01-05 15:24:30 air -10.08 ... 5 1 5-1-2018
4 2018-01-05 15:25:30 air -9.51 ... 5 1 5-1-2018
5 2018-01-05 15:26:30 air -9.61 ... 5 1 5-1-2018
... ... ... ... ... ... ...
341543 2018-12-17 12:42:30 air -9.99 ... 351 51 17-12-2018
341544 2018-12-17 12:43:30 air -9.53 ... 351 51 17-12-2018
341545 2018-12-17 12:44:30 air -9.54 ... 351 51 17-12-2018
341546 2018-12-17 12:45:30 air -9.93 ... 351 51 17-12-2018
341547 2018-12-17 12:46:30 air -9.66 ... 351 51 17-12-2018
Full data here: https://drive.google.com/file/d/1KmOwnpvrG2Edz1AlLyD0CKZlBpaFervM/view?usp=sharing
I'm plotting d13C column on the Y-axis and inverse total_co2 on the X and then fitting a regression line for each day in the data. I then filter out and store the dates I want depending on if the r^2 value of the regression line is > 0.8 like this:
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import numpy as np
from scipy import stats
df = pd.read_csv('dataset.txt', usecols = ['datetime', 'type', 'total_co2', 'd13C', 'day','month','year','dayofyear','week','hour'], dtype = {'total_co2':
np.float64, 'd13C':np.float64, 'day':str, 'month':str, 'year':str,'week':str, 'hour': str, 'dayofyear':str})
df['dmy'] = df['day'] +'-'+ df['month'] +'-'+ df['year'] # adding a full date column to make it easir to filter through
# the rows, ie. each day
# window18 = df[((df['year']=='2018'))] # selecting just the data from the year 2018
accepted_dates_list = [] # creating an empty list to store the dates that we're interested in
for d in df['dmy'].unique(): # this will pass through each day, the .unique() ensures that it doesnt go over the same days
acceptable_date = {} # creating a dictionary to store the valid dates
period = df[df.dmy==d] # defining each period from the dmy column
p = (period['total_co2'])**-1
q = period['d13C']
c,m = polyfit(p,q,1) # intercept and gradient calculation of the regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # getting some statistical properties of the regression line
if r_value**2 >= 0.8:
acceptable_date['period'] = d # populating the dictionary with the accpeted dates and corresponding other values
acceptable_date['r-squared'] = r_value**2
acceptable_date['intercept'] = intercept
accepted_dates_list.append(acceptable_date) # sending the valid stuff in the dictionary to the list
else:
pass
accepted_dates18 = pd.DataFrame(accepted_dates_list) # converting the list to a df
print(accepted_dates18)
But now I want to do the same thing, just over three day periods which I'm trying to select from the day of year column (unsure if this is the best way or not). For example, I would want to fit the regression line using all the rows with dayofyear=5, dayofyear=6, dayofyear=7, then for the next three days until the end of the data. There are some days missing, but essentially I just need to do this for every 3 days in the data.
The output dataframe I am then trying to get would have the list of the three day intervals with the r^2 >0.8, so anything like this that will show the valid date range:
Accepted dates
0 23-08-2018 - 25-08-2018
1 26-08-2018 - 28-08-2018
2 31-08-2018 - 02-09-2018
3 15-09-2018 - 17-09-2018
4 24-09-2018 - 26-09-2018
I'm not too sure what to do to iterate over every three days. Any help would go a long way, thanks!
Your code loops through a list of unique dates and filters the dataframe on each iteration.
Pandas implemented this with df.groupby(). It can be used to loop and get each group or it can be combined with aggregations, function applications, and transformations. You can read more about it on the user guide. This function can return groups according to any the columns (or set of columns) in df, levels of the index, or any other exogenous list-like with the same length as df (we are grouping rows, but note it can also group columns). It even has implementations for the most common statistical aggregations like mean, stdev, and corr, among many others.
Now to your problem. You not only want the correlation but the equation, so you do need to loop. And to get three-day groups you can use that dayofyear column with a twist.
Take this data
import io
fo = io.StringIO(
'''datetime,d13C
2018-01-05 15:22:30,-8.88
2018-01-05 15:23:30,-9.08
2018-01-06 15:24:30,-10.0
2018-01-06 15:25:30,-9.51
2018-01-07 15:26:30,-9.61
2018-01-07 15:27:30,-9.61
2018-01-08 15:28:30,-9.61
2018-01-08 15:29:30,-9.61
2018-01-09 15:26:30,-9.61
2018-01-09 15:27:30,-9.61
''')
df = pd.read_csv(fo)
df.datetime = pd.to_datetime(df.datetime)
fo.close()
With the code for grouping and looping
first_day = 5
days_to_group = 3
for doy, gdf in df.groupby((df.datetime.dt.dayofyear.sub(first_day) // days_to_group)
* days_to_group + first_day):
print(gdf, '\n')
print(doy, '\n')
Output
datetime d13C
0 2018-01-05 15:22:30 -8.88
1 2018-01-05 15:23:30 -9.08
2 2018-01-06 15:24:30 -10.00
3 2018-01-06 15:25:30 -9.51
4 2018-01-07 15:26:30 -9.61
5 2018-01-07 15:27:30 -9.61
5
datetime d13C
6 2018-01-08 15:28:30 -9.61
7 2018-01-08 15:29:30 -9.61
8 2018-01-09 15:26:30 -9.61
9 2018-01-09 15:27:30 -9.61
8
Now you can plug your code into this loop and get what you need.
PS
You can also use df.datetime.dt.floor('3d') as the grouper but I am not aware of how to control the first_day, so use it with caution.
Here is one approach. As I understand it, the primary goal is to get from current observations (multiple per day) to a 3-day moving average. First, I created a smaller, simpler data set:
import pandas as pd
df = pd.DataFrame({'counter': [*range(100)],
'date': pd.date_range('2020-01-01', periods=100, freq='7H')})
df = df.set_index('date')
print(df.head())
counter
date
2020-01-01 00:00:00 0
2020-01-01 07:00:00 1
2020-01-01 14:00:00 2
2020-01-01 21:00:00 3
2020-01-02 04:00:00 4
Second, I re-sampled on a daily basis:
df2 = df['counter'].resample('1D').mean() # <-- called df2
print(df2.head())
date
2020-01-01 1.5
2020-01-02 5.0
2020-01-03 8.5
2020-01-04 12.0
2020-01-05 15.5
Freq: D, Name: counter, dtype: float64
Third, I computed mean value for a 3-day moving window:
print(df2.rolling(3).mean().head())
date
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 5.0
2020-01-04 8.5
2020-01-05 12.0
Freq: D, Name: counter, dtype: float64
Seems like resample().mean() and rolling().mean() would be useful in this case.
The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.
I have a Dataframe with one column.
I need to calculate the average of the difference between the min and max values over 600 seconds period (10 minutes). Or more clearly this :
np.average(originalData[sensor1].rolling(600)
.apply(lambda mylist : (max(mylist) - min(mylist)), raw = True).dropna())
The code works perfectly and returns me the results I need.
The problem is that my Dataframe is pretty large (1.5 million lines, and 200 columns), and it takes a lot of time, specially if I want to go from 600 seconds to 3600 second.
I want to improve it by not to calculating the difference on every row, but by skipping 10 rows each time, it shouldn't impact the results significantly.
Meaning :
Calculate max(list)-min(list) on row 0 to 600
Calculate max(list)-min(list) on row 10 to 610
Calculate max(list)-min(list) on row 20 to 620
Calculate max(list)-min(list) on row 30 to 630
This will speed up the calculation 10 times (hopefully), but I don't see how I can do it with rolling
Any suggestions ?
Edit:
muzzyq requested sample data :
a = np.ones(1500000)
np.average(pd.Series(a).rolling(600).
apply(lambda thing : (max(thing) - min(thing)), raw = True).dropna())
You can use the resample method with '10min' as argument to group by 10 minute intervals. It is more efficient than using rolling for large sets of time series data, assuming it is set as the index.
Sample data
rng = pd.date_range('2000-01-01', periods=1_500_000, freq='S')
ts = pd.Series(np.arange(1_500_000), index=rng)
ts.head()
Output:
2000-01-01 00:00:00 0
2000-01-01 00:00:01 1
2000-01-01 00:00:02 2
2000-01-01 00:00:03 3
2000-01-01 00:00:04 4
Freq: S, dtype: int64
Answer
Using the function from your question:
np.average(ts.resample('10min').apply(lambda mylist: (max(mylist) - min(mylist))))
Output:
599.0
Alternative
Just because I'm not 100% sure what you want the outcome to look like, this will give you the range per 10 minute interval:
result = ts.resample('10min').apply(lambda mylist: (max(mylist) - min(mylist)))
result.head()
Output:
2000-01-01 00:00:00 599
2000-01-01 00:10:00 599
2000-01-01 00:20:00 599
2000-01-01 00:30:00 599
2000-01-01 00:40:00 599
Freq: 10T, dtype: int64
In this case, the answer will always be 599 as the maximum of 600 seconds is 600 and the minimum is 1, so 600 - 1 = 599