I have two lists.
The list times is a list of datetimes from 2018-04-10 00:00 to
2018-04-10 23:59.
For each item in times I have a corresponding label of 0 or 1 recorded in the list labels.
My goal is to get the mean label value (between 0 and 1) for every minute interval.
times = [Timestamp('2018-04-10 00:00:00.118000'),
Timestamp('2018-04-10 00:00:00.547000'),
Timestamp('2018-04-10 00:00:00.569000'),
Timestamp('2018-04-10 00:00:00.690000'),
.
.
.
Timestamp('2018-04-10 23:59:59.999000') ]
labels = [0,1,1,0,1,0,....1]
where len(times) == len(labels)
For every minute interval between 2018-04-10 00:00 and 2018-04-10 23:59, the min and max times in the list respectively, I am trying to get two lists:
1) The start time of the minute interval.
2) The mean average label value of all the datetimes in that interval.
In particular I am having trouble with (2).
Note: the times list is not necessarily chronologically ordered
Firstly, I begin with how I generated the data as above format
from datetime import datetime
size = int(1e6)
timestamp_a_day = np.linspace(datetime.now().timestamp(), datetime.now().timestamp()+24*60*60, size)
dummy_sec = np.random.rand(size)
timestamp_series = pd.Series(timestamp_a_day + dummy_sec)\
.sort_values().reset_index(drop=True)\
.apply(lambda x: datetime.fromtimestamp(x))
data = pd.DataFrame(timestamp_series, columns=['timestamp'])
data['label'] = np.random.randint(0, 2, size)
Let's solve this problem !!!
(I hope I understand your question precisely hahaha)
1) data['start_interval'] = data['timestamp'].dt.floor('s')
2) data.groupby('start_interval')['label'].mean()
zip times and labels then sort;
Write a function that returns the date, hour, minute of a Timestamp;
groupby that function;
sum and average the labels for each group
Related
I am writing some code to interpolate some data with space (x, y) and time. The data needs to be on a regular grid. I cant seem to make a generalized function to find a date range with regular spacing. The range that fails for me is:
date_min = numpy.datetime64('2022-10-24T00:00:00.000000000')
date_max = numpy.datetime64('2022-11-03T00:00:00.000000000')
And it needs to roughly match the current values of times I have, which for this case is 44.
periods = 44
I tried testing if the time difference is divisible by 2 and then adding 1 to the number of periods, which worked for a lot of cases, but it doesn't seem to really work for this time range:
def unique_diff(x):
return numpy.unique(numpy.diff(x))
unique_diff(pd.date_range(date_min, date_max, periods=periods))
Out[31]: array([20093023255813, 20093023255814], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods+1))
Out[32]: array([19636363636363, 19636363636364], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods-1))
Out[33]: array([20571428571428, 20571428571429], dtype='timedelta64[ns]')
However, it does work for +2:
unique_diff(pd.date_range(date_min, date_max, periods=periods+2))
Out[34]: array([19200000000000], dtype='timedelta64[ns]')
I could just keep trying different period deltas until I get a solution, but I would rather know why this is happening and how I can generalize this problem for any min/max times with a target number of periods
Your date range doesn't divide evenly by the periods in nanosecond resolution:
# as the contains start and end, there's a step fewer than there are periods
steps = periods - 1
int(date_max - date_min) / steps
# 20093023255813.953
A solution could be to round up (or down) your max date, to make it divide evenly in nanosecond resolution:
date_max_r = (date_min +
int(numpy.ceil(int(date_max - date_min) / (steps)) * (steps)))
unique_diff(pd.date_range(date_min, date_max_r, periods=periods))
# array([20093023255814], dtype='timedelta64[ns]')
I have a DataFrame with events that each have a start and end date. I also have a reporting period with a start and end date and a reporting frequency, e.g. monthly. What I want to calculate is the number of "active" events in each reporting period bin. An active event is an event for which the time overlaps with the time interval of the reporting period bin.
After struggling to much with DataFrame aggregation functions, I have come up with the following code that does the job but which is far from compact and elegant.
I'm pretty sure there is a way to write this more compactly but need some leads.
import numpy as np
import pandas as pd
import datetime as dt
# Example DF of events each with a start and end date provided as a string (my input data)
df = pd.DataFrame(columns=['id','start','end'], index=range(7), \
data=[[1,'2006-01-01','2007-10-01'],
[2,'2007-10-02','2008-12-01'],
[3,'2010-01-15','2010-10-20'],
[4,'2009-04-04','2010-06-03'],
[5,'2010-05-12','2010-08-31'],
[6,'2016-05-12','2199-12-31'],
[7,'2016-05-12','2199-12-31']])
# Reporting period in which we want to calculate the number of "ongoing"/"active" events:
reporting_period_start = '2010-01-01'
reporting_period_end = '2011-01-01'
reporting_freq = 'MS'
print('Input data:')
print(df)
# Convert the string dates to timestamps
def to_timestamp(str):
return pd.Timestamp(str)
df.start = df.start.apply(to_timestamp)
df.end = df.end.apply(to_timestamp)
# Create an additional colmun in the dataframe to capture the event time interval as an pandas.Interval
# pandas.Intervals offer a since .overlaps() function
def to_interval(s, e):
return pd.Interval(s, e)
df['interval'] = df.apply(lambda row: to_interval(row.start, row.end), axis=1)
# Create a data range and a period range to create reporting intervals (e.g. monthly)
# for which we want to count the number of event intervals that overlap with the reporting interval.
bins = pd.date_range(reporting_period_start, reporting_period_end, freq=reporting_freq)
print(bins)
# Convert the date ranges into a list of reporting intervals
# This is ugly code that most probably can be writting a lot more elegantly
intervals = []
n = bins.values.shape[0]
i = 0;
for b in bins[:-1]:
intervals.append(pd.Interval(pd.to_datetime(bins.values[i]), pd.to_datetime(bins.values[(i+1)%n]), closed='right'))
i = i + 1
# Function for trying a pandas.Dataframe.apply / resample / groupby or something alike...
def overlaps(i1, i2):
try:
return i1.overlaps(i2)
except:
return None
result_list = np.zeros(len(intervals)).astype(int)
for index, row in df.iterrows():
j = 0
for interval in intervals:
result_list[j] = result_list[j]+overlaps(intervals[j], row.interval)
j = j + 1
print(result_list)
If you think of your intervals as step functions, which have a value of 1 for the duration of the interval, and 0 otherwise then this can be concisely solved with staircase which has been built upon pandas and numpy for analysis with step functions.
In this setup code I have changing the dates in year 2199 to be None to indicate the end time is not known. I'm assuming that's what you may have wanted. If this is not correct then don't make this change.
setup
import numpy as np
import pandas as pd
# Example DF of events each with a start and end date provided as a string
df = pd.DataFrame(
columns=['id','start','end'],
index=range(7),
data=[[1,'2006-01-01','2007-10-01'],
[2,'2007-10-02','2008-12-01'],
[3,'2010-01-15','2010-10-20'],
[4,'2009-04-04','2010-06-03'],
[5,'2010-05-12','2010-08-31'],
[6,'2016-05-12',None],
[7,'2016-05-12',None]])
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
reporting_period_start = '2010-01-01'
reporting_period_end = '2011-01-01'
reporting_freq = 'MS'
solution
Your intervals do not start and end on month boundaries. We need to "floor" the start times to month boundaries, and "ceiling" the end times to month boundaries, to make sure that intervals that overlap a month overlap each other too. To my knowledge, there is currently no elegant way to do this, but the following will work
df["start"] = df["start"].dt.to_period("M").dt.to_timestamp()
df["end"] = (df["end"].dt.to_period("M")+1).dt.to_timestamp()
df now looks like this
id start end
0 1 2006-01-01 2007-11-01
1 2 2007-10-01 2009-01-01
2 3 2010-01-01 2010-11-01
3 4 2009-04-01 2010-07-01
4 5 2010-05-01 2010-09-01
5 6 2016-05-01 NaT
6 7 2016-05-01 NaT
Now we create a step function which is the combination of all intervals. When an interval starts the step function value increases by 1. When an interval finishes the value decreases by 1. So the value of the step function at any point will be the number of intervals overlapping that point. A step function is represented by the staircase.Stairs class. This class is to staircase as Series is to pandas.
import staircase as sc
stepfunction = sc.Stairs(df, "start", "end")
There are many things you can do with step functions in staircase including plotting.
stepfunction.plot(style="hlines")
Since the intervals now start and end at month boundaries, and the bins are month boundaries we can answer your question by finding the maximum value of the step function for each month.
bins = pd.date_range(reporting_period_start, reporting_period_end, freq=reporting_freq)
result = stepfunction.slice(bins).max()
result will be a pandas.Series indexed by a monthly interval index, whose values are the number of intervals overlapping with that month
[2010-01-01, 2010-02-01) 2.0
[2010-02-01, 2010-03-01) 2.0
[2010-03-01, 2010-04-01) 2.0
[2010-04-01, 2010-05-01) 2.0
[2010-05-01, 2010-06-01) 3.0
[2010-06-01, 2010-07-01) 3.0
[2010-07-01, 2010-08-01) 2.0
[2010-08-01, 2010-09-01) 2.0
[2010-09-01, 2010-10-01) 1.0
[2010-10-01, 2010-11-01) 1.0
[2010-11-01, 2010-12-01) 0.0
[2010-12-01, 2011-01-01) 0.0
dtype: float64
To recap, the solution (after imports and setup) is
df["start"] = df["start"].dt.to_period("M").dt.to_timestamp()
df["end"] = (df["end"].dt.to_period("M")+1).dt.to_timestamp()
result = sc.Stairs(df, "start", "end").slice(bins).max()
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1
I have a dataframe containing a time series indexed by time but with irregular time deltas as below
df
time x
2018-08-18 17:45:08 1.4562
2018-08-18 17:46:55 1.4901
2018-08-18 17:51:21 1.8012
...
2020-03-21 04:17:19 0.7623
2020-03-21 05:01:02 0.8231
2020-03-21 05:02:34 0.8038
What I want to do is get the daily difference between the two (chronologically) closest values, i.e. the closest time the next day. For example, if we have a sample at time 2018-08-18 17:45:08, and the next day we do not have a sample at the same time, but the closest sample is at, say, 2018-08-19 17:44:29, then I want to get the difference in x between these two times. How is that possible in pandas?
There will always be a sample for every single day between first day and last day in the time series.
The difference should be taken as (current x) - (past x) e.g. x_day2 - x_day1
The output's first n rows will be NaN given how the difference is taken, where n is the number of samples in the first day
EDIT: The code below works if the time deltas are regular
def get_daily_diff(data):
"""
Calculate daily difference in time series
Args:
data (pandas.Series): a pandas series of time series values indexed by pandas.Timestamp
Returns:
pandas.Series: daily difference in values
"""
df0 = data.index.searchsorted(data.index - pd.Timedelta(days=1))
df0 = df0[df0 > 0]
df0 = pd.Series(data.index[df0 - 1], index=data.index[data.shape[0] - df0.shape[0]:])
out = data.loc[df0.index] - data.loc[df0.values]
return out
However, if using irregular time delats, a ValueError is thrown when defining the variable out as we get a length mismatch between data.loc[df0.index] and data.loc[df0.values]. So the issue is to expand this function to work when the time deltas are irregular.
I would use pd.merge_asof with direction='nearest':
df['time_1d'] = df['time']+pd.Timedelta('1D')
tmp = pd.merge_asof(df, df, left_on='time', right_on ='time_1d',
direction='nearest', tolerance=pd.Timedelta('12H'), suffixes=('', '_y'))
tmp['delta'] = tmp['x_y'] - tmp['x']
tmp = tmp[['time', 'x', 'delta']]
Here I have used a tolerance of 12H to make sure to have NaN for first days but you could use a more appropriate value.
I have a data frame with temperature measurements at a frequency of 5 minutes. I would like to resample this dataset to find the mean temperature per hour.
This is typically done using df['temps'].resample('H', how='mean') but this averages all values that fall within the hour - using all times where '12' is the hour, for example. I want something that gets all values from 30 minutes either side of the hour (or times nearest to the actual hour) and finds the mean that way. In other words, for the resampled time step of 1200, use all temperature values from 1130 to 1230 to calculate the mean.
Example code below to create a test data frame:
index = pd.date_range('1/1/2000', periods=200, freq='5min')
temps = pd.Series(range(200), index=index)
df = pd.DataFrame(index=index)
df['temps'] = temps
Can this be done using the built-in resample method? I'm sure I've done it before using pandas but cannot find any reference to it.
It seems you need:
print (df['temps'].shift(freq='30Min').resample('H').mean())