Resample to Pandas DataFrame to Hourly using Hour as mid-point - python

I have a data frame with temperature measurements at a frequency of 5 minutes. I would like to resample this dataset to find the mean temperature per hour.
This is typically done using df['temps'].resample('H', how='mean') but this averages all values that fall within the hour - using all times where '12' is the hour, for example. I want something that gets all values from 30 minutes either side of the hour (or times nearest to the actual hour) and finds the mean that way. In other words, for the resampled time step of 1200, use all temperature values from 1130 to 1230 to calculate the mean.
Example code below to create a test data frame:
index = pd.date_range('1/1/2000', periods=200, freq='5min')
temps = pd.Series(range(200), index=index)
df = pd.DataFrame(index=index)
df['temps'] = temps
Can this be done using the built-in resample method? I'm sure I've done it before using pandas but cannot find any reference to it.

It seems you need:
print (df['temps'].shift(freq='30Min').resample('H').mean())

Related

What df.index.freq do I have to set for two hourly data?

When doing Time series Analysis we need to set a frequency like this for monthly data:
df.index.freq = 'MS'
What frequency do I have to set if I have two hourly data (or 30 days data etc.)?
Thanks a lot!
I think you need Series.asfreq if need change frequency:
df = df.asfreq('2H')
If need data processing - e.g. aggregate by sum per 2 hours, 30 days:
df1 = df.resample('2H').sum()
df2 = df.resample('30D').sum()

maximum difference between two time series of different resolution

I have two time series data that gives the electricity demand in one-hour resolution and five-minute resolution. I am trying to find the maximum difference between these two time series. So the one-hour resolution data has 8760 rows (hourly for an year) and the 5-minute resolution data has 104,722 rows (5-minutly for an year).
I can only think of a method that will expand the hourly data into 5 minute resolution that will have 12 times repeating of the hourly data and find the maximum of the difference of the two data sets.
If this technique is the way to go, is there an easy way to convert my hourly data into 5-minute resolution by repeating the hourly data 12 times?
for your reference I posted a plot of this data for one day.
P.S> I am using Python to do this task
Numpy's .repeat() function
You can change your hourly data into 5-minute data by using numpy's repeat function
import numpy as np
np.repeat(hourly_data, 12)
I would strongly recommend against converting the hourly data into five-minute data. If the data in both cases refers to the mean load of those time ranges, you'll be looking at more accurate data if you group the five-minute intervals into hourly datasets. You'd get more granularity the way you're talking about, but the granularity is not based on accurate data, so you're not actually getting more value from it. If you aggregate the five-minute chunks into hourly chunks and compare the series that way, you can be more confident in the trustworthiness of your results.
In order to group them together to get that result, you can define a function like the following and use the apply method like so:
def to_hour(date):
date = date.strftime("%Y-%m-%d %H:00:00")
date = dt.strptime(date, "%Y-%m-%d %H:%M:%S")
return date
df['Aggregated_Datetime'] = df['Original_Datetime'].apply(lambda x: to_hour(x))
df.groupby('Aggregated_Datetime').agg('Real-Time Lo

Python netcdf - Monthly median over the all time period of daily data

I have a NetCDF file input.nc. This file represents nearly 18 years of data sampled every 4 days. From this file, I would like to calculate the monthly median value over the all time period. So the output file should only contain 12-time steps.
I am using the following code:
import xarray as xr
os.chdir(inbasedir)
data = xr.open_dataset('input.nc')
monthly_data = data.resample(freq='m', dim ='time', how = 'median')
monthly_data.to_netcdf("test.nc")
Unfortunately when I look at the output file, my code has done the median value for each month of the all time serie and I end up with more than 200 values. How can I change my code, in order to calculate the 12 monthly medians over the all time period?
You want to use the groupby method:
monthly_data = data.groupby('time.month').median()
There are some good examples of how to use xarray with timeseries data here: http://xarray.pydata.org/en/stable/time-series.html

Resampling a frequency column in Pandas

I've been looking at the panda resample function, and it seems to only work for daily and above range. But, I want to resample a 64 Hz data into 8 Hz. The file is 170 MB, so I can't attach it here, but the data has 2 arrays, one for time, and the other for the corresponding value. Is it possible to resample it using by averaging it? any help would be appreciated.
Frequency is the inverse of time period. Essentially, you want to
convert frequency to time period
df['T'] = 1 / df['f']
resample every 0.125s (or 1/8). Look at the df.resample docs for help.

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Categories