Resampling a frequency column in Pandas - python

I've been looking at the panda resample function, and it seems to only work for daily and above range. But, I want to resample a 64 Hz data into 8 Hz. The file is 170 MB, so I can't attach it here, but the data has 2 arrays, one for time, and the other for the corresponding value. Is it possible to resample it using by averaging it? any help would be appreciated.

Frequency is the inverse of time period. Essentially, you want to
convert frequency to time period
df['T'] = 1 / df['f']
resample every 0.125s (or 1/8). Look at the df.resample docs for help.

Related

maximum difference between two time series of different resolution

I have two time series data that gives the electricity demand in one-hour resolution and five-minute resolution. I am trying to find the maximum difference between these two time series. So the one-hour resolution data has 8760 rows (hourly for an year) and the 5-minute resolution data has 104,722 rows (5-minutly for an year).
I can only think of a method that will expand the hourly data into 5 minute resolution that will have 12 times repeating of the hourly data and find the maximum of the difference of the two data sets.
If this technique is the way to go, is there an easy way to convert my hourly data into 5-minute resolution by repeating the hourly data 12 times?
for your reference I posted a plot of this data for one day.
P.S> I am using Python to do this task
Numpy's .repeat() function
You can change your hourly data into 5-minute data by using numpy's repeat function
import numpy as np
np.repeat(hourly_data, 12)
I would strongly recommend against converting the hourly data into five-minute data. If the data in both cases refers to the mean load of those time ranges, you'll be looking at more accurate data if you group the five-minute intervals into hourly datasets. You'd get more granularity the way you're talking about, but the granularity is not based on accurate data, so you're not actually getting more value from it. If you aggregate the five-minute chunks into hourly chunks and compare the series that way, you can be more confident in the trustworthiness of your results.
In order to group them together to get that result, you can define a function like the following and use the apply method like so:
def to_hour(date):
date = date.strftime("%Y-%m-%d %H:00:00")
date = dt.strptime(date, "%Y-%m-%d %H:%M:%S")
return date
df['Aggregated_Datetime'] = df['Original_Datetime'].apply(lambda x: to_hour(x))
df.groupby('Aggregated_Datetime').agg('Real-Time Lo

Resample to Pandas DataFrame to Hourly using Hour as mid-point

I have a data frame with temperature measurements at a frequency of 5 minutes. I would like to resample this dataset to find the mean temperature per hour.
This is typically done using df['temps'].resample('H', how='mean') but this averages all values that fall within the hour - using all times where '12' is the hour, for example. I want something that gets all values from 30 minutes either side of the hour (or times nearest to the actual hour) and finds the mean that way. In other words, for the resampled time step of 1200, use all temperature values from 1130 to 1230 to calculate the mean.
Example code below to create a test data frame:
index = pd.date_range('1/1/2000', periods=200, freq='5min')
temps = pd.Series(range(200), index=index)
df = pd.DataFrame(index=index)
df['temps'] = temps
Can this be done using the built-in resample method? I'm sure I've done it before using pandas but cannot find any reference to it.
It seems you need:
print (df['temps'].shift(freq='30Min').resample('H').mean())

Calculating variations in sampling frequency

I have a sequence of timestamps (in Unix milliseconds timebase) stored in a pandas Series. Each timestamp belongs to a sensor measurement. To get the sampling frequency I can just subtract the last timestamp from the first one and then divide by the amount of timestamps:
# assuming df is my Series
sf = (df.iloc[-1] - df.iloc[1]) / len(df)
But this does not provide me insights about the variation of the sampling frequency.
How can I calculate the standard deviation of the sampling frequency?
If you have the timestamps stored in numerical form, I'd propose simply checking the std of the interval between two timestamps.
In your example:
df.diff().std()

Python netcdf - Monthly median over the all time period of daily data

I have a NetCDF file input.nc. This file represents nearly 18 years of data sampled every 4 days. From this file, I would like to calculate the monthly median value over the all time period. So the output file should only contain 12-time steps.
I am using the following code:
import xarray as xr
os.chdir(inbasedir)
data = xr.open_dataset('input.nc')
monthly_data = data.resample(freq='m', dim ='time', how = 'median')
monthly_data.to_netcdf("test.nc")
Unfortunately when I look at the output file, my code has done the median value for each month of the all time serie and I end up with more than 200 values. How can I change my code, in order to calculate the 12 monthly medians over the all time period?
You want to use the groupby method:
monthly_data = data.groupby('time.month').median()
There are some good examples of how to use xarray with timeseries data here: http://xarray.pydata.org/en/stable/time-series.html

Day delta for dates >292 years apart

I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.

Categories