Pandas: extend resampling window - python

Let's say I have a 'low-frequency' series with data points every 2 hours, which I'd like to upsample to a 1-hour freq.
Is it possible in the code snippet below to have the high-freq signal have 24 rows (instead of 23)? More precisely, I'd like the new index to range from 00:00 to 23:00 with a NaN value (instead of stopping at 22:00).
I've played quite a bit with the several options but I still couldn't find out a clean way to do it.
import pandas as pd
import numpy as np
low_f = pd.Series(np.random.randn(12),
index=pd.date_range(start='01/01/2017', freq='2H', periods=12),
name='2H').cumsum()
high_f = low_f.resample('1H', ).mean()
print(high_f.tail(1).index)
#Yields DatetimeIndex(['2017-01-01 22:00:00'], dtype='datetime64[ns]', freq='H')
#I'd like DatetimeIndex(['2017-01-01 23:00:00'], dtype='datetime64[ns]', freq='H')
#(w/ 24 elements)

You can use DateTimeIndex.shift method to shift the dates by an 1 hour (leading). Take the union of it's old index and the newly formed shifted index.
Finally, reindex them according to these set of indices. As there would be no values of the series occuring at the last index, they would be filled by NaN as per it's default fill_value parameter.
high_f.reindex(high_f.index.union(high_f.index.shift(1, 'H')))

Related

Resample to Pandas DataFrame to Hourly using Hour as mid-point

I have a data frame with temperature measurements at a frequency of 5 minutes. I would like to resample this dataset to find the mean temperature per hour.
This is typically done using df['temps'].resample('H', how='mean') but this averages all values that fall within the hour - using all times where '12' is the hour, for example. I want something that gets all values from 30 minutes either side of the hour (or times nearest to the actual hour) and finds the mean that way. In other words, for the resampled time step of 1200, use all temperature values from 1130 to 1230 to calculate the mean.
Example code below to create a test data frame:
index = pd.date_range('1/1/2000', periods=200, freq='5min')
temps = pd.Series(range(200), index=index)
df = pd.DataFrame(index=index)
df['temps'] = temps
Can this be done using the built-in resample method? I'm sure I've done it before using pandas but cannot find any reference to it.
It seems you need:
print (df['temps'].shift(freq='30Min').resample('H').mean())

Day delta for dates >292 years apart

I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.

interpolate values between sample years with Pandas

I'm trying to get interpolated values for the metric shown below using Pandas time series.
test.csv
year,metric
2020,290.72
2025,221.763
2030,152.806
2035,154.016
Code
import pandas as pd
df = pd.read_csv('test.csv', parse_dates={'Timestamp': ['year']},
index_col='Timestamp')
As far as I understand this gives me an time series with the January 1 of each year as the index. Now I need to fill in values for missing years (2021, 2022, 2023, 2024, 2026 etc)
Is there a way to do this with Pandas?
If you're using a newer version of Pandas, your DataFrame object should have an interpolate method that can be used to fill in the gaps.
It turns out, interpolation only fills in values, where there are none. In my case above, what I had to do was to re-index so that the interval was 12 months.
# reindex with interval 12 months (M: month, S: beginning of the month)
df_reindexed = df.reindex(pd.date_range(start='20120101', end='20350101', freq='12MS'))
# method=linear works because the intervals are equally spaced out now
df_interpolated = df_reindexed.interpolate(method='linear')

Python Pandas: Group datetime column into hour and minute aggregations

This seems like it would be fairly straight forward but after nearly an entire day I have not found the solution. I've loaded my dataframe with read_csv and easily parsed, combined and indexed a date and a time column into one column but now I want to be able to just reshape and perform calculations based on hour and minute groupings similar to what you can do in excel pivot.
I know how to resample to hour or minute but it maintains the date portion associated with each hour/minute whereas I want to aggregate the data set ONLY to hour and minute similar to grouping in excel pivots and selecting "hour" and "minute" but not selecting anything else.
Any help would be greatly appreciated.
Can't you do, where df is your DataFrame:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.dt.hour, times.dt.minute]).value_col.sum()
Wes' code didn't work for me. But the DatetimeIndex function (docs) did:
times = pd.DatetimeIndex(data.datetime_col)
grouped = df.groupby([times.hour, times.minute])
The DatetimeIndex object is a representation of times in pandas. The first line creates a array of the datetimes. The second line uses this array to get the hour and minute data for all of the rows, allowing the data to be grouped (docs) by these values.
Came across this when I was searching for this type of groupby. Wes' code above didn't work for me, not sure if it's because changes in pandas over time.
In pandas 0.16.2, what I did in the end was:
grp = data.groupby(by=[data.datetime_col.map(lambda x : (x.hour, x.minute))])
grp.count()
You'd have (hour, minute) tuples as the grouped index. If you want multi-index:
grp = data.groupby(by=[data.datetime_col.map(lambda x : x.hour),
data.datetime_col.map(lambda x : x.minute)])
I have an alternative of Wes & Nix answers above, with just one line of code, assuming your column is already a datetime column, you don't need to get the hour and minute attributes separately:
df.groupby(df.timestamp_col.dt.time).value_col.sum()
This might be a little late but I found quite a good solution for any one that has the same problem.
I have a df like this:
datetime value
2022-06-28 13:28:08 15
2022-06-28 13:28:09 30
... ...
2022-06-28 14:29:11 20
2022-06-28 14:29:12 10
I want to convert those timestamps which are in intervals of a second to timestamps with an interval of minutes adding the value column in the process.
There is a neat way of doing it:
df['datetime'] = pd.to_datetime(df['datetime']) #if not already as datetime object
grouped = df.groupby(pd.Grouper(key='datetime', axis=0, freq='T')).sum()
print(grouped.head())
Result:
datetime value
2022-06-28 13:28:00 45
... ...
2022-06-28 14:29:00 30
freq='T' stands for minutes. You could also group it by hours or days. They are called Offset aliases.

Calculate daily sums using python pandas

I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()

Categories