Apply Different Resampling Method to the Same Column (pandas) - python

I have a time series and I want to apply different functions to the same column.
The main column is weight. I want to create a df that shows both the mean for the weight in the resampled period plus the max. I know I can do:
df.resample('M', how = {'weight':np.max}, kind='YearEnd')
df1.resample('M', how = {'weight': np.mean}, kind='YearEnd')
This seems inefficient.
Optimal:
df.resample('M', how = {'weight': np.mean, 'weight':np.max}, kind='YearEnd')

Try this.
In [23]: df = DataFrame(np.random.randn(100,1),columns=['weight'],index=date_range('20000101',periods=100,freq='MS'))
In [24]: df.resample('A',how=['max','mean'])
Out[24]:
weight
max mean
2000-12-31 1.958570 -0.312230
2001-12-31 1.739518 0.035701
2002-12-31 2.503437 0.169365
2003-12-31 1.115315 0.149279
2004-12-31 2.190617 -0.087536
2005-12-31 1.286224 0.037669
2006-12-31 1.674017 0.147676
2007-12-31 2.107169 -0.064962
2008-12-31 -0.163863 -0.572363
[9 rows x 2 columns]
supporting how as a dict I don't think is too hard, will open an issue about this enhancement: https://github.com/pydata/pandas/issues/6515

Related

Pandas Resample-Sum without Zero filling

When resampling Series with mean aggregation (daily to monthly) -> missing datetimes are filled with NaNs which is okay since we can simply remove them using .dropna() function,
however, with sum/total aggregation -> missing datetimes are filled with 0s (zeros) which is technically correct, but a bit bothersome as masks are needed to remove them.
The question is if there is a more efficient way on resampling with aggregate sum without zero-filling or using masks? Preferrably similar to dropna() but for dropping 0s.
For example:
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2000-03-01', '2000-03-02', '2000-05-01', '2000-05-02'])
# wanted output
# 2000-01-31 2.0
# 2000-03-31 2.0
# 2000-05-31 2.0
# ideal output but for aggregate sum.
ser.resample('M').mean().dropna()
# 2000-01-31 1.0
# 2000-03-31 1.0
# 2000-05-31 1.0
# not ideal
ser.resample('M').sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with .grouper() seems to have the exact behavior from resampling.
# not ideal
ser.groupby(pd.Grouper(freq='M')).sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with index.year is also doable, however, there does not seem to be an 'identity' for calendar month. Noting that .index.month is not what we are after.
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2002-03-01', '2002-03-02', '2005-05-01', '2005-05-02'])
ser.groupby(ser.index.year).sum()
# 2000 2
# 2002 2
# 2005 2
Use pd.offsets.MonthEnd and add this with the DatetimeIndex of ser to create a month end grouper, then use Series.groupby with this grouper and aggregate using sum or mean:
grp = ser.groupby(ser.index + pd.offsets.MonthEnd())
s1, s2 = grp.sum(), grp.mean()
Result:
print(s1)
2000-01-31 2
2002-03-31 2
2005-05-31 2
dtype: int64
print(s2)
2000-01-31 1
2002-03-31 1
2005-05-31 1
dtype: int64

how to apply in python mobile averaging considering periodic boundary conditions in data

I would like to perform mobile averaging considering periodic boundary conditions. I try to make myself clear.
I have this data:
Date,Q
1989-01-01 00:00,0
1989-01-02 00:00,1
1989-01-03 00:00,4
1989-01-04 00:00,6
1989-01-05 00:00,8
1989-01-06 00:00,10
1989-01-07 00:00,11
I would like to compute the mobile averaging considering 3 data: the next and the previous.
In particular, I would like to use same option in the "rolling" function where the first data (0 in python framework) were able to take into account the last one and vice versa the last one the first one. This would allows me to have a sort of periodic boundary conditions.
Indeed, I have applied the following:
First, I read the dataframe
df = pd.read_csv(fname, index_col = 0, parse_dates=True)
then I apply the "rolling" as
df['Q'] = pd.Series(df["Q"].rolling(3, center=True).mean())
However, I get the following results:
Date
1989-01-01 NaN
1989-01-02 1.66
1989-01-03 3.66
1989-01-04 6
1989-01-05 8
1989-01-06 9.66
1989-01-07 NaN
I know that I could apply the "min_periods=1" option but this is not what I want. Indeed, It is clear that in the second row the result is correct:
1.66 = (0+1+4)/3
However, I would like to have this result in the first row:
(0+1+11)/3
As you can noticed, the number 11 is the value of the last row. Similarly, I expect in the last row:
(10+11+0)/3
where 0 is the value of the first row.
Do you have some suggestions or idea?
Thanks,
Diego
I would just duplicate the values before the first one and after last one, sort the dataframe, and do the rolling average. Then it would be enough to drop the added values:
df.loc[df.index[0] - pd.offsets.Day(1), 'Q'] = df.iloc[-1]['Q']
df.loc[df.index[-2] + pd.offsets.Day(1), 'Q'] = df.iloc[0]['Q']
df = df.sort_index()
df['Q'] = pd.Series(df["Q"].rolling(3, center=True).mean())
It gives as expected:
Q
Date
1989-01-01 4.000000
1989-01-02 1.666667
1989-01-03 3.666667
1989-01-04 6.000000
1989-01-05 8.000000
1989-01-06 9.666667
1989-01-07 7.000000

How to resample a pandas dataframe backwards

Hi I am trying to resample a pandas DataFrame backwards.
This is my dataframe:
seconds = np.arange(20, 700, 60)
timedeltas = pd.to_timedelta(seconds, unit='s')
vals = np.array([randint(-10,10) for a in range(len(seconds))])
df = pd.DataFrame({'values': vals}, index = timedeltas)
then I have
In [252]: df
Out[252]:
values
00:00:20 8
00:01:20 4
00:02:20 5
00:03:20 9
00:04:20 7
00:05:20 5
00:06:20 5
00:07:20 -6
00:08:20 -3
00:09:20 -5
00:10:20 -5
00:11:20 -10
and
In [253]: df.resample('5min').mean()
Out[253]:
values
00:00:20 6.6
00:05:20 -0.8
00:10:20 -7.5
and what I would like is something like
Out[***]:
values
00:01:20 6
00:06:20 valb
00:11:20 -5.8
where the values of each new time are the ones if I roll back the dataframe and compute the mean in each bin going from backwards to forward. For example in this
case the last value should be
valc = (-6-3-5-5-10)/5.
valc= -5.8
which is the average of the last 5 values, and the first one should be the average of the only 2 first values because the "bin" is incomplete.
Reading pandas documentation I thought that I have to use the parameters how='last' but in my current version of pandas this is not working (version 0.20.3). Additionally I tried with the options closed and convention, but I wasn't able to perform this.
Thanks for the help
The easiest way is to sort the index in reverse order, then resample to get the desired results:
df.sort_index(ascending=False).resample('5min').mean()
Resample reference - When the resample starts the first bin is of max available length, in this case 5. Closed, label, convention parameters are helpful but do not compute the mean going from backwards to forward. To do that use sort.

Apply Numpy function over entire Dataframe

I am applying this function over a dataframe df1 such as the following:
AA AB AC AD
2005-01-02 23:55:00 "EQUITY" "EQUITY" "EQUITY" "EQUITY"
2005-01-03 00:00:00 32.32 19.5299 32.32 31.0455
2005-01-04 00:00:00 31.9075 19.4487 31.9075 30.3755
2005-01-05 00:00:00 31.6151 19.5799 31.6151 29.971
2005-01-06 00:00:00 31.1426 19.7174 31.1426 29.9647
def func(x):
for index, price in x.iteritems():
x[index] = price / np.sum(x,axis=1)
return x[index]
df3=func(df1.ix[1:])
However, I only get a single column returned as opposed to 3
2005-01-03 0.955843
2005-01-04 0.955233
2005-01-05 0.955098
2005-01-06 0.955773
2005-01-07 0.955877
2005-01-10 0.95606
2005-01-11 0.95578
2005-01-12 0.955621
I am guessing I am missing something in the formula to make it apply to the entire dataframe. Also how could I return the first index that has strings in its row?
You need to do it the following way :
def func(row):
return row/np.sum(row)
df2 = pd.concat([df[:1], df[1:].apply(func, axis=1)], axis=0)
It has 2 steps :
df[:1] extracts the first row, which contains strings, while df[1:] represents the rest of the DataFrame. You concatenate them later on, which answers the second part of your question.
For operating over rows you should use apply() method.

Using Python's Pandas to find average values by bins

I just started using pandas to analyze groundwater well data over time.
My data in a text file looks like (site_no, date, well_level):
485438103132901 19800417 -7.1
485438103132901 19800506 -6.8
483622101085001 19790910 -6.7
485438103132901 19790731 -6.2
483845101112801 19801111 -5.37
484123101124601 19801111 -5.3
485438103132901 19770706 -4.98
I would like an output with average well levels binned by 5 year increments and with a count:
site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count
I am reading in the data with:
names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)
I can find the overall average by site with:
avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()
My crude bin attempts use:
a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()
My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.
The most concise way is probably to convert this to a timeseris data and them downsample to get the means:
In [75]:
print df
ID Level
1
1980-04-17 485438103132901 -7.10
1980-05-06 485438103132901 -6.80
1979-09-10 483622101085001 -6.70
1979-07-31 485438103132901 -6.20
1980-11-11 483845101112801 -5.37
1980-11-11 484123101124601 -5.30
1977-07-06 485438103132901 -4.98
In [76]:
df.Level.resample('60M', how='mean')
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31 -4.980
1982-07-31 -6.245
Freq: 60M, Name: Level, dtype: float64
Alternatively, you may use groupby together with cut:
In [99]:
print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
ID Level
[1960, 1965] NaN NaN
(1965, 1970] NaN NaN
(1970, 1975] NaN NaN
(1975, 1980] 4.847632e+14 -6.064286
And by ID also:
In [100]:
print df.groupby(['ID',
pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
Level
ID
483622101085001 (1975, 1980] -6.70
483845101112801 (1975, 1980] -5.37
484123101124601 (1975, 1980] -5.30
485438103132901 (1975, 1980] -6.27
so what i like to do is create a separate column with the rounded bin number:
bin_width = 50000
mult = 1. / bin_width
df['bin'] = np.floor(ser * mult + .5) / mult
then, just group by the bins themselves
df.groupby('bin').mean()
another note, you can do multiple truth evaluations in one go:
df[(df.date > a) & (df.date < b)]

Categories