resample and aggregate using *multiple* *named* aggregation functions on *multiple* columns

resample and aggregate using *multiple* *named* aggregation functions on *multiple* columns - python

I have a dataframe like
import pandas as pd
import numpy as np
range = pd.date_range('2015-01-01', '2015-01-5', freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['otherF'] = np.random.randint(low=2, high=42, size=len(df.index))
I can easily resample and apply a builtin as sum():
df['speed'].resample('1D').sum()
Out[121]:
2015-01-01 2865
2015-01-02 2923
2015-01-03 2947
2015-01-04 2751
I can also apply a custom function returning multiple values:
def mu_cis(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
return np.mean(x_),np.mean(x_)-CI,np.mean(x_)+CI,len(x_)
df['speed'].resample('1D').agg(mu_cis)
Out[122]:
2015-01-01 (29.84375, [28.1098628611], [31.5776371389], 96)
2015-01-02 (30.4479166667, [28.7806726396], [32.115160693...
2015-01-03 (30.6979166667, [29.0182072972], [32.377626036...
2015-01-04 (28.65625, [26.965228204], [30.347271796], 96)
As I have read here, I can even multiple values with a name, pandas apply function that returns multiple values to rows in pandas dataframe
def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])
df['speed'].resample('1D').agg(myfunc1)
which gives
Out[124]:
2015-01-01 MU 29.8438
MU+ [31.5776371389]
MU- [28.1098628611]
2015-01-02 MU 30.4479
MU+ [32.1151606937]
MU- [28.7806726396]
2015-01-03 MU 30.6979
MU+ [32.3776260361]
MU- [29.0182072972]
2015-01-04 MU 28.6562
MU+ [30.347271796]
MU- [26.965228204]
However, when I try to apply this to all the original columns by, I only get NaNs:
df.resample('1D').agg(myfunc1)
Out[127]:
speed otherF
2015-01-01 NaN NaN
2015-01-02 NaN NaN
2015-01-03 NaN NaN
2015-01-04 NaN NaN
2015-01-05 NaN NaN
Results do not change using agg or apply after the resample().
What is the right way to do this?

The problem is in myfunc1. It tries to return a pd.Series, while you have a pd.DataFrame. The following seems to work just fine.
def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
try:
return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
except AttributeError: #will still raise errors of other nature
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])
Alternatively:
def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
if x.ndim > 1: #Equivalent to if len(x.shape) > 1
return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])

Related

Generating DataFrame with combination of columns and sum of grouped values

So, I have a DataFrame, in which each row represent an event, formed by essentially 4 columns:
happen start_date end_date number
0 2015-01-01 2015-01-01 2015-01-03 100.0
1 2015-01-01 2015-01-01 2015-01-01 20.0
2 2015-01-01 2015-01-02 2015-01-02 50.0
3 2015-01-02 2015-01-02 2015-01-02 40.0
4 2015-01-02 2015-01-02 2015-01-03 50.0
where happen is the date the event took place, start_date and end_date are the validity of that event, and number is just a summable variable.
What I'd like to get is a DataFrame that has for each row the combination of the happen date and validity date and, contextually, the sum of the number column.
What I tried so far is a double for loop on all dates, knowing that start_date >= happen:
startdate = pd.to_datetime('01/06/2014', format='%d/%m/%Y') # the minimum possible happen
enddate = pd.to_datetime('31/12/2021', format='%d/%m/%Y') # the maximum possible happen (and validity)
df_day = pd.DataFrame()
for dt1 in pd.date_range(start=startdate, end=enddate):
for dt2 in pd.date_range(start=dt1, end=enddate):
num_sum = df[(df['happen'] == dt1)&(df['start_date'] <= dt2)&
(df['end_date'] >= dt2)]['number'].sum()
row = {'happen':dt1,'valid':dt2,'number':num_sum}
df_day = df_day.append(row,ignore_index = True)
and that never came to an end. So I tried other way, I generated the df with all date combination first (like 3.8e6 rows), and then tried to fill it with a lambda func (it's crazy, I know, but don't know how to work around it):
dt1 = pd.date_range(start=startdate, end=enddate).tolist()
df_day = pd.DataFrame()
for i in dt1:
dt_acc1 = [i]
dt2 = pd.date_range(start=i, end=enddate).tolist()
df_comb = pd.DataFrame(list(product(dt_acc1, dt2)), columns=['happen', 'valid'])
df_day = df_day.append(df_comb, ignore_index=True)
df_day['number'] = 0
def append_num(happen,valid):
return df[(df['happen'] == happen)&(df['start_date'] <= valid)&
(df['end_date'] >= valid)]['number'].sum()
df_day['number'] = df_day.apply(lambda x: append_num(x['happen'],x['valid']), axis=1)
and also this loop take forever.
My expected output is something like this:
happen valid number
0 2015-01-01 2015-01-01 120.0
1 2015-01-01 2015-01-02 150.0
2 2015-01-01 2015-01-03 100.0
3 2015-01-02 2015-01-02 90.0
4 2015-01-02 2015-01-03 50.0
5 2015-01-03 2015-01-03 0.0
As you can see the first row represents the sum of all rows with happen on 2015-01-01 and with a start_date and end_date that contain the 2015-01-01 in valid. The number column contains the sum (with 120. = 100. + 20.). On the second row, with valid going one day forward, I "lose" element with index 1 and I "gain" element with index 2 (150. = 100. + 50.).
Every help or suggestion is appreciated!

Pandas merge on `datetime` or `datetime` in `datetimeIndex`

Currently I have two data frames representing excel spreadsheets. I wish to join the data where the dates are equal. This is a one to many join as one spread sheet has a date then I need to add data which has multiple rows with the same date
an example:
A B
date data date data
0 2015-0-1 ... 0 2015-0-1 to 2015-0-2 ...
1 2015-0-2 ... 1 2015-0-1 to 2015-0-2 ...
In this case both rows from A would recieve rows 0 and 1 from B because they are in that range.
I tried using
df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
to accomplish this but received this error.
Traceback (most recent call last):
File "<pyshell#61>", line 1, in <module>
df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 61, in merge
validate=validate)
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 555, in __init__
self._maybe_coerce_merge_keys()
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 990, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
I can add more information as needed of course

So here's the option with merging:
Assume you have two DataFrames:
import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
Now do some cleaning to get all of the dates you need and make sure they are datetime
df1['date'] = pd.to_datetime(df1.date)
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
# No need for this anymore
df2 = df2.drop(columns='date')
Now merge it all together. You'll get 99x10K rows.
df = df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy').drop(columns='dummy')
And subset to the dates that fall in between the ranges:
df[(df.date >= df.start) & (df.date <= df.end)]
# date data_x data_y start end
#0 2015-01-01 A E 2015-01-01 2015-01-02
#1 2015-01-01 A F 2015-01-01 2015-01-02
#3 2015-01-02 B E 2015-01-01 2015-01-02
#4 2015-01-02 B F 2015-01-01 2015-01-02
#5 2015-01-02 B G 2015-01-02 2015-01-03
#8 2015-01-03 C G 2015-01-02 2015-01-03
If for instance, some dates in df2 were a single date, since we're using .str.split we will get None for the second date. Then just use .loc to set it appropriately.
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03',
'2015-01-03'],
'data': ['E', 'F', 'G', 'H']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2.loc[df2.end.isnull(), 'end'] = df2.loc[df2.end.isnull(), 'start']
# data start end
#0 E 2015-01-01 2015-01-02
#1 F 2015-01-01 2015-01-02
#2 G 2015-01-02 2015-01-03
#3 H 2015-01-03 2015-01-03
Now the rest follows unchanged

Let's use this numpy method by #piRSquared:
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
df1['date'] = pd.to_datetime(df1['date'])
a = df1['date'].values
bh = df2['end'].values
bl = df2['start'].values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns))
Output:
date data date data start end
0 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
1 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
2 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
3 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
4 2015-01-02 00:00:00 B 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00
5 2015-01-03 00:00:00 C 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00

Pandas resample doesn't return anything

I am learning to use pandas resample() function, however, the following code does not return anything as expected. I re-sampled the time series by day.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print df.head()
weekly_summary = pd.DataFrame()
weekly_summary['speed'] = df.speed.resample('D').mean()
weekly_summary['distance'] = df.distance.resample('D').sum()
print weekly_summary.head()
Output
speed distance cumulative_distance
2015-01-01 00:00:00 40 10.00 10.00
2015-01-01 00:15:00 6 1.50 11.50
2015-01-01 00:30:00 31 7.75 19.25
2015-01-01 00:45:00 41 10.25 29.50
2015-01-01 01:00:00 59 14.75 44.25
[5 rows x 3 columns]
Empty DataFrame
Columns: [speed, distance]
Index: []
[0 rows x 2 columns]

Depending on your pandas version, how you will do this will vary.
In pandas 0.19.0, your code works as expected:
In [7]: pd.__version__
Out[7]: '0.19.0'
In [8]: df.speed.resample('D').mean().head()
Out[8]:
2015-01-01 28.562500
2015-01-02 30.302083
2015-01-03 30.864583
2015-01-04 29.197917
2015-01-05 30.708333
Freq: D, Name: speed, dtype: float64
In older versions, your solution might not work but at least in 0.14.1, you can tweak it to do so:
>>> pd.__version__
'0.14.1'
>>> df.speed.resample('D').mean()
29.41087328767123
>>> df.speed.resample('D', how='mean').head()
2015-01-01 29.354167
2015-01-02 26.791667
2015-01-03 31.854167
2015-01-04 26.593750
2015-01-05 30.312500
Freq: D, Name: speed, dtype: float64

This looks like an issue with old version of pandas, in newer versions it will enlarge the df when assigning a new column where the index is not the same shape. What should work is to not make an empty df and instead pass the initial call to resample as the data arg for the df ctor:
In [8]:
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print (df.head())
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
weekly_summary['distance'] = df.distance.resample('D').sum()
print( weekly_summary.head())
speed distance cumulative_distance
2015-01-01 00:00:00 28 7.0 7.0
2015-01-01 00:15:00 8 2.0 9.0
2015-01-01 00:30:00 10 2.5 11.5
2015-01-01 00:45:00 56 14.0 25.5
2015-01-01 01:00:00 6 1.5 27.0
speed distance
2015-01-01 27.895833 669.50
2015-01-02 29.041667 697.00
2015-01-03 27.104167 650.50
2015-01-04 28.427083 682.25
2015-01-05 27.854167 668.50
Here I pass the call to resample as the data arg for the df ctor, this will take the index and column name and create a single column df:
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
then subsequent assignments should work as expected

Pandas TimeGrouper: .median() behaves differently to .quantile(0.5)

I have a multi-year time series and want to find quantiles by season.
Numerically, this works fine. However, I'm getting a MultiIndex Series as output when I expect a singly-indexed DataFrame.
import pandas as pd
import numpy as np
rng = pd.date_range(start='2014-01-01', end='2016-01-01', freq='30T')
a_data = np.random.normal(loc=np.pi, scale=np.e, size=len(rng))
b_data = a_data - 5
df = pd.DataFrame(index=rng, data={'a': a_data, 'b': b_data})
grouped = df.groupby(pd.TimeGrouper(freq='QS-DEC'))
mult_idx_series = grouped.quantile(0.5)
mult_idx_series
shows a MultiIndex'd Series:
2013-12-01 a 3.079999
b -1.920001
2014-03-01 a 3.126490
b -1.873510
I'd expected (and wanted) the same ouput format as .median()
median_df = grouped.median()
median_df
which looks like:
a b
2013-12-01 3.079999 -1.920001
2014-03-01 3.126490 -1.873510
I should point out that:
it isn't the 0.5th quantile that I want in reality
I know I'm only mult_idx_series.unstack(1) from the format I want
I was surprised by the different return shapes and want to understand the reasoning.

The difference lies in the fact that grouped.median() calls an optimized (cythonized) median aggregation function, while grouped.quantile() calls a generic wrapper to apply the function on the groups.
Consider this:
In [56]: grouped.apply(lambda x: x.quantile(0.5))
Out[56]:
2013-12-01 a 3.175594
b -1.824406
2014-03-01 a 3.116556
b -1.883444
2014-06-01 a 3.222320
b -1.777680
2014-09-01 a 3.207015
b -1.792985
2014-12-01 a 3.114767
b -1.885233
2015-03-01 a 3.091952
b -1.908048
2015-06-01 a 3.220528
b -1.779472
2015-09-01 a 3.204990
b -1.795010
2015-12-01 a 3.108755
b -1.891245
dtype: float64
In [57]: grouped.agg(lambda x: x.quantile(0.5))
Out[57]:
a b
2013-12-01 3.175594 -1.824406
2014-03-01 3.116556 -1.883444
2014-06-01 3.222320 -1.777680
2014-09-01 3.207015 -1.792985
2014-12-01 3.114767 -1.885233
2015-03-01 3.091952 -1.908048
2015-06-01 3.220528 -1.779472
2015-09-01 3.204990 -1.795010
2015-12-01 3.108755 -1.891245
So grouped.quantile() does a general apply and not an aggregation. The reason for this is that quantile can also return a DataFrame (and thus is not always a pure aggregation), if you calculate multiple quantiles at once, eg with grouped.quantile([0.1, 0.5, 0.9]):
In [67]: grouped.quantile([0.1, 0.5, 0.9])
Out[67]:
a b
2013-12-01 0.1 -0.310566 -5.310566
0.5 3.131418 -1.868582
0.9 6.624399 1.624399
2014-03-01 0.1 -0.219992 -5.219992
0.5 3.173881 -1.826119
0.9 6.550259 1.550259
...

how to resample without skipping nan values in pandas

I am trying get the 10 days aggregate of my data which has NaN values. The sum of 10 days should return a nan values if there is a NaN value in the 10 day duration.
When I apply the below code, pandas is considering NaN as Zero and returning the sum of remaining days.
dateRange = pd.date_range(start_date, periods=len(data), freq='D')
# Creating a data frame so that the timeseries can handle numpy array.
df = pd.DataFrame(data)
base_Series = pd.DataFrame(list(df.values), index=dateRange)
# Converting to aggregated series
agg_series = base_Series.resample('10D', how='sum')
agg_data = agg_series.values
Sample Data:
2011-06-01 46.520536
2011-06-02 8.988311
2011-06-03 0.133823
2011-06-04 0.274521
2011-06-05 1.283360
2011-06-06 2.556313
2011-06-07 0.027461
2011-06-08 0.001584
2011-06-09 0.079193
2011-06-10 2.389549
2011-06-11 NaN
2011-06-12 0.195844
2011-06-13 0.058720
2011-06-14 6.570925
2011-06-15 0.015107
2011-06-16 0.031066
2011-06-17 0.073008
2011-06-18 0.072198
2011-06-19 0.044534
2011-06-20 0.240080
Output:
2011-06-01 62.254651
2011-06-11 7.301481

This uses numpy sum which will return nan if nan is present in the sum
In [35]: s = Series(randn(100),index=date_range('20130101',periods=100))
In [36]: s.iloc[11] = np.nan
In [37]: s.resample('10D',how=lambda x: x.values.sum())
Out[37]:
2013-01-01 6.910729
2013-01-11 NaN
2013-01-21 -1.592541
2013-01-31 -2.013012
2013-02-10 1.129273
2013-02-20 -2.054807
2013-03-02 4.669622
2013-03-12 3.489225
2013-03-22 0.390786
2013-04-01 -0.005655
dtype: float64

to filter out those days which have any NaNs, I propose that you do
noNaN_days_only = s.groupby(lambda x: x.date).filter(lambda x: ~x.isnull().any()
where s is a DataFrame

Just apply an agg function:
agg_series = base_Series.resample('10D').agg(lambda x: np.nan if np.isnan(x).all() else np.sum(x) )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

resample and aggregate using multiple named aggregation functions on multiple columns - python

Related

Generating DataFrame with combination of columns and sum of grouped values

Pandas merge on `datetime` or `datetime` in `datetimeIndex`

Pandas resample doesn't return anything

Pandas TimeGrouper: .median() behaves differently to .quantile(0.5)

how to resample without skipping nan values in pandas

Categories

Resources