Dask dataframe groupby and applying custom functions - python

I have this function that I am trying to apply to a dask dataframe that calculates cooling assuming certain storage capacity and rate limits. It takes a 15-minute timestep value of cooling a building uses and returns the amount a certain storage rate can accommodate.
def cooling_kwh_by_case(row, storage_capacity, storage_rate):
if ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] <= storage_rate/row['cop'])):
return row['daily_cooling_kwh']
elif ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] > storage_rate/row['cop'])):
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: sum(min(x,storage_rate/(4*row['cop']))))
return daily_groupby.loc[(row.building_date)]
else:
n_largest = 1
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
n_largest += 1
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
return min(storage_capacity/row['cop'],net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest-1).sum()).loc[(row.building_date)])
When I apply it, this is my error message.
<ipython-input-22-88e243d194c6> in cooling_kwh_by_case()
16 n_largest = 1
17 daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
---> 18 while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
19 n_largest += 1
20 daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I think the issue I'm running into is the way I try and calculate the value I want for the else statement which are the cases where the cooling kwh is larger than the storage_capacity parameter. To calculate this value, I apply a function to find when the sum of the largest 15-min cooling kwh values for the day exceeds the storage_capacity. I then return the sum of the largest values.
The dataframe that I am trying to groupby in the function to return a value is called net_load_w_times:
time electricity_cooling_kwh \
building_id
2 2016-07-05 19:00:00 0.050000
2 2016-07-05 22:00:00 3.200000
2 2016-07-05 16:00:00 5.779318
2 2016-07-05 20:00:00 1.888300
2 2016-07-05 18:00:00 7.490000
electricity_heating_kwh total_site_electricity_kwh iso_zone \
building_id
2 0.000000 19.529506 MISO-E
2 0.045235 6.310719 MISO-E
2 0.000000 22.514705 MISO-E
2 0.018624 13.474863 MISO-E
2 0.005464 18.192927 MISO-E
index date
building_id
2 2|2016-10-24 2016-10-24
2 2|2016-03-05 2016-03-05
2 2|2016-08-14 2016-08-14
2 2|2016-03-05 2016-03-05
2 2|2016-03-05 2016-03-05
Desired Output:
Given cooling_kwh_by_case(row, 8, 5) it outputs:
7.717618 because this is the max cooling kWh it can add up until 8.

Dask dataframes are lazy and do not work within control flow like if-else statements or for loops. I recommend trying to find solutions within the pandas API, like the where method.

Related

Using a IF statement in a DataFrame and getting an error: The truth value of a Series is ambiguous

I have a data frame of stock prices and returns imported from yahoo finance as below.
Date
price
return
2019-01-01
54
0.05
2019-02-01
46
-0.14
2019-03-01
48
0.04
where date is the index and the return is numeric.
I am trying to create a new column with will = equal 1 if the return on the following day is positive and equal -1 if the following return is negative.
I have used
if df['return'].shift(-1) > 0:
df['Indicator'] = 1
else
df['Indicator'] = -1
However, I get the afformentioned error. I have tried using .all() but this makes all of the indicator column equal to 1. even when the return on the following day is negative
The desired output would be
Date
price
return
indicator
2019-01-01
54
0.05
-1
2019-02-01
46
-0.14
1
2019-03-01
48
0.04
1
The last 1 in the indicator column is assuming the return the following day, 2019-04-01 is positive.
Any advice?
Thank you
Use the numpy where function. Its more effective and simple:
import numpy as np
df['Indicator'] = np.where(df['return'].shift(-1)>0,1,-1 )
This would do I think:
df['Indicator'] = df['return'].shift(-1).apply(lambda x: 1 if x > 0 else -1)

Pandas : When the 'apply' function is applied to the column, the 'NaN' value is output

In today's year, if the difference in the year of the corresponding column is 5 or more, it is designed to output 1, but the NaN value comes out.
import pandas as pd
from datetime import datetime
today = datetime.today()
def time(x):
if today.year - x.year > 5:
x = 1
return x
else:
x = 0
return x
df['VIP'] = df[condition]['DaysSinceJoined'].apply(time)
df['VIP']
Get an error:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
2235 NaN
2236 NaN
2237 NaN
2238 NaN
2239 NaN
Name: VIP, Length: 2240, dtype: float64
The function works just fine. The issue might lie within your initial condition:
Fist lets generate a bit sample data:
foo = pd.DataFrame({'time':['1979-11-10','1962-07-22','1987-09-16','2020-09-16']})
from datetime import datetime
today = datetime.today()
def time(x):
if today.year - x.year > 5:
return 1
else:
return 0
First we make sure it's not data format issue as I suggested above:
foo['VIP'] =foo['time'].apply(time)
'str' object has no attribute 'year'
We fix this by converting the dates to datetime:
foo['time'] = pd.to_datetime(foo['time'])
Lets test the function:
foo['VIP'] =foo['time'].apply(time)
time VIP
0 1979-11-10 1
1 1962-07-22 1
2 1987-09-16 1
3 2020-09-16 0
All good.
Now lets apply some random condition:
foo['VIP'] =foo[foo['time'].dt.year >1980]['time'].apply(time)
time VIP
0 1979-11-10 NaN
1 1962-07-22 NaN
2 1987-09-16 1.0
3 2020-09-16 0.0
Reason is that you first filter your dataframe to smaller bit and then feed those rows to your function. Because they are never processed they don't get return values.
I suggest you do this with .loc function:
foo.loc[(( today.year - foo['time'].dt.year > 5 ) & (Other_condition_here), 'vip'] = 1
foo.loc[(( today.year - foo['time'].dt.year <= 5 ) & (Other_condition_here), 'vip'] = 0
For more about .loc see documentation
I guess when you use .apply it takes several arguments. Use map:
df['VIP'] = df[condition]['DaysSinceJoined'].map(time)
or:
df['VIP'] = df[condition].apply(lambda x: time(x['DaysSinceJoined']))
If it didn't work, show us some sample data.

How to perform functions that reference previous row on subset of data in a dataframe using groupby

I have some log data that represents an item (id) and a timestamp that an action was a started and I want to determine the time between actions on each item.
for example, I have some data that looks like this:
data = [{"timestamp":"2019-05-21T14:17:29.265Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T14:21:49.722Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.695Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.696Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-22T07:51:17.49Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T08:11:13.948Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:52:59.897Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.406Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.481Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-21T14:23:08.147Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:29:18.228Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.831Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.834Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:02:19.072Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:02:34.867Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:12:28.877Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.567Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.582Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T09:58:02.185Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"},{"timestamp":"2019-05-21T10:07:24.044Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"}]
stack = pd.DataFrame(data)
stack.head()
I have tried getting all the unique ids to split the data frame and then getting the time taken with the index to recombine with the original set like, but the function is extremely slow on large data-sets and messes up both the index
and timestamp order resulting in results getting miss matched.
import ciso8601 as time
records = []
for i in list(stack.id.unique()):
dff = stack[stack.id == i]
time_taken = []
times = []
i = 0
for _, row in dff.iterrows():
if bool(times):
print(_)
current_time = time.parse_datetime(row.timestamp)
prev_time = times[i]
time_taken = current_time - prev_time
times.append(current_time)
i+=1
records.append(dict(index = _, time_taken = time_taken.seconds))
else:
records.append(dict(index = _, time_taken = 0))
times.append(time.parse_datetime(row.timestamp))
x = pd.DataFrame(records).set_index('index')
stack.merge(x, left_index=True, right_index=True, how='inner')
Is there a neat pandas groupby and apply way of doing this so that I don't have to split the frame and store it in memory so that can reference the previous row in the subset?
Thanks
You can use GroupBy.diff:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack['timestamp']= (stack.sort_values(['id','timestamp'])
.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round().fillna(0))
print(stack['time_taken'])
0 0.0
1 260.0
2 3276.0
3 0.0
4 0.0
5 1196.0
6 13306.0
7 4.0
8 0.0
9 0.0
10 370.0
11 2872.0
...
If you want the resulting dataframe to be ordered by date, instead do:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack = stack.sort_values(['id','timestamp'])
stack['time_taken'] = (stack.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round()
.fillna(0))
If dont need replace timestamp to datetimes create Series filled by datetimes by to_datetime and pass to DataFrameGroupBy.diff, then convert to seconds by Series.dt.total_seconds, if necessary round by Series.round and replace missing values by 0:
t = pd.to_datetime(stack['timestamp'])
stack['time_taken'] = t.groupby(stack['id']).diff().dt.total_seconds().round().fillna(0)
print (stack)
id timestamp time_taken
0 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:17:29.265Z 0.0
1 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:21:49.722Z 260.0
2 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.695Z 3276.0
3 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.696Z 0.0
4 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T07:51:17.49Z 0.0
5 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T08:11:13.948Z 1196.0
6 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:52:59.897Z 13306.0
7 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.406Z 4.0
8 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.481Z 0.0
9 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:23:08.147Z 0.0
10 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:29:18.228Z 370.0
11 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.831Z 2872.0
12 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.834Z 0.0
13 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:19.072Z 0.0
14 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:34.867Z 16.0
15 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:12:28.877Z 594.0
16 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.567Z 4011.0
17 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.582Z 0.0
18 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T09:58:02.185Z 0.0
19 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T10:07:24.044Z 562.0
Or if need replace timestamp to datetimes use #yatu answer.

Finding start of the maximum drawdown in Pandas [duplicate]

Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.

Definite numerical integration in a python pandas dataframe

I have a pandas dataframe of variable number of columns. I'd like to numerically integrate each column of the dataframe so that I can evaluate the definite integral from row 0 to row 'n'. I have a function that works on an 1D array, but is there a better way to do this in a pandas dataframe so that I don't have to iterate over columns and cells? I was thinking of some way of using applymap, but I can't see how to make it work.
This is the function that works on a 1D array:
def findB(x,y):
y_int = np.zeros(y.size)
y_int_min = np.zeros(y.size)
y_int_max = np.zeros(y.size)
end = y.size-1
y_int[0]=(y[1]+y[0])/2*(x[1]-x[0])
for i in range(1,end,1):
j=i+1
y_int[i] = (y[j]+y[i])/2*(x[j]-x[i]) + y_int[i-1]
return y_int
I'd like to replace it with something that calculates multiple columns of a dataframe all at once, something like this:
B_df = y_df.applymap(integrator)
EDIT:
Starting dataframe dB_df:
Sample1 1 dB Sample1 2 dB Sample1 3 dB Sample1 4 dB Sample1 5 dB Sample1 6 dB
0 2.472389 6.524537 0.306852 -6.209527 -6.531123 -4.901795
1 6.982619 -0.534953 -7.537024 8.301643 7.744730 7.962163
2 -8.038405 -8.888681 6.856490 -0.052084 0.018511 -4.117407
3 0.040788 5.622489 3.522841 -8.170495 -7.707704 -6.313693
4 8.512173 1.896649 -8.831261 6.889746 6.960343 8.236696
5 -6.234313 -9.908385 4.934738 1.595130 3.116842 -2.078000
6 -1.998620 3.818398 5.444592 -7.503763 -8.727408 -8.117782
7 7.884663 3.818398 -8.046873 6.223019 4.646397 6.667921
8 -5.332267 -9.163214 1.993285 2.144201 4.646397 0.000627
9 -2.783008 2.288842 5.836786 -8.013618 -7.825365 -8.470759
Ending dataframe B_df:
Sample1 1 B Sample1 2 B Sample1 3 B Sample1 4 B Sample1 5 B Sample1 6 B
0 0.000038 0.000024 -0.000029 0.000008 0.000005 0.000012
1 0.000034 -0.000014 -0.000032 0.000041 0.000036 0.000028
2 0.000002 -0.000027 0.000010 0.000008 0.000005 -0.000014
3 0.000036 0.000003 -0.000011 0.000003 0.000002 -0.000006
4 0.000045 -0.000029 -0.000027 0.000037 0.000042 0.000018
5 0.000012 -0.000053 0.000015 0.000014 0.000020 -0.000023
6 0.000036 -0.000023 0.000004 0.000009 0.000004 -0.000028
7 0.000046 -0.000044 -0.000020 0.000042 0.000041 -0.000002
8 0.000013 -0.000071 0.000011 0.000019 0.000028 -0.000036
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
In the above example,
(x[j]-x[i]) = 0.000008
First of all, you can achieve a similar result using vectorized operations. Each element of the integration is just the mean of the current and next y value scaled by the corresponding difference in x. The final integral is just the cumulative sum of these elements. You can achieve the same result by doing something like
def findB(x, y):
"""
x : pandas.Series
y : pandas.DataFrame
"""
mean_y = (y[:-1] + y.shift(-1)[:-1]) / 2
delta_x = x.shift(-1)[:-1] - x[:-1]
scaled_int = mean_y.multiply(delta_x)
cumulative_int = scaled_int.cumsum(axis='index')
return cumulative_int.shift(1).fillna(0)
Here DataFrame.shift and Series.shift are used to match the indices of the "next" elements to the current. You have to use DataFrame.multiply rather than the * operator to ensure that the proper axis is used ('index' vs 'column'). Finally, DataFrame.cumsum provides the final integration step. DataFrame.fillna ensures that you have a first row of zeros as you did in the original solution. The advantage of using all the native pandas functions is that you can pass in a dataframe with any number of columns and have it operate on all of them simultaneously.
Do you really look for numeric values of the integral? Maybe you just need a picture? Then it is easier, using pyplot.
import matplotlib.pyplot as plt
# Introduce a column *bin* holding left limits of our bins.
df['bin'] = pd.cut(df['volume2'], 50).apply(lambda bin: bin.left)
# Group by bins and calculate *f*.
g = df[['bin', 'universe']].groupby('bin').sum()
# Plot the function using cumulative=True.
plt.hist(list(g.index), bins=50, weights=list(g['universe']), cumulative=True)
plt.show()

Categories