Calculations using Pandas apply & lambda - python

I have a data frame with a column for dates (YYYY/MM/DD format) and one for wind speed measurements. Each date has more than one wind speed measurement associated with it, and I want to calculate the standard error of wind speed measurements on each day.
I use the pandas 'groupby' to group all of the wind speeds to the date they were taken on, and I calculate the mean and number of measurements taken from each day.
To calculate the standard error, have to sum the squared difference of each value from a day and the average of values from that day. Obviously, these are different lengths, and I cannot figure out how to do this with the lambdas function.
Is there a better way to go about this?
#calculate daily averages, daily number of measurements, and list of every value from day
average_from_date = df.groupby(['time'])['wind_spd_ms'].mean()
number = df.groupby(['time'])['wind_spd_ms'].count()
values_from_date = df.groupby(['time'])['wind_spd_ms'].apply(list)
#return list of standard errors for each date in the data set
standard_errors = df.groupby(['time'])['wind_spd_ms'].apply(lambda x: (sum((values_from_date -
average_from_date)**2)/(number-1)))

While the groupby.GroupBy.sem is a good way to calculate this since it is a ready function in pandas, there might be cases when you need to calculate a new column with a function that does not exist in the library.
apply() + lambda
This is how you would calculate a new column* using the "apply lambda" approach:
res1 = df.groupby(['time'])['wind_spd_ms'].apply(lambda x: ((x-np.mean(x))**2)/(len(x)-1))
What is important to understand that since df.groupby(['time']) is DataFrame(GroupBy) object the df.groupby(['time'])['wind_spd_ms'] is a Series(GroupBy) object and the apply() function therefore is pd.Series.apply. It takes a function as an argument, and the function will be called with the pandas series (here: the df.groupby(['time'])['wind_spd_ms']) as the argument. Now, you know already how to calculate the standard deviation if you get a list/Series.
apply() + another function
With apply you are not restricted to lambdas, but the argument can be any function that takes pd.Series as the argument. So, equally good solution would be.
def calculate_std(x):
ave = np.mean(x)
return ((x-ave)**2)/(len(x)-1)
res2 = df.groupby(['time'])['wind_spd_ms'].apply(calculate_std)
With a little more complex calculations this is much more readable and preferable solution.
How fast are different alternatives?
One might think that "using lambdas is faster", but if you time the functions yourself you see that there is no speed gains from using lambdas:
In [3]: timeit df.groupby(['time'])['wind_spd_ms'].apply(lambda x: ((x-np.mean(x))**2)/(len(x)-1))
3.26 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: timeit df.groupby(['time'])['wind_spd_ms'].apply(calculate_std)
2.87 ms ± 63.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
On the other hand, the sem function mentioned by oli5679 is faster
In [5]: timeit df.groupby(['time'])['wind_spd_ms'].sem()
1.33 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(Altough, library funtions are not always the fastest. For example, using scipy.ndimage.interpolation.shift for shifting..)
* This is the equation for groupwise standard deviation (not standard error of groupwise mean)

Pandas groupby has a 'sem', which you can use without having to create your own lambda function. See more info here.
import pandas as pd
test = pd.DataFrame({'group':['a','a','a','b','b','b'],'val':[1,100,-40,5,7,8]})
test.groupby(['group'])['val'].sem()
#a 41.554516
#b 0.881917
See below example of how to do it from scratch. I think original attempt wasn't quite fitting definition here. You to divide total squared difference from mean by N-1 to calculate sample variance, but also need to divide this by N, and squareroot to get SEM.
test["squared_difference_from_average"] = (
test["val"] - test.groupby(["group"])["val"].transform("mean")
) ** 2
group_count = test.groupby(["group"])["val"].count()
standard_errors = (
(
(test.groupby(["group"]) ["squared_difference_from_average"].sum())
/ (group_count - 1)
)
/ group_count
) ** 0.5

Related

How to quickly subset many dataframes?

I have 180 DataFrame objects, each one has 3130 rows and it's about 300KB in memory.
The index is a DatetimeIndex, business days from 2000-01-03 to 2011-12-31:
from datetime import datetime
import pandas as pd
freq = pd.tseries.offsets.BDay()
index = pd.date_range(datetime(2000,1,3), datetime(2011,12,31), freq=freq)
df = pd.DataFrame(index=index)
df['A'] = 1000.0
df['B'] = 2000.0
df['C'] = 3000.0
df['D'] = 4000.0
df['E'] = 5000.0
df['F'] = True
df['G'] = 1.0
df['H'] = 100.0
I preprocess all the data taking advantage of numpy/pandas vectorization, then I have to loop through the dataframes day by day. To prevent the possibility of 'look ahead bias' and get data from the future I must be sure each day I only return a subset of my dataframes, up to that datapoint. I explain: if the current datapoint I am processing is datetime(2010,5,15) I need data from datetime(2000,1,3) to datetime(2010,5,15). You should not be able to access data more recent than datetime(2010,5,15). With this subset I'll make other computations I can't vectorize because they are path dependent.
I modified my original loop like this:
def get_data(datapoint):
return df.loc[:datapoint]
calendar = df.index
for datapoint in calendar:
x = get_data(datapoint)
This kind of code is painfully slow. What is my best option to improve its speed?
If I do not try to prevent the look ahead bias my production code takes about 3 minutes to run but it is too risky. With code like this it takes 13 minutes and this is unacceptable.
%%timeit
A slightly faster option is using iloc instead of loc but it is still slow:
def get_data2(datapoint):
idx = df.index.get_loc(datapoint)
return df.iloc[:idx]
for datapoint in calendar:
x = get_data(datapoint)
371 ms ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for datapoint in calendar:
x = get_data2(datapoint)
327 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The original code, which was not trying to prevent the possibility of look ahead bias, simply returned the whole DataFrame when called for each datapoint. In this example is 100 time faster, real code is 4 times faster.
def get_data_no_check():
return df
for datapoint in calendar:
x = get_data_no_check()
2.87 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
See if this work for you:
datapoint_range = pd.date_range(datetime(2000,1,3), datetime.now(), freq=freq)
datapoint = datapoint_range[-1]
Logic is: replacing the ending date to be today so as to ensure not future date. Then get the last date of the range.
Then use your df.loc[:datapoint] to get the range you want.
I solved it like this: first I preprocess all my data in the DataFrame to take advantage of pandas vectorization then I convert it into a dict of dict and I iterate over it preventing the possibility of 'look ahead bias'. Since data are already preprocessed I can avoid the DataFrame overhead. The increase in processing speed in production code let me speechless: down from more than 30 minutes to 40 seconds!
# Convert the DataFrame into a dict of dict
for s, data in self._data.items():
self._data[s] = data.to_dict(orient='index')

Pandas transform method performing slow

I have a canonical Pandas transform example in which performance seems inexplicably slow. I have read the Q&A on the apply method, which is related but, in my humble opinion, offers an incomplete and potentially misleading answer to my question as I explain below.
The first five lines of my dataframe are
id date xvar
0 1004 1992-05-31 4.151628
1 1004 1993-05-31 2.868015
2 1004 1994-05-31 3.043287
3 1004 1995-05-31 3.189541
4 1004 1996-05-31 4.008760
There are 24,693 rows in the dataframe.
There are 2,992 unique id values.
I want to center xvar by id.
Approach 1 takes 861 ms:
df_r['xvar_center'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: x - x.mean())
)
Approach 2 takes 9 ms:
# Group means
df_r_mean = (
df_r
.groupby('id', as_index=False)['xvar']
.mean()
.rename(columns={'xvar':'xvar_avg'})
)
# Merge group means onto dataframe and center
df_w = (
pd
.merge(df_r, df_r_mean, on='id', how='left')
.assign(xvar_center=lambda x: x.xvar - x.xvar_avg)
)
The Q&A on the apply method recommends relying on vectorized functions whenever possible, much like #sammywemmy's comment implies. This I see as overlap. However, the Q&A on the apply method also sates:
"...here are some common situations where you will want to get rid of any calls to apply...Numeric Data"
#sammywemmy's comment does not "get rid of any calls to" the transform method in their answer to my question. On the contrary, the answer relies on the transform method. Therefore, unless #sammywemmy's suggestion is strictly dominated by an alternative approach that does not rely on the transform method, I think my question and its answer are sufficiently distinct from the discussion in Q&A on the apply method. (Thank you for your patience and help.)
This answer is due to the insightful comment from #sammywemmy, who deserves all credit and no blame for any inaccuracy here. Because a similar usage of transform is illustrated in the Pandas User's Guide, I thought elaborating may be useful for others.
My hypothesis is that the problem rests with a combination of using a non-vectorized function and a large number of groups. When I change the groupby variable from id (2,992 unique values) to year (constructed from the date variable and containing 28 unique values), the performance difference between my original approach and #sammywemmy's narrows substantially but is still significant.
%%timeit
df_r['xvar_center_y'] = (
df_r
.groupby('year')['xvar']
.transform(lambda x: x - x.mean())
)
11.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs.
%timeit df_r['xvar_center_y'] = df_r.xvar - df_r.groupby('year')['xvar'].transform('mean')
1.69 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The beauty of #sammywemmy's insight is that it is easy to apply to other common transformations for potentially significant performance improvements and at a modest cost in terms of additional code. For example, consider standardizing a variable:
%%timeit
df_r['xvar_z'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: (x - x.mean()) / x.std())
)
1.34 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs.
%%timeit
df_r['xvar_z'] = (
(df_r.xvar - df_r.groupby('id')['xvar'].transform('mean'))
/ df_r.groupby('id')['xvar'].transform('std')
)
3.96 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speeding up operations on large arrays & datasets (Pandas slow, Numpy better, further improvements?)

I have a large dataset comprising millions of rows and around 6 columns. The data is currently in a Pandas dataframe and I'm looking for the fastest way to operate on it. For example, let's say I want to drop all the rows where the value in one column is "1".
Here's my minimal working example:
# Create dummy data arrays and pandas dataframe
array_size = int(5e6)
array1 = np.random.rand(array_size)
array2 = np.random.rand(array_size)
array3 = np.random.rand(array_size)
array_condition = np.random.randint(0, 3, size=array_size)
df = pd.DataFrame({'array_condition': array_condition, 'array1': array1, 'array2': array2, 'array3': array3})
def method1():
df_new = df.drop(df[df.array_condition == 1].index)
EDIT: As Henry Yik pointed out in the comments, a faster Pandas approach is this:
def method1b():
df_new = df[df.array_condition != 1]
I believe that Pandas can be quite slow at this sort of thing, so I also implemented a method using numpy, processing each column as a separate array:
def method2():
masking = array_condition != 1
array1_new = array1[masking]
array2_new = array2[masking]
array3_new = array3[masking]
array_condition_new = array_condition[masking]
And the results:
%timeit method1()
625 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit methodb()
158 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method2()
138 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So we do see a slight significant performance boost using numpy. However, this is at the cost of much less readable code (i.e. having to create a mask and apply it to each array). This method doesn't seem as scalable either as if I have, say, 30 columns of data, I'll need a lot of lines of code that apply the mask to every array! Additionally, it would be useful to allow optional columns, so this method may fail trying to operate on arrays which are empty.
Therefore, I have 2 questions:
1) Is there a cleaner / more flexible way to implement this in numpy?
2) Or better, is there any higher performance method I could use here? e.g. JIT (numba?), Cython or something else?
PS, in practice, in-place operations can be used, replacing the old array with the new one once data is dropped
Part 1: Pandas and (maybe) Numpy
Compare your method1b and method2:
method1b generates a DataFrame, which is probably what you want,
method2 generates a Numpy array, so to get fully comparable result,
you should subsequently generate a DataFrame from it.
So I changed your method2 to:
def method2():
masking = array_condition != 1
array1_new = array1[masking]
array2_new = array2[masking]
array3_new = array3[masking]
array_condition_new = array_condition[masking]
df_new = pd.DataFrame({ 'array_condition': array_condition[masking],
'array1': array1_new, 'array2': array2_new, 'array3': array3_new})
and then compared execution times (using %timeit).
The result was that my (expanded) version of method2 executed about 5% longer
than method1b (check on your own).
So my opinion is that as long as a single operation is concerned,
it is probably better to stay with Pandas.
But if you want to perform on your source DataFrame a couple of operations
in sequence and / or you are satisfied with the result as a Numpy array,
it is worth to:
Call arr = df.values to get the underlying Numpy array.
Perform all required operations on it using Numpy methods.
(Optionally) create a DataFrame from the final reslut.
I tried Numpy version of method1b:
def method3():
a = df.values
arr = a[a[:,0] != 1]
but the execution time was about 40 % longer.
The reason is probably that Numpy array has all elements of the
same type, so array_condition column is coerced to float and then
the whole Numpy array is created, what takes some time.
Part 2: Numpy and Numba
An alternative to consider is to use Numba package - a Just-In-Time
Python compiler.
I made such test:
Created a Numpy array (as a preliminary step):
a = df.values
The reason is that JIT compiled methods are able to use Numpy methods and types,
but not those of Pandas.
To perform the test, I used almost the same method as above,
but with #njit annotation (requires from numba import njit):
#njit
def method4():
arr = a[a[:,0] != 1]
This time:
The execution time was about 45 % of the time for method1b.
But since a = df.values has been executed before the test loop,
there are doubts whether this result is comparable with earlier tests.
Anyway, try Numba on your own, maybe it will be an interesting option for you.
You may find using numpy.where useful here. It converts a Boolean mask to array indices, making life much cheaper. Combining this with numpy.vstack allows for some memory-cheap operations:
def method3():
wh = np.where(array_condition == 1)
return np.vstack(tuple(col[wh] for col in (array1, array2, array3)))
This gives the following timeits:
>>> %timeit method2()
180 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit method3()
96.9 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Tuple unpacking allows the operation to be fairly light on memory, as when the object is vstack-ed back together, it is smaller. If you need to get your columns out of a DataFrame directly, the following code snippet may be useful:
def method3b():
wh = np.where(array_condition == 1)
col_names = ['array1','array2','array3']
return np.vstack(tuple(col[wh] for col in tuple(df[col_name].to_numpy()
for col_name in col_names)))
This allows one to grab columns by name from the DataFrame, which are then tuple unpacked on the fly. The speed is about the same:
>>> %timeit method3b()
96.6 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Enjoy!

How does Pandas compute exponential moving averages under the hood?

I am trying to compare pandas EMA performance to numba performance.
Generally, I don't write functions if they are already in-built with pandas, as pandas will always be faster than my slow hand-coded python functions; for example quantile, sort values etc. I believe this is because much of pandas is coded in C under the hood, as well as pandas .apply() methods being much faster than explicit python for loops due to vectorization (but I'm open to an explanation if this is not true). But here, for computing EMA's, I have found that using numba far outperforms pandas.
The EMA I have coded is defined by
S_t = Y_1, t = 1
S_t = alpha*Y_t + (1 - alpha)*S_{t-1}, t > 1
where Y_t is the value of the time series at time t, S_t is the value of the moving average at time t, and alpha is the smoothing parameter.
The code is as follows
from numba import jit
import pandas as pd
import numpy as np
#jit
def ewm(arr, alpha):
"""
Calculate the EMA of an array arr
:param arr: numpy array of floats
:param alpha: float between 0 and 1
:return: numpy array of floats
"""
# initialise ewm_arr
ewm_arr = np.zeros_like(arr)
ewm_arr[0] = arr[0]
for t in range(1,arr.shape[0]):
ewm_arr[t] = alpha*arr[t] + (1 - alpha)*ewm_arr[t-1]
return ewm_arr
# initialize array and dataframe randomly
a = np.random.random(10000)
df = pd.DataFrame(a)
%timeit df.ewm(com=0.5, adjust=False).mean()
>>> 1000 loops, best of 3: 1.77 ms per loop
%timeit ewm(a, 0.5)
>>> 10000 loops, best of 3: 34.8 µs per loop
We see that the hand the hand coded ewm function is around 50 times faster than the pandas ewm method.
It may be the case that numba also outperforms various other pandas methods depending how one codes their function. But here I am interested in how numba outperforms pandas in calculating Exponential Moving Averages. What is pandas doing (not doing) that makes it slow - or is it that numba is just extremely fast in this case? How does pandas compute EMA's under the hood?
But here I am interested in how numba outperforms Pandas in calculating exponential moving averages.
Your version appears to be faster solely because you're passing it a NumPy array rather than a Pandas data structure:
>>> s = pd.Series(np.random.random(10000))
>>> %timeit ewm(s, alpha=0.5)
82 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit ewm(s.values, alpha=0.5)
26 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit s.ewm(alpha=0.5).mean()
852 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, comparing NumPy versus Pandas operations is apples-to-oranges. The latter is built on top of the former and will almost always trade speed for flexibility. (But, taking that into consideration, Pandas is still fast and has come to rely more heavily on Cython ops over time.) I'm not sure specifically what it is about numba/jit that behaves better with NumPy. But if you compare both functions using a Pandas Series, Pandas itself comes out faster.
How does Pandas compute EMAs under the hood?
When you call df.ewm() (without yet calling the methods such .mean() or .cov()), the intermediate result is a bona fide class EWM that's found in pandas/core/window.py.
>>> ewm = pd.DataFrame().ewm(alpha=0.1)
>>> type(ewm)
<class 'pandas.core.window.EWM'>
Whether you pass com, span, halflife, or alpha, Pandas will map this back to a com and use that.
When you call the method itself, such as ewm.mean(), this maps to ._apply(), which in this case serves as a router to the appropriate Cython function:
cfunc = getattr(_window, func, None)
In the case of .mean(), func is "ewma". _window is the Cython module pandas/libs/window.pyx.
That brings you to the heart of things, at the function ewma(), which is where the bulk of the work takes place:
weighted_avg = ((old_wt * weighted_avg) +
(new_wt * cur)) / (old_wt + new_wt)
If you'd like a fairer comparison, call this function directly with the underlying NumPy values:
>>> from pandas._libs.window import ewma
>>> %timeit ewma(s.values, 0.4, 0, 0, 0)
513 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(Remember, it takes only a com; for that, you can use pandas.core.window._get_center_of_mass().

How to optimize a numpy loop that sums values from an array which is indexed by another array where values equal the loop index

I have this piece of code that is called multiple times during the run of the application.
It takes an array of numbers which represent values (value_array).
These should be summed up in zones, which are defined in the zone_array.
zone_ids represents a list of all the possible zones in zone_array.
Its basically something in the lines of: i got a population raster map and i want to know how many people live in each zone of the zone map.
the code:
values = np.zeros(len(zone_ids))
for i in zone_ids:
values[i] = round(np.nansum(value_array[zone_array == i]), 2)
return values
The culprit seems to be the for loop, but i have not found a way to eliminate it and have the same results.
I tried it with bincount but i did not succeed.
Using numba jit also has no effect.
I would like to stay away from cython as this code will be used in a Qgis plugin which has no cython support.
test code:
import numpy as np
def fill_values(zone_array, value_array, zone_ids):
values = np.zeros(len(zone_ids))
for i in zone_ids:
values[i] = round(np.nansum(value_array[zone_array == i]), 2)
return values
def run():
# 300 different zones
zone_ids = range(300)
# zone map with 300 zones
zone_array = (np.random.rand(2000, 2000) * 300).astype(int)
# value map from which we want the sum of values per zone (real map can have NaN values)
value_array = (np.random.rand(2000, 2000) * 10.)
value_array[5, 5] = np.NAN
fill_values(zone_array, value_array, zone_ids)
if __name__ == '__main__':
run()
1.92 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With the implementation of bincount as suggested by Divakar :
203 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With a direct usage of bincount, you would have NaNs in the summations. So, you can simply replace the NaNs with zeros and use bincount. This should be much faster, being a vectorized solution.
Hence, the implementation would be -
val_nonan = np.where(np.isnan(value_array), 0, value_array)
out = np.round(np.bincount(zone_array.ravel(), val_nonan.ravel()),2)

Categories