I am trying to parallelize a function on a pandas DataFrame, and I am wondering why the parallelization is that much slower than single-core solution. I am aware that parallelization has its costs... but I am curious if there is a way to improve the code so that the parallelization would be faster.
In my case I am having a list of User-Ids (300 000 (all strings)) and need to check if the User-Id is also present in another list containing only 10 000 entries.
As I cannot reproduce the original code, so I am giving an example with integers that results in the same performance problem:
import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time
df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = pd.Series({'selection': np.random.randint(10000, size=10000)}).to_list()
t1=time.perf_counter()
df['Is_in_selection_single']=np.where(np.isin(df['All'], selection),1,0).astype('int8')
t2=time.perf_counter()
print(t2-t1)
def add_column(x):
return(np.where(np.isin(x, selection),1,0))
df['Is_in_selection_parallel'] = Parallel(n_jobs=4)(delayed(add_column)(x) for x in df['All'].to_list())
t3=time.perf_counter()
print(t3-t2)
The time-print results in the following:
0.0307
53.07
which means the parallelization is 1766 times slower than the single core.
In my real word example, with the User-Id, the single core takes 1 minute, but the parallelization has not finished after 15min...
I would need the parallization because I need to make this operation a couple of times, so the final script takes several minutes to run.
Thank you for any suggestions!
You are splitting the job into too many sub-jobs (1 for each row). This would create a very large overhead cost. You should cut it into smaller number of chunks:
parallel_result = Parallel(n_jobs=4)(delayed(add_column)(x) for x in np.split(df['All'].values, 4))
df['Is_in_selection_parallel'] = np.concatenate(parallel_result)
4 chunks would be 50% faster than the non-parallel version on my platform.
Using a set for membership testing provided a 2.5x improvement on my system. This could be used in addition to parallel computations.
df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = np.random.randint(10000, size=10000)
s1 = pd.Series(selection)
s2 = set(selection)
def orig(df, s):
df['Is_in_selection_single'] = np.where(
np.isin(df['All'], s), 1, 0).astype('int8')
return sum(df['Is_in_selection_single'])
def modified(df, s):
df['Is_in_selection_single'] = df['All'].isin(selection)
return sum(df['Is_in_selection_single'])
Timing results:
%timeit orig(df, s1)
47.1 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit modified(df, s2)
19 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
train.loc[:,'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > train['q_5']
I get Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do left, right = left.align(right, axis=1, copy=False)before e.g.left == right` and something strange output with a lot of columns, but I did expect cell values masked with True or False for calculate sum on next step.
Comparing each columns separately works just fine
train['nd_mean_2021-04-15'] > train['q_5']
But works slowly and messy code.
I've tested your original solution, and two additional ways of performing this comparison you want to make.
To cut to the chase, the following option had the smallest execution time:
%%timeit
sliced_df = df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27']
comparisson_df = pd.DataFrame({col: df['q_5'] for col in sliced_df.columns})
(sliced_df > comparisson_df)
# 1.46 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Drawback: it's little bit messy and requires you to create 2 new objects (sliced_df and comparisson_df)
Option 2: Using DataFrame.apply (slower but more readable)
The second option although slower than your original and the above implementations, in my opinion is the cleanest and easiest to read of them all. If you're not trying to process large amounts of data (I assume not, since you're using pandas instead of Dask or Spark that are tools more suitable for processing large volumes of data) then it's worth bringing it to the discussion table:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'].apply(lambda col: col > df['q_5'])
# 5.66 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Original Solution
I've also tested the performance of your original implementation and here's what I got:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > df['q_5']
# 2.02 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Side-Note: If the FutureWarning message is bothering you, there's always the option to ignore them, adding the following code after your script imports:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
DataFrame Used for Testing
All of the above implementations used the same dataframe, that I created using the following code:
import pandas as pd
import numpy as np
columns = list(
map(
lambda value: f'nd_mean_{value}',
pd.date_range('2021-04-15', '2021-08-27', freq='W').to_series().dt.strftime('%Y-%m-%d').to_list()
)
)
df = pd.DataFrame(
{col: np.random.randint(0, 100, 10) for col in [*columns, 'q_5']}
)
Screenshots
I have a large dataset comprising millions of rows and around 6 columns. The data is currently in a Pandas dataframe and I'm looking for the fastest way to operate on it. For example, let's say I want to drop all the rows where the value in one column is "1".
Here's my minimal working example:
# Create dummy data arrays and pandas dataframe
array_size = int(5e6)
array1 = np.random.rand(array_size)
array2 = np.random.rand(array_size)
array3 = np.random.rand(array_size)
array_condition = np.random.randint(0, 3, size=array_size)
df = pd.DataFrame({'array_condition': array_condition, 'array1': array1, 'array2': array2, 'array3': array3})
def method1():
df_new = df.drop(df[df.array_condition == 1].index)
EDIT: As Henry Yik pointed out in the comments, a faster Pandas approach is this:
def method1b():
df_new = df[df.array_condition != 1]
I believe that Pandas can be quite slow at this sort of thing, so I also implemented a method using numpy, processing each column as a separate array:
def method2():
masking = array_condition != 1
array1_new = array1[masking]
array2_new = array2[masking]
array3_new = array3[masking]
array_condition_new = array_condition[masking]
And the results:
%timeit method1()
625 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit methodb()
158 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method2()
138 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So we do see a slight significant performance boost using numpy. However, this is at the cost of much less readable code (i.e. having to create a mask and apply it to each array). This method doesn't seem as scalable either as if I have, say, 30 columns of data, I'll need a lot of lines of code that apply the mask to every array! Additionally, it would be useful to allow optional columns, so this method may fail trying to operate on arrays which are empty.
Therefore, I have 2 questions:
1) Is there a cleaner / more flexible way to implement this in numpy?
2) Or better, is there any higher performance method I could use here? e.g. JIT (numba?), Cython or something else?
PS, in practice, in-place operations can be used, replacing the old array with the new one once data is dropped
Part 1: Pandas and (maybe) Numpy
Compare your method1b and method2:
method1b generates a DataFrame, which is probably what you want,
method2 generates a Numpy array, so to get fully comparable result,
you should subsequently generate a DataFrame from it.
So I changed your method2 to:
def method2():
masking = array_condition != 1
array1_new = array1[masking]
array2_new = array2[masking]
array3_new = array3[masking]
array_condition_new = array_condition[masking]
df_new = pd.DataFrame({ 'array_condition': array_condition[masking],
'array1': array1_new, 'array2': array2_new, 'array3': array3_new})
and then compared execution times (using %timeit).
The result was that my (expanded) version of method2 executed about 5% longer
than method1b (check on your own).
So my opinion is that as long as a single operation is concerned,
it is probably better to stay with Pandas.
But if you want to perform on your source DataFrame a couple of operations
in sequence and / or you are satisfied with the result as a Numpy array,
it is worth to:
Call arr = df.values to get the underlying Numpy array.
Perform all required operations on it using Numpy methods.
(Optionally) create a DataFrame from the final reslut.
I tried Numpy version of method1b:
def method3():
a = df.values
arr = a[a[:,0] != 1]
but the execution time was about 40 % longer.
The reason is probably that Numpy array has all elements of the
same type, so array_condition column is coerced to float and then
the whole Numpy array is created, what takes some time.
Part 2: Numpy and Numba
An alternative to consider is to use Numba package - a Just-In-Time
Python compiler.
I made such test:
Created a Numpy array (as a preliminary step):
a = df.values
The reason is that JIT compiled methods are able to use Numpy methods and types,
but not those of Pandas.
To perform the test, I used almost the same method as above,
but with #njit annotation (requires from numba import njit):
#njit
def method4():
arr = a[a[:,0] != 1]
This time:
The execution time was about 45 % of the time for method1b.
But since a = df.values has been executed before the test loop,
there are doubts whether this result is comparable with earlier tests.
Anyway, try Numba on your own, maybe it will be an interesting option for you.
You may find using numpy.where useful here. It converts a Boolean mask to array indices, making life much cheaper. Combining this with numpy.vstack allows for some memory-cheap operations:
def method3():
wh = np.where(array_condition == 1)
return np.vstack(tuple(col[wh] for col in (array1, array2, array3)))
This gives the following timeits:
>>> %timeit method2()
180 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit method3()
96.9 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Tuple unpacking allows the operation to be fairly light on memory, as when the object is vstack-ed back together, it is smaller. If you need to get your columns out of a DataFrame directly, the following code snippet may be useful:
def method3b():
wh = np.where(array_condition == 1)
col_names = ['array1','array2','array3']
return np.vstack(tuple(col[wh] for col in tuple(df[col_name].to_numpy()
for col_name in col_names)))
This allows one to grab columns by name from the DataFrame, which are then tuple unpacked on the fly. The speed is about the same:
>>> %timeit method3b()
96.6 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Enjoy!
I am trying to compare pandas EMA performance to numba performance.
Generally, I don't write functions if they are already in-built with pandas, as pandas will always be faster than my slow hand-coded python functions; for example quantile, sort values etc. I believe this is because much of pandas is coded in C under the hood, as well as pandas .apply() methods being much faster than explicit python for loops due to vectorization (but I'm open to an explanation if this is not true). But here, for computing EMA's, I have found that using numba far outperforms pandas.
The EMA I have coded is defined by
S_t = Y_1, t = 1
S_t = alpha*Y_t + (1 - alpha)*S_{t-1}, t > 1
where Y_t is the value of the time series at time t, S_t is the value of the moving average at time t, and alpha is the smoothing parameter.
The code is as follows
from numba import jit
import pandas as pd
import numpy as np
#jit
def ewm(arr, alpha):
"""
Calculate the EMA of an array arr
:param arr: numpy array of floats
:param alpha: float between 0 and 1
:return: numpy array of floats
"""
# initialise ewm_arr
ewm_arr = np.zeros_like(arr)
ewm_arr[0] = arr[0]
for t in range(1,arr.shape[0]):
ewm_arr[t] = alpha*arr[t] + (1 - alpha)*ewm_arr[t-1]
return ewm_arr
# initialize array and dataframe randomly
a = np.random.random(10000)
df = pd.DataFrame(a)
%timeit df.ewm(com=0.5, adjust=False).mean()
>>> 1000 loops, best of 3: 1.77 ms per loop
%timeit ewm(a, 0.5)
>>> 10000 loops, best of 3: 34.8 µs per loop
We see that the hand the hand coded ewm function is around 50 times faster than the pandas ewm method.
It may be the case that numba also outperforms various other pandas methods depending how one codes their function. But here I am interested in how numba outperforms pandas in calculating Exponential Moving Averages. What is pandas doing (not doing) that makes it slow - or is it that numba is just extremely fast in this case? How does pandas compute EMA's under the hood?
But here I am interested in how numba outperforms Pandas in calculating exponential moving averages.
Your version appears to be faster solely because you're passing it a NumPy array rather than a Pandas data structure:
>>> s = pd.Series(np.random.random(10000))
>>> %timeit ewm(s, alpha=0.5)
82 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit ewm(s.values, alpha=0.5)
26 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit s.ewm(alpha=0.5).mean()
852 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, comparing NumPy versus Pandas operations is apples-to-oranges. The latter is built on top of the former and will almost always trade speed for flexibility. (But, taking that into consideration, Pandas is still fast and has come to rely more heavily on Cython ops over time.) I'm not sure specifically what it is about numba/jit that behaves better with NumPy. But if you compare both functions using a Pandas Series, Pandas itself comes out faster.
How does Pandas compute EMAs under the hood?
When you call df.ewm() (without yet calling the methods such .mean() or .cov()), the intermediate result is a bona fide class EWM that's found in pandas/core/window.py.
>>> ewm = pd.DataFrame().ewm(alpha=0.1)
>>> type(ewm)
<class 'pandas.core.window.EWM'>
Whether you pass com, span, halflife, or alpha, Pandas will map this back to a com and use that.
When you call the method itself, such as ewm.mean(), this maps to ._apply(), which in this case serves as a router to the appropriate Cython function:
cfunc = getattr(_window, func, None)
In the case of .mean(), func is "ewma". _window is the Cython module pandas/libs/window.pyx.
That brings you to the heart of things, at the function ewma(), which is where the bulk of the work takes place:
weighted_avg = ((old_wt * weighted_avg) +
(new_wt * cur)) / (old_wt + new_wt)
If you'd like a fairer comparison, call this function directly with the underlying NumPy values:
>>> from pandas._libs.window import ewma
>>> %timeit ewma(s.values, 0.4, 0, 0, 0)
513 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(Remember, it takes only a com; for that, you can use pandas.core.window._get_center_of_mass().
I have a DataFrame df with 541 columns, and I need to save all unique pairs of its column names into the rows of a separate DataFrame, repeated 8 times each.
I thought I would create an empty DataFrame fp, double loop through df's column names, insert into every 8th row, and fill in the blanks with the last available value.
When I tried to do this though I was baffled by how long it's taking. With 541 columns I only have to write 146,611 times yet it's taking well over 20 minutes. This seems egregious for just data access. Where is the problem and how can I solve it? It takes less time than that for Pandas to produce a correlation matrix with the columns so I must me doing something wrong.
Here's a reproducible example of what I mean:
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx
# 1 loop, best of 3: 22.3 s per loop
Don't do iloc/loc/chained-indexing. Using the NumPy interface alone increases speed by ~180x. If you further remove element access, we can bump this to 180,000x.
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
# this confirms how slow data access is on my computer
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx
1 loops, best of 3: 3min 9s per loop
# this accesses the underlying NumPy array, so you can directly set the data
%timeit for idx in range(0, len(fp)): fp.values[idx, 0] = idx
1 loops, best of 3: 1.19 s per loop
This is because there's extensive code that goes in the Python layer for this fancing indexing, taking ~10µs per loop. Using Pandas indexing should be done to retrieve entire subsets of data, which you then use to do vectorized operations on the entire dataframe. Individual element access is glacial: using Python dictionaries will give you a > 180 fold increase in performance.
Things get a lot better when you access columns or rows instead of individual elements: 3 orders of magnitude better.
# set all items in 1 go.
%timeit fp[0] = np.arange(146611)
1000 loops, best of 3: 814 µs per loop
Moral
Don't try to access individual elements via chained indexing, loc, or iloc. Generate a NumPy array in a single allocation, from a Python list (or a C-interface if performance is absolutely critical), and then perform operations on entire columns or dataframes.
Using NumPy arrays and performing operations directly on columns rather than individual elements, we got a whopping 180,000+ fold increase in performance. Not too shabby.
Edit
Comments from #kushy suggest Pandas may have optimized indexing in certain cases since I originally wrote this answer. Always profile your own code, and your mileage may vary.
Alexander's answer was the fastest for me as of 2020-01-06 when using .is_numpy() instead of .values. Tested in Jupyter Notebook on Windows 10. Pandas version = 0.24.2
import numpy as np
import pandas as pd
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
pd.__version__ # '0.24.2'
def func1():
# Asker badmax solution
for idx in range(0, len(fp)):
fp.iloc[idx, 0] = idx
def func2():
# Alexander Huszagh solution 1
for idx in range(0, len(fp)):
fp.to_numpy()[idx, 0] = idx
def func3():
# user4322543 answer to
# https://stackoverflow.com/questions/34855859/is-there-a-way-in-pandas-to-use-previous-row-value-in-dataframe-apply-when-previ
new = []
for idx in range(0, len(fp)):
new.append(idx)
fp[0] = new
def func4():
# Alexander Huszagh solution 2
fp[0] = np.arange(146611)
%timeit func1
19.7 ns ± 1.08 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func2
19.1 ns ± 0.465 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func3
21.1 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func4
24.7 ns ± 0.889 ns per loop (mean ± std. dev. of 7 runs, 50000000 loops each)
I've written a bunch of code on the assumption that I was going to use Numpy arrays. Turns out the data I am getting is loaded through Pandas. I remember now that I loaded it in Pandas because I was having some problems loading it in Numpy. I believe the data was just too large.
Therefore I was wondering, is there a difference in computational ability when using Numpy vs Pandas?
If Pandas is more efficient then I would rather rewrite all my code for Pandas but if there is no more efficiency then I'll just use a numpy array...
There can be a significant performance difference, of an order of magnitude for multiplications and multiple orders of magnitude for indexing a few random values.
I was actually wondering about the same thing and came across this interesting comparison:
http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/
I think it's more about using the two strategically and shifting data around (from numpy to pandas or vice versa) based on the performance you see. As a recent example, I was trying to concatenate 4 small pickle files with 10k rows each data.shape -> (10,000, 4) using numpy.
Code was something like:
n_concat = np.empty((0,4))
for file_path in glob.glob('data/0*', recursive=False):
n_data = joblib.load(file_path)
n_concat = np.vstack((co_np, filtered_snp))
joblib.dump(co_np, 'data/save_file.pkl', compress = True)
This crashed my laptop (8 GB, i5) which was surprising since the volume wasn't really that huge. The 4 compressed pickled files were roughly around 5 MB each.
The same thing, worked great on pandas.
for file_path in glob.glob('data/0*', recursive=False):
n_data = joblib.load(sd)
try:
df = pd.concat([df, pd.DataFrame(n_data, columns = [...])])
except NameError:
df = pd.concat([pd.DataFrame(n_data,columns = [...])])
joblib.dump(df, 'data/save_file.pkl', compress = True)
One the other hand, when I was implementing gradient descent by iterating over a pandas data frame, it was horribly slow, while using numpy for the job was much quicker.
In general, I've seen that pandas usually works better for moving around/munging moderately large chunks of data and doing common column operations while numpy works best for vectorized and recursive work (maybe more math intense work) over smaller sets of data.
Moving data between the two is hassle free, so I guess, using both strategically is the way to go.
In my experiments on large numeric data, Pandas is consistently 20 TIMES SLOWER than Numpy. This is a huge difference, given that only simple arithmetic operations were performed: slicing of a column, mean(), searchsorted() - see below. Initially, I thought Pandas was based on numpy, or at least its implementation was C optimized just like numpy's. These assumptions turn out to be false, though, given the huge performance gap.
In examples below, data is a pandas frame with 8M rows and 3 columns (int32, float32, float32), without NaN values, column #0 (time) is sorted. data_np was created as data.values.astype('float32'). Results on Python 3.8, Ubuntu:
A. Column slices and mean():
# Pandas
%%timeit
x = data.x
for k in range(100): x[100000:100001+k*100].mean()
15.8 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Numpy
%%timeit
for k in range(100): data_np[100000:100001+k*100,1].mean()
874 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Pandas is 18 times slower than Numpy (15.8ms vs 0.874 ms).
B. Search in a sorted column:
# Pandas
%timeit data.time.searchsorted(1492474643)
20.4 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numpy
%timeit data_np[0].searchsorted(1492474643)
1.03 µs ± 3.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Pandas is 20 times slower than Numpy (20.4µs vs 1.03µs).
EDIT: I implemented a namedarray class that bridges the gap between Pandas and Numpy in that it is based on Numpy's ndarray class and hence performs better than Pandas (typically ~7x faster) and is fully compatible with Numpy'a API and all its operators; but at the same time it keeps column names similar to Pandas' DataFrame, so that manipulating on individual columns is easier. This is a prototype implementation. Unlike Pandas, namedarray does not allow for different data types for columns. The code can be found here: https://github.com/mwojnars/nifty/blob/master/math.py (search "namedarray").