How to speed up pandas string function?

How to speed up pandas string function? - python

I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.
Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:
id facility
0 3466 abc~24353
1 4853 facility1~3.4.5.6
2 4582 53434_Facility~34432~cde
3 9972 facility2~FACILITY2~343
4 2356 Test~23 ~FAC1
The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:
Vectorization
In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lambda Apply
In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does anyone know why I am getting this behavior?
Thanks!

Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.
%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]
411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.
As for why apply is faster, I'm of the belief that the function apply is applying (i.e., str.split) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split.

Related

Pandas groupby: efficiently chain several functions

I need to group a DataFrame and apply several chained functions on each group.
My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.
There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?
First here is some large testing data:
from string import ascii_lowercase
import numpy as np
import pandas as pd
n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
Below is the performance of each function:
%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.
%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.
def cumsum_shift(s):
return s.cumsum().shift()
%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.

Let me know if this helps, few weeks back I was having the same issue.
I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.
# creating groupby object
g = df.groupby('x')['y']
%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would suggest to give a try to transform instead of apply
try this:
%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())
or, also try using
from toolz import pipe
%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())
I am pretty sure that pipe can be more efficient than apply or transform
Let us know if it works well

Vectorization of array creation with variable indices in python - How to remove the for loop?

I am trying to vectorize creation of an array with variable indices that change with the loop variable. In the code snippet below, I want to remove the for loop and vectorize the array creation. Can someone kindly help?
#Vectorize 1
def abc(x):
return str(x)+'_variable'
ar = []
for i in range(0,100):
ar += [str('vectorize_')+abc(i)]

You're not going to get much improvement from "vectorization" here since you're working with strings, unfortunately. A pure Python comprehension is about as good as you'll be able to get, because of this constraint. "Vectorized" operations are only able to take advantage of optimized numerical C code when the data are numeric.
Here's an example of one way you might do what you want here:
In [4]: %timeit np.char.add(np.repeat("vectorize_variable_", 100), np.arange(100).astype(str))
108 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
versus a pure Python comprehension:
In [5]: %timeit [f"vectorize_variable_{i}" for i in range(100)]
11.1 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As far as I know, using numpy really doesn't net you any performance benefits when working with strings. Of course, I may be mistaken, and would love if I am.
If you're still not convinced, here's the same test with n=10000:
In [6]: %timeit [f"vectorize_variable_{i}" for i in range(n)]
1.21 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit np.char.add(np.repeat("vectorize_variable_", n), np.arange(n).astype(str)
...: )
9.97 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pure Python is about 10x faster than the "vectorized" version.

Efficiently aggregate results into a Python Data Structure

I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations

You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.

Pandas' feather format is slow when writing a column of None

I'm testing out feather-format as a way to store pandas DataFrame files. The performance of feather seems to be extremely poor when writing columns consisting entirely of None (info() gives 0 non-null object). The following code well encapsulates the issue:
df1 = pd.DataFrame(data={'x': 1000*[None]})
%timeit df1.to_feather('.../x.feather')
5.35 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.to_pickle('.../x.pkl')
734 ms ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.to_parquet('.../x.parquet')
200 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm using feather-format 0.4.0, pandas 0.23.4, and pyarrow 0.13.0.
How can I get these kinds of DataFrames to save without taking forever?

You could try adding a specific dtype. That being said, the numbers are a little surprising in terms of how poor feather performance is.

Why is NumPy sometimes slower than NumPy + plain Python loop?

This is based on this question asked 2018-10.
Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).
import numpy as np
def f_1(arr):
return np.sum(arr > 0)
def f_2(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_3(arr):
return np.count_nonzero(arr)
if __name__ == '__main__':
data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
print(f_1(data))
print(f_2(data))
print(f_3(data))
Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):
%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, f_2() works faster than f_1() and f_3(). However, it's not the case with data of smaller size. The question is - why so? Is it NumPy, Python, or something else?

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:
np.sum(arr > 0)
It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.
Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.
Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.
Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:
a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp
a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.
Conclusions
Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).
Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.
If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.
Consider these variations, which only differ in which axis we are iterating on:
def f_2_0(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_2_1(arr):
ans = 0
for val in range(arr.shape[1]):
ans += np.sum(arr[:, val, :] > 0)
return ans
def f_2_2(arr):
ans = 0
for val in range(arr.shape[2]):
ans += np.sum(arr[:, :, val] > 0)
return ans
And the results on my laptop:
%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can see that f_2_1 almost as fast as f_1, which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0). The explanation for how exactly caching affects the timing is in the other answer.

Let's remove the temporary array completely
As #user2699 already mentioned in his answer, allocating and writing to a large array that doesn't fit in cache can slow down the process quite a lot. To show this behavior I have written two small functions using Numba (JIT-Compiler).
In compiled languages (C,Fortran,..) you normally avoid temporary arrays. In interpreted Python (without using Cython or Numba) you often want to call a compiled function on a larger chunk of data (vectorization) because loops in interpreted code are extremely slow. But this can also have a view downsides (like temporary arrays, bad cache usage)
Function without temporary array allocation
#nb.njit(fastmath=True,parallel=False)
def f_4(arr):
sum=0
for i in nb.prange(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(arr.shape[2]):
if arr[i,j,k]>0:
sum+=1
return sum
With temporary array
Please note that if you turn on parallelization parallel=True, the compiler does not only try to parallelize the code, but also other optimizations like loop fusing are turned on.
#nb.njit(fastmath=True,parallel=False)
def f_5(arr):
return np.sum(arr>0)
Timings
%timeit f_1(data)
1.65 s ± 48.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.27 s ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
1.99 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=false
216 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_4(data) #parallel=true
121 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=False
1.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_5(data) #parallel=true Temp-Array is automatically optimized away
146 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.