for loop performance of pandas Series - python

I am curious about the fact that, when applying a function to each element of pd.Series inside for loop, the execution time looks significantly faster than O(N).
Considering a function below, which is rotating the number bit-wise, but the code itself is not important here.
def rotate(x: np.uint32) -> np.uint32:
return np.uint32(x >> 1) | np.uint32((x & 1) << 31)
When executing this code 1000 times in a for loop, it simply takes the order of 1000 times as expected.
x = np.random.randint(2 ** 32 - 1, dtype=np.uint32)
%timeit rotate(x)
# 13 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for i in range(1000):
rotate(x)
# 9.61 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However when I apply this code inside for loop over a Series of size 1000, it gets significantly faster.
s = pd.Series(np.random.randint(2 ** 32 - 1, size=1000, dtype=np.uint32))
%%timeit
for x in s:
rotate(x)
# 2.08 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am curious about the mechanism that makes this happen?

Note in your first loop you're not actually using the next value of the iterator. The following is a better comparison:
...: %%timeit
...: for i in range(1000):
...: rotate(i)
...:
1.46 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
...: %%timeit
...: for x in s:
...: rotate(x)
...:
1.6 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, they perform more or less the same.
In your original example, by using a variable x declared outside, the interpreter needed to load in that variable using LOAD_GLOBAL 2 (x) while if you just used the value i then the interpreter could just call LOAD_FAST 0 (i), which as the name hints is faster.

Related

Broadcast comparison on sliced numpy array using "," is a lot slower than "]["

I'm not sure why comparing on a sliced numpy array using , is a lot slower than ][. For example:
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99][1:99] == 1
print(time.time() - start)
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99, 1:99] == 1
print(time.time() - start)
3.2756259441375732
11.044903039932251
That's over 3 times worse.
The time measurements are approximately the same using timeit.
I'm working on a recursive algorithm (I intended to do so), and those problems make my program run a lot slower, from about 1 second increased to 10 seconds. I just want to know the reason behind them. May be this is a bug. I'm using Python 3.9.9. Thanks.
The first is the same as a[2:99]==1. A (98,100) slice followed by a (97,100), and then the == test.
In [177]: timeit (a[1:99][1:99]==1)
8.51 µs ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [178]: timeit (a[1:99][1:99])
383 ns ± 5.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [179]: timeit (a[1:99])
208 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The bulk of the time is the test, not the slicing.
In [180]: a[1:99,1:99].shape
Out[180]: (98, 98)
In [181]: timeit a[1:99,1:99]==1
32.2 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [182]: timeit a[1:99,1:99]
301 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Again the slicing is a minor part of the timing, but the == test is significantly slower. In the first case we selected a subset of the rows, so the test is on a contiguous block of the data-buffer. In the second we select a subset of rows and columns. Iteration through the data-buffer is more complicated.
We can simplify the comparison by testing a slice of columns versus a slice of rows:
In [183]: timeit a[:,2:99]==1
32.3 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [184]: timeit a[2:99,:]==1
8.58 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a further test, make a new array with 'F' order. Now "rows" are the slow slice
In [189]: b = np.array(a, order='F')
In [190]: timeit b[:,2:99]==1
8.83 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [191]: timeit b[2:99,:]==1
32.8 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
===
But why are you trying to compare these two slices, one that makes a (97,100) array, and the other a (98,98). They are picking different parts of a.
I wonder if you really meant to test a sequential row, column slice, not two row slices.
In [193]: timeit (a[1:99][:,1:99]==1)
32.6 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparing just the slicing we see that the sequential one is slower - by just a bit.
In [194]: timeit (a[1:99][:,1:99])
472 ns ± 3.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [195]: timeit (a[1:99,1:99])
306 ns ± 3.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
===
The data for a is actually stored in 1d c array. The numpy code uses strides and shape to iterate through it when doing something like a[...] == 1.
So imagine (3,6) data buffer looking like
[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
sliced with [1:3], it will use
[_ _ _ _ _ _ 0 1 2 3 4 5 0 1 2 3 4 5]
slice with [:,1:4] it will use
[_ 1 2 3 _ _ _ 1 2 3 _ _ _ 1 2 3 _ _]
Regardless of the processor caching details, the iteration through the 2nd is more complex.

What is a faster option to compare vales in pandas?

I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)
To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.

Speed up numpy indexing of large array

I have a numpy array x of size (n, n, p) and I need to index it using a list m. I need to return a two new arrays of sizes (n, m, p) and (n, n-m, p). Both p and m are generally small (range 10 to 100), but n can be from 100 to 10000+.
When n is small, there is no issue. However when n gets large, these indexing operations take the majority of my function call time.
In my actual implementation, the indexing took 15 seconds, and the rest of the function was less than 1 sec.
I've tried doing the regular indexing, using np.delete, and np.take, and np.take was faster by a factor of 2 and it was what I am currently using to get the 15 sec time.
An example is below:
m = [1, 7, 12, 40]
r = np.arange(5000)
r = np.delete(r, m, axis=0)
x = np.random.rand(5000,5000,10)
%timeit tmp = x[:,m,:]
1.55 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = x[:,r,:]
1.7 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.delete(x, r, axis=1)
1.46 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.delete(x, m, axis=1)
1.64 s ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.take(x, m, axis=1)
1.21 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.take(x, r, axis=1)
1.04 s ± 79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Except instead of 1 sec, it's 15 times that and I have to call this function a few hundred or thousand times.
Is there something I can do to speed this indexing up?
I'm using Python 3.6.10 through Spyder 4.0.1 on a Windows 10 laptop with an Intel i7-8650U and 16GB of RAM. I checked the array sizes and my available RAM when executing the commands and did not hit the maximum usage at any point in the execution.

Fast numpy row slicing on a matrix

I have the following issue: I have a matrix yj of size (m,200) (m = 3683), and I have a dictionary that for each key, returns a numpy array of row indices for yj (for each key, the size array changes, just in case anyone is wondering).
Now, I have to access this matrix lots of times (around 1M times) and my code is slowing down because of the indexing (I've profiled the code and it takes 65% of time on this step).
Here is what I've tried out:
First of all, use the indices for slicing:
>> %timeit yj[R_u_idx_train[1]]
10.5 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The variable R_u_idx_train is the dictionary that has the row indices.
I thought that maybe boolean indexing might be faster:
>> yj[R_u_idx_train_mask[1]]
10.5 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
R_u_idx_train_mask is a dictionary that returns a boolean array of size m where the indices given by R_u_idx_train are set to True.
I also tried np.ix_
>> cols = np.arange(0,200)
>> %timeit ix_ = np.ix_(R_u_idx_train[1], cols); yj[ix_]
42.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I also tried np.take
>> %timeit np.take(yj, R_u_idx_train[1], axis=0)
2.35 ms ± 88.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And while this seems great, it is not, since it gives an array that is shape (R_u_idx_train[1].shape[0], R_u_idx_train[1].shape[0]) (it should be (R_u_idx_train[1].shape[0], 200)). I guess I'm not using the method correctly.
I also tried np.compress
>> %timeit np.compress(R_u_idx_train_mask[1], yj, axis=0)
14.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally I tried to index with a boolean matrix
>> %timeit yj[R_u_idx_train_mask2[1]]
244 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, is 10.5 µs ± 79.7 ns per loop the best I can do? I could try to use cython but that seems like a lot of work for just indexing...
Thanks a lot.
A very smart solution was given by V.Ayrat in the comments.
>> newdict = {k: yj[R_u_idx_train[k]] for k in R_u_idx_train.keys()}
>> %timeit newdict[1]
202 ns ± 6.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Anyway maybe it would still be cool to know if there is a way to speed it up using numpy!

Compute execution time of membership search of generator in python

I'm currently working with generators and factorials in python.
As an example:
itertools.permutations(range(100))
Meaning, I receive a generator object containing 100! values.
In reality, this code does look a bit more complicated; I'm using a list of sublists instead of range(100), with the goal to find a combination of those sublists meeting my conditions.
This is the code:
mylist = [[0, 0, 1], ..., [5, 7, 3]] # random numbers
x = True in (combination for combination in itertools.permutations(mylist)
if compare(combination))
# Compare() does return True for one or a few combination in that generator
I realized this is very time-consuming. Is there a more efficient way to do this, and, moreover, a way to compute how many time it is going to take?
I've done a few %timeit using ipython:
%timeit (combination for combination in itertools.permutations(mylist) if compare(combination))
--> 697 ns
%timeit (combination for combination in itertools.permutations(range(100)) if compare(combination))
--> 572 ns
Note: I do understand that the generator is just being created, when it's "consumed", meaning the genertor comprehension needs to be executed at first, to start the creaton of itself at all.
I've seen a lot of tutorials explaining how generators do work, but I've found nothing about the execution time.
Moreover I don't need an exact value, like timing the execution time using time-module in my program, hence I need a rough value before execution.
Edit:
I've also tested this for a smaller amount of values, for a list containing 24 sublists, 10 sublists and 5 sublists. Doing this, I receive an instant output.
This means, the program does work, it is just a matter of time.
My problem is (said more clarified): How much time is this going to take, and: Is there a less time consuming way to do it?
A comparison of generators, generator expression, lists and list comprehensions:
In [182]: range(5)
Out[182]: range(0, 5)
In [183]: list(range(5))
Out[183]: [0, 1, 2, 3, 4]
In [184]: (x for x in range(5))
Out[184]: <generator object <genexpr> at 0x7fc18cd88a98>
In [186]: [x for x in range(5)]
Out[186]: [0, 1, 2, 3, 4]
Some timings:
In [187]: timeit range(1000)
248 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [188]: timeit (x for x in range(1000))
802 ns ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [189]: timeit [x for x in range(1000)]
43.4 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [190]: timeit list(range(1000))
23.6 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
time for setting up a generator is (practically) independent of the parameter. Populating the list scales roughly with the size.
In [193]: timeit range(100000)
252 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [194]: timeit list(range(100000))
4.41 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edit
Timings show that a in test on a generator is somewhat faster than a list, but it still scales with the len:
In [264]: timeit True in (True for x in itertools.permutations(range(15),2) if x==(14,4))
17.1 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [265]: timeit list (True for x in itertools.permutations(range(15),2) if x==(14,4))
18.5 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [266]: timeit (14,4) in itertools.permutations(range(15),2)
8.85 µs ± 8.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [267]: timeit list(itertools.permutations(range(15),2))
11.3 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Categories