Broadcast comparison on sliced numpy array using "," is a lot slower than "][" - python

I'm not sure why comparing on a sliced numpy array using , is a lot slower than ][. For example:
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99][1:99] == 1
print(time.time() - start)
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99, 1:99] == 1
print(time.time() - start)
3.2756259441375732
11.044903039932251
That's over 3 times worse.
The time measurements are approximately the same using timeit.
I'm working on a recursive algorithm (I intended to do so), and those problems make my program run a lot slower, from about 1 second increased to 10 seconds. I just want to know the reason behind them. May be this is a bug. I'm using Python 3.9.9. Thanks.

The first is the same as a[2:99]==1. A (98,100) slice followed by a (97,100), and then the == test.
In [177]: timeit (a[1:99][1:99]==1)
8.51 µs ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [178]: timeit (a[1:99][1:99])
383 ns ± 5.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [179]: timeit (a[1:99])
208 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The bulk of the time is the test, not the slicing.
In [180]: a[1:99,1:99].shape
Out[180]: (98, 98)
In [181]: timeit a[1:99,1:99]==1
32.2 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [182]: timeit a[1:99,1:99]
301 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Again the slicing is a minor part of the timing, but the == test is significantly slower. In the first case we selected a subset of the rows, so the test is on a contiguous block of the data-buffer. In the second we select a subset of rows and columns. Iteration through the data-buffer is more complicated.
We can simplify the comparison by testing a slice of columns versus a slice of rows:
In [183]: timeit a[:,2:99]==1
32.3 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [184]: timeit a[2:99,:]==1
8.58 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a further test, make a new array with 'F' order. Now "rows" are the slow slice
In [189]: b = np.array(a, order='F')
In [190]: timeit b[:,2:99]==1
8.83 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [191]: timeit b[2:99,:]==1
32.8 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
===
But why are you trying to compare these two slices, one that makes a (97,100) array, and the other a (98,98). They are picking different parts of a.
I wonder if you really meant to test a sequential row, column slice, not two row slices.
In [193]: timeit (a[1:99][:,1:99]==1)
32.6 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparing just the slicing we see that the sequential one is slower - by just a bit.
In [194]: timeit (a[1:99][:,1:99])
472 ns ± 3.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [195]: timeit (a[1:99,1:99])
306 ns ± 3.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
===
The data for a is actually stored in 1d c array. The numpy code uses strides and shape to iterate through it when doing something like a[...] == 1.
So imagine (3,6) data buffer looking like
[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
sliced with [1:3], it will use
[_ _ _ _ _ _ 0 1 2 3 4 5 0 1 2 3 4 5]
slice with [:,1:4] it will use
[_ 1 2 3 _ _ _ 1 2 3 _ _ _ 1 2 3 _ _]
Regardless of the processor caching details, the iteration through the 2nd is more complex.

Related

Need help in understanding the loop speed with timeit function in python

I need help in understanding the %timeit function works in the two programs.
Program A
a = [1,3,2,4,1,4,2]
%timeit [val + 5 for val in a]
830 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Program B
import numpy as np
a = np.array([1,3,2,4,1,4,2])
%timeit [a+5]
1.07 µs ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My confusion:
µs is bigger than ns. How does the NumPy function execute slower than for loop here?
1.07 µs ± 23.7 ns per loop... why is the loop speed calculated in ns and not in µs?
Numpy adds an overhead, this will impact the speed on small datasets. Vectorization is mostly useful when using large datasets.
You must try on larger numbers:
N = 10_000_000
a = list(range(N))
%timeit [val + 5 for val in a]
import numpy as np
a = np.arange(N)
%timeit a+5
Output:
1.51 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
55.8 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Range of all elements in numpy array

Suppose I have a 1d array a where from each element I would like to have a range of which the size is stored in ranges:
a = np.array([10,9,12])
ranges = np.array([2,4,3])
The desired output would be:
np.array([10,11,9,10,11,12,12,13,14])
I could of course use a for loop, but I prefer a fully vectorized approach. np.repeat allows one to repeat the elements in a a number of times by setting repeats=, but I am not aware of a similar numpy function particularly dealing with the problem above.
>>> np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
array([10, 11, 9, 10, 11, 12, 12, 13, 14])
With pandas it could be easier:
>>> import pandas as pd
>>> x = pd.Series(np.repeat(a, ranges))
>>> x + x.groupby(x).cumcount()
0 10
1 11
2 9
3 10
4 11
5 12
6 12
7 13
8 14
dtype: int64
>>>
If you want a numpy array:
>>> x.add(x.groupby(x).cumcount()).to_numpy()
array([10, 11, 9, 10, 11, 12, 12, 13, 14], dtype=int64)
>>>
Someone asked about timing, so I compared the times of the three solutions (so far) in a very simple manner, using the %timeit magic function in Jupyter notebook cells.
I set it up as follows:
N = 1
a = np.array([10,9,12])
a = np.tile(a, N)
ranges = np.array([2,4,3])
ranges = np.tile(ranges, N)
a.shape, ranges.shape
So I could easily scale (albeit things not random, but repeated).
Then I ran:
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
,
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
and
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Results are as follows:
N = 1:
9.81 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
568 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.53 µs ± 81.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
N = 10:
63.4 µs ± 976 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
575 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 698 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
N = 100:
612 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
608 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
237 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 1000:
6.09 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
852 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the Pandas solution wins when things get to arrays of 1000 elements or more, but the Python double list comprehension does an excellent job until that point. np.hstack probably loses out because of extra memory allocation and copying, but that's a guess. Note also that the Pandas solution is nearly the same time for each array size.
Caveats still exists because there are repeated numbers, and all values are relatively small integers. This really shouldn't matter, but I'm not (yet) betting on it. (For example, Pandas groupby functionality may be fast because of the repeated numbers.)
Bonus: the OP has statement in a comment that "The real life arrays are around 1000 elements, yet with ranges ranging from 100 to 1000. So becomes quite big – pr94".
So I adjusted my timing test to the following:
import numpy as np
import pandas as pd
N = 1000
a = np.random.randint(100, 1000, N)
# This is how I understand "ranges ranging from 100 to 1000"
ranges = np.random.randint(100, 1000, N)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Which comes out as :
hstack: 2.78 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas: 18.4 ms ± 663 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
double list comprehension: 64.1 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which shows that those caveats I mentioned, in some form at least, do seem to exist. But people should double check whether this testing code is actually the most relevant and appropriate, and whether it is correct.
This problem is probably going to be solved much faster with a Numba-compiled function:
#nb.jit
def expand_range(values, counts):
n = len(values)
m = np.sum(counts)
r = np.zeros((m,), dtype=values.dtype)
k = 0
for i in range(n):
x = values[i]
for j in range(counts[i]):
r[k] = x + j
k += 1
return r
On the very small inputs:
%timeit expand_range(a, ranges)
# 1.16 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and on somewhat larger inputs:
b = np.random.randint(0, 1000, 1000)
b_ranges = np.random.randint(1, 10, 1000)
%timeit expand_range(b, b_ranges)
# 5.07 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
these show that with Numba-based approach winning the speed gain is at least 100x over any of the other approaches proposed so far.
With the numbers closer to what as been indicated in one of the comments by the OP:
b = np.random.randint(10, 1000, 1000)
b_ranges = np.random.randint(100, 1000, 1000)
%timeit expand_range(b, b_ranges)
# 1.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000
%timeit x = pd.Series(np.repeat(b, b_ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 91.8 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(b, b_ranges)])
# 10.7 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array([i for j in range(len(b)) for i in range(b[j],b[j]+b_ranges[j])])
# 144 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which is still at least a respectable 7x over the others.

Is it possible to check whether the word consists of punctuation without any loops?

I have a list with a lot of words, so I don't want to write a nested loop, 'cause it will take a lot of time for the program to run. So maybe there is a way to check whether the word consists of punctuation, something like function any(map(str.isdigit, s1)) isdigits when we have to check numbers?
Unless the list is very large, or your CPU is low-performance, it is not going to take much time to process a list of words. Consider the example below, which has 1 million 20-character strings.
import random
import string
In [16]: s = [''.join(random.choices(string.ascii_letters + string.punctuation, k=20)) for _ in range(1000000)]
In [17]: %%timeit -n 3 -r 3
...: [any(map(str.isdigit, s1)) for s1 in s]
...:
...:
1.23 s ± 2.53 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
In [18]: %%timeit -n 3 -r 3
...: [any([s2 in string.punctuation for s2 in s1]) for s1 in s]
...:
...:
1.72 s ± 18.1 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
You could speed it up with a regular expression
import re
import string
In [16]: s = [''.join(random.choices(string.ascii_letters + string.punctuation, k=20)) for _ in range(1000000)]
In [17]: patt = re.compile('[%s]' % re.escape(string.punctuation))
In [18]: %%timeit -n 3 -r 3
[bool(re.match(patt, s1)) for s1 in s]
1.03 s ± 3.23 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
It may depend on what you define as "punctuation". The module string defines string.punctuation as '!"#$%&\'()*+,-./:;<=>?#[\\]^_``{|}~'. You may also define it as "what isn't alphanumeric" (a-zA-Z0-9), or "what isn't alpha" (a-zA-Z).
Here I define a very long string of alphanumeric characters, and the same with an added dot ., shuffled.
import numpy as np
import string
mystr_no_punct = np.random.choice(list(string.ascii_letters) +
list(string.digits), 1e8)
mystr_withpunct = np.append(mystr_no_punct, '.')
np.random.shuffle(mystr_no_punct)
mystr_withpunct = "".join(mystr_withpunct)
mystr_no_punct = "".join(mystr_no_punct)
Below is an implementation of the naive iteration with a for loop, and some possible answers, according to what you look for, with time comparisons
def naive(mystr):
for x in mystr_no_punct:
if x in string.punctuation:
return False
return True
# naive solution
%timeit naive(mystr_withpunct)
%timeit naive(mystr_no_punct)
# check if string is only alnum
%timeit str.isalnum(mystr_withpunct)
%timeit str.isalnum(mystr_no_punct)
# reduce to a set of the present characters, compare with the set of punctuation characters
%timeit len(set(mystr_withpunct).intersection(set(string.punctuation))) > 0
%timeit len(set(mystr_no_punct).intersection(set(string.punctuation))) > 0
# use regex
import re
%timeit len(re.findall(rf"[{re.escape(string.punctuation)}]+", mystr_withpunct)) > 0
%timeit len(re.findall(rf"[{re.escape(string.punctuation)}]+", mystr_no_punct)) > 0
With the following results
# naive
53.9 ms ± 928 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.1 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# str.isalnum
4.17 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.47 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# sets intersection
8.26 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.2 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# regex
8.43 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.51 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So using the built-in isalnum is clearly the fastest. But if you have specific needs, regex or sets intersection seem a good fit.

Fast numpy row slicing on a matrix

I have the following issue: I have a matrix yj of size (m,200) (m = 3683), and I have a dictionary that for each key, returns a numpy array of row indices for yj (for each key, the size array changes, just in case anyone is wondering).
Now, I have to access this matrix lots of times (around 1M times) and my code is slowing down because of the indexing (I've profiled the code and it takes 65% of time on this step).
Here is what I've tried out:
First of all, use the indices for slicing:
>> %timeit yj[R_u_idx_train[1]]
10.5 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The variable R_u_idx_train is the dictionary that has the row indices.
I thought that maybe boolean indexing might be faster:
>> yj[R_u_idx_train_mask[1]]
10.5 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
R_u_idx_train_mask is a dictionary that returns a boolean array of size m where the indices given by R_u_idx_train are set to True.
I also tried np.ix_
>> cols = np.arange(0,200)
>> %timeit ix_ = np.ix_(R_u_idx_train[1], cols); yj[ix_]
42.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I also tried np.take
>> %timeit np.take(yj, R_u_idx_train[1], axis=0)
2.35 ms ± 88.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And while this seems great, it is not, since it gives an array that is shape (R_u_idx_train[1].shape[0], R_u_idx_train[1].shape[0]) (it should be (R_u_idx_train[1].shape[0], 200)). I guess I'm not using the method correctly.
I also tried np.compress
>> %timeit np.compress(R_u_idx_train_mask[1], yj, axis=0)
14.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally I tried to index with a boolean matrix
>> %timeit yj[R_u_idx_train_mask2[1]]
244 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, is 10.5 µs ± 79.7 ns per loop the best I can do? I could try to use cython but that seems like a lot of work for just indexing...
Thanks a lot.
A very smart solution was given by V.Ayrat in the comments.
>> newdict = {k: yj[R_u_idx_train[k]] for k in R_u_idx_train.keys()}
>> %timeit newdict[1]
202 ns ± 6.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Anyway maybe it would still be cool to know if there is a way to speed it up using numpy!

for loop performance of pandas Series

I am curious about the fact that, when applying a function to each element of pd.Series inside for loop, the execution time looks significantly faster than O(N).
Considering a function below, which is rotating the number bit-wise, but the code itself is not important here.
def rotate(x: np.uint32) -> np.uint32:
return np.uint32(x >> 1) | np.uint32((x & 1) << 31)
When executing this code 1000 times in a for loop, it simply takes the order of 1000 times as expected.
x = np.random.randint(2 ** 32 - 1, dtype=np.uint32)
%timeit rotate(x)
# 13 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for i in range(1000):
rotate(x)
# 9.61 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However when I apply this code inside for loop over a Series of size 1000, it gets significantly faster.
s = pd.Series(np.random.randint(2 ** 32 - 1, size=1000, dtype=np.uint32))
%%timeit
for x in s:
rotate(x)
# 2.08 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am curious about the mechanism that makes this happen?
Note in your first loop you're not actually using the next value of the iterator. The following is a better comparison:
...: %%timeit
...: for i in range(1000):
...: rotate(i)
...:
1.46 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
...: %%timeit
...: for x in s:
...: rotate(x)
...:
1.6 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, they perform more or less the same.
In your original example, by using a variable x declared outside, the interpreter needed to load in that variable using LOAD_GLOBAL 2 (x) while if you just used the value i then the interpreter could just call LOAD_FAST 0 (i), which as the name hints is faster.

Categories