What are the differences in performance and behavior between using Python's native sum function and NumPy's numpy.sum? sum works on NumPy's arrays and numpy.sum works on Python lists and they both return the same effective result (haven't tested edge cases such as overflow) but different types.
>>> import numpy as np
>>> np_a = np.array(range(5))
>>> np_a
array([0, 1, 2, 3, 4])
>>> type(np_a)
<class 'numpy.ndarray')
>>> py_a = list(range(5))
>>> py_a
[0, 1, 2, 3, 4]
>>> type(py_a)
<class 'list'>
# The numerical answer (10) is the same for the following sums:
>>> type(np.sum(np_a))
<class 'numpy.int32'>
>>> type(sum(np_a))
<class 'numpy.int32'>
>>> type(np.sum(py_a))
<class 'numpy.int32'>
>>> type(sum(py_a))
<class 'int'>
Edit: I think my practical question here is would using numpy.sum on a list of Python integers be any faster than using Python's own sum?
Additionally, what are the implications (including performance) of using a Python integer versus a scalar numpy.int32? For example, for a += 1, is there a behavior or performance difference if the type of a is a Python integer or a numpy.int32? I am curious if it is faster to use a NumPy scalar datatype such as numpy.int32 for a value that is added or subtracted a lot in Python code.
For clarification, I am working on a bioinformatics simulation which partly consists of collapsing multidimensional numpy.ndarrays into single scalar sums which are then additionally processed. I am using Python 3.2 and NumPy 1.6.
I got curious and timed it. numpy.sum seems much faster for numpy arrays, but much slower on lists.
import numpy as np
import timeit
x = range(1000)
# or
#x = np.random.standard_normal(1000)
def pure_sum():
return sum(x)
def numpy_sum():
return np.sum(x)
n = 10000
t1 = timeit.timeit(pure_sum, number = n)
print 'Pure Python Sum:', t1
t2 = timeit.timeit(numpy_sum, number = n)
print 'Numpy Sum:', t2
Result when x = range(1000):
Pure Python Sum: 0.445913167735
Numpy Sum: 8.54926219673
Result when x = np.random.standard_normal(1000):
Pure Python Sum: 12.1442425643
Numpy Sum: 0.303303771848
I am using Python 2.7.2 and Numpy 1.6.1
[...] my [...] question here is would using numpy.sum on a list of Python integers be any faster than using Python's own sum?
The answer to this question is: No.
Pythons sum will be faster on lists, while NumPys sum will be faster on arrays. I actually did a benchmark to show the timings (Python 3.6, NumPy 1.14):
import random
import numpy as np
import matplotlib.pyplot as plt
from simple_benchmark import benchmark
%matplotlib notebook
def numpy_sum(it):
return np.sum(it)
def python_sum(it):
return sum(it)
def numpy_sum_method(arr):
return arr.sum()
b_array = benchmark(
[numpy_sum, numpy_sum_method, python_sum],
arguments={2**i: np.random.randint(0, 10, 2**i) for i in range(2, 21)},
argument_name='array size',
function_aliases={numpy_sum: 'numpy.sum(<array>)', numpy_sum_method: '<array>.sum()', python_sum: "sum(<array>)"}
)
b_list = benchmark(
[numpy_sum, python_sum],
arguments={2**i: [random.randint(0, 10) for _ in range(2**i)] for i in range(2, 21)},
argument_name='list size',
function_aliases={numpy_sum: 'numpy.sum(<list>)', python_sum: "sum(<list>)"}
)
With these results:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
b_array.plot(ax=ax1)
b_list.plot(ax=ax2)
Left: on a NumPy array; Right: on a Python list.
Note that this is a log-log plot because the benchmark covers a very wide range of values. However for qualitative results: Lower means better.
Which shows that for lists Pythons sum is always faster while np.sum or the sum method on the array will be faster (except for very short arrays where Pythons sum is faster).
Just in case you're interested in comparing these against each other I also made a plot including all of them:
f, ax = plt.subplots(1)
b_array.plot(ax=ax)
b_list.plot(ax=ax)
ax.grid(which='both')
Interestingly the point at which numpy can compete on arrays with Python and lists is roughly at around 200 elements! Note that this number may depend on a lot of factors, such as Python/NumPy version, ... Don't take it too literally.
What hasn't been mentioned is the reason for this difference (I mean the large scale difference not the difference for short lists/arrays where the functions simply have different constant overhead). Assuming CPython a Python list is a wrapper around a C (the language C) array of pointers to Python objects (in this case Python integers). These integers can be seen as wrappers around a C integer (not actually correct because Python integers can be arbitrarily big so it cannot simply use one C integer but it's close enough).
For example a list like [1, 2, 3] would be (schematically, I left out a few details) stored like this:
A NumPy array however is a wrapper around a C array containing C values (in this case int or long depending on 32 or 64bit and depending on the operating system).
So a NumPy array like np.array([1, 2, 3]) would look like this:
The next thing to understand is how these functions work:
Pythons sum iterates over the iterable (in this case the list or array) and adds all elements.
NumPys sum method iterates over the stored C array and adds these C values and finally wraps that value in a Python type (in this case numpy.int32 (or numpy.int64) and returns it.
NumPys sum function converts the input to an array (at least if it isn't an array already) and then uses the NumPy sum method.
Clearly adding C values from a C array is much faster than adding Python objects, which is why the NumPy functions can be much faster (see the second plot above, the NumPy functions on arrays beat the Python sum by far for large arrays).
But converting a Python list to a NumPy array is relatively slow and then you still have to add the C values. Which is why for lists the Python sum will be faster.
The only remaining open question is why is Pythons sum on an array so slow (it's the slowest of all compared functions). And that actually has to do with the fact that Pythons sum simply iterates over whatever you pass in. In case of a list it gets the stored Python object but in case of a 1D NumPy array there are no stored Python objects, just C values, so Python&NumPy have to create a Python object (an numpy.int32 or numpy.int64) for each element and then these Python objects have to be added. The creating the wrapper for the C value is what makes it really slow.
Additionally, what are the implications (including performance) of using a Python integer versus a scalar numpy.int32? For example, for a += 1, is there a behavior or performance difference if the type of a is a Python integer or a numpy.int32?
I made some tests and for addition and subtractions of scalars you should definitely stick with Python integers. Even though there could be some caching going on which means that the following tests might not be totally representative:
from itertools import repeat
python_integer = 1000
numpy_integer_32 = np.int32(1000)
numpy_integer_64 = np.int64(1000)
def repeatedly_add_one(val):
for _ in repeat(None, 100000):
_ = val + 1
%timeit repeatedly_add_one(python_integer)
3.7 ms ± 71.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_32)
14.3 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_64)
18.5 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def repeatedly_sub_one(val):
for _ in repeat(None, 100000):
_ = val - 1
%timeit repeatedly_sub_one(python_integer)
3.75 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_sub_one(numpy_integer_32)
15.7 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_sub_one(numpy_integer_64)
19 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It's 3-6 times faster to do scalar operations with Python integers than with NumPy scalars. I haven't checked why that's the case but my guess is that NumPy scalars are rarely used and probably not optimized for performance.
The difference becomes a bit less if you actually perform arithmetic operations where both operands are numpy scalars:
def repeatedly_add_one(val):
one = type(val)(1) # create a 1 with the same type as the input
for _ in repeat(None, 100000):
_ = val + one
%timeit repeatedly_add_one(python_integer)
3.88 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_32)
6.12 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_64)
6.49 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Then it's only 2 times slower.
In case you wondered why I used itertools.repeat here when I could simply have used for _ in range(...) instead. The reason is that repeat is faster and thus incurs less overhead per loop. Because I'm only interested in the addition/subtraction time it's actually preferable not to have the looping overhead messing with the timings (at least not that much).
Note that Python sum on multidimensional numpy arrays will only perform a sum along the first axis:
sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]))
Out[47]:
array([[ 9, 11, 13],
[14, 16, 18]])
np.sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]), axis=0)
Out[48]:
array([[ 9, 11, 13],
[14, 16, 18]])
np.sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]))
Out[49]: 81
Numpy should be much faster, especially when your data is already a numpy array.
Numpy arrays are a thin layer over a standard C array. When numpy sum iterates over this, it isn't doing type checking and it is very fast. The speed should be comparable to doing the operation using standard C.
In comparison, using python's sum it has to first convert the numpy array to a python array, and then iterate over that array. It has to do some type checking and is generally going to be slower.
The exact amount that python sum is slower than numpy sum is not well defined as the python sum is going to be a somewhat optimized function as compared to writing your own sum function in python.
This is an extension to the the answer post above by Akavall. From that answer you can see that np.sum performs faster for np.array objects, whereas sum performs faster for list objects. To expand upon that:
On running np.sum for an np.array object Vs. sum for a list object, it seems that they perform neck to neck.
# I'm running IPython
In [1]: x = range(1000) # list object
In [2]: y = np.array(x) # np.array object
In [3]: %timeit sum(x)
100000 loops, best of 3: 14.1 µs per loop
In [4]: %timeit np.sum(y)
100000 loops, best of 3: 14.3 µs per loop
Above, sum is a tiny bit faster than np.array, although, at times I've seen np.sum timings to be 14.1 µs, too. But mostly, it's 14.3 µs.
if you use sum(), then it gives
a = np.arange(6).reshape(2, 3)
print(a)
print(sum(a))
print(sum(sum(a)))
print(np.sum(a))
>>>
[[0 1 2]
[3 4 5]]
[3 5 7]
15
15
Related
I'm sure this has a name in some other domain (maybe approx count distinct?).
Suppose you want to count the number of distinct elements in a numpy array but you only care about numbers below some threshold and above that you just return that it has more than thresh unique entries. This is particulary good for high arity arrays where you don't care that there are 10000 entries just that there are more than 10 entries perhaps.
In a compiled language this is simple to make fast. But what are some fast implementation expose to python?
Naively one might try numba like this:
#numba.jit(nopython=True)
def nunique_max_thresh(x, thresh=10):
seen = set()
for i in range(len(x)):
seen.add(x[i])
if len(seen) > thresh:
return thresh
return len(seen)
But the set usage is not supported.
Cython is an option but I am wondering if this is already done in some library or elsewhere in python. It seems like bottleneck would do this kind of thing but it's not really in there.
https://bottleneck.readthedocs.io/en/latest/reference.html
For example, consider these kind of arrays:
import string
import numpy as np
np.random.seed(0)
a = np.random.choice(list(string.ascii_letters), 1e7)
b = np.ones(int(1e7))
And you just want to know if this array has 10 or more unique values. Do not use the fact that these are length one strings.
For reference, this runs. But is probably not optimal.
import numpy as np
cimport numpy as np
def nunique_truncated(np.ndarray x_in, np.int thresh=10):
seen = set()
for i in range(x_in.shape[0]):
seen.add(x_in[i])
if len(seen) >= thresh:
return thresh
As #hpaulj suggested, you can just use numba without a set or dict and it should be reasonable since the use case is specifically targetted for shorter lists. Obviously some regime will suffer with slow inclusion lookups.
import numba
#numba.jit(nopython=True)
def nunique_truncated_numba(x_in, thresh=10):
seen = list()
for i, x in enumerate(x_in):
if x not in seen:
seen.append(x)
if len(seen) > thresh:
return len(seen)
return len(seen)
And the hard case is really when you do not hit the threshold (you are using python to do vectorized sweeps).
In [6]: %timeit cud.nunique_truncated(b)
116 µs ± 304 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit len(np.unique(b))
1.26 ms ± 2.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Would be be interested if anyone has other suggestions and tricks.
I have a for loop doing some operation on the elements of an array. There are 1e5 elements in the array
import numpy as np
A=np.array([1,2,3,4..........100000)]
for i in range(0,len(A)):
A[i]=(A[i]*2+A[i]*4)**(1/3)
I want to obtain parallelisation in the above code so that each execution of the for loop goes to a different core to make the code execution faster. I have a workstation with 48 cores. How to achieve this parallel processing in python? Please help.
Don't bother parallelizing just yet. Right now, you're taking no advantage of numpy vectorization; you may as well be using Python list (or maybe array.array) for all the benefit numpy is giving you.
Actually use the vectorization features, and the overhead should drop by several orders of magnitude:
import numpy as np
A = np.array([1,2,3,4..........100000]) # If this is actually the values you want, use np.arange(1, 100000+1) to speed it up
A = (A * 6) ** (1 / 3)
# If the result should truncate back to int64, not convert to doubles, cast back at the end
A = A.astype(np.int64)
(A * 6) ** (1 / 3) does the same work as the for loop did, but much faster (you could match the original code more closely with A = (A * 2 + A * 4) ** (1/3), but multiplying by 2 and 4 separately and adding them together is pointless when you could just multiply by 6 directly). The final (optional, depending on intent) line gets exact equivalent behavior of the original loop by truncating back to the original integer dtype.
Comparing performance with ipython %%timeit magic for a microbenchmark:
In [2]: %%timeit
...: A = np.arange(1, 100000+1)
...: for i in range(len(A)):
...: A[i] = (A[i]*2 + A[i]*4) ** (1/3)
...:
427 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %%timeit
...: A = np.arange(1, 100000+1)
...: A = (A * 6) ** (1/3)
...:
2.72 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The vectorized code takes about 0.6% of the time taken by the naive loop; merely parallelizing the naive loop would never come close to achieving that sort of speedup. Adding the .astype(np.int64) cast only increases runtime by about 6%, still a trivial fraction of what the original for loop required.
Let numpy do the hard work.
A = (A*2+A*4)**(1/3)
I'm trying to multiply three arrays (A x B x A), with the dimensions (19000, 3) x (19000, 3, 3) x (19000, 3) so that at the end I'm getting a 1d-array with the size (19000), so I want to multiply only along the last one/two dimensions.
I've got it working with np.einsum() but I'm wondering if there is any way of making this faster, as this is the bottleneck of my whole code.
np.einsum('...i,...ij,...j', A, B, A)
I've already tried it with two separated np.einsum() calls, but that gave me the same performance:
np.einsum('...i, ...i', np.einsum('...i,...ij', A, B), A)
As well I've already tried the # operator and adding some additional axes, but that also didn't make it faster:
(A[:, None]#B#A[...,None]).squeeze()
I've tried to get it working with np.inner(), np.dot(), np.tensordot() and np.vdot(), but these never gave me the same results, so I couldn't compare them.
Any other ideas? Is there any way I could get a better performance?
I've already had a quick look at Numba, but as Numba doesn't support np.einsum() and many other NumPy functions, I would have to rewrite a lot of code.
You could use Numba
In the beginning it is always a good idea, to look what np.einsum does. With optimize==optimal it is usually really good to find a way of contraction, which has less FLOPs. In this case there is actually only a minor optimization possible and the intermediate array is relatively large (I will stick to the naive version). It should also be mentioned that contractions with very small (fixed?) dimensions are a quite special case. This is also a reason why it is quite easy to outperfom np.einsum here (unrolling etc..., which a compiler does if it knows that a loop consists only of 3 elements)
import numpy as np
A=np.random.rand(19000, 3)
B=np.random.rand(19000, 3, 3)
print(np.einsum_path('...i,...ij,...j', A, B, A,optimize="optimal")[1])
"""
Complete contraction: si,sij,sj->s
Naive scaling: 3
Optimized scaling: 3
Naive FLOP count: 5.130e+05
Optimized FLOP count: 4.560e+05
Theoretical speedup: 1.125
Largest intermediate: 5.700e+04 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
3 sij,si->js sj,js->s
2 js,sj->s s->s
"""
Numba implementation
import numba as nb
#si,sij,sj->s
#nb.njit(fastmath=True,parallel=True,cache=True)
def nb_einsum(A,B):
#check the input's at the beginning
#I assume that the asserted shapes are always constant
#This makes it easier for the compiler to optimize
assert A.shape[1]==3
assert B.shape[1]==3
assert B.shape[2]==3
#allocate output
res=np.empty(A.shape[0],dtype=A.dtype)
for s in nb.prange(A.shape[0]):
#Using a syntax like that is also important for performance
acc=0
for i in range(3):
for j in range(3):
acc+=A[s,i]*B[s,i,j]*A[s,j]
res[s]=acc
return res
Timings
#warmup the first call is always slower
#(due to compilation or loading the cached function)
res=nb_einsum(A,B)
%timeit nb_einsum(A,B)
#43.2 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A,optimize=True)
#450 µs ± 8.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A)
#977 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(np.einsum('...i,...ij,...j', A, B, A,optimize=True),nb_einsum(A,B))
#True
I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.
>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776
From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.
Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.
>>> for i in range(1024):
... for j in range(1024):
... for k in range(1024):
... A[i,j,k] = np.random.randint(32767, dtype = np.int16)
The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!
So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?
A neat answer to your first question can also be found in this StackOverflow answer.
To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:
A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)
because then the memory is touched.
At my machine with my setup,
A = np.empty(0)
A.resize((1024, 1024, 1024))
also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.
Let's look at some timings for a smaller case:
In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)
We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.
In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now compare that time to the alternative of generating the random numbers with one call:
In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Much much faster.
If we do the same loop, but simply assign the random number to a variable (and throw it away):
In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.
I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.
Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:
In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.
(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).
I'm trying to execute the following
from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
set(item)
and it takes very long compared to:
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
item.tolist()
Why does it take much longer to convert a NumPy array to a set than to a list?
I mean basically both have complexity O(n)?
TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.
To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.
A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:
The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).
A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:
This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.
But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:
The values 1, 2 and 3 are stored directly in the NumPy array instance.
With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:
Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).
On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:
Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:
import numpy as np
arr = np.arange(100000)
lst = list(range(100000))
def iterateover(obj):
for item in obj:
pass
%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The set "constructor" just does an iteration over the object.
One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():
arr = np.arange(100000)
%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Each of these has O(n) runtime complexity but the constant factors are very different.
In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():
arr = np.random.randint(0, 1000, (10000, 3))
def tosets(arr):
for line in arr:
set(line)
def tolists(arr):
for line in arr:
list(line)
def tolists_method(arr):
for line in arr:
line.tolist()
def tosets_intermediatelist(arr):
for line in arr:
set(line.tolist())
%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).
In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.
I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:
Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).
Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:
import numba as nb
#nb.njit
def tosets_numba(arr):
for lineno in range(arr.shape[0]):
set(arr[lineno])
tosets_numba(arr) # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).
Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).
I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).
Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.
Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.
Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing
pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()