If I have a very large piece of data, and I want to find out some specific elements and convert them from bool to number. For example, I want to find whether the element is in the interval (0.3,0.4), and convert True to 1 and False to 0.
i=np.random.rand(1000,1000,1000)
j=((0.3<i)*(i<0.4))*1
Does j=((0.3<i)&(i<0.4))*1 work the same as the expression above?
I know bool*bool is time-consuming and exploit a huge memory, and so is bool convert to number. Then how can I seed up the algorithm and save memory? Is there a way to evaluate 0.3<i<0.4 quickly?
Yes, for boolean arrays & and * are identical because both are only True if both operands are True, otherwise False.
You already found out that each operation creates a temporary array (although newer NumPy versions might be optimized in that respect), so you have one temporary boolean array for each <, one for the * or the & and then you create an integer array with the * 1. Without using additional libraries you can't avoid that. NumPy is fast because it does the loops in C but that means you have to deal with temporary arrays.
But with additional libraries you actually can speed that up and make it more memory-efficient.
Numba:
import numba as nb
import numpy as np
#nb.njit
def numba_func(arr, lower, upper):
res = np.zeros(arr.size, dtype=np.int8)
arr_raveled = arr.ravel()
for idx in range(arr.size):
res[idx] = lower < arr_raveled[idx] < upper
return res.reshape(arr.shape)
>>> numba_func(i, 0.3, 0.4) # sample call
Numexpr
import numexpr as ne
ne.evaluate('((0.3<i)&(i<0.4))*1')
However numexpr is more of a black-box, you don't control how much memory it needs, but in most cases where you deal with multiple element-wise NumPy operations it's very fast and much more memory efficient than NumPy.
Cython
I'm using IPython magic here. If you don't use IPython or Jupyter you probably need to cythonize it yourself.
%load_ext cython
%%cython
import numpy as np
cimport numpy as cnp
cpdef cnp.int8_t[:] cython_func(double[:] arr, double lower, double upper):
cdef Py_ssize_t idx
cdef cnp.int8_t[:] res = np.empty(len(arr), dtype=np.int8)
for idx in range(len(arr)):
res[idx] = lower < arr[idx] < upper
return res
Given that I used 1D-memoryviews here, you need to cast it to an array and reshape it afterwards:
np.asarray(cython_func(i.ravel(), 0.3, 0.4)).reshape(i.shape) # sample call
There are probably better ways to get around the ravel, asarray and reshape but those require that you know the dimension of your array.
Timing
I use a smaller array because I don't have much RAM but you can easily change the numbers:
i = np.random.random((1000, 1000, 10))
%timeit numba_func(i, 0.3, 0.4)
52.1 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ne.evaluate('((0.3<i)&(i<0.4))*1')
77.1 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.asarray(cython_func(i.ravel(), 0.3, 0.4)).reshape(i.shape)
146 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ((0.3<i)&(i<0.4))*1
180 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Yes, the expression works the same. Check it with
jmult = ((0.3<i)*(i<0.4))*1
jand = ((0.3<i)&(i<0.4))*1
jand == jmult
Related
I am learning numba to try to optimize my code.
First, I calculated a*exp(b*x) where a and b are two parameters and x a large numpy array. I used the #jit decorator. The output is 1D array with len(x) elements.
This the code :
import timeit
import numpy as np
from numba import jit, prange
def func_np (par,x):
return(par[0]*np.exp(par[1]*x))
#jit('float64[:](float64[:], float64[:])', nopython=True, parallel=True)
def func_parallel(par,x):
length=len(x)
result = np.empty(length, dtype=np.float64)
for i in prange(length):
result[i] = par[0]*np.exp(par[1]*x[i])
return result
x=np.array(np.arange(0,100,0.0001))
par=np.array([10,0.1])
print("numpy only")
%timeit func_np(par,x)
print("")
print("numba")
%timeit func_parallel(par,x)
The output is :
numpy only
28.1 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numba
2.05 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, everything worked well. I have a large time decrease with numba optimization. (The machine where this code runs has 48 cores.).
After, I decided to use an array of a and b parameters (par=np.array([[i*10,i*0.1] for i in range(10)]), x is unchanged.
This is the beginning of my problems. First, as the output is a 2D array of shape (10 ,len(x)) elements, it seems that I could not use #jit decorator but I must use #guvectorize decorator. I have written this code (very similar to the one above) that works :
import timeit
import numpy as np
from numba import guvectorize, prange
def func_np (par,x):
return(par[:,0:1]*np.exp(par[:,1:2]*x))
#guvectorize([ 'float64[:,:], float64[:,:], float64[:]' ], '(m,n),(m,p),(n)', nopython=True, target='parallel')
def func_parallel(result,par,x):
lenght=len(x)
for i in prange(lenght):
result[:,i] = (par[:,0:1]*np.exp(par[:,1:2]*x[i])).T
x=np.array(np.arange(0,100,0.0001))
par=np.array([[i*10,i*0.1] for i in range(10)])
print("numpy only")
%timeit func_np(par,x)
print("")
print("numba")
result = np.empty([par.shape[0],x.shape[0]], dtype=np.float64)
%timeit func_parallel(result, par,x);
But, the benchmarks are not good. The code with numba optimization is slower than the code written only with numpy functions.
numpy only
358 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba
563 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As it is the first time I used numba, I suppose there is something that I have not understood. Any help would be appreciated.
i don't exactly know what guvectorize does to slow you down (most likely creating extra unnecessary threads), but it's simple to make a better defined implementation.
consider the following implementation which is faster than numpy implementation.
#jit('float64[:,:](float64[:,:], float64[:])', nopython=True,parallel=True)
def func_parallel(par, x):
result = np.zeros((par.shape[0],len(x)),dtype=np.float64)
lenght = len(x)
for i in prange(lenght):
for j in range(par.shape[0]):
result[j, i] = par[j, 0] * np.exp(par[j, 1] * x[i])
return result
it works x3 faster on my 2C/4T machine, so it should be much faster on yours.
I have recently discovered that Numba may work much slower than pure python even in non-python mode with the parrallel=True option enabled.
Important: If you don't deal with Voronoi diagrams please continue reading, my question doesn't relate to them directly.
Currently, I am working on a problem where I have energy associated with the Voronoi diagram's edges and cells areas. The scipy Vornoi returns an array containing couples of points (vor.ridge_points) associated with each Vornoi edge. For my code, I want to have the ability to get index of the edge when providing indexes of the associated points, so I define a kind of adjacency matrix, but instead of ones and zeros, it has zeros and indexes of edges.
It turns out that pure python when performing cycles over numpy arrays turns to be 10 times faster than numba. Here is a toy example (i just randomly generated arrays, for the same number of edges and points as in my simulation).
My guess that it has something to do with memory allocation. Any take on the subject would be apprectiated (the reason why is it so much slower or a better way to get edge number from numbers of points) :)
# %%
from numba.np.ufunc import parallel
import numpy as np
from numba import njit
from numba import prange
# %% generating array that models array og ridges
points_number = 8802
ridges_number = 26379
np.random.seed(123)
ridge_points = np.random.randint(points_number, size=(ridges_number, 2))
# %% symmetric matrix containing indexes of all edges
# in space [original_point_1, original_point_2]
ridge_points = np.array(ridge_points, dtype=np.int32)
#njit(parallel=True, cache=True)
def jit_edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in prange(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = jit_edges_matrix_op(ridge_points, ridges_number)
# %% the same but not jitted
def edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in range(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
jit_edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
edges_matrix_op(ridge_points, ridges_number)
UPDATE
Indeed parallelization is not working properly here, so I run tests with parallel=False. Here are the results
630 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=True
553 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=False
66.5 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - pure python
UPDATE 2
Thanks to max9111 sharing a link https://github.com/numba/numba/issues/7259
There seems to be an issue with allocating large arrays with zeros (np.zeros)
The issue has been reported a couple of weeks ago, and the link contains some workaround examples.
I tried allocating np.empty()
29.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=True
44.7 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=False
60.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- pure python
And as you can see parallelized numba works the best, so this task is parallizable and overhead is not that big
I think the crucial issue is related to what is the nature of the calculations you are performing in this part of the code:
for i in range(ridges_number):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
Loop calculations are performing best if they are cache-local and trivially paralellizable (i.e. calculations are independent in each loop).
In your case, both of the conditions are violated. e1 and e2 do not take consecutive values across all of the loops. Similarly, the matrix r_p is likely preventing efficient paralellization because it needs to be accessed by all of the threads in each of the loops and probably it is locked by one while being accesed by all others).
All in all, the function you chose to speed-up may suffer the overhead of paralellization while in effect the calculations are executed sequentially. And the calculations, at least as they are at the moment, are inherently difficult to speed up by parallelization.
I'm sure this has a name in some other domain (maybe approx count distinct?).
Suppose you want to count the number of distinct elements in a numpy array but you only care about numbers below some threshold and above that you just return that it has more than thresh unique entries. This is particulary good for high arity arrays where you don't care that there are 10000 entries just that there are more than 10 entries perhaps.
In a compiled language this is simple to make fast. But what are some fast implementation expose to python?
Naively one might try numba like this:
#numba.jit(nopython=True)
def nunique_max_thresh(x, thresh=10):
seen = set()
for i in range(len(x)):
seen.add(x[i])
if len(seen) > thresh:
return thresh
return len(seen)
But the set usage is not supported.
Cython is an option but I am wondering if this is already done in some library or elsewhere in python. It seems like bottleneck would do this kind of thing but it's not really in there.
https://bottleneck.readthedocs.io/en/latest/reference.html
For example, consider these kind of arrays:
import string
import numpy as np
np.random.seed(0)
a = np.random.choice(list(string.ascii_letters), 1e7)
b = np.ones(int(1e7))
And you just want to know if this array has 10 or more unique values. Do not use the fact that these are length one strings.
For reference, this runs. But is probably not optimal.
import numpy as np
cimport numpy as np
def nunique_truncated(np.ndarray x_in, np.int thresh=10):
seen = set()
for i in range(x_in.shape[0]):
seen.add(x_in[i])
if len(seen) >= thresh:
return thresh
As #hpaulj suggested, you can just use numba without a set or dict and it should be reasonable since the use case is specifically targetted for shorter lists. Obviously some regime will suffer with slow inclusion lookups.
import numba
#numba.jit(nopython=True)
def nunique_truncated_numba(x_in, thresh=10):
seen = list()
for i, x in enumerate(x_in):
if x not in seen:
seen.append(x)
if len(seen) > thresh:
return len(seen)
return len(seen)
And the hard case is really when you do not hit the threshold (you are using python to do vectorized sweeps).
In [6]: %timeit cud.nunique_truncated(b)
116 µs ± 304 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit len(np.unique(b))
1.26 ms ± 2.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Would be be interested if anyone has other suggestions and tricks.
I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).
np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)
np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)
Thanks to ruankesi and divakar, I got a piece of code that works:
# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])
# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])
# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)
Problem 1: Performance
This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.
Problem 2: My matrices are sparse
Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)
Problem 3: memory usage
If I try to also work with a 4th matrix, I run into memory issues.
I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values.
Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!
Thank you!
Edit after Divakar's answer
Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop
Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop
About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!
Approach #1
We can use np.einsum to do sum-reductions in one go -
result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Approach #2
We can use broadcasting to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot -
def broadcast_dot(a,b,c):
first_multi = a[...,None] * b[:,None]
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi. This gives us a modified solution, like so -
import numexpr as ne
def numexpr_broadcast_dot(a,b,c):
first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
Timings on random float data with given dataset sizes -
In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit broadcast_dot(a,b,c)
270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit numexpr_broadcast_dot(a,b,c)
172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to give a sense of improvement with numexpr -
In [7]: %timeit a[...,None] * b[:,None]
80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This should be substantial when extending this solution to higher number of inputs.
I'm trying to execute the following
from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
set(item)
and it takes very long compared to:
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
item.tolist()
Why does it take much longer to convert a NumPy array to a set than to a list?
I mean basically both have complexity O(n)?
TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.
To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.
A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:
The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).
A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:
This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.
But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:
The values 1, 2 and 3 are stored directly in the NumPy array instance.
With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:
Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).
On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:
Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:
import numpy as np
arr = np.arange(100000)
lst = list(range(100000))
def iterateover(obj):
for item in obj:
pass
%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The set "constructor" just does an iteration over the object.
One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():
arr = np.arange(100000)
%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Each of these has O(n) runtime complexity but the constant factors are very different.
In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():
arr = np.random.randint(0, 1000, (10000, 3))
def tosets(arr):
for line in arr:
set(line)
def tolists(arr):
for line in arr:
list(line)
def tolists_method(arr):
for line in arr:
line.tolist()
def tosets_intermediatelist(arr):
for line in arr:
set(line.tolist())
%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).
In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.
I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:
Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).
Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:
import numba as nb
#nb.njit
def tosets_numba(arr):
for lineno in range(arr.shape[0]):
set(arr[lineno])
tosets_numba(arr) # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).
Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).
I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).
Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.
Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.
Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing
pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()