speeding up numpy.dot inside list comprehension

speeding up numpy.dot inside list comprehension - python

I have a numpy script that is currently running quite slowly.
spends the vast majority of it's time performing the following operation inside a loop:
terms=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex,Ey,Ez_av)
res=[np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms]
res=array(res)
Ex[1:Nx-1]=res[1:Nx-1,0]
Ey[1:Nx-1]=res[1:Nx-1,1]
It's the list comprehension that is really slowing this code down.
In this case, Coeff_3, and Coeff_2 are length 1000 lists whose elements are 3x3 numpy matricies, and Ex,Ey,Ez, Curl_x, etc are all length 1000 numpy arrays.
I realize it might be faster if i did things like setting a single 3x1000 E vector, but i have to perform a significant amount of averaging of different E vectors between step, which would make things very unwieldy.
Curiously however, i perform this operation twice per loop (once for Ex,Ey, once for Ez), and performing the same operation for the Ez's takes almost twice as long:
terms2=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex_av,Ey_av,Ez)
res2=array([np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms2])
Anyone have any idea what's happening? Forgive me if it's anything obvious, i'm very new to python.

As pointed out in previous comments, use array operations. np.hstack(), np.vstack(), np.outer() and np.inner() are useful here. You're code could become something like this (not sure about your dimensions):
Cxyz = np.vstack((Curl_x,Curl_y,Curl_z))
C2xyz = np.dot(C2, Cxyz)
...
Check the shape of your resulting dimensions, to make sure you translated your problem right. Sometimes numexpr can also to speed up such tasks significantly with little extra effort,

Related

Faster numpy array indexing when using condition (numpy.where)?

I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?

np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.

Python Code Optimization LineProfiler [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
My function gets a combinations of numbers, can be small or large depending on user input, and will loop through each combinations to perform some operations.
Below is a line-profiling I ran on my function and it is takes 0.336 seconds to run. While this is fast, this is only a subset of a bigger framework. In this bigger framework, I will need to run this function 50-20000 times which when multiplied to 0.336, takes 16.8 to 6720 seconds (I hope this is right). Before it takes 0.996 seconds but I've manage to cut it by half through avoiding functions calls.
The major contributor to time is the two __getitem__ which is accessing dictionary for information N times depending on the number of combinations. My dictionary is a collection data and it looks something like this:
dic = {"array1", a,
"array2", b,
"array3", c,
"listofarray", [ [list of 5 array], [list of 5 array], [list of 5 2d Array ] ]
}
I was able to cut it by another ~0.01 seconds when I placed the dictionary lookback outside of the loop..
x = dic['listofarray']['list of 5 2d array']
So when I loop to get access to the the 5 different elements I just did x[i].
Other than that I am lost in terms of where to add more performance boost.
Note: I apologize that I haven't provided any code. I'd love to show but its proprietary. I just wanted to get some thoughts on whether I am looking at the right place for speed ups.
I am willing to learn and apply new things so if cython or some other data structure can speed things up, i am all ears. Thanks so much
PS:
inside my first __getitem__:
inside my second __getitem__:
EDIT:
I am using iter tools product(xrange(10), repeat=len(food_choices)) and iterating over this. I covert everything into numpy arrays np.array(i).astype(float).

The major contributor to time is the two __getitem__ which is accessing dictionary for information N times depending on the number of combinations.
No it isn't. Your two posted profile traces clearly show that they're NumPy/Pandas __getitem__ functions, not dict.__getitem__. So, you're trying to optimize the wrong place.
Which explains why moving all the dict stuff out of the loop made a difference of a small fraction of a percent.
Most likely the problem is that you're looping over some NumPy object, or using some fake-vectorized function (e.g., via vectorize), rather than performing some NumPy-optimized broadcasting operation. That's what you need to fix.
For example, if you compare these:
np.vectorize(lambda x: x*2)(a)
a * 2
… the second one will go at least 10x faster on any sizable array, and it's mostly because of all the time spending doing __getitem__—which includes boxing up numbers to be usable by your Python function. (There's also some additional cost in not being able to use CPU-vectorized operations, cacheable tight loops, etc., but even if you arrange things to be complicated enough that those don't enter into it, you're still going to get much faster code.)
Meanwhile:
I am using itertools.product(xrange(10), repeat=len(food_choices)) and iterating over this. I covert everything into numpy arrays np.array(i).astype(float).
So you're creating 10**n separate n-element arrays? That's not making sensible use of NumPy. Each array is tiny, and most likely you're spending as much time building and pulling apart the arrays as you are doing actual work. Do you have the memory to build a single giant array with an extra 10**n-long axis instead? Or, maybe, batch it up into groups of, say, 100K? Because then you could actually build and process the whole array in native NumPy-vectorized code.
However, the first thing you might want to try is to just run your code in PyPy instead of CPython. Some NumPy code doesn't work right with PyPy/NumPyPy, but there's fewer problems with each version, so you should definitely try it.
If you're lucky (and there's a pretty good chance of that), PyPy will JIT the repeated __getitem__ calls inside the loop, and make it much faster, with no change in your code.
If that helps (or if NumPyPy won't work on your code), Cython may be able to help more. But only if you do all the appropriate static type declarations, etc. And often, PyPy already helps enough that you're done.

Explain the speed difference between numpy's vectorized function application VS python's for loop

I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix, img_pix_mat are both 42000 by 784 matrices while img_total is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for eﬃcient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array for each loop is takes a long time?
Thanks.

Due to the internal workings of numpy, (as far as i know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
edit:
taking out the zip, and replacing it with a vstack should go faster too, (zip tends to go slow if the arguments are very large, and than vstack is faster), (but that is also something which puts it into numpy (thus into C), while zip is python)
and yes, if i understand correctly - not sure about that- , you are doing 42k times a try/except block, that should definately be bad for the speed,
test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42kx784 matrices, you are trying 42k times a try-except block, i am assuming that should put an effect in the computation times, as well as each time doing a zip each time, but not certain if that would be the main cause,
(so every one of your 42k times you run your stuff, takes 0.17sec, i am quite certain that a try/except block doesnt take 0.17 seconds, but maybe the overhead it causes or so, does contribute to it?
try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row=np.array([tfidf(img_vec[i],img_pix_vec[i],img_total) for i in xrange(len(img_vec))])
that at least gets rid of the zip statement, But not sure if the zip statement takes down your time by one min, or by nearly two hours (i know zip is slow, compared to numpy vstack, but no clue if that would give you two hours time gain)

The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack in a loop, and every time you call vstack, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix) to your asymptotic complexity.

Techniques for working with large Numpy arrays? [duplicate]

This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.

Why numpy is 'slow' by itself?

Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?

Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.

Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.

I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.

Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.