Speed up N-d array dot

Speed up N-d array dot - python

I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

Related

Faster numpy array indexing when using condition (numpy.where)?

I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?

np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.

Large numpy matrix memory issues

I have two questions, but the first takes precedence.
I was doing some timeit testing of some basic numpy operations that will be relevant to me.
I did the following
n = 5000
j = defaultdict()
for i in xrange(n):
print i
j[i] = np.eye(n)
What happened is, python's memory use almost immediately shot up to 6gigs, which is over 90% of my memory. However, numbers printed off at a steady pace, about 10-20 per second. While numbers printed off, memory use sporadically bounced down to ~4 gigs, and back up to 5, back down to 4, up to 6, down to 4.5, etc etc.
At 1350 iterations I had a segmentation fault.
So my question is, what was actually occurring during this time? Are these matrices actually created one at a time? Why is memory use spiking up and down?
My second question is, I may actually need to do something like this in a program I am working on. I will be doing basic arithmetic and comparisons between many large matrices, in a loop. These matrices will sometimes, but rarely, be dense. They will often be sparse.
If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory? I don't know what can be done with all the tools and tricks available... Maybe I would just have to store some of them on disk and pull them out in chunks?
Any advice for if I have to loop through many matrices and do basic arithmetic between them?
Thank you.

If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory?
If they're dense matrices, and you need them all at the same time, not by a long shot. Consider:
5K * 5K = 25M cells
25M * 8B = 200MB (assuming float64)
5K * 200MB = 1TB
The matrices are being created one at a time. As you get near 6GB, what happens depends on your platform. It might start swapping to disk, slowing your system to a crawl. There might be a fixed-size or max-size swap, so eventually it runs out of memory anyway. It may make assumptions about how you're going to use the memory, guessing that there will always be room to fit your actual working set at any given moment into memory, only to segfault when it discovers it can't. But the one thing it isn't going to do is just work efficiently.
You say that most of your matrices are sparse. In that case, use one of the sparse matrix representations. If you know which of the 5000 will be dense, you can mix and match dense and sparse matrices, but if not, just use the same sparse matrix type for everything. If this means your occasional dense matrices take 210MB instead of 200MB, but all the rest of your matrices take 1MB instead of 200MB, that's more than worthwhile as a tradeoff.
Also, do you actually need to work on all 5000 matrices at once? If you only need, say, the current matrix and the previous one at each step, you can generate them on the fly (or read from disk on the fly), and you only need 400MB instead of 1TB.
Worst-case scenario, you can effectively swap things manually, with some kind of caching discipline, like least-recently-used. You can easily keep, say, the last 16 matrices in memory. Keep a dirty flag on each so you know whether you have to save it when flushing it to make room for another matrix. That's about as tricky as it's going to get.

Explain the speed difference between numpy's vectorized function application VS python's for loop

I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix, img_pix_mat are both 42000 by 784 matrices while img_total is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for eﬃcient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array for each loop is takes a long time?
Thanks.

Due to the internal workings of numpy, (as far as i know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
edit:
taking out the zip, and replacing it with a vstack should go faster too, (zip tends to go slow if the arguments are very large, and than vstack is faster), (but that is also something which puts it into numpy (thus into C), while zip is python)
and yes, if i understand correctly - not sure about that- , you are doing 42k times a try/except block, that should definately be bad for the speed,
test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42kx784 matrices, you are trying 42k times a try-except block, i am assuming that should put an effect in the computation times, as well as each time doing a zip each time, but not certain if that would be the main cause,
(so every one of your 42k times you run your stuff, takes 0.17sec, i am quite certain that a try/except block doesnt take 0.17 seconds, but maybe the overhead it causes or so, does contribute to it?
try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row=np.array([tfidf(img_vec[i],img_pix_vec[i],img_total) for i in xrange(len(img_vec))])
that at least gets rid of the zip statement, But not sure if the zip statement takes down your time by one min, or by nearly two hours (i know zip is slow, compared to numpy vstack, but no clue if that would give you two hours time gain)

The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack in a loop, and every time you call vstack, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix) to your asymptotic complexity.

Techniques for working with large Numpy arrays? [duplicate]

This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.

Why numpy is 'slow' by itself?

Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?

Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.

Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.

I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.

Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.