Matlab equivalent of Numpy broadcasting? - python

I'm trying to find some way to substract a size 3 vector from each column of a 3*(a big number) matrix in Matlab. Of course I could use a loop, but I'm trying to find some more efficient solution, a bit like numpy broadcasting. Oh, and I can't use repmat because I just don't have enough memory to use it (as it creates yet another 3*(a big number) matrix)...
Is this possible?

The other answers are a bit out of date -- Matlab R2016b appears to have added broadcasting as a standard feature. An example from that blog post that matches the question:
>> A = ones(2) + [1 5]'
A =
2 2
6 6

Loops aren't bad in MATLAB anymore thanks to compiler optimizations like just-in-time acceleration (JITA). etc. Most of the time, I've noticed that a solution with loops in current MATLAB versions is much faster than complicated (albeit, cool :D) one-liners.
bsxfun might do the trick but in my experience, it tends to have memory issues as well but less so than repmat.
So the syntax would be:
AA = bsxfun(#minus,A,b) where b is the vector and A is your big matrix
But I urge you to profile the loopy version and then decide! Most probably, due to memory constraints, you might not have a choice :)

I don't know if this will speed up the code, but subtraction of a scalar from a vector doesn't have memory issues. Since your matrix size is so asymmetrical, the overhead from a for-loop on the short dimension is negligible.
So maybe
matout = matin;
for j = 1:size(matin, 1) %3 in this case
matout(j,:) = matin(j,:) - vec_to_subtract(j);
end
of course, you could do this in place, but I didn't know if you wanted to preserve the original matrix.

Actually, it seems that http://www.frontiernet.net/~dmschwarz/genops.html (operator overloading with mex files) does the trick too, even though I haven't tested it yet.

Related

Faster numpy array indexing when using condition (numpy.where)?

I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?
np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.

Speed up N-d array dot

I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?
Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.
You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

Explain the speed difference between numpy's vectorized function application VS python's for loop

I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix, img_pix_mat are both 42000 by 784 matrices while img_total is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for efficient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array for each loop is takes a long time?
Thanks.
Due to the internal workings of numpy, (as far as i know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
edit:
taking out the zip, and replacing it with a vstack should go faster too, (zip tends to go slow if the arguments are very large, and than vstack is faster), (but that is also something which puts it into numpy (thus into C), while zip is python)
and yes, if i understand correctly - not sure about that- , you are doing 42k times a try/except block, that should definately be bad for the speed,
test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42kx784 matrices, you are trying 42k times a try-except block, i am assuming that should put an effect in the computation times, as well as each time doing a zip each time, but not certain if that would be the main cause,
(so every one of your 42k times you run your stuff, takes 0.17sec, i am quite certain that a try/except block doesnt take 0.17 seconds, but maybe the overhead it causes or so, does contribute to it?
try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row=np.array([tfidf(img_vec[i],img_pix_vec[i],img_total) for i in xrange(len(img_vec))])
that at least gets rid of the zip statement, But not sure if the zip statement takes down your time by one min, or by nearly two hours (i know zip is slow, compared to numpy vstack, but no clue if that would give you two hours time gain)
The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack in a loop, and every time you call vstack, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix) to your asymptotic complexity.

DFT in Python taking significantly longer than C

I'm currently working on translating some C code To Python. This code is being used to help identify errors arising from the CLEAN algorithm used in Radio Astronomy. In order to do this analysis the value of the Fourier Transforms of Intensity Maps, Q Stokes Map and U Stokes Map must be found at specific pixel values (given by ANT_pix). These Maps are just 257*257 arrays.
The below code takes a few seconds to run with C but takes hours to run with Python. I'm pretty sure that it is terribly optimized as my knowledge of Python is quite poor.
Thanks for any help you can give.
Update My question is if there is a better way to implement the loops in Python which will speed things up. I've read quite a few answer here for other questions on Python which recommend avoiding nested for loops in Python if possible and I'm just wondering if anyone knows a good way of implementing something like the Python code below without the loops or with better optimised loops. I realise this may be a tall order though!
I've been using the FFT up till now but my supervisor wants to see what sort of difference the DFT will make. This is because the Antenna position will not, in general, occur at exact pixels values. Using FFT requires round to the closest pixel value.
I'm using Python as CASA, the computer program used to reduce Radio Astronomy datasets is written in python and implementing Python scripts in it is far far easier than C.
Original Code
def DFT_Vis(ANT_Pix="",IMap="",QMap="",UMap="", NMap="", Nvis=""):
UV=numpy.zeros([Nvis,6])
Offset=(NMap+1)/2
ANT=ANT_Pix+Offset;
i=0
l=0
k=0
SumI=0
SumRL=0
SumLR=0
z=0
RL=QMap+1j*UMap
LR=QMap-1j*UMap
Factor=[math.e**(-2j*math.pi*z/NMap) for z in range(NMap)]
for i in range(Nvis):
X=ANT[i,0]
Y=ANT[i,1]
for l in range(NMap):
for k in range(NMap):
Temp=Factor[int((X*l)%NMap)]*Factor[int((Y*k)%NMap)];
SumI+=IMap[l,k]*Temp
SumRL+=RL[l,k]*Temp
SumLR+=IMap[l,k]*Temp
k=1
UV[i,0]=SumI.real
UV[i,1]=SumI.imag
UV[i,2]=SumRL.real
UV[i,3]=SumRL.imag
UV[i,4]=SumLR.real
UV[i,5]=SumLR.imag
l=1
k=1
SumI=0
SumRL=0
SumLR=0
return(UV)
You should probably use numpy's fourier transform code, rather than writing your own: http://docs.scipy.org/doc/numpy/reference/routines.fft.html
If you are interested in boosting the performance of your script cython could be an option.
I am not an expert on the FFT, but my understanding is that the FFT is simply a fast way to compute the DFT. So to me your question sounds like you are trying to write a bubble sort algorithm to see if it gives a better answer than quicksort. They are both sorting algorithms that would give the same result!
So I am questioning your basic premise. I am wondering if you can just change your rounding on your data and get the same result from the SciPy FFT code.
Also, according to my DSP textbook, the FFT can produce a more accurate answer than computing the DFT the long way, simply because floating point operations are inexact, and the FFT invokes fewer floating point operations along the way to finding the correct answer.
If you have some working C code that does the calculation you want, you could always wrap the C code to let you call it from Python. Discussion here: Wrapping a C library in Python: C, Cython or ctypes?
To answer your actual question: as #ZoZo123 noted, it would be a big win to change from range() to xrange(). With range(), Python has to build a list of numbers, and then destroy the list when done; with xrange() Python just makes an iterator that yields up the numbers one at a time. (But note that in Python 3.x, range() makes an iterator and there is no xrange().)
Also, if this code does not have to integrate with the rest of your code, you might try running this code under PyPy. This is exactly the sort of code that PyPy can best optimize. The problem with PyPy is that currently your project must be "pure" Python, and it looks like you are using NumPy. (There are projects to get NumPy and PyPy to work together, but that's not done yet.) http://pypy.org/
If this code does need to integrate with the rest of your code, then I think you need to look at Cython (as noted by #Krzysztof Rosiński).

Why numpy is 'slow' by itself?

Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.

Categories