Numpy/Python numerical instability issue multiplying large arrays - python

I have created a function to rotate a vector by a quaternion:
def QVrotate_toLocal(Quaternion,Vector):
#NumSamples x Quaternion[w,x,y,z]
#NumSamples x Vector[x,y,z]
#For example shape (20000000,4) with range 0,1
# shape (20000000,3) with range -100,100
#All numbers are float 64s
Quaternion[:,2]*=-1
x,y,z=QuatVectorRotate(Quaternion,Vector)
norm=np.linalg.norm(Quaternion,axis=1)
x*=(1/norm)
y*=(1/norm)
z*=(1/norm)
return np.stack([x,y,z],axis=1)
Everything within QuatVectorRotate is addition and multiplication of (20000000,1) numpy arrays
For the data I have (20million samples for both the quaternion and vector), every time I run the code the solution oscillates between a (known) correct solution and very incorrect solution. Never deviating from pattern correct,incorrect,correct,incorrect...
This kind of numerical oscillation in static code usually means there is an ill-conditioned matrix which is being operated on, python is running out of floating point precision, or there is a silent memory overflow somewhere.
There is little linear algebra in my code, and I have checked and found the norm line to be static with every run. The problem seems to be happening somewhere in lines a= ... to d= ...
Which led me to believe that given these large arrays I was running out of memory somewhere along the line. This could still be the issue, but I dont believe it is; I have 16gb memory, and while running I never get above 75% usage. But again, I do not know enough about memory allocation to definitively rule this out. I attempted to force garbage collection at the beginning and end of the function to no avail.
Any ideas would be appreciated.
EDIT:
I just reproduced this issue with the following data and same behavior was observed.
Q=np.random.random((20000000,4))
V=np.random.random((20000000,3))

When you do Quaternion[:,2]*=-1 in the first line, you are mutating the Quaternion array. This is not a local copy of that array, but the actual array you pass in from the outside.
So every time you run this code, you have different signs on those elements. After having run the function twice, the array is back to the start (since, obviously, -1*-1 = 1).
One way to get around this is to make a local copy first:
Quaternion_temp = Quaternion.copy()

Related

Memory usage grows indefinitely when using scipy.sparse.linalg.eigsh

Here is the code:
# input:
# A : a large csr matrix (365 million rows and 1.3 billion entries), 32 bit float datatype
# get the two largest eigenvalues of A and the corresponding eigenvectors
from scipy.sparse.linalg import eigsh
(w,V) = eigsh(A,k=2,tol=10e-2,ncv=5)
As far as I can tell, there is not a lot of room to mess up here, but what I am observing is that my machine initially has plenty of memory (90G including swap), but the memory usage of eigsh slowly creeps up during the run until I run out of memory. Is there something obvious I am missing here?
What I have tried:
--Looking through the source. It is a lot, but as far as I could see, there is no memory allocated by the python code between iterations. I am not as good at Fortran, but it would be unexpected if ARPACK itself or the calling routine allocated memory.
--Tried an equivalent thing in Octave (MATLAB clone), with similar effects, although less obvious since the datatype is necessarily double precision and thus it is more constrained from the start. So perhaps it could be something with ARPACK itself?
--Googled a bunch. It looks like Scipy does (did?) use a circular reference somewhere that has caused others grief when calling eigsh multiple times, but I am calling it once, so maybe this is not the same issue.
Any help would be very greatly appreciated.

Memory leak using scipy.eigs

I have a set of large very sparse matrices which I'm trying to find the eigenvector corresponding to eigenvalue 0 of. From the structure of the underlying problem I know a solution to this must exist and that zero is also the largest eigenvalue.
To solve the problem I use scipy.sparse.linalg.eigs with arguments:
val, rho = eigs(L, k=1, which = 'LM', v0=init, sigma=-0.001)
where L is the matrix. I then call this multiple times for different matrices. At first I was having a severe problem where the memory usage just increases as I call the function more and more times. It seems eigs doesn't free all the memory it should. I solved by calling gc.collect() each time I use eigs.
But now I worry that also internally memory isn't being freed, naively I expect that using something like Arnoldi shouldn't use more memory as the algorithm progresses, it should just be storing the matrix and the current set of Lanczos vectors, but I find that memory usage increases while the code is still inside the eigs function.
Any ideas?

Printing whole matrix in Theano

I'm debugging my Theano code and printing the values of my tensors as advised here:
a_printed = theano.printing.Print("a: ")(a)
The issue is that, when a is a relatively large matrix, the value is truncated to the first couple of rows and the last couple of rows. However, I would like the whole matrix to be printed. Is this possible?
I believe you can print the underlying numpy, accessed as a.get_value(). Within numpy you can modify printing by
numpy.set_printoptions(threshold=10000000)
where threshold should be bigger than the number of elements expected, and then the whole array will show. See the documentation for set_printoptions. Note that if outputted to a console, this may freeze up because of the possibly very large amount of text.

Explain the speed difference between numpy's vectorized function application VS python's for loop

I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix, img_pix_mat are both 42000 by 784 matrices while img_total is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for eļ¬ƒcient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array for each loop is takes a long time?
Thanks.
Due to the internal workings of numpy, (as far as i know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
edit:
taking out the zip, and replacing it with a vstack should go faster too, (zip tends to go slow if the arguments are very large, and than vstack is faster), (but that is also something which puts it into numpy (thus into C), while zip is python)
and yes, if i understand correctly - not sure about that- , you are doing 42k times a try/except block, that should definately be bad for the speed,
test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42kx784 matrices, you are trying 42k times a try-except block, i am assuming that should put an effect in the computation times, as well as each time doing a zip each time, but not certain if that would be the main cause,
(so every one of your 42k times you run your stuff, takes 0.17sec, i am quite certain that a try/except block doesnt take 0.17 seconds, but maybe the overhead it causes or so, does contribute to it?
try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row=np.array([tfidf(img_vec[i],img_pix_vec[i],img_total) for i in xrange(len(img_vec))])
that at least gets rid of the zip statement, But not sure if the zip statement takes down your time by one min, or by nearly two hours (i know zip is slow, compared to numpy vstack, but no clue if that would give you two hours time gain)
The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack in a loop, and every time you call vstack, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix) to your asymptotic complexity.

How does Pythonic garbage collection with numpy array appends and deletes?

I am trying to adapt the underlying structure of plotting code (matplotlib) that is updated on a timer to go from using Python lists for the plot data to using numpy arrays. I want to be able to lower the time step for the plot as much as possible, and since the data may get up into the thousands of points, I start to lose valuable time fast if I can't. I know that numpy arrays are preferred for this sort of thing, but I am having trouble figuring out when I need to think like a Python programmer and when I need to think like a C++ programmer maximize my efficiency of memory access.
It says in the scipy.org docs for the append() function that it returns a copy of the arrays appended together. Do all these copies get garbage-collected properly? For example:
import numpy as np
a = np.arange(10)
a = np.append(a,10)
print a
This is my reading of what is going on on the C++-level, but if I knew what I was talking about, I wouldn't be asking the question, so please correct me if I'm wrong! =P
First a block of 10 integers gets allocated, and the symbol a points to the beginning of that block. Then a new block of 11 integers is allocated, for a total of 21 ints (84 bytes) being used. Then the a pointer is moved to the start of the 11-int block. My guess is that this would result in the garbage-collection algorithm decrementing the reference count of the 10-int block to zero and de-allocating it. Is this right? If not, how do I ensure I don't create overhead when appending?
I also am not sure how to properly delete a numpy array when I am done using it. I have a reset button on my plots that just flushes out all the data and starts over. When I had lists, this was done using del data[:]. Is there an equivalent function for numpy arrays? Or should I just say data = np.array([]) and count on the garbage collector to do the work for me?
The point of automatic memory management is that you don't think about it. In the code that you wrote, the copies will be garbage-collected fine (it's nigh on impossible to confuse Python's memory management). However, because np.append is not in-place, the code will create a new array in memory (containing the concatenation of a and 10) and then the variable a will be updated to point to this new array. Since a now no longer points to the original array, which had a refcount of 1, its refcount is decremented to 0 and it will be cleaned up automatically. You can use gc.collect to force a full cleanup.
Python's strength does not lie in fine-tuning memory access, although it is possible to optimise. You are probably best sorted pre-allocating a (using e.g. a = np.zeros( <size> )); if you need finer tuning than that it starts to get a bit hairy. You could have a look at the Cython + Numpy tutorial for a very neat and easy way to integrate C with Python for efficiency.
Variables in Python just point to the location where their contents are stored; you can del any variable and it will decrease the reference count of its target by one. The target will be cleaned automatically after its reference count hits zero. The moral of this is, don't worry about cleaning up your memory. It will happen automatically.

Categories