Numpy array subtraction: inconsistent values for large arrays - python

Here's a problem I came across today: I am trying to subtract the first row of a matrix from the (large) entire matrix. As a test, I made all rows equal. Here's a MWE:
import numpy as np
first = np.random.normal(size=10)
reference = np.repeat((first,), 10000, axis=0)
copy_a = np.copy(reference)
copy_a -= copy_a[0]
print np.all(copy_a == 0) # prints False
Oh wow - False! So I tried another thing:
copy_b = np.copy(reference)
copy_b -= reference[0]
np.all(copy_b == 0) # prints True
Examining the new copy_a array, I found that copy_a[0:818] are all zeros, copy_a[820:] are the original values, while copy_a[819] got operated partly.
In [115]: copy_a[819]
Out[115]:
array([ 0. , 0. , 0.57704706, -0.22270692, -1.83793342,
0.58976187, -0.71014837, 1.80517635, -0.98758385, -0.65062774])
Looks like midway during the operation, numpy went back and looked at copy_a[0], found it is all zeros, and hence subtracted zeros from the rest of the array. I find this weird. Is this a bug, or is it an expected numpy result?

This issue has actually been reported multiple times to the numpy repository (see below). It is considered a bug, but is very hard to fix without sacrificing performance (copying the input arrays) because correctly detecting if two arrays share memory is difficult.
Therefore, for now, you'd better just make a copy of copy_a[0] as explained in #Torben's answer.
The essence of the issue is that your are modifying the array while iterating. It happens to work until copy_a[819] simply because 8192 (819×10+2) is the size of numpy's assign buffer.
https://github.com/numpy/numpy/issues/6119
https://github.com/numpy/numpy/issues/5241
https://github.com/numpy/numpy/issues/4802
https://github.com/numpy/numpy/issues/2705
https://github.com/numpy/numpy/issues/1683

The infix operator -= modifies the array inplace, meaning that you are pulling the rug under your own feet. The effect that you see might have to do with internal caching of results (i.e. first "commit" happens after 818 rows).
The solution is to swap out the subtrahend into another array:
copy_a -= copy_a[0].copy()

Related

How do I plot a weighted histogram for every row of a 2D numpy array?

I have a 2 square matrices, some data, and their weights and I would like to extract histogram data for each row individually (without resorting to simply looping and preferably within numpy). For example:
import numpy as np
data = np.array([[1,3,4,5],[2,1,5,4],[3,3,1,6],[1,2,2,2]]) #data and weights are 4x4 matrices
weights = np.array([[1,1,2,0.4],[1,3,1,1],[1,1,6,5],[1,1,1,1]])
binn = np.array([0,1,3,4,7,10]) #there are 5 bins
#bins should be the same for every row
#would like output which is a 4x5 matrix
output = np.array([[0,1,1,2.4,0],[0,4,0,2,0],[0,6,2,5,0],[0,4,0,0,0]])
This is similar to what was asked by this question Numpy histogram on multi-dimensional array. However, I have tried to make the solutions suggested their work and the fact that the array of weights shares the dimensionality of the problem seems to break np.apply_along_axis. I've tried merging data and weights along the innermost dimension like np.concatenate((data[...,None],weights[...,None]), axis=-1) and feeding this as the array to a wrapper function which is the argument of apply_along_axis but this produces the wrong shape output array.
def wrapper(data_weights):
h, _ = np.histogram(data_weights[...,0], bins=binn, weights=data_weights[...,1])
return h
arrs = np.concatenate((data[...,None], weights[...,None]), axis=-1)
#looping gives the expected result:
for i in arrs:
print(wrapper(i))
#however
h = np.apply_along_axis(wrapper, axis=1, arrs)
print(h, h.shape)
Not only does using apply along axis not give anything obviously meaningful, it maintains the dimensionality of bundling data and weights h.shape = (4,5,2). I'm clearly using apply_along_axis wrong but I can't see how or if it is even usable here (have also looked into apply_over_axes with no more success).
The np function that does what apply_along_axis does with more complex specification is probably np.vectorize
def myhist(data, weights, binn):
return np.histogram(data, bins=binn, weights=weights)[0]
myhistvec=np.vectorize(myhist, signature='(m),(m),(n)->(p)')
myhistvec(data, weights, binn)
#[[0. 1. 1. 2.4 0. ]
# [0. 4. 0. 2. 0. ]
# [0. 6. 2. 5. 0. ]
# [0. 4. 0. 0. 0. ]]
myhist role is just to have only positional arguments (maybe np.vectorize can be used with keyword arguments, but I don't know it).
vectorize creates a new function that is the vectorized version of myhist.
Generally speaking g=np.vectorized(f) is a vectorized version of f. Meaning that if g is called with scalar arguments, they are just passed to f. But if g is called with 1d array arguments, then f is called with each elements of those 1d arrays, and then the results of all f calls are used to build an array of results, returned by g. Sames goes if g is called with 2d, 3d, ... arguments.
So, it is a little bit like numpy versions of math functions (np.sin([[1,2],[3,4]]) returns a 2d array containing [[sin(1), sin(2)],[sin(3), sin(4)]])
Except that for myhist we don't want to iterate down to each scalar of data. So we specify that with signature. Which means that myhist "atomic" calls are passed 3 arguments: 2 of the same size m (m values is irrelevant, all that matters is that it is the same m for the 2 first parameters, and that those 2 parameters are 1d arrays), and a 3rd with another 1d-array but of different size. And the return result if another 1d-array of yet another size.
So result is the one you wanted.
Now, I must say that this is just cosmetic. As would have been a successful attempt to use apply_along_axis. Those function are not helping at all performance-wise.
See the naive version (the one you wanted to avoid)
def naiveLoop(data, weights, binn):
res=[]
for i in range(len(data)):
res.append(np.histogram(data[i], weights=weights[i], bins=binn)[0])
return np.array(res)
And compare those with timeit. Result is disappointing: naive version needs 270 μs, when vectorized function needs 460 μs to perform the same task.
Sure, with more rows, difference is less and less in favor of naive version, and for 12500 rows, vectorized finally overtook it.
(Because iteration themselves, that is for loop counter, is probably faster, but that's all: for the rest, it is just an apply, so pure python function.)
For comparison, look at numba function, when reinventing the wheel (which I need because numba's version of histogram doesn't allow weights. Otherwise I could have just add a #jit before by naiveLoop to get the same result. Maybe I could have done it, but I am a beginner in numba)
from numba import jit
def fillHist(data, weights, binn, res):
nl=len(data)
nc=data.shape[1]
nb=len(binn)-1
for i in range(nl):
for j in range(nc):
for k in range(nb):
if data[i,j]>=binn[k] and data[i,j]<binn[k+1]:
res[i,k] += weights[i,j]
break
def numbaVersion(data, weights, binn):
res=np.zeros((len(data), len(binn)-1))
fillHist(data, weights, binn, res)
return res
Same result, obviously.
But took only 2.9 μs! (compared to 270 and 460 μs of naive and vectorize version)
So, sometimes the most elegant solution is not the fastest. I mean, it may seem presomptuous, but I am pretty sure that I have had just posted my "vectorize" version, without comments on speed, comparison with naive version, you would have replied "that what I wanted", and found the solution elegant.
Yet, the solution you already knew, and did not want because it needs for loops was almost twice faster. And even worse, reinventing the wheel (and in a naive way at that: no attempt to use dichotomy or anything to find the bin, just a linear loop over all the bins!), is some 150 times faster!

Computation difference between function and manual computation

I am facing a mystery right now. I get strange results in some program and I think it may be related to the computation since I got different results with my functions compared to manual computation.
This is from my program, I am printing the values pre-computation :
print("\nPrecomputation:\nmatrix\n:", matrix)
tmp = likelihood_left * likelihood_right
print("\nconditional_dep:", tmp)
print("\nfinal result:", matrix # tmp)
I got the following output:
Precomputation:
matrix:
[array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294])
array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784])
array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768])
array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674])
array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
conditional_dep: [0.01391123 0.01388155 0.17221067 0.02675524 0.01033257]
final result: [0.07995043 0.03485223 0.02184015 0.04721548 0.05323298]
The thing is when I compute the following code:
matrix = [np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]),
np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]),
np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]),
np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]),
np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
tmp = np.asarray([0.01391123, 0.01388155, 0.17221067, 0.02675524, 0.01033257])
matrix # tmp
The values in use are exactly the same as they should be in the computation before but I get the following result:
array([0.04171218, 0.04535276, 0.02546353, 0.04688848, 0.03106443])
This result is then obviously different than the previous one and is the true one (I computed the dot product by hand).
I have been facing this problem the whole day and I did not find anything useful online. If any of you have any even tiny idea where it can come from I'd be really happy :D
Thank's in advance
Yann
PS: I can show more of the code if needed.
PS2: I don't know if it is relevant but this is used in a dynamic programming algorithm.
To recap our discussion in the comments, in the first part ("pre-computation"), the following is true about the matrix object:
>>> matrix.shape
(5,)
>>> matrix.dtype
dtype('O') # aka object
And as you say, this is due to matrix being a slice of a larger, non-uniform array. Let's recreate this situation:
>>> matrix = np.array([[], np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]), np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]), np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]), np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]), np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])])[1:]
It is now not a matrix with scalars in rows and columns, but a column vector of column vectors. Technically, matrix # tmp is an operation between two 1-D arrays and hence NumPy should, according to the documentation, calculate the inner product of the two. This is true in this case, with the convention that the sum be over the first axis:
>>> np.array([matrix[i] * tmp[i] for i in range(5)]).sum(axis=0)
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
>>> matrix # tmp
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
This is essentially the same as taking the transpose of the proper 2-D matrix before the multiplication:
>>> np.stack(matrix).T # tmp
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
Equivalently, as noted by #jirasssimok:
>>> tmp # np.stack(matrix)
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
Hence the erroneous or unexpected result.
As you have already resolved to do in the comments, this can be avoided in the future by ensuring all matrices are proper 2-D arrays.
It looks like you got the operands switched in one of your matrix multiplications.
Using the same values of matrix and tmp that you provided, matrix # tmp and tmp # matrix provide the two results you showed.1
matrix = [np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]),
np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]),
np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]),
np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]),
np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
tmp = np.asarray([0.01391123, 0.01388155, 0.17221067, 0.02675524, 0.01033257])
print(matrix # tmp) # [0.04171218 0.04535276 0.02546353 0.04688848 0.03106443]
print(tmp # matrix) # [0.07995043 0.03485222 0.02184015 0.04721548 0.05323298]
To make it a little more obvious what your code is doing, you might also consider using np.dot instead of #. If you pass matrix as the first argument and tmp as the second, it will have the result you want, and make it more clear that you're conceptually calculating dot products rather than multiplying matrices.
As an additional note, if you're performing matrix operations on matrix, it might be better if it was a single two-dimensional array instead of a list of 1-dimensional arrays. this will prevent errors of the sort you'll see right now if you try to run matrix # matrix. This would also let you say matrix.dot(tmp) instead of np.dot(matrix, tmp) if you wanted to.
(I'd guess that you can use np.stack or a similar function to create matrix, or you can call np.stack on matrix after creating it.)
1 Because tmp has only one dimension and matrix has two, NumPy can and will treat tmp as whichever type of vector makes the multiplication work (using broadcasting). So tmp is treated as a column vector in matrix # tmp and a row vector in tmp # matrix.

NumPy: Compute mode row-wise spanning over multiple arrays from iterator

Have a look at this image:
In my application I receive from an iterator an arbitrary amount (let's say 1000 for now) of big 1-dimensional arrays arr1, arr2, arr3, ..., arr1000 (10000 entries each). Each entry is an integer between 0 and n, where in this case n = 9. My ultimate goal is to compute a 1-dimensional array result such that result[i] == the mode of arr1[i], arr2[i], arr3[i], ..., arr1000[i].
However, it is not tractable to concatenate the arrays to one big matrix and then compute the mode row-wise, since this may exceed the RAM on my machine.
An alternative would be to set up an array res2 of shape (10000, 10), then loop through every array, use each entry e as index and then to increase the value of res2[i][e] by 1. Alter looping, I would apply something like argmax. However, this is too slow.
So: Is the a way to perform the task in a fast way, maybe by using NumPy's advanced indexing?
EDIT (due to the comments):
This is basically the code which calculates the modes row-wise – avoiding to concatenate the arrays:
def foo(length, n):
counts = np.zeros((length, n), dtype=np.int_)
for arr in array_iterator():
i = 0
for e in arr:
counts[i][e] += 1
i += 1
return np.argmax(counts, axis=1)
It takes already 60 seconds for 100 arrays of size 10000 (although there is more work done behind the scenes, which results into that time – however, this work scales linearly with the amount of arrays).
Regarding the real sizes:
The amount of different arrays is really arbitrary. It's a parameter of experiments and I'd like to have the opportunity even to set this to values like 10^6. The length of each array is depending of my data set I'm working with. This could be 10000, or 100000 or even worse. However – spitting this into smaller pieces may be possible, though annoying.
My free RAM for this task is about 4 GB.
EDIT 2:
The running time I gave above leads to a wrong impression. Actually, the running time which just belongs to the inner loop (for e in arr) in the above mentioned scenario is just 5 seconds – which is now ok for me, since it's negligible compared to the remaining running time. I will leave this question open anyway for a moment, since there might be an even faster method waiting out there.

Fast way to construct a matrix in Python

I have been browsing through the questions, and could find some help, but I prefer having confirmation by asking it directly. So here is my problem.
I have an (numpy) array u of dimension N, from which I want to build a square matrix k of dimension N^2. Basically, each matrix element k(i,j) is defined as k(i,j)=exp(-|u_i-u_j|^2).
My first naive way to do it was like this, which is, I believe, Fortran-like:
for i in range(N):
for j in range(N):
k[i][j]=np.exp(np.sum(-(u[i]-u[j])**2))
However, this is extremely slow. For N=1000, for example, it is taking around 15 seconds.
My other way to proceed is the following (inspired by other questions/answers):
i, j = np.ogrid[:N,:N]
k = np.exp(np.sum(-(u[i]-u[j])**2,axis=2))
This is way faster, as for N=1000, the result is almost instantaneous.
So I have two questions.
1) Why is the first method so slow, and why is the second one so fast ?
2) Is there a faster way to do it ? For N=10000, it is starting to take quite some time already, so I really don't know if this was the "right" way to do it.
Thank you in advance !
P.S: the matrix is symmetric, so there must also be a way to make the process faster by calculating only the upper half of the matrix, but my question was more related to the way to manipulate arrays, etc.
First, a small remark, there is no need to use np.sum if u can be re-written as u = np.arange(N). Which seems to be the case since you wrote that it is of dimension N.
1) First question:
Accessing indices in Python is slow, so best is to not use [] if there is a way to not use it. Plus you call multiple times np.exp and np.sum, whereas they can be called for vectors and matrices. So, your second proposal is better since you compute your k all in once, instead of elements by elements.
2) Second question:
Yes there is. You should consider using only numpy functions and not using indices (around 3 times faster):
k = np.exp(-np.power(np.subtract.outer(u,u),2))
(NB: You can keep **2 instead of np.power, which is a bit faster but has smaller precision)
edit (Take into account that u is an array of tuples)
With tuple data, it's a bit more complicated:
ma = np.subtract.outer(u[:,0],u[:,0])**2
mb = np.subtract.outer(u[:,1],u[:,1])**2
k = np.exp(-np.add(ma, mb))
You'll have to use twice np.substract.outer since it will return a 4 dimensions array if you do it in one time (and compute lots of useless data), whereas u[i]-u[j] returns a 3 dimensions array.
I used np.add instead of np.sum since it keep the array dimensions.
NB: I checked with
N = 10000
u = np.random.random_sample((N,2))
I returns the same as your proposals. (But 1.7 times faster)

Scipy sparse matrices element wise multiplication

I am trying to do an element-wise multiplication for two large sparse matrices. Both are of size around (400K X 500K), with around 100M elements.
However, they might not have non-zero elements in the same positions, and they might not have the same number of non-zero elements. In either situation, Im okay with multiplying the non-zero value of one matrix and the zero value in the other matrix to zero.
I keep running out of memory (8GB) in every approach, which doesnt make much sense. I shouldnt be. These are what I've tried.
A and B are sparse matrices (Ive tried with COO and CSC formats).
# I have loaded sparse matrices A and B, and have a file opened in write mode
row,col = A.nonzero()
index = zip(row,col)
del row,col
for i,j in index :
# Approach 1
A[i,j] *= B[i,j]
# Approach 2
someopenfile.write(' '.join([str(i),str(j),str(A[j,j]*B[i,j]),'\n']))
# Approach 3
if B[i,j] != 0 :
A[i,j] = A[i,j]*B[i,j] # or, I wrote it to a file instead
# like in approach 2
If I comment out the for loop, I see that I use almost 3.5GB of memory. But the moment I use the loop, whether Im writing the products to a file or back to a matrix, the memory usage shoots up to the full memory, causing me to stop the execution, or the system hangs. How can I do this operation without consuming so much memory?
I suspect that your sparse matrices are becoming non sparse when you perform the operation have you tried just:
A.multiply(B)
As I suspect that it will be better optimised than anything that you can easily do.
If A is not already the correct type of sparse matrix you might need:
A = A.tocsr()
# May also need
# B = B.tocsr()
A = A.multiply(B)

Categories