Related
I have a 2d MxN array A , each row of which is a sequence of indices, padded by -1's at the end e.g.:
[[ 2 1 -1 -1 -1]
[ 1 4 3 -1 -1]
[ 3 1 0 -1 -1]]
I have another MxN array of float values B:
[[ 0.7 0.4 1.5 2.0 4.4 ]
[ 0.8 4.0 0.3 0.11 0.53]
[ 0.6 7.4 0.22 0.71 0.06]]
and I want to use the indices in A to filter B i.e. for each row, only the indices present in A retain their values, and the values at all other locations are set to 0.0, i.e. the result would look like:
[[ 0.0 0.4 1.5 0.0 0.0 ]
[ 0.0 4.0 0.0 0.11 0.53 ]
[ 0.6 7.4 0.0 0.71 0.0]]
What's a good way to do this in "pure" numpy? (I would like to do this in pure numpy so I can jit it in jax.
Numpy supports fancy indexing. Ignoring the "-1" entries for the moment, you can do something like this:
index = (np.arange(B.shape[0]).reshape(-1, 1), A)
result = np.zeros_like(B)
result[index] = B[index]
This works because indices are broadcasted. The column np.arange(B.shape[0]).reshape(-1, 1) matches all the elements of a given row of A to the corresponding row in B and result.
This example does not address the fact that -1 is a valid numpy index. You need to clear the elements that correspond to -1 in A when 4 (the last column) is not present in that row:
mask = (A == -1).any(axis=1) & (A != A.shape[1] - 1).all(axis=1)
result[mask, -1] = 0.0
Here, the mask is [True, False, True], indicating that even though the second row has a -1 in it, it also contains a 4.
This approach is fairly efficient. It will create no more than a couple of boolean arrays of the same shape as A for the mask.
You can use broadcasting, but note that it will create a large intermediate array of shape (M, N, N) (in pure numpy at least):
import numpy as np
A = ...
B = ...
M, N = A.shape
out = np.where(np.any(A[..., None] == np.arange(N), axis=1), B, 0.0)
out:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
Another possible solution:
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
np.where(mask, B, 0)
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
EDIT (When there is rows with only -1)
The following code aims to contemplate the possibility, raised by #MadPhysicist (to whom I thank), of having rows containing only -1 -- that is only necessary to add 2 lines of code to my previous code.
A = np.array([[ 2, 1, -1, -1, -1],
[ -1, -1, -1, -1, -1],
[ 3, 1, 0, -1, -1]])
B = np.array([[ 0.7, 0.4, 1.5, 2.0, 4.4 ],
[ 0.8, 4.0, 0.3, 0.11, 0.53],
[ 0.6, 7.4, 0.22, 0.71, 0.06]])
rminus1 = np.all(A == -1, axis=1) # new
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
C = np.where(mask, B, 0)
C[rminus1, :] = 0 # new
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
I am having difficulties selecting rows using two condition in Numpy. The following code does not return the intended output
tot_length=0.3
steps=0.1
start_val=0.0
list_no =np.arange(start_val, tot_length, steps)
x, y, z = np.meshgrid(*[list_no for _ in range(3)], sparse=True)
a = ((x>=y) & (y>=z)).nonzero() # this maybe the problem
output
(array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2]), array([0, 1, 2, 1, 1, 2, 2, 2, 2, 2]), array([0, 0, 0, 0, 1, 0, 1, 0, 1, 2]))
whereas, the intended output
[[0. 0. 0. ]
[0.1 0. 0. ]
[0.1 0.1 0. ]
[0.1 0.1 0.1]
[0.2 0. 0. ]
[0.2 0.1 0. ]
[0.2 0.1 0.1]
[0.2 0.2 0. ]
[0.2 0.2 0.1]
[0.2 0.2 0.2]]
ndarray.nonzero as well as np.where return tuples of arrays of indices. This makes unpacking those indices into separate arrays, which can then be used to index along a given axis. Stacking them up into a 2D array is trivial though, simply build a new array and transpose as:
ix = np.array(((x>=y) & (y>=z)).nonzero()).T
Then you can easily use the array of indices to index list_no as:
list_no[ix]
array([[0. , 0. , 0. ],
[0. , 0.1, 0. ],
[0. , 0.2, 0. ],
[0.1, 0.1, 0. ],
[0.1, 0.1, 0.1],
[0.1, 0.2, 0. ],
[0.1, 0.2, 0.1],
[0.2, 0.2, 0. ],
[0.2, 0.2, 0.1],
[0.2, 0.2, 0.2]])
I'm having a problem where I'm getting different random numbers across different computers despite
scipy.__version__ == '1.2.1' on all computers
numpy.__version__ == '1.15.4' on all computers
random_state seed is fixed to the same number (42) in every function call that generates random numbers for reproducible results
The code is a bit to complex to post in full here, but I noticed results start to diverge specifically when sampling from a multivariate normal:
import numpy as np
from scipy import stats
seed = 42
n_sim = 1000000
d = corr_mat.shape[0] # corr_mat is a 15x15 correlation matrix, numpy.ndarray
# results diverge from here across different hardware
z = stats.multivariate_normal(mean=np.zeros(d), cov=corr_mat).rvs(n_sim, random_state=seed)
corr_mat is a correlation matrix (see Appendix below) and is the same across all computers.
The two different computers we are testing on are
Computer 1
OS: Windows 7
Processor: Intel(R) Xeon(R) CPU E5-2623 v4 # 2.60Ghz 2.60 Ghz (2 processors)
RAM: 64 GB
System type: 64-bit
Computer 2
OS: Windows 7
Processor: Intel(R) Xeon(R) CPU E5-2660 v3 # 2.10Ghz 2.10 Ghz (2 processors)
RAM: 64 GB
System type: 64-bit
Appendix
corr_mat
>>> array([[1. , 0.15, 0.25, 0.25, 0.25, 0.25, 0.1 , 0.1 , 0.1 , 0.25, 0.25,
0.25, 0.1 , 0.1 , 0.1 ],
[0.15, 1. , 0. , 0. , 0. , 0. , 0.15, 0.05, 0.15, 0.15, 0.15,
0. , 0.15, 0.15, 0.15],
[0.25, 0. , 1. , 0.25, 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.25, 0. , 0.25, 1. , 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.25, 0. , 0.25, 0.25, 1. , 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.25, 0. , 0.25, 0.25, 0.25, 1. , 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
0.25, 0.2 , 0.2 , 0.2 ],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 1. , 0.15, 0.25, 0.25, 0.25,
0.2 , 0.25, 0.25, 0.25],
[0.1 , 0.05, 0. , 0. , 0. , 0. , 0.15, 1. , 0.15, 0.15, 0.15,
0. , 0.15, 0.15, 0.15],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 1. , 0.25, 0.25,
0.2 , 0.25, 0.25, 0.25],
[0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 1. , 0.25,
0.2 , 0.25, 0.25, 0.25],
[0.25, 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 1. ,
0.2 , 0.25, 0.25, 0.25],
[0.25, 0. , 0.25, 0.25, 0.25, 0.25, 0.2 , 0. , 0.2 , 0.2 , 0.2 ,
1. , 0.2 , 0.2 , 0.2 ],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
0.2 , 1. , 0.25, 0.25],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
0.2 , 0.25, 1. , 0.25],
[0.1 , 0.15, 0.2 , 0.2 , 0.2 , 0.2 , 0.25, 0.15, 0.25, 0.25, 0.25,
0.2 , 0.25, 0.25, 1. ]])
The following is an educated guess which I cannot validate since I don't have multiple machines.
Sampling from a correlated multinormal is typically done by sampling from an uncorrelated standard normal and then multiplying with a "square root" of the covariance matrix. I get a fairly similar sample to the one scipy produces with seed set at 42 and your covariance matrix if I use instead identity(15) for the covariance and then multiply with l*sqrt(d) where l,d,r = np.linalg.svd(covariance)
SVD is I suppose complex enough to explain small differences between platforms.
How can this snowball into something significant?
I think your choice of covariance matrix is to blame, since it has nonunique eigenvalues. As a consequence SVD is not unique, since eigenspaces to a given multiple eigenvalue can be rotated. This has the potential to hugely amplify a small numerical difference.
It would be interesting to see whether the differences you see persist if you test with a different covariance matrix with unique eigenvalues.
Edit:
For reference, here is what i tried for your smaller (6D) example:
>>> cm6 = np.array([[1,.5,.15,.15,0,0], [.5,1,.15,.15,0,0],[.15,.15,1,.25,0,0],[.15,.15,.25,1,0,0],[0,0,0,0,1,.1],[0,0,0,0,.1,1]])
>>> ls6,ds6,rs6 = np.linalg.svd(cm6)
>>> np.random.seed(42)
>>> cs6 = stats.multivariate_normal(cov=cm6).rvs()
>>> np.random.seed(42)
>>> is6 = stats.multivariate_normal(cov=np.identity(6)).rvs()
>>> LS6 = ls6*np.sqrt(ds6)
>>> np.allclose(cs6, LS6#is6)
True
As you report that the problem persists with unique eigenvalues here is one more possibility. Above I have used svd to compute eigen vectors / values which is ok since cov is symmetric. What happens if we use eigh instead?
>>> de6,le6 = np.linalg.eigh(cm6)
>>> LE6 = le6*np.sqrt(de6)
>>> cs6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -0.91913994,
1.12421904])
>>> LE6#is6
array([ 0.54338614, 1.04010029, -0.71379193, -0.88313042, -0.60813547,
0.26082989])
These are different. Why? First, eigh orders the eigenspaces the other way round:
>>> ds6
array([1.7 , 1.1 , 1.05, 0.9 , 0.75, 0.5 ])
>>> de6
array([0.5 , 0.75, 0.9 , 1.05, 1.1 , 1.7 ])
Does that fix it? Almost.
>>> LE6[:, ::-1]#is6
array([-0.00364915, -0.23778611, -0.50111166, -0.7878898 , -1.12421904,
0.91913994])
We see that the last two samples are swapped and their signs flipped. Turns out this is due to the sign of one eigen vector being inverted.
So even for unique eigen values we can get large differences because of ambiguities in (1) the order of eigen spaces and (2) the sign of eigen vectors.
I have a numpy array and when I print it i get this output. But I expected to get (105835, 99, 13) as output when printing the print(feat.shape) and was expecting feat to have 3 dimensions.
print(feat.ndim)
print(feat.shape)
print(feat.size)
print(feat[1].ndim)
print(feat[1].shape)
print(feat[1].size)`
1
(105835,)
105835
2
(99, 13)
1287
I don't know how to reduce this. But feat is a MFCC feature. If I print feat this is what I get.
array([array([[-1.0160675e+01, -1.3804866e+01, 9.1880971e-01, ...,
1.5415058e+00, 1.1875046e-02, -5.8664594e+00],
[-9.9697800e+00, -1.3823588e+01, -7.0778362e-02, ...,
1.5948311e+00, 4.3481258e-01, -5.1646194e+00],
[-9.9518738e+00, -1.2771760e+01, -1.2623003e-01, ...,
3.4290311e+00, 2.7361808e+00, -6.0621500e+00],
...,
[-11.605266 , -7.1909204, -33.44656 , ..., -11.974911 ,
12.825395 , 10.635098 ],
[-11.769397 , -9.340318 , -34.413307 , ..., -10.077869 ,
8.821722 , 7.704534 ],
[-12.301968 , -10.67318 , -32.46104 , ..., -6.829077 ,
15.29837 , 13.100596 ]], dtype=float32)], dtype=object)
the same structure can be create in a more simple way :
ain=rand(2,2)
a=ndarray(3,dtype=object)
a[:] = [ain]*3
#array([array([[ 0.14, 0.56],
# [ 0.9 , 0.9 ]]),
# array([[ 0.14, 0.56],
# [ 0.9 , 0.9 ]]),
# array([[ 0.14, 0.56],
# [ 0.9 , 0.9 ]])], dtype=object)
The problem arise because a.dtype is object. You can reconstruct your data by :
a= array(list(a))
#array([
# [[ 0.14, 0.56],
# [ 0.9 , 0.9 ]],
# [[ 0.14, 0.56],
# [ 0.9 , 0.9 ]],
# [[ 0.14, 0.56],
# [ 0.9 , 0.9 ]]])
With will have the float type inherited from the base dtype.
numpy.array has a handy .tostring() method which produces a compact representation of the array as a bytestring. But how do I restore the original array from the bytestring? numpy.fromstring() only produces a 1-dimensional array, and there is no numpy.array.fromstring(). Seems like I ought to be able to provide a string, a shape, and a type, and go, but I can't find the function.
>>> x
array([[ 0. , 0.125, 0.25 ],
[ 0.375, 0.5 , 0.625],
[ 0.75 , 0.875, 1. ]])
>>> s = x.tostring()
>>> numpy.fromstring(s)
array([ 0. , 0.125, 0.25 , 0.375, 0.5 , 0.625, 0.75 , 0.875, 1. ])
>>> y = numpy.fromstring(s).reshape((3, 3))
>>> y
array([[ 0. , 0.125, 0.25 ],
[ 0.375, 0.5 , 0.625],
[ 0.75 , 0.875, 1. ]])
It does not seem to exist; you can easily write it yourself, though:
def numpy_2darray_fromstring(s, nrows=1, dtype=float):
chunk_size = len(s)/nrows
return numpy.array([ numpy.fromstring(s[i*chunk_size:(i+1)*chunk_size], dtype=dtype)
for i in xrange(nrows) ])
An update to Mike Graham's answer:
numpy.fromstring is depreciated and should be replaced by numpy.frombuffer
in case of complex numbers dtype should be defined explicitly
So the above example would become:
>>> x = numpy.array([[1, 2j], [3j, 4]])
>>> x
array([[1.+0.j, 0.+2.j],
[0.+3.j, 4.+0.j]])
>>> s = x.tostring()
>>> y = numpy.frombuffer(s, dtype=x.dtype).reshape(x.shape)
>>> y
array([[1.+0.j, 0.+2.j],
[0.+3.j, 4.+0.j]])