A neater way to set values at indexes with NumPy - python

I have a numpy array initially with zeros, like this:
v = np.zeros((5, 5))
v
array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
I also have a set of arrays idx1 and idx2.
idx1
array([[0, 3],
[0, 4],
[1, 3],
[2, 4]])
idx2
array([[0, 1],
[0, 2],
[0, 4],
[1, 3]])
Look upon each pair of values as row and column indices. So, for example, in idx1, the first pair (0, 3) would be indexers into v[0, 3] and so on.
I want to first set values at indexes specified by idx1 to 1, followed by all indexes specified by idx2 to 0.
Also, please note that if there is a pair (i, j) in some array, I want to set v[i, j] and v[j, i] at the same time.
My final result becomes:
array([[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.]])
I currently achieve this by doing:
def set_vals(x, i, j, v):
x[i, j] = x.T[i, j] = v
v = np.zeros((5, 5))
i1, j1 = idx1[:, 0], idx1[:, 1]
i2, j2 = idx2[:, 0], idx2[:, 1]
set_vals(v, i1, j1, 1)
set_vals(v, i2, j2, 0)
v # the result
However, I believe there might be a better way. Would love to hear any thoughts/suggestions for improvement. Thanks!

In search of a more "compact" way of expressing it, I got this -
v = np.zeros((5, 5))
v[tuple(np.r_[idx1,idx1[:,::-1]].T)] = 1
v[tuple(np.r_[idx2,idx2[:,::-1]].T)] = 0
On python3.6+, you can use the * unpacking operator to reduce this further:
v[[*np.r_[idx1,idx1[:,::-1]].T]] = 1
v[[*np.r_[idx2,idx2[:,::-1]].T]] = 0
v
array([[ 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.]])

Related

How to distribute a Numpy array along the diagonal of an array of higher dimension?

I have three two dimensional Numpy arrays x, w, d and want to create a fourth one called a. w and d define only the shape of a with d.shape + w.shape. I want to have x in the entries of a with a zeros elsewhere.
Specifically, I want a loop-free version of this code:
a = np.zeros(d.shape + w.shape)
for j in range(d.shape[1]):
a[:,j,:,j] = x
For example, given:
x = np.array([
[2, 3],
[1, 1],
[8,10],
[0, 1]
])
w = np.array([
[ 0, 1, 1],
[-1,-2, 1]
])
d = np.matmul(x,w)
I want a to be
array([[[[ 2., 0., 0.],
[ 3., 0., 0.]],
[[ 0., 2., 0.],
[ 0., 3., 0.]],
[[ 0., 0., 2.],
[ 0., 0., 3.]]],
[[[ 1., 0., 0.],
[ 1., 0., 0.]],
[[ 0., 1., 0.],
[ 0., 1., 0.]],
[[ 0., 0., 1.],
[ 0., 0., 1.]]],
[[[ 8., 0., 0.],
[10., 0., 0.]],
[[ 0., 8., 0.],
[ 0., 10., 0.]],
[[ 0., 0., 8.],
[ 0., 0., 10.]]],
[[[ 0., 0., 0.],
[ 1., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 1., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 1.]]]])
This answer inspired the following solution:
# shape a: (4, 3, 2, 3)
# shape x: (4, 2)
a = np.zeros(d.shape + w.shape)
a[:, np.arange(a.shape[1]), :, np.arange(a.shape[3])] = x
It uses Numpy's broadcasting (see here or here) im combination with Advanced Indexing to enlarge x to fit the slicing.
I happen to have an even simpler solution: a = np.tensordot(x, np.identity(3), axes = 0).swapaxes(1,2)
The size of the identity matrix will be decided by the number of times you wish to repeat the elements of x.

Map numpy categorical data to a numpy vector

I am having a numpy array that is looking like:
my_arr = array([[0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.],
...
...]
I want to return a vector that will contain for each vector of my_arr the index of entry with value one. How can I do so?
You use np.argmax() for that.
inds = np.argmax(my_arr, axis=1)
# array([4, 1, 3, 4, 0, 4, 1, 4])
np.where(my_arr)[1]
Look at docs: https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
You can use np.argwhere to return an array of coordinates:
arr = np.random.randint(0, 2, (5, 5))
print(arr)
[[0 0 1 1 1]
[0 1 0 1 1]
[1 1 0 0 1]
[1 1 1 0 0]
[1 1 1 1 0]]
res = np.argwhere(arr)
print(res)
array([[0, 2], [0, 3], ..., [4, 2], [4, 3]], dtype=int64)

Numpy Cyclic Broadcast of Fancy Indexing

A is an numpy array with shape (6, 8)
I want:
x_id = np.array([0, 3])
y_id = np.array([1, 3, 4, 7])
A[ [x_id, y_id] += 1 # this doesn't actually work.
Tricks like ::2 won't work because the indices do not increase regularly.
I don't want to use extra memory to repeat [0, 3] and make a new array [0, 3, 0, 3] because that is slow.
The indices for the two dimensions do not have equal length.
which is equivalent to:
A[0, 1] += 1
A[3, 3] += 1
A[0, 4] += 1
A[3, 7] += 1
Can numpy do something like this?
Update:
Not sure if broadcast_to or stride_tricks is faster than nested python loops. (Repeat NumPy array without replicating data?)
You can convert y_id to a 2d array with the 2nd dimension the same as x_id, and then the two indices will be automatically broadcasted due to the dimension difference:
x_id = np.array([0, 3])
y_id = np.array([1, 3, 4, 7])
​
A = np.zeros((6,8))
A[x_id, y_id.reshape(-1, x_id.size)] += 1
A
array([[ 0., 1., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0.]])

allocate memory in python for large scipy.sparse matrix operations

Is there a way I can allocate memory for scipy sparse matrix functions to process large datasets?
Specifically, I'm attempting to use Asymmetric Least Squares Smoothing (translated into python here and the original here) to perform a baseline correction on a large mass spec dataset (length of ~60,000).
The function (see below) uses the scipy.sparse matrix operations.
def baseline_als(y, lam, p, niter):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in xrange(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
I have no problem when I pass data sets that are 10,000 or less in length:
baseline_als(np.ones(10000),100,0.1,10)
But when passing larger data sets, e.g.
baseline_als(np.ones(50000), 100, 0.1, 10)
I get a MemoryError, for the line
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
Try changing
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
to
diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)
D will be a sparse matrix in DIAgonal format. If it turns out that being in CSC format is important, convert it using the tocsc() method:
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2).tocsc()
The following example shows that the old and new versions generate the same matrix:
In [67]: from scipy import sparse
In [68]: L = 8
Original:
In [69]: D = sparse.csc_matrix(np.diff(np.eye(L), 2))
In [70]: D.A
Out[70]:
array([[ 1., 0., 0., 0., 0., 0.],
[-2., 1., 0., 0., 0., 0.],
[ 1., -2., 1., 0., 0., 0.],
[ 0., 1., -2., 1., 0., 0.],
[ 0., 0., 1., -2., 1., 0.],
[ 0., 0., 0., 1., -2., 1.],
[ 0., 0., 0., 0., 1., -2.],
[ 0., 0., 0., 0., 0., 1.]])
New version:
In [71]: diag = np.ones(L - 2)
In [72]: D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)
In [73]: D.A
Out[73]:
array([[ 1., 0., 0., 0., 0., 0.],
[-2., 1., 0., 0., 0., 0.],
[ 1., -2., 1., 0., 0., 0.],
[ 0., 1., -2., 1., 0., 0.],
[ 0., 0., 1., -2., 1., 0.],
[ 0., 0., 0., 1., -2., 1.],
[ 0., 0., 0., 0., 1., -2.],
[ 0., 0., 0., 0., 0., 1.]])

Looping through a 2D numpy array (e.g. to create a line)

I often find myself having to create a line (or some kind of other shape) within a 2D array. In other words, the value of the array is zero everywhere apart from where y = mx + c. (Aside - the motivation for this approach, rather than storing a line in a 1D array, is that my work often requires 2D Fourier transform, and so I need the zeros everywhere apart from the line/shape/etc etc).
My usual approach for doing this is the following:
array = numpy.zeros((height, width))
for i, line in enumerate(array):
for j, pixel in enumerate(line):
if j == m*i + c:
array[i,j] = 1
This works fine, but it doesn't strike me as particularly pythonic, and it tends to get pretty slow when the array gets big. So, my question is a rather general one - does anybody know of a better way of doing this?
Thanks in advance!
You could use broadcasting here to get rid of those nested loops -
import numpy as np
out = (np.arange(height) == m*np.arange(width)[:,None]+c)+0.0
As an example to verify for correctness, with these parameters -
height = 10
width = 10
m = 0.5;
c = 6;
you would have -
In [306]: array
Out[306]:
array([[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
In [307]: out
Out[307]:
array([[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
The function np.fromfunction was designed for cases where an array can be constructed from the indices, such as this scenario.
In your case,
np.fromfunction(lambda i, j: j == m*i+c, (height, width), dtype=np.float)
would be equivalent to your approach, but using numpy's routines rather than Python for-loops.
Short demo:
import numpy as np
height, width = 10,10
m, c = 2, 4
a = np.zeros((height, width))
for i, line in enumerate(a):
for j, pixel in enumerate(a):
if j == m*i + c:
a[i,j] = 1
b = np.fromfunction(lambda i, j: j == m*i+c, (height, width), dtype=np.float)
np.all(a==b)
# True
b.astype(np.int) # as type added to reduce output (no need for all the periods)
#array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Edit: Even though this answer got accepted, I want to point out that #Divakar's answer is about 10 times faster on my machine. If you're looking for speed: use that answer if your problem lends itself easily to vectorization, like Divakar showed (not every fromfunction call can be easily vectorized). I upvoted it, because it's a nice approach to this problem.
Use np.put,but you need to create the list of specific indices, that you can do it with a list comprehension :
>>> np.put(arr,[j for j in range(arr.shape[1]) for i in range(arr.shape[0]) if j == m*i + c],1)
Demo:
>>> np.put(arr,[j for j in range(arr.shape[1]) for i in range(arr.shape[0]) if j == 3*i + 1],1)
>>> arr
array([[ 0., 1., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
>>> np.put(arr,[j for j in range(arr.shape[1]) for i in range(arr.shape[0]) if j == 0.5*i + 2],1)
>>> arr
array([[ 0., 1., 1.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])

Categories