I have a 2D numpy array of values, a list of x-coordinates, and a list of y-coordinates. the x-coordinates increase left-to-right and the y-coordinates increase top-to-bottom.
For example:
a = np.random.random((3, 3))
a[0][1] = 9.0
a[0][2] = 9.0
a[1][1] = 9.0
a[1][2] = 9.0
xs = list(range(1112, 1115))
ys = list(range(1109, 1112))
Output:
[[0.48148651 9. 9. ]
[0.09030393 9. 9. ]
[0.79271224 0.83413552 0.29724989]]
[1112, 1113, 1114]
[1109, 1110, 1111]
I want to remove the values from the 2D array that are greater than 1. I also want to combine the lists xs and ys to get a list of all the coordinate pairs for points that are kept.
In this example I want to remove a[0][1], a[0][2], a[1][1], a[1][2] and I want the list of coordinate pairs to be
[[1112, 1109], [1112,1110], [1112, 1111], [1113, 1111], [1114, 1111]]
I have been able to accomplish this using a double for loop and if statements:
a_values = []
point_pairs = []
for i in range(0, a.shape[0]):
for j in range(0, a.shape[1]):
if (a[i][j] < 1):
a_values.append(a[i][j])
point_pairs.append([xs[j], ys[i]])
print(a_values)
print(point_pairs)
Output:
[0.48148650831317796, 0.09030392566133771, 0.7927122386213029, 0.8341355206494774, 0.2972498933037804]
[[1112, 1109], [1112, 1110], [1112, 1111], [1113, 1111], [1114, 1111]]
What is a more efficient way of doing this?
You can use np.nonzero to get the indices of the elements you removed:
mask = a < 1
i, j = np.nonzero(mask)
The fancy indices i and j can be used to get the elements of xs and ys directly if they are numpy arrays:
xs = np.array(xs)
ys = np.array(ys)
point_pairs = np.stack((xs[j], ys[i]), axis=-1)
You can also use np.take to make the conversion happen under the hood:
point_pairs = np.stack((np.take(xs, j), np.take(ys, i)), axis=-1)
The remaining elements of a are those not covered by the mask:
a_points = a[mask]
Alternatively:
i, j = np.nonzero(a < 1)
point_pairs = np.stack((np.take(xs, j), np.take(ys, i)), axis=-1)
a_points = a[i, j]
In this context, you can use np.where as a drop-in alias for np.nonzero.
Notes
If you are using numpy, there is rarely a need for lists. Putting xs = np.array(xs), or even just initializing it as xs = np.arange(1112, 1115) is faster and easier.
Numpy arrays should generally be indexed through a single index: a[0, 1], not a[0][1]. For your simple case, the behavior just happens to be the same, but it will not be in the general case. a[0, 1] is an index into the original array. a[0] is a view of the first row of the array, i.e., a separate array object. a[0][1] is an index into that new object. You just happened to get lucky that you are getting a view that shares the base memory, so the assignment is visible in a itself. This would not be the case if you tried a mask or fancy index, for example.
On a related note, setting a rectangular swath in an array only requires one line: a[1:, :-1] = 9.
I would write your example something like this:
a = np.random.random((3, 3))
a[1:, :-1] = 9.0
xs = np.arange(1112, 1115)
ys = np.arange(1109, 1112)
i, j = np.nonzero(a < 1)
point_pairs = np.stack((xs[j], ys[i]), axis=-1)
a_points = a[i, j]
Related
I have an n row, m column numpy array, and would like to create a new k x m array by selecting k random elements from each column of the array. I wrote the following python function to do this, but would like to implement something more efficient and faster:
def sample_array_cols(MyMatrix, nelements):
vmat = []
TempMat = MyMatrix.T
for v in TempMat:
v = np.ndarray.tolist(v)
subv = random.sample(v, nelements)
vmat = vmat + [subv]
return(np.array(vmat).T)
One question is whether there's a way to loop over each column without transposing the array (and then transposing back). More importantly, is there some way to map the random sample onto each column that would be faster than having a for loop over all columns? I don't have that much experience with numpy objects, but I would guess that there should be something analogous to apply/mapply in R that would work?
One alternative is to randomly generate the indices first, and then use take_along_axis to map them to the original array:
arr = np.random.randn(1000, 5000) # arbitrary
k = 10 # arbitrary
n, m = arr.shape
idx = np.random.randint(0, n, (k, m))
new = np.take_along_axis(arr, idx, axis=0)
Output (shape):
in [215]: new.shape
out[215]: (10, 500) # (k x m)
To sample each column without replacement just like your original solution
import numpy as np
matrix = np.arange(4*3).reshape(4,3)
matrix
Output
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
k = 2
np.take_along_axis(matrix, np.random.rand(*matrix.shape).argsort(axis=0)[:k], axis=0)
Output
array([[ 9, 1, 2],
[ 3, 4, 11]])
I would
Pre-allocate the result array, and fill in columns, and
Use numpy index based indexing
def sample_array_cols(matrix, n_result):
(n,m) = matrix.shape
vmat = numpy.array([n_result, m], dtype= matrix.dtype)
for c in range(m):
random_indices = numpy.random.randint(0, n, n_result)
vmat[:,c] = matrix[random_indices, c]
return vmat
Not quite fully vectorized, but better than building up a list, and the code scans just like your description.
I have a situation where I have two lists of [x, y, z] data, I want to concatenate these lists, sort them, then extract a matrix for the z values, with x increasing along the columns, and y increasing along the rows.
To give an example:
list1 = np.linspace(-2,2,3)
list2 = np.linspace(-1,1,3)
dat1 = []
for x in list1:
for y in list1:
z = x * y
dat1 += [[x,y,z]]
dat1 = np.array(dat1)
dat2 = []
for x in list2:
for y in list2:
z = x * y
dat2 += [[x,y,z]]
dat2 = np.array(dat2)
I can build an array from the z values for each of these list individually using:
dat1[:, 2].reshape((list1.shape[0],list1.shape[0]))
but I want an (ordered) array for all values from both lists, i.e. I want to do the same thing with full sorted data set:
dat_full=np.vstack((dat1, dat2))
dat_index = np.lexsort((dat_full[:,1], dat_full[:,0]))
dat_sorted = dat_full[dat_index]
the problem is that this is not a square array anymore, so I can't use the simple reshape trick I used previously. Is there a good way to do this?
Edit:
I should clarify that I am only interested in the unique data in concatenated array, which can be found using:
dat_full=np.unique(np.vstack((dat1, dat2)))
dat_index = np.lexsort((dat_full[:,1], dat_full[:,0]))
dat_sorted = dat_full[dat_index]
Like markuscosinus said, the problem with this is that you would need a "matrix" with varying row and column sizes, which cannot be done in NumPy. The alternative that you may consider, however, is using a masked array, if you can work with that. That will allow you to have all the values in the same array and masking the "gaps" as invalid. For example, you could do that like this (I have changed how you create dat1 and dat2 but the result is the same):
import numpy as np
list1 = np.linspace(-2, 2, 3)
list2 = np.linspace(-1, 1, 3)
# Evaluate using grids instead of loops
xg1, yg1 = np.meshgrid(list1, list1, indexing='ij')
x1, y1 = xg1.ravel(), yg1.ravel()
xg2, yg2 = np.meshgrid(list2, list2, indexing='ij')
x2, y2 = xg2.ravel(), yg2.ravel()
dat1 = np.stack([x1, y1, x1 * y1], axis=-1)
dat2 = np.stack([x2, y2, x2 * y2], axis=-1)
# Full dataset
dat_full = np.concatenate([dat1, dat2])
# Remove repeated rows
_, idx = np.unique(dat_full, return_index=True, axis=0)
dat_uniq = dat_full[idx]
# Find unique X and Y values
_, x_idx, x_counts = np.unique(dat_uniq[:, 0], return_inverse=True, return_counts=True)
_, y_idx, y_counts = np.unique(dat_uniq[:, 1], return_inverse=True, return_counts=True)
# Make array as big as the most repeated index
result = np.zeros((x_counts.max(), y_counts.max()), dtype=dat_full.dtype)
# Make mask for array
mask = np.ones_like(result, dtype=bool)
# Fill array and mask
result[x_idx, y_idx] = dat_uniq[:, 2]
mask[x_idx, y_idx] = False
# Make masked array
result = np.ma.masked_array(result, mask)
print(result)
Output:
[[4.0 -- -0.0 -- -4.0]
[-- 1.0 -0.0 -1.0 --]
[-0.0 -0.0 0.0 0.0 0.0]
[-- -1.0 0.0 1.0 --]
[-4.0 -- 0.0 -- 4.0]]
My approach would be
result = []
_, occurences = np.unique(dat_sorted[:,0], return_inverse=True)
for i in range(np.max(occurences) + 1):
result.append(dat_sorted[occurences == i, 2])
This will give you a x value ordered list of y value ordered arrays of z values. This is not a matrix because there are x values occuring more often than others, resulting in different sized arrays.
I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.
import numpy as np
A = np.random.rand(10,5)
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)
My current for loop looks like this:
result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
if(len(values))>0:
result[idx, 0] = np.mean(values[:,3])
result[idx, 1] = np.var(values[:,3])
result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
else:
result[idx, 0] = values[:,3]
I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.
You can use np.add.reduceat as follows:
>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>>
>>> np.allclose(result, result2)
True
Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.
I have a set of data in python likes:
x y angle
If I want to calculate the distance between two points with all possible value and plot the distances with the difference between two angles.
x, y, a = np.loadtxt('w51e2-pa-2pk.log', unpack=True)
n = 0
f=(((x[n])-x[n+1:])**2+((y[n])-y[n+1:])**2)**0.5
d = a[n]-a[n+1:]
plt.scatter(f,d)
There are 255 points in my data.
f is the distance and d is the difference between two angles.
My question is can I set n = [1,2,3,.....255] and do the calculation again to get the f and d of all possible pairs?
You can obtain the pairwise distances through broadcasting by considering it as an outer operation on the array of 2-dimensional vectors as follows:
vecs = np.stack((x, y)).T
np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
For example,
In [1]: import numpy as np
...: x = np.array([1, 2, 3])
...: y = np.array([3, 4, 6])
...: vecs = np.stack((x, y)).T
...: np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
...:
Out[1]:
array([[ 0. , 1.41421356, 3.60555128],
[ 1.41421356, 0. , 2.23606798],
[ 3.60555128, 2.23606798, 0. ]])
Here, the (i, j)'th entry is the distance between the i'th and j'th vectors.
The case of the pairwise differences between angles is similar, but simpler, as you only have one dimension to deal with:
In [2]: a = np.array([10, 12, 15])
...: a[np.newaxis, :] - a[: , np.newaxis]
...:
Out[2]:
array([[ 0, 2, 5],
[-2, 0, 3],
[-5, -3, 0]])
Moreover, plt.scatter does not care that the results are given as matrices, and putting everything together using the notation of the question, you can obtain the plot of angles by distances by doing something like
vecs = np.stack((x, y)).T
f = np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
d = angle[np.newaxis, :] - angle[: , np.newaxis]
plt.scatter(f, d)
You have to use a for loop and range() to iterate over n, e.g. like like this:
n = len(x)
for i in range(n):
# do something with the current index
# e.g. print the points
print x[i]
print y[i]
But note that if you use i+1 inside the last iteration, this will already be outside of your list.
Also in your calculation there are errors. (x[n])-x[n+1:] does not work because x[n] is a single value in your list while x[n+1:] is a list starting from n+1'th element. You can not subtract a list from an int or whatever it is.
Maybe you will have to even use two nested loops to do what you want. I guess that you want to calculate the distance between each point so a two dimensional array may be the data structure you want.
If you are interested in all combinations of the points in x and y I suggest to use itertools, which will give you all possible combinations. Then you can do it like follows:
import itertools
f = [((x[i]-x[j])**2 + (y[i]-y[j])**2)**0.5 for i,j in itertools.product(255,255) if i!=j]
# and similar for the angles
But maybe there is even an easier way...
TL;DR:
What is the theano.scan equivalent of:
M = np.arange(9).reshape(3, 3)
for i in range(M.shape[0]):
for j in range(M.shape[1]):
M[i, j] += 5
M
possibly (if doable) without using nested scans?
Note that this question does not want to be specifically about how to apply an operation elementwise to a matrix, but more generally on how to implement with theano.scan a nested looping construct like the above.
Long version:
theano.scan (or equivalently in this case, theano.map) allows to map a function looping through multiple indices, by simply providing a sequence of elements to the sequences arguments, with something like
import theano
import theano.tensor as T
M = T.dmatrix('M')
def map_func(i, j, matrix):
return matrix[i, j] + i * j
results, updates = theano.scan(map_func,
sequences=[T.arange(M.shape[0]), T.arange(M.shape[1])],
non_sequences=[M])
f = theano.function(inputs=[M], outputs=results)
f(np.arange(9).reshape(3, 3))
#
which is roughly equivalent to a python loop of the form:
M = np.arange(9).reshape(3, 3)
for i, j in zip(np.arange(M.shape[0]), np.arange(M.shape[1])):
M[i, j] += 5
M
which increases by 5 all the elements in the diagonal of M.
But what if I want to find the theano.scan equivalent of:
M = np.arange(9).reshape(3, 3)
for i in range(M.shape[0]):
for j in range(M.shape[1]):
M[i, j] += 5
M
possibly without nesting scan?
One way is of course to flatten the matrix, scan through the flattened elements, and then reshape it to the original shape, with something like
import theano
import theano.tensor as T
M = T.dmatrix('M')
def map_func(i, X):
return X[i] + .5
M_flat = T.flatten(M)
results, updates = theano.map(map_func,
sequences=T.arange(M.shape[0] * M.shape[1]),
non_sequences=M_flat)
final_M = T.reshape(results, M.shape)
f = theano.function([M], final_M)
f([[1, 2], [3, 4]])
but is there a better way that doesn't involve explicitly flattening and reshaping the matrix?
Here is an example on how this kind of thing can be achieve using nested theano.scan calls.
In this example we add the number 3.141 to every element of a matrix, effectively simulating in a convoluted way the output of H + 3.141:
H = T.dmatrix('H')
def fn2(col, row, matrix):
return matrix[row, col] + 3.141
def fn(row, matrix):
res, updates = theano.scan(fn=fn2,
sequences=T.arange(matrix.shape[1]),
non_sequences=[row, matrix])
return res
results, updates = theano.scan(fn=fn,
sequences=T.arange(H.shape[0]),
non_sequences=[H])
f = theano.function([H], results)
f([[0, 1], [2, 3]])
# array([[ 3.141, 4.141],
# [ 5.141, 6.141]])
As another example, let us add to each element of a matrix the product of its row and column indices:
H = T.dmatrix('H')
def fn2(col, row, matrix):
return matrix[row, col] + row * col
def fn(row, matrix):
res, updates = theano.scan(fn=fn2,
sequences=T.arange(matrix.shape[1]),
non_sequences=[row, matrix])
return res
results, updates = theano.scan(fn=fn,
sequences=T.arange(H.shape[0]),
non_sequences=[H])
f = theano.function([H], results)
f(np.arange(9).reshape(3, 3))
# Out[2]:array([[ 0., 1., 2.],
# [ 3., 5., 7.],
# [ 6., 9., 12.]])