Related
I have an n row, m column numpy array, and would like to create a new k x m array by selecting k random elements from each column of the array. I wrote the following python function to do this, but would like to implement something more efficient and faster:
def sample_array_cols(MyMatrix, nelements):
vmat = []
TempMat = MyMatrix.T
for v in TempMat:
v = np.ndarray.tolist(v)
subv = random.sample(v, nelements)
vmat = vmat + [subv]
return(np.array(vmat).T)
One question is whether there's a way to loop over each column without transposing the array (and then transposing back). More importantly, is there some way to map the random sample onto each column that would be faster than having a for loop over all columns? I don't have that much experience with numpy objects, but I would guess that there should be something analogous to apply/mapply in R that would work?
One alternative is to randomly generate the indices first, and then use take_along_axis to map them to the original array:
arr = np.random.randn(1000, 5000) # arbitrary
k = 10 # arbitrary
n, m = arr.shape
idx = np.random.randint(0, n, (k, m))
new = np.take_along_axis(arr, idx, axis=0)
Output (shape):
in [215]: new.shape
out[215]: (10, 500) # (k x m)
To sample each column without replacement just like your original solution
import numpy as np
matrix = np.arange(4*3).reshape(4,3)
matrix
Output
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
k = 2
np.take_along_axis(matrix, np.random.rand(*matrix.shape).argsort(axis=0)[:k], axis=0)
Output
array([[ 9, 1, 2],
[ 3, 4, 11]])
I would
Pre-allocate the result array, and fill in columns, and
Use numpy index based indexing
def sample_array_cols(matrix, n_result):
(n,m) = matrix.shape
vmat = numpy.array([n_result, m], dtype= matrix.dtype)
for c in range(m):
random_indices = numpy.random.randint(0, n, n_result)
vmat[:,c] = matrix[random_indices, c]
return vmat
Not quite fully vectorized, but better than building up a list, and the code scans just like your description.
I have a 3D numpy array of shape (3,3,3). I would like to obtain indices of maximum values in a plane,"plane" according to me is as follows:
a = np.random.rand(3,3,3)
>>> a[:,:,0]
array([[0.98423332, 0.44410844, 0.06945133],
[0.69876575, 0.87411547, 0.53595041],
[0.53418486, 0.16186808, 0.60579623]])
>>> a[:,:,1]
array([[0.38969199, 0.80202126, 0.62189662],
[0.66609605, 0.09771614, 0.74061269],
[0.77081531, 0.20068743, 0.72762023]])
>>> a[:,:,2]
array([[0.57110332, 0.29021439, 0.15433043],
[0.21762439, 0.93112448, 0.05763075],
[0.77880124, 0.36637245, 0.29070822]])
I have a solution but I would like to have something shorter and quicker without for loops, my solution is as belows:
for i in range(3):
x=a[:,:,i].argmax()/3
y=a[:,:,i].argmax()%3
z=i
print(x,y,z)
print a[x][y][z]
(0, 0, 0)
0.9842333247061394
(0, 1, 1)
0.8020212566990867
(1, 1, 2)
0.9311244845473187
We simply need to reshape the input array to 2D by merging the last two axes and then applying argmax along the second one i.e. the merged one to give ourselves a vectorized approach -
def argmax_each_plane(a):
a2D = a.reshape(a.shape[0],-1)
idx = a2D.argmax(1)
indices = np.unravel_index(idx, a.shape[1:])
vals = a2D[np.arange(len(idx)), idx]
return vals, np.c_[indices]
Sample run -
In [60]: np.random.seed(0)
...: a = np.random.rand(3,3,3)
In [61]: a
Out[61]:
array([[[0.5488135 , 0.71518937, 0.60276338],
[0.54488318, 0.4236548 , 0.64589411],
[0.43758721, 0.891773 , 0.96366276]],
[[0.38344152, 0.79172504, 0.52889492],
[0.56804456, 0.92559664, 0.07103606],
[0.0871293 , 0.0202184 , 0.83261985]],
[[0.77815675, 0.87001215, 0.97861834],
[0.79915856, 0.46147936, 0.78052918],
[0.11827443, 0.63992102, 0.14335329]]])
In [62]: v, ind = argmax_each_plane(a)
In [63]: v
Out[63]: array([0.96366276, 0.92559664, 0.97861834])
In [64]: ind
Out[64]:
array([[2, 2],
[1, 1],
[0, 2]])
If you need z indices as well, use : np.c_[indices[0], indices[1], range(len(a2D))].
Is it possible to simplify this:
import numpy as np
a = np.random.random_sample((40, 3))
data_base = np.random.random_sample((20, 3))
mean = np.random.random_sample((40,))
data = []
for s in data_base:
data.append(mean + np.dot(a, s))
data should be of size (20, 40). I was wondering if I could do some broadcasting instead of the loop. I was not able to do it with np.add and some [:, None]. I certainly do not use this correctly.
Your data creates a (20,40) array:
In [385]: len(data)
Out[385]: 20
In [386]: data = np.array(data)
In [387]: data.shape
Out[387]: (20, 40)
The straight forward application of dot produces the same thing:
In [388]: M2=mean+np.dot(data_base, a.T)
In [389]: np.allclose(M2,data)
Out[389]: True
The matmul operator also works with these arrays (no need to expand and squeeze):
M3 = data_base#a.T + mean
Given two arrays where each row represents a circle (x, y, r):
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
I would like to pull out all of the pairs of circles that are not disjointed. This can be done by:
close_data = {}
for row1 in data[1]: #loop over first array
for row2 in data[2]: #loop over second array
condition = ((abs(row1[0]-row2[0]) + abs(row1[1]-row2[1])) < (row1[2]+row2[2]))
if condition: #circles overlap if true
if tuple(row1) not in close_data.keys():
close_data[tuple(row1)] = [row1, row2] #pull out close data points
else:
close_data[tuple(row1)].append(row2)
for k, v in close_data.iteritems():
print k, v
#desired outcome
#(455.108, 97.047799999999995, 0.012245333299999999)
#[array([ 4.55108000e+02, 9.70478000e+01, 1.22453333e-02]),
# array([ 4.55103000e+02, 9.70473000e+01, 1.2040000e-02]),
# array([ 4.55106000e+02, 9.70490000e+01, 1.50472381e-02])]
However the multiple loops over the arrays are very inefficient for large datasets. Is it possible to vectorize the calculations so I get the advantage of using numpy?
The most difficult bit is actually getting to your representation of the info. Oh, and I inserted a few squares. If you really don't want Euclidean distances you have to change back.
import numpy as np
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
d1 = data[1][:, None, :]
d2 = data[2][None, :, :]
dists2 = ((d1[..., :2] - d2[..., :2])**2).sum(axis = -1)
radss2 = (d1[..., 2] + d2[..., 2])**2
inds1, inds2 = np.where(dists2 <= radss2)
# translate to your representation:
bnds = np.r_[np.searchsorted(inds1, np.arange(3)), len(inds1)]
rows = [data[2][inds2[bnds[i]:bnds[i+1]]] for i in range(3)]
out = dict([(tuple (data[1][i]), rows[i]) for i in range(3) if rows[i].size > 0])
Here is a pure numpythonic way (a is data[1] and b is data[2]):
In [80]: p = np.arange(3) # for creating the indices of combinations using np.tile and np.repeat
In [81]: a = a[np.repeat(p, 3)] # creates the first column of combination array
In [82]: b = b[np.tile(p, 3)] # creates the second column of combination array
In [83]: abs(a[:, :2] - b[:, :2]).sum(1) < a[:, 2] + b[:, 2]
Out[83]: array([ True, False, True, True, False, True, True, False, True], dtype=bool)
I'm trying to get the values of (nRows, nCols) from a 2D Matrix but when it's a single row (i.e. x = np.array([1, 2, 3, 4])), x.shape will return (4,) and so my statement of (nRows, nCols) = x.shape returns "ValueError: need more than 1 value to unpack"
Any suggestions on how I can make this statement more adaptable? It's for a function that is used in many programs and should work with both single row and multi-row matices. Thanks!
You could create a function that returns a tuple of rows and columns like this:
def rowsCols(a):
if len(a.shape) > 1:
rows = a.shape[0]
cols = a.shape[1]
else:
rows = a.shape[0]
cols = 0
return (rows, cols)
where a is the array you input to the function. Here's an example of using the function:
import numpy as np
x = np.array([1,2,3])
y = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
def rowsCols(a):
if len(a.shape) > 1:
rows = a.shape[0]
cols = a.shape[1]
else:
rows = a.shape[0]
cols = 0
return (rows, cols)
(nRows, nCols) = rowsCols(x)
print('rows {} and columns {}'.format(nRows, nCols))
(nRows, nCols) = rowsCols(y)
print('rows {} and columns {}'.format(nRows, nCols))
This prints rows 3 and columns 0 then rows 4 and columns 3. Alternatively, you can use the atleast_2d function for a more concise approach:
(r, c) = np.atleast_2d(x).shape
print('rows {}, cols {}'.format(r, c))
(r, c) = np.atleast_2d(y).shape
print('rows {}, cols {}'.format(r, c))
Which prints rows 1, cols 3 and rows 4, cols 3.
If your function uses
(nRows, nCols) = x.shape
it probably also indexes or iterates on x with the assumption that it has nRows rows, e.g.
x[0,:]
for row in x:
# do something with the row
Common practice is to reshape x (as needed) so it has at least 1 row. In other words, change the shape from (n,) to (1,n).
x = np.atleast_2d(x)
does this nicely. Inside a function, such a change to x won't affect x outside it. This way you can treat x as 2d through out your function, rather than constantly looking to see whether it is 1d v 2d.
Python: How can I force 1-element NumPy arrays to be two-dimensional?
is one of many previous SO questions that asks about treating 1d arrays as 2d.