(Python) Mapping between two arrays with a precedence array - python

Given a source array
src = np.random.rand(320,240)
and an index array
idx = np.indices(src.shape).reshape(2, -1)
np.random.shuffle(idx.T)
we can map the linear index i in src to the 2-dimensional index idx[:,i] in a destination array dst via
dst = np.empty_like(src)
dst[tuple(idx)] = src.ravel()
This is discussed in Python: Mapping between two arrays with an index array
However, if this mapping is not 1-to-1, i.e., multiple entries in src map to the same entry in dst, according to the docs it is unspecified which of the source entries will be written to dst:
For advanced assignments, there is in general no guarantee for the iteration order. This means that if an element is set more than once, it is not possible to predict the final result.
If we are additionally given a precedence array
p = np.random.rand(*src.shape)
how can we use p to disambiguate this situation, i.e., write the entry with highest precedence according to p?

Here is a method using a sparse matrix for sorting (it has large overhead but scales better than argsort, presumably because it uses some radix sort like method (?)). Duplicate indices without precedence are explicitly set to -1. We make the destination array one cell too big, the surplus cell serving as trash can.
import numpy as np
from scipy import sparse
N = 2
idx = np.random.randint(0, N, (2, N, N))
prec = np.random.random((N, N))
src = np.arange(N*N).reshape(N, N)
def f_sparse(idx, prec, src):
idx = np.ravel_multi_index(idx, src.shape).ravel()
sp = sparse.csr_matrix((prec.ravel(), idx, np.arange(idx.size+1)),
(idx.size, idx.size)).tocsc()
top = sp.indptr.argmax()
mx = np.repeat(np.maximum.reduceat(sp.data, sp.indptr[:top]),
np.diff(sp.indptr[:top+1]))
res = idx.copy()
res[sp.indices[sp.data != mx]] = -1
dst = np.full((idx.size + 1,), np.nan)
dst[res] = src.ravel()
return dst[:-1].reshape(src.shape)
print(idx)
print(prec)
print(src)
print(f_sparse(idx, prec, src))
Sample run:
[[[1 0]
[1 0]]
[[0 1]
[0 0]]]
[[0.90995366 0.92095225]
[0.60997092 0.84092015]]
[[0 1]
[2 3]]
[[ 3. 1.]
[ 0. nan]]

Related

NumPy template matching SQDIFF with `sliding window_view`

The SQDIFF is defined as openCV definition. (I believe they omit channels)
Which in junior numpy Python should be
A = np.arange(27, dtype=np.float32)
A = A.reshape(3,3,3) # The "image"
B = np.ones([2, 2, 3], dtype=np.float32) # window
rw, rh = A.shape[0] - B.shape[0] + 1, A.shape[1] - B.shape[1] + 1 # End result size
result = np.zeros([rw, rh])
for i in range(rw):
for j in range(rh):
w = A[i:i + B.shape[0], j:j + B.shape[1]]
res = B - w
result[i, j] = np.sum(
res ** 2
)
cv_result = cv.matchTemplate(A, B, cv.TM_SQDIFF) # this result is the same as the simple for loops
assert np.allclose(cv_result, result)
This is comparatively slow solution. I have read about sliding_window_view but cannot get it correct.
# This will fail with these large arrays but is ok for smaller ones
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
sqdiff = np.sum((B - locations) ** 2, axis=(-1,-2, -3, -4)) # This will fail with normal sized images
will fail with MemoryError even if the result easily fits to memory. How can I produce similar results to the cv2.matchTemplate function with this faster way?
As a last resort, you may perform the computation in tiles, instead of computing "all at once".
np.lib.stride_tricks.sliding_window_view returns a view of the data, so it doesn't consume a lot of RAM.
The expression B - locations can't use a view, and requires the RAM for storing an array with shape (781, 984, 1, 248, 249, 3) of float elements.
The total RAM for storing B - locations is 781*984*1*248*249*3*4 = 569,479,908,096 bytes.
For avoiding the need for storing B - locations at the RAM at once, we may compute sqdiff in tiles, when "tile" computation requires less RAM.
A simple tiles division is using every row as a tile - loop over the rows of sqdiff, and compute the output row by row.
Example:
sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32) # Allocate an array for storing the result.
# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))
Executable code sample:
import numpy as np
import cv2
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
cv_result = cv2.matchTemplate(A, B, cv2.TM_SQDIFF) # this result is the same as the simple for loops
#sqdiff = np.sum((B - locations) ** 2, axis=(-1, -2, -3, -4)) # This will fail with normal sized images
sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32) # Allocate an array for storing the result.
# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))
assert np.allclose(cv_result, sqdiff)
I know the solution is a bit disappointing... But it is the only generic solution I could find.
is equivalent to
where the 'star' operation is a cross-correlation, the 1_[m, n] is a window the size of the template, and 1_[k, l] is a window with the size of the image.
You can compute the cross-correlation terms using 'scipy.signal.correlate' and find the matches by looking for local minima in the square difference map.
You might want to do some non-minimum suppression too.
This solution will require orders of magnitude less memory to store.
For more help, please post a reproducible example with an image and template that are valid for the algorithm. Using noise will result in meaningless outputs.

efficient way to check every value in a 2d python array

I have a 2D numpy array of values, a list of x-coordinates, and a list of y-coordinates. the x-coordinates increase left-to-right and the y-coordinates increase top-to-bottom.
For example:
a = np.random.random((3, 3))
a[0][1] = 9.0
a[0][2] = 9.0
a[1][1] = 9.0
a[1][2] = 9.0
xs = list(range(1112, 1115))
ys = list(range(1109, 1112))
Output:
[[0.48148651 9. 9. ]
[0.09030393 9. 9. ]
[0.79271224 0.83413552 0.29724989]]
[1112, 1113, 1114]
[1109, 1110, 1111]
I want to remove the values from the 2D array that are greater than 1. I also want to combine the lists xs and ys to get a list of all the coordinate pairs for points that are kept.
In this example I want to remove a[0][1], a[0][2], a[1][1], a[1][2] and I want the list of coordinate pairs to be
[[1112, 1109], [1112,1110], [1112, 1111], [1113, 1111], [1114, 1111]]
I have been able to accomplish this using a double for loop and if statements:
a_values = []
point_pairs = []
for i in range(0, a.shape[0]):
for j in range(0, a.shape[1]):
if (a[i][j] < 1):
a_values.append(a[i][j])
point_pairs.append([xs[j], ys[i]])
print(a_values)
print(point_pairs)
Output:
[0.48148650831317796, 0.09030392566133771, 0.7927122386213029, 0.8341355206494774, 0.2972498933037804]
[[1112, 1109], [1112, 1110], [1112, 1111], [1113, 1111], [1114, 1111]]
What is a more efficient way of doing this?
You can use np.nonzero to get the indices of the elements you removed:
mask = a < 1
i, j = np.nonzero(mask)
The fancy indices i and j can be used to get the elements of xs and ys directly if they are numpy arrays:
xs = np.array(xs)
ys = np.array(ys)
point_pairs = np.stack((xs[j], ys[i]), axis=-1)
You can also use np.take to make the conversion happen under the hood:
point_pairs = np.stack((np.take(xs, j), np.take(ys, i)), axis=-1)
The remaining elements of a are those not covered by the mask:
a_points = a[mask]
Alternatively:
i, j = np.nonzero(a < 1)
point_pairs = np.stack((np.take(xs, j), np.take(ys, i)), axis=-1)
a_points = a[i, j]
In this context, you can use np.where as a drop-in alias for np.nonzero.
Notes
If you are using numpy, there is rarely a need for lists. Putting xs = np.array(xs), or even just initializing it as xs = np.arange(1112, 1115) is faster and easier.
Numpy arrays should generally be indexed through a single index: a[0, 1], not a[0][1]. For your simple case, the behavior just happens to be the same, but it will not be in the general case. a[0, 1] is an index into the original array. a[0] is a view of the first row of the array, i.e., a separate array object. a[0][1] is an index into that new object. You just happened to get lucky that you are getting a view that shares the base memory, so the assignment is visible in a itself. This would not be the case if you tried a mask or fancy index, for example.
On a related note, setting a rectangular swath in an array only requires one line: a[1:, :-1] = 9.
I would write your example something like this:
a = np.random.random((3, 3))
a[1:, :-1] = 9.0
xs = np.arange(1112, 1115)
ys = np.arange(1109, 1112)
i, j = np.nonzero(a < 1)
point_pairs = np.stack((xs[j], ys[i]), axis=-1)
a_points = a[i, j]

How to choose elements out of a matrix randomly weighted

I am pretty new to python and have some problems with Randomness.
I am looking for something similar then RandomChoice in Mathematica.
I create a Matrix of dimension let's say 10x3 with random numbers greater 0. Let us call the total sum of every row s_i for i=0,...,9
Later I want to choose for every row 2 out of 3 elements (no repetition) with weighted probability s_ij/s_i
So I need something like this but with weigthed propabilities
n=10
aa=np.random.uniform(1000, 2500, (n,3))
print(aa)
help=[0,1,2]
dd=np.zeros((n,2))
for i in range(n):
cc=random.sample(help,2)
dd[i,0]=aa[i,cc[0]]
dd[i,1]=aa[i,cc[1]]
print(dd)
Here, additionally speed is an important factor since I will use it in an Montecarlo approach (that's the reason I switched from Mathematica to Python) and I guess, the above code can be improved heavily
Thanks in advance for any tipps/help
EDIT: I now have the following, which is working but does not look like good gode to me
#pre-defined lists
nn=3
aa=np.random.uniform(1000, 2500, (nn,3))
help1=[0,1,2]
help2=aa.sum(axis=1)
#now I create a weigthed prob list and fill it
help3=np.zeros((nn,3))
for i in range(nn):
help3[i,0]=aa[i,0]/help2[i]
help3[i,1]=aa[i,1]/help2[i]
help3[i,2]=aa[i,2]/help2[i]
#every timestep when I have to choose 2 out of 3
help5=np.zeros((nn,2))
for i in range(nn):
#cc=random.sample(help1,2)
help4=np.random.choice(help1, 2, replace=False, p=[help3[i,0], help3[i,1], help3[i,2]])
help5[i,0]=aa[i,cc[0]]
help5[i,1]=aa[i,cc[1]]
print(help5)
As pointed out in the comments, np.random.choice accepts a weights parameter, so you can simply use that in a loop:
import numpy as np
# Make input data
np.random.seed(0)
n = 10
aa = np.random.uniform(1000, 2500, (n, 3))
s = np.random.rand(n, 3)
# Normalize weights
s_norm = s / s.sum(1, keepdims=True)
# Output array
out = np.empty((n, 2), dtype=aa.dtype)
# Sample iteratively
for i in range(n):
out[i] = aa[i, np.random.choice(3, size=2, replace=False, p=s_norm[i])]
This is not the most efficient way to do things, though, as usually using vectorized operations is much faster than looping. Unfortunately, I don't think there is any way to sample from multiple categorical distributions at the same time (see NumPy issue #15201). However, since you always want to get two elements out of three, you could sample the element that you want to remove (with inverted probabilities) and then keep the other two. This snippet does something like that:
import numpy as np
# Make input data
np.random.seed(0)
n = 10
aa = np.random.uniform(1000, 2500, (n, 3))
s = np.random.rand(n, 3)
print(s)
# [[0.26455561 0.77423369 0.45615033]
# [0.56843395 0.0187898 0.6176355 ]
# [0.61209572 0.616934 0.94374808]
# [0.6818203 0.3595079 0.43703195]
# [0.6976312 0.06022547 0.66676672]
# [0.67063787 0.21038256 0.1289263 ]
# [0.31542835 0.36371077 0.57019677]
# [0.43860151 0.98837384 0.10204481]
# [0.20887676 0.16130952 0.65310833]
# [0.2532916 0.46631077 0.24442559]]
# Invert weights
si = 1 / s
# Normalize
si_norm = si / si.sum(1, keepdims=True)
# Accumulate
si_cum = np.cumsum(si_norm, axis=1)
# Sample according to inverted probabilities
t = np.random.rand(n, 1)
idx = np.argmax(t < si_cum, axis=1)
# Get non-sampled indices
r = np.arange(3)
m = r != idx[:, np.newaxis]
choice = np.broadcast_to(r, m.shape)[m].reshape(n, -1)
print(choice)
# [[1 2]
# [0 2]
# [0 2]
# [1 2]
# [0 2]
# [0 2]
# [0 1]
# [1 2]
# [0 2]
# [1 2]]
# Get corresponding data
out = np.take_along_axis(aa, choice, 1)
One possible drawback of this is that the chosen elements will always be in order (that is, for a given row, you may get the pairs of indices (0, 1), (0, 2) or (1, 2), but not (1, 0), (2, 0) or (2, 1)).
Of course, if you really just need a few samples, then the loop is probably the most convenient and maintainable solution, the second one would only be useful if you need to do this at larger scale.

Calculate mean, variance, covariance of different length matrices in a split list

I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.
import numpy as np
A = np.random.rand(10,5)
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)
My current for loop looks like this:
result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
if(len(values))>0:
result[idx, 0] = np.mean(values[:,3])
result[idx, 1] = np.var(values[:,3])
result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
else:
result[idx, 0] = values[:,3]
I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.
You can use np.add.reduceat as follows:
>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>>
>>> np.allclose(result, result2)
True
Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.

How to add column to numpy array

I am trying to add one column to the array created from recfromcsv. In this case it's an array: [210,8] (rows, cols).
I want to add a ninth column. Empty or with zeroes doesn't matter.
from numpy import genfromtxt
from numpy import recfromcsv
import numpy as np
import time
if __name__ == '__main__':
print("testing")
my_data = recfromcsv('LIAB.ST.csv', delimiter='\t')
array_size = my_data.size
#my_data = np.append(my_data[:array_size],my_data[9:],0)
new_col = np.sum(x,1).reshape((x.shape[0],1))
np.append(x,new_col,1)
I think that your problem is that you are expecting np.append to add the column in-place, but what it does, because of how numpy data is stored, is create a copy of the joined arrays
Returns
-------
append : ndarray
A copy of `arr` with `values` appended to `axis`. Note that `append`
does not occur in-place: a new array is allocated and filled. If
`axis` is None, `out` is a flattened array.
so you need to save the output all_data = np.append(...):
my_data = np.random.random((210,8)) #recfromcsv('LIAB.ST.csv', delimiter='\t')
new_col = my_data.sum(1)[...,None] # None keeps (n, 1) shape
new_col.shape
#(210,1)
all_data = np.append(my_data, new_col, 1)
all_data.shape
#(210,9)
Alternative ways:
all_data = np.hstack((my_data, new_col))
#or
all_data = np.concatenate((my_data, new_col), 1)
I believe that the only difference between these three functions (as well as np.vstack) are their default behaviors for when axis is unspecified:
concatenate assumes axis = 0
hstack assumes axis = 1 unless inputs are 1d, then axis = 0
vstack assumes axis = 0 after adding an axis if inputs are 1d
append flattens array
Based on your comment, and looking more closely at your example code, I now believe that what you are probably looking to do is add a field to a record array. You imported both genfromtxt which returns a structured array and recfromcsv which returns the subtly different record array (recarray). You used the recfromcsv so right now my_data is actually a recarray, which means that most likely my_data.shape = (210,) since recarrays are 1d arrays of records, where each record is a tuple with the given dtype.
So you could try this:
import numpy as np
from numpy.lib.recfunctions import append_fields
x = np.random.random(10)
y = np.random.random(10)
z = np.random.random(10)
data = np.array( list(zip(x,y,z)), dtype=[('x',float),('y',float),('z',float)])
data = np.recarray(data.shape, data.dtype, buf=data)
data.shape
#(10,)
tot = data['x'] + data['y'] + data['z'] # sum(axis=1) won't work on recarray
tot.shape
#(10,)
all_data = append_fields(data, 'total', tot, usemask=False)
all_data
#array([(0.4374783740738456 , 0.04307289878861764, 0.021176067323686598, 0.5017273401861498),
# (0.07622262416466963, 0.3962146058689695 , 0.27912715826653534 , 0.7515643883001745),
# (0.30878532523061153, 0.8553768789387086 , 0.9577415585116588 , 2.121903762680979 ),
# (0.5288343561208022 , 0.17048864443625933, 0.07915689716226904 , 0.7784798977193306),
# (0.8804269791375121 , 0.45517504750917714, 0.1601389248542675 , 1.4957409515009568),
# (0.9556552723429782 , 0.8884504475901043 , 0.6412854758843308 , 2.4853911958174133),
# (0.0227638618687922 , 0.9295332854783015 , 0.3234597575660103 , 1.275756904913104 ),
# (0.684075052174589 , 0.6654774682866273 , 0.5246593820025259 , 1.8742119024637423),
# (0.9841793718333871 , 0.5813955915551511 , 0.39577520705133684 , 1.961350170439875 ),
# (0.9889343795296571 , 0.22830104497714432, 0.20011292764078448 , 1.4173483521475858)],
# dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8'), ('total', '<f8')])
all_data.shape
#(10,)
all_data.dtype.names
#('x', 'y', 'z', 'total')
If you have an array, a of say 210 rows by 8 columns:
a = numpy.empty([210,8])
and want to add a ninth column of zeros you can do this:
b = numpy.append(a,numpy.zeros([len(a),1]),1)
The easiest solution is to use numpy.insert().
The Advantage of np.insert() over np.append is that you can insert the new columns into custom indices.
import numpy as np
X = np.arange(20).reshape(10,2)
X = np.insert(X, [0,2], np.random.rand(X.shape[0]*2).reshape(-1,2)*10, axis=1)
'''
np.append or np.hstack expects the appended column to be the proper shape, that is N x 1. We can use np.zeros to create this zeros column (or np.ones to create a ones column) and append it to our original matrix (2D array).
def append_zeros(x):
zeros = np.zeros((len(x), 1)) # zeros column as 2D array
return np.hstack((x, zeros)) # append column
I add a new column with ones to a matrix array in this way:
Z = append([[1 for _ in range(0,len(Z))]], Z.T,0).T
Maybe it is not that efficient?
It can be done like this:
import numpy as np
# create a random matrix:
A = np.random.normal(size=(5,2))
# add a column of zeros to it:
print(np.hstack((A,np.zeros((A.shape[0],1)))))
In general, if A is an m*n matrix, and you need to add a column, you have to create an n*1 matrix of zeros, then use "hstack" to add the matrix of zeros to the right of the matrix A.
Similar to some of the other answers suggesting using numpy.hstack, but more readable:
import numpy as np
# declare 10 rows x 3 cols integer array of all 1s
arr = np.ones((10, 3), dtype=np.int64)
# get the number of rows in the original array (as if we didn't know it was 10 or it could be different in other cases)
numRows = arr.shape[0]
# declare the new array which will be the new column, integer array of all 0s so it's visually distinct from the original array
additionalColumn = np.zeros((numRows, 1), dtype=np.int64)
# use hstack to tack on the additionl column
result = np.hstack((arr, additionalColumn))
print(result)
result:
$ python3 scratchpad.py
[[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]
[1 1 1 0]]
Here's a shorter one-liner:
import numpy as np
data = np.random.rand(210, 8)
data = np.c_[data, np.zeros(len(data))]
Something that I use often to convert points to homogenous coordinates with np.ones instead.

Categories