Efficiently select random matrix indices with given probabilities - python

I have a numpy array of probabilities, such as:
[[0.1, 0, 0.3,],
0.2, 0, 0.05],
0, 0.15, 0.2 ]]
I want to select an element (e.g., select some indices (i,j)) from this matrix, with probability weighted according to this matrix. The actual matrices this will be working with are large (up to 1000x1000), so I'm looking for an efficient way to do this. This is my current solution:
def weighted_mat_choice(prob_mat):
"""
Randomly select indices of the matrix according to the probabilities in prob_mat
:param prob_mat: Normalized probabilities to select each element
:return: indices (i, j) selected
"""
inds_mat = [[(i, j) for j in xrange(prob_mat.shape[1])] for i in xrange(prob_mat.shape[0])]
inds_list = [item for sublist in inds_mat for item in sublist]
inds_of_inds = xrange(len(inds_list))
prob_list = prob_mat.flatten()
pick_ind_of_ind = np.random.choice(inds_of_inds, p=prob_list)
pick_ind = inds_list[pick_ind_of_ind]
return pick_ind
which is definitely not efficient. (Basically, linearizing the matrix, creating a list of index tuples, and then picking accordingly.) Is there a better way to do this selection?

You don't need a list of tuple to choice. Just use a arange(n) array, and convert it back to two dimension by unravel_index().
import numpy as np
p = np.array(
[[0.1, 0, 0.3,],
[0.2, 0, 0.05],
[0, 0.15, 0.2]]
)
p_flat = p.ravel()
ind = np.arange(len(p_flat))
res = np.column_stack(
np.unravel_index(
np.random.choice(ind, p=p_flat, size=10000),
p.shape))
The result:
array([[0, 2],
[2, 2],
[2, 1],
...,
[1, 0],
[0, 2],
[0, 0]], dtype=int64)

Related

Rearranging array of vertices into array of edges

I have a 3x2 array where each row represents a vertex of a triangle. I would like to reshape it in order to obtain a new array where each row represents a side.
I'm currently trying the following approach:
points = np.array([[0,0], [0,1], [1,0]])
sides = np.array([
[points[0], points[1]],
[points[1], points[2]],
[points[2], points[0]]
])
Is there any build in function to do that in a more elegant way?
Elegance is a matter of definition, if you find the following solution more elegant, is up to you. I use np.roll to shift the indices from [0], [1], [2] to [1] [2] [0] and then pair the shifted and unshifted arrays using np.stack, similar to what you do in your manual code (watch the index pairs you create, they are the same).
import numpy as np
points = np.array([[0,0], [0,1], [1,0]])
print(points)
#array([[0, 0],
# [0, 1],
# [1, 0]])
sides = np.stack([
points,
np.roll(points, -1, axis=-1)
], axis=-1)
print(sides)
#array([[[0, 0],
# [0, 1]],
#
# [[0, 1],
# [1, 0]],
#
# [[1, 0],
# [0, 0]]])
Keep in mind that this solution does not work for an arbitrary amount of vertices, but just three.

Calculate Similarity multi dimensions array Using fastdtw

I have trying to use fastdtw to calculate similarity
Here is the working example: The similarity is 0.916%.
dataSetI = [1, 0.5, 2, 2]
dataSetII = [1, 1, 1, 0.51, 2, 1]
x = np.array(dataSetI)
y = np.array(dataSetII)
distance, path = fastdtw(x, y, dist=euclidean)
print("{:.3f}%".format(similarity))#0.916%
But the dataset I am going to compare is a multidimensional array, random index length
Example:
dataSetI = [[1, 0.5], [2, 2],[]]]
dataSetII = [[1, 1,3,5], [1, 0.51], [2, 1,5,6,7]]
x = np.array(dataSetI)
y = np.array(dataSetII)
distance, path = fastdtw(x, y, dist=euclidean)
#error here
ValueError: setting an array element with a sequence.
So my question is: Am I able to do this using fastdtw? Or is there any library able to do this? Please let me know. Thx.
Nooo! Dont use FastDTW
FastDTW is approximate and Generally Slower than the Algorithm it Approximates
Renjie Wu, Eamonn J. Keogh: ICDE 2021: 2327-2328

Normalizing vectors contained in an array

I've got an array, called X, where every element is a 2d-vector itself. The diagonal of this array is filled with nothing but zero-vectors.
Now I need to normalize every vector in this array, without changing the structure of it.
First I tried to calculate the norm of every vector and put it in an array, called N. After that I wanted to divide every element of X by every element of N.
Two problems occured to me:
1) Many entries of N are zero, which is obviously a problem when I try to divide by them.
2) The shapes of the arrays don't match, so np.divide() doesn't work as expected.
Beyond that I don't think, that it's a good idea to calculate N like this, because later on I want to be able to do the same with more than two vectors.
import numpy as np
# Example array
X = np.array([[[0, 0], [1, -1]], [[-1, 1], [0, 0]]])
# Array containing the norms
N = np.vstack((np.linalg.norm(X[0], axis=1), np.linalg.norm(X[1],
axis=1)))
R = np.divide(X, N)
I want the output to look like this:
R = np.array([[[0, 0], [0.70710678, -0.70710678]], [[-0.70710678, 0.70710678], [0, 0]]])
You do not need to use sklearn. Just define a function and then use list comprehension:
Assuming that the 0th dimension of the X is equal to the number of 2D arrays that you have, use this:
import numpy as np
# Example array
X = np.array([[[0, 0], [1, -1]], [[-1, 1], [0, 0]]])
def stdmtx(X):
X= X - X.mean(axis =1)[:, np.newaxis]
X= X / X.std(axis= 1, ddof=1)[:, np.newaxis]
return np.nan_to_num(X)
R = np.array([stdmtx(X[i,:,:]) for i in range(X.shape[0])])
The desired output R:
array([[[ 0. , 0. ],
[ 0.70710678, -0.70710678]],
[[-0.70710678, 0.70710678],
[ 0. , 0. ]]])

How can I select a row from a SparseTensor in TensorFlow?

Say, if I have two SparseTensors as following:
[[1, 0, 0, 0],
[2, 0, 0, 0],
[1, 2, 0, 0]]
and
[[1.0, 0, 0, 0],
[1.0, 0, 0, 0],
[0.3, 0.7, 0, 0]]
and I want to extract the first two rows out of them. I need both indices and values of non-zeros entries as SparseTensors so that I can pass the result to tf.nn.embedding_lookup_sparse. How can I do this?
My application is:
I want to use word embeddings, which is quite straight forward in TensorFlow. But now I want to use sparse embeddings, i.e.: for common words, they have their own embeddings. For rare words, their embeddings are a sparse linear combination of embeddings of common words.
So I need two cookbooks to indicate how sparse embeddings are composed. In the aforementioned example, the cookbook says: For the first word, it's embedding consists of its own embedding with weight 1.0. Things are similar for the second word. For the last word, it says: the embedding of this word is a linear combination of the embeddings of the first two words, and the corresponding weights are 0.3 and 0.7 respectively.
I need to extract a row, then feed the indices and weights to tf.nn.embedding_lookup_sparse to obtain the final embeddings. How can I do that in TensorFlow?
Or I need to work around it, i.e.: preprocess my data and deal with the cookbook out of TensorFlow?
I checked in with one of the engineers here who knows more about this area, and here's what he passed on:
I am not sure if we have an efficient implementation of the this, but here is a not-so-optimal implementation using dynamic_partition and gather ops.
def sparse_slice(indices, values, needed_row_ids):
num_rows = tf.shape(indices)[0]
partitions = tf.cast(tf.equal(indices[:,0], needed_row_ids), tf.int32)
rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
slice_indices = tf.gather(indices, rows_to_gather)
slice_values = tf.gather(values, rows_to_gather)
return slice_indices, slice_values
with tf.Session().as_default():
indices = tf.constant([[0,0], [1, 0], [2, 0], [2, 1]])
values = tf.constant([1.0, 1.0, 0.3, 0.7], dtype=tf.float32)
needed_row_ids = tf.constant([1])
slice_indices, slice_values = sparse_slice(indices, values, needed_row_ids)
print(slice_indices.eval(), slice_values.eval())
Update:
The engineer sent on an example to help with multiple rows too, thanks for pointing that out!
def sparse_slice(indices, values, needed_row_ids):
needed_row_ids = tf.reshape(needed_row_ids, [1, -1])
num_rows = tf.shape(indices)[0]
partitions = tf.cast(tf.reduce_any(tf.equal(tf.reshape(indices[:,0], [-1, 1]), needed_row_ids), 1), tf.int32)
rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
slice_indices = tf.gather(indices, rows_to_gather)
slice_values = tf.gather(values, rows_to_gather)
return slice_indices, slice_values
with tf.Session().as_default():
indices = tf.constant([[0,0], [1, 0], [2, 0], [2, 1]])
values = tf.constant([1.0, 1.0, 0.3, 0.7], dtype=tf.float32)
needed_row_ids = tf.constant([0, 2])
Let sp be the name of your 2d SparseTensor. You can first create an indicator tensor for the rows of your SparseTensor that you want to extract, namely
mask = tf.concat([tf.constant([True, True]), tf.fill([sp.dense_shape[0] - 2],
False)], axis=0)
Next use tf.gather to propagate this to the sparse indices:
mask_sp = tf.gather(mask, sp.indices[:, 0])
Finally,
values = tf.boolean_mask(sp.values, mask_sp)
indices = tf.boolean_mask(sp.indices, mask_sp)
dense_shape = [sp.dense_shape[0] - 2, sp.dense_shape[1]]
output_sp = tf.SparseTensor(indices=indices, values=values, dense_shape=dense_shape)
Shouldn't it behave more like this:
This version will keep the order and frequency of the indices in selected_indices and, therefore, makes it possible to e.g. select the same row multiple times:
import tensorflow as tf
tf.enable_eager_execution()
def sparse_gather(indices, values, selected_indices, axis=0):
"""
indices: [[idx_ax0, idx_ax1, idx_ax2, ..., idx_axk], ... []]
values: [ value1, , ..., valuen]
"""
mask = tf.equal(indices[:, axis][tf.newaxis, :], selected_indices[:, tf.newaxis])
to_select = tf.where(mask)[:, 1]
return tf.gather(indices, to_select, axis=0), tf.gather(values, to_select, axis=0)
indices = tf.constant([[1, 0], [2, 0], [3, 0], [7, 0]])
values = tf.constant([1.0, 2.0, 3.0, 7.0], dtype=tf.float32)
needed_row_ids = tf.constant([7, 3, 2, 2, 3, 7])
slice_indices, slice_values = sparse_gather(indices, values, needed_row_ids)
print(slice_indices, slice_values)
I tried the answer by "Pete Warden" which only worked for small data. Given sparsetensor A with m nonzero elements, we would like to take out n rows. The tf.equal would take m*n space, which is not acceptable in my task.
My suggestion is to use Scipy.sparse instead of tensorflow.
In details:
take out all data from tf, indices & data, and form a Scipy.sparse. use coo
If u need to take out rows, use csr formate. if u need to take out cols, use csc
A[:,m]
transform to coo
transform to tf

How do I interate through a paired list when using map and lambda?

I'm stuck on how to iterate through a paired list while i'm using the map and lambda functions. I want to create a series of histograms based on a central location and the distances of selected locations (x,y) to the center and the number of times a particular distance appears, but I keep getting an index out of range error and I don't understand why. I'm not sure how to iterate through locations where I need to specify which two values out of it. The whole thing works except the n part.
Sorry for not being clearer, the locations=numpy.array((x,y)) are locations from a boolean array that produces specific locations that I wanted to test instead of for whole array. The values (x,y) produced are a two row array where the values I want are paired column-wise. The code before this was:
def detect_peaks(data):
average=numpy.average(data)*2
local_max = data > average
return local_max
(x,y) = numpy.where(detect_peaks(data))
for test_x in range(0, 8):
for test_y in range(0,8):
distances=[]
locations=numpy.array((x,y))
central=numpy.array((test_x,test_y))
[map(lambda x1: distances.append(numpy.linalg.norm(locations[(x1,0), (x1,1)]-central)), n) for n in locations]
histogram=numpy.histogram(distances, bins=10)
I'll rewrite the map/lambda function and come back. Thanks!
Is this what you want? x and y are arrays of int, not float.
Not that I like the double for loops, they should be replaced with a more vectorized algorithm, but to keep the change minimal and high light the problematic line, here is it:
>>> a2
array([[ 0.92607265, 1.26155686, 0.31516174, 0.91750943, 0.30713193],
[ 1.0601752 , 1.10404664, 0.67766044, 0.36434503, 0.64966887],
[ 1.29878971, 0.66417322, 0.48084284, 1.0956822 , 0.27142741],
[ 0.27654032, 0.29566566, 0.33565457, 0.29749312, 0.34113315],
[ 0.33608323, 0.25230828, 0.41507646, 0.37872512, 0.60471438]])
>>> numpy.where(detect_peaks(a2))
(array([0, 2]), array([1, 0]))
>>> def func1(locations): #You don't need to unpack the numpy.where result.
for test_x in range(0, 4):
for test_y in range(0, 4):
locations=numpy.array(locations)
central=numpy.array((test_x,test_y))
#Vectorization is almost always better.
#Be careful, iterate an array means iterate its rows, so, transpose it first.
distances=map(numpy.linalg.norm, (locations-central.reshape((2,-1))).T)
histogram=numpy.histogram(distances, bins=10)
print 'cental point:', central
print 'distance lst:', distances
print histogram
print '-------------------------'
And the result:
>>> func1(numpy.where(detect_peaks(a2)))
cental point: [0 0]
distance lst: [1.0, 2.0]
(array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1]), array([ 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ]))
-------------------------
cental point: [0 1]
distance lst: [0.0, 2.2360679774997898]
(array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1]), array([ 0. , 0.2236068 , 0.4472136 , 0.67082039, 0.89442719, 1.11803399, 1.34164079, 1.56524758, 1.78885438, 2.01246118, 2.23606798]))
-------------------------#and more
Came up with this:
def detect_peaks(arrayfinal):
average=numpy.average(arrayfinal)
local_max = arrayfinal > average
return local_max
def dist(distances, center, n):
distance=numpy.linalg.norm(n-center)
distances.append(distance)
def histotest():
peaks = numpy.where(detect_peaks(arrayfinal))
ordered= zip(peaks[0],peaks[1])
for test_x in range(0, 2048):
for test_y in range(0,2048):
distances=[]
center=numpy.array((test_x,test_y))
[dist(distances, center, n) for n in ordered]
histogram=numpy.histogram(distances, bins=30)
print histogram
It appears to work, but I like yours better.

Categories