Calculate Similarity multi dimensions array Using fastdtw - python

I have trying to use fastdtw to calculate similarity
Here is the working example: The similarity is 0.916%.
dataSetI = [1, 0.5, 2, 2]
dataSetII = [1, 1, 1, 0.51, 2, 1]
x = np.array(dataSetI)
y = np.array(dataSetII)
distance, path = fastdtw(x, y, dist=euclidean)
print("{:.3f}%".format(similarity))#0.916%
But the dataset I am going to compare is a multidimensional array, random index length
Example:
dataSetI = [[1, 0.5], [2, 2],[]]]
dataSetII = [[1, 1,3,5], [1, 0.51], [2, 1,5,6,7]]
x = np.array(dataSetI)
y = np.array(dataSetII)
distance, path = fastdtw(x, y, dist=euclidean)
#error here
ValueError: setting an array element with a sequence.
So my question is: Am I able to do this using fastdtw? Or is there any library able to do this? Please let me know. Thx.

Nooo! Dont use FastDTW
FastDTW is approximate and Generally Slower than the Algorithm it Approximates
Renjie Wu, Eamonn J. Keogh: ICDE 2021: 2327-2328

Related

Scipy KDTree get rectangular subset of grid defined by two points

I am using the following example from :
from scipy import spatial
x, y = np.mgrid[0:5, 2:8]
tree = spatial.KDTree(list(zip(x.ravel(), y.ravel())))
pts = np.array([[0, 0], [2.1, 2.9]])
idx = tree.query(pts)[1]
data = tree.data[??????????]
If I input two arbitrary points (see variable pts), I am looking to return all pairs of coordinates that lie within the rectangle defined by the two points (KDTree finds the closest neighbour). So in this case:
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2]])
How can I achieve that from the tree data?
Seems that I found a solution:
from scipy import spatial
import numpy as np
x, y = np.mgrid[0:5, 0:5]
tree = spatial.KDTree(list(zip(x.ravel(), y.ravel())))
pts = np.array([[0, 0], [2.1, 2.2]])
idx = tree.query(pts)[1]
data = tree.data[[idx[0], idx[1]]]
rectangle = tree.data[np.where((tree.data[:,0]>=min(data[:,0])) & (tree.data[:,0]<=max(data[:,0])) & (tree.data[:,1]>=min(data[:,1])) & (tree.data[:,1]<=max(data[:,1])))]
However, I would love to see a solution using the query option!

How can I select a row from a SparseTensor in TensorFlow?

Say, if I have two SparseTensors as following:
[[1, 0, 0, 0],
[2, 0, 0, 0],
[1, 2, 0, 0]]
and
[[1.0, 0, 0, 0],
[1.0, 0, 0, 0],
[0.3, 0.7, 0, 0]]
and I want to extract the first two rows out of them. I need both indices and values of non-zeros entries as SparseTensors so that I can pass the result to tf.nn.embedding_lookup_sparse. How can I do this?
My application is:
I want to use word embeddings, which is quite straight forward in TensorFlow. But now I want to use sparse embeddings, i.e.: for common words, they have their own embeddings. For rare words, their embeddings are a sparse linear combination of embeddings of common words.
So I need two cookbooks to indicate how sparse embeddings are composed. In the aforementioned example, the cookbook says: For the first word, it's embedding consists of its own embedding with weight 1.0. Things are similar for the second word. For the last word, it says: the embedding of this word is a linear combination of the embeddings of the first two words, and the corresponding weights are 0.3 and 0.7 respectively.
I need to extract a row, then feed the indices and weights to tf.nn.embedding_lookup_sparse to obtain the final embeddings. How can I do that in TensorFlow?
Or I need to work around it, i.e.: preprocess my data and deal with the cookbook out of TensorFlow?
I checked in with one of the engineers here who knows more about this area, and here's what he passed on:
I am not sure if we have an efficient implementation of the this, but here is a not-so-optimal implementation using dynamic_partition and gather ops.
def sparse_slice(indices, values, needed_row_ids):
num_rows = tf.shape(indices)[0]
partitions = tf.cast(tf.equal(indices[:,0], needed_row_ids), tf.int32)
rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
slice_indices = tf.gather(indices, rows_to_gather)
slice_values = tf.gather(values, rows_to_gather)
return slice_indices, slice_values
with tf.Session().as_default():
indices = tf.constant([[0,0], [1, 0], [2, 0], [2, 1]])
values = tf.constant([1.0, 1.0, 0.3, 0.7], dtype=tf.float32)
needed_row_ids = tf.constant([1])
slice_indices, slice_values = sparse_slice(indices, values, needed_row_ids)
print(slice_indices.eval(), slice_values.eval())
Update:
The engineer sent on an example to help with multiple rows too, thanks for pointing that out!
def sparse_slice(indices, values, needed_row_ids):
needed_row_ids = tf.reshape(needed_row_ids, [1, -1])
num_rows = tf.shape(indices)[0]
partitions = tf.cast(tf.reduce_any(tf.equal(tf.reshape(indices[:,0], [-1, 1]), needed_row_ids), 1), tf.int32)
rows_to_gather = tf.dynamic_partition(tf.range(num_rows), partitions, 2)[1]
slice_indices = tf.gather(indices, rows_to_gather)
slice_values = tf.gather(values, rows_to_gather)
return slice_indices, slice_values
with tf.Session().as_default():
indices = tf.constant([[0,0], [1, 0], [2, 0], [2, 1]])
values = tf.constant([1.0, 1.0, 0.3, 0.7], dtype=tf.float32)
needed_row_ids = tf.constant([0, 2])
Let sp be the name of your 2d SparseTensor. You can first create an indicator tensor for the rows of your SparseTensor that you want to extract, namely
mask = tf.concat([tf.constant([True, True]), tf.fill([sp.dense_shape[0] - 2],
False)], axis=0)
Next use tf.gather to propagate this to the sparse indices:
mask_sp = tf.gather(mask, sp.indices[:, 0])
Finally,
values = tf.boolean_mask(sp.values, mask_sp)
indices = tf.boolean_mask(sp.indices, mask_sp)
dense_shape = [sp.dense_shape[0] - 2, sp.dense_shape[1]]
output_sp = tf.SparseTensor(indices=indices, values=values, dense_shape=dense_shape)
Shouldn't it behave more like this:
This version will keep the order and frequency of the indices in selected_indices and, therefore, makes it possible to e.g. select the same row multiple times:
import tensorflow as tf
tf.enable_eager_execution()
def sparse_gather(indices, values, selected_indices, axis=0):
"""
indices: [[idx_ax0, idx_ax1, idx_ax2, ..., idx_axk], ... []]
values: [ value1, , ..., valuen]
"""
mask = tf.equal(indices[:, axis][tf.newaxis, :], selected_indices[:, tf.newaxis])
to_select = tf.where(mask)[:, 1]
return tf.gather(indices, to_select, axis=0), tf.gather(values, to_select, axis=0)
indices = tf.constant([[1, 0], [2, 0], [3, 0], [7, 0]])
values = tf.constant([1.0, 2.0, 3.0, 7.0], dtype=tf.float32)
needed_row_ids = tf.constant([7, 3, 2, 2, 3, 7])
slice_indices, slice_values = sparse_gather(indices, values, needed_row_ids)
print(slice_indices, slice_values)
I tried the answer by "Pete Warden" which only worked for small data. Given sparsetensor A with m nonzero elements, we would like to take out n rows. The tf.equal would take m*n space, which is not acceptable in my task.
My suggestion is to use Scipy.sparse instead of tensorflow.
In details:
take out all data from tf, indices & data, and form a Scipy.sparse. use coo
If u need to take out rows, use csr formate. if u need to take out cols, use csc
A[:,m]
transform to coo
transform to tf

Efficiently select random matrix indices with given probabilities

I have a numpy array of probabilities, such as:
[[0.1, 0, 0.3,],
0.2, 0, 0.05],
0, 0.15, 0.2 ]]
I want to select an element (e.g., select some indices (i,j)) from this matrix, with probability weighted according to this matrix. The actual matrices this will be working with are large (up to 1000x1000), so I'm looking for an efficient way to do this. This is my current solution:
def weighted_mat_choice(prob_mat):
"""
Randomly select indices of the matrix according to the probabilities in prob_mat
:param prob_mat: Normalized probabilities to select each element
:return: indices (i, j) selected
"""
inds_mat = [[(i, j) for j in xrange(prob_mat.shape[1])] for i in xrange(prob_mat.shape[0])]
inds_list = [item for sublist in inds_mat for item in sublist]
inds_of_inds = xrange(len(inds_list))
prob_list = prob_mat.flatten()
pick_ind_of_ind = np.random.choice(inds_of_inds, p=prob_list)
pick_ind = inds_list[pick_ind_of_ind]
return pick_ind
which is definitely not efficient. (Basically, linearizing the matrix, creating a list of index tuples, and then picking accordingly.) Is there a better way to do this selection?
You don't need a list of tuple to choice. Just use a arange(n) array, and convert it back to two dimension by unravel_index().
import numpy as np
p = np.array(
[[0.1, 0, 0.3,],
[0.2, 0, 0.05],
[0, 0.15, 0.2]]
)
p_flat = p.ravel()
ind = np.arange(len(p_flat))
res = np.column_stack(
np.unravel_index(
np.random.choice(ind, p=p_flat, size=10000),
p.shape))
The result:
array([[0, 2],
[2, 2],
[2, 1],
...,
[1, 0],
[0, 2],
[0, 0]], dtype=int64)

Interpolating a 2D data grid in python

I have a 2D grid with radioactive beta-decay rates. Each vale corresponds to a rate on a specific pair of temperature and density (both on logarithmic scale). What I would like to do, is when I have a temperature and density data pair (after getting their logarithms), to find the matching values in the table. I tried using the scipy interpolate interpn function, but I got a little confused, I would be grateful for the help.
What I have so far:
pointsx = np.array([7+0.2*i for i in range(0,16)]) #temperature range
pointsy = np.array([i for i in range(0,11) ]) #rho_el range
data = numpy.loadtxt(filename) #getting data from file
logT = np.log10(T) #wanted temperature logarithmic
logrho = np.log10(rho) #wanted rho logarithmic
The interpn function has the following arguments: points, values, xi, method='linear', bounds_error=True, fill_value=nan. I figure that the points will be the pointsx and pointsy I have, the data is quite obvious, and xi will be the (T,rho) I'm looking for. But I'm not sure, what dimensions they should have? The points is the same size, as the data? So I have to make an array of the corresponding pairs of T and rho, which will be the points part, and then have a (T, rho) pair as xi?
When you aren't certain about how a function works, it's always a good idea to open up a REPL and test it yourself. In this case, the function works exactly as expected, given your understanding of the documentation.
>>> points = [[1, 2, 3, 4], [1, 2, 3, 4]] # Input values for each grid dimension
>>> values = [[1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6], [4, 5, 6, 7]] # The grid itself
>>> xi = (1, 1.5)
>>> scipy.interpolate.interpn(points, values, xi)
array([ 1.5])
>>> xi = [[1, 1.5], [2, 1.5], [2, 2.5], [3, 2.5], [3, 3.5], [4, 3.5]]
>>> scipy.interpolate.interpn(points, values, xi)
array([ 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
The only thing you missed was that points is supposed to be a tuple. But as you can see from the above, it works even if points ins't a tuple.

How do I "randomly" select numbers with a specified bias toward a particular number

How do I generate random numbers with a specified bias toward one number. For example, how would I pick between two numbers, 1 and 2, with a 90% bias toward 1. The best I can come up with is...
import random
print random.choice([1, 1, 1, 1, 1, 1, 1, 1, 1, 2])
Is there a better way to do this? The method I showed works in simple examples but eventually I'll have to do more complicated selections with biases that are very specific (such as 37.65% bias) which would require a very long list.
EDIT:
I should have added that I'm stuck on numpy 1.6 so I can't use numpy.random.choice.
np.random.choice has a p parameter which you can use to specify the probability of the choices:
np.random.choice([1,2], p=[0.9, 0.1])
The algorithm used by np.random.choice() is relatively simple to replicate if you only need to draw one item at a time.
import numpy as np
def simple_weighted_choice(choices, weights, prng=np.random):
running_sum = np.cumsum(weights)
u = prng.uniform(0.0, running_sum[-1])
i = np.searchsorted(running_sum, u, side='left')
return choices[i]
For random sampling with replacement, the essential code in np.random.choice is
cdf = p.cumsum()
cdf /= cdf[-1]
uniform_samples = self.random_sample(shape)
idx = cdf.searchsorted(uniform_samples, side='right')
So we can use that in a new function the does the same thing (but without error checking and other niceties):
import numpy as np
def weighted_choice(values, p, size=1):
values = np.asarray(values)
cdf = np.asarray(p).cumsum()
cdf /= cdf[-1]
uniform_samples = np.random.random_sample(size)
idx = cdf.searchsorted(uniform_samples, side='right')
sample = values[idx]
return sample
Examples:
In [113]: weighted_choice([1, 2], [0.9, 0.1], 20)
Out[113]: array([1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1])
In [114]: weighted_choice(['cat', 'dog', 'goldfish'], [0.3, 0.6, 0.1], 15)
Out[114]:
array(['cat', 'dog', 'cat', 'dog', 'dog', 'dog', 'dog', 'dog', 'dog',
'dog', 'dog', 'dog', 'goldfish', 'dog', 'dog'],
dtype='|S8')
Something like that should do the trick, and working with all floating point probability without creating a intermediate array.
import random
from itertools import accumulate # for python 3.x
def accumulate(l): # for python 2.x
tmp = 0
for n in l:
tmp += n
yield tmp
def random_choice(a, p):
sums = sum(p)
accum = accumulate(p) # made a cumulative list of probability
accum = [n / sums for n in accum] # normalize
rnd = random.random()
for i, item in enumerate(accum):
if rnd < item:
return a[i]
Easy to get is the index in probability table.
Make a table for as many weights as you need looking for example like this:
prb = [0.5, 0.65, 0.8, 1]
Get index with something like this:
def get_in_range(prb, pointer):
"""Returns index of matching range in table prb"""
found = 0
for p in prb:
if nr>p:
found += 1
return found
Index returned by get_in_range may be used to point in corresponding table of values.
Example usage:
import random
values = [1, 2, 3]
weights = [0.9, 0.99, 1]
result = values[get_in_range(prb, random.random())]
There should be probability of choosing 1 with 95%; 2 with 4% and 3 with 1%

Categories