Two Issues about mlpy.dtw package in Python? - python

As a newbie in Dynamic Time Warping (DTW), I find its Python implementation mlpy.dtw is not documented in a very detailed extend. I have some problems with its return value.
Regarding the returned value dist? I have two questions:
Any typo here? For standard DTW, the document says
Standard DTW as described in [Muller07], using the Euclidean distance
(absolute value of the difference) or squared Euclidean distance (as
in [Keogh01]) as local cost measure.
and for subsequence DTW, the document says
Subsequence DTW as described in [Muller07], assuming that the length
of y is much larger than the length of x and using the Manhattan
distance (absolute value of the difference) as local cost measure.
The same so-called "absolute value of the difference" corresponds two different distance metrics?
Total distance? After running the snippet
dist, cost, path = mlpy.dtw_std(x, y, dist_only=False)
dist is one value. So is it the lumped sum of all the distances between each matched pair?

Yes, the mlpy.dtw() function is not well documented.
First question: no typo here. As you can see in the documentation, euclidean, squared euclidean and manhattan distances concern the local cost measure. In this case the cost measure is defined as a distance between two real values (one dimension), see cost in the pseudocode in http://en.wikipedia.org/wiki/Dynamic_time_warping. So, in this case, Manhattan distance and Euclidean distance are the same (http://en.wikipedia.org/wiki/Euclidean_distance#One_dimension). Anyway, in the standard dtw, you can choose the euclidean distance (absolute value of the difference) or the squared euclidean distance (squared difference) by the parameter squared:
>>> import mlpy
>>> mlpy.dtw_std([1,2,3], [4,5,6], squared=False) # Euclidean distance
9.0
>>> mlpy.dtw_std([1,2,3], [4,5,6], squared=True) # Squared Euclidean distance
26.0
Second question: dist is the unnormalized minimum-distance warp path between time series x and y. It is the unnormalized DTW distance. You can normalize it dividing by len(X)+len(Y). See http://www.irit.fr/~Julien.Pinquier/Docs/TP_MABS/res/dtw-sakoe-chiba78.pdf
Cheers,
Davide

It seems to be an error in the documentation. Euclidean distance is not the "absolute value of the difference", it is the correct description of the Manhattan metric. Probably author was thinking about one dimension case, as in R both Euclidean and manhattan metrics are the same (and Euclidean metric really expresses the absolute value of the difference then). I am not familiar with the library, if it only operates on 1 dimensional objects, then there is no error and these two distance measures are equivalent
The dist value is the value of best time-warp (measured as the summarized costs of matching, see the algorithm definiton on wikipedia). So it is in fact the minimum edit distance between two sequences, where particular edits' costs are expressed in dissimilarity (distance) between "matched" objects

Related

What unit is the distance in when using cKDTree?

I am trying to calculate the distance between the closest points in two geodataframes.
I used the function created by jHUW here. The function is as follows:
def ckdnearest(gdA, gdB):
nA = np.array(list(gdA.geometry.apply(lambda x: (x.x, x.y))))
nB = np.array(list(gdB.geometry.apply(lambda x: (x.x, x.y))))
btree = cKDTree(nB)
dist, idx = btree.query(nA, k=1)
gdB_nearest = gdB.iloc[idx].drop(columns="geometry").reset_index(drop=True)
gdf = pd.concat(
[
gdA.reset_index(drop=True),
gdB_nearest,
pd.Series(dist, name='dist')
],
axis=1)
return gdf
It's working fine between my datasets, but I was wondering what unit the returned distance is in. I did some research and found that the unit will be the same as the unit of the array used. I used an array of lat-lons, like so:
array([[-122.3295182, 47.6202074],
[-122.296276 , 37.8789939],
[-122.6857603, 45.5289172],
[-118.3804073, 33.9017057],
[ -93.2911788, 44.860997 ]])
I tried to find out what the units of lat-lons would be, but was unsuccessful. I also checked the distance between some of the point pairs on GoogleMaps to get some insight, but couldn't make sense of them. For instance, Googlemaps show a distance of 1.5 miles for my first pair, but the distance returned by the function is 0.0087466. I understand that ckDTree calculates the Euclidean distance but even then, the difference seems quite large. Please provide some insight if you have them.
The result of Scipy is indeed a L2 norm (aka Euclidean distance). The meaning of this distance is dependent of the chosen coordinate system. In your case, you appear to use a geographic coordinate system (which is a spherical coordinate system). As a result, coordinates are are based on angles and cannot be linearly transformed to meters (for example in Antarctica changing the angle does not impact much the distance in meter). Additionally, one need to consider the distortion of the space while computing the distance: a straight line for use on earth is a geodesic in your geographic space. The L2 norm computed by Scipy does not consider this. In fact, using this metric probably results in wrong results: the L2 norm computed over-estimate the actual distance (of the geodesic, both in meter or radian) in Antarctica compared to the equator. This means two point near to the north pole can be considered as close as two points located each in Japan and Europe... Thus, you certainly need to use a better metric. As for the unit of the distance, it does nor make much sense mainly because of this issue. On relatively good metric would be the length of the geodesic (possibly in meters) or the angle between two point. Unfortunately, AFAIK this is not possible with Scipy... Using a GIS library (like GDAL) may help.

Is squared Euclidien distance a admissible heuristic?

I have a grid such as shown in the picture. So far I have implemented below functions as my heuristic functions. So the goal of this game is to collect all the numbers placed on this grid. Starting point A.
Manhattan distance and then take the maximum of it to calculate the heuristic.
distance = abs(A_x-x_i)+abs(A_y-y_i)
if distance > manhMax:
manhMax = distance
Summation of all the Manhattan distances to numbers placed. (This I presume is not admissible since it over estimate the distance to reach the goal-correct me if I am wrong)
My question is, first method expand states more than what I need and the second method is not admissible. I am implementing my own heuristic at the moment.
I came up with the idea, calculating squared Euclidean distance between distance between distances from A to 2 and then to 1 and then to 3 and 0. There is no order as such to collect the numbers. However the problem is just Euclidean distance expand too many states though it is admissible. Could you help me with a suitable distance or method to achieve my task.
Thank you!
I suspect you're having trouble interpreting your approach, because this problem does not have the simple, single-goal paradigm used to define "admissible". Rather, you have a small TSP (Traveling Salesman Problem), in which you can collect the items in any of 4! orders. You didn't describe which distance you used in your approach, but no simplistic computation will do. Adding all 10 distances (for n=5 nodes; 5x4 / 2) is simply over-spending, as you will traverse only 4 of those edges. Similarly, adding the distances from A to each of the items is wrong.
Instead, you need to use the heuristic on each edge of the path, add then generate the heuristic estimate for the four-edge path under consideration. For this treatment, Euclidean distance is admissible -- but its square is not: it seriously overestimates, and is in the wrong units (area, rather than distance). Manhattan distance will likely be better for you.
Note that you have a problem case in this example, as edge 3-0 will be underestimated by a factor of 4 to 5, depending on your heuristic.

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.

Distance metric for n binary vectors

I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

Categories