Hierarchical clustering of different length time series data using scipy and DTW - python

I have a set of time series data having different lengths and I am trying to cluster them using Dynamic Time Warping (DTW).
For the completeness of the question, I am using this simple implementation of DTW
def DTWDistance(s1, s2):
DTW={}
for i in range(len(s1)):
DTW[(i, -1)] = float('inf')
for i in range(len(s2)):
DTW[(-1, i)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(len(s2)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])
and I have my time series data of the form
timeseries = [[0, 1, 2, 3, 4, 5, 6, 7, 8],
[0, 0, 1, 2, 3, 4],
[6, 7, 8, 9, 10, 11, 12, 13, 14],
[15, 14, 13, 14, 15, 16, 17, 18, 19]]
When I try
import scipy.cluster.hierarchy as hac
Z = hac.linkage(data, method='complete', metric= DTWDistance)
I get ValueError: setting an array element with a sequence., which is understandable because scipy.cluster.hierarchy.linkage documentation says
...a collection of m observation vectors in n dimensions may be passed
as an m by n array. All elements of the condensed distance matrix must
be finite, i.e. no NaNs or infs.
And clearly my input doesn't fulfil this specification. What will be the correct approach to classify different length time series data?
EDIT 1
A simple workaround will be to fill the missing entries for shorter timeseries with 0, so that we can obtain an m x n matrix as required. I am not sure if this will alter the semantics of the timeseries.

Related

randomly split data in n groups?

I am currently trying to write code for splitting a given data into a number of groups.
The groups should be created randomly and they should encompass together the entire data.
So let's suppose there's an array A of eg. shape = (3, 3, 3) that has 27 root elements e:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
I want to create n groups such that g1 & g2 & ... & gn will "add up" to the original array A.
I shuffled A as following
def shuffle(array):
shuf = array.ravel()
np.random.shuffle(shuf)
return np.reshape(shuf, array.shape)
But how do I create n groups (n < e) randomly?
Thanks!
Leo
Though not so elegant, the following code will spread the array to n group with ensuring each group having at least one element, and spread the rest randomly.
import numpy as np
def shuffle_and_group(array, n):
shuf = array.ravel()
np.random.shuffle(shuf)
shuf = list(shuf)
groups = []
for i in range(n): # ensuring no empty group
groups.append([shuf.pop()])
for num in shuf: # spread the remaining
groups[np.random.randint(n)].append(num)
return groups
array = np.arange(15)
print(shuffle_and_group(array, 9))
In case you worry about the time, the code will have time complexity of O(e) where e is the number of elements.

Is there a way to conditionally index 3D-numpy array?

Having an array A with the shape (2,6, 60), is it possible to index it based on a binary array B of shape (6,)?
The 6 and 60 is quite arbitrary, they are simply the 2D data I wish to access.
The underlying thing I am trying to do is to calculate two variants of the 2D data (in this case, (6,60)) and then efficiently select the ones with the lowest total sum - that is where the binary (6,) array comes from.
Example: For B = [1,0,1,0,1,0] what I wish to receive is equal to stacking
A[1,0,:]
A[0,1,:]
A[1,2,:]
A[0,3,:]
A[1,4,:]
A[0,5,:]
but I would like to do it by direct indexing and not a for-loop.
I have tried A[B], A[:,B,:], A[B,:,:] A[:,:,B] with none of them providing the desired (6,60) matrix.
import numpy as np
A = np.array([[4, 4, 4, 4, 4, 4], [1, 1, 1, 1, 1, 1]])
A = np.atleast_3d(A)
A = np.tile(A, (1,1,60)
B = np.array([1, 0, 1, 0, 1, 0])
A[B]
Expected results are a (6,60) array containing the elements from A as described above, the received is either (2,6,60) or (6,6,60).
Thank you in advance,
Linus
You can generate a range of the indices you want to iterate over, in your case from 0 to 5:
count = A.shape[1]
indices = np.arange(count) # np.arange(6) for your particular case
>>> print(indices)
array([0, 1, 2, 3, 4, 5])
And then you can use that to do your advanced indexing:
result_array = A[B[indices], indices, :]
If you always use the full range from 0 to length - 1 (i.e. 0 to 5 in your case) of the second axis of A in increasing order, you can simplify that to:
result_array = A[B, indices, :]
# or the ugly result_array = A[B, np.arange(A.shape[1]), :]
Or even this if it's always 6:
result_array = A[B, np.arange(6), :]
An alternative solution using np.take_along_axis (from version 1.15 - docs)
import numpy as np
x = np.arange(2*6*6).reshape((2,6,6))
m = np.zeros(6, int)
m[0] = 1
#example: [1, 0, 0, 0, 0, 0]
np.take_along_axis(x, m[None, :, None], 0) #add dimensions to mask to match array dimensions
>>array([[[36, 37, 38, 39, 40, 41],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]]])

How to extract columns from an indexed matrix?

I have the following matrix:
M = np.matrix([[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
And I receive a vector indexing the columns of the matrix:
index = np.array([1,1,2,2,2,2,3,4,4,4])
This vector has 4 different values, so my objective is to create a list containing four new matrices so that the first matrix is made by the first two columns of M, the second matrix is made by columns 3 to 6 and so on:
M1 = np.matrix([[1,2],[11,12],[21,22]])
M2 = np.matrix([[3,4,5,6],[13,14,15,16],[23,24,25,26]])
M3 = np.matrix([[7],[17],[27]])
M4 = np.matrix([[8,9,10],[18,19,20],[28,29,30]])
l = list(M1,M2,M3,M4)
I need to do this in a automated way, since the number of rows and columns of M as well as the indexing scheme are not fixed. How can I do this?
There are 3 points to note:
For a variable number of variables, as in this case, the recommended solution is to use a dictionary.
You can use simple numpy indexing for the individual case.
Unless you have a very specific reason, use numpy.array instead of numpy.matrix.
Combining these points, you can use a dictionary comprehension:
d = {k: np.array(M[:, np.where(index==k)[0]]) for k in np.unique(index)}
Result:
{1: array([[ 1, 2],
[11, 12],
[21, 22]]),
2: array([[ 3, 4, 5, 6],
[13, 14, 15, 16],
[23, 24, 25, 26]]),
3: array([[ 7],
[17],
[27]]),
4: array([[ 8, 9, 10],
[18, 19, 20],
[28, 29, 30]])}
import numpy as np
M = np.matrix([[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
index = np.array([1,1,2,2,2,2,3,4,4,4])
m = [[],[],[],[]]
for i,c in enumerate(index):
m[k-1].append(c)
for idx in m:
print M[:,idx]
this is a little hard coded, I assumed you will always want 4 matrixes and such.. you can change it for more generalisation

Function reads np.array - produces the mean for k nn to number p in np.array

I need to defina a function which reads a numpy array and produces the mean for k nearest points to number p in the array.
Example:
array= np.array([1, 2, 3, 4, 5, 6, 7, 50, 24, 32, 9, 11, 12, 10])
p= 15 (**Note this is not a number in the array, I will need to find the
number closest to p or p number itself)
k = 3
In this case, I would need to generate the mean for ([11, 12, 10)]
as they are closest to p = 15
With the above numbers, I will need to find the mean for k number of points closest to p and p can be explicitly stated in the array or may not be.
I am new and very confused at this point and feel I have exhausted my resources. I feel this question has been asked before but the answers are much too complex for what I need.
Thanks in advance.
Given a (1d) array arr and scalar input p, here's how you could find the mean of the n nearest values:
def neighbor_mean(arr, p, n=3):
idx = np.abs(arr - p).argsort()[:n]
return arr[idx].mean()
arr = np.array([1, 2, 3, 4, 5, 6, 7, 50, 24, 32, 9, 11, 12, 10])
neighbor_mean(arr, p=15)
# 11.0
In the above, first you take the absolute differences:
np.abs(arr - 15)
# array([14, 13, 12, 11, 10, 9, 8, 35, 9, 17, 6, 4, 3, 5])
Then argsort() returns the indices that would sort an array. We're interested in the n-smallest absolute differences. This is what you're really looking for, rather than sorting the differences directly.
np.abs(arr - p).argsort()[:3]
# array([12, 11, 13])
Lastly you want to index your input array arr and take the mean of this:
arr[[12, 11, 13]]
# array([12, 11, 10]) # mean: 11.0

calculate histogram peaks in python

In Python, how do I calcuate the peaks of a histogram?
I tried this:
import numpy as np
from scipy.signal import argrelextrema
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4,
5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9,
12,
15, 16, 17, 18, 19, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24,]
h = np.histogram(data, bins=[0, 5, 10, 15, 20, 25])
hData = h[0]
peaks = argrelextrema(hData, np.greater)
But the result was:
(array([3]),)
I'd expect it to find the peaks in bin 0 and bin 3.
Note that the peaks span more than 1 bin. I don't want it to consider the peaks that span more than 1 column as additional peak.
I'm open to another way to get the peaks.
Note:
>>> h[0]
array([19, 15, 1, 10, 5])
>>>
In computational topology, the formalism of persistent homology provides a definition of "peak" that seems to address your need. In the 1-dimensional case the peaks are illustrated by the blue bars in the following figure:
A description of the algorithm is given in this
Stack Overflow answer of a peak detection question.
The nice thing is that this method not only identifies the peaks but it quantifies the "significance" in a natural way.
A simple and efficient implementation (as fast as sorting numbers) and the source material to the above answer given in this blog article:
https://www.sthu.org/blog/13-perstopology-peakdetection/index.html
Try the findpeaks library.
pip install findpeaks
# Your input data:
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,]
# import library
from findpeaks import findpeaks
# Find some peaks using the smoothing parameter.
fp = findpeaks(lookahead=1, interpolate=10)
# fit
results = fp.fit(data)
# Make plot
fp.plot()
# Results with respect to original input data.
results['df']
# Results based on interpolated smoothed data.
results['df_interp']
I wrote an easy function:
def find_peaks(a):
x = np.array(a)
max = np.max(x)
lenght = len(a)
ret = []
for i in range(lenght):
ispeak = True
if i-1 > 0:
ispeak &= (x[i] > 1.8 * x[i-1])
if i+1 < lenght:
ispeak &= (x[i] > 1.8 * x[i+1])
ispeak &= (x[i] > 0.05 * max)
if ispeak:
ret.append(i)
return ret
I defined a peak as a value bigger than 180% that of the neighbors and bigger than 5% of the max value. Of course you can adapt the values as you prefer in order to find the best set up for your problem.

Categories