The process of dynamic time warping - python

Hi~ I wanna do DTW and clustering. I just have a question.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True)
dist = alignment.rx('distance')[0][0]
print(dist)
When I see this code, there are two time-series variables.
If I have many time-series variables(like below picture)
How can I go through the process??
I think I will set the fixed variable and compare fixed variable with other variables. like below picture
The picture means comparing A variable with whole variables.
After calculating all distance, I will do clustering method based on distance.
Is that right??
And can I choose the fixed variable arbitrarily??

Related

Principle Component Analysis, add a line to the 3d graph showing the first principal component

I am conducting PCA on a dataset. I am attempting to add a line in my 3d graph which shows the first principal component. I have tried a few methods but have not been able to display the first principal component as a line in my 3d graph. Any help is greatly appreciated. My code is as follows:
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
file_name = 'C:/Users/data'
input_data = pd.read_csv (file_name + '.csv', header=0, index_col=0)
A = input_data.A.values.astype(float)
B = input_data.B.values.astype(float)
C = input_data.C.values.astype(float)
D = input_data.D.values.astype(float)
E = input_data.E.values.astype(float)
F = input_data.F.values.astype(float)
X = np.column_stack((A, B, C, D, E, F))
ncompo = int (input ("Number of components to study: "))
print("")
pca = PCA (n_components = ncompo)
pcafit = pca.fit(X)
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
perc = pcafit.explained_variance_ratio_
perc_x = range(1, len(perc)+1)
plt.plot(perc_x, perc)
plt.xlabel('Components')
plt.ylabel('Percentage of Variance Explained')
plt.show()
#3d Graph
plt.clf()
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)
colormap = np.array(['green', 'blue', 'red', 'yellow'])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')
plt.title('PCA')
plt.show()
Some remarks to begin with:
You are computing PCA twice! To compute PCA is to compute eigen values and eigen vectors of the covariance matrix. So, either you use the sklearn function pca.fit, either you do it yourself. But you don't need to do both, unless you want to discover pca.fit and see for yourself that it does exactly what you expect it to do (if this is what you wanted, fine. It is a good thing to do that king of checking. I did this once also). Of course pca.fit has another advantage: once you have it, it also provides pca.predict to project points in the components space. But that also is simply a base change using eigenvectors matrix (that is matrix to change base)
pca object let you get the eigenvectors (pca.components_) and eigen values (pca.explained_variance_)
pca.fit is a 'inplace' method. It does not return a new PCA object. It just fit the one you have. So, no need to get pcafit and use it.
This is not a minimal reproducible exemple as required on SO. We should be able to copy and paste it, and run it, so see exactly your problem. Not to guess what kind of secret data you have. And in the meantime, it should be minimal. So, contains data example generation (it doesn't matter if those data doesn't make sense. Sometimes it is even better, since it allows some testing. In my following code, I generate my own noisy data along an axis, which allow me to verify that, indeed, I am able to "guess" what was that axis). Plus, since your problem concerns only 3d plot, there is no need to include ploting of explained variance here. That part is not part of your question.
Now, to print the principal component, well, you already did the hard part. Twice. That is to compute it. It is the eigenvector associated with the highest eigenvalue.
With pca object no need to search for it, they are already sorted. So it is simply pca.components_[0]. And since you want to plot in the space D,E,F, you simply need to draw vector pca.components_[0][3:].
With correct scaling.
You can do that with plot providing just 2 points (first and last)
Here is my version (which, by the way, shows also what a minimal reproducible example is)
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
# Generation of random data along a given vector
vec=np.array([1, -1, 0.5, -0.5, 0.75, 0.75]).reshape(-1,1)
# 10000 random data, that are U[0,10]×vec + gaussian noise std=1
X=(vec*np.random.rand(10000)*10 + np.random.normal(0,1,(6,10000))).T
(A,B,C,D,E,F)=X.T
input_data = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E, 'F':F, 'Grade':np.random.randint(1,5, (10000,))})
ncompo=6
pca = PCA (n_components = ncompo)
pca.fit(X)
# Redundant
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# See
print("Eigen values")
print(eig_vals)
print(pca.explained_variance_)
print("Eigen vec")
print(eig_vecs)
print(pca.components_)
# Note, compare first components to
print("Main component")
print(vec/np.linalg.norm(vec))
print(pca.components_[0])
#3d Graph
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)
fig = plt.figure()
colormap = np.array(['green', 'blue', 'red', 'yellow'])
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
U=pca.components_[0]
sc1=max(D)/U[3]
sc2=min(D)/U[3]
# Draw the 1st principal component as a blue line
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)
ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')
plt.title('PCA')
plt.show()
My example is not that minimal, because I took advantage of it to illustrate my first remark, and also computed PCA twice, to compare both result.
So, here I print, eigenvalues
Eigen values
[30.88941 1.01334 0.99512 0.96493 0.97692 0.98101]
[30.88941 1.01334 0.99512 0.98101 0.97692 0.96493]
(1st being your computation by diagonalisation of covariance matrix, 2nd pca.explained_variance_)
As you can see, they are the same, except sorting for the 1st one
Like wise,
Eigen vec
[[-0.52251 -0.27292 0.40863 -0.06321 0.26699 0.6405 ]
[ 0.52521 0.07577 -0.34211 0.27583 -0.04161 0.72357]
[-0.26266 -0.41332 -0.60091 0.38027 0.47573 -0.16779]
[ 0.26354 -0.52548 0.47284 0.59159 -0.24029 -0.15204]
[-0.39493 0.63946 0.07496 0.64966 -0.08619 0.00252]
[-0.3959 -0.25276 -0.35452 -0.0572 -0.79718 0.12217]]
[[ 0.52251 -0.52521 0.26266 -0.26354 0.39493 0.3959 ]
[-0.27292 0.07577 -0.41332 -0.52548 0.63946 -0.25276]
[-0.40863 0.34211 0.60091 -0.47284 -0.07496 0.35452]
[-0.6405 -0.72357 0.16779 0.15204 -0.00252 -0.12217]
[-0.26699 0.04161 -0.47573 0.24029 0.08619 0.79718]
[-0.06321 0.27583 0.38027 0.59159 0.64966 -0.0572 ]]
Also the same, but for sorting and transpose.
Eigen vectors are presented column wise when you diagonalize a matrix.
Where as for pca.components_ each line is an eigen vector.
But you can see that in the 1st matrix, the eigen vector associated to the biggest eigen value, that is, since biggest eigen value was the 1st one, the 1st column (-0.52, 0.52, etc.)
is also the same as the first line of pca.components_.
Like wise, the 4th biggest eigen value in your diagonalisation was the last one.
And if you look at the last column of your eigen vectors (0.64, 0.72, -0.76...), it is the same as the 4th line of pca.components_ (with a irrelevant ×-1 factor)
So, long story short, you already have eigenvals in pca.explained_variance_ sorted from the biggest to the smallest. And eigen vectors in pca_components_, in the same order.
Last thing I print here, is comparison between the first component (pca.components_[0]) and the vector I used to generate the data in the first place (my data are all colinear to a vector vec, + a gaussian noise).
Main component
[[ 0.52523]
[-0.52523]
[ 0.26261]
[-0.26261]
[ 0.39392]
[ 0.39392]]
[ 0.52251 -0.52521 0.26266 -0.26354 0.39493 0.3959 ]
As expected, PCA did find correctly that main axis.
So, that was just side comments.
What is really what you were looking for is
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)
sc1 and sc2 being just scaling factors (here I choose it so that it scales approx like the data. Another way would have been to set ax.set_xlim, ax.set_ylim, ax.set_zlim from D.min(), D.max(), E.min(), E.max(), etc.
And then just use big values for sc1 and sc2, like
sc1=1000
sc2=-1000

rpy2 Dynamic Time Warping (dtw) in python - windowing does not work

A now closed discussion shows how to use the R dtw package in python. This is a little clumsy, but the R dtw package is great and better than currently available python dtw implementations. Unfortunately, the windowing functions like the Sakoe-Chiba band do not work when trying to specify a "window.size". There appears to be an issue with the mapping to the argument. Note that "." in arguments is supposed to be replaced with "_" when using rpy2. But following this convention, the argument is not being used for some reason.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=5)
>>> RRuntimeError: Error in window.function(row(wm), col(wm), query.size= n, reference.size = m, :
argument "window.size" is missing, with no default
You can see that the error states "window.size" is missing, despite "window_size" clearly being specified in the rpy2 fashion.
Just a note from the future: this question is now superseded by the feature-equivalent dtw-python package (also found on PyPI). The rpy2-R-dtw bridge should no longer be necessary.
Answering my own question in case anyone ever has the same issue. The problem is the argument mapping and the R three dots ellipsis ‘...’. This can be fixed by specifying the mapping manually.
from rpy2.robjects.functions import SignatureTranslatedFunction
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
So with this specification the window_size argument is used correctly.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.functions import SignatureTranslatedFunction
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=10)
dist = alignment.rx('distance')[0][0]
print(dist)
>>> 117.348292359

Gap Statistic Method

import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.

Ways to Create Tables and Presentable Objects Other than Plots in Python

I have the following code that runs through the following:
Draw a number of points from a true distribution.
Use those points with curve_fit to extract the parameters.
Check if those parameters are, on average, close to the true values.
(You can do this by creating the "Pull distribution" and see if it returns
a standard normal variable.
# This script calculates the mean and standard deviation for
# the pull distributions on the estimators that curve_fit returns
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import gauss
import format
numTrials = 10000
# Pull given by (a_j - a_true)/a_error)
error_vec_A = []
error_vec_mean = []
error_vec_sigma = []
# Loop to determine pull distribution
for i in xrange(0,numTrials):
# Draw from primary distribution
mean = 0; var = 1; sigma = np.sqrt(var);
N = 20000
A = 1/np.sqrt((2*np.pi*var))
points = gauss.draw_1dGauss(mean,var,N)
# Histogram parameters
bin_size = 0.1; min_edge = mean-6*sigma; max_edge = mean+9*sigma
Nn = (max_edge-min_edge)/bin_size; Nplus1 = Nn + 1
bins = np.linspace(min_edge, max_edge, Nplus1)
# Obtain histogram from primary distributions
hist, bin_edges = np.histogram(points,bins,density=True)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
# Initial guess
p0 = [5, 2, 4]
coeff, var_matrix = curve_fit(gauss.gaussFun, bin_centres, hist, p0=p0)
# Get the fitted curve
hist_fit = gauss.gaussFun(bin_centres, *coeff)
# Error on the estimates
error_parameters = np.sqrt(np.array([var_matrix[0][0],var_matrix[1][1],var_matrix[2][2]]))
# Obtain the error for each value: A,mu,sigma
A_std = (coeff[0]-A)/error_parameters[0]
mean_std = ((coeff[1]-mean)/error_parameters[1])
sigma_std = (np.abs(coeff[2])-sigma)/error_parameters[2]
# Store results in container
error_vec_A.append(A_std)
error_vec_mean.append(mean_std)
error_vec_sigma.append(sigma_std)
# Plot the distribution of each estimator
plt.figure(1); plt.hist(error_vec_A,bins,normed=True); plt.title('Pull of A')
plt.figure(2); plt.hist(error_vec_mean,bins,normed=True); plt.title('Pull of Mu')
plt.figure(3); plt.hist(error_vec_sigma,bins,normed=True); plt.title('Pull of Sigma')
# Store key information regarding distribution
mean_A = np.mean(error_vec_A); sigma_A = np.std(error_vec_A)
mean_mu = np.mean(error_vec_mean); sigma_mu = np.std(error_vec_mean)
mean_sigma = np.mean(error_vec_sigma); sigma_sig = np.std(error_vec_sigma)
info = np.array([[mean_A,sigma_A],[mean_mu,sigma_mu],[mean_sigma,sigma_sig]])
My problem is I don't know how to use python to format the data into a table. I have to manually go into the variables and go to google docs to present the information. I'm just wondering how I can do that using pandas or some other library.
Here's an example of the manual insertion:
Trial 1 Trial 2 Trial 3
Seed [0.2,0,1] [10,2,5] [5,2,4]
Bins for individual runs 20 20 20
Points Thrown 1000 1000 1000
Number of Runs 5000 5000 5000
Bins for pull dist fit 20 20 20
Mean_A -0.11177 -0.12249 -0.10965
sigma_A 1.17442 1.17517 1.17134
Mean_mu 0.00933 -0.02773 -0.01153
sigma_mu 1.38780 1.38203 1.38671
Mean_sig 0.05292 0.06694 0.04670
sigma_sig 1.19411 1.18438 1.19039
I would like to automate this table so If I change my parameters in my code, I get a new table with that new data.
I would go with the CSV module to generate a presentable table.
if you're not already using it, the IPython notebook is really good for rendering rich display formats. It's really good in a lot of other ways, too.
It will render pandas dataframe objects as an html table when they're either the last, unreturned value in a cell or if you explicitly call Ipython.core.display.display function instead of print.
If you're not already using pandas, I highly recommend it. It's basically a wrapper around 2D & 3D numpy arrays; it's just as fast, but it has nice naming conventions, data grouping and filtering funcitons, and some other cool stuff.
At that point, it depends on how you want to present it. You can use nbconvert to render a whole notebook as static html or a pdf. You can copy-paste the html table into Excel or PowerPoint or an E-mail.

find tangent vector at a point for discrete data points

I have a vector with a min of two points in space, e.g:
A = np.array([-1452.18133319 3285.44737438 -7075.49516676])
B = np.array([-1452.20175668 3285.29632734 -7075.49110863])
I want to find the tangent of the vector at a discrete points along the curve, g.g the beginning and end of the curve. I know how to do it in Matlab but I want to do it in Python. This is the code in Matlab:
A = [-1452.18133319 3285.44737438 -7075.49516676];
B = [-1452.20175668 3285.29632734 -7075.49110863];
points = [A; B];
distance = [0.; 0.1667];
pp = interp1(distance, points,'pchip','pp');
[breaks,coefs,l,k,d] = unmkpp(pp);
dpp = mkpp(breaks,repmat(k-1:-1:1,d*l,1).*coefs(:,1:k-1),d);
ntangent=zeros(length(distance),3);
for j=1:length(distance)
ntangent(j,:) = ppval(dpp, distance(j));
end
%The solution would be at beginning and end:
%ntangent =
% -0.1225 -0.9061 0.0243
% -0.1225 -0.9061 0.0243
Any ideas? I tried to find the solution using numpy and scipy using multiple methods, e.g.
tck, u= scipy.interpolate.splprep(data)
but none of the methods seem satisfy what I want.
Give der=1 to splev to get the derivative of the spline:
from scipy import interpolate
import numpy as np
t=np.linspace(0,1,200)
x=np.cos(5*t)
y=np.sin(7*t)
tck, u = interpolate.splprep([x,y])
ti = np.linspace(0, 1, 200)
dxdt, dydt = interpolate.splev(ti,tck,der=1)
ok, I found the solution which is a little modification of "pv" above (note that splev works only for 1D vectors)
One problem I was having originally with "tck, u= scipy.interpolate.splprep(data)" is that it requires a min of 4 points to work (Matlab works with two points). I was using two points. After increasing the data points, it works as i want.
Here is the solution for completeness:
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
data = np.array([[-1452.18133319 , 3285.44737438, -7075.49516676],
[-1452.20175668 , 3285.29632734, -7075.49110863],
[-1452.32645025 , 3284.37412457, -7075.46633213],
[-1452.38226151 , 3283.96135828, -7075.45524248]])
distance=np.array([0., 0.15247556, 1.0834, 1.50007])
data = data.T
tck,u = interpolate.splprep(data, u=distance, s=0)
yderv = interpolate.splev(u,tck,der=1)
and the tangents are (which matches the Matlab results if the same data is used):
(-0.13394599723751408, -0.99063114953803189, 0.026614957159932656)
(-0.13394598523149195, -0.99063115868512985, 0.026614950816003666)
(-0.13394595055068903, -0.99063117647357712, 0.026614941718878599)
(-0.13394595652952143, -0.9906311632471152, 0.026614954146007865)

Categories