I am working on a time series data. The data available is multi-variate. So for every instance of time there are three data points available.
Format:
| X | Y | Z |
So one time series data in above format would be generated real time. I am trying to find a good match of this real time generated time series within another time series base data, which is already stored (which is much larger in size and was collected at a different frequency). If I apply standard DTW to each of the series (X,Y,Z) individually they might end up getting a match at different points within the base database, which is unfavorable. So I need to find a point in base database where all three components (X,Y,Z) match well and at the same point.
I have researched into the matter and found out that multidimensional DTW is a perfect solution to such a problem. In R the dtw package does include multidimensional DTW but I have to implement it in Python. The R-Python bridging package namely "rpy2" can probably of help here but I have no experience in R. I have looked through available DTW packages in Python like mlpy, dtw but are not help. Can anyone suggest a package in Python to do the same or the code for multi-dimensional DTW using rpy2.
Thanks in advance!
Thanks #lgautier I dug deeper and found implementation of multivariate DTW using rpy2 in Python. Just passing the template and query as 2D matrices (matrices as in R) would allow rpy2 dtw package to do a multivariate DTW. Also if you have R installed, loading the R dtw library and "?dtw" would give access to the library's documentation and different functionalities available with the library.
For future reference to other users with similar questions:
Official documentation of R dtw package: https://cran.r-project.org/web/packages/dtw/dtw.pdf
Sample code, passing two 2-D matrices for multivariate DTW, the open_begin and open_end arguments enable subsequence matching:
import numpy as np
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
import rpy2.robjects as robj
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
template = np.array([[1,2,3,4,5],[1,2,3,4,5]]).transpose()
rt,ct = template.shape
query = np.array([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]).transpose()
rq,cq = query.shape
#converting numpy matrices to R matrices
templateR=R.matrix(template,nrow=rt,ncol=ct)
queryR=R.matrix(query,nrow=rq,ncol=cq)
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(templateR,queryR,keep=True, step_pattern=R.rabinerJuangStepPattern(4,"c"),open_begin=True,open_end=True)
dist = alignment.rx('distance')[0][0]
print dist
It seems like tslearn's dtw_path() is exactly what you are looking for. to quote the docs linked before:
Compute Dynamic Time Warping (DTW) similarity measure between (possibly multidimensional) time series and return both the path and the similarity.
[...]
It is not required that both time series share the same size, but they must be the same dimension. [...]
The implementation they provide follows:
H. Sakoe, S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 26(1), pp. 43–49, 1978.
I think that it is a good idea to try out a method in whatever implementation is already available before considering whether it worth working on a reimplementation.
Did you try the following ?
from rpy2.robjects.packages import importr
# You'll obviously need the R package "dtw" installed with your R
dtw = importr("dtw")
# all functions and objects in the R package "dtw" are now available
# with `dtw.<function or object>`
I happened upon this post and thought I would provide some updated information in case anyone else is trying to locate a way to do multivariate DTW in Python. The DTADistance package has the option to perform multivariate DTW.
Related
I am trying to solve an optimization problem in Python, using gekko, where one of the variables takes on a random value at each time step, but I haven't been able to use the gekko function that returns random numbers.
Following the documentation page (http://t-t.dk/gekko/docs/user-manual/functions.htm), the function rnorm returns "a random number from a normal distribution with mean and variance provided". I used it as shown here:
x = m.Var(value=0)
m.Equation(x == 5.*m.rnorm(0, 1))
provided that
m = GEKKO()
but I get the following error message:
AttributeError: 'GEKKO' object has no attribute 'rnorm'
I would like to know if there is something that I am missing or if there is another way to get random numbers.
The documentation page you linked is associated with another package that isn't the same as the Optimization Suite in Python. I suggest looking at this page: https://gekko.readthedocs.io/en/latest/model_methods.html for the correct documentation.
As for your question about random numbers, I suggest using another package like python's random or numpy's random.normal. I'm not sure how exactly to apply it in your problem without seeing more code; what you could do is have an array of random numbers for each timestep and multiply or add it in somewhere while writing the problem in Gekko.
The documentation link that you provided is to different gekko software:
Gekko Timeseries and Modeling Software is a free and open-source software system for managing and analyzing timeseries data, and for solving and analyzing large-scale economic models. See the Gekko homepage: www.t-t.dk/gekko. Read more about the status of different Gekko versions on the Gekko versions overview page.
The Gekko Optimization Suite in Python pip install gekko is described in the Wikipedia article and in the Read the Docs documentation.
GEKKO is a Python package for machine learning and optimization of mixed-integer and differential algebraic equations. It is coupled with large-scale solvers for linear, quadratic, nonlinear, and mixed integer programming (LP, QP, NLP, MILP, MINLP). Modes of operation include parameter regression, data reconciliation, real-time optimization, dynamic simulation, and nonlinear predictive control. GEKKO is an object-oriented Python library to facilitate local execution of APMonitor.
Both software packages can analyze time-series data. The numpy.random.randn() function can be used with gekko.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
p = m.Param()
x = m.Var()
m.Equation(x==5*p)
for i in range(10):
p.value = np.random.randn()
m.solve(disp=False)
print(x.value[0],p.value[0])
This solves the optimization problem 10 times with different values for p sampled from a normal, mean-zero distribution.
I am working on a image processing tool, and I am having some trouble finding a good substitute for matlab's fitdist to a Python soltuion.
The matlab code works something like this:
pdR = fitdist(Red,'Kernel','Support','positive');
Any of you have found a good implementation for this in Python?
Generally SciPy is useful in your case:
import scipy.stats as st
# KDE
st.gaussian_kde(data)
# Fit to specified distribution (Normal distribution in this example)
st.norm.fit(data)
Full reference is here: https://docs.scipy.org/doc/scipy/reference/stats.html
While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):
np.mean(), np.std() (with ddof=0 or 1)
statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
scipy.statistics package
That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here:
why-is-statistics-mean-so-slow
I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.
It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?
Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.
My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.
Why does NumPy duplicate features of SciPy?
From the SciPy FAQ What is the difference between NumPy and SciPy?:
In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, etc. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors.
It recommends using SciPy over NumPy:
In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with Python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.
When should I use the statistics library?
From the statistics library documentation:
The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.
Thus I would not use it for serious (i.e. resource intensive) computation.
What is the difference between statsmodels and SciPy?
From the statsmodels about page:
The models module of scipy.stats was originally written by Jonathan Taylor. For some time it was part of scipy but was later removed. During the Google Summer of Code 2009, statsmodels was corrected, tested, improved and released as a new package. Since then, the statsmodels development team has continued to add new models, plotting tools, and statistical methods.
Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.
For example the SciPy documentation for scipy.stats.probplot notes that
Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot.
Thus in cases like these you will need to turn to statistical libraries beyond SciPy.
Just a simple benchmark.
statistics vs NumPy
import statistics as stat
import numpy as np
from timeit import default_timer as timer
l = list(range(1_000_000))
start = timer()
m, std = stat.mean(l), stat.stdev(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
start = timer()
m, std = np.mean(l), np.std(l)
end = timer()
print(end-start, "sec elapsed.")
print("mean, std:", m, std)
(no iteration for brevity.)
Result
NumPy is way faster.
# statistics
3.603845 sec elapsed.
mean, std: 499999.5 288675.2789323441
# numpy
0.2315261999999998 sec elapsed.
mean, std: 499999.5 288675.1345946685
long-time R and Python user here. I use R for my daily data analysis and Python for tasks heavier on text processing and shell-scripting. I am working with increasingly large data sets, and these files are often in binary or text files when I get them. The type of things I do normally is to apply statistical/machine learning algorithms and create statistical graphics in most cases. I use R with SQLite sometimes and write C for iteration-intensive tasks; before looking into Hadoop, I am considering investing some time in NumPy/Scipy because I've heard it has better memory management [and the transition to Numpy/Scipy for one with my background seems not that big] - I wonder if anyone has experience using the two and could comment on the improvements in this area, and if there are idioms in Numpy that deal with this issue. (I'm also aware of Rpy2 but wondering if Numpy/Scipy can handle most of my needs). Thanks -
R's strength when looking for an environment to do machine learning and statistics is most certainly the diversity of its libraries. To my knowledge, SciPy + SciKits cannot be a replacement for CRAN.
Regarding memory usage, R is using a pass-by-value paradigm while Python is using pass-by-reference. Pass-by-value can lead to more "intuitive" code, pass-by-reference can help optimize memory usage. Numpy also allows to have "views" on arrays (kind of subarrays without a copy being made).
Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when dealing with numpy arrays (benchmark). Fortunately, Cython lets one get serious speed improvements easily.
If working with Big Data, I find the support for storage-based arrays better with Python (HDF5).
I am not sure you should ditch one for the other but rpy2 can help you explore your options about a possible transition (arrays can be shuttled between R and Numpy without a copy being made).
I use NumPy daily and R nearly so.
For heavy number crunching, i prefer NumPy to R by a large margin (including R packages, like 'Matrix') I find the syntax cleaner, the function set larger, and computation is quicker (although i don't find R slow by any means). NumPy's Broadcasting functionality for instance, i do not think has an analog in R.
For instance, to read in a data set from a csv file and 'normalize' it for input to an ML algorithm (e.g., mean center then re-scale each dimension) requires just this:
data = NP.loadtxt(data1, delimiter=",") # 'data' is a NumPy array
data -= NP.mean(data, axis=0)
data /= NP.max(data, axis=0)
Also, i find that when coding ML algorithms, i need data structures that i can operate on element-wise and that also understand linear algebra (e.g., matrix multiplication, transpose, etc.). NumPy gets this and allows you to create these hybrid structures easily (no operator overloading or subclassing, etc.).
You won't be disappointed by NumPy/SciPy, more likely you'll be amazed.
So, a few recommendations--in general and in particular, given the facts in your question:
install both NumPy and Scipy. As a rough guide, NumPy provides the
core data structures (in particular
the ndarray) and SciPy (which is
actually several times larger than
NumPy) provides the domain-specific
functions (e.g., statistics, signal
processing, integration).
install the repository versions, particularly w/r/t NumPy because the
dev version is 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other of course, but both are the best in their respective class among python libraries. You can get all three via easy_install, which i assume you already.
NumPy/SciPy have several modules
specifically directed to Machine
Learning/Statistics, including the Clustering package and the Statistics package.
As well as packages directed to
general computation, but which are
make coding ML algorithms a lot
faster, in particular,
Optimization and Linear Algebra.
There are also the SciKits, not included in the base NumPy or
SciPy libraries; you need to install them separately.
Generally speaking, each SciKit is a
set of convenience wrappers to
streamline coding in a given domain. The SciKits you are likely to find most relevant are: ann (approximate Nearest Neighbor), and learn (a set of ML/Statistics regression and classification algorithms, e.g., Logistic Regression, Multi-Layer Perceptron, Support Vector Machine).
I am trying to find a numerical package which will fit a natural spline which minimizes weighted least squares.
There is a package in scipy which does what I want for unnatural splines.
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate, randn
x = np.arange(0,5,1.0/6)
xs = np.arange(0,5,1.0/500)
y = np.sin(x+1) + .2*np.random.rand(len(x)) -.1
knots = np.array([1,2,3,4])
tck = interpolate.splrep(x,y,s=0,k=3,t=knots,task=-1)
ys = interpolate.splev(xs,tck,der=0)
plt.figure()
plt.plot(xs,ys,x,y,'x')
The spline.py file inside of this tar file from this page does a natural spline fit by default. There is also some code on this page that claims to mostly what you want. The pyD3D package also has a natural spline function in its pyDataUtils module. This last one looks the most promising to me. However, it doesn't appear to have the option of setting your own knots. Maybe if you look at the source you can find a way to rectify that.
Also, I found this message on the Scipy mailing list which says that using s=0.0 (as in your given code) makes splines fitted using your above procedure natural according the writer of the message. I did find this splmake function that has an option to do a natural spline fit, but upon looking at the source I found that it isn't implemented yet.