A now closed discussion shows how to use the R dtw package in python. This is a little clumsy, but the R dtw package is great and better than currently available python dtw implementations. Unfortunately, the windowing functions like the Sakoe-Chiba band do not work when trying to specify a "window.size". There appears to be an issue with the mapping to the argument. Note that "." in arguments is supposed to be replaced with "_" when using rpy2. But following this convention, the argument is not being used for some reason.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=5)
>>> RRuntimeError: Error in window.function(row(wm), col(wm), query.size= n, reference.size = m, :
argument "window.size" is missing, with no default
You can see that the error states "window.size" is missing, despite "window_size" clearly being specified in the rpy2 fashion.
Just a note from the future: this question is now superseded by the feature-equivalent dtw-python package (also found on PyPI). The rpy2-R-dtw bridge should no longer be necessary.
Answering my own question in case anyone ever has the same issue. The problem is the argument mapping and the R three dots ellipsis ‘...’. This can be fixed by specifying the mapping manually.
from rpy2.robjects.functions import SignatureTranslatedFunction
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
So with this specification the window_size argument is used correctly.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.functions import SignatureTranslatedFunction
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=10)
dist = alignment.rx('distance')[0][0]
print(dist)
>>> 117.348292359
Related
I want to plot an array of values against a theoretical distribution using a QQ-Plot in Python. Ideally, I want to create the plot using the library Plotnine.
But when I try to create the plot, I'm getting error messages... here's my code with example data:
from scipy.stats import beta
from plotnine import *
import statsmodels.api as sm
import numpy as np
n = 207
values = -1 + np.random.beta(n/2-1, n/2-1, 100) * 2 # my data
dist = beta(n/2-1, n/2-1, loc = -1, scale = 2) # theoretical distribution
# 1. try:
ggplot(aes(sample = values)) + stat_qq(distribution = dist)
# gives ValueError: Unknown continuous distribution '<scipy.stats._distn_infrastructure.rv_frozen object at 0x0000029755C5C070>'
# 2. try:
params = {'a':n/2-1, 'b':n/2-1, 'loc':-1, 'scale':2}
ggplot(aes(sample = values)) + stat_qq(distribution = 'beta', dparams = params)
# gives TypeError: '>' not supported between instances of 'numpy.ndarray' and 'int'
Does anyone know what I'm doing wrong?
When I try to plot using statsmodels, it seems to work fine:
sm.qqplot(values, dist, line = '45')
As always, any help is highly appreciated!
This is a bug in plotnine, until it is fixed you can try to pass the arguments as a tuple instead of a dict. However, be careful about the positional matching of the arguments (a, b, loc, scale).
Edit
The bug is fixed in the current development version of plotnine and you can use a dict to pass the arguments.
I would like to use the function sigest() of kernlab in Python to estimate a good range for sigmas that I'll use in the construction of RBF Kernels. I am using rpy2 but I can't figure out what would be the argument for "na_action".
Recommended syntax in R:
sigest(x, frac = 0.5, scaled = TRUE, na.action = na.omit)
My syntax:
sigest(np.asmatrix(x), frac = 0.5, scaled = True,
na_action = pandas2ri.pandas.DataFrame.dropna)
x is the data matrix. I also tried
sigest(np.asmatrix(x), frac = 0.5, scaled = True,
na_action = pd.DataFrame.dropna)
Libraries used: matplotlib ,numpy, pandas. Also numpy2ri and pandas2ri
import matplotlib
import numpy as np
import pandas as pd
import rpy2
import rpy2.robjects as robj
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri
rpy2.robjects.numpy2ri.activate()
lab = importr("kernlab")
# ommiting the part of x initialization. it reads the data of a csv file and it's an array (40,1))
y = lab.sigest(np.asmatrix(x), frac = 0.5, scaled = True, na_action = 'ignore')
None of those Pandas methods will work for the na.action argument which expects an R call to stats::na.omit. Therefore, you must somehow reference this R method. Additionally, because the parameter maintains a dot in its name which is not allowed in identifiers of Python variables, consider adjusting parameter name manually with rpy2's SignatureTranslatedFunction if it is not handled automatically with importr:
from rpy2.robjects.functions import SignatureTranslatedFunction
from rpy2.robjects.packages import importr
lab = importr('kernlab')
lab.sigest = SignatureTranslatedFunction(lab.sigest,
init_prm_translate = {'na_action': 'na.action'})
Then, try passing the needed action call (to renamed parameter) as a string to avoid it being called directly by Python like you can for other methods, t.test, cor.test, lm, using same na.action argument:
y = lab.sigest(np.asmatrix(x), frac=0.5, scaled=True, na_action="na.omit")
So far I found 4 ways to find peaks in Python, however none of them can specify the number of peaks like Matlab does. Can someone provide some insight?
import scipy.signal as sg
import numpy as np
# Method 1
sg.find_peaks_cwt(vector, np.arange(1,4),max_distances=np.arange(1, 4)*2)
# Method 2
sg.argrelextrema(np.array(vector),comparator=np.greater,order=2)
# Method 3
sg.find_peaks(vector, height=7, distance=2.1)
# Method 4
detect_peaks.detect_peaks(vector, mph=7, mpd=2)`
Below is the Matlab code that I want to emulate:
[pks,locs] = findpeaks(data,'Npeaks',n)
If you want the exact function Matlab has, why not just use that function? If you have the rest of your data in Python, then you can just use the module provided by Matlab.
import matlab.engine #import matlab engine
eng = matlab.engine.start_matlab() #Start matlab engine
a = a = [(0.1*i)*(0.1*i-1)*(0.1*i-2) for i in range(50)] #Create some data with peaks
b = eng.findpeaks(matlab.double(a),'Npeaks',1) #Find 1 peak
Try the findpeaks library. Multiple methods are available for the detections of peaks and valleys in 1D-vectors and 2D-arrays (images).
pip install findpeaks
Lets create some peaks:
i = 10000
xs = np.linspace(0,3.7*np.pi,i)
X = (0.3*np.sin(xs) + np.sin(1.3 * xs) + 0.9 * np.sin(4.2 * xs) + 0.06 *
np.random.randn(i))
# import library
from findpeaks import findpeaks
# Initialize
fp = findpeaks()
# Find the peaks (high/low)
results = fp.fit(X)
# Make plot
fp.plot()
# Some of the results:
results['df']
I am using the Python version of the Shogun Toolbox.
I want to use the LinearTimeMMD, which accepts data under the streaming interface CStreamingFeatures. I have the data in the form of two RealFeatures objects: feat_p and feat_q. These work just fine with the QuadraticTimeMMD.
In order to use it with the LinearTimeMMD, I need to create StreamingFeatures objects from these - In this case, these would be StreamingRealFeatures, as far as I know.
My first approach was using this:
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
This however does not seem to work: The LinearTimeMMD delivers warnings and an unrealistic result (growing constantly with the number of samples) and calling gen_p.get_dim_feature_space() returns -1. Also, if I try calling gen_p.get_streamed_features(100) this results in a Memory Access Error.
I tried another approach using StreamingFileFromFeatures:
streamFile_p = sg.StreamingFileFromRealFeatures()
streamFile_p.set_features(feat_p)
streamFile_q = sg.StreamingFileFromRealFeatures()
streamFile_q.set_features(feat_q)
gen_p = StreamingRealFeatures(streamFile_p, False, 100)
gen_q = StreamingRealFeatures(streamFile_q, False, 100)
But this results in the same situation with the same described problems.
It seems that in both cases, the contents of the RealFeatures object handed to the StreamingRealFeatures object cannot be accessed.
What am I doing wrong?
EDIT: I was asked for a small working example to show the error:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import laplace, norm
def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):
# sample from both distributions
X=norm.rvs(size=n)*np.sqrt(sigma2)+mu
Y=laplace.rvs(size=n, loc=mu, scale=b)
return X,Y
# Main Script
mu=0.0
sigma2=1
b=np.sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(1,len(X)))
feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))
gen_p, gen_q = StreamingRealFeatures(feat_p), StreamingRealFeatures(feat_q)
print("Dimensions: ", gen_p.get_dim_feature_space())
print("Number of features: ", gen_p.get_num_features())
print("Number of vectors: ", gen_p.get_num_vectors())
test_features = gen_p.get_streamed_features(1)
print("success")
EDIT 2: The Output of the working example:
Dimensions: -1
Number of features: -1
Number of vectors: 1
Speicherzugriffsfehler (Speicherabzug geschrieben)
EDIT 3: Additional Code with LinearTimeMMD using the RealFeatures directly.
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(1000)
mmd.set_num_samples_q(1000)
alpha = 0.05
# Code taken from notebook example on
# http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html
# Location on page: In[16]
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("MMD_l[X,Y]^2=%.2f" % statistic)
EDIT 4: Additional code sample showing the growing mmd problem:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import shogun as sg
from shogun import StreamingRealFeatures
import numpy as np
from matplotlib import pyplot as plt
def mmd(n):
X = [(1.0,i) for i in range(n)]
Y = [(2.0,i) for i in range(n)]
X = np.array(X)
Y = np.array(Y)
# turn data into Shogun representation (columns vectors)
feat_p=sg.RealFeatures(X.reshape(2, len(X)))
feat_q=sg.RealFeatures(Y.reshape(2, len(Y)))
mmd = sg.LinearTimeMMD()
kernel = sg.GaussianKernel(10, 1)
mmd.set_kernel(kernel)
mmd.set_p(feat_p)
mmd.set_q(feat_q)
mmd.set_num_samples_p(100)
mmd.set_num_samples_q(100)
alpha = 0.05
block_size=100
mmd.set_num_blocks_per_burst(block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print("N =", n)
print("MMD_l[X,Y]^2=%.2f" % statistic)
print()
for n in [1000, 10000, 15000, 20000, 25000, 30000]:
mmd(n)
Output:
N = 1000
MMD_l[X,Y]^2=-12.69
N = 10000
MMD_l[X,Y]^2=-40.14
N = 15000
MMD_l[X,Y]^2=-49.16
N = 20000
MMD_l[X,Y]^2=-56.77
N = 25000
MMD_l[X,Y]^2=-63.47
N = 30000
MMD_l[X,Y]^2=-69.52
For some reason, the pythonenv in my machine is broken. So, I couldn't give a snippet in Python. But let me point to a working example in C++ which attempts to address the issues (https://gist.github.com/lambday/983830beb0afeb38b9447fd91a143e67).
I think the easiest way is to create a StreamingRealFeatures instance directly from RealFeatures instance (like you tried the first time). Check test1() and test2() methods in the gist which shows the equivalence of using RealFeatures and StreamingRealFeatures in the use-case in question. The reason you were getting weird results when streaming directly is that in order to start the streaming process we need to call the start_parser method in the StreamingRealFeatures class. We handle these technicalities internally inside MMD classes. But when trying to use it directly, we need to invoke that separately (See test3() method in my attached example).
Please note that the compute_statistic() method doesn't return MMD directly, but rather returns \frac{n_x\times n_y}{n_x+n_y}\times MMD^2 (as mentioned in the doc http://shogun.ml/api/latest/classshogun_1_1CMMD.html). With that in mind, maybe the results you are getting for varying number of samples make sense.
Hope it helps.
Hi~ I wanna do DTW and clustering. I just have a question.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True)
dist = alignment.rx('distance')[0][0]
print(dist)
When I see this code, there are two time-series variables.
If I have many time-series variables(like below picture)
How can I go through the process??
I think I will set the fixed variable and compare fixed variable with other variables. like below picture
The picture means comparing A variable with whole variables.
After calculating all distance, I will do clustering method based on distance.
Is that right??
And can I choose the fixed variable arbitrarily??