python: finding the value of a random variable for a cdf - python

I apologize in advance if this is poorly worded.
If I have a stdDev = 1, mean = 0, scipy.stats.cdf(-1, loc = 0, scale = 1) will give me the probability that a normally distributed random variable will be <= -1, and that is 0.15865525393145707.
Given 0.15865..., how do I find the value that gives me -1?
i.e. value(cdf = 0.15865, loc = 0, scale = 1)
Thanks for the help.

edit: you actually need import norm from scipy.stats.
I found the answer. You need to use ppf in scipy.stats which stands for "percent point function".
So let's say you have a normal distribution with stdDev = 1, and mean = 0 and you want to find the value at which the random variables will be below ~15% of the time. Just use:
value = norm.ppf(0.15, loc = 0, scale = 1)
This will return ~ -1, likewise if you do:
cdf = norm.cdf(-1, loc = 0, scale = 1)
This will return ~ 0.15 or 15%.
Cool beans.

Related

Guess fitting Voigt curve to data - Python script behaves erratically

Let me describe what I'm attempting to do. This requires the eyes of somebody more knowledgable of Python than myself.
I have a set of data (actually sediment diameter vs. percentage in a sample) and when plotted it shows a unique spectrum. I'm assuming that there are "modes" hidden within the data, and am trying to force fit voigt, guassian or lorentzian curves so draw out some information. The framework of this script came from a person doing a similar thing on XRD data. I'm not quite proficient enough to really understand how the script is achieving the goals, so I'm having trouble isolating a few strange behaviors. Let me outline the weirdness first, then I'll share the code.
If I run the code over and over again with the same data, the results are not always the same. Not only that, but maybe 25% of the time, I get an error that I can't figure out. Why does this error happen, and why is it only happening some of the time?
TypeError: unsupported operand type(s) for -: 'tuple' and 'float'
When I define "spec" in the beginning of the code, I have to specifc model types. By chance, I tried VoigtModel first, and again, it works most of the time. However, if I specify a type to Gaussian or Lorentzian the script doesn't run at all:
TypeError: can't multiply sequence by non-int of type 'float'
In the script, I ask it to print some information regarding the curves that it fit. Specifically, the x, y values of the peak of the curve. However, when I run it subsequent times, it may fit different curves but the print() output doesn't change. Like, what?
If anybody could give the code a try and perhaps offer some insight as to what's wonky about this code, I'd be hugely grateful.
edit I've discovered that if I add more {'type': 'VoigtModel'} to spec = , the frequency of script failure decreases. If I remove some (leave one or two) then it fails at a much greater percentage. Still could use some help understanding the connection.
The code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
import random
from lmfit import models
x = 0, 0.09326263, 0.186541806, 0.279826296, 0.373096863, 0.466372043, 0.559644359, 0.652910952, 0.746190193, 0.839463682, 0.932734784, 1.026014714, 1.119288717, 1.212558343, 1.305836463, 1.399111865, 1.492381488, 1.585657384, 1.678931325, 1.772207061, 1.865478378, 1.958752334, 2.05202538, 2.145299504, 2.238574433, 2.331847735, 2.425123471, 2.518395825, 2.611671451, 2.704945386, 2.798218396, 2.891491964, 2.984766114, 3.078040106, 3.171314505, 3.264585057, 3.357863555, 3.451137678, 3.544409886, 3.637684839, 3.730956661, 3.824229504, 3.917507936, 4.010781777, 4.104055591, 4.197326, 4.290603266, 4.383874926, 4.477149297, 4.57042345, 4.663698494, 4.756972396, 4.850245469, 4.943519232, 5.036793499, 5.13006734, 5.223340556, 5.316615186, 5.409888929, 5.503163537, 5.596438512, 5.689708905, 5.782986369, 5.876257098, 5.969532028, 6.062807987, 6.156078156, 6.249352461, 6.342627453, 6.43590194, 6.529177933, 6.622450379, 6.715725752, 6.808997914, 6.902272777, 6.995546352, 7.088819796, 7.18209372, 7.275367937, 7.36864248, 7.461916216, 7.555189618, 7.648464489, 7.741737739, 7.835015624, 7.928288902, 8.021559911, 8.114833257, 8.208110415, 8.301378965, 8.394658258, 8.487929146, 8.581205011, 8.674478952, 8.767749555, 8.861024001, 8.954299075, 9.047574353, 9.140848269, 9.234120373, 9.327394253, 9.420668151, 9.513942544, 9.607217038, 9.700491238, 9.793764758, 9.887039268, 9.980313168, 10.0735868, 10.16686092, 10.26013875, 10.35340805, 10.44668356, 10.53995856, 10.63323182, 10.72650553
y = 0.001352, 0.001721, 0.002661, 0.00523, 0.010879, 0.020142, 0.030427, 0.039188, 0.046922, 0.055438, 0.065352, 0.076432, 0.089913, 0.107888, 0.132296, 0.164797, 0.208043, 0.266067, 0.343688, 0.443698, 0.565158, 0.704086, 0.854979, 1.01437, 1.17932, 1.34739, 1.51366, 1.67215, 1.81638, 1.94147, 2.0432, 2.11934, 2.16792, 2.19005, 2.18907, 2.17172, 2.14565, 2.11866, 2.09749, 2.08736, 2.09102, 2.1084, 2.13739, 2.17478, 2.21729, 2.26139, 2.30342, 2.33966, 2.36671, 2.38045, 2.37413, 2.33769, 2.26088, 2.13908, 1.9769, 1.78619, 1.57832, 1.35944, 1.13483, 0.919488, 0.743312, 0.637312, 0.615423, 0.665356, 0.744581, 0.78791, 0.743882, 0.617121, 0.46602, 0.356204, 0.320677, 0.361725, 0.45788, 0.566712, 0.650727, 0.701846, 0.739237, 0.788714, 0.863346, 0.956347, 1.04314, 1.09353, 1.0874, 1.02493, 0.925497, 0.815472, 0.721377, 0.658056, 0.628985, 0.623906, 0.617012, 0.578717, 0.487132, 0.346259, 0.185964, 0.066494, 0.011942, 0.000815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#xlog = [math.log(xval) for xval in x]
spec = {
'x': x,
'y': y,
'model': [
{'type': 'VoigtModel'},
{'type': 'VoigtModel'},
{'type': 'VoigtModel'},
{'type': 'VoigtModel'},
]}
plt.plot(spec['x'], spec['y'])
plt.show()
def update_spec_from_peaks(spec, model_indicies, peak_widths=(1, 50), **kwargs):
x = spec['x']
y = spec['y']
x_range = np.max(x) - np.min(x)
peak_indicies = signal.find_peaks_cwt(y, peak_widths)
np.random.shuffle(peak_indicies)
for peak_indicie, model_indicie in zip(peak_indicies.tolist(), model_indicies):
model = spec['model'][model_indicie]
if model['type'] in ['GaussianModel', 'LorentzianModel', 'VoigtModel']:
params = {
'height': y[peak_indicie],
'sigma': x_range / len(x) * np.min(peak_widths),
'center': x[peak_indicie]
}
if 'params' in model:
model.update(params)
else:
model['params'] = params
return peak_indicies
#
peaks_found = update_spec_from_peaks(spec, [0], peak_widths=(5,))
print(peaks_found)
for i in peaks_found:
print(x[i], y[i])
def generate_model(spec):
composite_model = None
params = None
x = spec['x']
y = spec['y']
x_min = np.min(x)
x_max = np.max(x)
x_range = x_max - x_min
y_max = np.max(y)
for i, basis_func in enumerate(spec['model']):
prefix = f'm{i}_'
model = getattr(models, basis_func['type'])(prefix=prefix)
if basis_func['type'] in ['GaussianModel', 'LorentzianModel', 'VoigtModel']: # for now VoigtModel has gamma constrained to sigma
model.set_param_hint('sigma', min=1e-6, max=x_range)
model.set_param_hint('center', min=x_min, max=x_max)
model.set_param_hint('height', min=1e-6, max=1.1*y_max)
model.set_param_hint('amplitude', min=1e-6)
# default guess is horrible!! do not use guess()
default_params = {
prefix+'center': x_min + x_range * random.random(),
prefix+'height': y_max * random.random(),
prefix+'sigma': x_range * random.random()
}
else:
raise NotImplemented(f'model {basis_func["type"]} not implemented yet')
if 'help' in basis_func: # allow override of settings in parameter
for param, options in basis_func['help'].items():
model.set_param_hint(param, **options)
model_params = model.make_params(**default_params, **basis_func.get('params', {}))
if params is None:
params = model_params
else:
params.update(model_params)
if composite_model is None:
composite_model = model
else:
composite_model = composite_model + model
return composite_model, params
model, params = generate_model(spec)
output = model.fit(spec['y'], params, x=spec['x'])
fig, ax = plt.subplots()
ax.scatter(spec['x'], spec['y'], s=4)
components = output.eval_components(x=spec['x'])
print(len(spec['model']))
for i, model in enumerate(spec['model']):
ax.plot(spec['x'], components[f'm{i}_'])```
It should be sort of obvious that any code will run exactly the same every single time when given the same inputs.
The fitting appears to behave erratically because you are deliberately giving erratic data. Really, you are telling it to randomize the initial starting values. And you are setting bounds programmatically, and not checking how close the initial values are to the bounds. So, ask yourself, why are you doing these things?
Your code seems quite complicated, possibly so much so that you don't understand it. Start by getting rid of all the junk. Maybe make a model that is a sum of Gaussians, maybe something like this (code will run, give a decent fit):
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from lmfit import models
x = np.array([0, 0.09326263, 0.186541806, 0.279826296, 0.373096863, 0.466372043, 0.559644359, 0.652910952, 0.746190193, 0.839463682, 0.932734784, 1.026014714, 1.119288717, 1.212558343, 1.305836463, 1.399111865, 1.492381488, 1.585657384, 1.678931325, 1.772207061, 1.865478378, 1.958752334, 2.05202538, 2.145299504, 2.238574433, 2.331847735, 2.425123471, 2.518395825, 2.611671451, 2.704945386, 2.798218396, 2.891491964, 2.984766114, 3.078040106, 3.171314505, 3.264585057, 3.357863555, 3.451137678, 3.544409886, 3.637684839, 3.730956661, 3.824229504, 3.917507936, 4.010781777, 4.104055591, 4.197326, 4.290603266, 4.383874926, 4.477149297, 4.57042345, 4.663698494, 4.756972396, 4.850245469, 4.943519232, 5.036793499, 5.13006734, 5.223340556, 5.316615186, 5.409888929, 5.503163537, 5.596438512, 5.689708905, 5.782986369, 5.876257098, 5.969532028, 6.062807987, 6.156078156, 6.249352461, 6.342627453, 6.43590194, 6.529177933, 6.622450379, 6.715725752, 6.808997914, 6.902272777, 6.995546352, 7.088819796, 7.18209372, 7.275367937, 7.36864248, 7.461916216, 7.555189618, 7.648464489, 7.741737739, 7.835015624, 7.928288902, 8.021559911, 8.114833257, 8.208110415, 8.301378965, 8.394658258, 8.487929146, 8.581205011, 8.674478952, 8.767749555, 8.861024001, 8.954299075, 9.047574353, 9.140848269, 9.234120373, 9.327394253, 9.420668151, 9.513942544, 9.607217038, 9.700491238, 9.793764758, 9.887039268, 9.980313168, 10.0735868, 10.16686092, 10.26013875, 10.35340805, 10.44668356, 10.53995856, 10.63323182, 10.72650553])
y = np.array([0.001352, 0.001721, 0.002661, 0.00523, 0.010879, 0.020142, 0.030427, 0.039188, 0.046922, 0.055438, 0.065352, 0.076432, 0.089913, 0.107888, 0.132296, 0.164797, 0.208043, 0.266067, 0.343688, 0.443698, 0.565158, 0.704086, 0.854979, 1.01437, 1.17932, 1.34739, 1.51366, 1.67215, 1.81638, 1.94147, 2.0432, 2.11934, 2.16792, 2.19005, 2.18907, 2.17172, 2.14565, 2.11866, 2.09749, 2.08736, 2.09102, 2.1084, 2.13739, 2.17478, 2.21729, 2.26139, 2.30342, 2.33966, 2.36671, 2.38045, 2.37413, 2.33769, 2.26088, 2.13908, 1.9769, 1.78619, 1.57832, 1.35944, 1.13483, 0.919488, 0.743312, 0.637312, 0.615423, 0.665356, 0.744581, 0.78791, 0.743882, 0.617121, 0.46602, 0.356204, 0.320677, 0.361725, 0.45788, 0.566712, 0.650727, 0.701846, 0.739237, 0.788714, 0.863346, 0.956347, 1.04314, 1.09353, 1.0874, 1.02493, 0.925497, 0.815472, 0.721377, 0.658056, 0.628985, 0.623906, 0.617012, 0.578717, 0.487132, 0.346259, 0.185964, 0.066494, 0.011942, 0.000815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
peaks = signal.find_peaks_cwt(y, (1.5, 25))
xstep = x.ptp() / len(x)
model, params = None, None
for i, peak_index in enumerate(peaks):
this_model = models.GaussianModel(prefix=f'p{1+i:d}_')
this_params = this_model.make_params(amplitude=y[peak_index], center=x[peak_index], sigma=2*xstep)
if model is None:
model = this_model
params = this_params
else:
model += this_model
params.update(this_params)
result = model.fit(y, params, x=x)
print(result.fit_report())
plt.plot(x, y, label='data')
plt.plot(x, result.best_fit, label='fit')
plt.legend()
plt.show()
Does it need to be a lot more complicated than that? Hm, maybe not. This gives a decent fit, though it might be missing a subtle shoulder peak at around x=7.
Start simple. Keep it simple for as long as possible. Add complexity only when it simplifies something else.

Robust Linear Model - No exogenous var, just constants

I'm doing a robust linear regression on only a constant (a column of 1s) and no exogenous variable. I'm able to calculate the model just fine by inputting a list of 1's equal to the size of the 'xi_list' from the code snippet below.
def sigma_and_miu(gvkey, statevar_dict):
statevar_list = statevar_dict[gvkey]
xi_list = [np.log(statevar_list[i]) - np.log(statevar_list[i-1]) for i in range(1, len(statevar_list))]
x = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
y = np.array(xi_list)
rlm_model = sm.RLM(y, x, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
sigma = np.std(rlm_results.resid * rlm_results.weights)
miudelta = rlm_results.params[0] + (0.5 * sigma ** 2)
return miudelta, sigma
This function is ran with the following inputs.
dict = {1004:[1796.6, 1938.6, 2085.4, 2009.4, 1906.1, 2002.2, 2164.9, 2478.8, 2357.4, 2662.1, 2911.2, 2400.4, 2535.9, 2812.3, 2873.1, 2775.5, 3374.2, 3345.5, 3466.3, 2409.4]}
key = 1004
miu, sigma = sigma_and_miu(key,dict)
However, I'm looking for a more scalable approach. I was thinking that one solution could be to include a loop that appends as many 1's as the length of the xi_list variable but, this does not seem to be very efficient.
I know there is sm.add_constant() and I tried to add this constant to my 'y' variable and leaving 'x' blank in the sm.RLM() function. This results in not being able to run the model.
So my question is, whether there is a better way to create the list of 1s or should I just go for the loop?
Use basic numpy vectorized computation
e.g.
statevar = np.asarray(statevar_list)
y = np.log(statevar[1:]) - np.log(statevar[:-1])
x = np.ones(len(y))
Aside: The rlm_results should have the robust estimate of the standard deviation that is used in the estimation as a scale attribute.

Random list of ones and zeros with minimum distance between ones

I would like to have a random list where the occurence of ones is 10% and the rest of the items are zeros. The length of this list is 1000. I would like for the values to be in a random order so that there is an adjustable minimum distance between ones. So for example if I choose a value of 3, the list would look something like this:
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...]
What is the most elegant way to achieve this?
Edit. I was asked for more information and to show some effort.
This is for a study where 0 signifies one type of stimulus and 1 an other kind of stimulus and we want to have a minimum distance between stimulus type 1.
So far I have achieved this with:
trials = [0]*400
trials.extend([1]*100)
random.shuffle(trials)
#Make sure a fixed minimum number of standard runs follow each deviant
i = 0
while i < len(trials):
if trials[i] == 1:
trials[i+1:i+1] = 5*[0]
i = i + 6
else:
i = i + 1
This gives me a list of length 1000 but to me seems a little clumsy so out of curiosity I was wondering if there is a better way to do this.
You have essentially a binomial random variable. The waiting time between successes for a binomial random variable is given by the negative binomial distribution. Using this distribution, we can get a random sequence of intervals between successes for a binomial variable with the specified success rate. Then we simply add your "refractory period" to all intervals and create a binary representation.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import nbinom
min_failures = 3 # refractory period
total_successes = 100
total_time = 1000
# create a negative binomial distribution to model the waiting times to the next success for a Bernoulli RV;
rv = nbinom(1, total_successes / float(total_time))
# get interval lengths between successes;
intervals = rv.rvs(size=total_successes)
# get event times
events = np.cumsum(intervals)
# rescale event times to fit into the total time - refractory time
total_refractory = total_successes * min_failures
remaining_time = total_time - total_refractory
events = events.astype(np.float) / np.max(events) * remaining_time
# add refractory periods
intervals = np.diff(np.r_[0, events])
intervals += min_failures
events = np.r_[0, np.cumsum(intervals[:-1])] # series starts with success
# create binary representation
binary = np.zeros((total_time), dtype=np.uint8)
binary[events.astype(np.int)] = 1
To check that the inter-event intervals match your expectations, plot a histogram:
# check that intervals match our expectations
fig, ax = plt.subplots(1,1)
ax.hist(intervals, bins=20, normed=True);
ax.set_xlabel('Interval length')
ax.set_ylabel('Normalised frequency')
xticks = ax.get_xticks()
ax.set_xticks(np.r_[xticks, min_failures])
plt.show()
My approach to this problem is to maintain a list of candidate positions from which the next position is chosen randomly. Then, the surrounding range of positions is checked to be empty. If so, this position is chosen and the whole range around it in which no future position is allowed is removed from the list of available candidates. This ensures a minimum number of loops.
It may happen (if mindist is big compared to the number of positions) that less than the required positions are returned. In this case, the function needs to be called again, like shown.
import random
def array_ones(mindist, length_array, numones):
result = [0]*length_array
candidates = range(length_array)
while sum(result) < numones and len(candidates) > 0:
# choose one position randomly from candidates
pos = candidates[random.randint(0, len(candidates)-1)]
L = pos-mindist if pos >= mindist else 0
U = pos+mindist if pos <= length_array-1-mindist else length_array-1
if sum(result[L:U+1]) == 0: # no taken positions around
result[pos] = 1
# remove all candidates around this position
no_candidates = set(range(L, U+1))
candidates = list(set(candidates).difference(no_candidates))
return result, sum(result)
def main():
numones = 5
numtests = 50
mindist = 4
while True:
arr, ones = array_ones(mindist, numtests, numones)
if ones == numones:
break
print arr
if __name__ == '__main__':
main()
The function returns the array of ones and it's number of ones. Set difference is used to remove a range of candidate positions noniteratively.
Seems that there wasn't a very simple one-line answer to this problem. I finally came up with this:
import numpy as np
def construct_list(n_zeros, n_ones, min_distance):
if min_distance > (n_zeros + n_ones) / n_ones:
raise ValueError("Minimum distance too high.")
initial_zeros = n_zeros - min_distance * n_ones
block = np.random.permutation(np.array([0]*initial_zeros + [1]*n_ones))
ones = np.where(block == 1)[0].repeat(min_distance)
#Insert min_distance number of 0s after each 1
block = np.insert(block, ones+1, 0)
return block.tolist()
This seems simpler than the other answers although Paul's answer was just a little faster with values n_zeros=900, n_ones=100, min_distance=3

Root mean square of a function in python

I want to calculate root mean square of a function in Python. My function is in a simple form like y = f(x). x and y are arrays.
I tried Numpy and Scipy Docs and couldn't find anything.
I'm going to assume that you want to compute the expression given by the following pseudocode:
ms = 0
for i = 1 ... N
ms = ms + y[i]^2
ms = ms / N
rms = sqrt(ms)
i.e. the square root of the mean of the squared values of elements of y.
In numpy, you can simply square y, take its mean and then its square root as follows:
rms = np.sqrt(np.mean(y**2))
So, for example:
>>> y = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 1]) # Six 1's
>>> y.size
10
>>> np.mean(y**2)
0.59999999999999998
>>> np.sqrt(np.mean(y**2))
0.7745966692414834
Do clarify your question if you mean to ask something else.
You could use the sklearn function
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_actual,[0 for _ in y_actual], squared=False)
numpy.std(x) tends to rms(x) in cases of mean(x) value tends to 0 (thanks to #Seb), like it can be with sound records, vibrations, and other signals of fluctuations from zero.
rms = lambda x_seq: (sum(x*x for x in x_seq)/len(x_seq))**(1/2)
In case you'd like to frame your array before compute RMS, this is a numpy solution:
nframes = 1000
rms = np.array([
np.sqrt(np.mean(arr**2))
for arr in np.array_split(arr,nframes)
])
If you'd like to specify frame length instead of frame counts, you'd do this first:
frame_length = 200
arr_length = arr.shape[0]
nframes = arr_length // frame_length +1

Find the position of a lowest difference between numpy arrays

I've got two musical files: one lossless with little sound gap (at this time it's just silence but it could be anything: sinusoid or just some noise) at the beginning and one mp3:
In [1]: plt.plot(y[:100000])
Out[1]:
In [2]: plt.plot(y2[:100000])
Out[2]:
This lists are similar but not identical so I need to cut this gap, to find the first occurrence of one list in another with lowest delta error.
And here's my solution (5.7065 sec.):
error = []
for i in range(25000):
y_n = y[i:100000]
y2_n = y2[:100000-i]
error.append(abs(y_n - y2_n).mean())
start = np.array(error).argmin()
print(start, error[start]) #23057 0.0100046
Is there any pythonic way to solve this?
Edit:
After calculating the mean distance between special points (e.g. where data == 0.5) I reduce the area of search from 25000 to 2000. This gives me reasonable time of 0.3871s:
a = np.where(y[:100000].round(1) == 0.5)[0]
b = np.where(y2[:100000].round(1) == 0.5)[0]
mean = int((a - b[:len(a)]).mean())
delta = 1000
error = []
for i in range(mean - delta, mean + delta):
...
What you are trying to do is a cross-correlation of the two signals.
This can be done easily using signal.correlate from the scipy library:
import scipy.signal
import numpy as np
# limit your signal length to speed things up
lim = 25000
# do the actual correlation
corr = scipy.signal.correlate(y[:lim], y2[:lim], mode='full')
# The offset is the maximum of your correlation array,
# itself being offset by (lim - 1):
offset = np.argmax(corr) - (lim - 1)
You might want to take a look at this answer to a similar problem.
Let's generate some data first
N = 1000
y1 = np.random.randn(N)
y2 = y1 + np.random.randn(N) * 0.05
y2[0:int(N / 10)] = 0
In these data, y1 and y2 are almost the same (note the small added noise), but the first 10% of y2 are empty (similarly to your example)
We can now calculate the absolute difference between the two vectors and find the first element for which the absolute difference is below a sensitivity threshold:
abs_delta = np.abs(y1 - y2)
THRESHOLD = 1e-2
sel = abs_delta < THRESHOLD
ix_start = np.where(sel)[0][0]
fig, axes = plt.subplots(3, 1)
ax = axes[0]
ax.plot(y1, '-')
ax.set_title('y1')
ax.axvline(ix_start, color='red')
ax = axes[1]
ax.plot(y2, '-')
ax.axvline(ix_start, color='red')
ax.set_title('y2')
ax = axes[2]
ax.plot(abs_delta)
ax.axvline(ix_start, color='red')
ax.set_title('abs diff')
This method works if the overlapping parts are indeed "almost identical". You will have to think of smarter alignment ways if the similarity is low.
I think what you are looking for is correlation. Here is a small example.
import numpy as np
equal_part = [0, 1, 2, 3, -2, -4, 5, 0]
y1 = equal_part + [0, 1, 2, 3, -2, -4, 5, 0]
y2 = [1, 2, 4, -3, -2, -1, 3, 2]+y1
np.argmax(np.correlate(y1, y2, 'same'))
Out:
7
So this returns the time-difference, where the correlation between both signals is at its maximum. As you can see, in the example the time difference should be 8, but this depends on your data...
Also note that both signals have the same length.

Categories