Python: slice array uniformly with respect to dataset - python

I have a data set that has time t and a data d. Unfortunately, I changed the rate of exporting the data after some time (the rate was too high initially). I would like to sample the data so that I effectively remove the high-frequency exported data but maintain the low-frequency exported data near the end.
Consider the following code:
arr = np.loadtxt(file_name,skiprows=3)
Where t = arr[:,0], d = arr[:,1].
Here is a function to get a uniform slicing:
def get_uniform_slices(arr, N_desired_points):
s = arr.shape
if s[0] > N_desired_points:
n_skip = m.ceil(s[0]/N_desired_points)
else:
n_skip = 1
return arr[0::n_skip,:] # Sample output
However, the data then looks fine for the high-frequency exported data, but is too sparse for the low-frequency exported data.
Is there some way to slice such that indexes are uniformly spaced with respect to t?
Any help is greatly appreciated.
This is function I used to find the indexes, based on the accepted answer:
def get_uniform_index(t,N_desired_points):
t_uniform = np.linspace(np.amin(t),np.amax(t),N_desired_points)
t_desired = [nearest(t_d, t) for t_d in t_uniform]
i = np.in1d(t, t_desired)
return i

You have 2d data e.g.,
t = np.arange(0., 100., 0.5)
d = np.random.rand(len(t))
You want to keep only particular values of data at uniformly spaced times, e.g.
t_desired = np.arange(0., 100., 1.)
Let's pick them out the data points desired at the times desired using the in1d function:
d_pruned = d[np.in1d(t, t_desired)]
Of course, you must pick the t_desired and they should match values in t. If that's a problem, you could pick approximately uniform times using e.g.,
def nearest(x, arr):
index = (np.abs(arr - x)).argmin()
return arr[index]
t_uniform = np.arange(0., 100., 1.)
t_desired = [nearest(t_d, t) for t_d in t_uniform]
Here is the complete code:
import numpy as np
t = np.arange(0., 100., 0.5)
d = np.random.rand(len(t))
def nearest(x, arr):
index = (np.abs(arr - x)).argmin()
return arr[index]
t_uniform = np.arange(0., 100., 1.)
t_desired = [nearest(t_d, t) for t_d in t_uniform]
d_pruned = d[np.in1d(t, t_desired)]

Related

Finding anomalous values from sinusoidal data

How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise
I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:
Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.

Defining a numeric (custom) likelihood function in PyMC3

After looking at several questions/answers (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) and PyMC3's documentation, I've managed to create a MCVE of my MCMC setup (see below).
My fitted parameters are continuous and discrete, so the priors are defined using pm.Uniform and pm.DiscreteUniform (with a re-scaling applied to the latter). My likelihood function is particularly convoluted (it involves comparing the N-dimensional histograms of my observed data and some synthetic data generated using the free parameters), so I had to write it using theano's #as_op operator.
The implementation shown here works on a toy model working on random data, but in my actual model the likelihood and parameters are very similar.
My questions are:
Is this correct? Is there anything I should be doing different?
The call to the likelihood function is just thrown there apparently doing nothing and connected to nothing. Is this the proper way to do this?
I'm using NUTS for the continuous parameters but since my likelihood is numeric, I don't think I should be able to do this. Since the code still runs, I'm nut sure what's going on.
This is the first time I've used PyMC3 so any pointers will be really helpful.
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from theano.compile.ops import as_op
def main():
trace = bayesMCMC()
print(pm.summary(trace))
pm.traceplot(trace)
plt.show()
def bayesMCMC():
"""
Define and process the full model.
"""
with pm.Model() as model:
# Define uniform priors.
A = pm.Uniform("A", lower=0., upper=5.)
B = pm.Uniform("B", lower=10., upper=20.)
C = pm.Uniform("C", lower=0., upper=1.)
# Define discrete priors.
minD, maxD, stepD = 0.005, 0.06, 0.005
ND = int((maxD - minD) / stepD)
D = pm.DiscreteUniform("D", 0., ND)
minE, maxE, stepE = 9., 10., 0.05
NE = int((maxE - minE) / stepE)
E = pm.DiscreteUniform("E", 0., NE)
# Is this correct??
logp(A, B, C, D, E)
step1 = pm.NUTS(vars=[A, B, C])
print("NUTS")
step2 = pm.Metropolis(vars=[D, E])
print("Metropolis")
trace = pm.sample(300, [step1, step2]) # , start)
return trace
#as_op(
itypes=[tt.dscalar, tt.dscalar, tt.dscalar, tt.lscalar, tt.lscalar],
otypes=[tt.dscalar])
def logp(A, B, C, D, E):
"""
Likelihood evaluation.
"""
# Get observed data and some extra info to re-scale the discrete parameters
obsData, minD, stepD, minE, stepE = obsservedData()
# Scale discrete parameters
D, E = D * stepD + minD, E * stepE + minE
# Generate synthetic data using the prior values
synthData = synthetic(A, B, C, D, E)
# Generate N-dimensional histograms for both data sets.
obsHist, edges = np.histogramdd(obsData)
synHist, _ = np.histogramdd(synthData, bins=edges)
# Flatten both histograms
obsHist_f, synHist_f = obsHist.ravel(), synHist.ravel()
# Remove all bins where N_bin=0.
binNzero = obsHist_f != 0
obsHist_f, synHist_f = obsHist_f[binNzero], synHist_f[binNzero]
# Assign small value to the 0 elements in synHist_f to avoid issues with
# the log()
synHist_f[synHist_f == 0] = 0.001
# Compare the histograms of the observed and synthetic data via a Poisson
# likelihood ratio.
lkl = -2. * np.sum(synHist_f - obsHist_f * np.log(synHist_f))
return lkl
def obsservedData():
"""Some 'observed' data."""
np.random.seed(12345)
N = 1000
obsData = np.random.uniform(0., 10., (N, 3))
minD, stepD = 0.005, 0.005
minE, stepE = 9., 0.05
return obsData, minD, stepD, minE, stepE
def synthetic(A, B, C, D, E):
"""
Dummy function to generate synthetic data. The actual function makes use
of the A, B, C, D, E variables (obviously).
"""
M = np.random.randint(100, 1000)
synthData = np.random.uniform(0., 10., (M, 3))
return synthData
if __name__ == "__main__":
main()

Python merge datasets X1(t), X2(t) -> X1(X2)

I have some datasets (lets stay at 2 here) which are dependent on a common variable t, like X1(t) and X2(t). However X1(t) and X2(t) don't have to share the same t values or even have the same amount of datapoints.
For example they could look like:
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
I am trying to create a new dataset YNew(XNew) (=X2(X1)) such that both datasets are linked without the shared variable t.
In this case it should look like:
XNew = [10,20,30]
YNew = [100,150,200]
where to every occuring X1-value a corresponding X2-value (a mean value) is assigned.
Is there an easy already known way to achieve this(maybe with pandas)?
My first guess would be to find all t-values for a certain X1-value (in the example case the X1-value 10 would lie in the range 2,...,7) and then look for all X2-values in that range and get their mean value. Then you should be able to assign YNew(XNew).
Thanks for every advice!
Update:
I added a graph, so maybe my intentions are a bit more clear. I want to assign the mean X2-value to the corresponding X1-value in the marked regions (where the same X1-values occur).
graph corresponding to example lists
alright, I just tried to implement what I mentioned and it works as I liked it.
Although I think that some things are still a little clumsy...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# datasets to treat
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
X1Series = pd.Series(X1, index = t1)
X2Series = pd.Series(X2, index = t2)
X1Values = X1Series.drop_duplicates().values #returns all occuring values of X1 without duplicates as array
# lists for results
XNew = []
YNew = []
#find for every occuring value X1 the mean value of X2 in the range of X1
for value in X1Values:
indexpos = X1Series[X1Series == value].index.values
max_t = indexpos[indexpos.argmax()] # get max and min index of the range of X1
min_t =indexpos[indexpos.argmin()]
print("X1 = "+str(value)+" occurs in range from "+str(min_t)+" to "+str(max_t))
slicedX2 = X2Series[(X2Series.index >= min_t) & (X2Series.index <= max_t)] # select range of X2
print("in this range there are following values of X2:")
print(slicedX2)
mean = slicedX2.mean() #calculate mean value of selection and append extracted values
print("with the mean value of: " + str(mean))
XNew.append(value)
YNew.append(mean)
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(t1, X1,'ro-',label='X1(t)')
ax1.plot(t2, X2,'bo',label='X2(t)')
ax1.legend(loc=2)
ax1.set_xlabel('t')
ax1.set_ylabel('X1/X2')
ax2.plot(XNew,YNew,'ro-',label='YNew(XNew)')
ax2.legend(loc=2)
ax2.set_xlabel('XNew')
ax2.set_ylabel('YNew')
plt.show()

python: plot unevenly distributed axis

I am using python and have a plot which looks like this:
Now the problem is that, as most bins are in the range 0-500 on x-axis, so I want to make the x-axis like [0, 100, 200, 300, 400, 500, 1000, 1500, 2000, 2500] and each interval has the same length.
I don't know how to do this in python. Any idea?
Perhaps there's a simpler way to do this, but it's certainly possible to do so in pyplot using these two steps:
Plot a different function, namely one with the same y values but different x values
Manipulate the x-ticks so that it appears like you've plotted your original function (but with a different axis).
I'll start with 2. Note the existence of the xticks, which allows you to do stuff like this:
ticks = [0, 100, 200, 300, 400, 500, 1000, 1500, 2000, 2500]
xticks(range(10), ticks)
This allows you to place both the locations of the xticks, as well as the labels.
Now, for 1., you just need to translate your original x array to a new_x array, which is spread out in arange(10), but non-linearly, according to your labels. If your points are in the array x, then using np.interp1d:
from scipy import interpolate
new_x = interpolate.interp1d(ticks, arange(10))(x)
In conclusion, use plot(new_x, y) with the xticks above.
As already said, you have to map the original abscissae to a new range, and then draw the xtics accordingly... The first part is the toughest, of course, and can be done in different ways, my take uses a vectorized approach using numpy and computes the function body at runtime using eval.
def make_xmap(l):
from numpy import array
ll = len(l)
dy = 1.0 / (ll-1)
def f(l, i):
if i == 0 : return "0.0"
y0 = i*dy-dy
x0, x1 = l[i-1:i+1]
return '%r+%r*(x-%r)/%r'%(y0,dy,x0,x1-x0)
fmt = 'numpy.where(x<%f,%s%s'
body = ' '.join(fmt%(j,f(l,i),"," if i<(ll-1) else ", 1.0") for i, j in enumerate(l))
tail = ')'*ll
def xm(x):
x = array(x)
return eval(body+tail)
return xm
import numpy
xm = make_xmap([0.,200.,1000.])
x = (-10.,0.,100.,200.,600.,1000.,1010)
print xm(x)
# [0.0, 0.0, 0.25, 0.5, 0.75, 1.0, 1.0]
Note that you have to import numpy in your code, because we have used numpy.where to construct the function body... If you prefer to import numpy as np modify the fmt string in the factory function...
The second part is easier, if you have an x and an y array to plot, with the subdivision from your example, you can do
import numpy # I touched this point before...
...
intervals = [0., 100., 200., 300., 400., 500., 1000., 1500., 2000., 2500.]
xm = make_xmap(intervals)
plt.plot(xm(x),y)
plt.xticks(xm(intervals), [str(xi) for xi in intervals])
plt.show()
A small optimization
You may want to change
...
tail = ')'*ll
def xm(x):
x = array(x)
return eval(body+tail)
...
to
...
tail = ')'*ll
code = compile(body+tail,'','eval')
def xm(x):
x = array(x)
return eval(code)
...
This small optimization avoids the compilation of the code string every time you call the mapping function, and is of course more relevant if the mapping is used many times on short inputs.

Estimate formants using LPC in Python

I'm new to signal processing (and numpy, scipy, and matlab for that matter). I'm trying to estimate vowel formants with LPC in Python by adapting this matlab code:
http://www.mathworks.com/help/signal/ug/formant-estimation-with-lpc-coefficients.html
Here is my code so far:
#!/usr/bin/env python
import sys
import numpy
import wave
import math
from scipy.signal import lfilter, hamming
from scikits.talkbox import lpc
"""
Estimate formants using LPC.
"""
def get_formants(file_path):
# Read from file.
spf = wave.open(file_path, 'r') # http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ae.wav
# Get file as numpy array.
x = spf.readframes(-1)
x = numpy.fromstring(x, 'Int16')
# Get Hamming window.
N = len(x)
w = numpy.hamming(N)
# Apply window and high pass filter.
x1 = x * w
x1 = lfilter([1., -0.63], 1, x1)
# Get LPC.
A, e, k = lpc(x1, 8)
# Get roots.
rts = numpy.roots(A)
rts = [r for r in rts if numpy.imag(r) >= 0]
# Get angles.
angz = numpy.arctan2(numpy.imag(rts), numpy.real(rts))
# Get frequencies.
Fs = spf.getframerate()
frqs = sorted(angz * (Fs / (2 * math.pi)))
return frqs
print get_formants(sys.argv[1])
Using this file as input, my script returns this list:
[682.18960189917243, 1886.3054773107765, 3518.8326108511073, 6524.8112723782951]
I didn't even get to the last steps where they filter the frequencies by bandwidth because the frequencies in the list aren't right. According to Praat, I should get something like this (this is the formant listing for the middle of the vowel):
Time_s F1_Hz F2_Hz F3_Hz F4_Hz
0.164969 731.914588 1737.980346 2115.510104 3191.775838
What am I doing wrong?
Thanks very much
UPDATE:
I changed this
x1 = lfilter([1., -0.63], 1, x1)
to
x1 = lfilter([1], [1., 0.63], x1)
as per Warren Weckesser's suggestion and am now getting
[631.44354635609318, 1815.8629524985781, 3421.8288991389031, 6667.5030877036006]
I feel like I'm missing something since F3 is very off.
UPDATE 2:
I realized that the order being passed to scikits.talkbox.lpc was off due to a difference in sampling frequency. Changed it to:
Fs = spf.getframerate()
ncoeff = 2 + Fs / 1000
A, e, k = lpc(x1, ncoeff)
Now I'm getting:
[257.86573127888488, 774.59006835496086, 1769.4624576002402, 2386.7093679399809, 3282.387975973973, 4413.0428174593926, 6060.8150432549655, 6503.3090645887842, 7266.5069407315023]
Much closer to Praat's estimation!
The problem had to do with the order being passed to the lpc function. 2 + fs / 1000 where fs is the sampling frequency is the rule of thumb according to:
http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html
I have not been able to get the results you expect, but I do notice two things which might cause some differences:
Your code uses [1, -0.63] where the MATLAB code from the link you provided has [1 0.63].
Your processing is being applied to the entire x vector at once instead of smaller segments of it (see where the MATLAB code does this: x = mtlb(I0:Iend); ).
Hope that helps.
There are at least two problems:
According to the link, the "pre-emphasis filter is a highpass all-pole (AR(1)) filter". The signs of the coefficients given there are correct: [1, 0.63]. If you use [1, -0.63], you get a lowpass filter.
You have the first two arguments to scipy.signal.lfilter reversed.
So, try changing this:
x1 = lfilter([1., -0.63], 1, x1)
to this:
x1 = lfilter([1.], [1., 0.63], x1)
I haven't tried running your code yet, so I don't know if those are the only problems.

Categories