Python/Scipy Kolmogorov-Zurbenko filter? - python

I'm looking for a Python-based Kolmogorov-Zurbenko filter which receives a time-series input and filters it based on a window size and number of iterations and haven't found anything that seems to work. Has anyone had better luck than I?
Thanks!

I have just been looking into the same issue. The actual KZ filter is very easy in pandas:
import pandas as pd
def kz(series, window, iterations):
"""KZ filter implementation
series is a pandas series
window is the filter window m in the units of the data (m = 2q+1)
iterations is the number of times the moving average is evaluated
"""
z = series.copy()
for i in range(iterations):
z = pd.rolling_mean(z, window=window, min_periods=1, center=True)
return z
What cannot be easily realized to my knowledge is the adaptive version of the Kologorov Zurbenko filter (KZA). This would at least require a rolling_mean method which allows for the specification of different window lengths to the left and right of the center. The C code at https://cran.r-project.org/web/packages/kza/index.html looks fairly simple and straightforward, but it requires loops and would therefore be quite slow if implemented in Python directly.

Related

Anyone knows a more efficient way to run a pairwise comparison of hundreds of trajectories?

So I have two different files containing multiple trajectories in a squared map (512x512 pixels). Each file contains information about the spatial position of each particle within a track/trajectory (X and Y coordinates) and to which track/trajectory that spot belongs to (TRACK_ID).
My goal was to find a way to cluster similar trajectories between both files. I found a nice way to do this (distance clustering comparison), but the code it's too slow. I was just wondering if someone has some suggestions to make it faster.
My files look something like this:
The approach that I implemented finds similar trajectories based on something called Fréchet Distance (maybe not to relevant here). Below you can find the function that I wrote, but briefly this is the rationale:
group all the spots by track using pandas.groupby function for file1 (growth_xml) and file2 (shrinkage_xml)
for each trajectories in growth_xml (loop) I compare with each trajectory in growth_xml
if they pass the Fréchet Distance criteria that I defined (an if statement) I save both tracks in a new table. you can see an additional filter condition that I called delay, but I guess that is not important to explain here.
so really simple:
def distance_clustering(growth_xml,shrinkage_xml):
coords_g = pd.DataFrame() # empty dataframes to save filtered tracks
coords_s = pd.DataFrame()
counter = 0 #initalize counter to count number of filtered tracks
for track_g, param_g in growth_xml.groupby('TRACK_ID'):
# define growing track as multi-point line object
traj1 = [(x,y) for x,y in zip(param_g.POSITION_X.values, param_g.POSITION_Y.values)]
for track_s, param_s in shrinkage_xml.groupby('TRACK_ID'):
# define shrinking track as a second multi-point line object
traj2 = [(x,y) for x,y in zip(param_s.POSITION_X.values, param_s.POSITION_Y.values)]
# compute delay between shrinkage and growing ends to use as an extra filter
delay = (param_s.FRAME.iloc[0] - param_g.FRAME.iloc[0])
# keep track only if the frechet Distance is lower than 0.2 microns
if frechetDist(traj1, traj2) < 0.2 and delay > 0:
counter += 1
param_g = param_g.assign(NEW_ID = np.ones(param_g.shape[0]) * counter)
coords_g = pd.concat([coords_g, param_g])
param_s = param_s.assign(NEW_ID = np.ones(param_s.shape[0]) * counter)
coords_s = pd.concat([coords_s, param_s])
coords_g.reset_index(drop = True, inplace = True)
coords_s.reset_index(drop = True, inplace = True)
return coords_g, coords_s
The main problem is that most of the times I have more than 2 thousand tracks (!!) and this pairwise combination takes forever. I'm wondering if there's a simple and more efficient way to do this. Perhaps by doing the pairwise combination in multiple small areas instead of the whole map? not sure...
Have you tried to make a matrix (DeltaX,DeltaY) lookUpTable for the pairwise combination distance. It will take some long time to calc the LUT once, or you can write it in a file and load it when the algo starts.
Then you'll only have to look on correct case to have the result instead of calc each time.
You can too make a polynomial regression for the distance calc, it will be less precise but definitely faster
Maybe not an outright answer, but it's been a while. Could you not segment the lines and use minimum bounding box around each segment to assess similarities? I might be thinking of your problem the wrong way around. I'm not sure. Right now I'm trying to work with polygons from two different data sets and want to optimize the processing by first identifying the polygons in both geometries that overlap.
In your case, I think segments would you leave you with some edge artifacts. Maybe look at this paper: https://drops.dagstuhl.de/opus/volltexte/2021/14879/pdf/OASIcs-ATMOS-2021-10.pdf or this paper (with python code): https://www.austriaca.at/0xc1aa5576_0x003aba2b.pdf

Avoiding hard coding a lot of coupled ODEs in python

First of, I'm sorry if the title is not entirely fitting, I had a hard time finding an appropriate one (which might have also effect my searching efficiency for already asked questions like this :/ ).
The problem is the following. While it is comparably easy to solve coupled ODE's in python with Scipy, I still have to write down my ODE in the form explicitly. For example for a coupled ODE of the form
d/dt(c_0)=a(c_0)+b(c_1) and d/dt(c_1)=c(c_0)
I would set up sth like:
import numpy as np
from scipy.integrate import ode
a=1
b=2
c=3
val=[]
def dC_dt(t, C):
return [a*C[0]+b*C[1],
c*C[0]]
c0, t0 = [1.0,0.0], 0
r = ode(dC_dt).set_integrator('zvode', method='bdf',with_jacobian=False)
r.set_initial_value(c0, t0)
t1 = 0.001
dt = 0.000005
while r.successful() and r.t < t1:
r.integrate(r.t+dt)
val.append(r.y)
However, now I have coupled ODE's of the rough form
d/dt(c_{m,n})=a(c_{m,n})+b(c_{m+1,n-1})+k(c_{m-1,n+1})
with c_{0,0}=1 and I have to include orders with m^2+n^2-mn smaller than a max value.
For a small max, what I did, is using a dictionary to use a notation with two indices and map it on a 1D list
dict_in={'0,0':0,'-1,0':2,...}
and then I entered the ODE for each order
def dC_dt(t,C):
return[a*C[dict_in['0,0']]+b*C[dict_in['1,-1']]...
Now I basically have to do that for some 100 coupled equations, which I ofc do not want to hard code, so I was trying to figure out a way, to realize the ODE's with a loop or sth. However I couldn't yet find a way around the fact of having two indices in my coefficients together with the condition of only including orders with m^2+n^2-mn smaller than a max value.
As I am running in some deadlines, I figured it is time to ask smarter people for help.
Thanks for reading my question!
I had a similar problem. If you fill you dictionary you can just redeclare the function more times inside the loop. This is a silly example of how it works:
dict_in={'0,0':0,'-1,0':2}
for elem in dict_in:
def dC_dt(t,C):
#return[a*C[dict_in['0,0']]+b*C[dict_in['1,-1']]
return dict_in[elem]
t, C = 0, 0
print(dC_dt(t,C))
#r = ode(dC_dt).set_integrator('zvode', method='bdf',with_jacobian=False)
If you need to use more functions together you can use anonymous functions and store them in memory. Another example:
functions_list = list()
for i in range(4):
f = lambda n = i: n
functions_list.append(f)
for j in range(4):
print(functions_list[j]())
You can use a list or a generator too. For example you can write down the value on a txt file and read that with the readline function each time.
As pointed in the comments below, if you use lamda functions you should pay attention to references. See also https://docs.python.org/3/faq/programming.html#why-do-lambdas-defined-in-a-loop-with-different-values-all-return-the-same-result

Python - Cosine gradually reveals small-amp oscillations ("wobbles")

I have a problem that is equal parts trig and Python. I am plotting a cosine over time interval [0,t] whose frequency changes (slightly) according to another cosine function. So what I'd expect to see is a repeating pattern of higher-to-lower frequency that repeats over the duration of the window [0,t].
Instead what I'm seeing is that over time a low-freq motif emerges in the cosine plot and repeats over time, each time becoming lower and lower in freq until eventually the cosine doesn't even oscillate properly it just "wobbles", for lack of a better term.
I don't understand how this is emerging over the course of the window [0,t] because cosine is (obviously) periodic and the function modulating it is as well. So how can "new" behavior emerge?? The behavior should be identical across all periods of the modulatory cosine that tunes the freq of the base cosine, right?
As a note, I'm technically using a modified cosine, instead of cos(wt) I'm using e^(cos(wt)) [called von mises eq or something similar].
Minimum needed Code:
cos_plot = []
for wind,pos_theta in zip(window,pos_theta_vec): #window is vec of time increments
# for ref: DBFT(pos_theta) = (1/(2*np.pi))*np.cos(np.radians(pos_theta - base_pos))
f = float(baserate+DBFT(pos_theta)) # DBFT() returns a val [-0.15,0.15] periodically depending on val of pos_theta
cos_plot.append(np.exp(np.cos(f*2*np.pi*wind)))
plt.plot(cos_plot)
plt.show()
What you are observing could depend on "aliasing", i.e. the emergence of low-frequency figures because of sampling of an high frequency function with a step that is too big.
(picture taken from the linked Wikipedia page)
If the issue is NOT aliasing consider that any function shape between -1 and 1 can be obtained with cos(f(x)*x) by simply choosing f(x).
For, consider any function -1 <= g(x) <= 1 and set f(x) = arccos(g(x))/x.
To look for the problem try plotting your "frequency" and see if anything really strange is present in it. May be you've a bug in DBFT.
In the interest of posterity, in case anyone ever needs an answer to this question:
I wanted a cosine whose frequency was a time-varying function freq(t). My mistake was simply evaluating this function at each time t like this: Acos(2pifreq(t)t). Instead you need to integrate freq(t) from 0 to t at each time point: y = cos(2%piintegral(f(t)) + 2%pi*f0*t + phase). The term for this procedure is a frequency sweep or chirp (not identical terms, but similar if you need to google/SO answers).
Thanks to those who responded with help :)
SB

Trimmed Mean with Percentage Limit in Python?

I am trying to calculate the trimmed mean, which excludes the outliers, of an array.
I found there is a module called scipy.stats.tmean, but it requires the user specifies the range by absolute value instead of percentage values.
In Matlab, we have m = trimmean(X,percent), that does exactly what I want.
Do we have the counterpart in Python?
At least for scipy v0.14.0, there is a dedicated function for this (scipy.stats.trim_mean):
from scipy import stats
m = stats.trim_mean(X, 0.1) # Trim 10% at both ends
which used stats.trimboth inside.
From the source code it is possible to see that with proportiontocut=0.1 the mean will be calculated using 80% of the data. Note that the scipy.stats.trim_mean can not handle np.nan.
(Edit: the context for this answer was that scipy.stats.trim_mean wasn't documented yet. Now that it's publicly available, use that function instead of rolling your own. My answer below is kept for historical purpose.)
You can also implement the whole thing yourself, following the instruction in the MatLab documentation.
Here's the code in Python 2:
from numpy import mean
def trimmean(arr, percent):
n = len(arr)
k = int(round(n*(float(percent)/100)/2))
return mean(arr[k+1:n-k])
Here's a manual implementation using floor from the math library...
def trimMean(tlist,tperc):
removeN = int(math.floor(len(tlist) * tperc / 2))
tlist.sort()
if removeN > 0: tlist = tlist[removeN:-removeN]
return reduce(lambda a,b : a+b, tlist) / float(len(tlist))

Performing a moving linear fit to 1D data in Python

I have a 1D array of data and wish to extract the spatial variation. The standard way to do this which I wish to pythonize is to perform a moving linear regression to the data and save the gradient...
def nssl_kdp(phidp, distance, fitlen):
kdp=zeros(phidp.shape, dtype=float)
myshape=kdp.shape
for swn in range(myshape[0]):
print "Sweep ", swn+1
for rayn in range(myshape[1]):
print "ray ", rayn+1
small=[polyfit(distance[a:a+2*fitlen], phidp[swn, rayn, a:a+2*fitlen],1)[0] for a in xrange(myshape[2]-2*fitlen)]
kdp[swn, rayn, :]=array((list(itertools.chain(*[fitlen*[small[0]], small, fitlen*[small[-1]]]))))
return kdp
This works well but is SLOW... I need to do this 17*360 times...
I imagine the overhead is in the iterator in the [ for in arange] line... Is there an implimentation of a moving fit in numpy/scipy?
the calculation for linear regression is based on the sum of various values. so you could write a more efficient routine that modifies the sum as the window moves (adding one point and subtracting an earlier one).
this will be much more efficient than repeating the process every time the window shifts, but is open to rounding errors. so you would need to restart occasionally.
you can probably do better than this for equally spaced points by pre-calculating all the x dependencies, but i don't understand your example in detail so am unsure whether it's relevant.
so i guess i'll just assume that it is.
the slope is (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2) where the "2" is "squared" - http://easycalculation.com/statistics/learn-regression.php
for evenly spaced data the denominator is fixed (since you can shift the x axis to the start of the window without changing the gradient). the (ΣX) in the numerator is also fixed (for the same reason). so you only need to be concerned with ΣXY and ΣY. the latter is trivial - just add and subtract a value. the former decreases by ΣY (each X weighting decreases by 1) and increases by (N-1)Y (assuming x_0 is 0 and x_N is N-1) each step.
i suspect that's not clear. what i am saying is that the formula for the slope does not need to be completely recalculated each step. particularly because, at each step, you can rename the X values as 0,1,...N-1 without changing the slope. so almost everything in the formula is the same. all that changes are two terms, which depend on Y as Y_0 "drops out" of the window and Y_N "moves in".
I've used these moving window functions from the somewhat old scikits.timeseries module with some success. They are implemented in C, but I haven't managed to use them in a situation where the moving window varies in size (not sure if you need that functionality).
http://pytseries.sourceforge.net/lib.moving_funcs.html
Head here for downloads (if using Python 2.7+, you'll probably need to compile the extension itself -- I did this for 2.7 and it works fine):
http://sourceforge.net/projects/pytseries/files/scikits.timeseries/0.91.3/
I/we might be able to help you more if you clean up your example code a bit. I'd consider defining some of the arguments/objects in lines 7 and 8 (where you're defining 'small') as variables, so that you don't end row 8 with so many hard-to-follow parentheses.
Ok.. I have what seems to be a solution.. not an answer persay, but a way of doing a moving, multi-point differential... I have tested this and the result looks very very similar to a moving regression... I used a 1D sobel filter (ramp from -1 to 1 convolved with the data):
def KDP(phidp, dx, fitlen):
kdp=np.zeros(phidp.shape, dtype=float)
myshape=kdp.shape
for swn in range(myshape[0]):
#print "Sweep ", swn+1
for rayn in range(myshape[1]):
#print "ray ", rayn+1
kdp[swn, rayn, :]=sobel(phidp[swn, rayn,:], window_len=fitlen)/dx
return kdp
def sobel(x,window_len=11):
"""Sobel differential filter for calculating KDP
output:
differential signal (Unscaled for gate spacing
example:
"""
s=np.r_[x[window_len-1:0:-1],x,x[-1:-window_len:-1]]
#print(len(s))
w=2.0*np.arange(window_len)/(window_len-1.0) -1.0
#print w
w=w/(abs(w).sum())
y=np.convolve(w,s,mode='valid')
return -1.0*y[window_len/2:len(x)+window_len/2]/(window_len/3.0)
this runs QUICK!

Categories