Multithreading 1D Median Filtering in Python - python

I have tried the following python median filtering on time-series signals to find the fastest and more efficient function.
sig is a numpy array of size 80×188 which contains 188 samples measured by 80 sensors.
import numpy as np
from scipy.ndimage import median_filter
from scipy.signal import medfilt
from scipy.signal import medfilt2d
import time
sig = np.random.rand(80,188).astype('f')
print(type(sig))
print(type(sig[0][0]))
window_length = 181
t = time.time()
sigFiltered = medfilt2d(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt2d: %g seconds' % elapsed)
t = time.time()
sigFiltered = median_filter(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.ndimage.median_filter: %g seconds' % elapsed)
t = time.time()
sigFiltered = medfilt(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt: %g seconds' % elapsed)
The code can be tried here.
The result of the filter is another time-series array of size 80×188 with smoothed time-points for each sensor.
MATLAB medfilt1(sig, 181, [], 2) performs the filtering on the same data 10 times faster compared to scipy.signal.medfilt2d, which was the fastest among other functions. On my machine, MATLAB=2ms vs Python=20 ms. I think MATLAB performs multithreading processing and python does not.
Is there any way to perform multithreading median filtering to speed up the process and assign sensors to different threads? Is there a more efficient median filtering available in python? Can I achieve the performance of MATLAB win python or at least get closer to it?

With such a long filter relative to the input most outputs using a standard medfilt are going to be the same. Where this to be a convolution this would be a "full" convolution. If you instead only give outputs for "valid" convolution, that will be much faster in this case:
t = time.time()
medians = []
for i in range(188-181):
sig2 = sig[:, i:i+window_length]
f = np.median(sig2, axis=1)
medians.append(f)
sigFiltered = np.stack(medians).T
elapsed = time.time() - t
print('numpy.median: %g seconds' % elapsed)
numpy.median: 0.0015518 seconds
This is in the ballpark of the requested 1 ms runtime per 188 sample size.
Considering that even each unique value here will change very slowly/rarely with new input samples. You could therefore speed this up considerably by using a hop larger than 1.

I'm wondering why you're using a median filter of 181 points for a data length of 188? The filter is so long that you're essentially just throwing away all the data and replacing it with the global median of the sensor's output. Typical median filter lengths would be a few samples, depending on what kind of transients you want to filter out.
The filter length also explains why it's so slow. On my machine, your median_filter example takes 46 ms. Running with a more normal filter size of 3 samples takes 0.7 ms.

Related

Most computational-time efficient/fastest way to compute rolling (linear) regression in Python (Numpy or Pandas)

I have a need to do very very fast and efficient way of rolling linear regression.
I looked through these two threads :
Efficient way to do a rolling linear regression
Rolling linear regression
From them, I had inferred numpy was (computationally) the fastest. However, using my (limited) python skills, I found the time to compute the same set of rolling data, was *** the same ***.
Is there a faster way to compute than either of the 3 methods I post below? I would have thought the numpy way is much faster, but unfortunately, it wasn't.
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.empty(0)
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np = np.append(roll_np, grad_)
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
tl;dr
My answer is
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
And it takes 1.2 ms to compute, compared to 2 seconds for your pandas and numpy version, and 3.5 seconds for your stat version.
Long version
One method could be to use sliding_window_view to transform your tmp_ array, into an array of window (a fake one: it is just a view, not really a 10000x30 array of data. It is just tmp_ but viewed differenty. Hence the _view in the function name).
No direct advantage. But then, from there, you can try to take advantage of vectorization.
I do that two different way: an easy one, and one that takes a minute of thinking. Since I put the best answer first, the rest of this message can appear inconsistent chronologically (I say things like "in my previous answer" when the previous answer come later), but I tried to redact both answer consistently.
New answer : matrix operations
One method to do that (since lstsq is of the rare numpy method that wouldn't just do it naturally) is to go back to what lstsq(X,Y) does in reality: it computes (XᵀX)⁻¹Xᵀ Y
So let's just do that. In python, with xxx being the X array (of arange and 1 in your example) and view the array of windows to your data (that is view[i] is tmp_[i:i+win_]), that would be np.linalg.inv(xxx.T#xxx)#xxx.T#view[i] for i being each row. We could vectorize that operation with np.vectorize to avoid iterating i, as I did for my first solution (see below). But the thing is, we don't need to. That is just a matrix times a vector. And the operation computing a matrix times a vector for each vector in an array of vectors, is just matrix multiplication!
Hence my 2nd (and probably final) answer
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
roll_mat is still identical (with one extra row because your roll_np stopped one row short of the last possible one) to roll_np (see below for graphical proof with my first answer. I could provide a new image for this one, but it is indistinguishable from the one I already used). So same result (unsurprisingly I should say... but sometimes it is still a surprise when things work exactly like theory says they do)
But timing, is something else. As promised, my previous factor 4 was nothing compared to what real vectorization can do. See updated timing table:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize (see below)
0.46 s
numpy view/matmult
1.2 ms
The important part is 'ms', compared to other 's'.
So, this time factor is 1700 !
Old-answer : vectorize
A lame method, once we have this view could be to use np.vectorize from there. I call it lame because vectorize is not supposed to be efficient. It is just a for loop called by another name. Official documentation clearly says "not to be used for performance". And yet, it would be an improvement from your code
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
f = np.vectorize(lambda y: np.linalg.lstsq(xxx,y,rcond=None)[0][0], signature='(n)->()')
roll_vectorize=f(view)
Firt let's verify the result
plt.scatter(f(view)[:-1], roll_np))
So, obviously, same results as roll_np (which, I've checked the same way, are the same results as the two others. With also the same variation on indexing since all 3 methods have not the same strategy for border)
And the interesting part, timings:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize
0.46 s
So, you see, it is not supposed to be for performance, and yet, I gain more that x4 times with it.
I am pretty sure that a more vectorized method (alas, lstsq doesn't allow directly it, unlike most numpy functions) would be even faster.
First if you need some tips for optimizing your python code, I believe this playlist might help you.
For making it faster; "Append" is never a good way, you think of it in terms of memory, every time you append, python may create a completely new list with a bigger size (maybe n+1; where n is old size) and copy the last items (which will be n places) and for the last one will be added at last place.
So when I changed it to be as follows
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.zeros(len(tmp_)-win_) ### Change
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np[cnt_] = grad_ ### Change
# roll_np = np.append(roll_np, grad_) ### Change
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
I initialized the array from first place with the size of how it's expected to turn to be(len(tmp_)-win_ in range), and just assigned values to it later, and it was much faster.
there are also some other tips you can do, Python is interpreted language, meaning each time it takes a line, convert it to machine code, then execute it, and it does that for each line. Meaning if you can do multiple things at one line, meaning they will get converted at one time to machine code, it shall be faster, for example, think of list comprehension.

Optimizing Kernel Density Bandwidth using Python

I am attempting to build a class that automatically determines the optimal bandwidth for a kernel density estimate. I am using the FFTKDE method of KDEpy for my purposes, since I am required to calculate this quantity very quickly. I am aware that there are the options of Scott's Rule, Silverman's Rule and Improved Sheather-Jones in KDEpy, but I am keen to directly optimise mysomething bandwidth.
I would like to calculate the Maximum likelihood cross-validation (MCLV) in order that I optimize it. However, my code is terribly slow, since I need to estimate the KDE for each data point in 20,000 data points, each iteration of the minimisation.
The equation I am attempting to optimize looks like this (see page 8, here):
My code for calculating this loss is as follows:
import numpy as np
from KDEpy import FFTKDE
from tqdm import tqdm
import time
def MLCV(data, bw):
N = len(data)
idx = np.ones(N, bool)
logs = np.empty(N)
for i in tqdm(range(N)):
idx[i]=False
x_kde, y_kde = FFTKDE(bw=bw).fit(data[idx]).evaluate(2**13)
idx[i]=True
logs[i] = np.sum(np.log(y_kde))
MLCV = np.sum(logs)/N - np.log((N-1)*bw)
return MLCV
data = np.random.normal(size=20000)
bw = 0.01
t0 = time.process_time()
mlcv = MLCV(data, bw)
t1 = time.process_time()
print("MLCV = {:3.3E}, Elapsed Time = {:3.3}s".format(mlcv, t1-t0))
Output:
MLCV = -1.077E+05, Elapsed Time = 46.3s
Can anyone suggest a means of making this faster/an alternative, quicker algorithm?
I have also considered simply minimising the negative log of the output, which I have seen elsewhere:
def L(data, bw):
x_kde, y_kde = FFTKDE(bw=bw).fit(data).evaluate(2**13)
return -np.sum(np.log(y_kde))
However, my intuition tells me neither method is the correct solution, since I cannot directly calculate the values for the actual data, only interpolate them, due to the FFT method requiring points to be on a grid.
Is there a loss that suits my needs? Can anyone suggest a better solution than I have come up with?

Simulating a time-inhomogeneous Poisson process using the thinning method and the NeuroTools python module

There are several threads asking for a way to simulate time-inhomogenous poisson processes in python. The NeuroTools module offer a simple way to do so via the inh_poisson_generator () function. The help of this function is introduced at the bottom of this thread. The function was originally designed to simulate spike trains, and uses the thinning method.
I would like to simulate a spike train during 2000ms. The spike rate (in Hertz) changes every millisecond, and is comprised between 20 spikes/second and 160 spikes/second. I've tried to simulate this using the following code:
import NeuroTools
import numpy as np
from NeuroTools import stgen
import matplotlib.pyplot as plt
import random
st_gen = stgen.StGen()
time = np.arange(0, 2000)
t_rate = []
for i in range (2000):
t_rate.append(random.randrange(20, 161, 1))
t_rate = np.array(t_rate)
Psim = st_gen.inh_poisson_generator(rate = t_rate, t = time, t_stop = 2000, array = True)
However, the code returns very few timestamps (e.g., array([ 397.55345905, 1208.79804513, 1478.03525045, 1982.63643262]), which doesn't make sense to me. I would appreciate any help on this.
inh_poisson_generator(self, rate, t, t_stop, array=False) method of NeuroTools.stgen.StGen instance
Returns a SpikeTrain whose spikes are a realization of an inhomogeneous
poisson process (dynamic rate). The implementation uses the thinning
method, as presented in the references.
Inputs:
rate - an array of the rates (Hz) where rate[i] is active on interval
[t[i],t[i+1]]
t - an array specifying the time bins (in milliseconds) at which to
specify the rate
t_stop - length of time to simulate process (in ms)
array - if True, a numpy array of sorted spikes is returned,
rather than a SpikeList object.
Note:
t_start=t[0]
References:
Eilif Muller, Lars Buesing, Johannes Schemmel, and Karlheinz Meier
Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories
Neural Comput. 2007 19: 2958-3010.
Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-Verlag.
Examples:
>> time = arange(0,1000)
>> stgen.inh_poisson_generator(time,sin(time), 1000)enter code here
I don't really have an answer for you but because this post helped me to get started with NeuroTools, I thought I'd share my small example which is working fine.
For the inh_poisson_generator() the rate input is in unit Hz and all times are in ms. I use an average rate of 1.6 spikes/ms, so I expect to receive ~4000 events. The results confirm that just fine!
I guess it might be an issue that you are using a non-continuous rate. However I barely know anything about the algorithm implemented for this function..
I hope my example can help you somehow!
import NeuroTools
from NeuroTools import stgen
v0=1.6 #spikes/ms
Amp=1 # amplitude in spikes/ms
w=4/1000 # periodic frequency in spikes/ms
st_gen = stgen.StGen()
tstop=2500.0
intervals=np.arange(0,tstop,0.05)
rate=np.array([])
for tt in intervals:
v_next=v0+Amp*math.sin(2*math.pi*w*tt)
if (v_next>0.0):
rate=np.append(rate,v_next*1000)
else: rate=np.append(rate,0.0)
PSim=st_gen.inh_poisson_generator(rate=rate,t = intervals, t_stop = 2500.0, array = True) # important to have rate in Hz and all other times in ms
print len(PSim)
print np.mean(rate)/1000*tstop

Generating low discrepancy quasi-random sequences in python/numpy/scipy?

There is already a question on this but the answer contains a broken link, and being over two years old, I'm hoping there's a better solution now :)
Low discrepancy quasi-random sequences, e.g. Sobol sequences, fill a space more uniformly than uniformly random sequences. Is there a good/easy way to generate them in python?
I think the best alternative for Low Discrepancy sequences in Python is Sensitivity Analysis Library (SALib):
https://github.com/SALib/SALib
I think this is an active project and you can contact the author to check if the functionalities you need are already implemented. If that doesn't solve your problem, Corrado Chisari ported a SOBOL version made in Matlab (by John Burkardt) to Python, you can access it here:
http://people.sc.fsu.edu/~jburkardt/py_src/sobol/sobol.html
Someone cleaned up the comments in these sources and put them in the format of docstrings. It's much more readable and you can access it here:
https://github.com/naught101/sobol_seq
Scipy has this option now http://scipy.github.io/devdocs/generated/scipy.stats.qmc.Sobol.html
PyTorch also proves option of generating sobol random numbers. It allows upto a dimension of ~1k and has an option to switch on scrambling.
https://pytorch.org/docs/stable/generated/torch.quasirandom.SobolEngine.html
Chaospy is also a valid option. One can select several approaches for low-discrepancy sampling (including 'Sobol, latin hypercube, etc) - for more details see the documentation.
I think the easiest way to do it now (as of SciPy version >= 1.7.1) is how I'm doing it here. It's good for up to 21,201 dimensions as they implemented the Joe and Kuo algorithm, which is the highest number of dimensions you can get (opensource). https://web.maths.unsw.edu.au/~fkuo/sobol/
Here I show how to use the base2 method (with Owen Scrambling) and the random method (which generates an arbitrary number of points from the sequence), and how to skip the first point.
Note that this routine can be quite slow (due to the ndtri, or inverse normal distribution conversion of the points to shocks), especially in high dimensions + high simulation counts. Point generation from the Sobol sequence itself is quite fast, but for most Monte Carlo simulations, you convert them into shocks (you may be using another distribution other than the standard normal).
This at least lets you generate the points in Python code directly.
Also, in the QMCgenerate routine, I'm skipping the first point (which is 0s) - while this is commonly done, some papers suggest not doing so (but I haven't seen a good alternative, if you have one, feel free to comment). I transpose them just so I can paste them in Excel later and examine the generated shocks. Anyway, hope those of you who need this algorithm find it useful.
from scipy.stats import qmc # needs SciPy >= 1.7.1
from scipy.special import ndtri
import numpy as np
import timeit
time_periods = 252
factors = 12
# IF using base2 generation, need a pow(2,m)
sims = 8192
dimensions = factors*time_periods
def RQMCgenerate (dimensions, sims, seed):
start_time = timeit.default_timer()
m=10 # start at 1024 sims
while pow(2,m) < sims: #m = 17 # 131,072 sims; M = 16 # 65,536 sims
m = m+1
RQMCgenerator = qmc.Sobol(dimensions, scramble=True, seed=seed)
RQMCsamples = RQMCgenerator.random_base2(m)
print('\n' + 'Time after sample generation RQMC:', (timeit.default_timer() - start_time), 'seconds');
sobol = ndtri(RQMCsamples).T # get normsinv(points) and transpose to dimensions * sims
del RQMCsamples
print('\n' + 'Time after ndtri (normsinv) of', sims,'sims x dimensions', dimensions, 'Randomized Sobol points): ', (timeit.default_timer() - start_time), 'seconds');
return sobol
def QMCgenerate(dimensions, sims):
start_time = timeit.default_timer()
QMCgenerator = qmc.Sobol(dimensions, scramble=False)
QMCgenerator.fast_forward(1) #skip first point where normsinv(0) = -Inf
QMCsamples = QMCgenerator.random(sims) #this generates points not having to be powers of 2
print('\n' + 'Time after sample generation QMC:', (timeit.default_timer() - start_time), 'seconds');
sobol = ndtri(QMCsamples).T # get normsinv(points) and transpose to dimensions * sims
del QMCsamples
print('\n' + 'Time after ndtri (normsinv) of', sims,'sims x dimensions', dimensions, 'Sobol points):', (timeit.default_timer() - start_time), 'seconds');
return sobol
RQMCsobol = RQMCgenerate(dimensions, sims, seed=0) #note sims changed with pow(2,m) if a power of 2 was not passed
sobol = QMCgenerate(dimensions, sims)
Time after sample generation RQMC: 0.4269224999952712 seconds
Time after ndtri (normsinv) of 8092 sims x dimensions 3024 Randomized Sobol points): 1.0048970999996527 seconds
Time after sample generation QMC: 0.0630135999963386 seconds
Time after ndtri (normsinv) of 8092 sims x dimensions 3024 Sobol points): 0.5444753999981913 seconds
This gets much slower at higher sims*dimensions, although I haven't found a faster conversion of points to normally distributed shocks than ndtri in Python:
Time after sample generation RQMC: 2.1779929000040283 seconds
Time after ndtri (normsinv) of 131072 sims x dimensions 3024 Randomized Sobol points): 10.617904700004146 seconds
Time after sample generation QMC: 1.079756200000702 seconds
Time after ndtri (normsinv) of 131072 sims x dimensions 3024 Sobol points): 9.545934699999634 seconds

Any way to optimize numpy stats functions (e.g., via numexpr)?

I need to calculate standard deviation and other stats on a large multidimensional ndarray of gridded point data. Example:
import numpy as np
# ... gridded data are read into g1, g2, g3 arrays ...
allg = numpy.array( [g1, g2, g3] )
allmg = numpy.ma.masked_values(allg, -99.)
sd = numpy.zeros((3, 3315, 8325))
np.std(allmg, axis=0, ddof=1, out=sd)
I've seen the performance advantages of wrapping numpy calculations in numexpr.evaluate() on various websites but I don't think there's a way to run np.std() in numexpr.evaluate() (correct me if I'm wrong). Are there any other ways I can optimize the np.std() call? It currently takes about 18 sec to calculate on my system...hoping to make that much faster somehow...
Maybe you can use multiprocessing to do the calculation in several process. But before trying that, you can try to rearrange your data so that you can call std() for the last axis. Here is an example:
import numpy as np
import time
data = np.random.random((4000, 4000))
start = time.clock()
np.std(data, axis=0)
print time.clock() - start
start = time.clock()
np.std(data, axis=1)
print time.clock() - start
the result on my pc is :
0.511926329834
0.273098421142
since all the data are in continuous memory for the last axis, data access will use CPU cache more effectively.

Categories