Any way to optimize numpy stats functions (e.g., via numexpr)? - python

I need to calculate standard deviation and other stats on a large multidimensional ndarray of gridded point data. Example:
import numpy as np
# ... gridded data are read into g1, g2, g3 arrays ...
allg = numpy.array( [g1, g2, g3] )
allmg = numpy.ma.masked_values(allg, -99.)
sd = numpy.zeros((3, 3315, 8325))
np.std(allmg, axis=0, ddof=1, out=sd)
I've seen the performance advantages of wrapping numpy calculations in numexpr.evaluate() on various websites but I don't think there's a way to run np.std() in numexpr.evaluate() (correct me if I'm wrong). Are there any other ways I can optimize the np.std() call? It currently takes about 18 sec to calculate on my system...hoping to make that much faster somehow...

Maybe you can use multiprocessing to do the calculation in several process. But before trying that, you can try to rearrange your data so that you can call std() for the last axis. Here is an example:
import numpy as np
import time
data = np.random.random((4000, 4000))
start = time.clock()
np.std(data, axis=0)
print time.clock() - start
start = time.clock()
np.std(data, axis=1)
print time.clock() - start
the result on my pc is :
0.511926329834
0.273098421142
since all the data are in continuous memory for the last axis, data access will use CPU cache more effectively.

Related

Most computational-time efficient/fastest way to compute rolling (linear) regression in Python (Numpy or Pandas)

I have a need to do very very fast and efficient way of rolling linear regression.
I looked through these two threads :
Efficient way to do a rolling linear regression
Rolling linear regression
From them, I had inferred numpy was (computationally) the fastest. However, using my (limited) python skills, I found the time to compute the same set of rolling data, was *** the same ***.
Is there a faster way to compute than either of the 3 methods I post below? I would have thought the numpy way is much faster, but unfortunately, it wasn't.
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.empty(0)
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np = np.append(roll_np, grad_)
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
tl;dr
My answer is
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
And it takes 1.2 ms to compute, compared to 2 seconds for your pandas and numpy version, and 3.5 seconds for your stat version.
Long version
One method could be to use sliding_window_view to transform your tmp_ array, into an array of window (a fake one: it is just a view, not really a 10000x30 array of data. It is just tmp_ but viewed differenty. Hence the _view in the function name).
No direct advantage. But then, from there, you can try to take advantage of vectorization.
I do that two different way: an easy one, and one that takes a minute of thinking. Since I put the best answer first, the rest of this message can appear inconsistent chronologically (I say things like "in my previous answer" when the previous answer come later), but I tried to redact both answer consistently.
New answer : matrix operations
One method to do that (since lstsq is of the rare numpy method that wouldn't just do it naturally) is to go back to what lstsq(X,Y) does in reality: it computes (XᵀX)⁻¹Xᵀ Y
So let's just do that. In python, with xxx being the X array (of arange and 1 in your example) and view the array of windows to your data (that is view[i] is tmp_[i:i+win_]), that would be np.linalg.inv(xxx.T#xxx)#xxx.T#view[i] for i being each row. We could vectorize that operation with np.vectorize to avoid iterating i, as I did for my first solution (see below). But the thing is, we don't need to. That is just a matrix times a vector. And the operation computing a matrix times a vector for each vector in an array of vectors, is just matrix multiplication!
Hence my 2nd (and probably final) answer
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T # xxx) # (xxx.T) # view.T)[0]
roll_mat is still identical (with one extra row because your roll_np stopped one row short of the last possible one) to roll_np (see below for graphical proof with my first answer. I could provide a new image for this one, but it is indistinguishable from the one I already used). So same result (unsurprisingly I should say... but sometimes it is still a surprise when things work exactly like theory says they do)
But timing, is something else. As promised, my previous factor 4 was nothing compared to what real vectorization can do. See updated timing table:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize (see below)
0.46 s
numpy view/matmult
1.2 ms
The important part is 'ms', compared to other 's'.
So, this time factor is 1700 !
Old-answer : vectorize
A lame method, once we have this view could be to use np.vectorize from there. I call it lame because vectorize is not supposed to be efficient. It is just a for loop called by another name. Official documentation clearly says "not to be used for performance". And yet, it would be an improvement from your code
view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
f = np.vectorize(lambda y: np.linalg.lstsq(xxx,y,rcond=None)[0][0], signature='(n)->()')
roll_vectorize=f(view)
Firt let's verify the result
plt.scatter(f(view)[:-1], roll_np))
So, obviously, same results as roll_np (which, I've checked the same way, are the same results as the two others. With also the same variation on indexing since all 3 methods have not the same strategy for border)
And the interesting part, timings:
Method
Time
pandas
2.10 s
numpy roll
2.03 s
stat
3.58 s
numpy view/vectorize
0.46 s
So, you see, it is not supposed to be for performance, and yet, I gain more that x4 times with it.
I am pretty sure that a more vectorized method (alas, lstsq doesn't allow directly it, unlike most numpy functions) would be even faster.
First if you need some tips for optimizing your python code, I believe this playlist might help you.
For making it faster; "Append" is never a good way, you think of it in terms of memory, every time you append, python may create a completely new list with a bigger size (maybe n+1; where n is old size) and copy the last items (which will be n places) and for the last one will be added at last place.
So when I changed it to be as follows
########## testing time for pd rolling vs numpy rolling
def fitcurve(x_pts):
poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
return np.poly1d(poly)[1]
win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()
########
s_time = time.time()
roll_np = np.zeros(len(tmp_)-win_) ### Change
for cnt_ in range(len(tmp_)-win_):
tmp1_ = tmp_[cnt_:cnt_+ win_]
grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
roll_np[cnt_] = grad_ ### Change
# roll_np = np.append(roll_np, grad_) ### Change
print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()
#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
I initialized the array from first place with the size of how it's expected to turn to be(len(tmp_)-win_ in range), and just assigned values to it later, and it was much faster.
there are also some other tips you can do, Python is interpreted language, meaning each time it takes a line, convert it to machine code, then execute it, and it does that for each line. Meaning if you can do multiple things at one line, meaning they will get converted at one time to machine code, it shall be faster, for example, think of list comprehension.

Is there any simple method to parallel np.einsum?

I would like to know, is there any simple method to parallel einsum in Numpy?
I found some discussions
Numpy np.einsum array multiplication using multiple cores
Any chance of making this faster? (numpy.einsum)
numpy.tensordot() only for binary contraction with a single axis, Numba needs to specify certain loops. Is there any simple and robust approach to parallel einsum (possibly including opt-einsum, tf-einsum etc) with arbitrary contractions?
A sample code is as following (if necessary I can use more complicated contraction as the example)
import numpy as np
import timeit
import time
na = nc = 1000
nb = 1000
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
start = time.time()
C = np.einsum('ij,jk->ik', A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)

Optimizing Kernel Density Bandwidth using Python

I am attempting to build a class that automatically determines the optimal bandwidth for a kernel density estimate. I am using the FFTKDE method of KDEpy for my purposes, since I am required to calculate this quantity very quickly. I am aware that there are the options of Scott's Rule, Silverman's Rule and Improved Sheather-Jones in KDEpy, but I am keen to directly optimise mysomething bandwidth.
I would like to calculate the Maximum likelihood cross-validation (MCLV) in order that I optimize it. However, my code is terribly slow, since I need to estimate the KDE for each data point in 20,000 data points, each iteration of the minimisation.
The equation I am attempting to optimize looks like this (see page 8, here):
My code for calculating this loss is as follows:
import numpy as np
from KDEpy import FFTKDE
from tqdm import tqdm
import time
def MLCV(data, bw):
N = len(data)
idx = np.ones(N, bool)
logs = np.empty(N)
for i in tqdm(range(N)):
idx[i]=False
x_kde, y_kde = FFTKDE(bw=bw).fit(data[idx]).evaluate(2**13)
idx[i]=True
logs[i] = np.sum(np.log(y_kde))
MLCV = np.sum(logs)/N - np.log((N-1)*bw)
return MLCV
data = np.random.normal(size=20000)
bw = 0.01
t0 = time.process_time()
mlcv = MLCV(data, bw)
t1 = time.process_time()
print("MLCV = {:3.3E}, Elapsed Time = {:3.3}s".format(mlcv, t1-t0))
Output:
MLCV = -1.077E+05, Elapsed Time = 46.3s
Can anyone suggest a means of making this faster/an alternative, quicker algorithm?
I have also considered simply minimising the negative log of the output, which I have seen elsewhere:
def L(data, bw):
x_kde, y_kde = FFTKDE(bw=bw).fit(data).evaluate(2**13)
return -np.sum(np.log(y_kde))
However, my intuition tells me neither method is the correct solution, since I cannot directly calculate the values for the actual data, only interpolate them, due to the FFT method requiring points to be on a grid.
Is there a loss that suits my needs? Can anyone suggest a better solution than I have come up with?

Multithreading 1D Median Filtering in Python

I have tried the following python median filtering on time-series signals to find the fastest and more efficient function.
sig is a numpy array of size 80×188 which contains 188 samples measured by 80 sensors.
import numpy as np
from scipy.ndimage import median_filter
from scipy.signal import medfilt
from scipy.signal import medfilt2d
import time
sig = np.random.rand(80,188).astype('f')
print(type(sig))
print(type(sig[0][0]))
window_length = 181
t = time.time()
sigFiltered = medfilt2d(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt2d: %g seconds' % elapsed)
t = time.time()
sigFiltered = median_filter(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.ndimage.median_filter: %g seconds' % elapsed)
t = time.time()
sigFiltered = medfilt(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt: %g seconds' % elapsed)
The code can be tried here.
The result of the filter is another time-series array of size 80×188 with smoothed time-points for each sensor.
MATLAB medfilt1(sig, 181, [], 2) performs the filtering on the same data 10 times faster compared to scipy.signal.medfilt2d, which was the fastest among other functions. On my machine, MATLAB=2ms vs Python=20 ms. I think MATLAB performs multithreading processing and python does not.
Is there any way to perform multithreading median filtering to speed up the process and assign sensors to different threads? Is there a more efficient median filtering available in python? Can I achieve the performance of MATLAB win python or at least get closer to it?
With such a long filter relative to the input most outputs using a standard medfilt are going to be the same. Where this to be a convolution this would be a "full" convolution. If you instead only give outputs for "valid" convolution, that will be much faster in this case:
t = time.time()
medians = []
for i in range(188-181):
sig2 = sig[:, i:i+window_length]
f = np.median(sig2, axis=1)
medians.append(f)
sigFiltered = np.stack(medians).T
elapsed = time.time() - t
print('numpy.median: %g seconds' % elapsed)
numpy.median: 0.0015518 seconds
This is in the ballpark of the requested 1 ms runtime per 188 sample size.
Considering that even each unique value here will change very slowly/rarely with new input samples. You could therefore speed this up considerably by using a hop larger than 1.
I'm wondering why you're using a median filter of 181 points for a data length of 188? The filter is so long that you're essentially just throwing away all the data and replacing it with the global median of the sensor's output. Typical median filter lengths would be a few samples, depending on what kind of transients you want to filter out.
The filter length also explains why it's so slow. On my machine, your median_filter example takes 46 ms. Running with a more normal filter size of 3 samples takes 0.7 ms.

Optimize Function - use array as input

I am playing with SciPy today and I wanted to test least square fitting. The function malo(time) works perfectly in returning me calculated concentrations if I put it in a loop which iterates over an array of timesteps (in the code "time").
Now I want to compare my calculated concentrations with my measured ones. I created a residuals function which calculates the difference between measured concentration (in the script an array called conc) and the modelled concentration with malo(time).
With optimize.leastsq I want to fit the parameter PD to fit both curves as good as possible. I don't see a mistake in my code, malo(time) performs well, but whenever I want to run the optimize.leastsq command Python says "only length-1 arrays can be converted to Python scalars". If I set the timedt array to a single value, the code runs without any error.
Do you see any chance to convince Python to use my array of timesteps in the loop?
import pylab as p
import math as m
import numpy as np
from scipy import optimize
Q = 0.02114
M = 7500.0
dt = 30.0
PD = 0.020242215
tom = 26.0 #Minuten
tos = tom * 60.0 #Sekunden
timedt = np.array([30.,60.,90])
conc= np.array([ 2.7096, 2.258 , 1.3548, 0.9032, 0.9032])
def malo(time):
M1 = M/Q
M2 = 1/(tos*m.sqrt(4*m.pi*PD*((time/tos)**3)))
M3a = (1 - time/tos)**2
M3b = 4*PD*(time/tos)
M3 = m.exp(-1*(M3a/M3b))
out = M1 * M2 * M3
return out
def residuals(p,y,time):
PD = p
err = y - malo(timedt)
return err
p0 = 0.05
p1 = optimize.leastsq(residuals,p0,args=(conc,timedt))
Notice that you're working here with arrays defined in NumPy module. Eg.
timedt = np.array([30.,60.,90])
conc= np.array([ 2.7096, 2.258 , 1.3548, 0.9032, 0.9032])
Now, those arrays are not part of standard Python (which is a general purpose language). The problem is that you're mixing arrays with regular operations from the math module, which is part of the standard Python and only meant to work on scalars.
So, for example:
M2 = 1/(tos*m.sqrt(4*m.pi*PD*((time/tos)**3)))
will work if you use np.sqrt instead, which is designed to work on arrays:
M2 = 1/(tos*np.sqrt(4*m.pi*PD*((time/tos)**3)))
And so on.
NB: SciPy and other modules meant for numeric/scientific programming know about NumPy and are built on top of it, so those functions should all work on arrays. Just don't use math when working with them. NumPy comes with replicas of all those functions (sqrt, cos, exp, ...) to work with your arrays.

Categories