I have a series X of length n(=300,000). Using a window length of w (=40), I need to implement:
mu(i)= X(i)-X(i-w)
s(i) = sum{k=i-w to i} [X(k)-X(k-1) - mu(i)]^2
I was wondering if there's a way to prevent loops here. The fact that mu(i) is constant in second equation is causing complications in vectorization. I did the following so far:
x1=x.shift(1)
xw=x.shift(w)
mu= x-xw
dx=(x-x1-mu)**2 # wrong because mu wouldn't be constant for each i
s=pd.rolling_sum(dx,w)
The above code would work (and was working) in a loop setting but takes too long, so any help regarding vectorization or other speed improvement methods would be helpful. I posted this on crossvalidated with mathjax formatting but that doesn't seem to work here.
https://stats.stackexchange.com/questions/241050/python-vectorization-with-a-constant
Also just to clarify, I wasn't using a double loop, just a single one originally:
for i in np.arange(w, len(X)):
x=X.ix[i-w:i,0] # clip a series of size w
x1=x.shift(1)
mu.ix[i]= x.ix[-1]-x.ix[0]
temp= (x-x1-mu.ix[i])**2 # returns a series of size w but now mu is constant
s.ix[i]= temp.sum()
Approach #1 : One vectorized approach would be using broadcasting -
N = X.shape[0]
a = np.arange(N)
k2D = a[:,None] - np.arange(w+1)[::-1]
mu1D = X - X[a-w]
out = ((X[k2D] - X[k2D-1] - mu1D[:,None])**2).sum(-1)
We can further optimize the last step to get squared summations with np.einsum -
subs = X[k2D] - X[k2D-1] - mu1D[:,None]
out = np.einsum('ij,ij->i',subs,subs)
Further improvement is possible with the use of NumPy strides to get X[k2D] and X[k2D-1].
Approach #2 : To save on memory when working very large arrays, we can use one loop instead of two loops used in the original code, like so -
N = X.shape[0]
s = np.zeros((N))
k_idx = np.arange(-w,1)
for i in range(N):
mu = X[i]-X[i-w]
s[i] = ((X[k_idx]-X[k_idx-1] - mu)**2).sum()
k_idx += 1
Again, np.einsum could be used here to compute s[i], like so -
subs = X[k_idx]-X[k_idx-1] - mu
s[i] = np.einsum('i,i->',subs,subs)
Related
I'm making a script thats does some mathemagical morphology on images (mainly gis rasters). Now, I've implemented erosion and dilation, with opening/closing with reconstruction still on the TODO but thats not the subject here.
My implementation is very simple with nested loops, which I tried on a 10900x10900 raster and it took an absurdly long amount of time to finish, obviously.
Before I continue with other operations, I'd like to know if theres a faster way to do this?
My implementation:
def erode(image, S):
(m, n) = image.shape
buffer = np.full((m, n), 0).astype(np.float64)
for i in range(S, m - S):
for j in range(S, n - S):
buffer[i, j] = np.min(image[i - S: i + S + 1, j - S: j + S + 1]) #dilation is just np.max()
return buffer
I've heard about vectorization but I'm not quite sure I understand it too well. Any advice or pointers are appreciated. Also I am aware that opencv has these morphological operations, but I want to implement my own to learn about them.
The question here is do you want a more efficient implementation because you want to learn about numpy or do you want a more efficient algorithm.
I think there are two obvious things that could be improved with your approach. One is you want to avoid looping on the python level because that is slow. The other is that your taking a maximum of overlapping parts of arrays and you can make it more efficient if you reuse all the effort you put in finding the last maximum.
I will illustrate that with 1d implementations of erosion.
Baseline for comparison
Here is basically your implementation just a 1d version:
def erode(image, S):
n = image.shape[0]
buffer = np.full(n, 0).astype(np.float64)
for i in range(S, n - S):
buffer[i] = np.min(image[i - S: i + S + 1]) #dilation is just np.max()
return buffer
You can make this faster using stride_tricks/sliding_window_view. I.e. by avoiding the loops and doing that at the numpy level.
Faster Implementation
np.lib.stride_tricks.sliding_window_view(arr,2*S+1).min(1)
Notice that it's not quite doing the same since it only starts calculating values once there are 2S+1 values to take the maximum of. But for this illustration I will ignore this problem.
Faster Algorithm
A completely different approach would be to not start calculating the min from scratch but keeping the values ordered and only adding one and removing one when considering the next window one to the right.
Here is a ruff implementation of that:
def smart_erode(arr, m):
n = arr.shape[0]
sd = SortedDict()
for new in arr[:m]:
if new in sd:
sd[new] += 1
else:
sd[new] = 1
for to_remove,new in zip(arr[:-m+1],arr[m:]):
yield sd.keys()[0]
if new in sd:
sd[new] += 1
else:
sd[new] = 1
if sd[to_remove] > 1:
sd[to_remove] -= 1
else:
sd.pop(to_remove)
yield sd.keys()[0]
Notice that an ordered set wouldn't work and an ordered list would have to have a way to remove just one element with a specific value sind you could have repeated values in your array. I am using an ordered dict to store the amount of items present for a value.
A Ruff Benchmark
I want to illustrate how the 3 implementations compare for different window sizes. So I am testing them with an array of 10^5 random integers for different window sizes ranging from 10^3 to 10^4.
arr = np.random.randint(0,10**5,10**5)
sliding_window_times = []
op_times = []
better_alg_times = []
for m in np.linspace(0,10**4,11)[1:].astype('int'):
x = %timeit -o -n 1 -r 1 np.lib.stride_tricks.sliding_window_view(arr,2*m+1).min(1)
sliding_window_times.append(x.best)
x = %timeit -o -n 1 -r 1 erode(arr,m)
op_times.append(x.best)
x = %timeit -o -n 1 -r 1 tuple(smart_erode(arr,2*m+1))
better_alg_times.append(x.best)
print("")
pd.DataFrame({"Baseline Comparison":op_times,
'Faster Implementation':sliding_window_times,
'Faster Algorithm':better_alg_times,
},
index = np.linspace(0,10**4,11)[1:].astype('int')
).plot.bar()
Notice that for very small window sizes the raw power of the numpy implementation wins out but very quickly the amount of work we are saving by not calculating the min from scratch is more important.
So I have this definition here,
DP[i,j] = f[i,j] + min(DP[iā1, j ā1], DP[iā1, j], DP[iā1, j +1])
which defines the minimum accrued cost to go from the top of the NxM matrix to the bottom of the matrix. Each cell in f represents a value/cost (1.2, 0, 10, etc.) to travel to that cell from another cell.
The matrix may be large (1500x1500, It's Gradient map of an image), and the DP algorithm I programmed came out to be about a second per run for my matrices. This matrix needs to run hundreds of times per execution, so total program run time comes out to be several minutes long. This loop is about 99% of my bottleneck, so I am trying to optimize this loop with Python/numpys vectorization methods. I only have access to Numpy, and Scipy.
Note: I don't program in python hardly at all, so the solution may just be obvious idk.
First attempt, Just the straightforward loop, time here is about 2-2.5 seconds per run
DP = f.copy()
for r in range(2, len(DP) - 1): # Start at row 2 since row one doesn't change
for c in range(1, len(DP[0]) - 1):
DP[r][c] += min(DP[r - 1, c-1:c+2])
Second attempt, I tried to leverage some numpy vectorizations functions "fromiter" to calculate entire rows at a time rather than column by column, time here is about 1-1.5 seconds per run. My goal is to get this at least an order of magnitude faster, but I am stumped on how else I can optimize this.
DP = f.copy()
for r in range(2, len(DP) - 1):
def foo(arr):
idx, val = arr
if idx == 0 or idx == len(DP[[0]) - 1:
return np.inf
return val + min(DP[r - 1, idx - 1], DP[r - 1, idx], DP[r - 1, idx + 1])
DP[r, :] = np.fromiter(map(foo, enumerate(DP[r, :])))
As hpaulj stated, being your problem inherently sequential it will be hard to fully vectorize, although it seems possible (every cell is updated based on values of the row r=2, the difference is the considered number of triplets from row 2 for each of the following rows) so perhaps you can find a smart way to do it!
That being said, a quick and half-vectorized solution would be to use the neat way of performing sliding windows with fancy indexing proposed by user42541, so we replace the inner loop with a vectorized call:
indexer = np.arange(3)[:,None] + np.arange(DP.shape[1] - 2)[None,:]
for r in range(2, DP.shape[0] - 1):
DP[r,1:-1] += np.min(DP[r-1,indexer], axis = 0)
This results in a speed-up relative to your double loop method (your vectorized solution didn't work in my pc) of about two orders of magnitude for a 1500x1500 array of integers.
I'm trying to write a function that goes through the Jacobi iteration method for solving a system of linear equations. I've got most of it down, I just need to figure out how to iterate the last for loop either 1000 times or until the break condition is met. How can I make it so the value of x updates each iteration?
import numpy as np
def Jacobi(A,b,err):
n = A.shape
k = np.zeros(n[0])
D = np.zeros(n)
U = np.zeros(n)
L = np.zeros(n)
for i in range(n[0]):
for j in range(n[0]):
if i == j:
D[i,j] = A[i,j]
elif i < j:
U[i,j] = A[i,j]
else:
L[i,j] = A[i,j]
w = []
for i in range(1000):
x = np.linalg.inv(D)*(U+L)*x +np.linalg.inv(D)*b
w.append(x)
if abs(w[-1] - w[-2]) < err:
break
return w[-1]
For reference, my error statement says a list index in the if clause is out of range. I assume this is because there's only one element in w since I don't know how to make the for loop. Thanks in advance for any help.
I'm pretty sure you missed the intent of that exercise, if you can use inv, then you can also use linalg.inv(A) or better linalg.solve(A,b). Note that you have sign errors and that the multiplication * is not the matrix multiplication between numpy arrays. (Your declaration of the arrays is incompatible with their later use.)
Your specific problem can be solved by adding an additional test
if i>1 and abs(w[-1] - w[-2]) < err:
when the first condition fails the second is not evaluated.
You should contemplate if it is a waste of memory to construct the w list when all you ever need is the last two entries.
x_last, x = x, jacobi_step(A,b,x)
would also work to have these available.
The preparation can be reduced to
D=np.diag(A); A_reduced = A-np.diag(D);
then the Jacobi step is simply, using that the arithmetic operations are applied element-wise by default
x_last, x = x, (b-A_reduced.dot(x))/D
I've read a lot about different techniques for iterating over numpy arrays recently and it seems that consensus is not to iterate at all (for instance, see a comment here). There are several similar questions on SO, but my case is a bit different as I have to combine "iterating" (or not iterating) and accessing previous values.
Let's say there are N (N is small, usually 4, might be up to 7) 1-D numpy arrays of float128 in a list X, all arrays are of the same size. To give you a little insight, these are data from PDE integration, each array stands for one function, and I would like to apply a Poincare section. Unfortunately, the algorithm should be both memory- and time-efficient since these arrays are sometimes ~1Gb each, and there are only 4Gb of RAM on board (I've just learnt about memmap'ing of numpy arrays and now consider using them instead of regular ones).
One of these arrays is used for "filtering" the others, so I start with secaxis = X.pop(idx). Now I have to locate pairs of indices where (secaxis[i-1] > 0 and secaxis[i] < 0) or (secaxis[i-1] < 0 and secaxis[i] > 0) and then apply simple algebraic transformations to remaining arrays, X (and save results). Worth mentioning, data shouldn't be wasted during this operation.
There are multiple ways for doing that, but none of them seem efficient (and elegant enough) to me. One is a C-like approach, where you just iterate in a for-loop:
import array # better than lists
res = [ array.array('d') for _ in X ]
for i in xrange(1,secaxis.size):
if condition: # see above
co = -secaxis[i-1]/secaxis[i]
for j in xrange(N):
res[j].append( (X[j][i-1] + co*X[j][i])/(1+co) )
This is clearly very inefficient and besides not a Pythonic way.
Another way is to use numpy.nditer, but I haven't figured out yet how one accesses the previous value, though it allows iterating over several arrays at once:
# without secaxis = X.pop(idx)
it = numpy.nditer(X)
for vec in it:
# vec[idx] is current value, how do you get the previous (or next) one?
Third possibility is to first find sought indices with efficient numpy slices, and then use them for bulk multiplication/addition. I prefer this one for now:
res = []
inds, = numpy.where((secaxis[:-1] < 0) * (secaxis[1:] > 0) +
(secaxis[:-1] > 0) * (secaxis[1:] < 0))
coefs = -secaxis[inds] / secaxis[inds+1] # array of coefficients
for f in X: # loop is done only N-1 times, that is, 3 to 6
res.append( (f[inds] + coefs*f[inds+1]) / (1+coefs) )
But this is seemingly done in 7 + 2*(N - 1) passes, moreover, I'm not sure about secaxis[inds] type of addressing (it is not slicing and generally it has to find all elements by indices just like in the first method, doesn't it?).
Finally, I've also tried using itertools and it resulted in monstrous and obscure structures, which might stem from the fact that I'm not very familiar with functional programming:
def filt(x):
return (x[0] < 0 and x[1] > 0) or (x[0] > 0 and x[1] < 0)
import array
from itertools import izip, tee, ifilter
res = [ array.array('d') for _ in X ]
iters = [iter(x) for x in X] # N-1 iterators in a list
prev, curr = tee(izip(*iters)) # 2 similar iterators, each of which
# consists of N-1 iterators
next(curr, None) # one of them is now for current value
seciter = tee(iter(secaxis))
next(seciter[1], None)
for x in ifilter(filt, izip(seciter[0], seciter[1], prev, curr)):
co = - x[0]/x[1]
for r, p, c in zip(res, x[2], x[3]):
r.append( (p+co*c) / (1+co) )
Not only this looks very ugly, it also takes an awful lot of time to complete.
So, I have following questions:
Of all these methods is the third one indeed the best? If so, what can be done to impove the last one?
Are there any other, better ones yet?
Out of sheer curiosity, is there a way to solve the problem using nditer?
Finally, will I be better off using memmap versions of numpy arrays, or will it probably slow things down a lot? Maybe I should only load secaxis array into RAM, keep others on disk and use third method?
(bonus question) List of equal in length 1-D numpy arrays comes from loading N .npy files whose sizes aren't known beforehand (but N is). Would it be more efficient to read one array, then allocate memory for one 2-D numpy array (slight memory overhead here) and read remaining into that 2-D array?
The numpy.where() version is fast enough, you can speedup it a little by method3(). If the > condition can change to >=, you can also use method4().
import numpy as np
a = np.random.randn(100000)
def method1(a):
idx = []
for i in range(1, len(a)):
if (a[i-1] > 0 and a[i] < 0) or (a[i-1] < 0 and a[i] > 0):
idx.append(i)
return idx
def method2(a):
inds, = np.where((a[:-1] < 0) * (a[1:] > 0) +
(a[:-1] > 0) * (a[1:] < 0))
return inds + 1
def method3(a):
m = a < 0
p = a > 0
return np.where((m[:-1] & p[1:]) | (p[:-1] & m[1:]))[0] + 1
def method4(a):
return np.where(np.diff(a >= 0))[0] + 1
assert np.allclose(method1(a), method2(a))
assert np.allclose(method2(a), method3(a))
assert np.allclose(method3(a), method4(a))
%timeit method1(a)
%timeit method2(a)
%timeit method3(a)
%timeit method4(a)
the %timeit result:
1 loop, best of 3: 294 ms per loop
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.38 ms per loop
1000 loops, best of 3: 1.39 ms per loop
I'll need to read your post in more detail, but will start with some general observations (from previous iteration questions).
There isn't an efficient way of iterating over arrays in Python, though there are things that slow things down. I like to distinguish between the iteration mechanism (nditer, for x in A:) and the action (alist.append(...), x[i+1] += 1). The big time consumer is usually the action, done many times, not the iteration mechanism itself.
Letting numpy do the iteration in compiled code is the fastest.
xdiff = x[1:] - x[:-1]
is much faster than
xdiff = np.zeros(x.shape[0]-1)
for i in range(x.shape[0]:
xdiff[i] = x[i+1] - x[i]
The np.nditer isn't any faster.
nditer is recommended as a general iteration tool in compiled code. But its main value lies in handling broadcasting and coordinating the iteration over several arrays (input/output). And you need to use buffering and c like code to get the best speed from nditer (I'll look up a recent SO question).
https://stackoverflow.com/a/39058906/901925
Don't use nditer without studying the relevant iteration tutorial page (the one that ends with a cython example).
=========================
Just judging from experience, this approach will be fastest. Yes it's going to iterate over secaxis a number of times, but those are all done in compiled code, and will be much faster than any iteration in Python. And the for f in X: iteration is just a few times.
res = []
inds, = numpy.where((secaxis[:-1] < 0) * (secaxis[1:] > 0) +
(secaxis[:-1] > 0) * (secaxis[1:] < 0))
coefs = -secaxis[inds] / secaxis[inds+1] # array of coefficients
for f in X:
res.append( (f[inds] + coefs*f[inds+1]) / (1+coefs) )
#HYRY has explored alternatives for making the where step faster. But as you can see the differences aren't that big. Other possible tweaks
inds1 = inds+1
coefs = -secaxis[inds] / secaxis[inds1]
coefs1 = coefs+1
for f in X:
res.append(( f[inds] + coefs*f[inds1]) / coefs1)
If X was an array, res could be an array as well.
res = (X[:,inds] + coefs*X[:,inds1])/coefs1
But for small N I suspect the list res is just as good. Don't need to make the arrays any bigger than necessary. The tweaks are minor, just trying to avoid recalculating things.
=================
This use of np.where is just np.nonzero. That actually makes two passes of the array, once with np.count_nonzero to determine how many values it will return, and create the return structure (list of arrays of now known length). And a second loop to fill in those indices. So multiple iterations are fine if it keeps action simple.
I am writing a program to perform numerical calculation with a Hessian matrix. The Hessian matrix is 500 x 500 and I need to populate it hundreds of times over. I am populating it with two for loops each time. My problem is that preventatively slow. Here is my code:
#create these outside function
hess = np.empty([500,500])
b = np.empty([500])
def hess_h(x):
#create these first so they aren't calculated every iteration
for k in range(500):
b[k] = (1-np.dot(a[k],x))**2
for i in range(500):
for j in range(500):
if i == j:
#these are values along diagonal
hess[i,j] = float(2*(1-x[i])**2 + 4*x[i]**2)/(1-x[i]**2)**2 \
- float(a[i,j]*sum(a[i]))/b[i]
#the matrix is symmetric so only calculate upper triangle
elif j > i :
hess[i,j] = -float(a[i,j]*sum(a[i]))/b[i]
elif i > j:
hess[i,j] = hess[j,i]
return hess
I calculate that hess_h(np.zeros(500)) takes 10.2289998531 sec to run. That is too long and I need to figure out another way.
Look for patterns in your calculation, in particular things that you can calculate over the whole range of i and j.
I see for example a diagonal where i==j
hess[i,j] = float(2*(1-x[i])**2 + 4*x[i]**2)/(1-x[i]**2)**2 \
- float(a[i,j]*sum(a[i]))/b[i]
Can you change that to a one time expression, something like:
2*(1-x)**2 + 4*x**2)/(1-x**2)**2 - np.diagonal(a)*sum(a)/b
The other pieces work with up and lower triangular elements. There are functions like np.triu that give you their indices.
I'm trying to give you tools and though processes for solving this with a few numpy vectorized operaitons, instead of iterating over all elements of i and j.
Looks like
-a[i,j]*sum(a[i])/b[i]
is used for every element. I assume a is a (500,500) array. Can you use
-a*a.sum(axis=?)/b
b can be 'vectorized'
b[k] = (1-np.dot(a[k],x))**2
with something like:
(1 - np.dot(a, x))**2
or
(1 - np.einsum('kj,ji',a,x))**2
test the details on a smaller a.