I'd like to find a fast way to update a sum of squared residuals, when I know that only a small fraction of the terms are changing. Let me describe the problem in more detail.
I have N data points from noisy step-function data.
N = 100000
realStepList = [200, 500, 900]
x = np.zeros(N)
for realStep in realStepList:
x[realStep:] += 1
x+=np.random.randn(len(x))*0.1 #Add noise
I'd like to calculate the sum of squared residuals for this data and an arbitrary list of step locations. Here is how I do this.
a = [0, 250, 550, N]
def Q(x, a):
q = np.sum([np.sum((x[ai:af] - i)**2) for i, (ai,af) in enumerate(zip(a[:-1],a[1:]))])
return q
a is my list of potential steps. It's easier to use a list that always has 0 as the first element and N as the last element.
This is relatively slow, since it is a sum over N squares. However, I realized that if I change a by a relatively small amount, most of these N terms will remain unchanged, which means I don't have to compute them again.
So let's say I have already computed Q(x,a) as above. I now have another list
b = [aa + dd for aa, dd in zip(a, d)]
where d is the difference between the two lists. Rather than calculating Q(x,b) as above (another sum over N elements), I want to find
deltaQ(x, a, d) such that
Q(x, b) = Q(x,a) + deltaQ(x, a, d)
I have written such a function, but it is slow and sloppy. In fact, it is slower than Q!
def deltaQ(x, a, d):
z = np.zeros(len(x))
J = np.zeros(len(x))
s = 0
for j, [dd, aa] in enumerate(zip(d, a[1:-1])):
if dd >= 0:
z[aa:aa+dd] += 1
s += sum(x[aa:aa+dd])
if dd < 0:
z[aa+dd:aa] += -1
s += -sum(x[aa+dd:aa])
J[aa:] += 1
dq = 2*s - sum((J**2 - (J-z)**2))
return dq
The idea is to identify all the points in x which will be affected. For example, if the original list was a = [0, 5, 10] and b = [0, 7, 10], then only the terms corresponding to x[5:7] will change in the sum. I keep track of this with the list z. I then calculate the change based on this.
I don't think I'm the first person in the world to have this problem. So my question is:
Is there a fast way to calculate the difference in the sum of squared residuals, since this will often be a sum many fewer elements than recalculating the new sum from scratch?
First of all, I was able to run Q with the original code, only modifying N, to get the following timings on a fairly standard issue laptop (nothing too fancy):
N = 1e6: 0.00236s per loop
N = 1e7: 0.0260s per loop
N = 1e8: 0.251 per loop
The process went into swap at N = 1e9, but I would find a timing of 2.5 seconds quite acceptable for that size, assuming you had enough RAM available.
That being said, I was able to get a 10% speedup by changing the inner np.sum to np.ndarray.sum on the result of the call to np.power:
def Q1(x, a):
return sum(((x[ai:af] - i)**2).sum() for i, (ai, af) in enumerate(zip(a[:-1], a[1:])))
Now here is a version that is three times slower:
def offset(x, a):
d = np.zeros(x.shape, dtype=np.int)
d[a[1:-1]] = 1
# Add out=d to make this run 4 times slower
return np.cumsum(d)
def Q2(x, a):
return np.sum((x - offset(x, a))**2)
Why does this help? Well, notice what offset does: it readjusts x to the baseline that you chose. In the long run this does two things. First, you get a much more vectorized solution than the one you are currently proposing. Secondly, it allows you to rewrite your delta function in terms of the different b arrays that you chose instead of having to compute d, which may not even be possible if len(a) != len(b).
The delta is (x - i)2 - (x - i)2. If you expand out all the mess, you get (j - i)(j + i - 2x). j and i being the values of the steps, returned by offset. Not only does this simplify the computation greatly, but j - i is the mask at which you need to compute the deltas:
def deltaQ1(x, a, b):
i = offset(x, a)
j = offset(x, b)
d = j - i
mask = d.astype(np.bool)
return (d[mask] * (j[mask] + i[mask] - 2 * x[mask])).sum()
This function runs more than 10 to 15 times faster than your original implementation (but keep in mind that it takes a and b instead of a and d as inputs). Calling Q1(x, b) - Q1(x, a) is still twice as fast though. The new function also creates a bunch of temporary arrays, but these can be easily reduced in quantity.
Timings
Here are some sample timings on my computer, in addition to the ones shown above (using the data provided, and a = [0, 250, 550, N], b = [0, 180, 565, N] and therefore d = [0, -70, 15, 0], where relevant:
Raw residuals:
Q: 147µs per loop
Q1: 135µs per loop <-- Use this one!
Q2: 453µs per loop
Delta of residuals:
deltaQ: 8363µs per loop
deltaQ1: 656µs per loop
Q(x, b) - Q(x, a): 297µs per loop
Q1(x, b) - Q1(x, a): 275µs per loop <-- Best solution?
Final note: I have the distinct impression that your original implementation of the delta function is not correct. It does not agree with the result of Q(x, b) - Q(x, a), but deltaQ1(x, a, b) does.
TL;DR
Please don't optimize prematurely. If you do it right, it is of course possible to write a specialized C function to hold i - j and i + j in memory for you which will work much faster, but I doubt you will get much mileage out of a vectorized pipeline. Part of the reason is that you will end up spending a lot of time figuring out how a complex set of indices intermeshes instead of just adding numbers together.
Related
So I have this definition here,
DP[i,j] = f[i,j] + min(DP[i−1, j −1], DP[i−1, j], DP[i−1, j +1])
which defines the minimum accrued cost to go from the top of the NxM matrix to the bottom of the matrix. Each cell in f represents a value/cost (1.2, 0, 10, etc.) to travel to that cell from another cell.
The matrix may be large (1500x1500, It's Gradient map of an image), and the DP algorithm I programmed came out to be about a second per run for my matrices. This matrix needs to run hundreds of times per execution, so total program run time comes out to be several minutes long. This loop is about 99% of my bottleneck, so I am trying to optimize this loop with Python/numpys vectorization methods. I only have access to Numpy, and Scipy.
Note: I don't program in python hardly at all, so the solution may just be obvious idk.
First attempt, Just the straightforward loop, time here is about 2-2.5 seconds per run
DP = f.copy()
for r in range(2, len(DP) - 1): # Start at row 2 since row one doesn't change
for c in range(1, len(DP[0]) - 1):
DP[r][c] += min(DP[r - 1, c-1:c+2])
Second attempt, I tried to leverage some numpy vectorizations functions "fromiter" to calculate entire rows at a time rather than column by column, time here is about 1-1.5 seconds per run. My goal is to get this at least an order of magnitude faster, but I am stumped on how else I can optimize this.
DP = f.copy()
for r in range(2, len(DP) - 1):
def foo(arr):
idx, val = arr
if idx == 0 or idx == len(DP[[0]) - 1:
return np.inf
return val + min(DP[r - 1, idx - 1], DP[r - 1, idx], DP[r - 1, idx + 1])
DP[r, :] = np.fromiter(map(foo, enumerate(DP[r, :])))
As hpaulj stated, being your problem inherently sequential it will be hard to fully vectorize, although it seems possible (every cell is updated based on values of the row r=2, the difference is the considered number of triplets from row 2 for each of the following rows) so perhaps you can find a smart way to do it!
That being said, a quick and half-vectorized solution would be to use the neat way of performing sliding windows with fancy indexing proposed by user42541, so we replace the inner loop with a vectorized call:
indexer = np.arange(3)[:,None] + np.arange(DP.shape[1] - 2)[None,:]
for r in range(2, DP.shape[0] - 1):
DP[r,1:-1] += np.min(DP[r-1,indexer], axis = 0)
This results in a speed-up relative to your double loop method (your vectorized solution didn't work in my pc) of about two orders of magnitude for a 1500x1500 array of integers.
I am building a function, which contains many loops and conditions in it.
The input of the function is an element of a list.
I want the function to generate the result so that the nex time I don't need to run through those loop. The real code is really large so I pasted the main lines as follows, which is a toy model of the real code:
a=[1,2,3,4,5,6,7]
def ff(x):
b=0
for i in range(10000):
for k in range(10000):
if k/2 >20:
for j in range(1000):
if j**2-j>1:
b += a[x]^2+a[x]
return b
ff(2)
So, in fact the result of ff should be simple, but due to the loops and conditions it runs really slow. I don't want to run through the loops each time I call ff.
A bit more like the idea that the function is a tensor in tensorflow, and index is the feed value. The structure is built first and then can be executed with different feed in values. Maybe what I want is symbolic computation.
Is there a way so that I can store the result as a sturcture and next time I just feed in the value of the index.
I cannot simply feed the values of a, since a can be some other shapes.
Your code is equivalent to (if you'll start analyzing what each one of the loops is actually doing...):
def ff(x):
return 995900780000 * (a[x]^2+a[x])
This code should run very fast...
The condition k/2 >20 can be restated as k > 40; so rather than starting the k-loop from 0, start it from 41 and eliminate that condition. Likewise, the condition j**2 - j > 1 implies that you are only interested in j >= 2 since one solution of that is less than 0 (and you aren't interested in those values and the other is about 1.6 and the first integer greater than that is 2). So start the j loop from 2 and eliminate that condition. Finally, your b value does not depend on i, k or j, so make the rhs 1. You now have
def ff(x):
b=0
for i in range(10000):
for k in range(41, 10000):
for j in range(2, 1000):
b += 1
return b
The j loop will run 1000 - 2 = 998 times; k will run 10000 - 41 = 9959 times and i will run 10000 times. The total number of times that b will be incremented is 998*9959*10000 = 99390820000. That's how many times you will have added your rhs (a[x]**2 + a[x]) together...which, except for a different value, is what #alfasin is pointing out: your loops are effectively adding the rhs 99390820000 times so the result will be 99390820000*(a[x]**2 + a[x]) and now you never have to run the loop. Your whole function reduces to:
def ff(x):
return 99390820000*(a[x]**2 + a[x])
The "structure" of adding something within nested loops is to multiply that something by the product of the number of times each loop is run. So if you had
b = 0
for i in range(6):
for j in range(7):
b += 1
the value of b would be 6*7...and the answer (as always ;-)) is 42. If you were adding f(x) to b each time then the answer would be 42*f(x).
I've written out a recursive algorithm for a little homegrown computer algebra system, where I'm applying pairwise reductions to the list of operands of an algebraic operation (adjacent operands only, as the algebra is non-commutative). I'm trying to get an idea of the runtime complexity of my algorithm (but unfortunately, as a physicist it's been a very long time since I took any undergrad CS courses that dealt with complexity analysis). Without going into details of the specific problem, I think I can formalize the algorithm in terms of a function f that is a "divide" step and a function g that combines the results. My algorithm would then take the following formal representation:
f(1) = 1 # recursion anchor for f
f(n) = g(f(n/2), f(n/2))
g(n, 0) = n, g(0, m) = m # recursion ...
g(1, 0) = g(0, 1) = 1 # ... anchors for g
/ g(g(n-1, 1), m-1) if reduction is "non-neutral"
g(n, m) = | g(n-1, m-1) if reduction is "neutral"
\ n + m if no reduction is possible
In this notation, the functions f and g receive lists as arguments and return lists, with the length of the input/output lists being the argument and the right-hand-side of the equations above.
For the full story, the actual code corresponding to f and g is the following:
def _match_replace_binary(cls, ops: list) -> list:
"""Reduce list of `ops`"""
n = len(ops)
if n <= 1:
return ops
ops_left = ops[:n//2]
ops_right = ops[n//2:]
return _match_replace_binary_combine(
cls,
_match_replace_binary(cls, ops_left),
_match_replace_binary(cls, ops_right))
def _match_replace_binary_combine(cls, a: list, b: list) -> list:
"""combine two fully reduced lists a, b"""
if len(a) == 0 or len(b) == 0:
return a + b
if len(a) == 1 and len(b) == 1:
return a + b
r = _get_binary_replacement(a[-1], b[0], cls._binary_rules)
if r is None:
return a + b
if r == cls.neutral_element:
return _match_replace_binary_combine(cls, a[:-1], b[1:])
r = [r, ]
return _match_replace_binary_combine(
cls,
_match_replace_binary_combine(cls, a[:-1], r),
b[1:])
I'm interested in the worst-case number of times get_binary_replacement is
called, depending on the size of ops
So I think I've got it now. To restate the problem: find the number of calls to _get_binary_replacement when calling _match_replace_binary with an input of size n.
define function g(n, m) (as in original question) that maps the size of the the two inputs of _match_replace_binary_combine to the size of the output
define a function T_g(n, m) that maps the size of the two inputs of _match_replace_binary_combine to the total number of calls to g that is required to obtain the result. This is also the (worst case) number of calls to _get_binary_replacement as each call to _match_replace_binary_combine calls _get_binary_replacement at most once
We can now consider the worst case and best case for g:
best case (no reduction): g(n,m) = n + m, T_g(n, m) = 1
worst case (all non-neutral reduction): g(n, m) = 1, T_g(n, m) = 2*(n+m) - 1 (I determined this empirically)
Now, the master theorem (WP) applies:
Going through the description on WP:
k=1 (the recursion anchor is for size 1)
We split into a = 2 subproblems of size n/2 in constant (d = 1) time
After solving the subproblems, the amount of work required to combine the results is c = T_g(n/2, n/2). This is n-1 (approximately n) in the worst case and 1 in the best case
Thus, following the examples on the WP page for the master theorem, the worst case complexity is n * log(n), and the best case complexity is n
Empirical trials seem to bear out this result. Any objections to my line of reasoning?
I've read a lot about different techniques for iterating over numpy arrays recently and it seems that consensus is not to iterate at all (for instance, see a comment here). There are several similar questions on SO, but my case is a bit different as I have to combine "iterating" (or not iterating) and accessing previous values.
Let's say there are N (N is small, usually 4, might be up to 7) 1-D numpy arrays of float128 in a list X, all arrays are of the same size. To give you a little insight, these are data from PDE integration, each array stands for one function, and I would like to apply a Poincare section. Unfortunately, the algorithm should be both memory- and time-efficient since these arrays are sometimes ~1Gb each, and there are only 4Gb of RAM on board (I've just learnt about memmap'ing of numpy arrays and now consider using them instead of regular ones).
One of these arrays is used for "filtering" the others, so I start with secaxis = X.pop(idx). Now I have to locate pairs of indices where (secaxis[i-1] > 0 and secaxis[i] < 0) or (secaxis[i-1] < 0 and secaxis[i] > 0) and then apply simple algebraic transformations to remaining arrays, X (and save results). Worth mentioning, data shouldn't be wasted during this operation.
There are multiple ways for doing that, but none of them seem efficient (and elegant enough) to me. One is a C-like approach, where you just iterate in a for-loop:
import array # better than lists
res = [ array.array('d') for _ in X ]
for i in xrange(1,secaxis.size):
if condition: # see above
co = -secaxis[i-1]/secaxis[i]
for j in xrange(N):
res[j].append( (X[j][i-1] + co*X[j][i])/(1+co) )
This is clearly very inefficient and besides not a Pythonic way.
Another way is to use numpy.nditer, but I haven't figured out yet how one accesses the previous value, though it allows iterating over several arrays at once:
# without secaxis = X.pop(idx)
it = numpy.nditer(X)
for vec in it:
# vec[idx] is current value, how do you get the previous (or next) one?
Third possibility is to first find sought indices with efficient numpy slices, and then use them for bulk multiplication/addition. I prefer this one for now:
res = []
inds, = numpy.where((secaxis[:-1] < 0) * (secaxis[1:] > 0) +
(secaxis[:-1] > 0) * (secaxis[1:] < 0))
coefs = -secaxis[inds] / secaxis[inds+1] # array of coefficients
for f in X: # loop is done only N-1 times, that is, 3 to 6
res.append( (f[inds] + coefs*f[inds+1]) / (1+coefs) )
But this is seemingly done in 7 + 2*(N - 1) passes, moreover, I'm not sure about secaxis[inds] type of addressing (it is not slicing and generally it has to find all elements by indices just like in the first method, doesn't it?).
Finally, I've also tried using itertools and it resulted in monstrous and obscure structures, which might stem from the fact that I'm not very familiar with functional programming:
def filt(x):
return (x[0] < 0 and x[1] > 0) or (x[0] > 0 and x[1] < 0)
import array
from itertools import izip, tee, ifilter
res = [ array.array('d') for _ in X ]
iters = [iter(x) for x in X] # N-1 iterators in a list
prev, curr = tee(izip(*iters)) # 2 similar iterators, each of which
# consists of N-1 iterators
next(curr, None) # one of them is now for current value
seciter = tee(iter(secaxis))
next(seciter[1], None)
for x in ifilter(filt, izip(seciter[0], seciter[1], prev, curr)):
co = - x[0]/x[1]
for r, p, c in zip(res, x[2], x[3]):
r.append( (p+co*c) / (1+co) )
Not only this looks very ugly, it also takes an awful lot of time to complete.
So, I have following questions:
Of all these methods is the third one indeed the best? If so, what can be done to impove the last one?
Are there any other, better ones yet?
Out of sheer curiosity, is there a way to solve the problem using nditer?
Finally, will I be better off using memmap versions of numpy arrays, or will it probably slow things down a lot? Maybe I should only load secaxis array into RAM, keep others on disk and use third method?
(bonus question) List of equal in length 1-D numpy arrays comes from loading N .npy files whose sizes aren't known beforehand (but N is). Would it be more efficient to read one array, then allocate memory for one 2-D numpy array (slight memory overhead here) and read remaining into that 2-D array?
The numpy.where() version is fast enough, you can speedup it a little by method3(). If the > condition can change to >=, you can also use method4().
import numpy as np
a = np.random.randn(100000)
def method1(a):
idx = []
for i in range(1, len(a)):
if (a[i-1] > 0 and a[i] < 0) or (a[i-1] < 0 and a[i] > 0):
idx.append(i)
return idx
def method2(a):
inds, = np.where((a[:-1] < 0) * (a[1:] > 0) +
(a[:-1] > 0) * (a[1:] < 0))
return inds + 1
def method3(a):
m = a < 0
p = a > 0
return np.where((m[:-1] & p[1:]) | (p[:-1] & m[1:]))[0] + 1
def method4(a):
return np.where(np.diff(a >= 0))[0] + 1
assert np.allclose(method1(a), method2(a))
assert np.allclose(method2(a), method3(a))
assert np.allclose(method3(a), method4(a))
%timeit method1(a)
%timeit method2(a)
%timeit method3(a)
%timeit method4(a)
the %timeit result:
1 loop, best of 3: 294 ms per loop
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.38 ms per loop
1000 loops, best of 3: 1.39 ms per loop
I'll need to read your post in more detail, but will start with some general observations (from previous iteration questions).
There isn't an efficient way of iterating over arrays in Python, though there are things that slow things down. I like to distinguish between the iteration mechanism (nditer, for x in A:) and the action (alist.append(...), x[i+1] += 1). The big time consumer is usually the action, done many times, not the iteration mechanism itself.
Letting numpy do the iteration in compiled code is the fastest.
xdiff = x[1:] - x[:-1]
is much faster than
xdiff = np.zeros(x.shape[0]-1)
for i in range(x.shape[0]:
xdiff[i] = x[i+1] - x[i]
The np.nditer isn't any faster.
nditer is recommended as a general iteration tool in compiled code. But its main value lies in handling broadcasting and coordinating the iteration over several arrays (input/output). And you need to use buffering and c like code to get the best speed from nditer (I'll look up a recent SO question).
https://stackoverflow.com/a/39058906/901925
Don't use nditer without studying the relevant iteration tutorial page (the one that ends with a cython example).
=========================
Just judging from experience, this approach will be fastest. Yes it's going to iterate over secaxis a number of times, but those are all done in compiled code, and will be much faster than any iteration in Python. And the for f in X: iteration is just a few times.
res = []
inds, = numpy.where((secaxis[:-1] < 0) * (secaxis[1:] > 0) +
(secaxis[:-1] > 0) * (secaxis[1:] < 0))
coefs = -secaxis[inds] / secaxis[inds+1] # array of coefficients
for f in X:
res.append( (f[inds] + coefs*f[inds+1]) / (1+coefs) )
#HYRY has explored alternatives for making the where step faster. But as you can see the differences aren't that big. Other possible tweaks
inds1 = inds+1
coefs = -secaxis[inds] / secaxis[inds1]
coefs1 = coefs+1
for f in X:
res.append(( f[inds] + coefs*f[inds1]) / coefs1)
If X was an array, res could be an array as well.
res = (X[:,inds] + coefs*X[:,inds1])/coefs1
But for small N I suspect the list res is just as good. Don't need to make the arrays any bigger than necessary. The tweaks are minor, just trying to avoid recalculating things.
=================
This use of np.where is just np.nonzero. That actually makes two passes of the array, once with np.count_nonzero to determine how many values it will return, and create the return structure (list of arrays of now known length). And a second loop to fill in those indices. So multiple iterations are fine if it keeps action simple.
I am writing a program to perform numerical calculation with a Hessian matrix. The Hessian matrix is 500 x 500 and I need to populate it hundreds of times over. I am populating it with two for loops each time. My problem is that preventatively slow. Here is my code:
#create these outside function
hess = np.empty([500,500])
b = np.empty([500])
def hess_h(x):
#create these first so they aren't calculated every iteration
for k in range(500):
b[k] = (1-np.dot(a[k],x))**2
for i in range(500):
for j in range(500):
if i == j:
#these are values along diagonal
hess[i,j] = float(2*(1-x[i])**2 + 4*x[i]**2)/(1-x[i]**2)**2 \
- float(a[i,j]*sum(a[i]))/b[i]
#the matrix is symmetric so only calculate upper triangle
elif j > i :
hess[i,j] = -float(a[i,j]*sum(a[i]))/b[i]
elif i > j:
hess[i,j] = hess[j,i]
return hess
I calculate that hess_h(np.zeros(500)) takes 10.2289998531 sec to run. That is too long and I need to figure out another way.
Look for patterns in your calculation, in particular things that you can calculate over the whole range of i and j.
I see for example a diagonal where i==j
hess[i,j] = float(2*(1-x[i])**2 + 4*x[i]**2)/(1-x[i]**2)**2 \
- float(a[i,j]*sum(a[i]))/b[i]
Can you change that to a one time expression, something like:
2*(1-x)**2 + 4*x**2)/(1-x**2)**2 - np.diagonal(a)*sum(a)/b
The other pieces work with up and lower triangular elements. There are functions like np.triu that give you their indices.
I'm trying to give you tools and though processes for solving this with a few numpy vectorized operaitons, instead of iterating over all elements of i and j.
Looks like
-a[i,j]*sum(a[i])/b[i]
is used for every element. I assume a is a (500,500) array. Can you use
-a*a.sum(axis=?)/b
b can be 'vectorized'
b[k] = (1-np.dot(a[k],x))**2
with something like:
(1 - np.dot(a, x))**2
or
(1 - np.einsum('kj,ji',a,x))**2
test the details on a smaller a.