I have written a function about ordered logit model, recently.
But it takes me lots of time when running big data.
So I want to rewrite the code and substitute numpy.where function to if statement.
There have some problem about my new code, I don't know how to do it.
If you know, Please help me. Thank you very much!
This is my original function.
import numpy as np
from scipy.stats import logistic
def func(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
for i in xrange(1, len(thresholds)):
if row[0] == i:
diff_prob = logistic.cdf(thresholds[i] - row[1]) - logistic.cdf(thresholds[i - 1] - row[1])
if diff_prob <= 10 ** -5:
ll += np.log(10 ** -5)
else:
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print func(y, X, thresholds)
This is the new but not perfect code.
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
ll = np.where(y == 0, logistic.logcdf(thresholds[0] - X),
np.where(y == len(thresholds), logistic.logcdf(X - thresholds[-1]),
np.log(logistic.cdf(thresholds[1] - X) - logistic.cdf(thresholds[0] - X))))
print ll.sum()
The problem is that I don't know how to rewrite the sub-loop(for i in xrange(1, len(thresholds)):) function.
I think asking how to implement it just using np.where is a bit of an X/Y problem.
So I'll try to explain how I would approach optimizing this function.
My first instinct is to get rid of the for loop, which was the pain point anyway:
import numpy as np
from scipy.stats import logistic
def func1(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
diff_prob = logistic.cdf(thresholds[row[0]] - row[1]) - \
logistic.cdf(thresholds[row[0] - 1] - row[1])
diff_prob = 10 ** -5 if diff_prob < 10 ** -5 else diff_prob
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func1(y, X, thresholds))
I have just replaced i with row[0], without changing the semantics of the loop. So that's one for loop less.
Now I would like to have the form of the statements in the different branches of the if-else to be the same. To that end:
import numpy as np
from scipy.stats import logistic
def func2(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
ll += np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[row[0]] - row[1]) -
logistic.cdf(thresholds[row[0] - 1] - row[1])
)
)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func2(y, X, thresholds))
Now the expression in each branch is of the form ll += expr.
At this piont there are a couple of different paths the optimization can take. You can try to optimize the loop away by writing it as a comprehension, but I suspect that it'll not give you much increase in speed.
An alternate path is to pull the if conditions out of the loop. That is what your intent with np.where was as well:
import numpy as np
from scipy.stats import logistic
def func3(y, X, thresholds):
y_0 = y == 0
y_end = y == len(thresholds)
y_rest = ~(y_0 | y_end)
ll_1 = logistic.logcdf(thresholds[0] - X[ y_0 ])
ll_2 = logistic.logcdf(X[ y_end ] - thresholds[-1])
ll_3 = np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[y[ y_rest ]] - X[ y_rest ]) -
logistic.cdf(thresholds[ y[y_rest] - 1 ] - X[ y_rest])
)
)
return np.sum(ll_1) + np.sum(ll_2) + np.sum(ll_3)
y = np.array([0, 1, 2])
X = np.array([2, 2, 2])
thresholds = np.array([2, 3])
print(func3(y, X, thresholds))
Note that I turned X into an np.array to be able to use fancy indexing on it.
At this point, I'd wager that it is fast enough for my purposes. However, you can stop earlier or beyond this point, depending on your requirements.
On my computer, I get the following results:
y = np.random.random_integers(0, 10, size=(10000,))
X = np.random.random_integers(0, 10, size=(10000,))
thresholds = np.cumsum(np.random.rand(10))
%timeit func(y, X, thresholds) # Original
1 loops, best of 3: 1.51 s per loop
%timeit func1(y, X, thresholds) # Removed for-loop
1 loops, best of 3: 1.46 s per loop
%timeit func2(y, X, thresholds) # Standardized if statements
1 loops, best of 3: 1.5 s per loop
%timeit func3(y, X, thresholds) # Vectorized ~ 500x improvement
100 loops, best of 3: 2.74 ms per loop
Related
Is there some faster variant of computing the following matrix (from this paper), given a nxn matrix M and a n-vector X:
?
I currently compute it as follows:
#M, X are given as numpy arrays
G = np.zeros((n,n))
for i in range(0,n):
for j in range(i,n):
xi = X[i]
if i == j:
G[i,j] = abs(xi)
else:
xi2 = xi*xi
xj = X[j]
xj2 = xj*xj
mij = M[i,j]
mid = (xi2 - xj2)/mij
top = mij*mij + mid*mid + 2*xi2 + 2*xj2
G[i,j] = math.sqrt(top)/2
This is very slow, but I suspect there is a nicer "numpythonic" way of doing this instead of looping...
EDIT: While all answers work and are much faster than my naive implementation, I chose the one I benchmarked to be the fastest. Thanks!
Quite straightforward actually.
import math
import numpy as np
n = 5
M = np.random.rand(n, n)
X = np.random.rand(n)
Your code and result:
G = np.zeros((n,n))
for i in range(0,n):
for j in range(i,n):
xi = X[i]
if i == j:
G[i,j] = abs(xi)
else:
xi2 = xi*xi
xj = X[j]
xj2 = xj*xj
mij = M[i,j]
mid = (xi2 - xj2)/mij
top = mij*mij + mid*mid + 2*xi2 + 2*xj2
G[i,j] = math.sqrt(top)/2
array([[0.77847813, 5.26334534, 0.8794082 , 0.7785694 , 0.95799072],
[0. , 0.15662266, 0.88085031, 0.47955479, 0.99219171],
[0. , 0. , 0.87699707, 8.92340836, 1.50053712],
[0. , 0. , 0. , 0.45608367, 0.95902308],
[0. , 0. , 0. , 0. , 0.95774452]])
Using broadcasting:
temp = M**2 + ((X[:, None]**2 - X[None, :]**2) / M)**2 + 2 * (X[:, None]**2) + 2 * (X[None, :]**2)
G = np.sqrt(temp) / 2
array([[0.8284724 , 5.26334534, 0.8794082 , 0.7785694 , 0.95799072],
[0.89251217, 0.25682736, 0.88085031, 0.47955479, 0.99219171],
[0.90047282, 1.10306597, 0.95176428, 8.92340836, 1.50053712],
[0.85131766, 0.47379576, 0.87723514, 0.55013345, 0.95902308],
[0.9879939 , 1.46462011, 0.99516443, 0.95774481, 1.02135642]])
Note that you did not use the formula directly for diagonal elements and only computed for upper triangular region of G. I simply implemented the formula to calculate all G[i, j].
Note: If diagonal elements of M don't matter and they contain some zeros, just add some offset to avoid the divide by zero error like:
M[np.arange(n), np.arange(n)] += 1e-5
# Do calculation to get G
# Assign diagonal to X
G[np.arange(n), np.arange(n)] = abs(X)
First, you function is not you equation. As this line
mid = (xi2 - xj2)/mij
should be
mid = (xi - xj)/mij
Second, I use numpy generate your equation.
Generate test data
test_m = np.array(
[
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
]
)
test_x = np.array([5, 6, 7, 8, 9])
build function
def solve(m, x):
x_size = x.shape[0]
x = x.reshape(1, -1)
reshaped_x = x.reshape(-1, 1)
result = np.sqrt(
m ** 2
+ ((reshaped_x - x) / m) ** 2
+ 2 * np.repeat(reshaped_x, x_size, axis=1) ** 2
+ 2 * np.repeat(x, x_size, axis=0) ** 2
) / 2
return result
run
print(solve(test_m, test_x))
In fact, the result part could be simpfy like this:
result = np.sqrt(
m ** 2
+ ((reshaped_x - x) / m) ** 2
+ 2 * reshaped_x ** 2
+ 2 * x ** 2
) / 2
Tested with googles colab:
import numba
import numpy as np
import math
# your implementation
def bench_1(n):
#M, X are given as numpy arrays
G = np.zeros((n,n))
M = np.random.rand(n, n)
X = np.random.rand(n)
for i in range(0,n):
for j in range(i,n):
xi = X[i]
if i == j:
G[i,j] = abs(xi)
else:
xi2 = xi*xi
xj = X[j]
xj2 = xj*xj
mij = M[i,j]
mid = (xi2 - xj2)/mij
top = mij*mij + mid*mid + 2*xi2 + 2*xj2
G[i,j] = math.sqrt(top)/2
return G
%%timeit
n = 1000
bench_1(n)
1 loop, best of 3: 1.61 s per loop
Using Numba to compile the function:
#numba.jit(nopython=True, parallel=True)
def bench_2(n):
#M, X are given as numpy arrays
G = np.zeros((n,n))
M = np.random.rand(n, n)
X = np.random.rand(n)
for i in range(0,n):
for j in range(i,n):
xi = X[i]
if i == j:
G[i,j] = abs(xi)
else:
xi2 = xi*xi
xj = X[j]
xj2 = xj*xj
mij = M[i,j]
mid = (xi2 - xj2)/mij
top = mij*mij + mid*mid + 2*xi2 + 2*xj2
G[i,j] = math.sqrt(top)/2
return G
%%timeit
n = 1000
bench_2(n)
The slowest run took 88.13 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 9.8 ms per loop
Let's say an array sig:
sig = np.array([1,2,3,4,5])
Another array k which consists of indexes:
k = np.array([1,2,0,4])
I want to find an array that interpolates between s[k[i]-1] and s[k[i]] only if k[i]!= 0 and k[i] != len(k) i.e
p=2
result = np.zeros(len(k))
for i in range(len(k)):
if(k[i] == 0):
result[i] = sig[k[i]]
elif(k[i] == len(k)):
result[i] = sig[k[i] -1]
else:
result[i] = sig[k[i] -1] + (sig[k[i]] - sig[k[i]-1])*(p - k[i-1])/(k[i] - k[i-1])
How do I do this without looping over len(k) by vectorization
Expected : result = array([1.66666667,3, 1, 4])
Because for k = 0 and k =4 I did not interpolate the values were returned as sig[0] and sig[3] respectively
For a (very) limited amount of cases like here, an approach to vectorize such code is to build a linear combination of each case and the corresponding calculation.
So, set up vectors
alpha = (k == 0) to match the first case,
beta = (k > 0) to match the second case, and
gamma = (k < len(k)) to match the third case.
Then, build up a proper linear combination like:
alpha * sig[k] + beta * sig[k-1] + gamma * (sig[k] - sig[k-1] * (p - np.roll(k, 1)) / (k - np.roll(k, 1))
Pay attention, that - by the way beta and gamma are set up above - the calculations of the second and third cases can be combined. Also, we need np.roll here, to get the proper k[i-1].
The final solution, minimized to a one-liner, looks like this:
import numpy as np
# Inputs
sig = np.array([1, 2, 3, 4, 5])
k = np.array([1, 2, 0, 4])
p = 2
# Original solution using loop
result = np.zeros(len(k))
for i in range(len(k)):
if(k[i] == 0):
result[i] = sig[k[i]]
elif(k[i] == len(k)):
result[i] = sig[k[i] -1]
else:
result[i] = sig[k[i] -1] + (sig[k[i]] - sig[k[i]-1])*(p - k[i-1])/(k[i] - k[i-1])
# Vectorized solution
res = (k == 0) * sig[k] + (k > 0) * sig[k-1] + (k < len(k)) * (sig[k] - sig[k-1]) * (p - np.roll(k, 1)) / (k - np.roll(k, 1))
# Outputs
print('Original solution using loop:\n ', result)
print('Vectorized solution:\n ', res)
The outputs are identical:
Original solution using loop:
[1.66666667 3. 1. 4. ]
Vectorized solution:
[1.66666667 3. 1. 4. ]
Hope that helps!
I'm currently trying to calculate the sum of all sum of subsquares in a 10.000 x 10.000 array of values. As an example, if my array was :
1 1 1
2 2 2
3 3 3
I want the result to be :
1+1+1+2+2+2+3+3+3 [sum of squares of size 1]
+(1+1+2+2)+(1+1+2+2)+(2+2+3+3)+(2+2+3+3) [sum of squares of size 2]
+(1+1+1+2+2+2+3+3+3) [sum of squares of size 3]
________________________________________
68
So, as a first try i wrote a very simple python code to do that. As it was in O(k^2.n^2) (n being the size of the big array and k the size of the subsquares we are getting), the processing was awfully long. I wrote another algorithm in O(n^2) to speed it up :
def getSum(tab,size):
n = len(tab)
tmp = numpy.zeros((n,n))
for i in xrange(0,n):
sum = 0
for j in xrange(0,size):
sum += tab[j][i]
tmp[0][i] = sum
for j in xrange(1,n-size+1):
sum += (tab[j+size-1][i] - tab[j-1][i])
tmp[j][i] = sum
finalsum = 0
for i in xrange(0,n-size+1):
sum = 0
for j in xrange(0,size):
sum += tmp[i][j]
finalsum += sum
for j in xrange(1,n-size+1):
finalsum += (tmp[i][j+size-1] - tmp[i][j-1])
return finalsum
So this code works fine. Given an array and a size of subsquares, it will return the sum of the values in all this subsquares. I basically iterate over the size of subsquares to get all the possible values.
The problem is this is again waaay to long for big arrays (over 20 days for a 10.000 x 10.000 array). I googled it and learned I could vectorize the iterations over arrays with numpy. However, i couldn't figure out how to make it so in my case...
If someone can help me to speed my algorithm up, or give me good documentation on the subject, i'll be glad !
Thank you !
Following the excellent idea of #Divakar, I would suggest using integral images to speedup convolutions. If the matrix is very big, you have to convolve it several times (once for each kernel size). Several convolutions (or evaluations of sums inside a square) can be very efficiently computed using integral images (aka summed area tables).
Once an integral image M is computed, the sum of all values inside a region (x0, y0) - (x1, y1) can be computed with just 4 aritmetic computations, regardless of the size of the window (picture from wikipedia):
M[x1, y1] - M[x1, y0] - M[x0, y1] + M[x0, y0]
This can be very easily vectorized in numpy. An integral images can be calculated with cumsum. Following the example:
tab = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
M = tab.cumsum(0).cumsum(1) # Create integral images
M = np.pad(M, ((1,0), (1,0)), mode='constant') # pad it with a row and column of zeros
M is padded with a row and a column of zeros to handle the first row (where x0 = 0 or y0 = 0).
Then, given a window size W, the sum of EVERY window of size W can be computed efficiently and fully vectorized with numpy as:
all_sums = M[W:, W:] - M[:-W, W:] - M[W:, :-W] + M[:-W, :-W]
Note that the vectorized operation above, calculates the sum of every window, i.e. every A, B, C, and D of the matrix. The sum of all windows is then calculated as
total = all_sums.sum()
Note that for N different sizes, different to convolutions, the integral image has to be computed only once, thus, the code can be written very efficiently as:
def get_all_sums(A):
M = A.cumsum(0).cumsum(1)
M = np.pad(M, ((1,0), (1,0)), mode='constant')
total = 0
for W in range(1, A.shape[0] + 1):
tmp = M[W:, W:] + M[:-W, :-W] - M[:-W, W:] - M[W:, :-W]
total += tmp.sum()
return total
The output for the example:
>>> get_all_sums(tab)
68
Some timings comparing convolutions to integral images with different size matrices. getAllSums refeers to Divakar's convolutional method, while get_all_sums to the integral images based method described above:
>>> R1 = np.random.randn(10, 10)
>>> R2 = np.random.randn(100, 100)
1) With R1 10x10 matrix:
>>> %time getAllSums(R1)
CPU times: user 353 µs, sys: 9 µs, total: 362 µs
Wall time: 335 µs
2393.5912717342017
>>> %time get_all_sums(R1)
CPU times: user 243 µs, sys: 0 ns, total: 243 µs
Wall time: 248 µs
2393.5912717342012
2) With R2 100x100 matrix:
>>> %time getAllSums(R2)
CPU times: user 698 ms, sys: 0 ns, total: 698 ms
Wall time: 701 ms
176299803.29826894
>>> %time get_all_sums(R2)
CPU times: user 2.51 ms, sys: 0 ns, total: 2.51 ms
Wall time: 2.47 ms
176299803.29826882
Note that using integral images is 300 times faster than convolutions for large enough matrices.
Those sliding summations are best suited to be calculated as 2D convolution summations and those could be efficiently calculated with scipy's convolve2d. Thus, for a specific size, you could get the summations, like so -
def getSum(tab,size):
# Define kernel and perform convolution to get such sliding windowed summations
kernel = np.ones((size,size),dtype=tab.dtype)
return convolve2d(tab, kernel, mode='valid').sum()
To get summations across all sizes, I think the best way both in terms of memory and performance efficiency would be to use a loop to loop over all possible sizes. Thus, to get the final summation, you would have -
def getAllSums(tab):
finalSum = 0
for i in range(tab.shape[0]):
finalSum += getSum(tab,i+1)
return finalSum
Sample run -
In [51]: tab
Out[51]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [52]: getSum(tab,1) # sum of squares of size 1
Out[52]: 18
In [53]: getSum(tab,2) # sum of squares of size 2
Out[53]: 32
In [54]: getSum(tab,3) # sum of squares of size 3
Out[54]: 18
In [55]: getAllSums(tab) # sum of squares of all sizes
Out[55]: 68
Based the idea to calculate how many times each number counted, I came to this simple code:
def get_sum(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
for k in range(1, n + 1):
# k is the square size. count is times of the number counted.
count = min(k, n - k + 1, i + 1, n - i) * min(k, n - k + 1, j + 1, n - j)
ret += count * matrix[i][j]
return ret
a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
print get_sum(a, 3) # 68
Divakar's solution is fantastic, however, I think mine could be more efficient, at least in asymptotical time complexity (O(n^3) compared with Divakar's O(n^3logn)).
I get a O(n^2) solution now...
Basically, we can get that:
def get_sum2(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
x = min(i + 1, n - i)
y = min(j + 1, n - j)
# k < half
half = (n + 1) / 2
for k in range(1, half + 1):
count = min(k, x) * min(k, y)
ret += count * matrix[i][j]
# k >= half
for k in range(half + 1, n + 1):
count = min(n + 1 - k, x) * min(n + 1 - k, y)
ret += count * matrix[i][j]
return ret
You can see sum(min(k, x) * min(k, y)) can be calculated in O(1) when 1 <= k <= n/2
So we came to that O(n^2) code:
def get_square_sum(n):
return n * (n + 1) * (2 * n + 1) / 6
def get_linear_sum(a, b):
return (b - a + 1) * (a + b) / 2
def get_count(x, y, k_end):
# k <= min(x, y), count is k*k
sum1 = get_square_sum(min(x, y))
# k > min(x, y) and k <= max(x, y), count is k * min(x, y)
sum2 = get_linear_sum(min(x, y) + 1, max(x, y)) * min(x, y)
# k > max(x, y), count is x * y
sum3 = x * y * (k_end - max(x, y))
return sum1 + sum2 + sum3
def get_sum3(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
x = min(i + 1, n - i)
y = min(j + 1, n - j)
half = n / 2
# k < half
ret += get_count(x, y, half) * matrix[i][j]
# k >= half
ret += get_count(x, y, half + half % 2) * matrix[i][j]
return ret
Test:
a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
n = 1000
b = [[1] * n] * n
print get_sum3(a, 3) # 68
print get_sum3(b, n) # 33500333666800
You can rewrite my O(n^2) Python code to C and I believe it will result a very efficient solution...
I am numerically solving for x(t) for a system of first order differential equations. The system is:
dx/dt = y
dy/dt = -x - a*y(x^2 + y^2 -1)
I have implemented the Forward Euler method to solve this problem as follows:
def forward_euler():
h = 0.01
num_steps = 10000
x = np.zeros([num_steps + 1, 2]) # steps, number of solutions
y = np.zeros([num_steps + 1, 2])
a = 1.
x[0, 0] = 10. # initial condition 1st solution
y[0, 0] = 5.
x[0, 1] = 0. # initial condition 2nd solution
y[0, 1] = 0.0000000001
for step in xrange(num_steps):
x[step + 1] = x[step] + h * y[step]
y[step + 1] = y[step] + h * (-x[step] - a * y[step] * (x[step] ** 2 + y[step] ** 2 - 1))
return x, y
Now I would like to vectorize the code further and keep x and y in the same array, I have come up with the following solution:
def forward_euler_vector():
num_steps = 10000
h = 0.01
x = np.zeros([num_steps + 1, 2, 2]) # steps, variables, number of solutions
a = 1.
x[0, 0, 0] = 10. # initial conditions 1st solution
x[0, 1, 0] = 5.
x[0, 0, 1] = 0. # initial conditions 2nd solution
x[0, 1, 1] = 0.0000000001
def f(x):
return np.array([x[1],
-x[0] - a * x[1] * (x[0] ** 2 + x[1] ** 2 - 1)])
for step in xrange(num_steps):
x[step + 1] = x[step] + h * f(x[step])
return x
The question: forward_euler_vector() works, but was this to best way to vectorize it? I am asking because the vectorized version runs about 20 ms slower on my laptop:
In [27]: %timeit forward_euler()
1 loops, best of 3: 301 ms per loop
In [65]: %timeit forward_euler_vector()
1 loops, best of 3: 320 ms per loop
There is always the trivial autojit solution:
def forward_euler(initial_x, initial_y, num_steps, h):
x = np.zeros([num_steps + 1, 2]) # steps, number of solutions
y = np.zeros([num_steps + 1, 2])
a = 1.
x[0, 0] = initial_x[0] # initial condition 1st solution
y[0, 0] = initial_y[0]
x[0, 1] = initial_x[1] # initial condition 2nd solution
y[0, 1] = initial_y[1]
for step in xrange(int(num_steps)):
x[step + 1] = x[step] + h * y[step]
y[step + 1] = y[step] + h * (-x[step] - a * y[step] * (x[step] ** 2 + y[step] ** 2 - 1))
return x, y
Timings:
from numba import autojit
jit_forward_euler = autojit(forward_euler)
%timeit forward_euler([10,0], [5,0.0000000001], 1E4, 0.01)
1 loops, best of 3: 385 ms per loop
%timeit jit_forward_euler([10,0], [5,0.0000000001], 1E4, 0.01)
100 loops, best of 3: 3.51 ms per loop
#Ophion comment explains very well what's going on. The call to array() within f(x) introduces some overhead, that kills the benefit of the use of matrix multiplication in the expression h * f(x[step]).
And as he says, you may be interested in having a look at scipy.integrate for a nice set of numerical integrators.
To solve the problem at hand of vectorising your code, you want to avoid recreating the array every time you call f. You would like to initialize the array once, and return it modified at every call. This is similar to what a static variable is in C/C++.
You can achieve this with a mutable default argument, that is interpreted once, at the time of the definition of the function f(x), and that has local scope. Since it has to be mutable, you encapsulate it in a list of a single element:
def f(x,static_tmp=[empty((2,2))]):
static_tmp[0][0]=x[1]
static_tmp[0][1]=-x[0] - a * x[1] * (x[0] ** 2 + x[1] ** 2 - 1)
return static_tmp[0]
With this modification to your code, the overhead of array creation disappears, and on my machine I gain a small improvement:
%timeit forward_euler() #258ms
%timeit forward_euler_vector() #248ms
This means that the gain of optimizing matrix multiplication with numpy is quite small, at least on the problem at hand.
You may want to get rid of the function f straight away as well, doing its operations within the for loop, getting rid of the call overhead. This trick of the default argument can however be applied also with scipy more general time integrators, where you must provide a function f.
EDIT: as pointed out by Jaime, another way to go is to treat static_tmp as an attribute of the function f, and to create it after having declared the function but before calling it:
def f(x):
f.static_tmp[0]=x[1]
f.static_tmp[1]=-x[0] - a * x[1] * (x[0] ** 2 + x[1] ** 2 - 1)
return f.static_tmp
f.static_tmp=empty((2,2))
My Setup: Python 2.7.4.1, Numpy MKL 1.7.1, Windows 7 x64, WinPython
Context:
I tried to implement the Sequential Minimal Optimization algorithm for solving SVM. I use maximal violating pair approach.
The problem:
In working set selection procedure i want to find maximum value of gradient and its index for elements which met some condition, y[i]*alpha[i]<0 or y[i]*alpha[i]
#y - array of -1 and 1
y=np.array([-1,1,1,1,-1,1])
#alpha- array of floats in range [0,C]
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
#grad - array of floats
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])
GMaxI=float('-inf')
GMax_idx=-1
n=alpha.shape[0] #usually n=100000
C=4
B=[0,0,C]
for i in xrange(0,n):
yi=y[i] #-1 or 1
alpha_i=alpha[i]
if (yi * alpha_i< B[yi+1]): # B[-1+1]=0 B[1+1]=C
if( -yi*grad[i]>=GMaxI):
GMaxI= -yi*grad[i]
GMax_idx = i
This procedure is called many times (~50000) and profiler shows that this is the bottleneck.
It is possible to vectorize this code?
Edit 1:
Add some small exemplary data
Edit 2:
I have checked solution proposed by hwlau , larsmans and Mr E. Only solutions proposed Mr E is correct. Below sample code with all three answers:
import numpy as np
y=np.array([ -1, -1, -1, -1, -1, -1, -1, -1])
alpha=np.array([0, 0.9, 0.4, 0.1, 1.33, 0, 0.9, 0])
grad=np.array([-3, -0.5, -1, -1, -0.2, -4, -0.4, -0.3])
C=4
B=np.array([0,0,C])
#hwlau - wrong index and value
filter = (y*alpha < C*0.5*(y+1)).astype('float')
GMax_idx = (filter*(-y*grad)).argmax()
GMax = -y[GMax_idx]*grad[GMax_idx]
print GMax_idx,GMax
#larsmans - wrong index
neg_y_grad = (-y * grad)[y * alpha < B[y + 1]]
GMaxI = np.max(neg_y_grad)
GMax_ind = np.argmax(neg_y_grad)
print GMax_ind,GMaxI
#Mr E - correct result
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -y * grad
values[~valid_mask] = np.min(values) - 1.0
GMaxI = values.max()
GMax_idx = values.argmax()
print GMax_idx,GMaxI
Output (GMax_idx, GMaxI)
0 -3.0
3 -0.2
4 -0.2
Conclusions
After checking all solutions, the fastest one (2x-6x) is solution proposed by #ali_m. However it requires to install some python packages: numba and all its prerequisites.
I have some trouble to use numba with class methods, so I create global functions which are autojited with numba, my solution look something like this:
from numba import autojit
#autojit
def FindMaxMinGrad(A,B,alpha,grad,y):
'''
Finds i,j indices with maximal violatin pair scheme
A,B - 3 dim arrays, contains bounds A=[-C,0,0], B=[0,0,C]
alpha - array like, contains alpha coeficients
grad - array like, gradient
y - array like, labels
'''
GMaxI=-100000
GMaxJ=-100000
GMax_idx=-1
GMin_idx=-1
for i in range(0,alpha.shape[0]):
if (y[i] * alpha[i]< B[y[i]+1]):
if( -y[i]*grad[i]>GMaxI):
GMaxI= -y[i]*grad[i]
GMax_idx = i
if (y[i] * alpha[i]> A[y[i]+1]):
if( y[i]*grad[i]>GMaxJ):
GMaxJ= y[i]*grad[i]
GMin_idx = i
return (GMaxI,GMaxJ,GMax_idx,GMin_idx)
class SVM(object):
def working_set(self,....):
FindMaxMinGrad(.....)
You can probably do quite a lot better than plain vectorization if you use numba to JIT-compile your original code that used nested loops.
import numpy as np
from numba import autojit
#autojit
def jit_max_grad(y, alpha, grad, B):
maxgrad = -inf
maxind = -1
for ii in xrange(alpha.shape[0]):
if (y[ii] * alpha[ii] < B[y[ii] + 1]):
g = -y[ii] * grad[ii]
if g >= maxgrad:
maxgrad = g
maxind = ii
return maxind, maxgrad
For comparison, here's Mr E's vectorized version:
def mr_e_max_grad(y, alpha, grad, B):
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -y * grad
values[~valid_mask] = np.min(values) - 1.0
GMaxI = values.max()
GMax_idx = values.argmax()
return GMax_idx, GMaxI
Timing:
y = np.array([ -1, -1, -1, -1, -1, -1, -1, -1])
alpha = np.array([0, 0.9, 0.4, 0.1, 1.33, 0, 0.9, 0])
grad = np.array([-3, -0.5, -1, -1, -0.2, -4, -0.4, -0.3])
C = 4
B = np.array([0,0,C])
%timeit mr_e_max_grad(y, alpha, grad, B)
# 100000 loops, best of 3: 19.1 µs per loop
%timeit jit_max_grad(y, alpha, grad, B)
# 1000000 loops, best of 3: 1.07 µs per loop
Update: if you want to see what the timings look like on bigger arrays, it's easy to define a function that generates semi-realistic fake data based on your description in the question:
def make_fake(n, C=4):
y = np.random.choice((-1, 1), n)
alpha = np.random.rand(n) * C
grad = np.random.randn(n)
B = np.array([0,0,C])
return y, alpha, grad, B
%%timeit y, alpha, grad, B = make_fake(100000, 4)
mr_e_max_grad(y, alpha, grad, B)
# 1000 loops, best of 3: 1.83 ms per loop
%%timeit y, alpha, grad, B = make_fake(100000, 4)
jit_max_grad(y, alpha, grad, B)
# 1000 loops, best of 3: 471 µs per loop
I think this is a fully vectorized version
import numpy as np
#y - array of -1 and 1
y=np.array([-1,1,1,1,-1,1])
#alpha- array of floats in range [0,C]
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
#grad - array of floats
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -yi * grad
values[~valid_mask] = np.min(values) - 1.0
GMaxI = values.max()
GMax_idx = values.argmax()
Here you go:
y=np.array([-1,1,1,1,-1,1])
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])
C=4
filter = (y*alpha < C*0.5*(y+1)).astype('float')
GMax_idx = (filter*(-y*grad)).argmax()
GMax = -y[GMax_idx]*grad[GMax_idx]
No benchmark tried, but it is pure numerical and vectorized so it should be fast.
If you change B from a list to a NumPy array, you can at least vectorize the yi * alpha_i< B[yi+1] and push the loop inwards:
GMaxI = float('-inf')
GMax_idx = -1
for i in np.where(y * alpha < B[y + 1])[0]:
if -y[i] * grad[i] >= GMaxI:
GMaxI= -y[i] * grad[i]
GMax_idx = i
That should save a bit of time. Next up, you can vectorize -y[i] * grad[i]:
GMaxI = float('-inf')
GMax_idx = -1
neg_y_grad = -y * grad
for i in np.where(y * alpha < B[y + 1])[0]:
if neg_y_grad[i] >= GMaxI:
GMaxI= -y[i] * grad[i]
GMax_idx = i
Finally, we can vectorize away the entire loop by using max and argmax on -y * grad, filtered by y * alpha < B[y + 1]:
neg_y_grad = (-y * grad)
GMaxI = np.max(neg_y_grad[y * alpha < B[y + 1]])
GMax_idx = np.where(neg_y_grad == GMaxI)[0][0]