I'm currently trying to calculate the sum of all sum of subsquares in a 10.000 x 10.000 array of values. As an example, if my array was :
1 1 1
2 2 2
3 3 3
I want the result to be :
1+1+1+2+2+2+3+3+3 [sum of squares of size 1]
+(1+1+2+2)+(1+1+2+2)+(2+2+3+3)+(2+2+3+3) [sum of squares of size 2]
+(1+1+1+2+2+2+3+3+3) [sum of squares of size 3]
________________________________________
68
So, as a first try i wrote a very simple python code to do that. As it was in O(k^2.n^2) (n being the size of the big array and k the size of the subsquares we are getting), the processing was awfully long. I wrote another algorithm in O(n^2) to speed it up :
def getSum(tab,size):
n = len(tab)
tmp = numpy.zeros((n,n))
for i in xrange(0,n):
sum = 0
for j in xrange(0,size):
sum += tab[j][i]
tmp[0][i] = sum
for j in xrange(1,n-size+1):
sum += (tab[j+size-1][i] - tab[j-1][i])
tmp[j][i] = sum
finalsum = 0
for i in xrange(0,n-size+1):
sum = 0
for j in xrange(0,size):
sum += tmp[i][j]
finalsum += sum
for j in xrange(1,n-size+1):
finalsum += (tmp[i][j+size-1] - tmp[i][j-1])
return finalsum
So this code works fine. Given an array and a size of subsquares, it will return the sum of the values in all this subsquares. I basically iterate over the size of subsquares to get all the possible values.
The problem is this is again waaay to long for big arrays (over 20 days for a 10.000 x 10.000 array). I googled it and learned I could vectorize the iterations over arrays with numpy. However, i couldn't figure out how to make it so in my case...
If someone can help me to speed my algorithm up, or give me good documentation on the subject, i'll be glad !
Thank you !
Following the excellent idea of #Divakar, I would suggest using integral images to speedup convolutions. If the matrix is very big, you have to convolve it several times (once for each kernel size). Several convolutions (or evaluations of sums inside a square) can be very efficiently computed using integral images (aka summed area tables).
Once an integral image M is computed, the sum of all values inside a region (x0, y0) - (x1, y1) can be computed with just 4 aritmetic computations, regardless of the size of the window (picture from wikipedia):
M[x1, y1] - M[x1, y0] - M[x0, y1] + M[x0, y0]
This can be very easily vectorized in numpy. An integral images can be calculated with cumsum. Following the example:
tab = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
M = tab.cumsum(0).cumsum(1) # Create integral images
M = np.pad(M, ((1,0), (1,0)), mode='constant') # pad it with a row and column of zeros
M is padded with a row and a column of zeros to handle the first row (where x0 = 0 or y0 = 0).
Then, given a window size W, the sum of EVERY window of size W can be computed efficiently and fully vectorized with numpy as:
all_sums = M[W:, W:] - M[:-W, W:] - M[W:, :-W] + M[:-W, :-W]
Note that the vectorized operation above, calculates the sum of every window, i.e. every A, B, C, and D of the matrix. The sum of all windows is then calculated as
total = all_sums.sum()
Note that for N different sizes, different to convolutions, the integral image has to be computed only once, thus, the code can be written very efficiently as:
def get_all_sums(A):
M = A.cumsum(0).cumsum(1)
M = np.pad(M, ((1,0), (1,0)), mode='constant')
total = 0
for W in range(1, A.shape[0] + 1):
tmp = M[W:, W:] + M[:-W, :-W] - M[:-W, W:] - M[W:, :-W]
total += tmp.sum()
return total
The output for the example:
>>> get_all_sums(tab)
68
Some timings comparing convolutions to integral images with different size matrices. getAllSums refeers to Divakar's convolutional method, while get_all_sums to the integral images based method described above:
>>> R1 = np.random.randn(10, 10)
>>> R2 = np.random.randn(100, 100)
1) With R1 10x10 matrix:
>>> %time getAllSums(R1)
CPU times: user 353 µs, sys: 9 µs, total: 362 µs
Wall time: 335 µs
2393.5912717342017
>>> %time get_all_sums(R1)
CPU times: user 243 µs, sys: 0 ns, total: 243 µs
Wall time: 248 µs
2393.5912717342012
2) With R2 100x100 matrix:
>>> %time getAllSums(R2)
CPU times: user 698 ms, sys: 0 ns, total: 698 ms
Wall time: 701 ms
176299803.29826894
>>> %time get_all_sums(R2)
CPU times: user 2.51 ms, sys: 0 ns, total: 2.51 ms
Wall time: 2.47 ms
176299803.29826882
Note that using integral images is 300 times faster than convolutions for large enough matrices.
Those sliding summations are best suited to be calculated as 2D convolution summations and those could be efficiently calculated with scipy's convolve2d. Thus, for a specific size, you could get the summations, like so -
def getSum(tab,size):
# Define kernel and perform convolution to get such sliding windowed summations
kernel = np.ones((size,size),dtype=tab.dtype)
return convolve2d(tab, kernel, mode='valid').sum()
To get summations across all sizes, I think the best way both in terms of memory and performance efficiency would be to use a loop to loop over all possible sizes. Thus, to get the final summation, you would have -
def getAllSums(tab):
finalSum = 0
for i in range(tab.shape[0]):
finalSum += getSum(tab,i+1)
return finalSum
Sample run -
In [51]: tab
Out[51]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
In [52]: getSum(tab,1) # sum of squares of size 1
Out[52]: 18
In [53]: getSum(tab,2) # sum of squares of size 2
Out[53]: 32
In [54]: getSum(tab,3) # sum of squares of size 3
Out[54]: 18
In [55]: getAllSums(tab) # sum of squares of all sizes
Out[55]: 68
Based the idea to calculate how many times each number counted, I came to this simple code:
def get_sum(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
for k in range(1, n + 1):
# k is the square size. count is times of the number counted.
count = min(k, n - k + 1, i + 1, n - i) * min(k, n - k + 1, j + 1, n - j)
ret += count * matrix[i][j]
return ret
a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
print get_sum(a, 3) # 68
Divakar's solution is fantastic, however, I think mine could be more efficient, at least in asymptotical time complexity (O(n^3) compared with Divakar's O(n^3logn)).
I get a O(n^2) solution now...
Basically, we can get that:
def get_sum2(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
x = min(i + 1, n - i)
y = min(j + 1, n - j)
# k < half
half = (n + 1) / 2
for k in range(1, half + 1):
count = min(k, x) * min(k, y)
ret += count * matrix[i][j]
# k >= half
for k in range(half + 1, n + 1):
count = min(n + 1 - k, x) * min(n + 1 - k, y)
ret += count * matrix[i][j]
return ret
You can see sum(min(k, x) * min(k, y)) can be calculated in O(1) when 1 <= k <= n/2
So we came to that O(n^2) code:
def get_square_sum(n):
return n * (n + 1) * (2 * n + 1) / 6
def get_linear_sum(a, b):
return (b - a + 1) * (a + b) / 2
def get_count(x, y, k_end):
# k <= min(x, y), count is k*k
sum1 = get_square_sum(min(x, y))
# k > min(x, y) and k <= max(x, y), count is k * min(x, y)
sum2 = get_linear_sum(min(x, y) + 1, max(x, y)) * min(x, y)
# k > max(x, y), count is x * y
sum3 = x * y * (k_end - max(x, y))
return sum1 + sum2 + sum3
def get_sum3(matrix, n):
ret = 0
for i in range(n):
for j in range(n):
x = min(i + 1, n - i)
y = min(j + 1, n - j)
half = n / 2
# k < half
ret += get_count(x, y, half) * matrix[i][j]
# k >= half
ret += get_count(x, y, half + half % 2) * matrix[i][j]
return ret
Test:
a = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
n = 1000
b = [[1] * n] * n
print get_sum3(a, 3) # 68
print get_sum3(b, n) # 33500333666800
You can rewrite my O(n^2) Python code to C and I believe it will result a very efficient solution...
Related
Problem:
I am trying to increase the speed of an aerodynamics function in Python.
Function Set:
import numpy as np
from numba import njit
def calculate_velocity_induced_by_line_vortices(
points, origins, terminations, strengths, collapse=True
):
# Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).
# This will allow NumPy to broadcast the upcoming subtractions.
points = np.expand_dims(points, axis=1)
# Define the vectors from the vortex to the points. r_1 and r_2 now both are of
# shape (N x M x 3). Each row/column pair holds the vector associated with each
# point/vortex pair.
r_1 = points - origins
r_2 = points - terminations
r_0 = r_1 - r_2
r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
r_1_cross_r_2_absolute_magnitude = (
r_1_cross_r_2[:, :, 0] ** 2
+ r_1_cross_r_2[:, :, 1] ** 2
+ r_1_cross_r_2[:, :, 2] ** 2
)
r_1_length = nb_2d_explicit_norm(r_1)
r_2_length = nb_2d_explicit_norm(r_2)
# Define the radius of the line vortices. This is used to get rid of any
# singularities.
radius = 3.0e-16
# Set the lengths and the absolute magnitudes to zero, at the places where the
# lengths and absolute magnitudes are less than the vortex radius.
r_1_length[r_1_length < radius] = 0
r_2_length[r_2_length < radius] = 0
r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
# Calculate the vector dot products.
r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)
r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)
# Calculate k and then the induced velocity, ignoring any divide-by-zero or nan
# errors. k is of shape (N x M)
with np.errstate(divide="ignore", invalid="ignore"):
k = (
strengths
/ (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
* (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
)
# Set the shape of k to be (N x M x 1) to support numpy broadcasting in the
# subsequent multiplication.
k = np.expand_dims(k, axis=2)
induced_velocities = k * r_1_cross_r_2
# Set the values of the induced velocity to zero where there are singularities.
induced_velocities[np.isinf(induced_velocities)] = 0
induced_velocities[np.isnan(induced_velocities)] = 0
if collapse:
induced_velocities = np.sum(induced_velocities, axis=1)
return induced_velocities
#njit
def nb_2d_explicit_norm(vectors):
return np.sqrt(
(vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2
)
#njit
def nb_2d_explicit_cross(a, b):
e = np.zeros_like(a)
e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]
e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]
e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]
return e
Context:
This function is used by Ptera Software, an open-source solver for flapping wing aerodynamics. As shown by the profile output below, it is by far the largest contributor to Ptera Software's run time.
Currently, Ptera Software takes just over 3 minutes to run a typical case, and my goal is to get this below 1 minute.
The function takes in a group of points, origins, terminations, and strengths. At every point, it finds the induced velocity due to the line vortices, which are characterized by the groups of origins, terminations, and strengths. If collapse is true, then the output is the cumulative velocity induced at each point due to the vortices. If false, the function outputs each vortex's contribution to the velocity at each point.
During a typical run, the velocity function is called approximately 2000 times. At first, the calls involve vectors with relatively small input arguments (around 200 points, origins, terminations, and strengths). Later calls involve large input arguments (around 400 points and around 6,000 origins, terminations, and strengths). An ideal solution would be fast for all size inputs, but increasing the speed of large input calls is more important.
For testing, I recommend running the following script with your own implementation of the function:
import timeit
import matplotlib.pyplot as plt
import numpy as np
n_repeat = 2
n_execute = 10 ** 3
min_oom = 0
max_oom = 3
times_py = []
for i in range(max_oom - min_oom + 1):
n_elem = 10 ** i
n_elem_pretty = np.format_float_scientific(n_elem, 0)
print("Number of elements: " + n_elem_pretty)
# Benchmark Python.
print("\tBenchmarking Python...")
setup = '''
import numpy as np
these_points = np.random.random((''' + str(n_elem) + ''', 3))
these_origins = np.random.random((''' + str(n_elem) + ''', 3))
these_terminations = np.random.random((''' + str(n_elem) + ''', 3))
these_strengths = np.random.random(''' + str(n_elem) + ''')
def calculate_velocity_induced_by_line_vortices(points, origins, terminations,
strengths, collapse=True):
pass
'''
statement = '''
results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,
these_terminations,
these_strengths)
'''
times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)
time_py = min(times)/n_execute
time_py_pretty = np.format_float_scientific(time_py, 2)
print("\t\tAverage Time per Loop: " + time_py_pretty + " s")
# Record the times.
times_py.append(time_py)
sizes = [10 ** i for i in range(max_oom - min_oom + 1)]
fig, ax = plt.subplots()
ax.plot(sizes, times_py, label='Python')
ax.set_xscale("log")
ax.set_xlabel("Size of List or Array (elements)")
ax.set_ylabel("Average Time per Loop (s)")
ax.set_title(
"Comparison of Different Optimization Methods\nBest of "
+ str(n_repeat)
+ " Runs, each with "
+ str(n_execute)
+ " Loops"
)
ax.legend()
plt.show()
Previous Attempts:
My prior attempts at speeding up this function involved vectorizing it (which worked great, so I kept those changes) and trying out Numba's JIT compiler. I had mixed results with Numba. When I tried to use Numba on a modified version of the entire velocity function, my results were much slower than before. However, I found that Numba significantly sped up the cross-product and norm functions, which I implemented above.
Updates:
Update 1:
Based on Mercury's comment (which has since been deleted), I replaced
points = np.expand_dims(points, axis=1)
r_1 = points - origins
r_2 = points - terminations
with two calls to the following function:
#njit
def subtract(a, b):
c = np.empty((a.shape[0], b.shape[0], 3))
for i in range(a.shape[0]):
for j in range(b.shape[0]):
for k in range(3):
c[i, j, k] = a[i, k] - b[j, k]
return c
This resulted in a speed increase from 227 s to 220 s. This is better! However, it is still not fast enough.
I also have tried setting the njit fastmath flag to true, and using a numba function instead of calls to np.einsum. Neither increased the speed.
Update 2:
With Jérôme Richard's answer, the run time is now 156 s, which is a decrease of 29%! I'm satisfied enough to accept this answer, but feel free to make other suggestions if you think you can improve on their work!
First of all, Numba can perform parallel computations resulting in a faster code if you manually request it using mainly parallel=True and prange. This is useful for big arrays (but not for small ones).
Moreover, your computation is mainly memory bound. Thus, you should avoid creating big arrays when they are not reused multiple times, or more generally when they cannot be recomputed on the fly (in a relatively cheap way). This is the case for r_0 for example.
In addition, memory access pattern matters: vectorization is more efficient when accesses are contiguous in memory and the cache/RAM is use more efficiently. Consequently, arr[0, :, :] = 0 should be faster then arr[:, :, 0] = 0. Similarly, arr[:, :, 0] = arr[:, :, 1] = 0 should be mush slower than arr[:, :, 0:2] = 0 since the former performs to noncontinuous memory passes while the latter performs only one more contiguous memory pass. Sometimes, it can be beneficial to transpose your data so that the following calculations are much faster.
Moreover, Numpy tends to create many temporary arrays that are costly to allocate. This is a huge problem when the input arrays are small. The Numba jit can avoid that in most cases.
Finally, regarding your computation, it may be a good idea to use GPUs for big arrays (definitively not for small ones). You can give a look to cupy or clpy to do that quite easily.
Here is an optimized implementation working on the CPU:
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def subtract(a, b):
c = np.empty((a.shape[0], b.shape[0], 3))
for i in prange(c.shape[0]):
for j in range(c.shape[1]):
for k in range(3):
c[i, j, k] = a[i, k] - b[j, k]
return c
#njit(parallel=True)
def nb_2d_explicit_norm(vectors):
res = np.empty((vectors.shape[0], vectors.shape[1]))
for i in prange(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sqrt(vectors[i, j, 0] ** 2 + vectors[i, j, 1] ** 2 + vectors[i, j, 2] ** 2)
return res
# NOTE: better memory access pattern
#njit(parallel=True)
def nb_2d_explicit_cross(a, b):
e = np.empty(a.shape)
for i in prange(e.shape[0]):
for j in range(e.shape[1]):
e[i, j, 0] = a[i, j, 1] * b[i, j, 2] - a[i, j, 2] * b[i, j, 1]
e[i, j, 1] = a[i, j, 2] * b[i, j, 0] - a[i, j, 0] * b[i, j, 2]
e[i, j, 2] = a[i, j, 0] * b[i, j, 1] - a[i, j, 1] * b[i, j, 0]
return e
# NOTE: avoid the slow building of temporary arrays
#njit(parallel=True)
def cross_absolute_magnitude(cross):
return cross[:, :, 0] ** 2 + cross[:, :, 1] ** 2 + cross[:, :, 2] ** 2
# NOTE: avoid the slow building of temporary arrays again and multiple pass in memory
# Warning: do the work in-place
#njit(parallel=True)
def discard_singularities(arr):
for i in prange(arr.shape[0]):
for j in range(arr.shape[1]):
for k in range(3):
if np.isinf(arr[i, j, k]) or np.isnan(arr[i, j, k]):
arr[i, j, k] = 0.0
#njit(parallel=True)
def compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length):
return (strengths
/ (4 * np.pi * r_1_cross_r_2_absolute_magnitude)
* (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)
)
#njit(parallel=True)
def rDotProducts(b, c):
assert b.shape == c.shape and b.shape[2] == 3
n, m = b.shape[0], b.shape[1]
ab = np.empty((n, m))
ac = np.empty((n, m))
for i in prange(n):
for j in range(m):
ab[i, j] = 0.0
ac[i, j] = 0.0
for k in range(3):
a = b[i, j, k] - c[i, j, k]
ab[i, j] += a * b[i, j, k]
ac[i, j] += a * c[i, j, k]
return (ab, ac)
# Compute `np.sum(arr, axis=1)` in parallel.
#njit(parallel=True)
def collapseArr(arr):
assert arr.shape[2] == 3
n, m = arr.shape[0], arr.shape[1]
res = np.empty((n, 3))
for i in prange(n):
res[i, 0] = np.sum(arr[i, :, 0])
res[i, 1] = np.sum(arr[i, :, 1])
res[i, 2] = np.sum(arr[i, :, 2])
return res
def calculate_velocity_induced_by_line_vortices(points, origins, terminations, strengths, collapse=True):
r_1 = subtract(points, origins)
r_2 = subtract(points, terminations)
# NOTE: r_0 is computed on the fly by rDotProducts
r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)
r_1_cross_r_2_absolute_magnitude = cross_absolute_magnitude(r_1_cross_r_2)
r_1_length = nb_2d_explicit_norm(r_1)
r_2_length = nb_2d_explicit_norm(r_2)
radius = 3.0e-16
r_1_length[r_1_length < radius] = 0
r_2_length[r_2_length < radius] = 0
r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0
r_0_dot_r_1, r_0_dot_r_2 = rDotProducts(r_1, r_2)
with np.errstate(divide="ignore", invalid="ignore"):
k = compute_k(strengths, r_1_cross_r_2_absolute_magnitude, r_0_dot_r_1, r_1_length, r_0_dot_r_2, r_2_length)
k = np.expand_dims(k, axis=2)
induced_velocities = k * r_1_cross_r_2
discard_singularities(induced_velocities)
if collapse:
induced_velocities = collapseArr(induced_velocities)
return induced_velocities
On my machine, this code is 2.5 times faster than the initial implementation on arrays of size 10**3. It also use a bit less memory.
I want to try to implement the following code from Matlab to Python (I am not familiar with Python in general, but I try to translate it from Matlab using basics)
% n is random integer from 1 to 10
% first set the random seed (because we want our results to be reproducible;
% the seed sets a starting point in the sequence of random numbers the program
rng(n)
% Generate random columns
a = rand(n, 1);
b = rand(n, 1);
c = rand(n, 1);
% Convert to a matrix
A = zeros(n);
for i = 1:n
if i ~= n
A(i + 1, i) = a(i + 1);
A(i, i + 1) = c(i);
end
A(i, i) = b(i);
end
This is my attempt in Python:
import numpy as np
## n is random integer from 1 to 10
np.random.seed(n)
### generate random columns:
a = np.random.rand(n)
b = np.random.rand(n)
c = np.random.rand(n)
A = np.zeros((n, n)) ## create zero n-by-n matrix
for i in range(0, n):
if (i != n):
A[i + 1, i] = a[i + 1]
A[i, i + 1] = c[i]
A[i, i] = b[i]
I run into an error on the line A[i + 1, i] = a[i]. Is there any structure in Python that I am missing out here?
As the above comments clearly points out the indexing error, here is a numpy way of doing it based on np.diag:
import numpy as np
# for reproducibility
np.random.seed(42)
# n is random integer from 1 to 10
n = np.random.randint(low=1, high=10)
# first diagonal below main diag: k = -1
a = np.random.rand(n-1)
# main diag: k = 0
b = np.random.rand(n)
# first diagonal above main diag: k = 1
c = np.random.rand(n-1)
# sum all 2-d arrays in order to obtain A
A = np.diag(a, k=-1) + np.diag(b, k=0) + np.diag(c, k=1)
Short answer is that for i = 1:n iterates [1, n], inclusive on both bounds, while for i in range(n): iterates [0, n), exclusive on the right bound. Therefore, the check if i ~= n correctly tests if you are at the right edge, while if (i!=n): does not. Replace it with
if i != n - 1:
The long answer is that you don't need any of that code in either language, since both MATLAB and numpy are intended to be used with vectorized operations. In MATLAB, you can write
A = diag(a(2:end), -1) + diag(b, 0) + diag(c(1:end-1), +1)
In numpy, it's very similar:
A = np.diag(a[1:], -1) + np.diag(b, 0) + np.diag(c[:-1], +1)
There are other tricks you can use, especially if you just want random numbers in the matrix:
A = np.random.rand(n, n)
A[np.tril_indices(n, -2)] = A[np.triu_indices(n, 2)] = 0
You can use other index-based approaches:
i, j = np.diag_indices(n)
i = np.concatenate((i[:-1], i, i[1:]))
j = np.concatenate((j[1:], j, j[:-1]))
A = np.zeros((n, n))
A[i, j] = np.random.rand(3 * n - 2)
Let's say an array sig:
sig = np.array([1,2,3,4,5])
Another array k which consists of indexes:
k = np.array([1,2,0,4])
I want to find an array that interpolates between s[k[i]-1] and s[k[i]] only if k[i]!= 0 and k[i] != len(k) i.e
p=2
result = np.zeros(len(k))
for i in range(len(k)):
if(k[i] == 0):
result[i] = sig[k[i]]
elif(k[i] == len(k)):
result[i] = sig[k[i] -1]
else:
result[i] = sig[k[i] -1] + (sig[k[i]] - sig[k[i]-1])*(p - k[i-1])/(k[i] - k[i-1])
How do I do this without looping over len(k) by vectorization
Expected : result = array([1.66666667,3, 1, 4])
Because for k = 0 and k =4 I did not interpolate the values were returned as sig[0] and sig[3] respectively
For a (very) limited amount of cases like here, an approach to vectorize such code is to build a linear combination of each case and the corresponding calculation.
So, set up vectors
alpha = (k == 0) to match the first case,
beta = (k > 0) to match the second case, and
gamma = (k < len(k)) to match the third case.
Then, build up a proper linear combination like:
alpha * sig[k] + beta * sig[k-1] + gamma * (sig[k] - sig[k-1] * (p - np.roll(k, 1)) / (k - np.roll(k, 1))
Pay attention, that - by the way beta and gamma are set up above - the calculations of the second and third cases can be combined. Also, we need np.roll here, to get the proper k[i-1].
The final solution, minimized to a one-liner, looks like this:
import numpy as np
# Inputs
sig = np.array([1, 2, 3, 4, 5])
k = np.array([1, 2, 0, 4])
p = 2
# Original solution using loop
result = np.zeros(len(k))
for i in range(len(k)):
if(k[i] == 0):
result[i] = sig[k[i]]
elif(k[i] == len(k)):
result[i] = sig[k[i] -1]
else:
result[i] = sig[k[i] -1] + (sig[k[i]] - sig[k[i]-1])*(p - k[i-1])/(k[i] - k[i-1])
# Vectorized solution
res = (k == 0) * sig[k] + (k > 0) * sig[k-1] + (k < len(k)) * (sig[k] - sig[k-1]) * (p - np.roll(k, 1)) / (k - np.roll(k, 1))
# Outputs
print('Original solution using loop:\n ', result)
print('Vectorized solution:\n ', res)
The outputs are identical:
Original solution using loop:
[1.66666667 3. 1. 4. ]
Vectorized solution:
[1.66666667 3. 1. 4. ]
Hope that helps!
I am numerically solving for x(t) for a system of first order differential equations. The system is:
dx/dt = y
dy/dt = -x - a*y(x^2 + y^2 -1)
I have implemented the Forward Euler method to solve this problem as follows:
def forward_euler():
h = 0.01
num_steps = 10000
x = np.zeros([num_steps + 1, 2]) # steps, number of solutions
y = np.zeros([num_steps + 1, 2])
a = 1.
x[0, 0] = 10. # initial condition 1st solution
y[0, 0] = 5.
x[0, 1] = 0. # initial condition 2nd solution
y[0, 1] = 0.0000000001
for step in xrange(num_steps):
x[step + 1] = x[step] + h * y[step]
y[step + 1] = y[step] + h * (-x[step] - a * y[step] * (x[step] ** 2 + y[step] ** 2 - 1))
return x, y
Now I would like to vectorize the code further and keep x and y in the same array, I have come up with the following solution:
def forward_euler_vector():
num_steps = 10000
h = 0.01
x = np.zeros([num_steps + 1, 2, 2]) # steps, variables, number of solutions
a = 1.
x[0, 0, 0] = 10. # initial conditions 1st solution
x[0, 1, 0] = 5.
x[0, 0, 1] = 0. # initial conditions 2nd solution
x[0, 1, 1] = 0.0000000001
def f(x):
return np.array([x[1],
-x[0] - a * x[1] * (x[0] ** 2 + x[1] ** 2 - 1)])
for step in xrange(num_steps):
x[step + 1] = x[step] + h * f(x[step])
return x
The question: forward_euler_vector() works, but was this to best way to vectorize it? I am asking because the vectorized version runs about 20 ms slower on my laptop:
In [27]: %timeit forward_euler()
1 loops, best of 3: 301 ms per loop
In [65]: %timeit forward_euler_vector()
1 loops, best of 3: 320 ms per loop
There is always the trivial autojit solution:
def forward_euler(initial_x, initial_y, num_steps, h):
x = np.zeros([num_steps + 1, 2]) # steps, number of solutions
y = np.zeros([num_steps + 1, 2])
a = 1.
x[0, 0] = initial_x[0] # initial condition 1st solution
y[0, 0] = initial_y[0]
x[0, 1] = initial_x[1] # initial condition 2nd solution
y[0, 1] = initial_y[1]
for step in xrange(int(num_steps)):
x[step + 1] = x[step] + h * y[step]
y[step + 1] = y[step] + h * (-x[step] - a * y[step] * (x[step] ** 2 + y[step] ** 2 - 1))
return x, y
Timings:
from numba import autojit
jit_forward_euler = autojit(forward_euler)
%timeit forward_euler([10,0], [5,0.0000000001], 1E4, 0.01)
1 loops, best of 3: 385 ms per loop
%timeit jit_forward_euler([10,0], [5,0.0000000001], 1E4, 0.01)
100 loops, best of 3: 3.51 ms per loop
#Ophion comment explains very well what's going on. The call to array() within f(x) introduces some overhead, that kills the benefit of the use of matrix multiplication in the expression h * f(x[step]).
And as he says, you may be interested in having a look at scipy.integrate for a nice set of numerical integrators.
To solve the problem at hand of vectorising your code, you want to avoid recreating the array every time you call f. You would like to initialize the array once, and return it modified at every call. This is similar to what a static variable is in C/C++.
You can achieve this with a mutable default argument, that is interpreted once, at the time of the definition of the function f(x), and that has local scope. Since it has to be mutable, you encapsulate it in a list of a single element:
def f(x,static_tmp=[empty((2,2))]):
static_tmp[0][0]=x[1]
static_tmp[0][1]=-x[0] - a * x[1] * (x[0] ** 2 + x[1] ** 2 - 1)
return static_tmp[0]
With this modification to your code, the overhead of array creation disappears, and on my machine I gain a small improvement:
%timeit forward_euler() #258ms
%timeit forward_euler_vector() #248ms
This means that the gain of optimizing matrix multiplication with numpy is quite small, at least on the problem at hand.
You may want to get rid of the function f straight away as well, doing its operations within the for loop, getting rid of the call overhead. This trick of the default argument can however be applied also with scipy more general time integrators, where you must provide a function f.
EDIT: as pointed out by Jaime, another way to go is to treat static_tmp as an attribute of the function f, and to create it after having declared the function but before calling it:
def f(x):
f.static_tmp[0]=x[1]
f.static_tmp[1]=-x[0] - a * x[1] * (x[0] ** 2 + x[1] ** 2 - 1)
return f.static_tmp
f.static_tmp=empty((2,2))
This is a follow up to How to set up and solve simultaneous equations in python but I feel deserves its own reputation points for any answer.
For a fixed integer n, I have a set of 2(n-1) simultaneous equations as follows.
M(p) = 1+((n-p-1)/n)*M(n-1) + (2/n)*N(p-1) + ((p-1)/n)*M(p-1)
N(p) = 1+((n-p-1)/n)*M(n-1) + (p/n)*N(p-1)
M(1) = 1+((n-2)/n)*M(n-1) + (2/n)*N(0)
N(0) = 1+((n-1)/n)*M(n-1)
M(p) is defined for 1 <= p <= n-1. N(p) is defined for 0 <= p <= n-2. Notice also that p is just a constant integer in every equation so the whole system is linear.
Some very nice answers were given for how to set up a system of equations in python. However, the system is sparse and I would like to solve it for large n. How can I use scipy's sparse matrix representation and http://docs.scipy.org/doc/scipy/reference/sparse.linalg.html for example instead?
I wouldn't normally keep beating a dead horse, but it happens that my non-vectorized approach to solving your other question, has some merit when things get big. Because I was actually filling the coefficient matrix one item at a time, it is very easy to translate into COO sparse matrix format, which can efficiently be transformed to CSC and solved. The following does it:
import scipy.sparse
def sps_solve(n) :
# Solution vector is [N[0], N[1], ..., N[n - 2], M[1], M[2], ..., M[n - 1]]
n_pos = lambda p : p
m_pos = lambda p : p + n - 2
data = []
row = []
col = []
# p = 0
# n * N[0] + (1 - n) * M[n-1] = n
row += [n_pos(0), n_pos(0)]
col += [n_pos(0), m_pos(n - 1)]
data += [n, 1 - n]
for p in xrange(1, n - 1) :
# n * M[p] + (1 + p - n) * M[n - 1] - 2 * N[p - 1] +
# (1 - p) * M[p - 1] = n
row += [m_pos(p)] * (4 if p > 1 else 3)
col += ([m_pos(p), m_pos(n - 1), n_pos(p - 1)] +
([m_pos(p - 1)] if p > 1 else []))
data += [n, 1 + p - n , -2] + ([1 - p] if p > 1 else [])
# n * N[p] + (1 + p -n) * M[n - 1] - p * N[p - 1] = n
row += [n_pos(p)] * 3
col += [n_pos(p), m_pos(n - 1), n_pos(p - 1)]
data += [n, 1 + p - n, -p]
if n > 2 :
# p = n - 1
# n * M[n - 1] - 2 * N[n - 2] + (2 - n) * M[n - 2] = n
row += [m_pos(n-1)] * 3
col += [m_pos(n - 1), n_pos(n - 2), m_pos(n - 2)]
data += [n, -2, 2 - n]
else :
# p = 1
# n * M[1] - 2 * N[0] = n
row += [m_pos(n - 1)] * 2
col += [m_pos(n - 1), n_pos(n - 2)]
data += [n, -2]
coeff_mat = scipy.sparse.coo_matrix((data, (row, col))).tocsc()
return scipy.sparse.linalg.spsolve(coeff_mat,
np.ones(2 * (n - 1)) * n)
It is of course much more verbose than building it from vectorized blocks, as TheodorosZelleke does, but an interesting thing happens when you time both approaches:
First, and this is (very) nice, time is scaling linearly in both solutions, as one would expect from using the sparse approach. But the solution I gave in this answer is always faster, more so for larger ns. Just for the fun of it, I also timed TheodorosZelleke's dense approach from the other question, which gives this nice graph showing the different scaling of both types of solutions, and how very early, somewhere around n = 75, the solution here should be your choice:
I don't know enough about scipy.sparse to really figure out why the differences between the two sparse approaches, although I suspect heavily of the use of LIL format sparse matrices. There may be some very marginal performance gain, although a lot of compactness in the code, by turning TheodorosZelleke's answer into COO format. But that is left as an exercise for the OP!
This is a solution using scipy.sparse. Unfortunately the problem is not stated here. So in order to comprehend this solution, future visitors have to first look up the problem under the link provided in the question.
Solution using scipy.sparse:
from scipy.sparse import spdiags, lil_matrix, vstack, hstack
from scipy.sparse.linalg import spsolve
import numpy as np
def solve(n):
nrange = np.arange(n)
diag = np.ones(n-1)
# upper left block
n_to_M = spdiags(-2. * diag, 0, n-1, n-1)
# lower left block
n_to_N = spdiags([n * diag, -nrange[-1:0:-1]], [0, 1], n-1, n-1)
# upper right block
m_to_M = lil_matrix(n_to_N)
m_to_M[1:, 0] = -nrange[1:-1].reshape((n-2, 1))
# lower right block
m_to_N = lil_matrix((n-1, n-1))
m_to_N[:, 0] = -nrange[1:].reshape((n-1, 1))
# build A, combine all blocks
coeff_mat = hstack(
(vstack((n_to_M, n_to_N)),
vstack((m_to_M, m_to_N))))
# const vector, right side of eq.
const = n * np.ones((2 * (n-1),1))
return spsolve(coeff_mat.tocsr(), const).reshape((-1,1))
There's some code that I've looked at before here: http://jkwiens.com/heat-equation-using-finite-difference/ His function implements a finite difference method to solve the heat equation using the scipy sparse matrix package.