Python for loop multiprocessing - python

I have a matrix of matrices that I need to do computations on (i.e. a x by y by z by z matrix, I need to perform a computation on each of the z by z matrices, so x*y total operations). Currently, I am looping over the first two dimensions of the matrix and performing the computation, but they are entirely independent. Is there a way I can compute these in parallel, without knowing in advance how many such matrices there will be (i.e. x and y unknown).

Yes; see the multiprocessing module. Here's an example (tweaked from the one in the docs to suit your use case). (I assume z = 1 here for simplicity, so that f takes a scalar.)
from multiprocessing import Pool
# x = 2, y = 3, z = 1 - needn't be known in advance
matrix = [[1, 2, 3], [4, 5, 6]]
def f(x):
# Your computation for each z-by-z submatrix goes here.
return x**2
with Pool() as p:
flat_results = p.map(f, [x for row in matrix for x in row])
# If you don't need to preserve the x*y shape of the input matrix,
# you can use `flat_results` and skip the rest.
x = len(matrix)
y = len(matrix[0])
results = [flat_results[i*y:(i+1)*y] for i in range(x)]
# Now `results` contains `[[1, 4, 9], [16, 25, 36]]`
This will divide up the x * y computations across several processes (one per CPU core; this can be tweaked by providing an argument to Pool()).
Depending on what you're doing, consider trying vectorized operations (as opposed to for loops) with numpy first; you might find that it's fast enough to make multiprocessing unnecessary. (If matrix were a numpy array in this example, the code would just be results = matrix**2.)

The way I approach parallel processing in python is to define a function to do what I want, then apply it in parallel using multiprocessing.Pool.starmap. It's hard to suggest any code for your problem without knowing what you are computing and how.
import multiprocessing as mp
def my_function(matrices_to_compare, matrix_of_matrices):
m1 = matrices_to_compare[0]
m2 = matrices_to_compare[1]
result = m1 - m2 # or whatever you want to do
return result
matrices_x = <list of x matrices>
matrices_y = <list of y matrices>
matrices_to_compare = list(zip(matrices_x,matrices_y))
with mp.Pool(mp.cpu_count()) as pool:
results = pool.starmap(my_function,
[(x, matrix_of_matrices) for x in matrices_to_compare],
chunksize=1)

An alternative to the multiprocessing pool approach proposed by other answers - if you have a GPU available possibly the most straightforward approach would be to use a tensor algebra package taking advantage of it, such as cupy or torch.
You could also get some more speedup by jit-compiling your code (for cpu) with cython or numba (and then for gpu programming there's also numba.cuda which however requires some background to use).

Related

Fast way to calculate customized function on a multi-dimensional array?

I was trying to evaluate a customized function over every point on an n-dimensional grid, after which I can marginalize and do corner plots.
This is conceptually simple but I'm struggling to find a way to do it efficiently. I tried a loop regardless, and it is indeed too slow, especially considering that I will be needing this for more parameters (a1, a2, a3...) as well. I was wondering if there is a faster way or any reliable package that could help?
EDITS: Sorry that my description of myfunction hasn't been very clear, since the function takes some specific external data. Nevertheless here's a sample function that demonstrates it:
import numpy as np
from scipy.ndimage import gaussian_filter
#This gaussian filter is needed to process my data
data = gaussian_filter(np.array([[1, 2], [3, 4]]), sigma = 1)
model1 = np.array([[0, 1], [2, 3]])
model2 = np.array([[2, 3], [4, 5]])
models = np.array([model1, model2])
(This is just a demonstration. The actual data and models are some 500x500-ish 2D arrays.)
and then
from scipy.special import gammaln
def myfunction(params):
"""
Calculates the log of the Poisson likelihood of generating data
given model params and fits.
params: array-like, with number entries. For example,
if params = np.array([a1, a2]), we generate a model of
a1*model1 + a2*model2.
"""
model_combined = np.sum(models * params[:,None,None], axis=0)
#Unfortunately I need to process my combined model as well
model_smeared = gaussian_filter(model_combined, sigma=1)
#This following line is calculating the log of the Poisson likelihood
#of each pixel taking its value given the combined model as the expectation
#value, taking advantage that numpy does element-wise calculations
#automatically in this case
loglikelihood_array = data * np.log(model_combined) - model_combined - gammaln(data+1)
#Sum up the loglikelihoods
loglikelihood_sum = np.sum(loglikelihood_array)
return loglikelihood_sum
The function itself will return me results immediately, but not so if I just simply write a for-loop to calculate it for, say, 100x100 pairs of parameters.
EDIT #2 I understand that the for-loop within my shown code (sorry for my sloppiness) is confusing (and thanks for the comments for the broadcasting simplification!), and I've just edited that.
My real problem isn't with the combining of the models[i], but with the implementation of the entire function (again described by a very sloppy for-loop here), and loglikes is what I finally wanted:
a1_array = np.linspace(0, 2, 100)
a2_array = np.linspace(2, 4, 100)
loglikes = np.empty((100, 100))
for i in range(len(a1_array)):
for j in range(len(a2_array)):
loglikes[i][j] = myfunction(np.array([a1_array[i], a2_array[j]]))
I think there should be a better way of doing this out there than this for-loop, but was unfortunately unaware of it. When I say more parameters I mean, say adding an a3_array = np.linspace(3, 5, 100) and then loglikes will be a 3-dimensional array, and so on.
Thanks again so much for any feedback!
Vectorising that loop won't save you any time and in fact may make things worse.
Instead of looping through a1_array and a2_array to create pairs, you can generate all pairs from the get go by putting them in a 100x100x2 array. This operation takes an insignificant amount of time compared to python loops. But when you're actually in the function and you're broadcasting your arrays so that you can do your calculations on data, you're now suddenly dealing with 100x100x2x500x500 arrays. You won't have memory for this and if you rely on file swapping it makes the operations exponentially slower.
Not only are you not saving any time (well, you do for very small arrays but it's the difference between 0.03 s vs 0.005 s), but with python loops you're only using a few 10s of MB of RAM, while with the vectorised approach it skyrockets into the GB.
But out of curiosity, this is how it could be vectorised.
import itertools
def vectorised(params):
model_combined = np.sum(params[...,None,None] * models, axis=2)
model_smeared = gaussian_filter(model_combined, sigma=1)
log_array = data * np.log(model_combined) - model_combined - gammaln(data+1)
return np.sum(log_array, axis=(-2, -1))
np.random.seed(0)
M = 500
N = 100
data = gaussian_filter(np.random.randint(0, 1000, (N, N)), sigma=1)
models = np.random.randint(1, 1000, (2, M, M))
a1 = np.linspace(0, 2, N)
a2 = np.linspace(2, 4, N)
a = np.array(list(itertools.product(a1, a2))).reshape((N, N, 2))
log_sum = vectorised(a)
And some benchmarks
Bottom line, a python loop ran 10000 times just to fetch some array elements takes 0.001 s in its totality. This is insignificant to your function taking 0.01 s for each call.

Numpy: create a "superset" array view

I am trying to create a view x_ of a vector x that is augmented, but still references the same memory location as x. That way, I don't need to run an augment function every time I want the augmented vector, but can simply refer to x_.
Is there a way to re-write this so that the assertion below is true? I am looking to maximize efficiency.
import numpy as np
x = np.arange(10)
ones = np.ones(len(x), dtype=x.dtype)
x_ = np.stack([x, ones], axis=0)
x[0] = 11
assert x_[0, 0] == 11
Note: I have a feeling that this could be impossible or inefficient because it would break contiguous storage. I would appreciate an explanation if this is the case.

How to multiprocess iteratively into several numpy arrays?

Which is the correct way to parallelize this? Essentially, I have a very large 2D array I want to do a linear fit of each row to a separate array of the same length (x), which would be constant for all the rows. The expected result is a 1D array (data_slopes) with the linear fit slopes. This code works but it is very slow:
for j in range(img1_data_r.shape[0]):
y = img1_data_r[j,:]
model = LinearRegression()
model.fit(x.reshape((-1, 1)),y,1)
data_slopes[j] = model.coef_[0]
I have no previous experience with multiprocessing pool and I have been trying unsuccessfully
If you can pass in the domain of the data that needs to processed, you can use xargs to run your program in parallel. xargs allows you to execute a program in parallel, passing in different parameters read from stdin. I've used it successfully to make bash shells work in parallel.
See if this question helps you: Python read .json files from GCS into pandas DF in parallel
You can try the following. Instead of iterating over the ranges, what I would recommend is that you make a function that takes in a 2D array and returns your expected LinearRegression output. Then you can create a list that has all the 2D arrays you need to iterate over (iterator) -
#Function that works on a single object
def fn(x):
out = x**3 #your code here
return out
iterator = [1,2,3,4,5,6,7,8,9,10] #list of objects that you need to run your function on
pool = mp.Pool(processes=4) #Number of cores you want to utilize
results = [pool.apply_async(fn, args=(x,)) for x in iterator] #maps the iterator and the function to each core asynchronously
output = [p.get() for p in results] #collects and returns the results as a list of outputs.
output
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
pool.apply_async should be superfast along with the list comprehensions since it asynchronously passes the operations to the cores without waiting for all cores to finish their operations before passing the next batch.
Here is an example of how you can use multiprocessing to operate on rows of a 2d numpy array and a constant vector.
In this example, the same vector b (equivalent to your x) is dot-producted with each of the rows of the a array.
import numpy as np
from multiprocessing import Pool
def dot_product(row, vec):
return (row * vec).sum()
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
b = np.array([10, 11, 12])
p = Pool(3) # max number of simultaneous processes
print(p.starmap(dot_product, ((row, b) for row in a)))
Note that you can only pass picklable objects to a multiprocessing.Pool. Although numpy arrays are picklable, and a function (such as dot_product here) is, instance methods are not. So you could not use your model (LinearRegression()) as first argument to Pool.map (or Pool.starmap). Instead, you would have to instantiate LinearRegression separately inside the function for each process.
Putting this together for you (although obviously I don't have enough information to test this), you would get something like this:
def get_data_slope(row, x):
model = LinearRegression()
model.fit(x.reshape((-1, 1)), row, 1)
return model.coef_[0]
p = Pool(3)
data_slopes[:] = p.starmap(get_data_slope, ((row, x) for row in img1_data_r))

Einsum slower than explicit Numpy implementation for n-mode tensor-matrix product

I'm trying to implement the n-mode tensor-matrix product (as defined by Kolda and Bader: https://www.sandia.gov/~tgkolda/pubs/pubfiles/SAND2007-6702.pdf) efficiently in Python using Numpy. The operation effectively gets down to (for matrix U, tensor X and axis/mode k):
Extract all vectors along axis k from X by collapsing all other axes.
Multiply these vectors on the left by U using standard matrix multiplication.
Insert the vectors again into the output tensor using the same shape, apart from X.shape[k], which is now equal to U.shape[0] (initially, X.shape[k] must be equal to U.shape[1], as a result of the matrix multiplication).
I've been using an explicit implementation for a while which performs all these steps separately:
Transpose the tensor to bring axis k to the front (in my full code I added an exception in case k == X.ndim - 1, in which case it's faster to leave it there and transpose all future operations, or at least in my application, but that's not relevant here).
Reshape the tensor to collapse all other axes.
Calculate the matrix multiplication.
Reshape the tensor to reconstruct all other axes.
Transpose the tensor back into the original order.
I would think this implementation creates a lot of unnecessary (big) arrays, so once I discovered np.einsum I thought this would speed things up considerably. However using the code below I got worse results:
import numpy as np
from time import time
def mode_k_product(U, X, mode):
transposition_order = list(range(X.ndim))
transposition_order[mode] = 0
transposition_order[0] = mode
Y = np.transpose(X, transposition_order)
transposed_ranks = list(Y.shape)
Y = np.reshape(Y, (Y.shape[0], -1))
Y = U # Y
transposed_ranks[0] = Y.shape[0]
Y = np.reshape(Y, transposed_ranks)
Y = np.transpose(Y, transposition_order)
return Y
def einsum_product(U, X, mode):
axes1 = list(range(X.ndim))
axes1[mode] = X.ndim + 1
axes2 = list(range(X.ndim))
axes2[mode] = X.ndim
return np.einsum(U, [X.ndim, X.ndim + 1], X, axes1, axes2, optimize=True)
def test_correctness():
A = np.random.rand(3, 4, 5)
for i in range(3):
B = np.random.rand(6, A.shape[i])
X = mode_k_product(B, A, i)
Y = einsum_product(B, A, i)
print(np.allclose(X, Y))
def test_time(method, amount):
U = np.random.rand(256, 512)
X = np.random.rand(512, 512, 256)
start = time()
for i in range(amount):
method(U, X, 1)
return (time() - start)/amount
def test_times():
print("Explicit:", test_time(mode_k_product, 10))
print("Einsum:", test_time(einsum_product, 10))
test_correctness()
test_times()
Timings for me:
Explicit: 3.9450525522232054
Einsum: 15.873924326896667
Is this normal or am I doing something wrong? I know there are circumstances where storing intermediate results can decrease complexity (e.g. chained matrix multiplication), however in this case I can't think of any calculations that are being repeated. Is matrix multiplication so optimized that it removes the benefits of not transposing (which technically has a lower complexity)?
I'm more familiar with the subscripts style of using einsum, so worked out these equivalences:
In [194]: np.allclose(np.einsum('ij,jkl->ikl',B0,A), einsum_product(B0,A,0))
Out[194]: True
In [195]: np.allclose(np.einsum('ij,kjl->kil',B1,A), einsum_product(B1,A,1))
Out[195]: True
In [196]: np.allclose(np.einsum('ij,klj->kli',B2,A), einsum_product(B2,A,2))
Out[196]: True
With a mode parameter, your approach in einsum_product may be best. But the equivalences help me visualize the calculation better, and may help others.
Timings should basically be the same. There's an extra setup time in einsum_product that should disappear in larger dimensions.
After updating Numpy, Einsum is only slightly slower than the explicit method, with or without multi-threading (see comments to my question).

best method of making an array

I'm new to programming and am a bit unsure about how to write my own for loop. This is what I would like please?
Let us subdivide interval [0,1] into n points x0=0,...,xn−1=1.
Write a function compute_discrete_u(epsilon, n) that returns two numpy arrays:
x_array contains the coordinates of the n points
u_array contains the discrete values of u at these points.
u(x)=sin(1x+ϵ)
Thank you!
First of all, you do not need a for loop at all. You want to use numpy, so you can use the vectorized operations that numpy is built upon.
Here's the function you are literally asking for (and most likely not how you should solve your problem):
# Do NOT use this.
import numpy as np
def compute_discrete_u(epsilon, n):
x = np.linspace(0, 1, n)
return x, np.sin(x + expsilon)
That's quite an awkward API. From a design point-of-view, you are mixing two responsibilities in the function:
Generating a certain x vector
Calculating a u vector based on a mathematical function.
You should not do this for complexity and reusability reasons. What if you want a non-uniform x later on?
So here's what you should do:
import numpy as np
def compute_u(x, epsilon):
return np.sin(x + epsilon)
x = np.linspace(0, 1, num=101)
u = compute_u(x, epsilon=1e-3)
This is more easy to understand because the function is just the mathematical function. Additionally, you can compute u for any x array (or single float) you like. If you do not need compute_u elsewhere, you may even completely drop it and write u = np.sin(x + epsilon)

Categories