I m using griddata to "mount" array with a great number of shapes and
i would like to know if i can calculate functions (on each slice) on each my 4 cores in order to accelerate the process?
import numpy
size = 8.
Y=(arange(2000))
X=(arange(2000))
(xx,yy)=meshgrid(X,Y)
array=zeros((Y.shape[0],X.shape[0],size))
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X**2+Y**2+X+Y
array[:,:,3] = X**3+Y**3+X**2+Y**2+X+Y
array[:,:,4] = X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,5] = X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
array[:,:,6] = X**7+Y**7+X**6+Y**6+X**5+Y**5+X**4+Y**4+X**3+Y**3+X**2+Y**2+X+Y
So here i would like to calculate array[:,:,0] & array[:,:,1] with the first core, then array[:,:,2] & array[:,:,3] with the second core...?
----EDIT LATER---
There is no link between different "slices"...My different functions are independent
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X*np.cos(X)+Y*np.sin(Y)
array[:,:,3] = X**3+np.sin(X)+X**2+Y**2+np.sin(Y)
...
You can try with multiprocessing.Pool :
from multiprocessing import Pool
import numpy as np
size = 8.
Y=(np.arange(2000))
X=(np.arange(2000))
(xx,yy)=np.meshgrid(X,Y)
array=np.zeros((Y.shape[0],X.shape[0],size))
def func(i): # you need to call a function with Pool
array_=np.zeros((Y.shape[0],X.shape[0]))
for j in range(1,i):
array_+=X**j+Y**j
return array_
if __name__ == '__main__':
p = Pool(4) # if you have 4 cores in your processor
result=p.map(func, range(1,8))
for i in range(1,8):
array[::,::,i]=result[i-1]
Keep in mind that multiprocessing in python does not share memory, that's why you have to create the array_ and add the for-loop at the end of the code.
As your application (with these dimensions) doesn't need a lot of computing time, it is possible that you will be slower with this method. Also you will create multiple copies of all your variables, wich may cause a memory overflow.
You should also double-check the func I wrote, as I didn't completely verify that it does what it is supposed to do :)
If you want to apply a single function over an array of data, then using e.g. a multiprocessing.Pool is a good solution, provided that both the input and output of the calculation are relatively small.
You want to do many different calculations to two input arrays, which results in an array being returned for every one of those calculations.
Since separate processes do not share memory, the X and Y arrays have to be transported to each worker process when it is are started. And the result of each calculation (which is also a numpy array the same size as X and Y) has to be returned to the parent process.
Depending on e.g. the size of the arrays and the amount of cores, the overhead from the transfer of all those array between worker processes and the parent process via interprocess communication ("IPC") will cost time, reducing the advantages of using multiple cores.
Keep in mind that the parent process has to listen for and handle IPC requests from all the worker processes. So you've shifted the bottleneck from calculation to communication.
So it is not a given that multiprocessing will actually improve performance in this case. It depends on the details of the actual problem (number of cores, array size, amount of physical memory et cetera).
You will have to do some careful performance measurements using e.g. Pool or Process with realistic array sizes.
Three things:
The most important question is why are you doing this?.
Your NumPy build may already be making use of multiple cores. I am not sure off the top of my head how to check, see questions like this or if absolutely necessary take a look at the Numexpr library https://github.com/pydata/numexpr
About the "Y" in your likely XY problem - you are re-calculating data that you can instead re-use:
.
import numpy
size = 8
Y=(arange(2000))
X=(arange(2000))
(xx,yy)=meshgrid(X,Y)
array = zeros((Y.shape[0], X.shape[0], size))
array[..., 0] = 0
for i in range(1, size):
array[..., 1] = X ** i + Y ** i + array[..., i - 1]
Related
I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.
Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.
You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.
As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)
I am trying to simulate a systolic array structure -- all of which I learned from these slides: http://web.cecs.pdx.edu/~mperkows/temp/May22/0020.Matrix-multiplication-systolic.pdf -- for matrix multiplication in a Python environment. An integral part of a systolic array is that data flow between PE's is concurrent with any multiplication or addition that occurs on any one node. I am having difficulty surmising how exactly to implement such a concurrent procedure in Python. Specifically, I hope to better understand a computational approach to feed the elements of the matrices to be multiplied into the systolic array in a cascading fashion, while allowing these elements to propagate through the array in a concurrent fashion.
I have begun writing some code in python to multiple two 3 by 3 array's, but ultimately, I want to simulate any sized systolic array to work with any sized a and b matrices.
from threading import Thread
from collections import deque
vals_deque = deque(maxlen=9*2)#will hold the interconnections between nodes of the systolicarray
dump=deque(maxlen=9) # will be the output of the SystolicArray
prev_size = 0
def setupSystolicArray():
global SystolicArray
SystolicArray = [NodeSystolic(i,j) for i in range(3), for i in range(3)]
def spreadInputs(a,b):
#needs some way to initially propagate the elements of a and b through the top and leftmost parts of the systolic array
new = map(lambda x: x.start() , SystolicArray) #start all the nodes of the systolic array, they are waiting for an input
#even if i found a way to put these inputs into the array at the start, I am not sure how to coordinate future inputs into the array in the cascading fashion described in the slides
while(len(dump) != 9):
if(len(vals_deque) != prev_size):
vals = vals_deque[-1]
row = vals['t'][0]
col = vals['l'][0]
a= vals['t'][1]
b = vals['l'][1]
# these if elif statements track whether the outputs are at the edge of the systolic array and can be removed
if(row >= 3):
dump.append(a)
elif(col >= 3):
dump.append(b)
else:
#something is wrong with the logic here
SystolicArray[t][l-1].update(a,b)
SystolicArray[t-1][l].update(a,b)
class NodeSystolic:
def __init__(self,row, col):
self.row = row
self.col = col
self.currval = 0
self.up = False
self.ain = 0#coming in from the top
self.bin = 0#coming in from the side
def start(self):
Thread(target=self.continuous, args = ()).start()
def continuous(self):
while True:
if(self.up = True):
self.currval = self.ain*self.bin
self.up = False
self.passon(self.ain, self.bin)
else:
pass
def update(self, left, top):
self.up = True
self.ain = top
self.bin = left
def passon(self, t,l):
#this will passon the inputs this node has received onto the next nodes
vals_deque.append([{'t': [self.row+ 1, self.ain], 'l': [self.col + 1, self.bin]}])
def returnValue(self):
return self.currval
def main():
a = np.array([
[1,2,3],
[4,5,6],
[7,8,9],
])
b = np.array([
[1,2,3],
[4,5,6],
[7,8,9]
])
setupSystolicArray()
spreadInputs(a,b)
The above code is not operational and still has many bugs. I was hoping someone could give me pointers on how to improve the code, or whether there is a much simpler way to model the parallel procedures of a systolic array with the asynchronous properties in Python, so with very large systolic array sizes, I won't have to worry about creating too many threads (nodes).
It's interesting to think about simulating a systolic array in Python, but I think there are some significant difficulties in doing this along the lines you've sketched out above.
Most importantly there are the issues about Python's limited scope for true parallelism caused by the Global Interpreter Lock. This means that you won't get any significant parallelism for compute-limited tasks, and its threads are probably best suited to handling I/O limited tasks such as web-requests or filesystem accesses. The nearest Python can get to this is probably via the multiprocessing module, but that would require separate process for each node.
Secondly, even if you were going to get parallelism in the numerical operations within your systolic array, you'd need to have some locking mechanisms to allow different threads to exchange data (or messages) without corrupting each other's memory when they try to read and write data at the same time.
As regards the datastructures in your example, I think you might be better off having each node in the systolic array having a reference to its upstream nodes, rather than knowing that it lies at a particular location in an NxM grid. I don't think there's any reason why a systolic array needs to be a rectangular grid, and any from of Directed Acyclic Graph (DAG) would still have the potential for efficient distributed computation.
Overall, I'd expect the computational overheads of doing this simulation in Python to be enormous relative to what could be achieved by lower-level languages such as Scala or C++. Even then, unless each node in the systolic array is doing a lot of computation (i.e. much more than a few multiply-adds), then the overheads of exchanging messages between nodes will be substantial. So, I presume your simulation is mainly to get an understanding of the data flows, and the high-level behaviour of the array, rather than to get anywhere close to what could be provided by custom DSP (Digital Signal Processing) hardware. If that's the case, then I'd be tempted just to do without the threading and use a centralized message-queue to which all nodes submit messages that are delivered by a global message-distribution mechanism.
here is my problem:
I would like to define an array of persons and change the entries of this array in a for loop. Since I also would like to see the asymptotics of the resulting distribution, I want to repeat this simulation quiet a lot, thus I'm using a matrix to store the several array in each row. I know how to do this with two for loops:
import random
import numpy as np
nobs = 100
rep = 10**2
steps = 10**2
dmoney = 1
state = np.matrix([[10] * nobs] * rep)
for i in range(steps):
for j in range(rep)
sample = random.sample(range(state.shape[1]),2)
state[j,sample[0]] = state[j,sample[0]] + dmoney
state[j,sample[1]] = state[j,sample[1]] - dmoney
I thought I use the multiprocessing library but I don't know how to do it, because in my simple mind, the workers manipulate the same global matrix in parallel, which I read is not a good idea.
So, how can I do this, to speed up calculations?
Thanks in advance.
OK, so this might not be much use, I haven't profiled it to see if there's a speed-up, but list comprehensions will be a little faster than normal loops anyway.
...
y_ix = np.arange(rep) # create once as same for each loop
for i in range(steps):
# presumably the two locations in the population to swap need refreshing each loop
x_ix = np.array([np.random.choice(nobs, 2) for j in range(rep)])
state[y_ix, x_ix[:,0]] += dmoney
state[y_ix, x_ix[:,1]] -= dmoney
PS what numpy splits over multiple processors depends on what libraries have been included when compiled (BLAS etc). You will be able to find info on line about this.
EDIT I can confirm, after comparing the original with the numpy indexed version above, that the original method is faster!
I have a large-ish array artist_topic_probs (112,312 item rows by ~100 feature columns), and I want to calculate the pairwise cosine similarity between a (large sample) of random pairs of rows from this array. Here's the relevant bits of my current code
# the number of random pairs to check (10 million here)
random_sample_size=10000000
# I want to make sure they're unique, and that I'm never comparing a row to itself
# so I generate my set of comparisons like so:
np.random.seed(99)
comps = set()
while len(comps)<random_sample_size:
a = np.random.randint(0,112312)
b= np.random.randint(0,112312)
if a!=b:
comp = tuple(sorted([a,b]))
comps.add(comp)
# convert to list at the end to ensure sort order
# not positive if this is needed...I've seen conflicting opinions
comps = list(sorted(comps))
This generates a list of tuples, where each are the two rows between which I'll calculate similarity. Then I just use a simple loop to calculate all the similarities:
c_dists = []
from scipy.spatial.distance import cosine
for a,b in comps:
c_dists.append(cosine(artist_topic_probs[a],artist_topic_probs[b]))
(of course, cosine here gives distance, not a similarity, but we can easily get that with sim = 1.0 - dist. I used similarity in the title because it's the more common term)
This works fine, but isn't too fast, and I need to repeat the procedure many times. I have 32 cores to work with, so parallelization seems like a good bet, but I'm not sure the best way to go about it. My idea was something like:
pool = mp.Pool(processes=32)
c_dists = [pool.apply(cosine, args=(artist_topic_probs[a],artist_topic_probs[b]))
for a,b in comps]
But testing this approach out on my laptop with some test data hasn't been working (it just hangs, or at least is taking so much longer than the simple loop that I got sick of waiting and killed it). My concern is the indexing of the matrix being some sort of bottleneck, but I'm not sure. Any ideas on how to effectively parallelize this (or otherwise speed up the process)?
First of all, you might want to use itertools.combinations and random.sample to get unique pairs in the future, but it won't work in this case due to memory issues. Then, multiprocessing is not multithreading, i.e. spawning a new process involves huge system overhead. There is little sense in spawning a process for each individual task. A task must be well worth the overhead to rationalise starting a new process, hence you'd better split all work into separate jobs (into as many pieces as the number of cores you want to use). Then, don't forget that multiprocessing implementation serialises the entire namespace and loads it into memory N times, where N is the number of processes. This can lead to intensive swapping if you don't have enough RAM to store N copies of your huge array. So you might want to reduce the number of cores.
Updated to restore initial order as you requested.
I made a test data-set of identical vectors, hence cosine must return a vector of zeros.
from __future__ import division, print_function
import math
import multiprocessing as mp
from scipy.spatial.distance import cosine
from operator import itemgetter
import itertools
def worker(enumerated_comps):
return [(ind, cosine(artist_topic_probs[a], artist_topic_probs[b])) for ind, (a, b) in enumerated_comps]
def slice_iterable(iterable, chunk):
"""
Slices an iterable into chunks of size n
:param chunk: the number of items per slice
:type chunk: int
:type iterable: collections.Iterable
:rtype: collections.Generator
"""
_it = iter(iterable)
return itertools.takewhile(
bool, (tuple(itertools.islice(_it, chunk)) for _ in itertools.count(0))
)
# Test data
artist_topic_probs = [range(10) for _ in xrange(10)]
comps = tuple(enumerate([(1, 2), (1, 3), (1, 4), (1, 5)]))
n_cores = 2
chunksize = int(math.ceil(len(comps)/n_cores))
jobs = tuple(slice_iterable(comps, chunksize))
pool = mp.Pool(processes=n_cores)
work_res = pool.map_async(worker, jobs)
c_dists = map(itemgetter(1), sorted(itertools.chain(*work_res.get())))
print(c_dists)
Output:
[2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16]
These values are fairly close to zero.
P.S.
From the multiprocessing.Pool.apply docs
Equivalent of the apply() built-in function. It blocks until the
result is ready, so apply_async() is better suited for performing
work in parallel. Additionally, func is only executed in one of the
workers of the pool.
scipy.spatial.distance.cosine, as you can see following the link, introduces a significant overhead in your computations because for each invocation it computes the norm of the two vectors that you're analyzing at each invocation, for the size of your sample
this amounts to 20 millions norms computed, if you memorize the norms of your ~100 thousand vectors in advance you can save approximately 60% of your computation time because you have a dot product, u*v, and two norm calculations, and each of these three operations is roughly equivalent in terms of operations count.
Further, you're using explicit loops, if you could put your logic inside a vectorized numpy operator you could trim another large slice of your computational time.
Eventually, you talk about cosine similarity... consider that scipy.spatial.distance.cosine computes the cosine distance instead, the relationship is easy, cs = cd - 1 but I haven't seen this in your posted code.
I am trying to optimize my code using Python's multiprocessing.Pool module, but I am not getting the speed-up results that I would logically expect.
The main method I am doing involves calculating matrix-vector products for a large number of vectors and a fixed large sparse matrix. Below is a toy example which performs what I need, but with random matrices.
import time
import numpy as np
import scipy.sparse as sp
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
res = []
for i in range(10):
res.append(calculate(np.random.rand(N), matrix = matrix))
print time.time() - t
The method terminates in about 30 seconds.
Now, since the calculation of each element of results does not depend on the results of any other calculation, it is natural to think that paralel calculation will speed up the process. The idea is to create 4 processes and if each does some of the calculations, then the time it takes for all the processes to complete should decrease by some factor around 4. To do this, I wrote the following code:
import time
import numpy as np
import scipy.sparse as sp
from multiprocessing import Pool
from functools import partial
def calculate(vector, matrix = None):
for i in range(50):
v = matrix.dot(vector)
return v
if __name__ == '__main__':
N = 1e6
matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
t = time.time()
input = []
for i in range(10):
input.append(np.random.rand(N))
mp = partial(calculate, matrix = matrix)
p = Pool(4)
res = p.map(mp, input)
print time.time() - t
My problem is that this code takes slightly above 20 seconds to run, so I did not even improve performance by a factor of 2! Even worse, the performance does not improve even if the pool contains 8 processes! Any idea why the speed-up is not happening?
Note: My actual method takes much longer, and the input vectors are stored in a file. If I split the file in 4 pieces and then run my script in a separate process for each file manually, each process terminates four times as quickly as it would for the whole file (as expected). I am confuzed why this speed-up (which is obviously possible) is not happening with multiprocessing.Pool
Edi: I have just found Multiprocessing.Pool makes Numpy matrix multiplication slower this question which may be related. I have to check, though.
Try:
p = Pool(4)
for i in range(10):
input = np.random.rand(N)
p.apply_async(calculate, args=(input, matrix)) # perform function calculate as new process with arguments input and matrix
p.close()
p.join() # wait for all processes to complete
I suspect that the "partial" object and map are resulting in a blocking behavior. (though I have never used partial, so I'm not familiar with it.)
"apply_async" (or "map_async") are multiprocessing methods that specifically do not block - (see: Python multiprocessing.Pool: when to use apply, apply_async or map?)
Generally, for "embarrassingly parallel problems" like this, apply_async works for me.
EDIT:
I tend to write results to MySQL databases when I'm done - the implementation I provided doesn't work if that's not your approach. "map" is probably the right answer if you want to use order in the list as your way of tracking which entry is which, but I remain suspicious of the "partial" objects.