Providing shared read-only ressources to parallel processes

Providing shared read-only ressources to parallel processes - python

I am working on a problem that allows for some rather unproblematic parallelisation. I am having difficulties figuring out what suitable. parallelising mechanisms are available in Python. I am working with python 3.9 on MacOS.
My pipeline is:
get_common_input() acquires some data in a way not easily parallelisable. If that matters, its return value common_input_1 a list of list of integers.
parallel_computation_1() gets the common_input_1 and an individual input from a list individual_inputs. The common input is only read.
common_input_2 is more or less the collected outputs from parallel_computation_1()`.
parallel_computation_2() then again gets common_input_2 as read only input, plus some individual input.
I could do the following:
import multiprocessing
common_input_1 = None
common_input_2 = None
def parallel_computation_1(individual_input):
return sum(1 for i in common_input_1 if i == individual_input)
def parallel_computation_2(individual_input):
return individual_input in common_input_2
def main():
multiprocessing.set_start_method('fork')
global common_input_1
global common_input_2
common_input_1 = [1, 2, 3, 1, 1, 3, 1]
individual_inputs_1 = [0,1,2,3]
individual_inputs_2 = [0,1,2,3,4]
with multiprocessing.Pool() as pool:
common_input_2 = pool.map(parallel_computation_1, individual_inputs_1)
with multiprocessing.Pool() as pool:
common_output = pool.map(parallel_computation_2, individual_inputs_2)
print(common_output)
if __name__ == '__main__':
main()
As suggested in this answer, I use global variables to share the data. That works if I use set_start_method('fork') (which works for me, but seems to be problematic on MacOS).
Note that if I remove the second with multiprocessing.Pool() to have just one Pool used for both parallel tasks, things won't work (the processes don't see the new value of common_input_2).
Apart from the fact that using global variables seems like bad coding style to me (Is it? That's just my gut feeling), the need to start a new pool doesn't please me, as it introduces some probably unnecessary overhead.
What do you think about these concerns, esp. the second one?
Are there good alternatives? I see that I could use multiprocessing.Array, but since my data are a lists of lists, I would need to flatten it into a single list and use that in parallel_computation in some nontrivial way. If my shared input was even more complex, I would have to put quite some effort into wrapping this into multiprocessing.Value or multiprocessing.Array's.

You can define and compute output_1 as a global variable before creating your process pool; that way each process will have access to the data; this won't result in any memory duplication because you're not changing that data (copy-on-write).
_output_1 = serial_computation()
def parallel_computation(input_2):
# here you can access _output_1
# you must not modify it as this will result in creating new copy in the child process
...
def main():
input_2 = ...
with Pool() as pool:
output_2 = pool.map(parallel_computation, input_2)

Related

Parallelizing for loop in Python 2.7

I'm very new to Python (and coding in general) and I need help parallising the code below. I looked around and found some packages (eg. Multiprocessing & JobLib) which could be useful.
However, I have trouble using it in my example. My code makes an outputfile, and updates it doing the loop(s). Therefore is it not directly paralisable, so I think I need to make smaller files. After this, I could merge the files together.
I'm unable to find a way to do this, could someone be so kind and give me a decent start?
I appreciate any help,
A code newbie
Code:
def delta(graph,n,t,nx,OutExt):
fout_=open(OutExt+'Delta'+str(t)+'.txt','w')
temp=nx.Graph(graph)
for u in range(0,n):
#print "stamp: "+str(t)+" node: "+str(u)
for v in range(u+1,n):
#print str(u)+"\t"+str(v)
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
del temp
fout_.close()

As a start, find the part of the code that you want to be able to execute in parallel with something (perhaps with other invocations of that very same function). Then, figure out how to make this code not share mutable state with anything else.
Mutable state is the enemy of parallel execution. If two pieces of code are executing in parallel and share mutable state, you don't know what the outcome will be (and the outcome will be different each time you run the program). This is becaues you don't know what order the code from the parallel executions will run in. Perhaps the first will mutate something and then the second one will compute something. Or perhaps the second one will compute something and then the first one will mutate it. Who knows? There are solutions to that problem but they involve fine-grained locking and careful reasoning about what can change and when.
After you have an algorithm with a core that doesn't share mutable state, factor it into a separate function (turning locals into parameters).
Finally, use something like the threading (if your computations are primarily in CPython extension modules with good GIL behavior) or multiprocessing (otherwise) modules to execute the algorithm core function (which you have abstracted out) at some level of parallelism.
The particular code example you've shared is a challenge because you use the NetworkX library and a lot of shared mutable state. Each iteration of your loop depends on the results of the previous, apparently. This is not obviously something you can parallelize. However, perhaps if you think about your goals more abstractly you will be able to think of a way to do it (remember, the key is to be able to expressive your algorithm without using shared mutable state).
Your function is called delta. Perhaps you can split your graph into sub-graphs and compute the deltas of each (which are now no longer shared) in parallel.
If the code within your outermost loop is concurrent safe (I don't know if it is or not), you could rewrite it like this for parallel execution:
from multiprocessing import Pool
def do_one_step(nx, graph, n, t, OutExt, u):
# Create a separate output file for this set of results.
name = "{}Delta{}-{}.txt".format(OutExt, t, u)
fout_ = open(name, 'w')
temp = nx.Graph(graph)
for v in range(u+1,n):
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
fout_.close()
def delta(graph,n,t,nx,OutExt):
pool = Pool()
pool.map(
partial(
do_one_step,
nx,
graph,
n,
t,
OutExt,
),
range(0,n),
)
This supposes that all of the arguments can be serialized across processes (required for any argument you pass to a function you call with multiprocessing). I suspect that nx and graph may be problems but I don't know what they are.
And again, this assumes it's actually correct to concurrently execute the inner loop.

Best use pool.map. Here an example that shows what you need to do. Here a simple example of how multiprocessing works with pool:
Single threaded, basic function:
def f(x):
return x*x
if __name__ == '__main__':
print(map(f, [1, 2, 3]))
>> [1, 4, 9]
Using multiple processors:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
Using 1 processor
from multiprocessing.pool import ThreadPool as Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
When you use map you can easily get a list back from the results of your function.

Multiprocessing storing read-only string-array for all processes

I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?

You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]

Python Using Multiprocessing

I am trying to use multiprocessing in python 3.6. I have a for loopthat runs a method with different arguments. Currently, it is running one at a time which is taking quite a bit of time so I am trying to use multiprocessing. Here is what I have:
def test(self):
for key, value in dict.items():
pool = Pool(processes=(cpu_count() - 1))
pool.apply_async(self.thread_process, args=(key,value))
pool.close()
pool.join()
def thread_process(self, key, value):
# self.__init__()
print("For", key)
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.

You're making a pool at every iteration of the for loop. Make a pool beforehand, apply the processes you'd like to run in multiprocessing, and then join them:
from multiprocessing import Pool, cpu_count
import time
def t():
# Make a dummy dictionary
d = {k: k**2 for k in range(10)}
pool = Pool(processes=(cpu_count() - 1))
for key, value in d.items():
pool.apply_async(thread_process, args=(key, value))
pool.close()
pool.join()
def thread_process(key, value):
time.sleep(0.1) # Simulate a process taking some time to complete
print("For", key, value)
if __name__ == '__main__':
t()

You're not populating your multiprocessing.Pool with data - you're re-initializing the pool on each loop. In your case you can use Pool.map() to do all the heavy work for you:
def thread_process(args):
print(args)
def test():
pool = Pool(processes=(cpu_count() - 1))
pool.map(thread_process, your_dict.items())
pool.close()
if __name__ == "__main__": # important guard for cross-platform use
test()
Also, given all those self arguments I reckon you're snatching this off of a class instance and if so - don't, unless you know what you're doing. Since multiprocessing in Python essentially works as, well, multi-processing (unlike multi-threading) you don't get to share your memory, which means your data is pickled when exchanging between processes, which means anything that cannot be pickled (like instance methods) doesn't get called. You can read more on that problem on this answer.

I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
No, you are in fact using the correct syntax here to utilize 3 cores to run an arbitrary function independently on each. You cannot magically utilize 3 cores to work together on one task with out explicitly making that a part of the algorithm itself/ coding that your self often using threads (which do not work the same in python as they do outside of the language).
You are however re-initializing the pool every loop you'll need to do something like this instead to actually perform this properly:
cpus_to_run_on = cpu_count() - 1
pool = Pool(processes=(cpus_to_run_on)
# don't call a dictionary a dict, you will not be able to use dict() any
# more after that point, that's like calling a variable len or abs, you
# can't use those functions now
pool.map(your_function, your_function_args)
pool.close()
Take a look at the python multiprocessing docs for more specific information if you'd like to get a better understanding of how it works. Under python, you cannot utilize threading to do multiprocessing with the default CPython interpreter. This is because of something called the global interpreter lock, which stops concurrent resource access from within python itself. The GIL doesn't exist in other implementations of the language, and is not something other languages like C and C++ have to deal with (and thus you can actually use threads in parallel to work together on a task, unlike CPython)
Python gets around this issue by simply making multiple interpreter instances when using the multiprocessing module, and any message passing between instances is done via copying data between processes (ie the same memory is typically not touched by both interpreter instances). This does not however happen in the misleadingly named threading module, which often actually slow processes down because of a process called context switching. Threading today has limited usefullness, but provides an easier way around non GIL locked processes like socket and file reads/writes than async python.
Beyond all this though there is a bigger problem with your multiprocessing. Your writing to standard output. You aren't going to get the gains you want. Think about it. Each of your processes "print" data, but its all being displayed in one terminal/output screen. So even if your processes are "printing" they aren't really doing that independently, and the information has to be coalesced back into another processes where the text interface lies (ie your console). So these processes write whatever they were going to to some sort of buffer, which then has to be copied (as we learned from how multiprocessing works) to another process which will then take that buffered data and output it.
Typically dummy programs use printing as a means of showing how there is no order between execution of these processes, that they can finish at different times, they aren't meant to demonstrate the performance benefits of multi core processing.

I have experimented a bit this week with multiprocessing. The fastest way that I discovered to do multiprocessing in python3 is using imap_unordered, at least in my scenario. Here is a script you can experiment with using your scenario to figure out what works best for you:
import multiprocessing
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MP_FUNCTION = 'imap_unordered' # 'imap_unordered' or 'starmap' or 'apply_async'
def process_chunk(a_chunk):
print(f"processig mp chunk {a_chunk}")
return a_chunk
map_jobs = [1, 2, 3, 4]
result_sum = 0
if MP_FUNCTION == 'imap_unordered':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
for i in pool.imap_unordered(process_chunk, map_jobs):
result_sum += i
elif MP_FUNCTION == 'starmap':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
try:
map_jobs = [(i, ) for i in map_jobs]
result_sum = pool.starmap(process_chunk, map_jobs)
result_sum = sum(result_sum)
finally:
pool.close()
pool.join()
elif MP_FUNCTION == 'apply_async':
with multiprocessing.Pool(processes=NUMBER_OF_PROCESSES) as pool:
result_sum = [pool.apply_async(process_chunk, [i, ]).get() for i in map_jobs]
result_sum = sum(result_sum)
print(f"result_sum is {result_sum}")
I found that starmap was not too far behind in performance, in my scenario it used more cpu and ended up being a bit slower. Hope this boilerplate helps.

Adding state to a function which gets called via pool.map -- how to avoid pickling errors

I've hit the common problem of getting a pickle error when using the multiprocessing module.
My exact problem is that I need to give the function I'm calling some state before I call it in the pool.map function, but in doing so, I cause the attribute lookup __builtin__.function failed error found here.
Based on the linked SO answer, it looks like the only way to use a function in pool.map is to call the defined function itself so that it is looked up outside the scope of the current function.
I feel like I explained the above poorly, so here is the issue in code. :)
Testing without pool
# Function to be called by the multiprocessing pool
def my_func(x):
massive_list, medium_list, index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = map(my_func, to_crunch)
This works A-OK and just as expected. The only thing "wrong" with it is that it's slow.
Pool Attempt 1
# (Note: my_func() remains the same)
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(my_func, to_crunch)
This technically works, but it is a stunning 18x slower! The slow down must be coming from not only copying the two massive data structures on each call, but also pickling/unpickling them as they get passed around. The non-pool version benefits from only having to pass the reference to the massive list around, rather than the actual list.
So, having tracked down the bottleneck, I try to store the two massive lists as state inside of my_func. That way, if I understand correctly, it will only need to be copied once for each worker (in my case, 4).
Pool Attempt 2:
I wrap up my_func in a closure passing in the two lists as stored state.
def build_myfunc(m,s):
def my_func(x):
massive_list = m # close the state in there
small_list = s
index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
return my_func
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
modified_func = build_myfunc(data, source)
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(modified_func, to_crunch)
However, this returns the pickle error as (based on the above linked SO question) you cannot call a function with multiprocessing from inside of the same scope.
Error:
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
So, is there a way around this problem?

Map is a way to distribute workload. If you store the data in the func i think you vanish the initial purpose.
Let's try to find why it is slower. It's not normal and there must be something else.
First, the number of processes must be suitable for the machine running them. In your example you're using a pool of 2 processes so a total of 3 processes is involved. How many cores are on the system you're using? What else is running? What's the system load while crunching data?
What does the function do with the data? Does it access disk? Or maybe it uses DB which means there is probably another process accessing disk and cores.
What about memory? Is it sufficient for storing the initial lists?
The right implementation is your Attempt 1.
Try to profile the execution using iostat for example. This way you can spot the bottlenecks.
If it stalls on the cpu then you can try some tweaks to the code.
From another answer on Stackoverflow (by me so no problem copy and pasting it here :P ):
You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.
You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of results is not important (as it seems from your example).
Here's the relevant documentation. Worth noting the line:
For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

Python multiprocess with pool workers - memory use optimization

I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?

Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.