Is numpy rng thread safe?

Is numpy rng thread safe? - python

I implemented a function that uses the numpy random generator to simulate some process. Here is a minimal example of such a function:
def thread_func(cnt, gen):
s = 0.0
for _ in range(cnt):
s += gen.integers(6)
return s
Now I wrote a function that uses python's starmap to call the thread_func. If I were to write it like this (passing the same rng reference to all processes):
from multiprocessing import Pool
import numpy as np
def evaluate(total_cnt, thread_cnt):
gen = np.random.default_rng()
cnt_per_thread = total_cnt // thread_cnt
with Pool(thread_cnt) as p:
vals = p.starmap(thread_func, [(cnt_per_thread,gen) for _ in range(thread_cnt)])
return vals
The result of evaluate(100000, 5) is an array of 5 same values, for example:
[49870.0, 49870.0, 49870.0, 49870.0, 49870.0]
However if I pass a different rng to all processes, for example by doing:
vals = p.starmap(thread_func, [(cnt_per_thread,np.random.default_rng()) for _ in range(thread_cnt)])
I get the expected result (5 different values), for example:
[49880.0, 49474.0, 50232.0, 50038.0, 50191.0]
Why does this happen?

TL;DR: as pointed out by #MichaelSzczesny, the main problem appear that you use processes which operate on a copy of the same RNG object having the same initial state.
Random number generator (RNG) objects are initialized with an integer called a seed which is modified when a new number is generated using an iterative operation (eg. (seed * huge_number) % another_huge_number).
It is not a good idea to use the same RNG object for multiple threads operations on it are inherently sequential. In the best case, if two threads accesses it in a protected way (eg. using critical sections), the result is dependent of the ordering of the thread. Additionally, performance is affected since doing that cause an effect called cache line bouncing slowing down the execution of the threads accessing to the same object. In the worst case, the RNG object is unprotected and this cause a race condition. Such an issue cause the seed to be possibly the same for multiple threads and so the result (that was supposed to be random).
CPython uses giant mutex called the global interpreter lock (GIL) that protects access to Python objects. It prevents multiple threads from executing Python bytecodes at once. The goal is to protect the interpreter but not the object state. Many function of Numpy release the GIL so the code can scale in parallel. The thing is it cause race condition if you use them from the same thread. It is your responsibility to use locks to protect Numpy objects.
In your case, I cannot reproduce the problem with thread but I can with processes. Thus, I think you use processes in your example. For processes, you should use:
from multiprocessing import Pool
And for threads you should use:
from multiprocessing.pool import ThreadPool as Pool
Processes behave differently from threads because they do not operate on shared objects (at least not by default). Instead, processes operates on object copies. Processes produce the same output since the initial state of the RNG object is the same in all processes.
Put it shortly, please use one different RNG per thread. A typical solution is to create N threads with they own RNG object and then communicate with them to send some work (eg. using queues). This is called a thread pool. An alternative option might be to use thread local storage.
Note that the Numpy documentation provides an example in Section Multithreaded Generation.

Related

Concurrent Futures: When and how to implement?

from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
import numpy as np
import time
#creating iterable
testDict = {}
for i in range(1000):
testDict[i] = np.random.randint(1,10)
#default method
stime = time.time()
newdict = []
for k, v in testDict.items():
for i in range(1000):
v = np.tanh(v)
newdict.append(v)
etime = time.time()
print(etime - stime)
#output: 1.1139910221099854
#multi processing
stime = time.time()
testresult = []
def f(item):
x = item[1]
for i in range(1000):
x = np.tanh(x)
return x
def main(testDict):
with ProcessPoolExecutor(max_workers = 8) as executor:
futures = [executor.submit(f, item) for item in testDict.items()]
for future in as_completed(futures):
testresult.append(future.result())
if __name__ == '__main__':
main(testDict)
etime = time.time()
print(etime - stime)
#output: 3.4509658813476562
Learning multiprocessing and testing stuff. Ran a test to check if I have implemented this correctly. Looking at the output time taken, concurrent method is 3 times slower. So what's wrong?
My objective is to parallelize a script which mostly operates on a dictionary of around 500 items. Each loop, values of those 500 items are processed and updated. This loops for let's say 5000 generations. None of the k,v pairs interact with other k,v pairs. [Its a genetic algorithm].
I am also looking at guidance on how to parallelize the above described objective. If I use the correct concurrent futures method on each of my function in my genetic algorithm code, where each function takes an input of a dictionary and outputs a new dictionary, will it be useful? Any guides/resources/help is appreciated.
Edit: If I run this example: https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor-example, it takes 3 times more to solve than a default for loop check.

There are a couple basic problems here, you're using numpy but you're not vectorizing your calculations. You'll not benefit from numpy's speed benefit with the way you write your code here, and might as well just use the standard library math module, which is faster than numpy for this style of code:
# 0.089sec
import math
for k, v in testDict.items():
for i in range(1000):
v = math.tanh(v)
newdict.append(v)
Once you vectorise the operation, only then you see the benefit of numpy:
# 0.016sec
for k, v in testDict.items():
arr = no.full(1000, v)
arr2 = np.tanh(arr)
newdict.append(arr2[-1])
For comparison, your original single threaded code runs in 1.171sec on my test machine. As you can see here, when it's not used properly, NumPy can be a couple orders of magnitude slower than even pure Python.
Now on to why you're seeing what you're seeing.
To be honest, I can't replicate your timing results. Your original multiprocessing code runs in 0.299sec for me macOS on Python 3.6), which is faster than the single process code. But if I have to take a guess, you're probably using Windows? In some platforms like Windows, creating a child process and setting up an environment to run multiprocessing task is very expensive, so using multiprocessing for a task that lasts less than a few seconds is of dubious benefit. If your are interested in why, read here.
Also, in platforms that lacks a usable fork() like MacOS after Python 3.8 or Windows, when you use multiprocessing, the child process has to reimport the module, so if you put both code in the same file, it has to run your single threaded code in the child processes before it can run the multiprocessing code. You'll likely want to put your test code in a function and protect the top level code with if __name__ == "__main__" block. On Mac with Python 3.8 or higher, you can also revert to using fork method by calling multiprocessing.set_start_method("fork") if you're not calling into Mac's non-fork-safe framework libraries.
With that out of the way, on to your title question.
When you use multiprocessing, you need to copy data to the child process and back to the main process to retrieve the result and there's a cost to spawning child processes. To benefit from multiprocessing, you need to design your workload so that this part of the cost is negligible.
If your data comes from external source, try loading the data in the child processes, rather than having the main process load the data then transfer it to the child process, have the main process tell the child how to fetch its slice of data. Here you're generating the testDict in the main process, so if you can, parallelize that and move them to the children instead.
Also, since you're using numpy, if you vectorise your operations properly, numpy will release the GIL while doing vectorised operations, so you may be able to just use multithreading instead. Since numpy doesn't hold GIL during vector operation, you can take advantage of multiple threads in a single Python process, and you don't need to fork or copy data over to child processes, as threads share memory.

Does multiprocessing copy the object in this scenario?

import multiprocessing
import numpy as np
import multiprocessing as mp
import ctypes
class Test():
def __init__(self):
shared_array_base = multiprocessing.Array(ctypes.c_double, 100, lock=False)
self.a = shared_array = np.ctypeslib.as_array(shared_array_base)
def my_fun(self,i):
self.a[i] = 1
if __name__ == "__main__":
num_cores = multiprocessing.cpu_count()
t = Test()
def my_fun_wrapper(i):
t.my_fun(i)
with mp.Pool(num_cores) as p:
p.map(my_fun_wrapper, np.arange(100))
print(t.a)
In the code above, I'm trying to write a code to modify an array, using multiprocessing. The function my_fun(), executed in each process, should modify the value for the array a[:] at index i which is passed to my_fun() as a parameter. With regards to the code above, I would like to know what is being copied.
1) Is anything in the code being copied by each process? I think the object might be but ideally nothing is.
2) Is there a way to get around using a wrapper function my_fun() for the object?

Almost everything in your code is getting copied, except the shared memory you allocated with multiprocessing.Array. multiprocessing is full of unintuitive, implicit copies.
When you spawn a new process in multiprocessing, the new process needs its own version of just about everything in the original process. This is handled differently depending on platform and settings, but we can tell you're using "fork" mode, because your code wouldn't work in "spawn" or "forkserver" mode - you'd get an error about the workers not being able to find my_fun_wrapper. (Windows only supports "spawn", so we can tell you're not on Windows.)
In "fork" mode, this initial copy is made by using the fork system call to ask the OS to essentially copy the whole entire process and everything inside. The memory allocated by multiprocessing.Array is sort of "external" and isn't copied, but most other things are. (There's also copy-on-write optimization, but copy-on-write still behaves as if everything was copied, and the optimization doesn't work very well in Python due to refcount updates.)
When you dispatch tasks to worker processes, multiprocessing needs to make even more copies. Any arguments, and the callable for the task itself, are objects in the master process, and objects inherently exist in only one process. The workers can't access any of that. They need their own versions. multiprocessing handles this second round of copies by pickling the callable and arguments, sending the serialized bytes over interprocess communication, and unpickling the pickles in the worker.
When the master pickles my_fun_wrapper, the pickle just says "look for the my_fun_wrapper function in the __main__ module", and the workers look up their version of my_fun_wrapper to unpickle it. my_fun_wrapper looks for a global t, and in the workers, that t was produced by the fork, and the fork produced a t with an array backed by the shared memory you allocated with your original multiprocessing.Array call.
On the other hand, if you try to pass t.my_fun to p.map, then multiprocessing has to pickle and unpickle a method object. The resulting pickle doesn't say "look up the t global variable and get its my_fun method". The pickle says to build a new Test instance and get its my_fun method. The pickle doesn't have any instructions in it about using the shared memory you allocated, and the resulting Test instance and its array are independent of the original array you wanted to modify.
I know of no good way to avoid needing some sort of wrapper function.

Python Using Multiprocessing

I am trying to use multiprocessing in python 3.6. I have a for loopthat runs a method with different arguments. Currently, it is running one at a time which is taking quite a bit of time so I am trying to use multiprocessing. Here is what I have:
def test(self):
for key, value in dict.items():
pool = Pool(processes=(cpu_count() - 1))
pool.apply_async(self.thread_process, args=(key,value))
pool.close()
pool.join()
def thread_process(self, key, value):
# self.__init__()
print("For", key)
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.

You're making a pool at every iteration of the for loop. Make a pool beforehand, apply the processes you'd like to run in multiprocessing, and then join them:
from multiprocessing import Pool, cpu_count
import time
def t():
# Make a dummy dictionary
d = {k: k**2 for k in range(10)}
pool = Pool(processes=(cpu_count() - 1))
for key, value in d.items():
pool.apply_async(thread_process, args=(key, value))
pool.close()
pool.join()
def thread_process(key, value):
time.sleep(0.1) # Simulate a process taking some time to complete
print("For", key, value)
if __name__ == '__main__':
t()

You're not populating your multiprocessing.Pool with data - you're re-initializing the pool on each loop. In your case you can use Pool.map() to do all the heavy work for you:
def thread_process(args):
print(args)
def test():
pool = Pool(processes=(cpu_count() - 1))
pool.map(thread_process, your_dict.items())
pool.close()
if __name__ == "__main__": # important guard for cross-platform use
test()
Also, given all those self arguments I reckon you're snatching this off of a class instance and if so - don't, unless you know what you're doing. Since multiprocessing in Python essentially works as, well, multi-processing (unlike multi-threading) you don't get to share your memory, which means your data is pickled when exchanging between processes, which means anything that cannot be pickled (like instance methods) doesn't get called. You can read more on that problem on this answer.

I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
No, you are in fact using the correct syntax here to utilize 3 cores to run an arbitrary function independently on each. You cannot magically utilize 3 cores to work together on one task with out explicitly making that a part of the algorithm itself/ coding that your self often using threads (which do not work the same in python as they do outside of the language).
You are however re-initializing the pool every loop you'll need to do something like this instead to actually perform this properly:
cpus_to_run_on = cpu_count() - 1
pool = Pool(processes=(cpus_to_run_on)
# don't call a dictionary a dict, you will not be able to use dict() any
# more after that point, that's like calling a variable len or abs, you
# can't use those functions now
pool.map(your_function, your_function_args)
pool.close()
Take a look at the python multiprocessing docs for more specific information if you'd like to get a better understanding of how it works. Under python, you cannot utilize threading to do multiprocessing with the default CPython interpreter. This is because of something called the global interpreter lock, which stops concurrent resource access from within python itself. The GIL doesn't exist in other implementations of the language, and is not something other languages like C and C++ have to deal with (and thus you can actually use threads in parallel to work together on a task, unlike CPython)
Python gets around this issue by simply making multiple interpreter instances when using the multiprocessing module, and any message passing between instances is done via copying data between processes (ie the same memory is typically not touched by both interpreter instances). This does not however happen in the misleadingly named threading module, which often actually slow processes down because of a process called context switching. Threading today has limited usefullness, but provides an easier way around non GIL locked processes like socket and file reads/writes than async python.
Beyond all this though there is a bigger problem with your multiprocessing. Your writing to standard output. You aren't going to get the gains you want. Think about it. Each of your processes "print" data, but its all being displayed in one terminal/output screen. So even if your processes are "printing" they aren't really doing that independently, and the information has to be coalesced back into another processes where the text interface lies (ie your console). So these processes write whatever they were going to to some sort of buffer, which then has to be copied (as we learned from how multiprocessing works) to another process which will then take that buffered data and output it.
Typically dummy programs use printing as a means of showing how there is no order between execution of these processes, that they can finish at different times, they aren't meant to demonstrate the performance benefits of multi core processing.

I have experimented a bit this week with multiprocessing. The fastest way that I discovered to do multiprocessing in python3 is using imap_unordered, at least in my scenario. Here is a script you can experiment with using your scenario to figure out what works best for you:
import multiprocessing
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MP_FUNCTION = 'imap_unordered' # 'imap_unordered' or 'starmap' or 'apply_async'
def process_chunk(a_chunk):
print(f"processig mp chunk {a_chunk}")
return a_chunk
map_jobs = [1, 2, 3, 4]
result_sum = 0
if MP_FUNCTION == 'imap_unordered':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
for i in pool.imap_unordered(process_chunk, map_jobs):
result_sum += i
elif MP_FUNCTION == 'starmap':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
try:
map_jobs = [(i, ) for i in map_jobs]
result_sum = pool.starmap(process_chunk, map_jobs)
result_sum = sum(result_sum)
finally:
pool.close()
pool.join()
elif MP_FUNCTION == 'apply_async':
with multiprocessing.Pool(processes=NUMBER_OF_PROCESSES) as pool:
result_sum = [pool.apply_async(process_chunk, [i, ]).get() for i in map_jobs]
result_sum = sum(result_sum)
print(f"result_sum is {result_sum}")
I found that starmap was not too far behind in performance, in my scenario it used more cpu and ended up being a bit slower. Hope this boilerplate helps.

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

multiprocessing.pool context and load balancing

I've encountered some unexpected behaviour of the python multiprocessing Pool class.
Here are my questions:
1) When does Pool creates its context, which is later used for serialization? The example below runs fine as long as the Pool object is created after the Container definition. If you swap the Pool initializations, serialization error occurs. In my production code I would like to initialize Pool way before defining the container class. Is it possible to refresh Pool "context" or to achieve this in another way.
2) Does Pool have its own load balancing mechanism and if so how does it work?
If I run a similar example on my i7 machine with the pool of 8 processes I get the following results:
- For a light evaluation function Pool favours using only one process for computation. It creates 8 processes as requested but for most of the time only one is used (I printed the pid from inside and also see this in htop).
- For a heavy evaluation function the behaviour is as expected. It uses all 8 processes equally.
3) When using Pool I always see 4 more processes that I requested (i.e. for Pool(processes=2) I see 6 new processes). What is their role?
I use Linux with Python 2.7.2
from multiprocessing import Pool
from datetime import datetime
POWER = 10
def eval_power(container):
for power in xrange(2, POWER):
container.val **= power
return container
#processes = Pool(processes=2)
class Container(object):
def __init__(self, value):
self.val = value
processes = Pool(processes=2)
if __name__ == "__main__":
cont = [Container(foo) for foo in xrange(20)]
then = datetime.now()
processes.map(eval_power, cont)
now = datetime.now()
print "Eval time:", now - then
EDIT - TO BAKURIU
1) I was afraid that that's the case.
2) I don't understand what the linux scheduler has to do with python assigning computations to processes. My situation can be ilustrated by the example below:
from multiprocessing import Pool
from os import getpid
from collections import Counter
def light_func(ind):
return getpid()
def heavy_func(ind):
for foo in xrange(1000000):
ind += foo
return getpid()
if __name__ == "__main__":
list_ = range(100)
pool = Pool(4)
l_func = pool.map(light_func, list_)
h_func = pool.map(heavy_func, list_)
print "light func:", Counter(l_func)
print "heavy func:", Counter(h_func)
On my i5 machine (4 threads) I get the following results:
light func: Counter({2967: 100})
heavy func: Counter({2969: 28, 2967: 28, 2968: 23, 2970: 21})
It seems that the situation is as I've described it. However I still don't understand why python does it this way. My guess would be that it tries to minimise communication expenses, but still the mechanism which it uses for load balancing is unknown. The documentation isn't very helpful either, the multiprocessing module is very poorly documented.
3) If I run the above code I get 4 more processes as described before. The screen comes from htop: http://i.stack.imgur.com/PldmM.png

The Pool object creates the subprocesses during the call to __init__ hence you must define Container before. By the way, I wouldn't include all the code in a single file but use a module to implement the Container and other utilities and write a small file that launches the main program.
The Pool does exactly what is described in the documentation. In particular it has no control over the scheduling of the processes hence what you see is what Linux's scheduler thinks it is right. For small computations they take so little time that the scheduler doesn't bother parallelizing them(this probably have better performances due to core affinity etc.)
Could you show this with an example and what you see in the task manager? I think they may be the processes that handle the queue inside the Pool, but I'm not sure. On my machine I can see only the main process plus the two subprocesses.
Update on point 2:
The Pool object simply puts the tasks into a queue, and the child processes get the arguments from this queue. If a process takes almost no time to execute an object, than Linux scheduler let the process execute more time(hence consuming more items from the queue). If the execution takes much time then this scheduler will change processes and thus the other child processes are also executed.
In your case a single process is consuming all items because the computation take so little time that before the other child processes are ready it has already finished all items.
As I said, Pool doesn't do anything about balancing the work of the subprocesses. It's simply a queue and a bunch of workers, the pool puts items in the queue and the processes get the items and compute the results. AFAIK the only thing that it does to control the queue is putting a certain number of tasks in a single item in the queue(see the documentation) but there is no guarantee about which process will grab which task. Everything else is left to the OS.
On my machine the results are less extreme. Two processes get about twice the number of calls than the other two for the light computation, while for the heavy one all have more or less the same number of items processed. Probably on different OSes and/or hardware we would obtain even different results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.