Python multiprocessing numpy.linalg.pinv cause segfault - python

I wrote a function using multiprocessing packages from python and tried to boost the speed of my code.
from arch.univariate import ARX, GARCH
from multiprocessing import Process
import multiprocessing
import time
def batch_learning(X, lag_array=None):
"""
X is a time series array
lag_array contains all possible lag numbers
"""
# init a queue used for triggering different processes
queue = multiprocessing.JoinableQueue()
data = multiprocessing.Queue()
# a worker called ARX_fit triggered by queue.get()
def ARX_fit(queue):
while True:
q = queue.get()
q.volatility = GARCH()
print "Starting to fit lags %s" %str(q.lags.size/2)
try:
q_res=q.fit(update_freq=500)
except:
print "Error:...."
print "finished lags %s" %str(q.lags.size/2)
queue.task_done()
# init four processes
for i in range(4):
process_i = Process(target=ARX_fit, name="Process_%s"%str(i), args=(queue,))
process_i.start()
# put ARX model objects into queue continuously
for num in lag_array:
queue.put(ARX(X, lags=num))
# sync processes here
queue.join()
return
After calling function:
batch_learning(a, lag_array=range(1,10))
However it got stuck in the middle and I got the print out messages as below:
Starting to fit lags 1
Starting to fit lags 3
Starting to fit lags 2
Starting to fit lags 4
finished lags 1
finished lags 2
Starting to fit lags 5
finished lags 3
Starting to fit lags 6
Starting to fit lags 7
finished lags 4
Starting to fit lags 8
finished lags 6
finished lags 5
Starting to fit lags 9
It runs forever but without any printouts on my Mac OS El Captain. Then using PyCharm debug mode and thanks for Tim Peters suggestions, I successfully find out that the processes actually quitted unexpectedly. Under debug mode, I can pinpoint it is actually svd function inside numpy.linalg.pinv() used by arch library causing this problem. Then my question is: Why? It works with single process for-loop but it cannot work with 2 processes or above. I don't know how to fix this problem. Is it a numpy bug? Can anyone help me a bit here?

I have to answer this question by myself and providing my solutions. I have already solved this issue, thanks to the help from #Tim Peters and #aganders.
The multiprocessing usually hangs when you use numpy/scipy libraries on Mac OS because of the Accelerate Framework used in Apple OS which is a replacement for OpenBlas numpy is built on. Simply, in order to solve the similar problem, you have to do as follows:
uninstall numpy and scipy (scipy needs to be matched with proper version of numpy)
follow the procedure on this link to rebuild numpy with Openblas.
reinstall scipy and test your code to see if it works.
Some heads up for testing your multiprocessing codes on Mac OS, when you run your code, it is better to set up a env variable to run your code:
OPENBLAS_NUM_THREADS=1 python import_test.py
The reason for doing this is that OpenBlas by default create 2 threads for each core to run, in which case there are 8 threads running (2 for each core) even though you set up 4 processes. This creates a bit overhead for the thread switching. I tested OPENBLAS_NUM_THREADS=1 config to limit 1 thread each process on each core, it is indeed faster than default settings.

There's not much to go on here, and the code indentation is wrong so it's hard to guess what you're really doing. To the extent I can guess, what you're seeing could happen if the OS killed a process in a way that didn't raise a Python exception.
One thing to try: first make a list, ps, of your four process_i objects. Then before queue.join() add:
while ps:
new_ps = []
for p in ps:
if p.is_alive():
new_ps.append(p)
else:
print("*********", p.name, "exited with", p.exitcode)
ps = new_ps
time.sleep(1)
So about once per second, this just runs through the list of worker processes to see whether any have (unexpectedly!) died. If one (or more) has, it displays the process name (which you supplied already) and the process exit code (as given by your OS). If that triggers, it would be a big clue.
If none die, then we have to wonder whether
q_res=q.fit(update_freq=500)
"simply" takes a very long time for some q states.

Related

Faster Startup of Processes Python

I'm trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I've found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead -- ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.
For hardware information, I'm running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.
I've provided the script that I've written below as well as the output that I typically get from it.
import multiprocessing as mp
import time
import numpy as np
def t(s):
return (time.perf_counter() - s) * 1000
def run0(s):
print(f"Time to reach run0: {t(s):.2f}ms")
time.sleep(0.03)
return np.ones((1,4))
def run1(s):
print(f"Time to reach run1: {t(s):.2f}ms")
time.sleep(0.03)
return np.zeros((1,5))
def main():
s = time.perf_counter()
with mp.Pool(processes=2) as p:
print(f"Time to init pool: {t(s):.2f}ms")
f0 = p.apply_async(run0, args=(time.perf_counter(),))
f1 = p.apply_async(run1, args=(time.perf_counter(),))
r0 = f0.get()
r1 = f1.get()
print(r0, r1)
print(f"Time to run end-to-end: {t(s):.2f}ms")
if __name__ == "__main__":
main()
Below is the output that I typically get from running the above script
Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms
Note: I'm looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!
several points to consider:
"Time to init pool" is wrong. The child processes haven't finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.
startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can't exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.
you're on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)
example:
import multiprocessing as mp
import time
# import numpy as np
def run():
time.sleep(0.03)
return "whatever"
def main():
s = time.perf_counter()
with mp.Pool(processes=1) as p:
p.apply_async(run).get()
print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
#first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
s = time.perf_counter()
p.apply_async(run).get()
print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
#second job about 30ms every time
if __name__ == "__main__":
main()
you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.
one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)
the ultimate solution ? don't write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn't made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.
if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.

Start a new multiprocessing.pool() process for each one that dies

I'm using a python api for a proprietary software to run numerical simulations. I need to do quite a few so have tried to speed things up using multiprocessing.pool() to run simulations in parallel. The simulations are independent and the function passed to multiprosessing.pool() returns nothing but the simulation results are saved to disk. As far as I understand this should be similar to opening X no of terminals and running a call to the API from each.
Using multiprocessing starts off well, I can see all processors running at 100% which is expected for the simulations. However after a while the processes seem to die. Eventually I end up with no active processes but still simulations that have not started. I think that the problem is that the API is sometimes a a little buggy. Certain errors cause python kernel to crash. I think this likely what is happening with my multiprocessing.pool().
Is there a way that I can add a new process for each one that dies so that there will always be processes in the pool? For now I can run the individual simulations that give problems manually.
Below is a minimum working example but I am not sure how to reproduce an error that causes the kernel to crash so it is not of much use.
from multiprocessing import Pool
from multiprocessing import cpu_count
import time
def test_function(a,b):
"Takes in two variables to justify starmap, pause,return nothing"
print(f'running case {a}')
' api(a,b) - Runs a simulation and saves output to disk'
'include error that "randomly" crashes python console/process'
time.sleep(5)
if __name__ == '__main__':
case_names = list(range(60))
b = 'b'
inputs = [(a,b) for a in case_names] #All the inputs in order needed by run_wdi
start_time = time.time()
# no_processes = cpu_count()
no_processes = min(cpu_count(),len(inputs))
print(f"Using {no_processes} processes on {cpu_count()} cpu's")
# with Pool(processes=no_processes) as pool:
with Pool() as pool:
result = pool.starmap(test_function, inputs)
end_time = time.time()
print(f'Total time {end_time-start_time}')

Python: parallel execution of a function which has a sequential loop inside

I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction.
Some of these require significant computation time so I tried to get the advantage of my multicore CPU.
Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:
import numpy as np
def foo(eps): # need an (unused) argument to use pool.map()
# initialising
# the true values of the actions
q = np.random.normal(0, 1, size=10)
# the estimated values
q_est = np.zeros(10)
# the counter of how many times each of the 10 actions was chosen
n = np.zeros(10)
rewards = []
for i in range(1000):
# choose an action based on its estimated value
a = np.argmax(q_est)
# get the normally distributed reward
rewards.append(np.random.normal(q[a], 1))
# increment the chosen action counter
n[a] += 1
# update the estimated value of the action
q_est[a] += (rewards[-1] - q_est[a]) / n[a]
return rewards
I execute this function 2000 times to get (2000, 1000) array:
reward = np.array([foo(0) for _ in range(2000)])
Then I plot the mean reward across 2000 experiments:
import matplotlib.pyplot as plt
plt.plot(np.arange(1000), reward.mean(axis=0))
sequential plot
which fully corresponds the expected result (looks the same as in the book).
But when I try to execute it in parallel, I get much greater standard deviation of the average reward:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
reward_p = np.array(pool.map(foo, [0]*2000))
plt.plot(np.arange(1000), reward_p.mean(axis=0))
parallel plot
I suppose this is due to the parallelization of a loop inside of the foo. As i reduce the number of cores allocated to the task, the reward plot approaches the expected shape.
Is there a way to get the advantage of the multiprocessing here while getting the correct results?
UPD:
I tried running the same code on Windows 10 and sequential vs parallel and the results turned out to be the same! What may be the reason?
Ubuntu 20.04, Python 3.8.5, jupyter
Windows 10, Python 3.7.3, jupyter
As we found out it is different on windows and ubuntu. It is probably because of this:
spawn The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows and macOS.
fork The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
Try adding this line to your code:
mp.set_start_method('spawn')

Python multiprocessing pool.map self._event.wait(timeout) is hanging. Why is pool.map wait not responding?

multiprocessing pool.map works nicely on my old PC but does not work on the new PC.
It hangs in the call to
def wait(self,timeout=None)
self._event.wait(timeout)
at which time the cpu utilization drops to zero% with no further response like it has gone to sleep.
I wrote a simple test.py as follows
import multiprocessing as mp
letters = ['A','B','C']
def doit(letter):
for i in range(1000):
print(str(letter) + ' ' + str(i))
if __name__ == '__main__':
pool = mp.Pool()
pool.map(doit,letters)
This works on the old PC with i7-7700k(4cores,8logical), python365-64bit, Win10Pro, PyCharm2018.1 where the stdout displays letters and numbers in non-sequential order as expected.
Though this same code does not work on the new build i9-7960(16core-32logical), python37-64bit, Win10Pro, PyCharm2018.3
New PC bios version has not been updated from 2017/11 (4 months older)
pool.py appears to be the same on both machines (2006-2008 R Oudkerk)
The codeline where it hangs in the 'wait' function is ...
self._event.wait(timeout)
Any help please on where I might look next to find the cause.
Thanks in advance.
....
EDIT::
My further interpretation -
1. GIL (Global interpreter Lock) is not relevant here as this relates to multi-threading only, not multiprocessing.
2. multiprocessing.manager is unnecessary here as the code is consuming static input and producing independent output. So pool.close and pool.join are not required either, as I am not post-process joining results
3. This link is a good introduction to multiprocessing though I don't see a solution in here.
https://docs.python.org/2/library/multiprocessing.html#windows

Python multithreading - memory not released when ran using While statement

I built a scraper (worker) launched XX times through multithreading (via Jupyter Notebook, python 2.7, anaconda).
Script is of the following format, as described on python.org:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
When I run the script as is, there are no issues. Memory is released after script finishes.
However, I want to run the said script 20 times (batching of sort),
so I turn the script mentioned into a function, and run the function using code below:
def multithreaded_script():
my script #code from above
x = 0
while x<20:
x +=1
multithredaded_script()
memory builds up with each iteration, and eventually the system start writing it to disk.
Is there a way to clear out the memory after each run?
I tried:
setting all the variables to None
setting sleep(30) at end of each iteration (in case it takes time for ram to release)
and nothing seems to help.
Any ideas on what else I can try to get the memory to clear out after each run within the While statement?
If not, is there a better way to execute my script XX times, that would not eat up the ram?
Thank you in advance.
TL;DR Solution: Make sure to end each function with return to ensure all local variables are destroyed from ram**
Per Pavel's suggestion, I used memory tracker (unfortunately suggested mem tracker did't work for me, so i used Pympler.)
Implementation was fairly simple:
from pympler.tracker import SummaryTracker
tracker = SummaryTracker()
~~~~~~~~~YOUR CODE
tracker.print_diff()
The tracker gave a nice output, which made it obvious that local variables generated by functions were not being destroyed.
Adding "return" at the end of every function fixed the issue.
Takeaway:
If you are writing a function that processes info/generates local variables, but doesn't pass local variables to anything else -> make sure to end the function with return anyways. This will prevent any issues that you may run into with memory leaks.
Additional notes on memory usage & BeautifulSoup:
If you are using BeautifulSoup / BS4 with multithreading and multiple workers, and have limited amount of free ram, you can also use soup.decompose() to destroy soup variable right after you are done with it, instead of waiting for the function to return/code to stop running.

Categories