ipyparallel's LoadBalancedView bloats memory, how can I avoid that?

ipyparallel's LoadBalancedView bloats memory, how can I avoid that? - python

This issue may be related to https://github.com/ipython/ipyparallel/issues/207 which is also not marked as solved, yet.
I also opened this issue here https://github.com/ipython/ipyparallel/issues/286
I want to execute multiple tasks in parallel using python and ipyparallel in a jupyter notebook and using 4 local engines by executing ipcluster start in a local console.
Besides that one can also use DirectView, I use LoadBalancedView to map a set of tasks. Each task takes around 0.2 seconds (can vary though) and each task does a MySQL query where it loads some data and then processes it.
Working with ~45000 tasks works fine, however, my memory grows really high. This is actually bad because I want to run another experiment with over 660000 tasks which I can't run anymore because it bloats up my memory limit of 16 GB and then the memory swapping on my local drive starts. However, when using the DirectView my memory grows relatively small and is never full. But I actually need LoadBalancedView.
Even when running a minimal working example without database query this happens (see below).
I am not perfectly familiar with the ipyparallel library but I've read something about logs and caches that the ipcontroler does which may cause this. I am still not sure if it is a bug or if I can change some settings to avoid my problem.
Running a MWE
For my Python 3.5.3 environment running on Windows 10 I use the following (recent) packages:
ipython 6.1.0
ipython_genutils 6.1.0
ipyparallel 6.0.2
jupyter 1.0.0
jupyter_client 4.4.0
jupyter_console 5.0.0
jupyter_core 4.2.0
I would like the following example to work for LoadBalancedView without the immense memory growth (if possible at all):
Start ipcluster start on a console
Run a jupyter notebook with the following three cells:
<1st cell>
import ipyparallel as ipp
rc = ipp.Client()
lview = rc.load_balanced_view()
<2nd cell>
%%px --local
import time
<3rd cell>
def sleep_here(i):
time.sleep(0.2)
return 42
amr = lview.map_async(sleep_here, range(660000))
amr.wait_interactive()

Related

Python multiprocessing uses all cores even when limited

I am using the multiprocessing module from Python in order to speed-up my code, but for some reason the program ends up using more than the number of cores I specify. Indeed, when looking at htop to check the core usage, I see that all 60 available cores are used at times instead of 10 as needed.
As an example, this code already produces the behavior:
import numpy as np
import multiprocessing
def square(i):
size = 10000
return np.dot(np.random.normal(size=size), np.random.normal(size=(size, size)) # np.random.normal(size=size))
with multiprocessing.Pool(10) as pool:
t = pool.map(square, range(100))
I am running this in a Jupyter Notebook inside a JupyterHub. Up until a few hours ago, everything was running smoothly (meaning that the number of cores used was not going beyond 10), the only big change that happened is that I updated all my packages via conda update --all and specifically updated scipy with conda install -c conda-forge scipy. However I have no idea what could have caused the issue.

app package built by pyinstaller fails after it uses tqdm

I have a program which can optionally use tqdm to display a progress bar for long calculations.
I've packaged the program using pyinstaller.
The packaged program runs to completion just fine if I don't enable tqdm.
If I enable tqdm, the package works fine and displays progress bars until it gets to the very end, then the message below is displayed and the packaged program is relaunched (with the wrong args).
multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker:
process died unexpectedly, relaunching. Some resources might leak.
There is nothing fancy about the way the program invokes tqdm.
There is a loop that iterates over the members of a list.
The program does not import the multiprocessing module.
def best_cost(self, guesses, steps=1, verbose=0) -> None:
""" find the guess with the minimum cost """
self.cost = float("inf")
for guess in guesses if verbose < 3 else tqdm(guesses, mininterval=1.0):
groups = dict()
for (word, frequency) in self.words.items():
result = calc_result(guess, word)
groups.setdefault(result, {})[word] = frequency
self.check_cost(guess, groups, steps=steps)
return
The problem occurs whether pyinstaller creates the package with --onedir or --onefile.
I'm running on MACOS Monterey V12.4 on an Intel processor.
I'm using python 3.10.4, but the problem also occurs with 3.9. Pip installed packages include: pyinstaller 5.1, pyinstaller-hooks-contrib 2022.7, and tqdm 4.64.0.
I package the program in a virtualenv.
I found this in the documentation for the multiprocessing module:
On Unix using the spawn or forkserver start methods will also start a resource tracker process which tracks the unlinked named system resources (such as named semaphores or SharedMemory objects) created by processes of the program. When all processes have exited the resource tracker unlinks any remaining tracked object. Usually there should be none, but if a process was killed by a signal there may be some “leaked” resources. (Neither leaked semaphores nor shared memory segments will be automatically unlinked until the next reboot. This is problematic for both objects because the system allows only a limited number of named semaphores, and shared memory segments occupy some space in the main memory.)
I can't figure out how to debug the the problem since it only occurs with the pyinstaller packaged program and my IDE (vscode) doesn't help.
My program doesn't use the multiprocessing module.
The tqdm documentation doesn't mention issues like this, but it does include the multiprocessing module to create locks (semaphores) in the tqdm class which
get used by the instances.
There aren't any tqdm methods I can find to delete the semaphores resources;
I guess that is just supposed to happen automatically when the program ends.
I think the pyinstaller packaged program may also use the multiprocessing
module.
Perhaps pyinstaller is interfering with deletion of the tqdm semaphores somehow?
The packaged program works fine till it starts to exit and then gets restarted.
Does anyone know how I can get the packaged program to exit cleanly?

pytorch profiler nvidia cuda permission error (microsoft wsl): CUPTI_ERROR_NOT_INITIALIZED, CUPTI_ERROR_INSUFFICIENT_PRIVILEGES

I am trying to run a profiling script for pytorch on MS WSL 2.0 Ubuntu 20.04.
WSL is on the newest version (wsl --update). I am running the stable conda pytorch cuda 11.3 version from the pytorch website with pytorch 1.11. My GPU is a GTX 1650 Ti.
I can run my script fine and it finishes without error, but when I try to profile it using pytorch's bottleneck profiling tool python -m torch.utils.bottleneck run.py
it first throws this warning when starting the autograd profiler:
Running your script with the autograd profiler...
WARNING:2022-06-01 13:37:49 513:513 init.cpp:129] function status failed with error CUPTI_ERROR_NOT_INITIALIZED (15)
WARNING:2022-06-01 13:37:49 513:513 init.cpp:130] CUPTI initialization failed - CUDA profiler activities will be missing
Then, if I run for a small number of epochs, the script finishes fine, and it shows also the cuda profiling stats (even though it says profiler activities will be missing). But when I do a longer run, I get the message Killed after the script runs "through" the autograd profiler. The command dmesg gives this output at the end:
[ 1224.321233] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=python,pid=295,uid=1000
[ 1224.321421] Out of memory: Killed process 295 (python) total-vm:55369308kB, anon-rss:15107852kB, file-rss:0kB, shmem-rss:353072kB, UID:1000 pgtables:39908kB oom_score_adj:0
[ 1224.746786] oom_reaper: reaped process 295 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:353936kB
So, when using the profiler, there seems to be a memory error (it might not necessarily be related to the above CUPTI warning). Is this related to the profiler somehow saving too much data in-mem? Then, it might be a common problem that occurs for too long runs, right?
The cuda warning CUPTI_ERROR_NOT_INITIALIZED indicates that my CUPTI (short for "CUDA Profiling Tools Interface") is not running. I read in another post that this might be related to me running a newer version of CUPTI that is not backcompatible with the older version of CUDA 11.3. As cupti is not included in the cudatoolkit on conda by default, the system is probably trying to use / locate the cupti, but does not find it / cannot use it.
I'd appreciate any help for this issue. It would be quite nice to see a longer profiling run, in order to determine the bottlenecks / expensive operations in my pytorch code.
Thanks!

Import trax takes too long to load

I was stumped the first time I loaded this library. In my local computer it tooks me at least 40s to load trax on a local Jupyter Notebook and more than 1 minute to load it on a shared Colab environment.
import trax
I'm not sure if it's an issue with my installation or a BUG in the version of trax I'm using?
I'm new in trax, and in fact my experience is with Keras and TensorFlow so I'd like to get an opinion from someone in the trax community, if this is normal or not.
Thanks a lot in advance!
BTW: I'm using trax 1.4.1 with Python 3.9.6 and my local computer has the following specs:
Intel(R) Core(TM) i5-6200U CPU # 2.30GHz, 4 cores and 16GB RAM.

After a while searching it seems that this is normal behavior. In fact there is an issue raised in the official Trax repository about this: import trax takes 17 seconds #1368.
Apparently the fastmath module that contains the trax re-implements for maths operations has plenty of dependencies.
From the issue thread:
This is a well known problem occurring on basically all setups (local, colab, gpu cluster) and it is not a big issue for running long experiments, however it does make local debugging hard. I have tried debugging the import graph with profiler, but without success yet. It looks like even from trax import fastmath has plenty of dependencies - here is the tree generated by importlab library for trax.fastmath.init module:
importlab --tree init.py
out:
https://gist.github.com/syzymon/3bb6f59063f918b4b62b77cdb223da72
So in conclusion, whether it takes you 17 seconds or
40 seconds like me, is a known behavior in Trax.

Printed output not displayed when using joblib in jupyter notebook

So I am using joblib to parallelize some code and I noticed that I couldn't print things when using it inside a jupyter notebook.
I tried using doing the same example in ipython and it worked perfectly.
Here is a minimal (not) working example to write in a jupyter notebook cell
from joblib import Parallel, delayed
Parallel(n_jobs=8)(delayed(print)(i) for i in range(10))
So I am getting the output as [None, None, None, None, None, None, None, None, None, None] but nothing is printed.
What I expect to see (print order could be random in reality):
1
2
3
4
5
6
7
8
9
10
[None, None, None, None, None, None, None, None, None, None]
Note:
You can see the prints in the logs of the notebook process. But I would like the prints to happen in the notebook, not the logs of the notebook process.
EDIT
I have opened a Github issue, but with minimal attention so far.

I think this caused in part by the way Parallel spawns the child workers, and how Jupyter Notebook handles IO for those workers. When started without specifying a value for backend, Parallel will default to loky which utilizes a pooling strategy that directly uses a fork-exec model to create the subprocesses.
If you start Notebook from a terminal using
$ jupyter-notebook
the regular stderr and stdout streams appear to remain attached to that terminal, while the notebook session will start in a new browser window. Running the posted code snippet in the notebook does produce the expected output, but it seems to go to stdout and ends up in the terminal (as hinted in the Note in the question). This further supports the suspicion that this behavior is caused by the interaction between loky and notebook, and the way the standard IO streams are handled by notebook for child processes.
This lead me to this discussion on github (active within the past 2 weeks as of this posting) where the authors of notebook appear to be aware of this, but it would seem that there is no obvious and quick fix for the issue at the moment.
If you don't mind switching the backend that Parallel uses to spawn children, you can do so like this:
from joblib import Parallel, delayed
Parallel(n_jobs=8, backend='multiprocessing')(delayed(print)(i) for i in range(10))
with the multiprocessing backend, things work as expected. threading looks to work fine too. This may not be the solution you were hoping for, but hopefully it is sufficient while the notebook authors work on finding a proper solution.
I'll cross-post this to GitHub in case anyone there cares to add to this answer (I don't want to misstate anyone's intent or put words in people mouths!).
Test Environment:
MacOS - Mojave (10.14)
Python - 3.7.3
pip3 - 19.3.1
Tested in 2 configurations. Confirmed to produce the expected output when using both multiprocessing and threading for the backend parameter. Packages install using pip3.
Setup 1:
ipykernel 5.1.1
ipython 7.5.0
jupyter 1.0.0
jupyter-client 5.2.4
jupyter-console 6.0.0
jupyter-core 4.4.0
notebook 5.7.8
Setup 2:
ipykernel 5.1.4
ipython 7.12.0
jupyter 1.0.0
jupyter-client 5.3.4
jupyter-console 6.1.0
jupyter-core 4.6.2
notebook 6.0.3
I also was successful using the same versions as 'Setup 2' but with the notebook package version downgraded to 6.0.2.
Note:
This approach works inconsistently on Windows. Different combinations of software versions yield different results. Doing the most intuitive thing-- upgrading everything to the latest version-- does not guarantee it will work.

In Z4-tier's git link scottgigante's method work in Windows, but opposite to the mentined results: in Jupyter notebook, the "multiprocessing" backend hang forever, but the default loky work well (python 3.8.5 and notebook 6.1.1):
from joblib import Parallel, delayed
import sys
def g(x):
stream = getattr(sys, "stdout")
print("{}".format(x), file=stream)
stream.flush()
return x
Parallel(n_jobs=2)(delayed(g)(x**2) for x in range(5))
executed in 91ms, finished 11:17:25 2021-05-13
[0, 1, 4, 9, 16]
A simpler method is to use the identity function in delay:
Parallel(n_jobs=2)(delayed(lambda y:y)([np.log(x),np.sin(x)]) for x in range(5))
executed in 151ms, finished 09:34:18 2021-05-17
[[-inf, 0.0],
[0.0, 0.8414709848078965],
[0.6931471805599453, 0.9092974268256817],
[1.0986122886681098, 0.1411200080598672],
[1.3862943611198906, -0.7568024953079282]]
Or use like this:
Parallel(n_jobs=2)(delayed(lambda y:[np.log(y),np.sin(y)])(x) for x in range(5))
executed in 589ms, finished 09:44:57 2021-05-17
[[-inf, 0.0],
[0.0, 0.8414709848078965],
[0.6931471805599453, 0.9092974268256817],
[1.0986122886681098, 0.1411200080598672],
[1.3862943611198906, -0.7568024953079282]]

This problem is fixed in the latest version of ipykernel.
To solve the issue just do pip install --upgrade ipykernel.
From what I understand, any version above 6 will do, as mentioned in this Github comment by one of the maintainers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.