Dask.distributed only uses one core per node

Dask.distributed only uses one core per node - python

I have a program where each task is a call to a C++ external program through subprocess.Popen. The tasks are arranged in a graph and everything is executed through the dask get command.
I have a single node version of this program that works just fine with dask.threaded and I am trying to extend this version to a distributed setting. My goal is to run it on a Slurm cluster but I have trouble deploying the workers. When I run the following:
screen -d -m dask-scheduler --scheduler-file scheduler.json
screen -d -m srun dask-worker --scheduler-file scheduler.json
python3 myscript.py
only a single core gets used on every node (out of twenty cores per node).
I did suspect some issues with the GIL but the script works just fine with dask.threaded so I am not quite sure what is going on and some help would be appreciated.

I recommend looking at the dashboard to see how many tasks Dask is running at a time on each worker:
Documentation here: http://dask.pydata.org/en/latest/diagnostics-distributed.html
If you see that Dask is only running one task per worker then it's probably a problem in how you've set up your workers (you might want to look at the worker page to get a sense for what Dask thinks you've asked for)
If you see that Dask is running many tasks per worker concurrently then it's probably an issue with your function.

Related

parallelizing code with python pathos multiprocessing in docker

I am parallizing code across 30 CPUs and confirmed that outside a container this works fine using the python library 'pathos'.
pool = ProcessPool(nodes=30)
results = pool.map(func_that_needs_to_run_in_parallel, range(30))
pool.close()
pool.join()
results_df = pd.concat(results)
However, it doesn't work while running the code as part of a Flask app in a Docker container. I have three containers:
flask app,
redis which I use to offload all the heavy processing to a worker process,
the worker process.
The code for the worker process can be summarised as:
#some code that needs to be run on only one cpu
#the above 'ProcessPool' code snippet for one particularly resource-intensive task
#some code that needs to be run on only one cpu
When I run the app, the parallelization part of the code in the worker container never uses more than 4 cpus. I confirmed this in docker stats and htop. There are no cpu usage limits on the containers in the docker-compose yaml file.
htop shows that the code runs on only 4 cpus at any one time but it actually randomly switches which cpus its using during the task, thus the worker container can access all 48 cpus.
Summary: Running the app with this multiprocessing code is helpful but there is a ceiling on CPU usage.

Early docker literature (2016) suggested one container per CPU, which is clearly not the case. The idea is to configure this at run time, in the same way you assign memory,
docker run -it --cpus="30" debian /bin/bash
The docker container resource allocation I found useful, here
If pathos is the issue why not switch to themultiprocessor.Pool() library via apply, map_async, or imap methods?

Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?

I was trying to run slurm jobs with srun on the background. Unfortunately, right now due to the fact I have to run things through docker its a bit annoying to use sbatch so I am trying to find out if I can avoid it all together.
From my observations, whenever I run srun, say:
srun docker image my_job_script.py
and close the window where I was running the command (to avoid receiving all the print statements) and open another terminal window to see if the command is still running, it seems that my running script is for some reason cancelled or something. Since it isn't through sbatch it doesn't send me a file with the error log (as far as I know) so I have no idea why it closed.
I also tried:
srun docker image my_job_script.py &
to give control back to me in the terminal. Unfortunately, if I do that it still keeps printing things to my terminal screen, which I am trying to avoid.
Essentially, I log into a remote computer through ssh and then do a srun command, but it seems that if I terminate the communication of my ssh connection, the srun command is automatically killed. Is there a way to stop this?
Ideally I would like to essentially send the script to run and not have it be cancelled for any reason unless I cancel it through scancel and it should not print to my screen. So my ideal solution is:
keep running srun script even if I log out of the ssh session
keep running my srun script even if close the window from where I sent the command
keep running my srun script and let me leave the srun session and not print to my scree (i.e. essentially run to the background)
this would be my idea solution.
For the curious crowd that want to know the issue with sbatch, I want to be able to do (which is the ideal solution):
sbatch docker image my_job_script.py
however, as people will know it does not work because sbatch receives the command docker which isn't a "batch" script. Essentially a simple solution (that doesn't really work for my case) would be to wrap the docker command in a batch script:
#!/usr/bin/sh
docker image my_job_script.py
unfortunately I am actually using my batch script to encode a lot of information (sort of like a config file) of the task I am running. So doing that might affect jobs I do because their underlying file is changing. That is avoided by sending the job directly to sbatch since it essentially creates a copy of the batch script (as noted in this question: Changing the bash script sent to sbatch in slurm during run a bad idea?). So the real solution to my problem would be to actually have my batch script contain all the information that my script requires and then somehow in python call docker and at the same time pass it all the information. Unfortunately, some of the information are function pointers and objects, so its not even clear to me how I would pass such a thing to a docker command ran in python.
or maybe being able to run docker directly to sbatch instead of using a batch script with also solve the problem.

The outputs can be redirected with the options -o stdout and -e for stderr.
So, the job can be launched in background and with the outputs redirected:
$ srun -o file.out -e file.errr docker image my_job_script.py &

Another approach is to use a terminal multiplexer like tmux or screen.
For example, create a new tmux window type tmux. In that window, use srun with your script. From there, you can then detach the tmux window, which returns you to your main shell so you can go about your other business, or you can logoff entirely. When you want to check in on your script, just reattach to the tmux window. See the documentation tmux -h for how to detach and reattach on your OS.
Any output redirects using the -o or -e will still work with this technique and you can run multiple srun commands concurrently in different tmux windows. I’ve found this approach useful to run concurrent pipelines (genomics in this case).

I was wondering this too because the differences between sbatch and srun are not very clearly explainer or motivated. I looked at the code and found:
sbatch
sbatch pretty much just sends a shell script to the controller, tells it to run it and then exits. It does not need to keep running while the job is happening. It does have a --wait option to stay running until the job is finished but all it does is poll the controller every 2 seconds to ask it.
sbatch can't run a job across multiple nodes - the code simply isn't in sbatch.c. sbatch is not implemented in terms of srun, it's a totally different thing.
Also its argument must be a shell script. Bit of a weird limitation but it does have a --wrap option so that it can automatically wrap a real program in a shell script for you. Good luck getting all the escaping right with that!
srun
srun is more like an MPI runner. It directly starts tasks on lots of nodes (one task per node by default though you can override that with --ntasks). It's intended for MPI so all of the jobs will run simultaneously. It won't start any until all the nodes have a slot free.
It must keep running while the job is in progress. You can send it to the background with & but this is still different to sbatch. If you need to start a million sruns you're going to have a problem. A million sbatchs should (in theory) work fine.
There is no way to have srun exit and leave the job still running like there is with sbatch. srun itself acts as a coordinator for all of the nodes in the job, and it updates the job status etc. so it needs to be running for the whole thing.

Using maxtasksperchild with eventlet

We have a python application with some celery workers.
We use the next command to start celery worker:
python celery -A proj worker --queue=myqueue -P prefork --maxtasksperchild=500
We have two issues with our celery workers.
We have a memory leak
We have pretty big load and we need a lot of workers to process everything fast
We're still looking into memory leak, but since it's legacy code it's pretty hard to find a cause and it will take some time to resolve this issue. To prevent leaks we're using --maxtasksperchild, so each worker after processing 500 events restarts itself. And it works ok, memory grows just to some level.
Second issue is a bit harder. To process all events from our celery queue we have to start more workers. But with prefork each process eats a lot of memory (about 110M in our case) so we either need a lot of servers to start right number of workers or we have to switch from prefork to eventlet:
python celery -A proj worker --queue=myqueue -P eventlet --concurrency=10
In this case we'll use the same amount of memory (about 110M per process) but each process will have 10 workers which is much more memory efficient. But the issue with this is that we still have issue #1 (memory leak), and we can't use --maxtasksperchild because it doesn't work with eventlet.
Any thoughts how can use something like --maxtasksperchild with eventlet?

Upgrade Celery, I've just quick scanned master code, they promise max-memory-per-child. Hope it would work with all concurrency models. I haven't tried it yet.
Set up process monitoring, send graceful terminate signal to workers above memory threshold. Works for me.
Run Celery in control group with limited memory. Works for me.

Utilize multi-core with LocalMRJobRunner for MRJob

I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with mrjob. I notice mrjob assigns 54 mappers for my job, but it seems to run only one at a time. Is there a way to make mrjob run all tasks in parallel with all my cpu cores?
I manually changed number of tasks but didn't help much.
--jobconf mapred.map.tasks=10 --jobconf mapred.reduce.tasks=10
EDIT:
I have -r local when I execute the job, however, looking at the code, it seems it defaults to run one process at a time. Please tell me I am wrong.

The local job runner for mrjob just spawns one subprocess for each MR stage, one for the mapper, one for the combiner (optionally), and one for the reducer, and passes data between them via a pipe. It is not designed to have any parallelism at all, so it will never take advantage of your 64 cores.
My suggestion would be to run hadoop on your local machine and submit the job with the -r hadoop option. A hadoop cluster running on your local machine in pseduo-distributed mode should be able to take advantage of your multiple cores.
See this question which addresses that topic: Full utilization of all cores in Hadoop pseudo-distributed mode

The runner for a job can be specified via the command line with the -r option.
When you run a mrjob script from the command line, the default run mode is inline which runs your job on your local machine in a single process. The other obvious options for running jobs are emr and hadoop.
You can make the job run in parallel on you local machine by setting the runner as local
$ python myjob.py -r local
Those --jobconf options are only recognised by Hadoop (i.e. on EMR or a Hadoop cluster).

Why spark always uses a single core on my PC?

Even when I give a parameter to the groupByKey function, for example groupByKey(4), when I check with the top command, spark is still using one core. I run my script like that.
spark-submit --master local[4] program.py
So, why spark only uses one core when I tell it to use 4?

You're running this on Linux, if the tags to your question are to be trusted. Under linux, top does not, by default, show every thread (it shows every process). local[4] tells spark to work locally on 4 threads (not processes).
Run top -H to pick up the threads.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.