parallelizing code with python pathos multiprocessing in docker

parallelizing code with python pathos multiprocessing in docker - python

I am parallizing code across 30 CPUs and confirmed that outside a container this works fine using the python library 'pathos'.
pool = ProcessPool(nodes=30)
results = pool.map(func_that_needs_to_run_in_parallel, range(30))
pool.close()
pool.join()
results_df = pd.concat(results)
However, it doesn't work while running the code as part of a Flask app in a Docker container. I have three containers:
flask app,
redis which I use to offload all the heavy processing to a worker process,
the worker process.
The code for the worker process can be summarised as:
#some code that needs to be run on only one cpu
#the above 'ProcessPool' code snippet for one particularly resource-intensive task
#some code that needs to be run on only one cpu
When I run the app, the parallelization part of the code in the worker container never uses more than 4 cpus. I confirmed this in docker stats and htop. There are no cpu usage limits on the containers in the docker-compose yaml file.
htop shows that the code runs on only 4 cpus at any one time but it actually randomly switches which cpus its using during the task, thus the worker container can access all 48 cpus.
Summary: Running the app with this multiprocessing code is helpful but there is a ceiling on CPU usage.

Early docker literature (2016) suggested one container per CPU, which is clearly not the case. The idea is to configure this at run time, in the same way you assign memory,
docker run -it --cpus="30" debian /bin/bash
The docker container resource allocation I found useful, here
If pathos is the issue why not switch to themultiprocessor.Pool() library via apply, map_async, or imap methods?

Related

Linux/pm2 is killing my Flask service using Python's multiprocessing library

I have a Flask service running on a particular port xxxx. Inside this flask service is an endpoint:
/buildGlobalIdsPool
This endpoint uses Python's multiprocessing library's Pool object to run parallel processes of a function:
with Pool() as p:
p.starmap(api.build_global_ids_with_recordlinkage, args)
We use pm2 process manager on a Linux server to manage our services. I am hitting the endpoint from Postman and everything works fine up until the code above is reached. As soon as processes are supposed to spawn, pm2 will kill the main Flask process, but the spawned processes will persist (I check using lsof -i:xxxx and I see multiple python3 processes running on this port). This happens whether I run the service using pm2 or if I simply run python3 app.py. My program works on my local Windows 10 machine.
Just curious what I could be missing that is native to Linux or pm2 that is killing this process or not allowing multiple processes on the same port, while my local machine handles the program just fine.
Thanks!

How to let a Python process use all Docker container memory without getting killed?

I have a Python process that does some heavy computations with Pandas and such - not my code so I basically don't have much knowledge on this.
The situation is this Python code used to run perfectly fine on a server with 8GB of RAM maxing all the resources available.
We moved this code to Kubernetes and we can't make it run: even increasing the allocated resources up to 40GB, this process is greedy and will inevitably try to get as much as it can until it gets over the container limit and gets killed by Kubernetes.
I know this code is probably suboptimal and needs rework on its own.
However my question is how to get Docker on Kubernetes mimic what Linux did on the server: give as much as resources as needed by the process without killing it?

I found out that running something like this seems to work:
import resource
import os
if os.path.isfile('/sys/fs/cgroup/memory/memory.limit_in_bytes'):
with open('/sys/fs/cgroup/memory/memory.limit_in_bytes') as limit:
mem = int(limit.read())
resource.setrlimit(resource.RLIMIT_AS, (mem, mem))
This reads the memory limit file from cgroups and set it as both hard and soft limit for the process' max area address space.
You can test by runnning something like:
docker run --it --rm -m 1G --cpus 1 python:rc-alpine
And then trying to allocate 1G of ram before and after running the script above.
With the script, you'll get a MemoryError, without it the container will be killed.

Using --oom-kill-disable option with a memory limit works for me (12GB memory) in a Docker container. Perhaps it applies to Kubernetes as well.
docker run -dp 80:8501 --oom-kill-disable -m 12g <image_name>
Hence
How to mimic "--oom-kill-disable=true" in kuberenetes?

Dask.distributed only uses one core per node

I have a program where each task is a call to a C++ external program through subprocess.Popen. The tasks are arranged in a graph and everything is executed through the dask get command.
I have a single node version of this program that works just fine with dask.threaded and I am trying to extend this version to a distributed setting. My goal is to run it on a Slurm cluster but I have trouble deploying the workers. When I run the following:
screen -d -m dask-scheduler --scheduler-file scheduler.json
screen -d -m srun dask-worker --scheduler-file scheduler.json
python3 myscript.py
only a single core gets used on every node (out of twenty cores per node).
I did suspect some issues with the GIL but the script works just fine with dask.threaded so I am not quite sure what is going on and some help would be appreciated.

I recommend looking at the dashboard to see how many tasks Dask is running at a time on each worker:
Documentation here: http://dask.pydata.org/en/latest/diagnostics-distributed.html
If you see that Dask is only running one task per worker then it's probably a problem in how you've set up your workers (you might want to look at the worker page to get a sense for what Dask thinks you've asked for)
If you see that Dask is running many tasks per worker concurrently then it's probably an issue with your function.

celery launches more processes than configured

I'm running a celery machine, using redis as the broker with the following configuration:
celery -A project.tasks:app worker -l info --concurrency=8
When checking the number of celery running processes, I see more than 8.
Is there something that I am missing? Is there a limit for max concurrency?
This problem causes huge memory allocation, and is killing the machine.

With the default settings Celery will always start one more process than the number you ask. This additional process is a kind of bookkeeping process that is used to coordinate the other processes that are part of the worker. It communicates with the rest of Celery, and dispatches the tasks to the processes that actually run the tasks.
Switching to a different pool implementation than the "prefork" default might reduce the number of processes created but that's opening new can of worms.

For the concurrency problem, I have no suggestion.
For the memory problem, you can look at redis configuration in ~/.redis/redis.conf. You have a maxmemory attribute which fix a limit upon tasks…
See the Redis configuration

Using maxtasksperchild with eventlet

We have a python application with some celery workers.
We use the next command to start celery worker:
python celery -A proj worker --queue=myqueue -P prefork --maxtasksperchild=500
We have two issues with our celery workers.
We have a memory leak
We have pretty big load and we need a lot of workers to process everything fast
We're still looking into memory leak, but since it's legacy code it's pretty hard to find a cause and it will take some time to resolve this issue. To prevent leaks we're using --maxtasksperchild, so each worker after processing 500 events restarts itself. And it works ok, memory grows just to some level.
Second issue is a bit harder. To process all events from our celery queue we have to start more workers. But with prefork each process eats a lot of memory (about 110M in our case) so we either need a lot of servers to start right number of workers or we have to switch from prefork to eventlet:
python celery -A proj worker --queue=myqueue -P eventlet --concurrency=10
In this case we'll use the same amount of memory (about 110M per process) but each process will have 10 workers which is much more memory efficient. But the issue with this is that we still have issue #1 (memory leak), and we can't use --maxtasksperchild because it doesn't work with eventlet.
Any thoughts how can use something like --maxtasksperchild with eventlet?

Upgrade Celery, I've just quick scanned master code, they promise max-memory-per-child. Hope it would work with all concurrency models. I haven't tried it yet.
Set up process monitoring, send graceful terminate signal to workers above memory threshold. Works for me.
Run Celery in control group with limited memory. Works for me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parallelizing code with python pathos multiprocessing in docker - python

Related

Linux/pm2 is killing my Flask service using Python's multiprocessing library

How to let a Python process use all Docker container memory without getting killed?

Dask.distributed only uses one core per node

celery launches more processes than configured

Using maxtasksperchild with eventlet

Categories

Resources