I am currently trying to run a long running python script on Ubuntu 12.04. The machine is running on a Digital Ocean droplet. It has no visible memory leaks (top shows constant memory). After running without incident (there are no uncaught exceptions and the used memory does not increase) for about 12 hours, the script gets killed.
The only messages present in syslog relating to the script are
Sep 11 06:35:06 localhost kernel: [13729692.901711] select 19116 (python), adj 0, size 62408, to kill
Sep 11 06:35:06 localhost kernel: [13729692.901713] send sigkill to 19116 (python), adj 0, size 62408
I've encountered similar problems before (with other scripts) in Ubuntu 12.04 but the logs then contained the additional information that the scripts were killed by oom-killer.
Those scripts, as well as this one, occupy a maximum of 30% of available memory.
Since i can't find any problems with the actual code, could this be an OS problem? If so, how do i go about fixing it?
Your process was indeed killed by the oom-killer. The log message “select … to kill“ hints to that.
Probably your script didn’t do anything wrong, but it was selected to be killed because it used the most memory.
You have to provide more free memory, by adding more (virtual) RAM if you can, by moving other services from this machine to a different one, or by trying to optimize memory usage in your script.
See e.g. Debug out-of-memory with /var/log/messages for debugging hints. You could try to spare your script from being killed: How to set OOM killer adjustments for daemons permanently? But often killing some process at random may leave the whole machine in an unstable state. In the end you will have to sort out the memory requirements and then make sure enough memory for peak loads is available.
Related
I have a Python process that does some heavy computations with Pandas and such - not my code so I basically don't have much knowledge on this.
The situation is this Python code used to run perfectly fine on a server with 8GB of RAM maxing all the resources available.
We moved this code to Kubernetes and we can't make it run: even increasing the allocated resources up to 40GB, this process is greedy and will inevitably try to get as much as it can until it gets over the container limit and gets killed by Kubernetes.
I know this code is probably suboptimal and needs rework on its own.
However my question is how to get Docker on Kubernetes mimic what Linux did on the server: give as much as resources as needed by the process without killing it?
I found out that running something like this seems to work:
import resource
import os
if os.path.isfile('/sys/fs/cgroup/memory/memory.limit_in_bytes'):
with open('/sys/fs/cgroup/memory/memory.limit_in_bytes') as limit:
mem = int(limit.read())
resource.setrlimit(resource.RLIMIT_AS, (mem, mem))
This reads the memory limit file from cgroups and set it as both hard and soft limit for the process' max area address space.
You can test by runnning something like:
docker run --it --rm -m 1G --cpus 1 python:rc-alpine
And then trying to allocate 1G of ram before and after running the script above.
With the script, you'll get a MemoryError, without it the container will be killed.
Using --oom-kill-disable option with a memory limit works for me (12GB memory) in a Docker container. Perhaps it applies to Kubernetes as well.
docker run -dp 80:8501 --oom-kill-disable -m 12g <image_name>
Hence
How to mimic "--oom-kill-disable=true" in kuberenetes?
Using docker (docker-compose) on macOS. When running the Docker containers and attaching Visual Studio Code (VSCode) to the active app container it can make the hyperkit process go crazy :( the macBook fans have to go at full speed to try to keep the temperature down.
When using VSCode on python files I noticed that actions, such as done by pylint, that result in scanning/parsing your file will increase the hyperkit CPU usage to the max and the macBook fans go on full speed :(. Hyperkit CPU usage goes down again when the action of pylint is finished.
When using VSCode to debug my Django Python app the hyperkit CPU usage goes to the max again. When actively debugging the hyperkit goes wild but it does settle down again afterwards.
I'm currently switching "bind mounts" to "volume mounts" I think I see some improvements but haven't done enough testing to say anything conclusive. I've only switched my source code to using "volume mount" instead of "bind mount" and will do the same for my static files and database and see if that results in improvements.
You can check out this stackoverflow post on Docker volumes for some more info on the subject.
Here is some post that I found regarding this issue:
https://code.visualstudio.com/docs/remote/containers?origin_team=TJ8BCJSSG
https://github.com/docker/for-mac/issues/1759
Any other ideas on how to keep the hyperkit process under control❓
[update 27 March] Docker debug mode was set to TRUE I've changed this to FALSE but I have not seen any significant improvements.
[update 27 March] Using "delegated" option for my source code (app) folder and first impressions are positive. I'm seeing significant performance improvements we'll have to see if it lasts 😀
FYI Docker docu on delegated: the container’s view is authoritative (permit delays before updates on the container appear in the host)
[update 27 March] I've also reduced the number of CPU cores Docker desktop can use (settings->advanced). Hopefully this prevents the CPU from getting too hot.
I "solved" this issue by using http://docker-sync.io to create volumes that I can mount without raising my CPU usage at all. I am currently running 8 containers (6 Python and 2 node) with file watchers on and my CPU is at 25% usage.
I created a a very simple test that launches and close a software I was testing using Python Nose test platform to track down a bug in the start up sequence of the software I was working on.
The test was set up so that it would launch and close about 1,500 times in a singling execution.
A few hours later, I discovered that the test was not able to launch to the software around after 300 iterations. It was timing out while waiting for the process to start. As soon as I logged back in, the test started launching the process without any problem and all the tests started passing as well.
This is quite puzzling to me. I have never seen this behavior. This never happened on Windows also.
I am wondering if there is a sort of power saving state that Mac was waiting for currently running process to finish and prohibits new process from starting.
I would really appreciate if anybody can shed light on this confusion.
I was running Python 2.7.x on High Sierra.
I am not aware of any state where the system flat out denies new processes while old ones are still running.
However, I can easily imagine a situation in which a process may hang because of some unexpected dependency on e.g. the window server. For example, I once noticed that rsvg-convert, a command-line SVG-to-image converter, running in an SSH session, had different fonts available to it depending on whether I was also simultaneously logged in on the console. This behavior went away when I recompiled the SVG libraries to exclude all references to macOS specific libraries...
I have some big computation programs running, which does not involve disk operation, however, when I monitor the process in htop, it shows a lot of disk wait, and CPU utilization is only 10%, I looked up using iotop, I am not seeing any irregular disk read/write, what might be the possible reason for this and how do I debug it? The program is python3 and C, and I am running on Ubuntu 14.04 and Ubuntu 16.04.
Sorry that the program is too big and I don't know which part is malfunctioning, so I can't paste them here for testing.
I'm stuck trying to debug an Apache process that keeps growing in memory size. I'm running Apache 2.4.6 with MPM Prefork on virtual Ubuntu host with 4GB of RAM, serving a Django app with mod_wsgi. The app is heavy with AJAX calls and Apache is getting between 300-1000 requests per minute. Here's what I'm seeing:
As soon as I restart Apache, the first child process (with lowest PID) will keep growing its memory usage, reaching over a gig in 6 or 7 minutes. All the other Apache process will keep memory usage between 10MB-50MB per process.
CPU usage for the troublesome process will fluctuate, sometimes dipping down very low, other times hovering at 20% or sometimes spiking higher.
The troublesome process will run indefinitely until I restart Apache.
I can see in my Django logs that the troublesome process is serving some requests to multiple remote IPs (I'm seeing reports of caught exceptions for URLs my app doesn't like, primarily).
Apache error logs will often (but not always) show "IOError: failed to write data" for the PID, sometimes across multiple IPs.
Apache access logs do not show any requests completed associated with this PID.
Running strace on the PID gets no results other than 'restart_syscall(<... resuming interrupted call ...>' even when I can see that PID mentioned in my app logs at a time when strace was running.
I've tried setting low values of MaxRequestsPerChild and MaxMemFree and neither has seemed to have any effect.
What could this be or how could I debug further? The fact that I see no output of strace makes me that my application has an infinite loop. If that were the case, how could I go about tracing the PID back to the code path it executed or the request that started the trouble?
Instead of restarting Apache, stop and start Apache. There is a known no-fix memory leak issue with Apache.
Also, consider using nginx and gunicorn—this setup is a lightweight, faster, and often recommended alternative to serving your django app, and static files.
References:
Performance
Memory Usage
Apache/Nginx Comparison