killing kube pod on the fly with python - python

I have code/app running on Kubernetes.
and I want a way to dynamically(from the code) kill the pod.
does someone know a way do to it, I tried to exit the process and raise error but it didn't work

Based on the answer to my comment above, I understand your app has a memory leak and you want to restart the pod to clean up that memory leak.
First, I will state the obvious and the best way to fix the problem is to the fix the leak in the code. That being said, you have also said finding the problem will take a lot of time.
The other solution I can think of is to utilise Kubernete's own resource requests/limits functionality.
You declare how much memory and/or CPU you want the node to reserve for your app and you can also declare how much memory and/or CPU you want the node to give your app at max. If the CPU usage exceeds the limit declared, the app is throttled. If the memory usage exceeds the limit declared, it becomes OOMKilled
Keeping the memory limit low will allow your app to be restarted if it exceeds the memory limit you declared.
Details of this, and a examples (which includes a memory leak example) can be found here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

Related

Difference between cAdvisor memory report and resources maxrss in python

I'm running a simple flask web app using kubernetes as infrastructure. Recently I realized a curious behavior, when I was testing memory consumption. Using the following python code I report the total RSS used by my process.
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
After some warmup making requests to the server the reported resident memory was about 128Mb.
cAdvisor on the other hand reported 102Mib of rss. If it was the opposite there would be some sense in it, since the container could be using some memory for other stuff beside running my app, but the weird thing is that the python process apparently is using more memory than the container is aware of.
Making the conversion from Mib to Mb does not explain that since 102Mib ~ 107Mb.
What does the memory usage reported by cAdvisor stands for? Which number should I use as a reliable memory usage report?

How does app engine (python) manage memory across requests (Exceeded soft private memory limit)

I'm experiencing occasional Exceeded soft private memory limit error in a wide variety of request handlers in app engine. I understand that this error means that the RAM used by the instance has exceeded the amount allocated, and how that causes the instance to shut down.
I'd like to understand what might be the possible causes of the error, and to start, I'd like to understand how app engine python instances are expected to manage memory. My rudimentary assumptions were:
An F2 instance starts with 256 MB
When it starts up, it loads my application code - lets say 30 MB
When it handles a request it has 226 MB available
so long as that request does not exceed 226 MB (+ margin of error) the request completes w/o error
if it does exceed 226 MB + margin, the instance completes the request, logs the 'Exceeded soft private memory limit' error, then terminates - now go back to step 1
When that request returns, any memory used by it is freed up - ie. the unused RAM goes back to 226 MB
Step 3-4 are repeated for each request passed to the instance, indefinitely
That's how I presumed it would work, but given that I'm occasionally seeing this error across a fairly wide set of request handlers, I'm now not so sure. My questions are:
a) Does step #4 happen?
b) What could cause it not to happen? or not to fully happen? e.g. how could memory leak between requests?
c) Could storage in module level variables causes memory usage to leak? (I'm not knowingly using module level variables in that way)
d) What tools / techniques can I use to get more data? E.g. measure memory usage at entry to request handler?
In answers/comments, where possible, please link to the gae documentation.
[edit] Extra info: my app is congifured as threadsafe: false. If this has a bearing on the answer, please state what it is. I plan to change to threadsafe: true soon.
[edit] Clarification: This question is about the expected behavior of gae for memory management. So while suggestions like 'call gc.collect()' might well be partial solutions to related problems, they don't fully answer this question. Up until the point that I understand how gae is expected to behave, using gc.collect() would feel like voodoo programming to me.
Finally: If I've got this all backwards then I apologize in advance - I really cant find much useful info on this, so I'm mostly guessing..
App Engine's Python interpreter does nothing special, in terms of memory management, compared to any other standard Python interpreter. So, in particular, there is nothing special that happens "per request", such as your hypothetical step 4. Rather, as soon as any object's reference count decreases to zero, the Python interpreter reclaims that memory (module gc is only there to deal with garbage cycles -- when a bunch of objects never get their reference counts down to zero because they refer to each other even though there is no accessible external reference to them).
So, memory could easily "leak" (in practice, though technically it's not a leak) "between requests" if you use any global variable -- said variables will survive the instance of the handler class and its (e.g) get method -- i.e, your point (c), though you say you are not doing that.
Once you declare your module to be threadsafe, an instance may happen to serve multiple requests concurrently (up to what you've set as max_concurrent_requests in the automatic_scaling section of your module's .yaml configuration file; the default value is 8). So, your instance's RAM will need be a multiple of what each request needs.
As for (d), to "get more data" (I imagine you actually mean, get more RAM), the only thing you can do is configure a larger instance_class for your memory-hungry module.
To use less RAM, there are many techniques -- which have nothing to do with App Engine, everything to do with Python, and in particular, everything to do with your very specific code and its very specific needs.
The one GAE-specific issue I can think of is that ndb's caching has been reported to leak -- see https://code.google.com/p/googleappengine/issues/detail?id=9610 ; that thread also suggests workarounds, such as turning off ndb caching or moving to old db (which does no caching and has no leak). If you're using ndb and have not turned off its caching, that might be the root cause of "memory leak" problems you're observing.
Point 4 is an invalid asumption, Python's garbage collector doesn't return the memory that easily, Python's program is taking up that memory but it's not used until garbage collector has a pass. In the meantime if some other request requires more memory - new might be allocated, on top the memory from the first request. If you want to force Python to garbage collect, you can use gc.collect() as mentioned here
Take a look at this Q&A for approaches to check on garbage collection and for potential alternate explanations: Google App Engine DB Query Memory Usage

Django memory leak: possible causes?

I've a Django application that every so often is getting into memory leak.
I am not using large data that could overload the memory, in fact the application 'eats' memory incrementally (in a week the memory goes from ~ 70 MB to 4GB), that is why I suspect the garbage collector is missing something, I am not sure though. Also, it seems as this increment is not dependant of the number of requests.
Obvious things like DEBUG=True, leaving open files, etc... no apply here.
I'm using uWSGI 2.0.3 (+ nginx) and Django 1.4.5
I could set up wsgi so that restart the server when the memory exceeds certain limit, but I wouldn't like to do that since that is not a solution really.
Are there any well know situations where the garbage collector "doesn't do its work properly"? Could it provide some code examples?
Is there any configuration of uWSGI + Django that could cause this?
I haven't found the exact stuff I'm looking for (each project is a world!), but following some clues and advices I managed to solve the issue. I share with you a few links that could help if you are facing a similar problem.
django memory leaks, part I, django memory leaks, part II and Finding and fixing memory leaks in Python
Some useful SO answers/questions:
Which Python memory profiler is recommended?, Is there any working memory profiler for Python3, Python memory leaks and Python: Memory leak debugging
Update
pyuwsgimemhog is a new tool that helps to find out where the leak is.
Django doesn't have known memory leak issues.
I had a similar memory issue. I found that there is a slow SQL causing a high DB CPU percentage. The memory issue is fixed after I fixed the slow SQL.

Large celery task memory leak

I have a huge celery task that works basically like this:
#task
def my_task(id):
if settings.DEBUG:
print "Don't run this with debug on."
return False
related_ids = get_related_ids(id)
chunk_size = 500
for i in xrange(0, len(related_ids), chunk_size):
ids = related_ids[i:i+chunk_size]
MyModel.objects.filter(pk__in=ids).delete()
print_memory_usage()
I also have a manage.py command that just runs my_task(int(args[0])), so this can either be queued or run on the command line.
When run on the command line, print_memory_usage() reveals a relatively constant amount of memory used.
When run inside celery, print_memory_usage() reveals an ever-increasing amount of memory, continuing until the process is killed (I'm using Heroku with a 1GB memory limit, but other hosts would have a similar problem.) The memory leak appears to correspond with the chunk_size; if I increase the chunk_size, the memory consumption increases per-print. This seems to suggest that either celery is logging queries itself, or something else in my stack is.
Does celery log queries somewhere else?
Other notes:
DEBUG is off.
This happens both with RabbitMQ and Amazon's SQS as the queue.
This happens both locally and on Heroku (though it doesn't get killed locally due to having 16 GB of RAM.)
The task actually goes on to do more things than just deleting objects. Later it creates new objects via MyModel.objects.get_or_create(). This also exhibits the same behavior (memory grows under celery, doesn't grow under manage.py).
A bit of necroposting, but this can help people in the future. Although the best solution should be tracking the source of the problem, sometimes this is not possible either because the source of the problem is outside of our control. In this case you can use the --max-memory-per-child option when spawning the Celery worker process.
This turned out not to have anything to do with celery. Instead, it was new relic's logger that consumed all of that memory. Despite DEBUG being set to False, it was storing every SQL statement in memory in preparation for sending it to their logging server. I do not know if it still behaves this way, but it wouldn't flush that memory until the task fully completed.
The workaround was to use subtasks for each chunk of ids, to do the delete on a finite number of items.
The reason this wasn't a problem when running this as a management command is that new relic's logger wasn't integrated into the command framework.
Other solutions presented attempted to reduce the overhead for the chunking operation, which doesn't help in an O(N) scaling concern, or force the celery tasks to fail if a memory limit is exceeded (a feature that didn't exist at the time, but might have eventually worked with infinite retries.)
Try Using the #shared_task decorator
You can although run worker with --autoscale n,0 option. If minimum number of pool is 0 celery will kill unused workers and memory will be released.
But this is not good solution.
A lot of memory is used by django's Collector - before deleting it collects all related objects and firstly deletes them. You can set on_delete to SET_NULL on model fields.
Another possible solution is deleting objects with limits, for example some objects per hour. That will lower memory usage.
Django does not have raw_delete. You can use raw sql for this.

Tornado memory leaks

We develop the application that use Tornado(2.4) and tornadio2 for transport. We have a problem with memory leaks, and tried to find what is wrong with pympler, but it haven't catched any leaks. But the problem is. Pmap shows us that memory is used (look through the added with the link screenshot http://prntscr.com/16wv6k).
There are more then 90% of memory is used by one anon process.
Whith every user coming in our application reserve some memory, but with user out memory is still reserved and doesnt free. We can't understand what is the problem.
The question is - what should we do to remove this leaks? We have to rebot server every hour just for 500 user online. It's bad((
Maybe this (already closed) bug ?
https://github.com/facebook/tornado/commit/bff07405549a6eb173a4cfc9bbc3fc7c6da5cdd7

Categories