Tornado memory leaks - python

We develop the application that use Tornado(2.4) and tornadio2 for transport. We have a problem with memory leaks, and tried to find what is wrong with pympler, but it haven't catched any leaks. But the problem is. Pmap shows us that memory is used (look through the added with the link screenshot http://prntscr.com/16wv6k).
There are more then 90% of memory is used by one anon process.
Whith every user coming in our application reserve some memory, but with user out memory is still reserved and doesnt free. We can't understand what is the problem.
The question is - what should we do to remove this leaks? We have to rebot server every hour just for 500 user online. It's bad((

Maybe this (already closed) bug ?
https://github.com/facebook/tornado/commit/bff07405549a6eb173a4cfc9bbc3fc7c6da5cdd7

Related

killing kube pod on the fly with python

I have code/app running on Kubernetes.
and I want a way to dynamically(from the code) kill the pod.
does someone know a way do to it, I tried to exit the process and raise error but it didn't work
Based on the answer to my comment above, I understand your app has a memory leak and you want to restart the pod to clean up that memory leak.
First, I will state the obvious and the best way to fix the problem is to the fix the leak in the code. That being said, you have also said finding the problem will take a lot of time.
The other solution I can think of is to utilise Kubernete's own resource requests/limits functionality.
You declare how much memory and/or CPU you want the node to reserve for your app and you can also declare how much memory and/or CPU you want the node to give your app at max. If the CPU usage exceeds the limit declared, the app is throttled. If the memory usage exceeds the limit declared, it becomes OOMKilled
Keeping the memory limit low will allow your app to be restarted if it exceeds the memory limit you declared.
Details of this, and a examples (which includes a memory leak example) can be found here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

How does app engine (python) manage memory across requests (Exceeded soft private memory limit)

I'm experiencing occasional Exceeded soft private memory limit error in a wide variety of request handlers in app engine. I understand that this error means that the RAM used by the instance has exceeded the amount allocated, and how that causes the instance to shut down.
I'd like to understand what might be the possible causes of the error, and to start, I'd like to understand how app engine python instances are expected to manage memory. My rudimentary assumptions were:
An F2 instance starts with 256 MB
When it starts up, it loads my application code - lets say 30 MB
When it handles a request it has 226 MB available
so long as that request does not exceed 226 MB (+ margin of error) the request completes w/o error
if it does exceed 226 MB + margin, the instance completes the request, logs the 'Exceeded soft private memory limit' error, then terminates - now go back to step 1
When that request returns, any memory used by it is freed up - ie. the unused RAM goes back to 226 MB
Step 3-4 are repeated for each request passed to the instance, indefinitely
That's how I presumed it would work, but given that I'm occasionally seeing this error across a fairly wide set of request handlers, I'm now not so sure. My questions are:
a) Does step #4 happen?
b) What could cause it not to happen? or not to fully happen? e.g. how could memory leak between requests?
c) Could storage in module level variables causes memory usage to leak? (I'm not knowingly using module level variables in that way)
d) What tools / techniques can I use to get more data? E.g. measure memory usage at entry to request handler?
In answers/comments, where possible, please link to the gae documentation.
[edit] Extra info: my app is congifured as threadsafe: false. If this has a bearing on the answer, please state what it is. I plan to change to threadsafe: true soon.
[edit] Clarification: This question is about the expected behavior of gae for memory management. So while suggestions like 'call gc.collect()' might well be partial solutions to related problems, they don't fully answer this question. Up until the point that I understand how gae is expected to behave, using gc.collect() would feel like voodoo programming to me.
Finally: If I've got this all backwards then I apologize in advance - I really cant find much useful info on this, so I'm mostly guessing..
App Engine's Python interpreter does nothing special, in terms of memory management, compared to any other standard Python interpreter. So, in particular, there is nothing special that happens "per request", such as your hypothetical step 4. Rather, as soon as any object's reference count decreases to zero, the Python interpreter reclaims that memory (module gc is only there to deal with garbage cycles -- when a bunch of objects never get their reference counts down to zero because they refer to each other even though there is no accessible external reference to them).
So, memory could easily "leak" (in practice, though technically it's not a leak) "between requests" if you use any global variable -- said variables will survive the instance of the handler class and its (e.g) get method -- i.e, your point (c), though you say you are not doing that.
Once you declare your module to be threadsafe, an instance may happen to serve multiple requests concurrently (up to what you've set as max_concurrent_requests in the automatic_scaling section of your module's .yaml configuration file; the default value is 8). So, your instance's RAM will need be a multiple of what each request needs.
As for (d), to "get more data" (I imagine you actually mean, get more RAM), the only thing you can do is configure a larger instance_class for your memory-hungry module.
To use less RAM, there are many techniques -- which have nothing to do with App Engine, everything to do with Python, and in particular, everything to do with your very specific code and its very specific needs.
The one GAE-specific issue I can think of is that ndb's caching has been reported to leak -- see https://code.google.com/p/googleappengine/issues/detail?id=9610 ; that thread also suggests workarounds, such as turning off ndb caching or moving to old db (which does no caching and has no leak). If you're using ndb and have not turned off its caching, that might be the root cause of "memory leak" problems you're observing.
Point 4 is an invalid asumption, Python's garbage collector doesn't return the memory that easily, Python's program is taking up that memory but it's not used until garbage collector has a pass. In the meantime if some other request requires more memory - new might be allocated, on top the memory from the first request. If you want to force Python to garbage collect, you can use gc.collect() as mentioned here
Take a look at this Q&A for approaches to check on garbage collection and for potential alternate explanations: Google App Engine DB Query Memory Usage

Django memory leak: possible causes?

I've a Django application that every so often is getting into memory leak.
I am not using large data that could overload the memory, in fact the application 'eats' memory incrementally (in a week the memory goes from ~ 70 MB to 4GB), that is why I suspect the garbage collector is missing something, I am not sure though. Also, it seems as this increment is not dependant of the number of requests.
Obvious things like DEBUG=True, leaving open files, etc... no apply here.
I'm using uWSGI 2.0.3 (+ nginx) and Django 1.4.5
I could set up wsgi so that restart the server when the memory exceeds certain limit, but I wouldn't like to do that since that is not a solution really.
Are there any well know situations where the garbage collector "doesn't do its work properly"? Could it provide some code examples?
Is there any configuration of uWSGI + Django that could cause this?
I haven't found the exact stuff I'm looking for (each project is a world!), but following some clues and advices I managed to solve the issue. I share with you a few links that could help if you are facing a similar problem.
django memory leaks, part I, django memory leaks, part II and Finding and fixing memory leaks in Python
Some useful SO answers/questions:
Which Python memory profiler is recommended?, Is there any working memory profiler for Python3, Python memory leaks and Python: Memory leak debugging
Update
pyuwsgimemhog is a new tool that helps to find out where the leak is.
Django doesn't have known memory leak issues.
I had a similar memory issue. I found that there is a slow SQL causing a high DB CPU percentage. The memory issue is fixed after I fixed the slow SQL.

web.py memory leak

Am I doing something wrong or does web.py leak memory?
import web
class Index:
def GET(self): return 'hello web.py'
app = web.application(('/*', 'Index'), globals())
app.run()
Run the above file. Watch how much memory the task uses. Go to localhost:8080 in your browser. Close the browser (to keep the page from getting cached), then open the page again, and see how the memory usage rises. It goes up every time you close the browser and re-visit the page.
Running python 2.6 on Win XP.
After running your code and sending it thousands of requests (via another Python process using urllib2), I find that it grows by about 200k over the course of the first few hundred requests and then stops growing. That doesn't seem unreasonable, and it needn't indicate a memory leak. Remember that Python uses automatic memory management via a combination of reference counting and garbage collection, so there's no guarantee that every bit of memory it uses is reusable the instant it's no longer in use; and it may request memory from the OS and then not return it even though it isn't needed any more.
So I think the answer is: You aren't doing anything wrong, but web.py doesn't leak memory.

SimpleXmlRpcServer _sock.rcv freezes after thousands of requests

I'm serving requests from several XMLRPC clients over WAN. The thing works great for, let's say, a period of one day (sometimes two), then freezes in socket.py:
data = self._sock.recv(self._rbufsize)
_sock.timeout is -1, _sock.gettimeout is None
There is nothing special I do in the main thread (just receiving XMLRPC calls), there are another two threads talking to DB. Both these threads work fine and survive this block (did a check with WinPdb). Clients are sending requests not being longer than 1KB, and there isn't any special content: just nice and clean strings in dictionary. Between two blockings I serve tens of thousands requests without problems.
Firewall is off, no strange software on the same machine, etc...
I use Windows XP and Python 2.6.4. I've checked differences between 2.6.4. and 2.6.5, and didn't find anything important (or am I mistaking?). 2.7 version is not an option as I would miss binaries for MySqlDB.
The only thing that happens from time to time caused by the clients that have poor internet connection is that sockets break. This is happening, every 5-10 minutes (there are just five clients accessing server every 2 seconds).
I've spent great deal of time on this issue, now I'm beginning to lose any ideas what to do. Any hint or thought would be highly appreciated.
What exactly is happening in your OS's TCP/IP stack (possibly in the python layers on top, but that's less likely) to cause this is a mystery. As a practical workaround, I'd set a timeout longer than the delays you expect between requests (10 seconds should be plenty if you expect a request every 2 seconds) and if one occurs, close and reopen. (Calibrate the delay needed to work around freezes without interrupting normal traffic by trial and error). Unpleasant to hack a fix w/o understanding the problem, I know, but being pragmatical about such things is a necessary survival trait in the world of writing, deploying and operating actual server systems. Be sure to comment the workaround accurately for future maintainers!
thanks so much for the fast response. Right after I've receive it I augmented the timeout to 10 seconds. Now it is all running without problems, but of course I would need to wait another day or two to have sort of confirmation, but only after 5 days I'll be sure and will come back with the results. I see now that 140K request went well already, having so hard experience on this one I would wait at least another 200K.
What you were proposing about auto adaptation of timeouts (without putting the system down) sounds also reasonable. Would the right way to go be in creating a small class (e.g. AutoTimeoutCalibrator) and embedding it directly into serial.py?
Yes - being pragmatical is the only way without loosing another 10 days trying to figure out the real reason behind.
Thanks again, I'll be back with the results.
(sorry, but for some reason I was not able to post it as a reply to your post)

Categories