How to diagnose Google App Engine Flask Memory Leak

How to diagnose Google App Engine Flask Memory Leak - python

I have a Google App Engine app running (Flask app) that seems to have a memory leak. See the plot of the memory usage below. The memory usage continually creeps up until it hits the limit and the instance is shutdown and a new one is started up.
It's a simple API with about 8 endpoints. None of them handle large amounts of data.
I added an endpoint that takes a memory snapshot with the tracemalloc package, and compares it to the last snapshot and then writes the output to Google Cloud Storage.
I don't see anything in the reports that indicates a memory leak. The peak memory usage is recorded as about 0.12 GiB.
I am also calling gc.collect() at the end of every function that is called by each endpoint.
Any ideas on how to diagnose this, or what might be causing it?

There could be many reasons for this situation to be encountered. Is your app creating temporary files? Temporary files can be a cause of a memory leak. Temporary files can also be created from errors, or warnings. First of all, I would check my Stackdriver logs for errors and warnings and I would try to fix them.
Is your application interacting with databases or storage buckets ? Some memory related issues can be related to a bad interaction of your app with any data storage service. This issue was also encountered here and was mitigated by treating the Google Cloud Storage errors.
Another thing that you can do is investigate a bit the way of the memory is used in your function. For this you have some nice tools you can use like Pympler and Heapy. Playing with those tools may give you valuable clues about what your issue is.

Related

GAE Soft Memory Use

I am new to Google App Engine, but trying to find the true source of how much soft memory my application is consuming.
I am running the F1 instance class (128MB Memory limit) in the standard environment and have not yet had a soft memory exceeded error.
The tools I am using to check memory are:
Google App Engine Dashboard (Memory usage chart) - shows memory use gradually increasing over the past week from 250MB to over 1GB. Refer to first image below.
Google App Engine Dashboard (Instance summary table) - shows the average Memory usage at 122MB. Refer to first image below.
logging runtime.memory_usage() - shows a range between 120MB and 160MB throughout the day.
Stackdriver Monitoring - shows memory mostly hovering around 150MB, but spiking as new instances are spawned. Refer to second image below.
Appreciate any guidance on which information source I should be using to determine the actual memory use of the application, and what Google would use to throw a soft memory error.
App Engine Dashboard:
Stackdriver Monitoring:

App Engine won't throw an exception when you reach the soft limit. Instead, your instance will be gracefully restarted (stop accepting new requests, finish any existing requests, and shutdown).
In your first graph, the "250MB to over 1GB" is the aggregate memory usage over all your App Engine instances. You can see in the instance summary table that the average memory per instance is 122.3MB, so it's under the soft limit.
The Stackdriver graph is showing aggregate memory usage over a region. You can see that the spikes in memory correlate with multiple instances running simultaneously.

Difference between cAdvisor memory report and resources maxrss in python

I'm running a simple flask web app using kubernetes as infrastructure. Recently I realized a curious behavior, when I was testing memory consumption. Using the following python code I report the total RSS used by my process.
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
After some warmup making requests to the server the reported resident memory was about 128Mb.
cAdvisor on the other hand reported 102Mib of rss. If it was the opposite there would be some sense in it, since the container could be using some memory for other stuff beside running my app, but the weird thing is that the python process apparently is using more memory than the container is aware of.
Making the conversion from Mib to Mb does not explain that since 102Mib ~ 107Mb.
What does the memory usage reported by cAdvisor stands for? Which number should I use as a reliable memory usage report?

Django memory leak: possible causes?

I've a Django application that every so often is getting into memory leak.
I am not using large data that could overload the memory, in fact the application 'eats' memory incrementally (in a week the memory goes from ~ 70 MB to 4GB), that is why I suspect the garbage collector is missing something, I am not sure though. Also, it seems as this increment is not dependant of the number of requests.
Obvious things like DEBUG=True, leaving open files, etc... no apply here.
I'm using uWSGI 2.0.3 (+ nginx) and Django 1.4.5
I could set up wsgi so that restart the server when the memory exceeds certain limit, but I wouldn't like to do that since that is not a solution really.
Are there any well know situations where the garbage collector "doesn't do its work properly"? Could it provide some code examples?
Is there any configuration of uWSGI + Django that could cause this?

I haven't found the exact stuff I'm looking for (each project is a world!), but following some clues and advices I managed to solve the issue. I share with you a few links that could help if you are facing a similar problem.
django memory leaks, part I, django memory leaks, part II and Finding and fixing memory leaks in Python
Some useful SO answers/questions:
Which Python memory profiler is recommended?, Is there any working memory profiler for Python3, Python memory leaks and Python: Memory leak debugging
Update
pyuwsgimemhog is a new tool that helps to find out where the leak is.

Django doesn't have known memory leak issues.
I had a similar memory issue. I found that there is a slow SQL causing a high DB CPU percentage. The memory issue is fixed after I fixed the slow SQL.

Frustrating behavior for google modules/backends

I'm working with GAE, and I'm trying to process a large zip file (~150mb zipped, 500 unzipped), which I need to do every day for my app.
I created a module to load a file from Google Cloud Storage, and parse through it, saving specific pieces of information in Google Datastore along the way. The problem is that it will shut itself down within a few minutes, and I basically lose where I am in the file. I am giving the instance more than enough CPU/memory, so that's not the issue..
Is there some way to handle this? The documentation for handling shutdowns is quite limited, and it seems shutdown requests aren't even guaranteed.. It seems really odd to me that GAE isn't able to handle a ~150mb file, nor can GAE guarantee 10-15 minutes of uptime at a time. Is there a way to get around these limitations? Thanks..
EDIT:
Why when I go to load my module ([modulename].[appname].appspot.com), it loads all available instances:
The documentation states
"http://module.app-id.appspot.com
Send the request to an available instance of the default version of the named module (round robin scheduling is used)."

Did you really measure that there is enough memory ?! If you load the 500Mb unzipped into memory, then that's a lot.
I have seen this behavior when running out of memory. I would suggest to try with a smaller test file. If that works, try to implement a streaming solution where the size of the file won't matter since it is never loaded into memory.

Are Heroku instances persistent? (Or, can I use dict/array as a cache?)

So my friend told me that instances on Heroku are persistent (I'm not sure if the vocab is right, but he implied that all users share the same instance).
So, if I have app.py, and an instance runs it, then all users share that instance. That means we can use a dict as a temporary cache for storing small things for faster response time.
So for example, if I'm serving an API, I can maybe define a cache like this and then use it.
How true is that? I tried looking this up, but could not find anything.
I deployed the linked API to heroku on 1 dyno, and with just a few requests per second, it was taking over 100 seconds to serve it. So my understanding is that the cache wasn't working. (It might be useful to note here that majority of time was due to request queueing, according to new relic.)

The Heroku Devcenter has several articles about the Heroku architecture.
Processes don't share memory. Moreover, your code is compiled into a slug and optimized for distribution to the dyno manager. In simple words, it means you don't even know which machine will execute your code. Theoretically, 5 users hitting your app may be routed to 5 different machines and processes.
Last but not least, keep in mind that if your app has only a single web dyno running, that web dyno will sleep. You have to have more than one web dyno to prevent web dynos from sleeping. When the dyno enter the sleep mode, the memory is released and you will loose all the data in memory.
This means that your approach will not work.
Generally speaking, in Heroku you should use external storages. For example, you can use the Memcached add-on and store your cache information in Memcached.
Also note you should not use the file system as cache. Not only because it's slower than Memcached, but also because the Cedar stack file system should be considered ephemeral.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to diagnose Google App Engine Flask Memory Leak - python

Related

GAE Soft Memory Use

Difference between cAdvisor memory report and resources maxrss in python

Django memory leak: possible causes?

Frustrating behavior for google modules/backends

Are Heroku instances persistent? (Or, can I use dict/array as a cache?)

Categories

Resources