Tornado file uploads - high memory usage - python

We have a tornado application running behind nginx, supporting user file uploads (I just use self.request.files to access the uploaded files). The maximum file size is 10MB and this is set in the nginx config, so the python process should never see files larger than that.
I've noticed that every time a user uploads a file, the memory goes up by a little. But I can't figure out a pattern there. I've tried to figure out if there are any memory leaks (using pympler and objgraph) but couldn't find anything particularly suspicious. They only show me that the top memory consuming objects are strings and dicts, with combined object sizes not more than 7-8MB. If the uploaded file itself would still have a reference after the request completed, then I would also expect the bytes type to be reported by pympler and/or objgraph, which I don't.
I wonder how to best deal with this situation. Is this another case of the "high water" behavior? Would switching to stream_request_body yield better results? Or is it easier to simply restart the process once it hits a certain threshold?

Related

How to limit the amount of memory available to Python

I have a Django REST Framework service. Once in a while, the service gets a large request and when that happens, it eats a large chunk of OS memory (a K8 pod in my case). After the request is processed and garbage collection is done (sometimes forced), the objects are released but the amount of memory occupied by Python never goes down until the OS starts running low on memory.
I have read this, this and this. And it looks like I cannot force Python to give that memory back to the OS. However, is there a way to limit the amount of memory Python has access to in the first place? This way, I can ensure that Python doesn't eat up memory on its own and other processes have plenty to play around with too. Kinda similar to how you can set the memory a JVM can use.

Infinite While Loop - Memory Use?

I am trying to figure out how a while loop determines how much memory to use.
At the basic level:
while True:
pass
If I did a similar thing in PHP, it would grind my localhost to a crawl. But in this case I would expect it to use next to nothing.
So for an infinite loop in python, should I expect that it would use up tons of memory, or does it scale according to what is being done, and where?
Your infinite loop is a no-op (it doesn't do anything), so it won't increase the memory use beyond what is being used by the rest of your program. To answer your question, you need to post the code that you suspect is causing memory problems.
In PHP however, the same loop will "hang" because the web server is expecting a response to send back to the client. Since no response is being received, the web browser will simply "freeze". Depending on how the web server is configured, it may choose to end the process an issue a timeout error.
You could do the same if you used Python and a web framework and put an infinite loop in one of your methods that returns a response to the client.
If you ran the equivalent PHP code from the shell, it will have the same effect as if it was written in Python (or any other language). That is, your console will block until you kill the process.
I'm asking because I want to create a program that runs infinitely,
but I'm not sure how to determine it's footprint or how much it will
take from system resources.
A program that runs indefinitely (I think that's what you mean) - it generally has two cases:
Its waiting to do some work on a trigger (like a web server runs indefinitely, but its just sitting there until someone visits your website)
Its doing a process that is taking a long time.
For #2, you need to determine the resource use by figuring out what is the work being done.
If its building a large list of items to do some calculations/sorting, then memory use will grow as the list grows.
If its processing a bunch of files, and during this process, it generates a lot of output stored on disk - then disk usage will grow, and then shrink when the process is done.
If its a rendering engine, then memory use and CPU use will increase, along with disk use as the memory is swapped out during rendering. However, such a system will not tax the disk too much.
The bottom line is, you can't get an answer to this unless you explain the process being run.

Frustrating behavior for google modules/backends

I'm working with GAE, and I'm trying to process a large zip file (~150mb zipped, 500 unzipped), which I need to do every day for my app.
I created a module to load a file from Google Cloud Storage, and parse through it, saving specific pieces of information in Google Datastore along the way. The problem is that it will shut itself down within a few minutes, and I basically lose where I am in the file. I am giving the instance more than enough CPU/memory, so that's not the issue..
Is there some way to handle this? The documentation for handling shutdowns is quite limited, and it seems shutdown requests aren't even guaranteed.. It seems really odd to me that GAE isn't able to handle a ~150mb file, nor can GAE guarantee 10-15 minutes of uptime at a time. Is there a way to get around these limitations? Thanks..
EDIT:
Why when I go to load my module ([modulename].[appname].appspot.com), it loads all available instances:
The documentation states
"http://module.app-id.appspot.com
Send the request to an available instance of the default version of the named module (round robin scheduling is used)."
Did you really measure that there is enough memory ?! If you load the 500Mb unzipped into memory, then that's a lot.
I have seen this behavior when running out of memory. I would suggest to try with a smaller test file. If that works, try to implement a streaming solution where the size of the file won't matter since it is never loaded into memory.

Store file on server for a short time period

What's the best way to name and store a generated file on a server, such that if the user requests the file in the next 5 minutes or so, you return it, otherwise, return an error code? I am using Python and Webapp2 (although this would work with any WSGI server).
I would suggest using a client-created UUID on the server, and when the server stores it, send back an error (forcing a retry) to the client. Under most circumstances, the UUID will be completely unique and won't collide with anything already stored. If it does, the client can pick a new name and try again. If you want to make this slightly better, wait a random number of milliseconds between retries to reduce the likelihood of collisions being repeated.
That'd be my approach to this specific, insecure, short-term storage problem.
As for removal, I'd leave that in the responsibility of the server to remove them at intervals, basically checking to see if any file is greater than 5 minutes old and removing them. As long as in-process downloads leave the file open, it shouldn't interrupt.
If you want to leave the client in control, you will not have an easy way to enforce deletion when the client is offline, so I'd suggest keeping a list of the files in date order and delete them:
in a background thread as necessary if you expect to be running a long time
at startup (which will require persisting these to disk)
at shutdown (doesn't require persisting to disk)
However, all of these mechanisms are prone to leaving unnecessary files on the server if you crash or lose the persistent information, so I'd still recommend making the deletion the responsibility of the server.

How does urllib.urlopen() work?

Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars).
If I use built-in open()/file() the file will be loaded in lazy manner.
I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)
def makefile(self, mode='r', bufsize=-1):
"""makefile([mode[, bufsize]]) -> file object
Return a regular file object corresponding to the socket. The mode
and bufsize arguments are as for the built-in open() function."""
return _fileobject(self._sock, mode, bufsize)
Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.
How big is the difference in performance between urllib.urlopen().readline() and file().readline()?
It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.
For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".
It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.

Categories