I'm working with GAE, and I'm trying to process a large zip file (~150mb zipped, 500 unzipped), which I need to do every day for my app.
I created a module to load a file from Google Cloud Storage, and parse through it, saving specific pieces of information in Google Datastore along the way. The problem is that it will shut itself down within a few minutes, and I basically lose where I am in the file. I am giving the instance more than enough CPU/memory, so that's not the issue..
Is there some way to handle this? The documentation for handling shutdowns is quite limited, and it seems shutdown requests aren't even guaranteed.. It seems really odd to me that GAE isn't able to handle a ~150mb file, nor can GAE guarantee 10-15 minutes of uptime at a time. Is there a way to get around these limitations? Thanks..
EDIT:
Why when I go to load my module ([modulename].[appname].appspot.com), it loads all available instances:
The documentation states
"http://module.app-id.appspot.com
Send the request to an available instance of the default version of the named module (round robin scheduling is used)."
Did you really measure that there is enough memory ?! If you load the 500Mb unzipped into memory, then that's a lot.
I have seen this behavior when running out of memory. I would suggest to try with a smaller test file. If that works, try to implement a streaming solution where the size of the file won't matter since it is never loaded into memory.
Related
So I have this large file (about 1.5GB) which I load into python pretty regularly. The loading parses it into an object, that ends up taking about 3GB of RAM.
The loading process is not that long (takes about 40 seconds on my PC), but still it becomes an issue when I want to debug programs that load it.
I was trying to come up with a solution for loading it more quickly, at first I though about pickling the resulting python object, but as I said earlier it's 3GB, so unpickling it ended up taking even longer then the parsing process.
Is there a way to let python access it more quickly? I am not really opposed to any working solution (cloud server? other programming languages?) but I am not even sure if this is technically possible at all.
I am using moviepy to insert a text into different parts of the video in my Django project. Here is my code.
from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
txt = TextClip('Hello', font="Tox-Typewriter")
video = VideoFileClip("videofile.mp4").subclip(0,31)
final_clip = CompositeVideoClip([video, txt]).set_duration(video.duration)
final_clip.write_videofile("media/{}.mp4".format('hello'),
fps=24,threads=4,logger=None)
final_clip.close()
I am getting the video written to a file in 10s and showing the video in browser. The issue is when there are simultaneous requests to the server. Say there are 5 simultaneous requests coming to the server, then each response will take 50 s each. Instead of giving each response in 10s. It seems that there is some resource which is used by all these requests, and one is waiting for the another to release the resource. But could not find out where it is happening. I have tried using 5 separate file for each request thinking that all the requests opening same file is the problem, but did not work out. Please help me to find a solution.
So without knowing more about your application setup any answers to this question will really be a shot in the dark.
As you know editing video or any changes to video is going to be resource intensive. In this instance you are actually a lot better off loading any processing to a specific task runner (celery, django-q). Not only will this not hold open server resources until the task is complete it also means you can offload the "work" to machines which are better suited for the job (optimized for IO or CPU bound work (depending on use case).
In development, if you are running using the local development server you will only be using one process. One process, when sent multiple intensive requests, will get blocked. You could look at using something like gunicorn or waitress and set the number of processes to < 1.
But still, at some point you are going to have to offload this work to a task runner, doing such work in a production environment could result in over consuming of web server resources.
On a more technical note,
have you looked at this issue on github:
https://github.com/Zulko/moviepy/issues/645
They talk about passing in a parameter ``progress_bar=False`. If in your use case you are writing 4 files and they are all writing to a progress bar you might be getting IO swamped.
Also, consider running a profiler while replicating the issue, it might give you better insight as to where the bottleneck is occurring (IO, or CPU).
I am currently developing a data processing web server(linux) using python flask.
The general work flow is:
Get an input file from the user (handled by python flask)
Flask passes this input file to a java program
Java program processes this input file, saves the outputs (multiple files) on the server.
Flask calls another python script which will process these outputs to get the final result and return the result back to the client.
The problem is: between step 3 and step 4, there exist some intermediate files, this would not have been a problem at all if this is a local program. but as a server program, When more than one clients access this program, they could get unexpected result generated by input that is provided by another user who is using the web program at the same time.
From the point I see it, this is kind of a mutual exclusion problem on file access. I have had problems with mutual exclusion problems on threads before, I solved some of them using thread locks such as like synchronization in java and lock in pythons, but I am not sure what to do when it comes to files instead of threads.
It occurred to me that maybe I canspawn different copies of files based on different clients. But as I understand, the HTTP is stateless so you can't really know who is accessing the server. I don't want to add a login system and a user database to achieve this purpose as I sense there is a much simpler and better way to resolve this problem.
I have been looking for a good solution these days but haven't found an ideal one so I am looking for some advice here. Any suggestions will be highly appreciated. If you can suggest a viable solution, please feel free to provide me with your name so I can add you to the thank list of digital and paper publications about this tool when it's published.
As a system kind of person I suggest you something like this
https://docs.python.org/3/library/fcntl.html#fcntl.lockf
This is how I would solve it there is so many way to solve this problem and it is up to debate of course it is come hard with the best solution
Assume the output file is where the conflict happen
so you lock the file and you keep polling until the resource is release (the user need to wait) so you force one user to access the file at a time (polling here time.sleep) for like 2-3 seconds (add a try except) here thread lock on the output file only when the resource is release the next user process will pass through normally.
Another easy way is to dump the data in a rds like mysql or postgres it will handle all the file access nightmare occurred from concurrent request (put the output file in a db).
We have a tornado application running behind nginx, supporting user file uploads (I just use self.request.files to access the uploaded files). The maximum file size is 10MB and this is set in the nginx config, so the python process should never see files larger than that.
I've noticed that every time a user uploads a file, the memory goes up by a little. But I can't figure out a pattern there. I've tried to figure out if there are any memory leaks (using pympler and objgraph) but couldn't find anything particularly suspicious. They only show me that the top memory consuming objects are strings and dicts, with combined object sizes not more than 7-8MB. If the uploaded file itself would still have a reference after the request completed, then I would also expect the bytes type to be reported by pympler and/or objgraph, which I don't.
I wonder how to best deal with this situation. Is this another case of the "high water" behavior? Would switching to stream_request_body yield better results? Or is it easier to simply restart the process once it hits a certain threshold?
We have a web service which serves small, arbitrary segments of a fixed inventory of larger MP3 files. The MP3 files are generated on-the-fly by a python application. The model is, make a GET request to a URL specifying which segments you want, get an audio/mpeg stream in response. This is an expensive process.
We're using Nginx as the front-end request handler. Nginx takes care of caching responses for common requests.
We initially tried using Tornado on the back-end to handle requests from Nginx. As you would expect, the blocking MP3 operation kept Tornado from doing its thing (asynchronous I/O). So, we went multithreaded, which solved the blocking problem, and performed quite well. However, it introduced a subtle race condition (under real world load) that we haven't been able to diagnose or reproduce yet. The race condition corrupts our MP3 output.
So we decided to set our application up as a simple WSGI handler behind Apache/mod_wsgi (still w/ Nginx up front). This eliminates the blocking issue and the race condition, but creates a cascading load (i.e. Apache creates too many processses) on the server under real world conditions. We're working on tuning Apache/mod_wsgi right now, but still at a trial-and-error phase. (Update: we've switched back to Tornado. See below.)
Finally, the question: are we missing anything? Is there a better way to serve CPU-expensive resources over HTTP?
Update: Thanks to Graham's informed article, I'm pretty sure this is an Apache tuning problem. In the mean-time, we've gone back to using Tornado and are trying to resolve the data-corruption issue.
For those who were so quick to throw more iron at the problem, Tornado and a bit of multi-threading (despite the data integrity problem introduced by threading) handles the load acceptably on a small (single core) Amazon EC2 instance.
Have you tried Spawning? It is a WSGI server with a flexible assortment of threading modes.
Are you making the mistake of using embedded mode of Apache/mod_wsgi? Read:
http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html
Ensure you use daemon mode if using Apache/mod_wsgi.
You might consider a queuing system with AJAX notification methods.
Whenever there is a request for your expensive resource, and that resource needs to be generated, add that request to the queue (if it's not already there). That queuing operation should return an ID of an object that you can query to get its status.
Next you have to write a background service that spins up worker threads. These workers simply dequeue the request, generate the data, then saves the data's location in the request object.
The webpage can make AJAX calls to your server to find out the progress of the generation and to give a link to the file once it's available.
This is how LARGE media sites work - those that have to deal with video in particular. It might be overkill for your MP3 work however.
Alternatively, look into running a couple machines to distribute the load. Your threads on Apache will still block, but atleast you won't consume resources on the web server.
Please define "cascading load", as it has no common meaning.
Your most likely problem is going to be if you're running too many Apache processes.
For a load like this, make sure you're using the prefork mpm, and make sure you're limiting yourself to an appropriate number of processes (no less than one per CPU, no more than two).
It looks like you are doing things right -- just lacking CPU power: can you determine what is the CPU loading in the process of generating these MP3?
I think the next thing you have to do there is to add more hardware to render the MP3's on other machines. Or that or find a way to deliver pre-rendered MP3 (maybe you can cahce some of your media?)
BTW, scaling for the web was the theme of a Keynote lecture by Jacob Kaplan-Moss on PyCon Brasil this year, and it is far from being a closed problem. The stack of technologies one needs to handle is quite impressible - (I could not find an online copy o f the presentation, though - -sorry for that)