I am using moviepy to insert a text into different parts of the video in my Django project. Here is my code.
from moviepy.editor import VideoFileClip, TextClip, CompositeVideoClip
txt = TextClip('Hello', font="Tox-Typewriter")
video = VideoFileClip("videofile.mp4").subclip(0,31)
final_clip = CompositeVideoClip([video, txt]).set_duration(video.duration)
final_clip.write_videofile("media/{}.mp4".format('hello'),
fps=24,threads=4,logger=None)
final_clip.close()
I am getting the video written to a file in 10s and showing the video in browser. The issue is when there are simultaneous requests to the server. Say there are 5 simultaneous requests coming to the server, then each response will take 50 s each. Instead of giving each response in 10s. It seems that there is some resource which is used by all these requests, and one is waiting for the another to release the resource. But could not find out where it is happening. I have tried using 5 separate file for each request thinking that all the requests opening same file is the problem, but did not work out. Please help me to find a solution.
So without knowing more about your application setup any answers to this question will really be a shot in the dark.
As you know editing video or any changes to video is going to be resource intensive. In this instance you are actually a lot better off loading any processing to a specific task runner (celery, django-q). Not only will this not hold open server resources until the task is complete it also means you can offload the "work" to machines which are better suited for the job (optimized for IO or CPU bound work (depending on use case).
In development, if you are running using the local development server you will only be using one process. One process, when sent multiple intensive requests, will get blocked. You could look at using something like gunicorn or waitress and set the number of processes to < 1.
But still, at some point you are going to have to offload this work to a task runner, doing such work in a production environment could result in over consuming of web server resources.
On a more technical note,
have you looked at this issue on github:
https://github.com/Zulko/moviepy/issues/645
They talk about passing in a parameter ``progress_bar=False`. If in your use case you are writing 4 files and they are all writing to a progress bar you might be getting IO swamped.
Also, consider running a profiler while replicating the issue, it might give you better insight as to where the bottleneck is occurring (IO, or CPU).
Related
Just point me in the right direction, please. I have no idea how to deal with this.
So, I want to connect my scraping script written in Python, connected to a front-end Android app. I have already written the script and front-end is ready as well. However, I dont know how these two things would communicate with each other, in which the script constantly listens for requests from the Android App (Through Firebase maybe?).
However, there is one more thing. Since multiple users would use the app at the same time, so there will be parallel requests sent from the App as well. How do I let the script to process the requests concurrently without waiting for first one to be completed. All the scraping is done through requests library. I researched a bit, and found some hints related to Threading, Queue, Async etc.
Kindly, tell me which way do I go?
In the given scenario, you can put the script in an azure python function and make a call to it whenever required. However, as you mentioned there will be multiple parallel request which might pose a warning due to the single threaded architecture of Python.
It is documented in our Python Functions Developer reference on how to handle such scenario’s: functions-reference-python ( also check asyncio-eventloop )
Here are methods to handle this:
Use Async calls
Add more Language worker processes per host, this can be done by using application setting : FUNCTIONS_WORKER_PROCESS_COUNT upto a maximum value of 10.
[Please note that each new language worker is spawned every 10 seconds until they are warm.]
Here is a GitHub issue which talks about this issue in detail : https://github.com/Azure/azure-functions-python-worker/issues/236
Is it possible to create as many threads to use 100% of CPU and is it really efficient? I'm planning to create a crawler in Python and, in order to make the program efficient, I want to create as many threads as possible, where each thread will be downloading one website. I tried looking up for some information online; unfortunately I didn't find much.
You are confusing your terminology, but that is ok. A very high level overview would help.
Concurrency can consist of IO bound (reading and writing from disk, http requests, etc) and CPU bound work (running a machine learning optimization function on a big set of data).
With IO bound work, which is what you are referring to I am assuming, in fact your CPU is not working very hard but rather waiting around for data to come back.
Contrast that with multi-processing where you can use multiple core of your machine to do more intense CPU bound work.
That said multi-threading could help you. I would advise to use the asyncio and aiohttp modules for Python. These will help you make sure whilst you are waiting for some response to be returned, the software can continue with other requests.
I use asyncio, aiohttp and bs4 when I need to do some web-scraping.
I have inherited a rather large code base that utilizes tornado to compute and serve big and complex data-types (imagine a 1 MB XML file). Currently there are 8 instances of tornado running to compute and serve this data.
That was a wrong design-decision from the start and I am facing many many timeouts from applications that access the servers.
I'd like to change as few lines of code as possible in the legacy code base because I do not want to break anything that has already been tested in the field. What can I do to transform this system into a threaded one that can execute more xml-computation in parallel?
transform this system into a threaded one that can execute more xml-computation in parallel
If there are enough Tornado instances to saturate the computational resources, moving to a threaded model will probably not gain much performance. Getting rid of blocking code however helps with connection timeouts.
Another option is getting rid of all asynchronous code and using tornado.wsgi.WSGIApplication. That way, you can run the application on a threaded WSGI server. Features that are not available in WSGI mode are listed here.
Use Tornado to just receive non-blocking requests. To do the actual XML processing you can then spawn another process or use an async task processor like celery. Using celery would facilitate easy scaling of your system in future. In fact with this model you'll just need one Tornado instance.
#Eren - I don't think that the computational resources are getting saturated. It would just be that more than 8 requests are not getting processed simultaneously as Tornado would right now be serving requests in blocking mode.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.
My requirements for the solution are:
Should be interruptable. Ctrl+C should immediately terminate all downloads.
There should be no leftover processes that you have to kill manually using kill, even if the main program crashes or an exception is thrown.
It should work on Linux and Windows too.
It should retry downloads, be resilient against network errors and should timeout properly.
It should be smart about not hammering the same server with 100+ simultaneous downloads, but queue them in a sane way.
It should handle important http status codes like 301, 302 and 304. That means that for each file, it should take the Last-Modified value as input and only download if it has changed since last time.
Preferably it should have a progress bar or it should be easy to write a progress bar for it to monitor the download progress of all files.
Preferably it should take advantage of http keep-alive to maximize the transfer speed.
Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.
I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously.
Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.
You can try pycurl, though the interface is not easy at first, but once you look at examples, its not hard to understand. I have used it to fetch 1000s of web pages in parallel on meagre linux box.
You don't have to deal with threads, so it terminates gracefully, and there are no processes left behind
It provides options for timeout, and http status handling.
It works on both linux and windows.
The only problem is that it provides a basic infrastructure (basically just a python layer above the excellent curl library). You will have to write few lines to achieve the features as you want.
There are lots of options but it will be hard to find one which fits all your needs.
In your case, try this approach:
Create a queue.
Put URLs to download into this queue (or "config objects" which contain the URL and other data like the user name, the destination file, etc).
Create a pool of threads
Each thread should try to fetch a URL (or a config object) from the queue and process it.
Use another thread to collect the results (i.e. another queue). When the number of result objects == number of puts in the first queue, then you're finished.
Make sure that all communication goes via the queue or the "config object". Avoid accessing data structures which are shared between threads. This should save you 99% of the problems.
I don't think such a complete library exists, so you'll probably have to write your own. I suggest taking a look at gevent for this task. They even provide a concurrent_download.py example script. Then you can use urllib2 for most of the other requirements, such as handling HTTP status codes, and displaying download progress.
I would suggest Twisted, although it is not a ready made solution, but provides the main building blocks to get every feature you listed in an easy way and it does not use threads.
If you are interested, take a look at the following links:
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#getPage
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#downloadPage
As per your requirements:
Supported out of the box
Supported out of the box
Supported out of the box
Timeout supported out of the box, other error handling done through deferreds
Achieved easily using cooperators (example 7)
Supported out of the box
Not supported, solutions exists (and they are not that hard to implement)
Not supported, it can be implemented (but it will be relatively hard)
Nowadays there are excellent Python libs you might want to use - urllib3 and requests
Try using aria2 through simple python subprocess module.
It provide all requirements from your list, except 7, out of the box, and 7 is easy to write.
aria2c has a nice xml-rpc or json-rpc interface to interact with it from your scripts.
Does urlgrabber fit your requirements?
http://urlgrabber.baseurl.org/
If it doesn't, you could consider volunteering to help finish it. Contact the authors, Michael Stenner and Ryan Tomayko.
Update: Googling for "parallel wget" yields these, among others:
http://puf.sourceforge.net/
http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget
It seems like you have a number of options to choose from.
I used the standard libs for that, urllib.urlretrieve to be precise. downloaded podcasts this way, via a simple thread pool, each using its own retrieve. I did about 10 simultanous connections, more should not be a problem. Continue a interrupted download, maybe not. Ctrl-C could be handled, I guess. Worked on Windows, installed a handler for progress bars. All in all 2 screens of code, 2 screens for generating the URLs to retrieve.
This seems pretty flexible:
http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/
Threading isn't "half-assed" unless you're a bad programmer. The best general approach to this problem is the producer / consumer model. You have one dedicated URL producer, and N dedicated download threads (or even processes if you use the multiprocessing model).
As for all of your requirements, ALL of them CAN be done with the normal python threaded model (yes, even catching Ctrl+C -- I've done it).
We have a web service which serves small, arbitrary segments of a fixed inventory of larger MP3 files. The MP3 files are generated on-the-fly by a python application. The model is, make a GET request to a URL specifying which segments you want, get an audio/mpeg stream in response. This is an expensive process.
We're using Nginx as the front-end request handler. Nginx takes care of caching responses for common requests.
We initially tried using Tornado on the back-end to handle requests from Nginx. As you would expect, the blocking MP3 operation kept Tornado from doing its thing (asynchronous I/O). So, we went multithreaded, which solved the blocking problem, and performed quite well. However, it introduced a subtle race condition (under real world load) that we haven't been able to diagnose or reproduce yet. The race condition corrupts our MP3 output.
So we decided to set our application up as a simple WSGI handler behind Apache/mod_wsgi (still w/ Nginx up front). This eliminates the blocking issue and the race condition, but creates a cascading load (i.e. Apache creates too many processses) on the server under real world conditions. We're working on tuning Apache/mod_wsgi right now, but still at a trial-and-error phase. (Update: we've switched back to Tornado. See below.)
Finally, the question: are we missing anything? Is there a better way to serve CPU-expensive resources over HTTP?
Update: Thanks to Graham's informed article, I'm pretty sure this is an Apache tuning problem. In the mean-time, we've gone back to using Tornado and are trying to resolve the data-corruption issue.
For those who were so quick to throw more iron at the problem, Tornado and a bit of multi-threading (despite the data integrity problem introduced by threading) handles the load acceptably on a small (single core) Amazon EC2 instance.
Have you tried Spawning? It is a WSGI server with a flexible assortment of threading modes.
Are you making the mistake of using embedded mode of Apache/mod_wsgi? Read:
http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html
Ensure you use daemon mode if using Apache/mod_wsgi.
You might consider a queuing system with AJAX notification methods.
Whenever there is a request for your expensive resource, and that resource needs to be generated, add that request to the queue (if it's not already there). That queuing operation should return an ID of an object that you can query to get its status.
Next you have to write a background service that spins up worker threads. These workers simply dequeue the request, generate the data, then saves the data's location in the request object.
The webpage can make AJAX calls to your server to find out the progress of the generation and to give a link to the file once it's available.
This is how LARGE media sites work - those that have to deal with video in particular. It might be overkill for your MP3 work however.
Alternatively, look into running a couple machines to distribute the load. Your threads on Apache will still block, but atleast you won't consume resources on the web server.
Please define "cascading load", as it has no common meaning.
Your most likely problem is going to be if you're running too many Apache processes.
For a load like this, make sure you're using the prefork mpm, and make sure you're limiting yourself to an appropriate number of processes (no less than one per CPU, no more than two).
It looks like you are doing things right -- just lacking CPU power: can you determine what is the CPU loading in the process of generating these MP3?
I think the next thing you have to do there is to add more hardware to render the MP3's on other machines. Or that or find a way to deliver pre-rendered MP3 (maybe you can cahce some of your media?)
BTW, scaling for the web was the theme of a Keynote lecture by Jacob Kaplan-Moss on PyCon Brasil this year, and it is far from being a closed problem. The stack of technologies one needs to handle is quite impressible - (I could not find an online copy o f the presentation, though - -sorry for that)