I'm doing some metric analysis on on my web app, which makes extensive use of celery. I have one metric which measures the full trip from a post_save signal through a celery task (which itself calls a number of different celery tasks) to the end of that task. I've been hitting the server with up to 100 requests in 5 seconds.
What I find interesting is that when I hit the server with hundreds of requests (which entails thousands of celery worker processes being queued), the time it takes for the trip from post save to the end of the main celery task increases significantly, even though I never do any additional database calls, and none of the celery tasks should be blocking the main task.
Could the fact that there are so many celery tasks in the queue when I make a bunch of requests really quickly be slowing down the logic in my post_save function and main celery task? That is, could the processing associated with getting the sub-tasks that the main celery task creates onto a crowded queue be having a significant impact on the time it takes to reach the end of the main celery task?
It's impossible to really answer your question without an in-depth analysis of your actual code AND benchmark protocol, and while having some working experience with Python, Django and Celery I wouldn't be able to do such an in-depth analysis. Now there are a couple very obvious points :
if your workers are running on the same computer as your Django instance, they will compete with Django process(es) for CPU, RAM and IO.
if the benchmark "client" is also running on the same computer then you have a "heisenbench" case - bombing a server with 100s of HTTP request per second also uses a serious amount of resources...
To make a long story short: concurrent / parallel programming won't give you more processing power, it will only allow you to (more or less) easily scale horizontally.
I'm not sure about slowing down, but it can cause your application to hang. I've had this problem where one application would backup several other queues with no workers. My application could then no longer queue messages.
If you open up a django shell and try to queue a task. Then hit ctrl+c. I can't quite remember what the stack trace should be, but if you post it here I could confirm it.
Related
I am currently using celery to run my background tasks.
Let's say I want to run these background tasks on weekly basis.
Is it a good idea to throw 20 tasks under one worker?
Each tasks would make about at least 800 web requests.
In the celery docs it says :
A single Celery process can process millions of tasks a minute, with sub-millisecond round-trip latency (using RabbitMQ, librabbitmq, and optimized settings).
So basically; one task could be for one user; I would need to run at least 50 different tasks; each of them making about 800 web request. I thought; maybe I would need new worker for each tasks; but reviewing the doc it doesn't seem like I need multiple workers for each tasks; instead I can throw all at once; and would be just fine. I don't feel confident about that though; What should I do in my case; if I am making 800 web requests per task; should I need multiple workers? Or should I just do everything under one worker.
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
I have tasks that need to load a few large files from the internet and then do some processing. Running synchronously the loading would take ~3s and the processing ~0.2s. Although the processing is much faster than loading, it still takes a considerable amount of time.
I wonder what would be the best celery configuration to handle my scenario. Multi-processing, Eventlet, or maybe something else?
This question seems to me it needs an answer comparing multi-process/thread versus green threads; however, generally speaking in the context of celery concurrency, in doesn't make a difference using either, unless you have limited resources (and too many tasks), or you are making too many outbound connections and you are i/o-bound, then you will have to go "green" and use eventlet.
A good idea, I have seen Instagram presenting in last PyCon 2013 (Messaging at Scale at Instagram), they use both. The main usage is the threaded tasks, yet, they use the "green" approach with those tasks that does nothing but doing outbound request to other websites, like twitter, facebook, and tumbler .. those type of tasks don't deserve a complete thread/process, as no real processing happening, moveover, the request/respones cycle takes some time, so, the best thing to do with those tasks is to make them green.
You can create worker(s) that use thread/process per task, that only process tasks through specific queues, and another worker(s) that uses greenlets that only process other tasks through other queues. Then you can decide which goes where as per the above explanation!
I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.
I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.
You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.
You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing