Newb quesion about Django app design:
Im building reporting engine for my web-site. And I have a big (and getting bigger with time) amounts of data, and some algorithm which must be applied to it. Calculations promise to be heavy on resources, and it would be stupid if they are performed by requests of users. So, I think to put them into background process, which would be executed continuously and from time to time return results, which could be feed to Django views-routine for producing html output by demand.
And my question is - what proper design approach for building such system? Any thoughts?
Celery is one of your best choices. We are using it successfully. It has a powerful scheduling mechanism - you can either schedule tasks as a timed job or trigger tasks in background when user (for example) requests it.
It also provides ways to query for the status of such background tasks and has a number of flow control features. It allows for a very easy distribution of the work - i.e your celery background tasks can be run on a separate machine (this is very useful for example with heroku web/workers split where web process is limited to max 30s per request). It provides various queue backends (it can use database, rabbitMQ or a number of other queuing mechanisms. With simplest setup it can use the same database that your Django site already uses for that (which makes it easy to setup).
And if you are using automated tests it also has a feature that helps with testing - it can be set in "eager" mode, where background tasks are not executed in background - thus giving predictable logic testing.
More info here: http://docs.celeryproject.org:8000/en/latest/django/
You mean the results are returned into a database or do you want to create django-views directly from your independently running code?
If you have large amounts of data I like to use Pythons multiprocessing. You can create a Generator which fills a JoinableQueue with the different tasks to do and a pool of Workers consuming the different Tasks. This way you should be able to maximize the resource utilization on your system.
The multiprocessing module also allows you to do several tasks over the network (e.g. multiprocessing.Manager()). With this in mind you should easily be able to scale things up if you need a second machine to process the data in time.
Example:
This example shows how to spawn multiple processes. The generator function should query the database for all new entries that need heavy lifting. The consumers take the individual items from the queue and do the actual calculations.
import time
from multiprocessing.queues import JoinableQueue
from multiprocessing import Process
QUEUE = JoinableQueue(-1)
def generator():
""" Puts items in the queue. For example query database for all new,
unprocessed entries that need some serious math done.."""
while True:
QUEUE.put("Item")
time.sleep(0.1)
def consumer(consumer_id):
""" Consumes items from the queue... Do your calculations here... """
while True:
item = QUEUE.get()
print "Process %s has done: %s" % (consumer_id, item)
QUEUE.task_done()
p = Process(target=generator)
p.start()
for x in range(0, 2):
w = Process(target=consumer, args=(x,))
w.start()
p.join()
w.join()
Why don't you have a url or python script that triggers whatever sort of calculation you need to have done everytime it's run and then fetch that url or run that script via a cronjob on the server? From what your question was it doesn't seem like you need a whole lot more than that.
Related
I am currently working on a test system that uses selenium grid for WhatsApp automation.
WhatsApp requires a QR code scan to log in, but once the code has been scanned, the session persists as long as the cookies remain saved in the browser's user data directory.
I would like to run a series of tests concurrently while making sure that every session is only used by one thread at any given time.
I would also like to be able to add additional tests to the queue while tests are being run.
So far I have considered using the ThreadPoolExecutor context manager in order to limit the maximum available workers to the maximum number of sessions. Something like this:
import queue
from concurrent.futures import ThreadPoolExecutor
def make_queue(questions):
q = queue.Queue()
for question in questions:
q.put(question)
return q
def test_conversation(q):
item = q.get()
# Whatsapp test happens here
q.task_done()
def run_tests(questions):
q = make_queue(questions)
with ThreadPoolExecutor(max_workers=number_of_sessions) as executor:
while not q.empty()
test_results = executor.submit(test_conversation, q)
for f in concurrent.futures.as_completed(test_results):
# save results somewhere
It does not include some way to make sure that every thread gets its own session though and as far as I know I can only send one parameter to the function that the executor calls.
I could make some complicated checkout system that works like borrowing books from a library so that every session can only be checked out once at any given time, but I'm not confident in making something that is thread safe and works in all cases. Even the ones I can't think of until they happen.
I am also not sure how I would keep the thing going while adding items to the queue without it locking up my entire application. Would I have to run run_tests() in its own thread?
Is there an established way to do this? Any help would be much appreciated.
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
I have a CPU intensive Celery task. I would like to use all the processing power (cores) across lots of EC2 instances to get this job done faster (a celery parallel distributed task with multiprocessing - I think).
The terms, threading, multiprocessing, distributed computing, distributed parallel processing are all terms I'm trying to understand better.
Example task:
#app.task
for item in list_of_millions_of_ids:
id = item # do some long complicated equation here very CPU heavy!!!!!!!
database.objects(newid=id).save()
Using the code above (with an example if possible) how one would ago about distributed this task using Celery by allowing this one task to be split up utilising all the computing CPU power across all available machine in the cloud?
Your goals are:
Distribute your work to many machines (distributed
computing/distributed parallel processing)
Distribute the work on a given machine across all CPUs
(multiprocessing/threading)
Celery can do both of these for you fairly easily. The first thing to understand is that each celery worker is configured by default to run as many tasks as there are CPU cores available on a system:
Concurrency is the number of prefork worker process used to process
your tasks concurrently, when all of these are busy doing work new
tasks will have to wait for one of the tasks to finish before it can
be processed.
The default concurrency number is the number of CPU’s on that machine
(including cores), you can specify a custom number using -c option.
There is no recommended value, as the optimal number depends on a
number of factors, but if your tasks are mostly I/O-bound then you can
try to increase it, experimentation has shown that adding more than
twice the number of CPU’s is rarely effective, and likely to degrade
performance instead.
This means each individual task doesn't need to worry about using multiprocessing/threading to make use of multiple CPUs/cores. Instead, celery will run enough tasks concurrently to use each available CPU.
With that out of the way, the next step is to create a task that handles processing some subset of your list_of_millions_of_ids. You have a couple of options here - one is to have each task handle a single ID, so you run N tasks, where N == len(list_of_millions_of_ids). This will guarantee that work is evenly distributed amongst all your tasks since there will never be a case where one worker finishes early and is just waiting around; if it needs work, it can pull an id off the queue. You can do this (as mentioned by John Doe) using the celery group.
tasks.py:
#app.task
def process_ids(item):
id = item #long complicated equation here
database.objects(newid=id).save()
And to execute the tasks:
from celery import group
from tasks import process_id
jobs = group(process_ids(item) for item in list_of_millions_of_ids)
result = jobs.apply_async()
Another option is to break the list into smaller pieces and distribute the pieces to your workers. This approach runs the risk of wasting some cycles, because you may end up with some workers waiting around while others are still doing work. However, the celery documentation notes that this concern is often unfounded:
Some may worry that chunking your tasks results in a degradation of
parallelism, but this is rarely true for a busy cluster and in
practice since you are avoiding the overhead of messaging it may
considerably increase performance.
So, you may find that chunking the list and distributing the chunks to each task performs better, because of the reduced messaging overhead. You can probably also lighten the load on the database a bit this way, by calculating each id, storing it in a list, and then adding the whole list into the DB once you're done, rather than doing it one id at a time. The chunking approach would look something like this
tasks.py:
#app.task
def process_ids(items):
for item in items:
id = item #long complicated equation here
database.objects(newid=id).save() # Still adding one id at a time, but you don't have to.
And to start the tasks:
from tasks import process_ids
jobs = process_ids.chunks(list_of_millions_of_ids, 30) # break the list into 30 chunks. Experiment with what number works best here.
jobs.apply_async()
You can experiment a bit with what chunking size gives you the best result. You want to find a sweet spot where you're cutting down messaging overhead while also keeping the size small enough that you don't end up with workers finishing their chunk much faster than another worker, and then just waiting around with nothing to do.
In the world of distribution there is only one thing you should remember above all :
Premature optimization is the root of all evil. By D. Knuth
I know it sounds evident but before distributing double check you are using the best algorithm (if it exists...).
Having said that, optimizing distribution is a balancing act between 3 things:
Writing/Reading data from a persistent medium,
Moving data from medium A to medium B,
Processing data,
Computers are made so the closer you get to your processing unit (3) the faster and more efficient (1) and (2) will be. The order in a classic cluster will be : network hard drive, local hard drive, RAM, inside processing unit territory...
Nowadays processors are becoming sophisticated enough to be considered as an ensemble of independent hardware processing units commonly called cores, these cores process data (3) through threads (2).
Imagine your core is so fast that when you send data with one thread you are using 50% of the computer power, if the core has 2 threads you will then use 100%. Two threads per core is called hyper threading, and your OS will see 2 CPUs per hyper threaded core.
Managing threads in a processor is commonly called multi-threading.
Managing CPUs from the OS is commonly called multi-processing.
Managing concurrent tasks in a cluster is commonly called parallel programming.
Managing dependent tasks in a cluster is commonly called distributed programming.
So where is your bottleneck ?
In (1): Try to persist and stream from the upper level (the one closer to your processing unit, for example if network hard drive is slow first save in local hard drive)
In (2): This is the most common one, try to avoid communication packets not needed for the distribution or compress "on the fly" packets (for example if the HD is slow, save only a "batch computed" message and keep the intermediary results in RAM).
In (3): You are done! You are using all the processing power at your disposal.
What about Celery ?
Celery is a messaging framework for distributed programming, that will use a broker module for communication (2) and a backend module for persistence (1), this means that you will be able by changing the configuration to avoid most bottlenecks (if possible) on your network and only on your network.
First profile your code to achieve the best performance in a single computer.
Then use celery in your cluster with the default configuration and set CELERY_RESULT_PERSISTENT=True :
from celery import Celery
app = Celery('tasks',
broker='amqp://guest#localhost//',
backend='redis://localhost')
#app.task
def process_id(all_the_data_parameters_needed_to_process_in_this_computer):
#code that does stuff
return result
During execution open your favorite monitoring tools, I use the default for rabbitMQ and flower for celery and top for cpus, your results will be saved in your backend. An example of network bottleneck is tasks queue growing so much that they delay execution, you can proceed to change modules or celery configuration, if not your bottleneck is somewhere else.
Why not use group celery task for this?
http://celery.readthedocs.org/en/latest/userguide/canvas.html#groups
Basically, you should divide ids into chunks (or ranges) and give them to a bunch of tasks in group.
For smth more sophisticated, like aggregating results of particular celery tasks, I have successfully used chord task for similar purpose:
http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords
Increase settings.CELERYD_CONCURRENCY to a number that is reasonable and you can afford, then those celery workers will keep executing your tasks in a group or a chord until done.
Note: due to a bug in kombu there were trouble with reusing workers for high number of tasks in the past, I don't know if it's fixed now. Maybe it is, but if not, reduce CELERYD_MAX_TASKS_PER_CHILD.
Example based on simplified and modified code I run:
#app.task
def do_matches():
match_data = ...
result = chord(single_batch_processor.s(m) for m in match_data)(summarize.s())
summarize gets results of all single_batch_processor tasks. Every task runs on any Celery worker, kombu coordinates that.
Now I get it: single_batch_processor and summarize ALSO have to be celery tasks, not regular functions - otherwise of course it will not be parallelized (I'm not even sure chord constructor will accept it if it's not a celery task).
Adding more celery workers will certainly speed up executing the task. You might have another bottleneck though: the database. Make sure it can handle the simultaneous inserts/updates.
Regarding your question: You are adding celery workers by assigning another process on your EC2 instances as celeryd. Depending on how many workers you need you might want to add even more instances.
I have my server on Google App Engine
One of my jobs is to match a huge set of records with another.
This takes very long, if i have to match 10000 records with 100.
Whats the best way of implementing this.
Im, using Web2py stack and deployed my application on Google App Engine.
maybe i'm misunderstanding something, but thos sounds like the perfect match for a task queue, and i can't see how multithreading will help, as i thought this only ment that you can serve many responses simultaneously, it won't help if your responses take longer than the 30 second limit.
With a task you can add it, then process until the time limit, then recreate another task with the remainder of the task if you haven't finished your job by the time limit.
Multithreading your code is not supported on GAE so you can not explicitly use it.
GAE itself can be multithreaded, which means that one frontend instance can handle multiple http requests simultaneously.
In your case, best way to achieve parallel task execution is Task Queue.
The basic structure for what you're doing is to have the cron job be responsible for dividing the work into smaller units, and executing each unit with the task queue. The payload for each task would be information that identifies the entities in the first set (such as a set of keys). Each task would perform whatever queries are necessary to join the entities in the first set with the entities in the second set, and store intermediate (or perhaps final) results. You can tweak the payload size and task queue rate until it performs the way you desire.
If the results of each task need to be aggregated, you can have each task record its completion and test for whether all tasks are complete, or just have another job that polls the completion records, to fire off the aggregation. When the MapReduce feature is more widely available, that will be a framework for performing this kind of work.
http://www.youtube.com/watch?v=EIxelKcyCC0
http://code.google.com/p/appengine-mapreduce/
I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.
What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.
I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.