I have a simple function that go over a list of URLs, using GET to retrieve some information and update the DB (PostgresSQL) accordingly. The function works perfect. However, going over each URL one at a time talking too much time.
Using python, I'm able to do to following to parallel these tasks:
from multiprocessing import Pool
def updateDB(ip):
code goes here...
if __name__ == '__main__':
pool = Pool(processes=4) # process per core
pool.map(updateDB, ip)
This is working pretty well. However, I'm trying to find how do the same on django project. Currently I have a function (view) that go over each URL to get the information, and update the DB.
The only thing I could find is using Celery, but this seems to be a bit overpower for the simple task I want to perform.
Is there anything simple that i can do or do I have to use Celery?
Currently I have a function (view) that go over each URL to get the
information, and update the DB.
It means response time does not matter for you and instead of doing it in the background (asynchronously), you are OK with doing it in the foreground if your response time is cut by 4 (using 4 sub-processes/threads). If that is the case you can simply put your sample code in your view. Like
from multiprocessing import Pool
def updateDB(ip):
code goes here...
def my_view(request):
pool = Pool(processes=4) # process per core
pool.map(updateDB, ip)
return HttpResponse("SUCCESS")
But, if you want to do it asynchronously in the background then you should use Celery or follow one of #BasicWolf's suggestions.
Though using Celery may seem an overkill, it is a well-known way of doing asynchronous tasks. Essentially Django serves WSGI request-response cycle which knows nothing of multiprocessing or background tasks.
Here are alternative options:
Django background tasks - might fit your case better.
Redis queue
I will recommend to use gevent for multithreading solution instead of multiprocessing. Multiprocessing can cause problem in production environment where spawning new processes are restricted.
Example code:
from django.shortcuts import HttpResponse
from gevent.pool import Pool
def square(number):
return number * number
def home(request):
pool = Pool(50)
numbers = [1, 3, 5]
results = pool.map(square, numbers)
return HttpResponse(results)
Related
I am somewhat new to both threading and multiprocessing in Python, as well as dealing with the concept of the GIL. I have a situation where I have time consuming fire and forget tasks that I need the server to run, but the server should immediately reply to the client and basically be like "okay, your thing was submitted" so that the client does not hang waiting for the thing to complete. An example of what one of the "things" might do is pull down some data from a database or two, compare that data, and then write the result to another database. The databases are remote, not locally on the same host as the server itself. Another example, is crunching some data and then sending a text as a result of that. The client does not care about the data, but someone will receive a text later with some information that is the result of the data crunching from the various dictionaries and database entries. However, there could be many such requests pouring in from many clients. The goal here is to spawn a thread, or process that essentially kills itself because we don't care at all about returning any data from it.
At a glance, my understanding is that both multiprocessing and threading can achieve similar results for this use case. My main concerns are that I can immediately launch the function to go do its own thing and return to the client quickly so it does not hang. There are many, many requests coming in simultaneously from many, many clients in this scenario. As a result, my understanding is that multiprocessing may be better, so that these tasks would not need to be executed as sequential threads because of the GIL. However, I am unsure of how to make the processes end themselves when they are done with their task rather than needing to wait for them.
An example of the problem
#route('/api/example', methods=["POST"])
def example_request(self, request):
request_data = request.get_json()
crunch_data_and_send_text(request_data) # Takes maybe 5-10 seconds, doesn't return data
return # Return to client. Would like to return to client immediately rather than waiting
Would threading or multiprocessing be better here? And how can I make the process (or thread) .join() itself effectively when it is done rather than needing to join it before I can return to the client.
I have also considered asyncio which I think would allow something that would also improve this, but the existing codebase I have inherited is so large that it is infeasible to rewrite in async for the time being, and library replacements may need to be found in that case, so it is not an option.
#Threading
from threading import Thread
#route('/api/example', methods=["POST"])
def example_request(self, request):
request_data = request.get_json()
fire_and_forget = Thread(target = crunch_data_and_send_text, args=(request_data,))
fire_and_forget.start()
return # Return to client. Would like to return to client immediately rather than waiting
# Multiprocessing
from multiprocessing import Process
#route('/api/example', methods=["POST"])
def example_request(self, request):
request_data = request.get_json()
fire_and_forget = Process(target = crunch_data_and_send_text, args=(request_data,))
fire_and_forget.start()
return # Return to client. Would like to return to client immediately rather than waiting
Which of these is better for this use case? Is there a way I can have them .join() themselves automatically when they finish rather than needing to actually sit here in the function and wait for them to complete before returning to the client?
To be clear, asyncio is unfortunately NOT an option for me.
I suggest using Advance Python Scheduler.
Instead of running your function in a thread, schedule it to run and immediately return to client.
After setting up your flask app, setup Flask-APScheduler and then schedule your function to run in the background.
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler({
--- setup the scheduler ---
})
#route('/api/example', methods=["POST"])
def example_request(self, request):
request_data = request.get_json()
job = scheduler.add_job(crunch_data_and_send_text, 'date', run_date=datetime.utcnow())
return "The request is being processed ..."
to pass arguments to crunch_data_and_send_text you can do:
lambda: crunch_data_and_send_text(request_data)
here is the User Guide
I am currently working on a test system that uses selenium grid for WhatsApp automation.
WhatsApp requires a QR code scan to log in, but once the code has been scanned, the session persists as long as the cookies remain saved in the browser's user data directory.
I would like to run a series of tests concurrently while making sure that every session is only used by one thread at any given time.
I would also like to be able to add additional tests to the queue while tests are being run.
So far I have considered using the ThreadPoolExecutor context manager in order to limit the maximum available workers to the maximum number of sessions. Something like this:
import queue
from concurrent.futures import ThreadPoolExecutor
def make_queue(questions):
q = queue.Queue()
for question in questions:
q.put(question)
return q
def test_conversation(q):
item = q.get()
# Whatsapp test happens here
q.task_done()
def run_tests(questions):
q = make_queue(questions)
with ThreadPoolExecutor(max_workers=number_of_sessions) as executor:
while not q.empty()
test_results = executor.submit(test_conversation, q)
for f in concurrent.futures.as_completed(test_results):
# save results somewhere
It does not include some way to make sure that every thread gets its own session though and as far as I know I can only send one parameter to the function that the executor calls.
I could make some complicated checkout system that works like borrowing books from a library so that every session can only be checked out once at any given time, but I'm not confident in making something that is thread safe and works in all cases. Even the ones I can't think of until they happen.
I am also not sure how I would keep the thing going while adding items to the queue without it locking up my entire application. Would I have to run run_tests() in its own thread?
Is there an established way to do this? Any help would be much appreciated.
in the middle of a function I'd like to be able to fire off a call to the DB (burn results) but the function keeps on running so that I do not experience an I/O bottleneck. this is NOT a web application. all is offline.
snippet for explanatory purposes:
a = list(range(100))
for i in a:
my_output = very_long_function(i)
# I'd like to_sql to run in a "fire-and-forget fashion"
function_of_choice_to_sql(my_output)
I was wondering whether I was better off with the threading library, asyncio or other tools. I was unsuccessful in this particular endeavour with none of them. i'll take any working solution.
any help?
p.s.: there will like be no problems with concurrency/locking and the like since in my case the time my function takes to compute is far larger than the time for the database to be written.
You could use a ThreadPoolExecutor, it provides a simple interface to schedule callables to a pool of worker. In particular, you might be interested in the map method:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as executor:
# Lazy map (python3 only)
outputs = map(very_long_function, range(10))
# Asynchronous map
results = executor.map(function_of_choice_to_sql, outputs)
# Wait for the results
print(list(results))
I am trying to implement a REST API in python using flask. However, the API implementation in turn needs to make some network calls (db for example). To improve throughput can I make this asynchronous? By that I mean this. Lets suppose teh REST API is foo()
def foo():
# 1. do stuff as needed
# 2. call make_network_call and wait for it to return.
# 3. do stuff as needed with returned data.
# 4. return.
Now if I know its going to take some time at step 2, can I give up the cpu here and process other incoming requests and come back to it when it returns? If so how do I do it and what are the frameworks involved? I am using python with flask currently.
Flask can be run with multiple threads or processes when it's launched, see this question. It wont make foo() any more efficient, but you will be able to serve multiple clients simultaneously.
To run it with multiple threads or processes, you can specify so in the Flask.run() keywords:
for threads:
if __name__ == '__main__':
app.run(threaded=True)
Or for processes:
if __name__ == '__main__':
app.run(processes=5) # Or however many you you may want.
If you're using a recent (>= 3.2) version of Python, you can use concurrent.futures. That would look like:
from concurrent.futures import ThreadPoolExecutor
def other_func():
with ThreadPoolExecutor as executor:
future = executor.submit(foo)
# do other stuff
return future.result()
Have a look at Klein -- its a flask / twisted hybrid -- twisted is an asynchronous reactor pattern framework that works at a lower level than you'd be used to in flask.
Klein is like a wrapper on top of twisted that acts very like Flask -- allows you to write deffered code using the reactor.
https://github.com/twisted/klein
Newb quesion about Django app design:
Im building reporting engine for my web-site. And I have a big (and getting bigger with time) amounts of data, and some algorithm which must be applied to it. Calculations promise to be heavy on resources, and it would be stupid if they are performed by requests of users. So, I think to put them into background process, which would be executed continuously and from time to time return results, which could be feed to Django views-routine for producing html output by demand.
And my question is - what proper design approach for building such system? Any thoughts?
Celery is one of your best choices. We are using it successfully. It has a powerful scheduling mechanism - you can either schedule tasks as a timed job or trigger tasks in background when user (for example) requests it.
It also provides ways to query for the status of such background tasks and has a number of flow control features. It allows for a very easy distribution of the work - i.e your celery background tasks can be run on a separate machine (this is very useful for example with heroku web/workers split where web process is limited to max 30s per request). It provides various queue backends (it can use database, rabbitMQ or a number of other queuing mechanisms. With simplest setup it can use the same database that your Django site already uses for that (which makes it easy to setup).
And if you are using automated tests it also has a feature that helps with testing - it can be set in "eager" mode, where background tasks are not executed in background - thus giving predictable logic testing.
More info here: http://docs.celeryproject.org:8000/en/latest/django/
You mean the results are returned into a database or do you want to create django-views directly from your independently running code?
If you have large amounts of data I like to use Pythons multiprocessing. You can create a Generator which fills a JoinableQueue with the different tasks to do and a pool of Workers consuming the different Tasks. This way you should be able to maximize the resource utilization on your system.
The multiprocessing module also allows you to do several tasks over the network (e.g. multiprocessing.Manager()). With this in mind you should easily be able to scale things up if you need a second machine to process the data in time.
Example:
This example shows how to spawn multiple processes. The generator function should query the database for all new entries that need heavy lifting. The consumers take the individual items from the queue and do the actual calculations.
import time
from multiprocessing.queues import JoinableQueue
from multiprocessing import Process
QUEUE = JoinableQueue(-1)
def generator():
""" Puts items in the queue. For example query database for all new,
unprocessed entries that need some serious math done.."""
while True:
QUEUE.put("Item")
time.sleep(0.1)
def consumer(consumer_id):
""" Consumes items from the queue... Do your calculations here... """
while True:
item = QUEUE.get()
print "Process %s has done: %s" % (consumer_id, item)
QUEUE.task_done()
p = Process(target=generator)
p.start()
for x in range(0, 2):
w = Process(target=consumer, args=(x,))
w.start()
p.join()
w.join()
Why don't you have a url or python script that triggers whatever sort of calculation you need to have done everytime it's run and then fetch that url or run that script via a cronjob on the server? From what your question was it doesn't seem like you need a whole lot more than that.