in the middle of a function I'd like to be able to fire off a call to the DB (burn results) but the function keeps on running so that I do not experience an I/O bottleneck. this is NOT a web application. all is offline.
snippet for explanatory purposes:
a = list(range(100))
for i in a:
my_output = very_long_function(i)
# I'd like to_sql to run in a "fire-and-forget fashion"
function_of_choice_to_sql(my_output)
I was wondering whether I was better off with the threading library, asyncio or other tools. I was unsuccessful in this particular endeavour with none of them. i'll take any working solution.
any help?
p.s.: there will like be no problems with concurrency/locking and the like since in my case the time my function takes to compute is far larger than the time for the database to be written.
You could use a ThreadPoolExecutor, it provides a simple interface to schedule callables to a pool of worker. In particular, you might be interested in the map method:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as executor:
# Lazy map (python3 only)
outputs = map(very_long_function, range(10))
# Asynchronous map
results = executor.map(function_of_choice_to_sql, outputs)
# Wait for the results
print(list(results))
Related
(preamble: I use python-telegram-bot for running a Telegram bot which registers users' messages in Google Sheets. None of this is relevant for this question but may provide some context to understand the source of the troubles. The thing is that Google Sheets API does not allow too frequent access to the google sheets, so if many users try to write there, I need to process their requests with some delay).
I know it is considered a very bad practice to use threading module to process tasks and avoid locking by GIL. But by the nature of my task, I receive a flow of requests from users, and I would like to process them with some delay (like from 1 to 10 seconds later than they were actually received). (right now I use celery+redis to process delayed tasks but it looks like an overkill for me for such trivial thing as delayed execution, but I may be wrong).
So I wonder if I can use concurrent.futures.ProcessPoolExecutor (as it is explained for example here: https://idolstarastronomer.com/two-futures.html) or it will result in some kind of disaster promised by the most of people who warn against using threading in Python?
Here is purely hypothetical code that runs something with delay with ProcessPoolExecutor. Will it end up with a disaster under some conditions (too many delayed requests for instance)?
import concurrent.futures
import time
import random
def register_with_delay():
time.sleep(random.randint(0, 10))
print('Im in the delayed registration')
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = [executor.submit(register_with_delay) for _ in range(10)]
for i in range(10):
print('Im in the main loop')
time.sleep(random.randint(0, 1))
if __name__ == '__main__':
main()
I have a simple function that go over a list of URLs, using GET to retrieve some information and update the DB (PostgresSQL) accordingly. The function works perfect. However, going over each URL one at a time talking too much time.
Using python, I'm able to do to following to parallel these tasks:
from multiprocessing import Pool
def updateDB(ip):
code goes here...
if __name__ == '__main__':
pool = Pool(processes=4) # process per core
pool.map(updateDB, ip)
This is working pretty well. However, I'm trying to find how do the same on django project. Currently I have a function (view) that go over each URL to get the information, and update the DB.
The only thing I could find is using Celery, but this seems to be a bit overpower for the simple task I want to perform.
Is there anything simple that i can do or do I have to use Celery?
Currently I have a function (view) that go over each URL to get the
information, and update the DB.
It means response time does not matter for you and instead of doing it in the background (asynchronously), you are OK with doing it in the foreground if your response time is cut by 4 (using 4 sub-processes/threads). If that is the case you can simply put your sample code in your view. Like
from multiprocessing import Pool
def updateDB(ip):
code goes here...
def my_view(request):
pool = Pool(processes=4) # process per core
pool.map(updateDB, ip)
return HttpResponse("SUCCESS")
But, if you want to do it asynchronously in the background then you should use Celery or follow one of #BasicWolf's suggestions.
Though using Celery may seem an overkill, it is a well-known way of doing asynchronous tasks. Essentially Django serves WSGI request-response cycle which knows nothing of multiprocessing or background tasks.
Here are alternative options:
Django background tasks - might fit your case better.
Redis queue
I will recommend to use gevent for multithreading solution instead of multiprocessing. Multiprocessing can cause problem in production environment where spawning new processes are restricted.
Example code:
from django.shortcuts import HttpResponse
from gevent.pool import Pool
def square(number):
return number * number
def home(request):
pool = Pool(50)
numbers = [1, 3, 5]
results = pool.map(square, numbers)
return HttpResponse(results)
I am trying to implement a REST API in python using flask. However, the API implementation in turn needs to make some network calls (db for example). To improve throughput can I make this asynchronous? By that I mean this. Lets suppose teh REST API is foo()
def foo():
# 1. do stuff as needed
# 2. call make_network_call and wait for it to return.
# 3. do stuff as needed with returned data.
# 4. return.
Now if I know its going to take some time at step 2, can I give up the cpu here and process other incoming requests and come back to it when it returns? If so how do I do it and what are the frameworks involved? I am using python with flask currently.
Flask can be run with multiple threads or processes when it's launched, see this question. It wont make foo() any more efficient, but you will be able to serve multiple clients simultaneously.
To run it with multiple threads or processes, you can specify so in the Flask.run() keywords:
for threads:
if __name__ == '__main__':
app.run(threaded=True)
Or for processes:
if __name__ == '__main__':
app.run(processes=5) # Or however many you you may want.
If you're using a recent (>= 3.2) version of Python, you can use concurrent.futures. That would look like:
from concurrent.futures import ThreadPoolExecutor
def other_func():
with ThreadPoolExecutor as executor:
future = executor.submit(foo)
# do other stuff
return future.result()
Have a look at Klein -- its a flask / twisted hybrid -- twisted is an asynchronous reactor pattern framework that works at a lower level than you'd be used to in flask.
Klein is like a wrapper on top of twisted that acts very like Flask -- allows you to write deffered code using the reactor.
https://github.com/twisted/klein
First of all i know i can use threading to accomplish such task, like so:
import Queue
import threading
# called by each thread
def do_stuff(q, arg):
result = heavy_operation(arg)
q.put(result)
operations = range(1, 10)
q = Queue.Queue()
for op in operations:
t = threading.Thread(target=do_stuff, args = (q,op))
t.daemon = True
t.start()
s = q.get()
print s
However, in google app engine there's something called ndb tasklets and according to their documentation you can execute code in parallel using them.
Tasklets are a way to write concurrently running functions without
threads; tasklets are executed by an event loop and can suspend
themselves blocking for I/O or some other operation using a yield
statement. The notion of a blocking operation is abstracted into the
Future class, but a tasklet may also yield an RPC in order to wait for
that RPC to complete.
Is it possible to accomplish something like the example with threading above?
I already know how to handle retrieving entities using get_async() (got it from their examples at doc page) but its very unclear to me when it comes to parallel code execution.
Thanks.
The answer depended on what your heavy_operation really is. If the heavy_operation use RPC (Remote Procedure Call, such as datastore access, UrlFetch, ... etc), then the answer is yes.
In
how to understand appengine ndb.tasklet?
I asked a similar question, you may find more details there.
May I put any kind of code inside a function and decorate it as ndb.tasklet? Then used it as async function later. Or it must be appengine RPC?
The Answer
Technically yes, but it will not run asynchronously. When you decorate a non-yielding function with #tasklet, its Future's value is computed and set when you call that function. That is, it runs through the entire function when you call it. If you want to achieve asynchronous operation, you must yield on something that does asynchronous work. Generally in GAE it will work its way down to an RPC call.
Newb quesion about Django app design:
Im building reporting engine for my web-site. And I have a big (and getting bigger with time) amounts of data, and some algorithm which must be applied to it. Calculations promise to be heavy on resources, and it would be stupid if they are performed by requests of users. So, I think to put them into background process, which would be executed continuously and from time to time return results, which could be feed to Django views-routine for producing html output by demand.
And my question is - what proper design approach for building such system? Any thoughts?
Celery is one of your best choices. We are using it successfully. It has a powerful scheduling mechanism - you can either schedule tasks as a timed job or trigger tasks in background when user (for example) requests it.
It also provides ways to query for the status of such background tasks and has a number of flow control features. It allows for a very easy distribution of the work - i.e your celery background tasks can be run on a separate machine (this is very useful for example with heroku web/workers split where web process is limited to max 30s per request). It provides various queue backends (it can use database, rabbitMQ or a number of other queuing mechanisms. With simplest setup it can use the same database that your Django site already uses for that (which makes it easy to setup).
And if you are using automated tests it also has a feature that helps with testing - it can be set in "eager" mode, where background tasks are not executed in background - thus giving predictable logic testing.
More info here: http://docs.celeryproject.org:8000/en/latest/django/
You mean the results are returned into a database or do you want to create django-views directly from your independently running code?
If you have large amounts of data I like to use Pythons multiprocessing. You can create a Generator which fills a JoinableQueue with the different tasks to do and a pool of Workers consuming the different Tasks. This way you should be able to maximize the resource utilization on your system.
The multiprocessing module also allows you to do several tasks over the network (e.g. multiprocessing.Manager()). With this in mind you should easily be able to scale things up if you need a second machine to process the data in time.
Example:
This example shows how to spawn multiple processes. The generator function should query the database for all new entries that need heavy lifting. The consumers take the individual items from the queue and do the actual calculations.
import time
from multiprocessing.queues import JoinableQueue
from multiprocessing import Process
QUEUE = JoinableQueue(-1)
def generator():
""" Puts items in the queue. For example query database for all new,
unprocessed entries that need some serious math done.."""
while True:
QUEUE.put("Item")
time.sleep(0.1)
def consumer(consumer_id):
""" Consumes items from the queue... Do your calculations here... """
while True:
item = QUEUE.get()
print "Process %s has done: %s" % (consumer_id, item)
QUEUE.task_done()
p = Process(target=generator)
p.start()
for x in range(0, 2):
w = Process(target=consumer, args=(x,))
w.start()
p.join()
w.join()
Why don't you have a url or python script that triggers whatever sort of calculation you need to have done everytime it's run and then fetch that url or run that script via a cronjob on the server? From what your question was it doesn't seem like you need a whole lot more than that.