Parallelise web tasks with asyncio in Python

Parallelise web tasks with asyncio in Python - python

I'm trying to wrap my head around asyncio and aiohttp and for the first time in years programming makes me feel utterly stupid and incapable. Which is kind of beautiful, in a weirdo Zen way. But alas, there's work to get done.
I've got an existing class that can do numerous wondrous things on the web, like signing up to a web site, getting data, the works. And now I need like, 100 or 1000 of these little worker bees to sign up. Code looks roughly like this:
class Worker(object):
def signup(self, ...):
...
data = self.make_request(url, data)
self.user_id = data.get("user_id")
return self
def make_request(self, url, data):
response = requests.post(url, data=data)
return response.json()
workers = [Worker().signup() for n in range(100)]
As you can see, we're using the requests module to make a POST request. However this is blocking, so we'll have to wait for worker N to finish signing up before we start signing up worker N+1. Fortunately, the original author of the Worker class (that sounds charmingly Marxist) in her infinite wisdom wrapped every HTTP call in the self.make_request method, so making the whole Worker non blocking should just be a matter of swapping out the requests library for a non-blocking one aaaaand bob's your uncle, right? This is how far I got:
class AyncWorker(Worker):
#asyncio.coroutine
def make_request(self, url, data):
response = yield from aiohttp.request('post', url, data=data)
return (yield from response.json())
coroutines = [Worker().signup() for n in range(100)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(coroutines))
loop.close()
But this will raise an AttributeError: 'generator' object has no attribute 'get' in the signup method where I do self.user_id = data.get("user_id"). And beyond that, I still don't have the workers in a neat dictionary. I'm aware that I'm most likely completely misunderstanding how asyncio works - but I already spent a day reading through various docs, mind-shattering tutorials by David Beazly, and masses of toy examples that are simply enough that I understand them and too simple to apply to this situation. How should I structure my worker and my async loop to sign up 100 workers in parallel and eventually get a list of all workers after they signed up?

Once you use the yield (or yield from) in a function, this function becomes a coroutine. It means that you can't get a result by just calling it: you will get a generator object. You must at least do this:
#asyncio.coroutine
def some_coroutine(*args):
#...
#...
yield from tasty.asyncio.function()
return result
def coroutine_user():
# data = some_coroutine() will give you a generator object instead of result
data = yield from some_coroutine()
return data # data here is a plain result: you can call your .get or whatever
Guess what happens when you call coroutine_user():
>>> coroutine_user()
<generator object coroutine_user at 0x7fe13b8a47e0>
Lack of async.coroutine decorator doesn't help at all: coroutines are contagious! To get a result in a function, you must use yield from. It turns your function into another coroutine!
Though things aren't always that bad (usually you can manually iterate a generator object without relying on yield from), asyncio will specifically stop you from doing it: it breaks some internals (you can do it only from Future or asyncio.coroutine). So just use concurrent.futures or something similar unless you're going to turn all your code into coroutines. As some alternative, isolate all users of aiohttp.request from usual methods and work with both coroutine-based async workers and synchronous plain old code. Diving into asyncio and actually refactoring all your code is an option too, obviously: you basically need to put yield from before every call to any infected with asyncio method.

Related

Python asyncio: Queue concurrent to normal code

Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!

Python - Async HTTP request, Blocking at the end for complete response HTTP (urllib3, requests, threading)

I'm not really concerned with the method used whether it be urllib3, requests, or even threading. I am using a software library which has the notion of "tasks". I am hooking the beginTask and taskComplete functions as follows:
class addon:
def __init__(self):
self.passed_check={}
def beginTask(self, task):
r = requests.get("https://website.com/?name="+task.attribute);
result = json.loads(r.text)
if(result['attr']==1):
self.passed_check[task.id]=1
def taskComplete(self, task):
if(self.passed_check[task.id]):
#do SOMETHING
else:
#do Something else
The flow of the code is:
Library handles initial state of task. - Handled by the library.
beginTask function runs. - I overrode it with my addon, normally it's blank.
IO intensive task begins. - Handled by the native library.
taskComplete function runs. - I overrode it with my addon, normally it's blank.
The issue I have with the above code is my call to requests is blocking IO which therefore blocks the rest of the task. This is not correct but I really want to do something like.
class addon:
def __init__(self):
self.passed_check={}
self.requests_mapper={}
def beginTask(self, task):
self.requests_mapper[task.id] = requests.get("https://website.com/?name="+task.attribute, async=True);
def taskComplete(self, task):
while self.requests_mapper[task.id].request_complete == False:
pass
result = json.loads(r.text)
if(result['attr']==1):
self.passed_check[task.id]=1
if(self.passed_check[task.id]):
#do SOMETHING
else:
#do Something else
Essentially I want to kick off an async call to make the HTTP request, let the library and the HTTP request run in parallel but prior to the task completing ensure the HTTP request completes so I can do my thing (which is the reason for the while loop, in theory the task can complete before the HTTP request comes back). I also need to make sure the entire response is received so I can properly evaluate it prior to ending the task. I need to block the taskComplete function.
Since multiple tasks can run in tandem I am using the request_mapper which is a dict mapped by task id to keep track of which task the request belongs to.
Here's a diagram (I can't link yet):
https://i.stack.imgur.com/Hy4cO.png
Any suggestions.

Calling another function asynchronously and never wait it to finish in Python

I am working on a chatbot, where before I reply to the user I make a DB call to save the chat in a table. This will be done each time user types something, and it increases the response time.
So to decrease the response time, we need to call this asynchronously.
How to do this in Python 3?
I have read tutorials of asyncio library, but did not understand it completely and could not understand how to make it work.
Another workaround is to use queuing system, but that sounds like an overkill.
Example:
request = get_request_from_chat
res = call_some_function_to_prepare_response()
save_data() # this will be call asynchronously
reply() # this should not wait save_data() to finish
Any suggestions are welcome.

Use loop.create_task(some_async_function()) to run an async function "in the background". For example, this answer shows how to do that in case of a trivial client-server communication.
In your case the pseudo-code would look like this:
request = await get_request_from_chat()
res = call_some_function_to_prepare_response()
loop = asyncio.get_event_loop()
loop.create_task(save_data()) # runs in the "background"
reply() # doesn't wait for save_data() to finish
For this to work, of course, the program must be written for asyncio and save_data must be a coroutine. For a chat server it's a good approach to follow anyway, so I would recommend to give asyncio a chance.

Because you mentioned
Another workaround is to use queuing system, but that sounds like an
overkill.
I assume you are open to other solutions so I will propose multi-threading approach:
from concurrent.futures import ThreadPoolExecutor
from time import sleep
def long_runnig_funciton(param1):
print(param1)
sleep(10)
return "Complete"
with ThreadPoolExecutor(max_workers=10) as executor:
future = executor.submit(long_runnig_funciton,["Param1"])
print(future.result(timeout=12))
Steps:
1) You create a ThreadPoolExecutor and define maximum number of concurrent tasks.
2) You submit a function with arguments it needs
3) You call result() on the return value from submit() when you need the results
Note that the result() can throw exception if exception was thrown in the submitted function
You can also check if the result of your call is ready with future.done() which returns True or False

is it possible to list all blocked tornado coroutines

I have a "gateway" app written in tornado using #tornado.gen.coroutine to transfer information from one handler to another. I'm trying to do some debugging/status testing. What I'd like to be able to do is enumerate all of the currently blocked/waiting coroutines that are live at a given moment. Is this information accessible somewhere in tornado?

You talk about ioloop _handlers dict maybe. Try to add this in periodic callback:
def print_current_handlers():
io_loop = ioloop.IOLoop.current()
print io_loop._handlers
update: I've checked source code and now think that there is no simple way to trace current running gen.corouitines, A. Jesse Jiryu Davis is right!
But you can trace all "async" calls (yields) from coroutines - each yield from generator go into IOLoop.add_callback (http://www.tornadoweb.org/en/stable/ioloop.html#callbacks-and-timeouts)
So, by examining io_loop._callbacks you can see what yields are in ioloop right now.
Many interesting stuff is here :) https://github.com/tornadoweb/tornado/blob/master/tornado/gen.py

No there isn't, but you could perhaps create your own decorator that wraps gen.coroutine, then updates a data structure when the coroutine begins.
import weakref
import functools
from tornado import gen
from tornado.ioloop import IOLoop
all_coroutines = weakref.WeakKeyDictionary()
def tracked_coroutine(fn):
coro = gen.coroutine(fn)
#functools.wraps(coro)
def start(*args, **kwargs):
future = coro(*args, **kwargs)
all_coroutines[future] = str(fn)
return future
return start
#tracked_coroutine
def five_second_coroutine():
yield gen.sleep(5)
#tracked_coroutine
def ten_second_coroutine():
yield gen.sleep(10)
#gen.coroutine
def tracker():
while True:
running = list(all_coroutines.values())
print(running)
yield gen.sleep(1)
loop = IOLoop.current()
loop.spawn_callback(tracker)
loop.spawn_callback(five_second_coroutine)
loop.spawn_callback(ten_second_coroutine)
loop.start()
If you run this script for a few seconds you'll see two active coroutines, then one, then none.
Note the warning in the docs about the dictionary changing size, you should catch "RuntimeError" in "tracker" to deal with that problem.
This is a bit complex, you might get all you need much more simply by turning on Tornado's logging and using set_blocking_log_threshold.

How should I structure a batch processing pipeline using Twisted?

I'm slowly but surely getting the hang of twisted, but I'm not sure how I should be approaching this particular project.
I'm trying to create a class for batch processing of web pages. There are multiple web pages that I would like to process independently, so it makes sense to have a pipeline of sorts for each url. Additionally, I would like to call a one-time preprocessing function before any of the urls are processed, and when all urls have been processed, I would like to call a post-processing function. Importantly, I'd like to be able to subclass this processing class and override certain methods based on the contents I'm trying to process -- not all web-pages will require the same processing steps.
If this were synchronous code, I would probably do this with a context manager. Consider the following example code:
class Pipeline(object):
def __init__(self, urls):
self.urls = urls # iterable
self.continue = False
def __enter__(self):
self.continue = self.preprocess()
def __exit__(self, type, value, traceback):
if self.continue: # if we decided to run the batch pipeline...
self.postprocess()
def preprocess(self):
# does some stuff and returns a bool
def postprocess(self):
# do some stuff
def pipeline(self):
for url in self.urls:
try:
# download url, do some stuff
except:
# recover so that other urls are not interrupted
After which, I would use it as follows:
with Pipeline(list_of_urls) as p:
p.pipeline()
This works well with synchronous network operations, but not with Twisted, since the pipeline function will return before the end of the processing pipeline, thus calling __exit__.
Additionally, I would like each URL's processing to take place completely separately because there may be conditional branching based on the result of my queries. For this reason, using Twisted's DeferredList is not desirable.
In a nutshell, I need the following:
Preprocess must run before all else
Postprocess must run when the following is true:
At least one url began processing (preprocess returns True)
All urls have either completed or thrown an exception
What's the sanest way to set something like this up with Twisted? The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Any advice?

In light of the comments for the original question, I would suggest using DeferredList in combination with maybeDeferred. I propose the latter because of this:
The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Using a maybeDeferred allows you to treat all of your function calls as though they were asynchronous, whether they actually are or not.

Following structure helps me in close condition:
def Processor:
def __init__()
self.runned = 0
self.hasSuccess = True
self.preprocess()
def launch(self, urls):
for url in urls:
dfrd = url.process()
dfrd.addCallback(self.succ)
dfrd.addErrback(self.fail)
self.runned += 1
def fail(self, reason):
self.runned -= 1
if self.runned == 0 and self.hasSuccess:
self.postprocess()
def succ(self, arg):
self.runned -= 1
self.hasSuccess = True
if not self.runned
self.postprocess()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelise web tasks with asyncio in Python - python

Related

Python asyncio: Queue concurrent to normal code

Python - Async HTTP request, Blocking at the end for complete response HTTP (urllib3, requests, threading)

Calling another function asynchronously and never wait it to finish in Python

is it possible to list all blocked tornado coroutines

How should I structure a batch processing pipeline using Twisted?

Categories

Resources