How should I structure a batch processing pipeline using Twisted? - python

I'm slowly but surely getting the hang of twisted, but I'm not sure how I should be approaching this particular project.
I'm trying to create a class for batch processing of web pages. There are multiple web pages that I would like to process independently, so it makes sense to have a pipeline of sorts for each url. Additionally, I would like to call a one-time preprocessing function before any of the urls are processed, and when all urls have been processed, I would like to call a post-processing function. Importantly, I'd like to be able to subclass this processing class and override certain methods based on the contents I'm trying to process -- not all web-pages will require the same processing steps.
If this were synchronous code, I would probably do this with a context manager. Consider the following example code:
class Pipeline(object):
def __init__(self, urls):
self.urls = urls # iterable
self.continue = False
def __enter__(self):
self.continue = self.preprocess()
def __exit__(self, type, value, traceback):
if self.continue: # if we decided to run the batch pipeline...
self.postprocess()
def preprocess(self):
# does some stuff and returns a bool
def postprocess(self):
# do some stuff
def pipeline(self):
for url in self.urls:
try:
# download url, do some stuff
except:
# recover so that other urls are not interrupted
After which, I would use it as follows:
with Pipeline(list_of_urls) as p:
p.pipeline()
This works well with synchronous network operations, but not with Twisted, since the pipeline function will return before the end of the processing pipeline, thus calling __exit__.
Additionally, I would like each URL's processing to take place completely separately because there may be conditional branching based on the result of my queries. For this reason, using Twisted's DeferredList is not desirable.
In a nutshell, I need the following:
Preprocess must run before all else
Postprocess must run when the following is true:
At least one url began processing (preprocess returns True)
All urls have either completed or thrown an exception
What's the sanest way to set something like this up with Twisted? The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Any advice?

In light of the comments for the original question, I would suggest using DeferredList in combination with maybeDeferred. I propose the latter because of this:
The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Using a maybeDeferred allows you to treat all of your function calls as though they were asynchronous, whether they actually are or not.

Following structure helps me in close condition:
def Processor:
def __init__()
self.runned = 0
self.hasSuccess = True
self.preprocess()
def launch(self, urls):
for url in urls:
dfrd = url.process()
dfrd.addCallback(self.succ)
dfrd.addErrback(self.fail)
self.runned += 1
def fail(self, reason):
self.runned -= 1
if self.runned == 0 and self.hasSuccess:
self.postprocess()
def succ(self, arg):
self.runned -= 1
self.hasSuccess = True
if not self.runned
self.postprocess()

Related

Use objects in parallel processing and "restart" these

The goal
I want to retrieve information from webpages (in parallel). As soon as one of the "crawlers" finds the result we are looking for we terminate, if not we refresh the page we just looked at and search again. To put this differently:
Open the webpages in 3 processes (same page, delayed X sec)
Return the result as soon as we have it (per process not all at once)
If this result ==True we are done and terminate the pool
if not, we call .restart() and add it to the pool again
repeat
The code side
Scrape class
Lets first define the Scraper object:
import random
import time
import multiprocessing
# Result simulation array, False is much more likely than True
RES = [True, False, False, False, False, False, False, False, False, False, False, False, False]
class Scrape:
def __init__(self, url):
self.url = url
self.result = None
def get_result(self):
return self.result
def scrape(self):
# Go to url
# simulate work
time.sleep(random.randrange(5))
# simulate result
result = RES[random.randrange(13)]
# Return what we found on the page
self.result = result
def restart(self):
# >> Some page refreshing
self.scrape()
So we go to the webpage and do some work (scrape) then we can access the results via get_result and if this is not what we want we can call restart. Note that in reality this class is much more complex so creating it all over again would be a waste of starting up the driver (compared to reusing the same class via restart)
Parallel code
Here is where I got stuck, while I used map hundreds of times I don't know how I can keep the Scrape objects and call restart and add them to the pool again. I was thinking of something like this, but this does not work as I want it to. Perhaps a queue is a better approach for this but I'm not familiar with it.
# Function to create the scrapers
def obj_create(url):
print('Create')
a = Scrape(url)
a.scrape()
return a
# Function to restart the scraper
def obj_restart(a):
print('Restart')
a.restart()
a.scrape()
return a
# Callback
def call_back(scrape_obj):
if scrape_obj.get_result():
pool.terminate()
# Also somehow return the result...
else:
# Restart and add again
pool.apply_async(obj_restart, scrape_obj, callback=call_back)
pool = multiprocessing.Pool(3)
url = 'test.com'
delay = 0.001
n = 3
for i in range(n):
time.sleep(delay)
pool.apply_async(obj_create, url, callback=call_back)
pool.close()
pool.join()
I tried my best to make this reproducible example, and explain it as well as I could, but please let me know if anything is unclear!
python
in parallel
Seems you never heard about the GIL yet? If so, I am sorry to disappoint you, but you can't get true parallelism because of it.
More than that. In the case you've described.. you don't ever need one. Because most of the time your threads will be blocked by the network interface to reply back. Thus what you need is a good async framework, a-ka event loop, that will optimize your execution such that blocking reduces to a possible minimum.
Let me introduce aiohttp. It gives you rightly implemented HTTP(s) calls and the glorious asyncio provides you with neat mechanism to cancel, retry and handle errors.
In particular, have a look at the gather.
P.S. multiprocessing will make things parallel, yet it is an overkill here due to IO-bound nature of the task. You may stay within the same process and avoid all the multiprocessing hell simply by rejecting this kind of parallelism which either way will be blocked most of the time.

Python asyncio: Queue concurrent to normal code

Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!

Python - Async HTTP request, Blocking at the end for complete response HTTP (urllib3, requests, threading)

I'm not really concerned with the method used whether it be urllib3, requests, or even threading. I am using a software library which has the notion of "tasks". I am hooking the beginTask and taskComplete functions as follows:
class addon:
def __init__(self):
self.passed_check={}
def beginTask(self, task):
r = requests.get("https://website.com/?name="+task.attribute);
result = json.loads(r.text)
if(result['attr']==1):
self.passed_check[task.id]=1
def taskComplete(self, task):
if(self.passed_check[task.id]):
#do SOMETHING
else:
#do Something else
The flow of the code is:
Library handles initial state of task. - Handled by the library.
beginTask function runs. - I overrode it with my addon, normally it's blank.
IO intensive task begins. - Handled by the native library.
taskComplete function runs. - I overrode it with my addon, normally it's blank.
The issue I have with the above code is my call to requests is blocking IO which therefore blocks the rest of the task. This is not correct but I really want to do something like.
class addon:
def __init__(self):
self.passed_check={}
self.requests_mapper={}
def beginTask(self, task):
self.requests_mapper[task.id] = requests.get("https://website.com/?name="+task.attribute, async=True);
def taskComplete(self, task):
while self.requests_mapper[task.id].request_complete == False:
pass
result = json.loads(r.text)
if(result['attr']==1):
self.passed_check[task.id]=1
if(self.passed_check[task.id]):
#do SOMETHING
else:
#do Something else
Essentially I want to kick off an async call to make the HTTP request, let the library and the HTTP request run in parallel but prior to the task completing ensure the HTTP request completes so I can do my thing (which is the reason for the while loop, in theory the task can complete before the HTTP request comes back). I also need to make sure the entire response is received so I can properly evaluate it prior to ending the task. I need to block the taskComplete function.
Since multiple tasks can run in tandem I am using the request_mapper which is a dict mapped by task id to keep track of which task the request belongs to.
Here's a diagram (I can't link yet):
https://i.stack.imgur.com/Hy4cO.png
Any suggestions.

Parallelise web tasks with asyncio in Python

I'm trying to wrap my head around asyncio and aiohttp and for the first time in years programming makes me feel utterly stupid and incapable. Which is kind of beautiful, in a weirdo Zen way. But alas, there's work to get done.
I've got an existing class that can do numerous wondrous things on the web, like signing up to a web site, getting data, the works. And now I need like, 100 or 1000 of these little worker bees to sign up. Code looks roughly like this:
class Worker(object):
def signup(self, ...):
...
data = self.make_request(url, data)
self.user_id = data.get("user_id")
return self
def make_request(self, url, data):
response = requests.post(url, data=data)
return response.json()
workers = [Worker().signup() for n in range(100)]
As you can see, we're using the requests module to make a POST request. However this is blocking, so we'll have to wait for worker N to finish signing up before we start signing up worker N+1. Fortunately, the original author of the Worker class (that sounds charmingly Marxist) in her infinite wisdom wrapped every HTTP call in the self.make_request method, so making the whole Worker non blocking should just be a matter of swapping out the requests library for a non-blocking one aaaaand bob's your uncle, right? This is how far I got:
class AyncWorker(Worker):
#asyncio.coroutine
def make_request(self, url, data):
response = yield from aiohttp.request('post', url, data=data)
return (yield from response.json())
coroutines = [Worker().signup() for n in range(100)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(coroutines))
loop.close()
But this will raise an AttributeError: 'generator' object has no attribute 'get' in the signup method where I do self.user_id = data.get("user_id"). And beyond that, I still don't have the workers in a neat dictionary. I'm aware that I'm most likely completely misunderstanding how asyncio works - but I already spent a day reading through various docs, mind-shattering tutorials by David Beazly, and masses of toy examples that are simply enough that I understand them and too simple to apply to this situation. How should I structure my worker and my async loop to sign up 100 workers in parallel and eventually get a list of all workers after they signed up?
Once you use the yield (or yield from) in a function, this function becomes a coroutine. It means that you can't get a result by just calling it: you will get a generator object. You must at least do this:
#asyncio.coroutine
def some_coroutine(*args):
#...
#...
yield from tasty.asyncio.function()
return result
def coroutine_user():
# data = some_coroutine() will give you a generator object instead of result
data = yield from some_coroutine()
return data # data here is a plain result: you can call your .get or whatever
Guess what happens when you call coroutine_user():
>>> coroutine_user()
<generator object coroutine_user at 0x7fe13b8a47e0>
Lack of async.coroutine decorator doesn't help at all: coroutines are contagious! To get a result in a function, you must use yield from. It turns your function into another coroutine!
Though things aren't always that bad (usually you can manually iterate a generator object without relying on yield from), asyncio will specifically stop you from doing it: it breaks some internals (you can do it only from Future or asyncio.coroutine). So just use concurrent.futures or something similar unless you're going to turn all your code into coroutines. As some alternative, isolate all users of aiohttp.request from usual methods and work with both coroutine-based async workers and synchronous plain old code. Diving into asyncio and actually refactoring all your code is an option too, obviously: you basically need to put yield from before every call to any infected with asyncio method.

Integrating with deluged api - twisted deferred

I have a very simple script that monitors a file transfer progress, comparing its actual size with the target then calculating its hash, comparing with the desired hash and firing up a few extra things when everything seems alright.
I've replaced the tool used for the file transfers (wget) with deluged, which has a neat api to integrate with.
Instead of comparing the file progress and compare the hashes, I only need to know now when deluged finished downloading the files. To achieve that, I was able to modify this script to my needs, but I'm stuck trying to wrap my head around twisted framework, that deluged makes use of.
To try getting over it, I grabbed one sample script from twisted deferred documentation, wrapped a class around it and attempted to use the same concept I'm using on this script I mentioned.
Now, I don't know exactly what to do with the reactor object, since it's basically a blocking loop that can't be restarted.
This is my sample code I'm working with:
from twisted.internet import reactor, defer
import time
class DummyDataGetter:
done = False
result = 0
def getDummyData(self, x):
d = defer.Deferred()
# simulate a delayed result by asking the reactor to fire the
# Deferred in 2 seconds time with the result x * 3
reactor.callLater(2, d.callback, x * 3)
return d
def assignResult(self, d):
"""
Data handling function to be added as a callback: handles the
data by printing the result
"""
self.result = d
self.done = True
reactor.stop()
def run(self):
d = self.getDummyData(3)
d.addCallback(self.assignResult)
reactor.run()
getter = DummyDataGetter()
getter.run()
while not getter.done:
time.sleep(0.5)
print getter.result
# then somewhere else I want to get dummy data again
getter = DummyDataGetter()
getter.run() #this throws an exception of type error.ReactorNotRestartable
while not getter.done:
time.sleep(0.5)
print getter.result
My questions are:
Should reactor be fired in another thread to prevent it blocking the code?
If so, how would I add more callbacks to this reactor living in a separate thread? Simply by doing something similar to reactor.callLater(2, d.callback, x * 3), from my main thread?
If not, what is the technique to overcome this problem of not being able to starting/stopping reactor twice or more on the same process?
OK, easiest approach I found to this is to simply have a separate script called using subprocess.Popen, dump the statuses of the torrents and anything else needed into the stdout (serialized using JSON) and pipe that into the calling script.
Way less traumatic than learning twisted, but of course far away from optimal.

Categories