The goal
I want to retrieve information from webpages (in parallel). As soon as one of the "crawlers" finds the result we are looking for we terminate, if not we refresh the page we just looked at and search again. To put this differently:
Open the webpages in 3 processes (same page, delayed X sec)
Return the result as soon as we have it (per process not all at once)
If this result ==True we are done and terminate the pool
if not, we call .restart() and add it to the pool again
repeat
The code side
Scrape class
Lets first define the Scraper object:
import random
import time
import multiprocessing
# Result simulation array, False is much more likely than True
RES = [True, False, False, False, False, False, False, False, False, False, False, False, False]
class Scrape:
def __init__(self, url):
self.url = url
self.result = None
def get_result(self):
return self.result
def scrape(self):
# Go to url
# simulate work
time.sleep(random.randrange(5))
# simulate result
result = RES[random.randrange(13)]
# Return what we found on the page
self.result = result
def restart(self):
# >> Some page refreshing
self.scrape()
So we go to the webpage and do some work (scrape) then we can access the results via get_result and if this is not what we want we can call restart. Note that in reality this class is much more complex so creating it all over again would be a waste of starting up the driver (compared to reusing the same class via restart)
Parallel code
Here is where I got stuck, while I used map hundreds of times I don't know how I can keep the Scrape objects and call restart and add them to the pool again. I was thinking of something like this, but this does not work as I want it to. Perhaps a queue is a better approach for this but I'm not familiar with it.
# Function to create the scrapers
def obj_create(url):
print('Create')
a = Scrape(url)
a.scrape()
return a
# Function to restart the scraper
def obj_restart(a):
print('Restart')
a.restart()
a.scrape()
return a
# Callback
def call_back(scrape_obj):
if scrape_obj.get_result():
pool.terminate()
# Also somehow return the result...
else:
# Restart and add again
pool.apply_async(obj_restart, scrape_obj, callback=call_back)
pool = multiprocessing.Pool(3)
url = 'test.com'
delay = 0.001
n = 3
for i in range(n):
time.sleep(delay)
pool.apply_async(obj_create, url, callback=call_back)
pool.close()
pool.join()
I tried my best to make this reproducible example, and explain it as well as I could, but please let me know if anything is unclear!
python
in parallel
Seems you never heard about the GIL yet? If so, I am sorry to disappoint you, but you can't get true parallelism because of it.
More than that. In the case you've described.. you don't ever need one. Because most of the time your threads will be blocked by the network interface to reply back. Thus what you need is a good async framework, a-ka event loop, that will optimize your execution such that blocking reduces to a possible minimum.
Let me introduce aiohttp. It gives you rightly implemented HTTP(s) calls and the glorious asyncio provides you with neat mechanism to cancel, retry and handle errors.
In particular, have a look at the gather.
P.S. multiprocessing will make things parallel, yet it is an overkill here due to IO-bound nature of the task. You may stay within the same process and avoid all the multiprocessing hell simply by rejecting this kind of parallelism which either way will be blocked most of the time.
Related
I have an always-on video stream being processed in an infinite loop. Once a certain object is detected, a second I/O bound method (let's refer to this as FuncIO) is triggered. Ideally, only 1 of FuncIO should run at a time. Once FuncIO completes, the parent loop should continue (i.e., wait for the next trigger of FuncIO).
Here is the pseudocode:
def FuncIO(self):
if self._funcio_running:
# Only 1 instance of FuncIO should run at a time.
# Is this the best place to enforce this?
return
self._funcio_running = true
PerformsBlockingIO()
self._funcio_running = false
return
def main_loop(self):
while True:
if detect_object():
# Run FuncIO asynchronously
else:
# Performs other tasks.
I'm a bit new to asyncio so I would like to know if there is an existing design pattern I can use to handle this scenario.
Thanks!
As far as I understand from your question, you don't leverage the advantages of asynchronous functionality with this approach, and maybe you don't even need it here.
If I didn't understand your question correctly, so there is a special lock mechanism in asyncio, you can read more here, would be something like that:
async def FuncIO(self):
await lock.acquire()
try:
PerformsBlockingIO()
finally:
lock.release()
Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!
I am working on a chatbot, where before I reply to the user I make a DB call to save the chat in a table. This will be done each time user types something, and it increases the response time.
So to decrease the response time, we need to call this asynchronously.
How to do this in Python 3?
I have read tutorials of asyncio library, but did not understand it completely and could not understand how to make it work.
Another workaround is to use queuing system, but that sounds like an overkill.
Example:
request = get_request_from_chat
res = call_some_function_to_prepare_response()
save_data() # this will be call asynchronously
reply() # this should not wait save_data() to finish
Any suggestions are welcome.
Use loop.create_task(some_async_function()) to run an async function "in the background". For example, this answer shows how to do that in case of a trivial client-server communication.
In your case the pseudo-code would look like this:
request = await get_request_from_chat()
res = call_some_function_to_prepare_response()
loop = asyncio.get_event_loop()
loop.create_task(save_data()) # runs in the "background"
reply() # doesn't wait for save_data() to finish
For this to work, of course, the program must be written for asyncio and save_data must be a coroutine. For a chat server it's a good approach to follow anyway, so I would recommend to give asyncio a chance.
Because you mentioned
Another workaround is to use queuing system, but that sounds like an
overkill.
I assume you are open to other solutions so I will propose multi-threading approach:
from concurrent.futures import ThreadPoolExecutor
from time import sleep
def long_runnig_funciton(param1):
print(param1)
sleep(10)
return "Complete"
with ThreadPoolExecutor(max_workers=10) as executor:
future = executor.submit(long_runnig_funciton,["Param1"])
print(future.result(timeout=12))
Steps:
1) You create a ThreadPoolExecutor and define maximum number of concurrent tasks.
2) You submit a function with arguments it needs
3) You call result() on the return value from submit() when you need the results
Note that the result() can throw exception if exception was thrown in the submitted function
You can also check if the result of your call is ready with future.done() which returns True or False
I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards
I'm slowly but surely getting the hang of twisted, but I'm not sure how I should be approaching this particular project.
I'm trying to create a class for batch processing of web pages. There are multiple web pages that I would like to process independently, so it makes sense to have a pipeline of sorts for each url. Additionally, I would like to call a one-time preprocessing function before any of the urls are processed, and when all urls have been processed, I would like to call a post-processing function. Importantly, I'd like to be able to subclass this processing class and override certain methods based on the contents I'm trying to process -- not all web-pages will require the same processing steps.
If this were synchronous code, I would probably do this with a context manager. Consider the following example code:
class Pipeline(object):
def __init__(self, urls):
self.urls = urls # iterable
self.continue = False
def __enter__(self):
self.continue = self.preprocess()
def __exit__(self, type, value, traceback):
if self.continue: # if we decided to run the batch pipeline...
self.postprocess()
def preprocess(self):
# does some stuff and returns a bool
def postprocess(self):
# do some stuff
def pipeline(self):
for url in self.urls:
try:
# download url, do some stuff
except:
# recover so that other urls are not interrupted
After which, I would use it as follows:
with Pipeline(list_of_urls) as p:
p.pipeline()
This works well with synchronous network operations, but not with Twisted, since the pipeline function will return before the end of the processing pipeline, thus calling __exit__.
Additionally, I would like each URL's processing to take place completely separately because there may be conditional branching based on the result of my queries. For this reason, using Twisted's DeferredList is not desirable.
In a nutshell, I need the following:
Preprocess must run before all else
Postprocess must run when the following is true:
At least one url began processing (preprocess returns True)
All urls have either completed or thrown an exception
What's the sanest way to set something like this up with Twisted? The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Any advice?
In light of the comments for the original question, I would suggest using DeferredList in combination with maybeDeferred. I propose the latter because of this:
The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Using a maybeDeferred allows you to treat all of your function calls as though they were asynchronous, whether they actually are or not.
Following structure helps me in close condition:
def Processor:
def __init__()
self.runned = 0
self.hasSuccess = True
self.preprocess()
def launch(self, urls):
for url in urls:
dfrd = url.process()
dfrd.addCallback(self.succ)
dfrd.addErrback(self.fail)
self.runned += 1
def fail(self, reason):
self.runned -= 1
if self.runned == 0 and self.hasSuccess:
self.postprocess()
def succ(self, arg):
self.runned -= 1
self.hasSuccess = True
if not self.runned
self.postprocess()