python multithreading with list I/O - python

I'm wanting to achieve multithreading in python where the threaded function does some actions and adds a URL to a list of URLs (links) and a listener watches the links list from the calling script for new elements to iterate over. Confused? Me too, I'm not even sure how to go about explaining this, so let me try to demonstrate with pseudo-code:
from multiprocessing import Pool
def worker(links):
#do lots of things with urllib2 including finding elements with BeautifulSoup
#extracting text from those elements and using it to compile the unique URL
#finally, append a url that was gathered in the `lots of things` section to a list
links.append( `http://myUniqueURL.com` ) #this will be unique for each time `worker` is called
links = []
for i in MyBigListOfJunk:
Pool().apply(worker, links)
for link in links:
#do a bunch of stuff with this link including using it to retrieve the html source with urllib2
Now, rather than waiting for all the worker threads to finish and iterate over links all at once, is there a way for me to iterate over the URLs as they are getting appended to the links list? Basically, the worker iteration to generate the links list HAS to be separate from the iteration of links itself; however, rather than running each sequentially I was hoping I could run them somewhat concurrently and save some time... currently I must call worker upwards of 30-40 times within a loop and the entire script takes roughly 20 minutes to finish executing...
Any thoughts would be very welcome, thank you.

You should use Queue class for this. It is a thread-safe array. It's 'get' function removes item from Queue, and, what's important, blocks when there is no items and waits until other processes add them.
If you use multiprocessing than you should use Queue from this module, not the Queue module.
Next time you ask questions on processes, provide exact Python version you want it for. This is for 2.6

Related

Python asyncio: as_completed in order

TL;DR
Is there a way to wait on multiple futures, and yield from them as they are completed in a given order?
Long story
Imagine you have two data sources. One gives you id -> name mapping, the other gives you id -> age mapping. You want to compute (name, age) -> number_of_ids_with_that_name_and_age.
There is too much data to just load it, but both data sources support paging/iterating and ordering by id.
So you write something like
def iterate_names():
for page in get_name_page_numbers():
yield from iterate_name_page(page) # yields (id, name) pairs
and the same for age, and then you iterate over iterate_names() and iterate_ages().
What is wrong with that? What happens is:
you request one page of names and ages
you get them
you process the data until you reach the end of a page, let's say, ages
you request another page of ages
you process the data until ...
Basically, you are not waiting for any requests while you process data.
You could use asyncio.gather to send all requests and wait for all data, but then:
when the first pages arrives, you still wait for others
you run out of memory
There is asyncio.as_completed which allows you to send all requests and process pages as you get results, but you will get pages out of order, so you will not be able to do the processing.
Ideally, there would be a function that would make the first request, and as the response comes, make the second request and yield results from the first at the same moment.
Is that possible?
There are a lot of things going on in your question; I'll try to get to all of them.
Is there a way to wait on multiple futures, and yield from them as they are completed in a given order?
Yes. Your code can yield from or await any number of futures in sequence. If you are talking about Tasks specifically and you want these tasks to be executing concurrently, they simply need to be assigned to the loop (done when you asyncio.ensure_future() or loop.create_task()) and the loop needs to be running.
As for yielding from them in sequence, you can establish what that sequence is in the first place as you create the tasks. In a simple example where you have created all of the tasks/futures before you start to process their results, you could use a list to store the task futures and finally pull from the list:
loop = asyncio.get_event_loop()
tasks_im_waiting_for = []
for thing in things_to_get:
task = loop.create_task(get_a_thing_coroutine(thing))
tasks_im_waiting_for.append(task)
#asyncio.coroutine
def process_gotten_things(getter_tasks):
for task in getter_tasks:
result = yield from task
print("We got {}".format(result))
loop.run_until_complete(process_gotten_things(tasks_im_waiting_for))
That example will only process one result at a time, but will still allow any of the scheduled getter tasks to continue doing their thing while it's waiting for the next one in the sequence to complete. If the processing order didn't matter as much and we wanted to process more than one potentially-ready result at a time, then we could use a deque instead of a list, with more than one process_gotten_things task .pop()ing the getter tasks from the deque. If we wanted to get even more advanced, we can do as Vincent suggests in a comment to your question and use an asyncio.Queue instead of a deque. With such a queue, you can have a producer adding tasks to the queue running concurrently with the task-processing consumers.
Using a deque, or Queue for sequencing futures for processing has a disadvantage though, and that's that you are only processing as many futures concurrently as you have running processor tasks. You could create a new processor task every single time you queued up a new future to be processed, but at that point, this queue becomes a completely redundant data structure because asyncio already gives you a queue-like object where every thing added gets processed concurrently: the event loop. For every task we schedule, we can also schedule its processing. Revising the above example:
for thing in things_to_get:
getter_task = loop.create_task(get_a_thing_coroutine(thing))
processor_task = loop.create_task(process_gotten_thing(getter_task))
# Tasks are futures; the processor can await the result once started
Now let's say that our getter might return multiple things (kind of like your scenario) and each of those things needs some processing. That brings me to a different asyncio design pattern: sub-tasks. Your tasks can schedule other tasks on the event loop. As the event loop is run, the order of your first tasks will still be maintained, but if any one of them ends up waiting on something, there's a chance one of your sub-tasks will get started in the midst of things. Revising the above scenario, we might pass the loop to our coroutine so the coroutine can schedule the tasks that processes its results:
for thing in things_to_get:
task = loop.create_task(get_a_thing_coroutine(thing, loop))
#asyncio.coroutine
def get_a_thing_coroutine(thing, loop):
results = yield from long_time_database_call(thing)
subtasks = []
for result in results:
subtasks.append(loop.create_task(process_result(result)))
# With subtasks scheduled in the order we like, wait for them
# to finish before we consider THIS task complete.
yield from asyncio.wait(subtasks)
All these advanced patterns start tasks in the order you want, but might finish processing them in any order. If you truly need to process the results in the exact same order that you started getting those results, then stick to a single processor pulling result futures from a sequence or yielding from an asyncio.Queue.
You'll also notice that to ensure tasks starting in a predictable order, I explicitly schedule them with loop.create_task(). While asyncio.gather() and asyncio.wait() will happily take coroutine objects and schedule/wrap them as Tasks, they have problems with scheduling them in a predictable order as of me writing this. See asyncio issue #432.
OK, let's get back to your specific case. You have two separate sources of results, and those results need to be joined together by a common key, an id. The patterns I mentioned for getting things and processing those things don't account for such a problem, and I don't know the perfect pattern for it off the top of my head. I'll go through what I might do to attempt this though.
We need some objects to maintain the state of what we know and what we've done so far for the sake of correlating that knowledge as it grows.
# defaultdicts are great for representing knowledge that an interested
# party might want whether or not we have any knowledge to begin with:
from collections import defaultdict
# Let's start with a place to store our end goal:
name_and_age_to_id_count = defaultdict(int)
# Given we're correlating info from two sources, let's make two places to
# store that info, keyed by what we're joining on: id
# When we join correlate this info, only one side might be known, so use a
# Future on both sides to represent data we may or may not have yet.
id_to_age_future = defaultdict(loop.create_future)
id_to_name_future = defaultdict(loop.create_future)
# As soon as we learn the name or age for an id, we can begin processing
# the joint information, but because this information is coming from
# multiple sources we want to process concurrently we need to keep track
# of what ids we've started processing the joint info for.
ids_scheduled_for_processing = set()
We know we'll be getting this information in "pages" via the iterators you mentioned, so let's start there in designing our tasks:
#asyncio.coroutine
def process_name_page(page_number):
subtasks = []
for id, name in iterate_name_page(page_number):
name_future = id_to_name_future[id]
name_future.set_result(name)
if id not in ids_scheduled_for_processing:
age_future = id_to_age_future[id]
task = loop.create_task(increment_name_age_pair(id, name_future, age_future))
subtasks.append(task)
ids_scheduled_for_processing.add(id)
yield from asyncio.wait(subtasks)
#asyncio.coroutine
def process_age_page(page_number):
subtasks = []
for id, age in iterate_age_page(page_number):
age_future = id_to_age_future[id]
age_future.set_result(age)
if id not in ids_scheduled_for_processing:
name_future = id_to_name_future[id]
task = loop.create_task(increment_name_age_pair(id, name_future, age_future))
subtasks.append(task)
ids_scheduled_for_processing.add(id)
yield from asyncio.wait(subtasks)
Those coroutines schedule the name/age pair of an id to be processed—more specifically, the name and age futures for an id. Once started, the processor will await both futures' results (joining them, in a sense).
#asyncio.coroutine
def increment_name_age_pair(id, name_future, age_future):
# This will wait until both futures are resolved and let other tasks work in the meantime:
pair = ((yield from name_future), (yield from age_future))
name_and_age_to_id_count[pair] += 1
# If memory is a concern:
ids_scheduled_for_processing.discard(id)
del id_to_age_future[id]
del id_to_name_future[id]
OK, we've got tasks for getting/iterating the pages and subtasks for processing what's in those pages. Now we need to actually schedule the getting of those pages. Back to your problem, we've got two datasources we want to pull from, and we want to pull from them in parallel. We assume the order of information from one closely correlates to the order of information from another, so we interleave the processing of both in the event loop.
page_processing_tasks = []
# Interleave name and age pages:
for name_page_number, age_page_number in zip_longest(
get_name_page_numbers(),
get_age_page_numbers()
):
# Explicitly schedule it as a task in the order we want because gather
# and wait have non-deterministic scheduling order:
if name_page_number is not None:
page_processing_tasks.append(loop.create_task(process_name_page(name_page_number)))
if age_page_number is not None:
page_processing_tasks.append(loop.create_task(process_age_page(age_page_number)))
Now that we have scheduled the top level tasks, we can finally actually do the things:
loop.run_until_complete(asyncio.wait(page_processing_tasks))
print(name_and_age_to_id_count)
asyncio may not solve all of your parallel processing woes. You mentioned that the "processing" each page to iterate takes forever. If it takes forever because it's awaiting responses from a server, then this architecture is a neat lightweight approach to do what you need (just make sure the i/o is being done with asyncio loop-aware tools).
If it takes forever because Python is crunching numbers or moving things around with CPU and memory, asyncio's single-threaded event loop doesn't help you much because only one Python operation is happening at a time. In this scenario, you may want to look into using loop.run_in_executor with a pool of Python interpreter processes if you'd like to stick with asyncio and the sub-task pattern. You could also develop a solution using the concurrent.futures library with a process pool instead of using asyncio.
Note: The example generator you gave might be confusing to some because it uses yield from to delegate generation to an inner generator. It just so happens that asyncio coroutines use the same expression to await a future result and tell the loop it can run other coroutines' code if it wants.
asyncio has no such functionality but you may write a simple wrapper around as_completed for yielding data in-order.
It may be built using small sliding window buffer for storing newer completed data while older result is not available yet.

how to introduce fault tolerance into beautiful soup

I'm interested in scraping a lot of different websites as quickly as possible. The URLs could have any number of issues for web scraping; e.g., they may refer to files instead of sites, or they may not point to anything real at all. The issue I haven't been able to resolve is what to do when BeautifulSoup hangs, or fails for some reason and doesn't exit. There needs to be a way to stop the html parsing if it can't seem to do it after X seconds. This seems to be remarkably non-trivial, but it seems I'm not the only one, with this website seeming to provide the most relevant information: http://eli.thegreenplace.net/2011/08/22/how-not-to-set-a-timeout-on-a-computation-in-python
So, given that its quite hard to kill a hanging process such as BeautifulSoup(text) after a certain time has expired, what should I do?
Absent some kind of inbuilt functionality in bs4 you can always use multiprocessing.
In [1]: from bs4 import BeautifulSoup
In [2]: from multiprocessing import Pool
In [3]: p = Pool(1)
In [4]: j = p.apply_async(BeautifulSoup, ["<html></html>"])
In [5]: j.get(timeout=5)
Out[5]: <html></html>
EDIT: for explanation
If there is no builtin functionality that supports timeouts for bs4 parsing then multiprocessing is your only option because if you were to run it naively just saying
BeautifulSoup(html)
then you would be running this instruction in one python process. If in this process, any one of your instructions fails in a way that it just chews up CPU and doesn't exit then you the user will see it occur because it is inside of your shell, and inevitably frustration occurs. In your case bs4 probably got stuck in some kind of loop trying to parse your html, so if you are running that without multiprocessing then you have no recourse than to either a. kill the process, b. wait for the process to finish.
Multiprocessing allows you to get out of the single process paradigm. Multiprocessing will spawn new python instances that run in the background when you create a pool. The number of instances spawned correspond with the number of workers you allocate to the pool. In our case we just gave the pool one worker. Multiprocessing basically says that if one of its background processes takes too long to return back some results from the command it is given we have the option to kill it. That is essentially what you are doing here.
The difference between Pool.map, Pool.apply, and Pool.apply_async are best left to the documentation. The map function doesn't have a real purpose for what you're doing because you only need one thread to perform your work. apply would work fine but its essentially a synchronous action, it waits for the result to return before continuing. apply_async is asynchronous meaning it is good for performing work in parallel. I'm not sure what your specific requirements for your work are so I'll leave the decision of which function to use up to you.

using multiple threads in Python

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!
The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.
The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.

Multi-step, concurrent HTTP requests in Python

I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?
Up until this point, I've been doing one-step scraping with eventlet, like this:
urls = ['http://example.com', '...']
def scrape_page(url):
"""Gets the data from the web page."""
body = eventlet.green.urllib2.urlopen(url).read()
# Do something with body
return data
pool = eventlet.GreenPool()
for data in pool.imap(screen_scrape, urls):
# Handle the data...
However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.
I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.
Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.
Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.
Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.
Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.
Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.
Post note
Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:
http://eventlet.net/doc/examples.html#producer-consumer-web-crawler
In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.

Which strategy to use with multiprocessing in python

I am completely new to multiprocessing. I have been reading documentation about multiprocessing module. I read about Pool, Threads, Queues etc. but I am completely lost.
What I want to do with multiprocessing is that, convert my humble http downloader, to work with multiple workers. What I am doing at the moment is, download a page, parse to page to get interesting links. Continue until all interesting links are downloaded. Now, I want to implement this with multiprocessing. But I have no idea at the moment, how to organize this work flow. I had two thoughts about this. Firstly, I thought about having two queues. One queue for links that needs to be downloaded, other for links to be parsed. One worker, downloads the pages, and adds them to queue which is for items that needs to be parsed. And other process parses a page, and adds the links it finds interesting to the other queue. Problems I expect from this approach are; first of all, why download one page at a time and parse a page at a time. Moreover, how do one process know that there are items to be added to queue later, after it exhausted all items from queue.
Another approach I thought about using is that. Have a function, that can be called with an url as an argument. This function downloads the document and starts parsing it for the links. Every time it encounters an interesting link, it instantly creates a new thread running identical function as itself. The problem I have with this approach is, how do I keep track of all the processes spawned all around, how do I know if there is still processes to running. And also, how do I limit maximum number of processes.
So I am completely lost. Can anyone suggest a good strategy, and perhaps show some example codes about how to go with the idea.
Here is one approach, using multiprocessing. (Many thanks to #Voo, for suggesting many improvements to the code).
import multiprocessing as mp
import logging
import Queue
import time
logger=mp.log_to_stderr(logging.DEBUG) # or,
# logger=mp.log_to_stderr(logging.WARN) # uncomment this to silence debug and info messages
def worker(url_queue,seen):
while True:
url=url_queue.get()
if url not in seen:
logger.info('downloading {u}'.format(u=url))
seen[url]=True
# Replace this with code to dowload url
# urllib2.open(...)
time.sleep(0.5)
content=url
logger.debug('parsing {c}'.format(c=content))
# replace this with code that finds interesting links and
# puts them in url_queue
for i in range(3):
if content<5:
u=2*content+i-1
logger.debug('adding {u} to url_queue'.format(u=u))
time.sleep(0.5)
url_queue.put(u)
else:
logger.debug('skipping {u}; seen before'.format(u=url))
url_queue.task_done()
if __name__=='__main__':
num_workers=4
url_queue=mp.JoinableQueue()
manager=mp.Manager()
seen=manager.dict()
# prime the url queue with at least one url
url_queue.put(1)
downloaders=[mp.Process(target=worker,args=(url_queue,seen))
for i in range(num_workers)]
for p in downloaders:
p.daemon=True
p.start()
url_queue.join()
A pool of (4) worker processes are created.
There is a JoinableQueue, called url_queue.
Each worker gets a url from the url_queue, finds new urls and adds
them to the url_queue.
Only after adding new items does it call url_queue.task_done().
The main process calls url_queue.join(). This blocks the main
process until task_done has been called for every task in the
url_queue.
Since the worker processes have the daemon attribute set to True,
they too end when the main process ends.
All the components used in this example are also explained in Doug Hellman's excellent Python Module of the Week tutorial on multiprocessing.
What you're describing is essentially graph traversal; Most graph traversal algorithms (That are more sophisticated than depth first), keep track of two sets of nodes, in your case, the nodes are url's.
The first set is called the "closed set", and represents all of the nodes that have already been visited and processed. If, while you're processing a page, you find a link that happens to be in the closed set, you can ignore it, it's already been handled.
The second set is unsurprisingly called the "open set", and includes all of the edges that have been found, but not yet processed.
The basic mechanism is to start by putting the root node into the open set (the closed set is initially empty, no nodes have been processed yet), and start working. Each worker takes a single node from the open set, copies it to the closed set, processes the node, and adds any nodes it discovers back to the open set (so long as they aren't already in either the open or closed sets). Once the open set is empty, (and no workers are still processing nodes) the graph has been completely traversed.
Actually implementing this in multiprocessing probably means that you'll have a master task that keeps track of the open and closed sets; If a worker in a worker pool indicates that it is ready for work, the master worker takes care of moving the node from the open set to the closed set and starting up the worker. the workers can then pass all of the nodes they find, without worrying about if they are already closed, back to the master; and the master will ignore nodes that are already closed.

Categories