Assistance with Python multithreading - python

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()

It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()

You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).

For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

Related

Store the results of a multiprocessing queue in python

I'm trying to store the results of multiple API requests using multiprocessing queue as the API can't handle more than 5 connections at once.
I found part of a solution of How to use multiprocessing with requests module?
def worker(input_queue, stop_event):
while not stop_event.is_set():
try:
# Check if any request has arrived in the input queue. If not,
# loop back and try again.
request = input_queue.get(True, 1)
input_queue.task_done()
except queue.Empty:
continue
print('Started working on:', request)
api_request_function(request) #make request using a function I wrote
print('Stopped working on:', request)
def master(api_requests):
input_queue = multiprocessing.JoinableQueue()
stop_event = multiprocessing.Event()
workers = []
# Create workers.
for i in range(3):
p = multiprocessing.Process(target=worker,
args=(input_queue, stop_event))
workers.append(p)
p.start()
# Distribute work.
for requests in api_requests:
input_queue.put(requests)
# Wait for the queue to be consumed.
input_queue.join()
# Ask the workers to quit.
stop_event.set()
# Wait for workers to quit.
for w in workers:
w.join()
print('Done')
I've looked at the documentation of threading and pooling but missing a step. So the above runs and all requests get a 200 status code which is great. But I do I store the results of the requests to use?
Thanks for your help
Shan
I believe you have to make a Queue. The code can be a little tricky, you need to read up on the multiprocessing module. In general, with multiprocessing, all the variables are copied for each worker, hence you can't do something like appending to a global variable. Since that will literally be copied and the original will be untouched. There are a few functions that already automatically incorporate workers, queues, and return values. Personally, I try to write my functions to work with mp.map, like below:
def worker(*args,**kargs):
#do stuff
return 'thing'
output = multiprocessing.Pool().map(worker,[1,2,3,4,5])

Python Requests: Don't wait for request to finish

In Bash, it is possible to execute a command in the background by appending &. How can I do it in Python?
while True:
data = raw_input('Enter something: ')
requests.post(url, data=data) # Don't wait for it to finish.
print('Sending POST request...') # This should appear immediately.
Here's a hacky way to do it:
try:
requests.get("http://127.0.0.1:8000/test/",timeout=0.0000000001)
except requests.exceptions.ReadTimeout:
pass
Edit: for those of you that observed that this will not await a response - that is my understanding of the question "fire and forget... do not wait for it to finish". There are much more thorough and complete ways to do it with threads or async if you need response context, error handling, etc.
I use multiprocessing.dummy.Pool. I create a singleton thread pool at the module level, and then use pool.apply_async(requests.get, [params]) to launch the task.
This command gives me a future, which I can add to a list with other futures indefinitely until I'd like to collect all or some of the results.
multiprocessing.dummy.Pool is, against all logic and reason, a THREAD pool and not a process pool.
Example (works in both Python 2 and 3, as long as requests is installed):
from multiprocessing.dummy import Pool
import requests
pool = Pool(10) # Creates a pool with ten threads; more threads = more concurrency.
# "pool" is a module attribute; you can be sure there will only
# be one of them in your application
# as modules are cached after initialization.
if __name__ == '__main__':
futures = []
for x in range(10):
futures.append(pool.apply_async(requests.get, ['http://example.com/']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get()) # For each future, wait until the request is
# finished and then print the response object.
The requests will be executed concurrently, so running all ten of these requests should take no longer than the longest one. This strategy will only use one CPU core, but that shouldn't be an issue because almost all of the time will be spent waiting for I/O.
Elegant solution from Andrew Gorcester. In addition, without using futures, it is possible to use the callback and error_callback attributes (see
doc) in order to perform asynchronous processing:
def on_success(r: Response):
if r.status_code == 200:
print(f'Post succeed: {r}')
else:
print(f'Post failed: {r}')
def on_error(ex: Exception):
print(f'Post requests failed: {ex}')
pool.apply_async(requests.post, args=['http://server.host'], kwargs={'json': {'key':'value'},
callback=on_success, error_callback=on_error))
According to the doc, you should move to another library :
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide
any kind of non-blocking IO. The Response.content property will block
until the entire response has been downloaded. If you require more
granularity, the streaming features of the library (see Streaming
Requests) allow you to retrieve smaller quantities of the response at
a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of
projects out there that combine Requests with one of Python’s
asynchronicity frameworks.
Two excellent examples are
grequests and
requests-futures.
Simplest and Most Pythonic Solution using threading
A Simple way to go ahead and send POST/GET or to execute any other function without waiting for it to finish is using the built-in Python Module threading.
import threading
import requests
def send_req():
requests.get("http://127.0.0.1:8000/test/")
for x in range(100):
threading.Thread(target=send_req).start() # start's a new thread and continues.
Other Important Features of threading
You can turn these threads into daemons using thread_obj.daemon = True
You can go ahead and wait for one to complete executing and then continue using thread_obj.join()
You can check if a thread is alive using thread_obj.is_alive() bool: True/False
You can even check the active thread count as well by threading.active_count()
Official Documentation
If you can write the code to be executed separately in a separate python program, here is a possible solution based on subprocessing.
Otherwise you may find useful this question and related answer: the trick is to use the threading library to start a separate thread that will execute the separated task.
A caveat with both approach could be the number of items (that's to say the number of threads) you have to manage. If the items in parent are too many, you may consider halting every batch of items till at least some threads have finished, but I think this kind of management is non-trivial.
For more sophisticated approach you can use an actor based approach, I have not used this library myself but I think it could help in that case.
from multiprocessing.dummy import Pool
import requests
pool = Pool()
def on_success(r):
print('Post succeed')
def on_error(ex):
print('Post requests failed')
def call_api(url, data, headers):
requests.post(url=url, data=data, headers=headers)
def pool_processing_create(url, data, headers):
pool.apply_async(call_api, args=[url, data, headers],
callback=on_success, error_callback=on_error)

What kind of problems (if any) would there be combining asyncio with multiprocessing?

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?
You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.
Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res
See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

scan websites for content (fast)

I have thousands of websites in a database and I want to search all of the websites for a specific string. What is the fastest way to do it? I think I should get the content of each website first - this would be the way I do it:
import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
print "My search string: "+string
and search for the string. But this is very slow. How can I accelerate it in python?
I don't think your issue is the program - it is the fact that you are executing an HTTP request for thousands of sites. You could investigate different solutions involving some sort of parallel processing, but regardless of how efficient you make the parsing code you are going to hit a bottleneck with the requests in your current implementation.
Here is a basic example that uses the Queue and threading modules. I would suggest reading up on the benefits of multiprocessing vs. multiple threads (such as the post mentioned by #JonathanV), but this will hopefully be somewhat helpful in understanding what is happening:
import Queue
import threading
import time
import urllib2
my_sites = [
'http://news.ycombinator.com',
'http://news.google.com',
'http://news.yahoo.com',
'http://www.cnn.com'
]
# Create a queue for our processing
queue = Queue.Queue()
class MyThread(threading.Thread):
"""Create a thread to make the url call."""
def __init__(self, queue):
super(MyThread, self).__init__()
self.queue = queue
def run(self):
while True:
# Grab a url from our queue and make the call.
my_site = self.queue.get()
url = urllib2.urlopen(my_site)
# Grab a little data to make sure it is working
print url.read(1024)
# Send the signal to indicate the task has completed
self.queue.task_done()
def main():
# This will create a 'pool' of threads to use in our calls
for _ in range(4):
t = MyThread(queue)
# A daemon thread runs but does not block our main function from exiting
t.setDaemon(True)
# Start the thread
t.start()
# Now go through our site list and add each url to the queue
for site in my_sites:
queue.put(site)
# join() ensures that we wait until our queue is empty before exiting
queue.join()
if __name__ == '__main__':
start = time.time()
main()
print 'Total Time: {0}'.format(time.time() - start)
For good resources on threading in particular, see Doug Hellmann's post here, an IBM article here (this has become my general threading setup as evidenced by the above) and the actual docs here.
Try looking into using multiprocessing to run multiple searches at the same time. Mutlithreading works too but the shared memory can turn into a curse if not managed properly. Take a look at this discussion to help you see which choice would work for you.

Throughput differences when using coroutines vs threading

A few days ago I has asked a question on SO about helping me design a paradigm for structuring multiple HTTP requests
Here's the scenario. I would like a have a multi-producer, multi-consumer system. My producers crawl and scrape a few sites and add the links that it finds into a queue. Since I'll be crawling multiple sites, I would like to have multiple producers/crawlers.
The consumers/workers feed off this queue, make TCP/UDP requests to these links and saves the results to my Django DB. I would also like to have multiple-workers as each queue item is totally independent of each other.
People suggested that use a coroutine library for this i.e. Gevent or Eventlet. Having never worked with coroutines, I read that even though the programming paradigm is similar to threaded paradigms, only one thread is actively executing but when blocking calls occur - such as I/O calls - the stacks are switched in-memory and the other green thread takes over until it encounters some sort of a blocking I/O call. Hopefully I got this right? Here's the code from one of my SO posts:
import gevent
from gevent.queue import *
import time
import random
q = JoinableQueue()
workers = []
producers = []
def do_work(wid, value):
gevent.sleep(random.randint(0,2))
print 'Task', value, 'done', wid
def worker(wid):
while True:
item = q.get()
try:
print "Got item %s" % item
do_work(wid, item)
finally:
print "No more items"
q.task_done()
def producer():
while True:
item = random.randint(1, 11)
if item == 10:
print "Signal Received"
return
else:
print "Added item %s" % item
q.put(item)
for i in range(4):
workers.append(gevent.spawn(worker, random.randint(1, 100000)))
# This doesn't work.
for j in range(2):
producers.append(gevent.spawn(producer))
# Uncommenting this makes this script work.
# producer()
q.join()
This works well because the sleep calls are blocking calls and when a sleep event occurs, another green thread takes over. This is a lot faster than sequential execution.
As you can see, I don't have any code in my program that purposely yields the execution of one thread to another thread. I fail to see how this fits into scenario above as I would like to have all the threads executing simultaneously.
All works fine, but I feel the throughput that I've achieved using Gevent/Eventlets is higher than the original sequentially running program but drastically lower than what could be achieved using real-threading.
If I were to re-implement my program using threading mechanisms, each of my producers and consumers could simultaneously be working without the need to swap stacks in and out like coroutines.
Should this be re-implemented using threading? Is my design wrong? I've failed to see the real benefits of using coroutines.
Maybe my concepts are little muddy but this is what I've assimilated. Any help or clarification of my paradigm and concepts would be great.
Thanks
As you can see, I don't have any code in my program that purposely
yields the execution of one thread to another thread. I fail to see
how this fits into scenario above as I would like to have all the
threads executing simultaneously.
There is a single OS thread but several greenlets. In your case gevent.sleep() allows workers to execute concurrently. Blocking IO calls such as urllib2.urlopen(url).read() do the same if you use urllib2 patched to work with gevent (by calling gevent.monkey.patch_*()).
See also A Curious Course on Coroutines and Concurrency to understand how a code can work concurrently in a single threaded environment.
To compare throughput differences between gevent, threading, multiprocessing you could write the code that compatible with all aproaches:
#!/usr/bin/env python
concurrency_impl = 'gevent' # single process, single thread
##concurrency_impl = 'threading' # single process, multiple threads
##concurrency_impl = 'multiprocessing' # multiple processes
if concurrency_impl == 'gevent':
import gevent.monkey; gevent.monkey.patch_all()
import logging
import time
import random
from itertools import count, islice
info = logging.info
if concurrency_impl in ['gevent', 'threading']:
from Queue import Queue as JoinableQueue
from threading import Thread
if concurrency_impl == 'multiprocessing':
from multiprocessing import Process as Thread, JoinableQueue
The rest of the script is the same for all concurrency implementations:
def do_work(wid, value):
time.sleep(random.randint(0,2))
info("%d Task %s done" % (wid, value))
def worker(wid, q):
while True:
item = q.get()
try:
info("%d Got item %s" % (wid, item))
do_work(wid, item)
finally:
q.task_done()
info("%d Done item %s" % (wid, item))
def producer(pid, q):
for item in iter(lambda: random.randint(1, 11), 10):
time.sleep(.1) # simulate a green blocking call that yields control
info("%d Added item %s" % (pid, item))
q.put(item)
info("%d Signal Received" % (pid,))
Don't execute code at a module level put it in main():
def main():
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(process)d %(message)s")
q = JoinableQueue()
it = count(1)
producers = [Thread(target=producer, args=(i, q)) for i in islice(it, 2)]
workers = [Thread(target=worker, args=(i, q)) for i in islice(it, 4)]
for t in producers+workers:
t.daemon = True
t.start()
for t in producers: t.join() # put items in the queue
q.join() # wait while it is empty
# exit main thread (daemon workers die at this point)
if __name__=="__main__":
main()
gevent is great when you have very many (green) threads. I tested it with thousands and it worked very well. you have make sure all libraries you use both for scraping and for saving to the db get green. afaik if they use python's socket, gevent injection ought to work. extensions written in C (e.g. mysqldb) would block however and you'd need to use green equivalents instead.
if you use gevent you could mostly do away with queues, spawn new (green) thread for every task, code for the thread being as simple as db.save(web.get(address)). gevent will take care of preemption when some library in db or web blocks. it will work as long as your tasks fit in memory.
In this case, your problem is not with program speed (i.e choice of gevent or threading), but network IO throughput. That's (should be) the bottleneck that determines how fast the program runs.
Gevent is one nice way to make sure that is the bottleneck, and not your program's architecture.
This is the sort of process you'd want:
import gevent
from gevent.queue import Queue, JoinableQueue
from gevent.monkey import patch_all
patch_all() # Patch urllib2, etc
def worker(work_queue, output_queue):
for work_unit in work_queue:
finished = do_work(work_unit)
output_queue.put(finished)
work_queue.task_done()
def producer(input_queue, work_queue):
for url in input_queue:
url_list = crawl(url)
for work in url_list:
work_queue.put(work)
input_queue.task_done()
def do_work(work):
gevent.sleep(0) # Actually proces link here
return work
def crawl(url):
gevent.sleep(0)
return list(url) # Actually process url here
input = JoinableQueue()
work = JoinableQueue()
output = Queue()
workers = [gevent.spawn(worker, work, output) for i in range(0, 10)]
producers = [gevent.spawn(producer, input, work) for i in range(0, 10)]
list_of_urls = ['foo', 'bar']
for url in list_of_urls:
input.put(url)
# Wait for input to finish processing
input.join()
print 'finished producing'
# Wait for workers to finish processing work
work.join()
print 'finished working'
# We now have output!
print 'output:'
for message in output:
print message
# Or if you'd like, you could use the output as it comes!
You don't need to wait for input and work queues to finish, I've just demonstrated that here.

Categories