Throughput differences when using coroutines vs threading

Throughput differences when using coroutines vs threading - python

A few days ago I has asked a question on SO about helping me design a paradigm for structuring multiple HTTP requests
Here's the scenario. I would like a have a multi-producer, multi-consumer system. My producers crawl and scrape a few sites and add the links that it finds into a queue. Since I'll be crawling multiple sites, I would like to have multiple producers/crawlers.
The consumers/workers feed off this queue, make TCP/UDP requests to these links and saves the results to my Django DB. I would also like to have multiple-workers as each queue item is totally independent of each other.
People suggested that use a coroutine library for this i.e. Gevent or Eventlet. Having never worked with coroutines, I read that even though the programming paradigm is similar to threaded paradigms, only one thread is actively executing but when blocking calls occur - such as I/O calls - the stacks are switched in-memory and the other green thread takes over until it encounters some sort of a blocking I/O call. Hopefully I got this right? Here's the code from one of my SO posts:
import gevent
from gevent.queue import *
import time
import random
q = JoinableQueue()
workers = []
producers = []
def do_work(wid, value):
gevent.sleep(random.randint(0,2))
print 'Task', value, 'done', wid
def worker(wid):
while True:
item = q.get()
try:
print "Got item %s" % item
do_work(wid, item)
finally:
print "No more items"
q.task_done()
def producer():
while True:
item = random.randint(1, 11)
if item == 10:
print "Signal Received"
return
else:
print "Added item %s" % item
q.put(item)
for i in range(4):
workers.append(gevent.spawn(worker, random.randint(1, 100000)))
# This doesn't work.
for j in range(2):
producers.append(gevent.spawn(producer))
# Uncommenting this makes this script work.
# producer()
q.join()
This works well because the sleep calls are blocking calls and when a sleep event occurs, another green thread takes over. This is a lot faster than sequential execution.
As you can see, I don't have any code in my program that purposely yields the execution of one thread to another thread. I fail to see how this fits into scenario above as I would like to have all the threads executing simultaneously.
All works fine, but I feel the throughput that I've achieved using Gevent/Eventlets is higher than the original sequentially running program but drastically lower than what could be achieved using real-threading.
If I were to re-implement my program using threading mechanisms, each of my producers and consumers could simultaneously be working without the need to swap stacks in and out like coroutines.
Should this be re-implemented using threading? Is my design wrong? I've failed to see the real benefits of using coroutines.
Maybe my concepts are little muddy but this is what I've assimilated. Any help or clarification of my paradigm and concepts would be great.
Thanks

As you can see, I don't have any code in my program that purposely
yields the execution of one thread to another thread. I fail to see
how this fits into scenario above as I would like to have all the
threads executing simultaneously.
There is a single OS thread but several greenlets. In your case gevent.sleep() allows workers to execute concurrently. Blocking IO calls such as urllib2.urlopen(url).read() do the same if you use urllib2 patched to work with gevent (by calling gevent.monkey.patch_*()).
See also A Curious Course on Coroutines and Concurrency to understand how a code can work concurrently in a single threaded environment.
To compare throughput differences between gevent, threading, multiprocessing you could write the code that compatible with all aproaches:
#!/usr/bin/env python
concurrency_impl = 'gevent' # single process, single thread
##concurrency_impl = 'threading' # single process, multiple threads
##concurrency_impl = 'multiprocessing' # multiple processes
if concurrency_impl == 'gevent':
import gevent.monkey; gevent.monkey.patch_all()
import logging
import time
import random
from itertools import count, islice
info = logging.info
if concurrency_impl in ['gevent', 'threading']:
from Queue import Queue as JoinableQueue
from threading import Thread
if concurrency_impl == 'multiprocessing':
from multiprocessing import Process as Thread, JoinableQueue
The rest of the script is the same for all concurrency implementations:
def do_work(wid, value):
time.sleep(random.randint(0,2))
info("%d Task %s done" % (wid, value))
def worker(wid, q):
while True:
item = q.get()
try:
info("%d Got item %s" % (wid, item))
do_work(wid, item)
finally:
q.task_done()
info("%d Done item %s" % (wid, item))
def producer(pid, q):
for item in iter(lambda: random.randint(1, 11), 10):
time.sleep(.1) # simulate a green blocking call that yields control
info("%d Added item %s" % (pid, item))
q.put(item)
info("%d Signal Received" % (pid,))
Don't execute code at a module level put it in main():
def main():
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(process)d %(message)s")
q = JoinableQueue()
it = count(1)
producers = [Thread(target=producer, args=(i, q)) for i in islice(it, 2)]
workers = [Thread(target=worker, args=(i, q)) for i in islice(it, 4)]
for t in producers+workers:
t.daemon = True
t.start()
for t in producers: t.join() # put items in the queue
q.join() # wait while it is empty
# exit main thread (daemon workers die at this point)
if __name__=="__main__":
main()

gevent is great when you have very many (green) threads. I tested it with thousands and it worked very well. you have make sure all libraries you use both for scraping and for saving to the db get green. afaik if they use python's socket, gevent injection ought to work. extensions written in C (e.g. mysqldb) would block however and you'd need to use green equivalents instead.
if you use gevent you could mostly do away with queues, spawn new (green) thread for every task, code for the thread being as simple as db.save(web.get(address)). gevent will take care of preemption when some library in db or web blocks. it will work as long as your tasks fit in memory.

In this case, your problem is not with program speed (i.e choice of gevent or threading), but network IO throughput. That's (should be) the bottleneck that determines how fast the program runs.
Gevent is one nice way to make sure that is the bottleneck, and not your program's architecture.
This is the sort of process you'd want:
import gevent
from gevent.queue import Queue, JoinableQueue
from gevent.monkey import patch_all
patch_all() # Patch urllib2, etc
def worker(work_queue, output_queue):
for work_unit in work_queue:
finished = do_work(work_unit)
output_queue.put(finished)
work_queue.task_done()
def producer(input_queue, work_queue):
for url in input_queue:
url_list = crawl(url)
for work in url_list:
work_queue.put(work)
input_queue.task_done()
def do_work(work):
gevent.sleep(0) # Actually proces link here
return work
def crawl(url):
gevent.sleep(0)
return list(url) # Actually process url here
input = JoinableQueue()
work = JoinableQueue()
output = Queue()
workers = [gevent.spawn(worker, work, output) for i in range(0, 10)]
producers = [gevent.spawn(producer, input, work) for i in range(0, 10)]
list_of_urls = ['foo', 'bar']
for url in list_of_urls:
input.put(url)
# Wait for input to finish processing
input.join()
print 'finished producing'
# Wait for workers to finish processing work
work.join()
print 'finished working'
# We now have output!
print 'output:'
for message in output:
print message
# Or if you'd like, you could use the output as it comes!
You don't need to wait for input and work queues to finish, I've just demonstrated that here.

Related

How to avoid to start hundreds of threads when starting (very short) actions at different timings in the future

I use this method to launch a few dozen (less than thousand) of calls of do_it at different timings in the future:
import threading
timers = []
while True:
for i in range(20):
t = threading.Timer(i * 0.010, do_it, [i]) # I pass the parameter i to function do_it
t.start()
timers.append(t) # so that they can be cancelled if needed
wait_for_something_else() # this can last from 5 ms to 20 seconds
The runtime of each do_it call is very fast (much less than 0.1 ms) and non-blocking. I would like to avoid spawning hundreds of new threads for such a simple task.
How could I do this with only one additional thread for all do_it calls?
Is there a simple way to do this with Python, without third party library and only standard library?

As I understand it, you want a single worker thread that can process submitted tasks, not in the order they are submitted, but rather in some prioritized order. This seems like a job for the thread-safe queue.PriorityQueue.
from dataclasses import dataclass, field
from threading import Thread
from typing import Any
from queue import PriorityQueue
#dataclass(order=True)
class PrioritizedItem:
priority: int
item: Any=field(compare=False)
def thread_worker(q: PriorityQueue[PrioritizedItem]):
while True:
do_it(q.get().item)
q.task_done()
q = PriorityQueue()
t = Thread(target=thread_worker, args=(q,))
t.start()
while True:
for i in range(20):
q.put(PrioritizedItem(priority=i * 0.010, item=i))
wait_for_something_else()
This code assumes you want to run forever. If not, you can add a timeout to the q.get in thread_worker, and return when the queue.Empty exception is thrown because the timeout expired. Like that you'll be able to join the queue/thread after all the jobs have been processed, and the timeout has expired.
If you want to wait until some specific time in the future to run the tasks, it gets a bit more complicated. Here's an approach that extends the above approach by sleeping in the worker thread until the specified time has arrived, but be aware that time.sleep is only as accurate as your OS allows it to be.
from dataclasses import astuple, dataclass, field
from datetime import datetime, timedelta
from time import sleep
from threading import Thread
from typing import Any
from queue import PriorityQueue
#dataclass(order=True)
class TimedItem:
when: datetime
item: Any=field(compare=False)
def thread_worker(q: PriorityQueue[TimedItem]):
while True:
when, item = astuple(q.get())
sleep_time = (when - datetime.now()).total_seconds()
if sleep_time > 0:
sleep(sleep_time)
do_it(item)
q.task_done()
q = PriorityQueue()
t = Thread(target=thread_worker, args=(q,))
t.start()
while True:
now = datetime.now()
for i in range(20):
q.put(TimedItem(when=now + timedelta(seconds=i * 0.010), item=i))
wait_for_something_else()
To address this problem using only a single extra thread we have to sleep in that thread, so it's possible that new tasks with higher priority could come in while the worker is sleeping. In that case the worker would process that new high priority task after it's done with the current one. The above code assumes that scenario will not happen, which seems reasonable based on the problem description. If that might happen you can alter the sleep code to repeatedly poll if the task at the front of the priority queue has come due. The disadvantage with a polling approach like that is that it would be more CPU intensive.
Also, if you can guarantee that the relative order of the tasks won't change after they've been submitted to the worker, then you can replace the priority queue with a regular queue.Queue to simplify the code somewhat.
These do_it tasks can be cancelled by removing them from the queue.
The above code was tested with the following mock definitions:
def do_it(x):
print(x)
def wait_for_something_else():
sleep(5)
An alternative approach that uses no extra threads would be to use asyncio, as pointed out by smcjones. Here's an approach using asyncio that calls do_it at specific times in the future by using loop.call_later:
import asyncio
def do_it(x):
print(x)
async def wait_for_something_else():
await asyncio.sleep(5)
async def main():
loop = asyncio.get_event_loop()
while True:
for i in range(20):
loop.call_later(i * 0.010, do_it, i)
await wait_for_something_else()
asyncio.run(main())
These do_it tasks can be cancelled using the handle returned by loop.call_later.
This approach will, however, require either switching over your program to use asyncio throughout, or running the asyncio event loop in a separate thread.

It sounds like you want something to be non-blocking and asynchronous, but also single-processed and single-threaded (one thread dedicated to do_it).
If this is the case, and especially if any networking is involved, so long as you're not actively doing serious I/O on your main thread, it is probably worthwhile using asyncio instead.
It's designed to handle non-blocking operations, and allows you to make all of your requests without waiting for a response.
Example:
import asyncio
def main():
while True:
tasks = []
for i in range(20):
tasks.append(asyncio.create_task(do_it(i)))
await wait_for_something_else()
for task in tasks:
await task
asyncio.run(main())
Given the time spent on blocking I/O (seconds) - you'll probably waste more time managing threads than you will save on generating a separate thread to do these other operations.

As you have said that in your code each series of 20 do_it calls starts when wait_for_something_else is finished, I would recommend calling the join method in each iteration of the while loop:
import threading
timers = []
while True:
for i in range(20):
t = threading.Timer(i * 0.010, do_it, [i]) # I pass the parameter i to function do_it
t.start()
timers.append(t) # so that they can be cancelled if needed
wait_for_something_else() # this can last from 5 ms to 20 seconds
for t in timers[-20:]:
t.join()

do_it run in order and cancellable
run all do_it in one thread and sleep for the specific timing (may not with sleep)
use a variable "should_run_it" to check the do_it should run or not (cancellable?)
it's that something like this?
import threading
import time
def do_it(i):
print(f"[{i}] {time.time()}")
should_run_it = {i:True for i in range(20)}
def guard_do_it(i):
if should_run_it[i]:
do_it(i)
def run_do_it():
for i in range(20):
guard_do_it(i)
time.sleep(0.010)
if __name__ == "__main__":
t = threading.Timer(0.010, run_do_it)
start = time.time()
print(start)
t.start()
#should_run_it[5] = should_run_it[10] = should_run_it[15] = False # test
t.join()
end = time.time()
print(end)
print(end - start)

I don't have a ton of experience with threading in Python, so please go easy on me. The concurrent.futures library is a part of Python3 and it's dead simple. I'm providing an example for you so you can see how straightforward it is.
Concurrent.futures with exactly one thread for do_it() and concurrency:
import concurrent.futures
import time
def do_it(iteration):
time.sleep(0.1)
print('do it counter', iteration)
def wait_for_something_else():
time.sleep(1)
print('waiting for something else')
def single_thread():
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
futures = (executor.submit(do_it, i) for i in range(20))
for future in concurrent.futures.as_completed(futures):
future.result()
def do_asap():
wait_for_something_else()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(single_thread), executor.submit(do_asap)]
for future in concurrent.futures.as_completed(futures):
future.result()
The code above uses max_workers=1 threads to execute do_it() in a single thread. On line 13, do_it() is constrained to a single thread using the option max_workers=1 to limit the work to exactly one thread.
On line 22, both methods are submitted to the concurrent.futures thread pool executor. The code from lines 21-24 enables both methods to run in a thread pool and do_it runs on a single non-blocking thread.
The concurrent.futures doc describes how to control the number of threads. When max_workers is not specified, the total number of threads assigned to both processes is max_workers = min(32, os.cpu_count() + 4).

Store the results of a multiprocessing queue in python

I'm trying to store the results of multiple API requests using multiprocessing queue as the API can't handle more than 5 connections at once.
I found part of a solution of How to use multiprocessing with requests module?
def worker(input_queue, stop_event):
while not stop_event.is_set():
try:
# Check if any request has arrived in the input queue. If not,
# loop back and try again.
request = input_queue.get(True, 1)
input_queue.task_done()
except queue.Empty:
continue
print('Started working on:', request)
api_request_function(request) #make request using a function I wrote
print('Stopped working on:', request)
def master(api_requests):
input_queue = multiprocessing.JoinableQueue()
stop_event = multiprocessing.Event()
workers = []
# Create workers.
for i in range(3):
p = multiprocessing.Process(target=worker,
args=(input_queue, stop_event))
workers.append(p)
p.start()
# Distribute work.
for requests in api_requests:
input_queue.put(requests)
# Wait for the queue to be consumed.
input_queue.join()
# Ask the workers to quit.
stop_event.set()
# Wait for workers to quit.
for w in workers:
w.join()
print('Done')
I've looked at the documentation of threading and pooling but missing a step. So the above runs and all requests get a 200 status code which is great. But I do I store the results of the requests to use?
Thanks for your help
Shan

I believe you have to make a Queue. The code can be a little tricky, you need to read up on the multiprocessing module. In general, with multiprocessing, all the variables are copied for each worker, hence you can't do something like appending to a global variable. Since that will literally be copied and the original will be untouched. There are a few functions that already automatically incorporate workers, queues, and return values. Personally, I try to write my functions to work with mp.map, like below:
def worker(*args,**kargs):
#do stuff
return 'thing'
output = multiprocessing.Pool().map(worker,[1,2,3,4,5])

Monitoring a threaded Python program with htop

First of all, this is the code I am referring to:
from random import randint
import time
from threading import Thread
import Queue
class TestClass(object):
def __init__(self, queue):
self.queue = queue
def do(self):
while True:
wait = randint(1, 10)
time.sleep(1.0/wait)
print '[>] Enqueuing from TestClass.do...', wait
self.queue.put(wait)
class Handler(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
task_no = 0
while True:
task = self.queue.get()
task_no += 1
print ('[<] Dequeuing from Handler.run...', task,
'task_no=', task_no)
time.sleep(1) # emulate processing time
print ('[*] Task %d done!') % task_no
self.queue.task_done()
def main():
q = Queue.Queue()
watchdog = TestClass(q)
observer = Thread(target=watchdog.do)
observer.setDaemon(True)
handler = Handler(q)
handler.setDaemon(True)
handler.start()
observer.start()
try:
while True:
wait = randint(1, 10)
time.sleep(1.0/wait)
print '[>] Enqueuing from main...', wait
q.put(wait)
except KeyboardInterrupt:
print '[*] Exiting...', True
if __name__ == '__main__':
main()
While the code is not very important to my question, it is a simple script that spawns 2 threads, on top of the main one. Two of them enqueue "tasks", and one dequeues them and "executes" them.
I am just starting to study threading in python, and I have of course ran into the subject of GIL, so I expected to have one process. But the thing is, when I monitor this particular script with htop, I notice not 1, but 3 processes being spawned.
How is this possible?

The GIL means only one thread will "do work" at a time but it doesn't mean that Python won't spawn the threads. In your case, you asked Python to spawn two threads so it did (giving you a total of three threads). FYI, top lists both processes and threads in case this was causing your confusion.
Python threads are useful for when you want concurrency but don't need parallelism. Concurrency is a tool for making programs simpler and more modular; it allows you to spawn a thread per task instead of having to write one big (often messy) while loop and/or use a bunch of callbacks (like JavaScript).
If you're interested in this subject, I recommend googling "concurrency versus parallelism". The concept is not language specific.
Edit: Alternativly, you can just read this Stack Overflow thread.

How to list Processes started by multiprocessing Pool?

While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()

To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.

Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()

You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()

It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()

You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).

For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Throughput differences when using coroutines vs threading - python

Related

How to avoid to start hundreds of threads when starting (very short) actions at different timings in the future

Store the results of a multiprocessing queue in python

Monitoring a threaded Python program with htop

How to list Processes started by multiprocessing Pool?

Assistance with Python multithreading

Categories

Resources