Hi can you please tell me how use different functions in different thread using thread pool
in twisted...say
I have a list of ids x=[1,2,3,4] where 1,2,...etc are ids(I got from data base and each one contains python script in some where disk).
what I want to do is
scanning of x traverse on list and run every script in different thread until they completed
Thanx Calderone, your code helped me a lot.
I have few doubts like I can resize threadpool size by this way.
from twisted.internet import reactor
reactor.suggestThreadPoolSize(30)
say all 30 available threads are busy & there is still some ids in list(dict or tuple)
1-In this situation all ids will be traversed? I mean as soon as thread is free next tool(id)
will be assigned to freed thread?
2-there is also some cases one tools must be executed before second tool & one tool output will be used by another tool,how will it be managed in twisted thread. 3
Threads in Twisted are primarily used via twisted.internet.threads.deferToThread. Alternatively, there's a new interface which is slightly more flexible, twisted.internet.threads.deferToThreadPool. Either way, the answer is roughly the same, though. Iterate over your data and use one of these functions to dispatch it to a thread. You get back a Deferred from either which will tell you what the result is, when it is available.
from twisted.internet.threads import deferToThread
from twisted.internet.defer import gatherResults
from twisted.internet import reactor
def double(n):
return n * 2
data = [1, 2, 3, 4]
results = []
for datum in data:
results.append(deferToThread(double, datum))
d = gatherResults(results)
def displayResults(results):
print 'Doubled data:', results
d.addCallback(displayResults)
d.addCallback(lambda ignored: reactor.stop())
reactor.run()
You can read more about threading in Twisted in the threading howto.
Related
In a part of my software code written with python, I have a list of items where it size can vary greatly from 12 to only one item . For each item in this list I'm doing some processing (sending an HTTP request related to the given item, parse results and many other operations . I'd like to speed up my code using threading, I'd like to create 2 threads where each one take a number of items and do the processing async.
Example 1 : Let's say that in my list I have 12 items, each thread would take in this case 6 items and call the processing functions on each item .
Example 2 : Now let's say that my list have 9 items, one thread would take 5 items and the other thread would take the other 4 left items .
Currently I'm not applying any threading and my code base is very large, so here some code that do almost the same thing as my case :
#This procedure need to be used with threading .
itemList = getItems() #This function return an unknown number of items between 1 and 12
if len(itemList) > 0: # Make sure that the list is empty in this case .
for item in itemList:
processItem(item) #This is an imaginary function that do the processing on each item
Below is a basic lite code that explain what I'm doing, I can't figure out how can I make my threads flexible, so each one take a number of items and the other take the rest (as explained in example 1 & 2) .
Thank's for your time
You might rather implement it using shared queues
https://docs.python.org/3/library/queue.html#queue-objects
import queue
import threading
def worker():
while True:
item = q.get()
if item is None:
break
do_work(item)
q.task_done()
q = queue.Queue()
threads = []
for i in range(num_worker_threads):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
for item in source():
q.put(item)
# block until all tasks are done
q.join()
# stop workers
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
Quoting from
https://docs.python.org/3/library/queue.html#module-queue:
The queue module implements multi-producer, multi-consumer queues. It
is especially useful in threaded programming when information must be
exchanged safely between multiple threads.
The idea is that you have a shared storage and each thread attempts reading items from it one-by-one.
This is much more flexible than distributing the load in advance as you don't know how threads execution will be scheduled by your OS, how much time each iteration would take etc.
Furthermore, you might add items for further processing to this queue dynamically — for example, having a producer thread running in parallel.
Some helpful links:
A brief introduction into concurrent programming in python:
http://www.slideshare.net/dabeaz/an-introduction-to-python-concurrency
More details on producer-consumer pattern with line-by-line explanation:
http://www.informit.com/articles/article.aspx?p=1850445&seqNum=8
You can use the ThreadPoolExecutor class from the concurrent.futures module in Python 3. The module is not present in Python 2, but there are some workarounds (which I will not discuss).
A thread pool executor does basically what #ffeast proposed, but with fewer lines of code for you to write. It manages a pool of threads which will execute all the tasks that you submit to it, presumably in the most efficient manner possible. The results will be returned through Future objects, which represent a "pending" result.
Since you seem to know the list of tasks up front, this is especially convenient for you. While you can not guarantee how the tasks will be split between the threads, the result will probably be at least as good as anything you coded by hand.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=2) as executor:
for item in getItems():
executor.submit(processItem, item)
If you need more information with the output, like some way of identifying the futures that have completed or getting results out of them, see the example in the Python documentation (on which the code above is heavily based).
To begin with, we're given the following piece of code:
from validate_email import validate_email
import time
import os
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
emails = set()
with open(email_path) as f:
for email in f:
email = email.strip()
if email in emails:
continue
emails.add(email)
if validate_email(email, verify=True):
good_emails.write(email + '\n')
else:
bad_emails.write(email + '\n')
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
I expect contacting SMTP servers to be the most expensive part by far from my program when emails.txt contains large amount of lines (>1k). Using some form of parallel or asynchronous I/O should speed this up a lot, since I can wait for multiple servers to respond instead of waiting sequentially.
As far as I have read:
Asynchronous I/O operates by queuing a request for I/O to the file
descriptor, tracked independently of the calling process. For a file
descriptor that supports asynchronous I/O (raw disk devcies
typically), a process can call aio_read() (for instance) to request a
number of bytes be read from the file descriptor. The system call
returns immediately, whether or not the I/O has completed. Some time
later, the process then polls the operating system for the completion
of the I/O (that is, buffer is filled with data).
To be sincere, I didn't quite understand how to implement async I/O on my program. Can anybody take a little time and explain me the whole process ?
EDIT as per PArakleta suggested:
from validate_email import validate_email
import time
import os
from multiprocessing import Pool
import itertools
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
with open(email_path, "r") as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
good_emails.close()
bad_emails.close()
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
You're asking the wrong question
Having looked at the validate_email package your real problem is that you're not efficiently batching your results. You should be only doing the MX lookup once per domain and then only connect to each MX server once, go through the handshake, and then check all of the addresses for that server in a single batch. Thankfully the validate_email package does the MX result caching for you, but you still need to be group the email addresses by server to batch the query to the server itself.
You need to edit the validate_email package to implement batching, and then probably give a thread to each domain using the actual threading library rather than multiprocessing.
It's always important to profile your program if it's slow and figure out where it is actually spending the time rather than trying to apply optimisation tricks blindly.
The requested solution
IO is already asynchronous if you are using buffered IO and your use case fits with the OS buffering. The only place you could potentially get some advantage is in read-ahead but Python already does this if you use the iterator access to a file (which you are doing). AsyncIO is an advantage to programs that are moving large amounts of data and have disabled the OS buffers to prevent copying the data twice.
You need to actually profile/benchmark your program to see if it has any room for improvement. If your disks aren't already throughput bound then there is a chance to improve the performance by parallel execution of the processing of each email (address?). The easiest way to check this is probably to check to see if the core running your program is maxed out (i.e. you are CPU bound and not IO bound).
If you are CPU bound then you need to look at threading. Unfortunately Python threading doesn't work in parallel unless you have non-Python work to be done so instead you'll have to use multiprocessing (I'm assuming validate_email is a Python function).
How exactly you proceed depends on where the bottleneck's in your program are and how much of a speed up you need to get to the point where you are IO bound (since you cannot actually go any faster than that you can stop optimising when you hit that point).
The emails set object is hard to share because you'll need to lock around it so it's probably best that you keep that in one thread. Looking at the multiprocessing library the easiest mechanism to use is probably Process Pools.
Using this you would need to wrap your file iterable in an itertools.ifilter which discards duplicates, and then feed this into a Pool.imap_unordered and then iterate that result and write into your two output files.
Something like:
with open(email_path) as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
The validate_map function should be something simple like:
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
The unique function should be something like:
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
ETA: I just realised that validate_email is a library which actually contacts SMTP servers. Given that it's not busy in Python code you can use threading. The threading API though is not as convenient as the multiprocessing library but you can use multiprocessing.dummy to have a thread based Pool.
If you are CPU bound then it's not really worth having more threads/processes than cores but since your bottleneck is network IO you can benefit from many more threads/processes. Since processes are expensive you want to swap to threads and then crank up the number running in parallel (although you should be polite not to DOS-attack the servers you are connecting to).
Consider from multiprocessing.dummy import Pool as ThreadPool and then call ThreadPool(processes=32).imap_unordered().
I need to do a lot of webscraping from domains stored in a .txt file (about 50 MB size).
I want to do it multi-threaded. Hence I am loading a number of entries into a Python list and process each with threads.
Example:
biglist = ['google.com','facebook.com','apple.com']
threads = [threading.Thread(target=fetch_url, args=(chuck,))
for domain in biglist]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
It works but it seems to me that it's not very efficient, as there is a lot of memory usage and it takes a lot of time to complete.
What better ways are there to achieve what I'm doing?
Don't use lists/threads, but a queue/processes instead.
If you know Redis I suggest RQ (http://python-rq.org) - I'm doing the same thing and works quite nicely.
You are using the wrong library. threading.Thread does not really benefit from multiple processors as it is blocked by the Global Interpreter Lock.
From the documentation of the threading module (c.f.):
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
I suggest you to use a process pool from the multiprocessing module and map() to compute the results in parallel. It doesn't make sense to use more processes than you have processors then.
From the documentation of the multiprocessing module (c.f.):
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
Example:
from multiprocessing import Pool
number_of_processors = 3
data = range(10)
def func(x):
print "processing", x
return x*x
pool = Pool(number_of_processors)
ret = pool.map(func, data)
The output is:
$ python test.py
processing 0
processing 1
processing 3
processing 2
processing 4
processing 5
processing 6
processing 7
processing 8
processing 9
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Adding to moooeeeeps answer.
There is another way of handling lot of connections in one thread, without spawning expensive processes. Gevent has a similar api to multiprocessing/multithreading.
Docs and tutorials:
http://www.gevent.org/intro.html
http://sdiehl.github.io/gevent-tutorial/
Also there is a python framework for scraping urls:
http://scrapy.org/
Example from docs asyc/sync fetching urls:
import gevent.monkey
gevent.monkey.patch_socket()
import gevent
import urllib2
import simplejson as json
def fetch(pid):
response = urllib2.urlopen('http://json-time.appspot.com/time.json')
result = response.read()
json_result = json.loads(result)
datetime = json_result['datetime']
print('Process %s: %s' % (pid, datetime))
return json_result['datetime']
def synchronous():
for i in range(1,10):
fetch(i)
def asynchronous():
threads = []
for i in range(1,10):
threads.append(gevent.spawn(fetch, i))
gevent.joinall(threads)
print('Synchronous:')
synchronous()
print('Asynchronous:')
asynchronous()
As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?
You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.
Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res
See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.
I have a very simple script that monitors a file transfer progress, comparing its actual size with the target then calculating its hash, comparing with the desired hash and firing up a few extra things when everything seems alright.
I've replaced the tool used for the file transfers (wget) with deluged, which has a neat api to integrate with.
Instead of comparing the file progress and compare the hashes, I only need to know now when deluged finished downloading the files. To achieve that, I was able to modify this script to my needs, but I'm stuck trying to wrap my head around twisted framework, that deluged makes use of.
To try getting over it, I grabbed one sample script from twisted deferred documentation, wrapped a class around it and attempted to use the same concept I'm using on this script I mentioned.
Now, I don't know exactly what to do with the reactor object, since it's basically a blocking loop that can't be restarted.
This is my sample code I'm working with:
from twisted.internet import reactor, defer
import time
class DummyDataGetter:
done = False
result = 0
def getDummyData(self, x):
d = defer.Deferred()
# simulate a delayed result by asking the reactor to fire the
# Deferred in 2 seconds time with the result x * 3
reactor.callLater(2, d.callback, x * 3)
return d
def assignResult(self, d):
"""
Data handling function to be added as a callback: handles the
data by printing the result
"""
self.result = d
self.done = True
reactor.stop()
def run(self):
d = self.getDummyData(3)
d.addCallback(self.assignResult)
reactor.run()
getter = DummyDataGetter()
getter.run()
while not getter.done:
time.sleep(0.5)
print getter.result
# then somewhere else I want to get dummy data again
getter = DummyDataGetter()
getter.run() #this throws an exception of type error.ReactorNotRestartable
while not getter.done:
time.sleep(0.5)
print getter.result
My questions are:
Should reactor be fired in another thread to prevent it blocking the code?
If so, how would I add more callbacks to this reactor living in a separate thread? Simply by doing something similar to reactor.callLater(2, d.callback, x * 3), from my main thread?
If not, what is the technique to overcome this problem of not being able to starting/stopping reactor twice or more on the same process?
OK, easiest approach I found to this is to simply have a separate script called using subprocess.Popen, dump the statuses of the torrents and anything else needed into the stdout (serialized using JSON) and pipe that into the calling script.
Way less traumatic than learning twisted, but of course far away from optimal.