select() on socket and another event mechanism

select() on socket and another event mechanism - python

I want to read from a socket when data is available and in the same thread I want to read items from a message queue like this:
while True:
ready = select.select([some_socket, some_messagequeue], [], [])[0]
if some_socket in ready:
read_and_handle_data_from_socket()
if some_messagequeue in ready:
read_and_handle_data_from_messagequeue()
With other words: I want to abort select() as soon as the thread receives messages via some process internal messaging system.
From what I have read now I found two approaches: selecting on the message-queue itself or creating a os.pipe() for aborting the select() but I didn't find a nice implementation yet.
Approach 1: There seem to be two Queue implementations: multiprocessing.Queue and queue.Queue (Python3). While multiprocessing.Queue has a _reader member which can be used with select() only queue.Queue allows arbitrary data structures to be queued without having to mess with the pickling.
Question: Is there a way to use select() on a queue.Queue as well?
Approach 2: would look like this:
import os, queue, select
r, w = os.pipe()
some_socket = 67 # FD to some other socket
q = queue.Queue()
def read_fd():
while True:
ready = select.select([r, some_socket], [], [])[0]
if r in ready:
os.read(r, 100)
print('handle task: ', q.get())
if some_socket in ready:
print('socket has data')
threading.Thread(target=read_fd, daemon=True).start()
while True:
q.put('some task')
os.write(w, b'x')
print('scheduled task')
time.sleep(1)
And this works - but in my eyes this code is quite cumbersome and not very pythonic. Question: is there a nicer way to just send 'signals' through a os.pipe (or any other implementation)?
Approach 3..N: Question: how would you solve this?
I know libraries like ZeroMQ but since I'm working on an embedded project I'd prefer a solution that comes with the native Python (3.3) distribution. And I think a there should be a solution as short as the first example - after all I just want to abort the select() if something happens on the message queue.

Approach 3: have two threads. 1 waits on select. 2 waits on message queue. Mutex to prevent them from firing simultaneously. Why have threads if you aren't going to use them?

You can create a pipe pair of file descriptors, signal queue push via writing to the write end of it, and wait for queue activity in the same select on the read end of the pipe.
On Linux specifically, there is also the eventfd(2) system call that can be used for the same purpose. instead of pipe(2) (this might me useful).

Related

How to make a python thread dependent on the completion of another thread?

Basically I want make like 15000 get requests of the form GET www.somewebsite.com/archive/1, www.somewebsite.com/archive/2, and write the content to its own file locally. But doing all those in order takes a bit. And doing them all with their own thread results in all sorts of IO and HTTP errors. But if I do say 50 at a time it works fine. What I want to do is create a chunk thread that I spawn 50 threads off of, and then spawn another chunk thread when that one is finished. But I haven't found a way to do this.
I need a way to say "don't execute any more lines until this thread is completed" or a way to queue up threads that get executed asynchronously in order.

Python has a built in library multiprocessing that will allow you to implement simple batch processing with a queue.
import multiprocessing
static_input = range(100)
chunksize = 10
def work(item):
return "Number " + str(item)
with multiprocessing.Pool() as pool:
for out in pool.imap_unordered(work, static_input, chunksize):
print(out)

"You need to use join method of Thread object in the end of the script."
This has been stated here by maksim skurydzin.
You might also want to take a look at the multiprocessing class here.

I am not sure why multiprocessing Queue isn't working. What is it doing?

I am using python's built in socket and multiprocessing libaries to scan tcp ports of a host. I know my first function works, and I am just trying to make it work with multriprocess Queue and Process, not sure where I am going wrong.
If I remove the Queue everything seems to complete, I just actually need to get the results from it.
from multiprocessing import Process, Queue
import socket
def tcp_connect(ip, port_number):
try:
scanner = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
scanner.settimeout(0.1)
scanner.connect((str(ip), port_number))
scanner.close()
#put into queue
## Remove line below if you
q.put(port_number)
except:
pass
RESULTS = []
list_of_numbs = list(range(1,501))
for numb in list_of_numbs:
#make my queue
q = Queue()
p = Process(target=tcp_connect, args=('google',numb))
p.start()
#take my results from my queue and append to a list
RESULTS.append(q.get())
p.join()
print(RESULTS)
I would just like it to print out port numbers that were open. Right now since it is scanning google.com it should really only return 80 and 443.
EDIT: This would work if I used Pool but I went to Process and Queue is because the bigger piece of this runs in Django with celery and the don't allow Daemon when executing with Pool

For work like this, a multiprocessing.Pool would be a better way of handling it.
You don't have to worry about creating Processes and Queues; all that is done for you in the background. Your worker function only has to return a result, and that will be transported to the parent process for you.
I would suggest using multiprocessing.Pool.imap_unordered(), because it starts returning results as soon as it is available.
One thing; the worker process takes only one argument. It you need multiple different arguments for each call; wrap them in a tuple.
If you have arguments that are the same for all calls, use functools.partial.
A slightly more modern aproach would be to use an Executor.map() method from concurrent.futures. Since your work consists mainly of socket calls, you could use a ThreadPoolExecutor here, I think. That should be slightly less resource-intensive than a ProcessPoolExecutor.

What kind of problems (if any) would there be combining asyncio with multiprocessing?

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

Why are Python multiprocessing Pipe unsafe?

I don't understand why Pipes are said unsafe when there are multiple senders and receivers.
How the following code can be turned into code using Queues if this is the case ? Queues don't throw EOFError when closed, so my processes can't stop. Should I send endlessly 'Poison' messages to tell them to stop (this way, i'm sure all my processes receive at least one poison) ?
I would like to keep the pipe p1 open until I decide otherwise (here it's when I have sent the 10 messages).
from multiprocessing import Pipe, Process
from random import randint, random
from time import sleep
def job(name, p_in, p_out):
print(name + ' starting')
nb_msg = 0
try:
while True:
x = p_in.recv()
print(name + ' receives ' + x)
nb_msg = nb_msg + 1
p_out.send(x)
sleep(random())
except EOFError:
pass
print(name + ' ending ... ' + str(nb_msg) + ' message(s)')
if __name__ == '__main__':
p1_in, p1_out = Pipe()
p2_in, p2_out = Pipe()
proc = []
for i in range(3):
p = Process(target=job, args=(str(i), p1_out, p2_in))
p.start()
proc.append(p)
for x in range(10):
p1_in.send(chr(97+x))
p1_in.close()
for p in proc:
p.join()
p1_out.close()
p2_in.close()
try:
while True:
print(p2_out.recv())
except EOFError:
pass
p2_out.close()

Essentially, the problem is that Pipe is a thin wrapper around a platform-defined pipe object. recv simply repeatedly receives a buffer of bytes until a complete Python object is obtained. If two threads or processes use recv on the same pipe, the reads may interleave, leaving each process with half a pickled object and thus corrupting the data. Queues do proper synchronization between processes, at the expense of more complexity.
As the multiprocessing documentation puts it:
Note that data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time. Of course there is no risk of corruption from processes using different ends of the pipe at the same time.
You don't have to endlessly send poison pills; one per worker is all you need. Each worker picks up exactly one poison pill before exiting, so there's no danger that a worker will somehow miss the message.
You should also consider using multiprocessing.Pool instead of reimplementing the "worker process" model -- Pool has a lot of methods which make distributing work across multiple threads very easy.

I don't understand why Pipes are said unsafe when there are multiple senders and receivers.
Consider you put water into a pipe from source A and B simultaneously. On the other end of the pipe, it will be impossible for you to find out which part of the water came from A or B, right? :)
A pipe transports a data stream on the byte level. Without a communication protocol on top of it, it does not know what a message is and therefore can't ensure message integrity. Therefore, it is not only 'unsafe' to use pipes with multiple senders. It is a major design flaw and will most likely lead to communication problems.
Queues, however, are implemented on a higher level. They are designed for communicating messages (or even abstract objects). Queues are made for keeping a message/object self-contained. Multiple sources can put objects into a queue and multiple consumers can pull these objects while being 100 % sure that whatever got into the queue as a unit also comes out of it as a unit.
Edit after quite a while:
I should add that in the byte stream, all bytes are retrieved in the same order as sent (guaranteed). The issue with multiple senders is that the sending order (the order of input) might already be unclear or random, i.e. multiple streams might mix in an unpredictable fashion.
A common queue implementation guarantees that single messages are kept intact, even if there are multiple senders. Messages are retrieved in the order as sent, too. With multiple competing senders and without further synchronization mechanisms there is, however, again no guarantee about the order of input messages.

How to pause and resume a thread using the threading module?

I have a long process that I've scheduled to run in a thread, because otherwise it will freeze the UI in my wxpython application.
I'm using:
threading.Thread(target=myLongProcess).start()
to start the thread and it works, but I don't know how to pause and resume the thread. I looked in the Python docs for the above methods, but wasn't able to find them.
Could anyone suggest how I could do this?

I did some speed tests as well, the time to set the flag and for action to be taken is pleasantly fast 0.00002 secs on a slow 2 processor Linux box.
Example of thread pause test using set() and clear() events:
import threading
import time
# This function gets called by our thread.. so it basically becomes the thread init...
def wait_for_event(e):
while True:
print('\tTHREAD: This is the thread speaking, we are Waiting for event to start..')
event_is_set = e.wait()
print('\tTHREAD: WHOOOOOO HOOOO WE GOT A SIGNAL : %s' % event_is_set)
# or for Python >= 3.6
# print(f'\tTHREAD: WHOOOOOO HOOOO WE GOT A SIGNAL : {event_is_set}')
e.clear()
# Main code
e = threading.Event()
t = threading.Thread(name='pausable_thread',
target=wait_for_event,
args=(e,))
t.start()
while True:
print('MAIN LOOP: still in the main loop..')
time.sleep(4)
print('MAIN LOOP: I just set the flag..')
e.set()
print('MAIN LOOP: now Im gonna do some processing')
time.sleep(4)
print('MAIN LOOP: .. some more processing im doing yeahhhh')
time.sleep(4)
print('MAIN LOOP: ok ready, soon we will repeat the loop..')
time.sleep(2)

There is no method for other threads to forcibly pause a thread (any more than there is for other threads to kill that thread) -- the target thread must cooperate by occasionally checking appropriate "flags" (a threading.Condition might be appropriate for the pause/unpause case).
If you're on a unix-y platform (anything but windows, basically), you could use multiprocessing instead of threading -- that is much more powerful, and lets you send signals to the "other process"; SIGSTOP should unconditionally pause a process and SIGCONT continues it (if your process needs to do something right before it pauses, consider also the SIGTSTP signal, which the other process can catch to perform such pre-suspension duties. (There may be ways to obtain the same effect on Windows, but I'm not knowledgeable about them, if any).

You can use signals: http://docs.python.org/library/signal.html#signal.pause
To avoid using signals you could use a token passing system. If you want to pause it from the main UI thread you could probably just use a Queue.Queue object to communicate with it.
Just pop a message telling the thread the sleep for a certain amount of time onto the queue.
Alternatively you could simply continuously push tokens onto the queue from the main UI thread. The worker should just check the queue every N seconds (0.2 or something like that). When there are no tokens to dequeue the worker thread will block. When you want it to start again just start pushing tokens on to the queue from the main thread again.

The multiprocessing module works fine on Windows. See the documentation here (end of first paragraph):
http://docs.python.org/library/multiprocessing.html
On the wxPython IRC channel, we had a couple fellows trying multiprocessing out and they said it worked. Unfortunately, I have yet to see anyone who has written up a good example of multiprocessing and wxPython.
If you (or anyone else on here) come up with something, please add it to the wxPython wiki page on threading here: http://wiki.wxpython.org/LongRunningTasks
You might want to check that page out regardless as it has several interesting examples using threads and queues.

You might take a look at the Windows API for thread suspension.
As far as I'm aware there is no POSIX/pthread equivalent. Furthermore, I cannot ascertain if thread handles/IDs are made available from Python. There are also potential issues with Python, as its scheduling is done using the native scheduler, it's unlikely that it is expecting threads to suspend, particularly if threads suspended while holding the GIL, amongst other possibilities.

I had the same issue. It is more effective to use time.sleep(1800) in the thread loop to pause the thread execution.
e.g
MON, TUE, WED, THU, FRI, SAT, SUN = range(7) #Enumerate days of the week
Thread 1 :
def run(self):
while not self.exit:
try:
localtime = time.localtime(time.time())
#Evaluate stock
if localtime.tm_hour > 16 or localtime.tm_wday > FRI:
# do something
pass
else:
print('Waiting to evaluate stocks...')
time.sleep(1800)
except:
print(traceback.format_exc())
Thread 2
def run(self):
while not self.exit:
try:
localtime = time.localtime(time.time())
if localtime.tm_hour >= 9 and localtime.tm_hour <= 16:
# do something
pass
else:
print('Waiting to update stocks indicators...')
time.sleep(1800)
except:
print(traceback.format_exc())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.