asynchronous I/O multiplexing (socket and inter-thread)

asynchronous I/O multiplexing (socket and inter-thread) - python

I would like to have a Python thread wait either for data coming from one socket (serial port, TCP/IP, etc.), or for data coming from another thread.
And I would like a portable Windows-and-Linux solution.
What I am looking for is similar to select.select() but I believe I cannot use select.select() on Windows for inter-thread communication.
Is this possible easily ?

Are you certain that it is necessary to use threads? Are you using some foreign API that requires their use?
Anyway, using Twisted, you can easily listen on any file-like portably (including serial ports and TCP sockets). Additionally, provided that you do in fact need to use threads, Twisted provides several tools for doing so. The simplest method, given your description, would be that you call reactor.callFromThread. If you want to get data back and not simply call the function in the reactor thread, Twisted provides twisted.internet.threads.blockingCallFromThread, which will block until the function in the reactor thread returns (or, if it returns a deferred, until that deferred fires).

Related

Is Flask-SocketIO's emit function thread safe?

I have a Flask-SocketIO application. Can I safely call socketio.emit() from different threads? Is socketio.emit() atomic like the normal socket.send()?

The socketio.emit() function is thread safe, or I should say that it is intended to be thread-safe, as there is currently one open issue related to this. Note that 'thread' in this context means a supported threading model. Most people use Flask-SocketIO in conjunction with eventlet or gevent in production, so in those contexts thread means "green" thread.
The open issue is related to using a message queue, which is necessary when you have multiple servers. In that set up, the accesses to the queue are not thread safe at this time. This is a bug that needs to be fixed, but as a workaround, you can create a different socketio object per thread.
On second question regarding if socketio.emit() is atomic, the answer is no. This is not a simple socket write operation. The payload needs to be formatted in certain way to comply with the Socket.IO protocol, then depending on the selected transport (long-polling or websocket) the write happens in a completely different way.

Handling Incoming Data from Multiple Sockets in Python

Background:
I have a current implementation that receives data from about 120 different socket connections in python. In my current implementation, I handle each of these separate socket connections with a dedicated thread for each. Each of these threads parse the data and eventually store it within a shared locked dictionary. These sockets DO NOT have uniform data rates, some sockets get more data than others.
Question:
Is this the best way to handle incoming data in python, or does python have a better way on handling multiple sockets per thread?

Using an asynchronous approach will make you much happier. For an example of a well-done implementation of this as a well-known application Tornado is perfect. You can easily use Tornado's ioloop for things other than web servers, too.
There are alternative libraries such as gevent; but I believe Tornado is a better place to look at first since it both provides the loop and a web server implemented on top of it as a great example of how to use the loop well.

If you're using threads, that's basically the way you'd go about it.
The alternative is to use one of the various asynchronous networking libraries out there, such as Twisted, Tornado, or GEvent.

As mentioned in Asynchronous UDP Socket Reading question from you, asyncoro can be used to process many asynchronous sockets efficiently. Another benefit with asyncoro in your problem is that you don't need to worry about locking shared dictionary, as with asyncoro at most one coroutine is executing at any time and there is no forced preemption.

Should I use epoll or just blocking recv in threads?

I'm trying to write a scalable custom web server.
Here's what I have so far:
The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).
Each process has a thread pool. The process assigns the socket to a thread.
The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.
Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?
PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.

If you're already using multiple threads, epoll doesn't offer you much additional benefit.
The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).
If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).
But...
What you're describing sounds suspiciously like you're basically reinventing the wheel - your:
main loop and request interpreter
pool of processes
sounds almost exactly like:
nginx (or any other load balancer/reverse proxy)
A pre-forking tornado app
Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.
If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.
(P.S. I approve of your avatar. <3)

The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).
On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.
Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).

what's the tornado ioloop, and tornado's workflow?

i want to know tornado's internal workflow, and have seen this article, it's great, but something i just can't figure out
within the ioloop.py, there is such a function
def add_handler(self, fd, handler, events):
"""Registers the given handler to receive the given events for fd."""
self._handlers[fd] = handler
self._impl.register(fd, events | self.ERROR)
so what's this mean? every request will trigger add_handler or it's just triggered once when init?
every socket connect will generate a file descriptor , or it's just generated once?
what's the relationship between ioloop and iostream ?
how does httpserver work with ioloop and iostream ?
is there any workflow chart, so i can see it clearly ?
sorry for these questiones, i just confused
any link, suggestion, tip helps. many thanks :)

I'll see if I can answer your questions in order:
Here _impl is whichever socket polling mechanism is available, epoll on Linux, select on Windows. So self._impl.register(fd, events | self.ERROR) passes the "wait for some event" request to the underlying operating system, also specifically including error events.
When run, the HTTPServer will register sockets to accept connections on, using IOLoop.add_handler(). When connections are accepted, they will generate more communication sockets, which will likely also add event handlers via an IOStream, which may also call add_handler(). So new handlers will be added both at the beginning, and as connections are recieved.
Yes, every new socket connection will have a unique file descriptor. The original socket the HTTPServer is listening on should keep its file descriptor though. File descriptors are provided by the operating system.
The IOLoop handles events to do with the sockets, for example whether they have data available to be read, whether they may be written to, and whether an error has occured. By using operating system services such as epoll or select, it can do this very efficiently.
An IOStream handles streaming data over a single connection, and uses the IOLoop to do this asynchronously. For example an IOStream can read as much data as is available, then use IOLoop.add_handler() to wait until more data is available.
On listen(), the HTTPServer creates a socket which it uses to listen for connections using the IOLoop. When a connection is obtained, it uses socket.accept() to create a new socket which is then used for communicating with the client using a new HTTPConnection.
The HTTPConnection uses an IOStream to transfer data to or from the client. This IOStream uses the IOLoop to do this in an asynchronous and non-blocking way. Many IOStream and HTTPConnection objects can be active at once, all using the same IOLoop.
I hope this answers some of your questions. I don't know of a good structural chart, but the overall idea should be fairly similar for other webservers too, so there might be some good info around. That in-depth article you linked to did look pretty useful, so if you understand enough I'd recommend giving it another go :).

Mix Python Twisted with multiprocessing?

I need to write a proxy like program in Python, the work flow is very similar to a web proxy. The program sits in between the client and the server, incept requests sent by the client to the server, process the request, then send it to the original server. Of course the protocol used is a private protocol uses TCP.
To minimize the effort, I want to use Python Twisted to handle the request receiving (the part acts as a server) and resending (the part acts as a client).
To maximum the performance, I want to use python multiprocessing (threading has the GIL limit) to separate the program into three parts (processes). The first process runs Twisted to receive requests, put the request in a queue, and return success immediately to the original client. The second process take request from the queue, process the request further and put it to another queue. The 3rd process take request from the 2nd queue and send it to the original server.
I was a new comer to Python Twisted, I know it is event driven, I also heard it's better to not mix Twisted with threading or multiprocessing. So I don't know whether this way is appropriate or is there a more elegant way by just using Twisted?

Twisted has its own event-driven way of running subprocesses which is (in my humble, but correct, opinion) better than the multiprocessing module. The core API is spawnProcess, but tools like ampoule provide higher-level wrappers over it.
If you use spawnProcess, you will be able to handle output from subprocesses in the same way you'd handle any other event in Twisted; if you use multiprocessing, you'll need to develop your own queue-based way of getting output from a subprocess into the Twisted mainloop somehow, since the normal callFromThread API that a thread might use won't work from another process. Depending on how you call it, it will either try to pickle the reactor, or just use a different non-working reactor in the subprocess; either way it will lose your call forever.

ampoule is the first thing I think when reading your question.
It is a simple process pool implementation which uses the AMP protocol to communicate. You can use the deferToAMPProcess function, it's very easy to use.

You can try something like Cooperative Multitasking technique as it's described there http://us.pycon.org/2010/conference/schedule/event/73/ . It's simillar to technique as Glyph menitioned and it's worth a try.
You can try to use ZeroMQ with Twisted but it's really hard and experimental for now :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.