How to close a socket file object (makefile) from a different thread? - python

I'm using socket.makefile() in order to read a textual line-by-line input stream.
I'm also using different threads in order to do that In the background:
import threading
import socket
import time
s = socket.create_connection(("127.0.0.1", 9000))
f = s.makefile()
threading.Thread(target=lambda: print(f.readline())).start()
f.close() # This blocks.
How can I close the file or socket without blocking?

You may close it like so:
# Order is important.
s.close() # First mark the socket as closed
f.detach().detach().close() # Mark the file object as closed, and actually close the socket.
Internally in CPython, f.readline() takes a lock in the underlying BufferedIO object. It does so in order to make sure only one thread is reading at a time. The same lock is used for f.close() for flushing it's buffers.
If you call close() while readline() is waiting in the background, your code will wait forever trying to acquire that lock, until the socket will receive a new line.
Unfortunately, attempting to close the socket in the socket level (s.close()) is not enough. Closing the socket only marks it as closed, but does not actually close it until all remaining file objects (makefiles) are closed as well.
Using f.detach() you traverse lower into the IO abstractions (TextIOWrapper -> BufferedIO -> SocketIO), until you reach the SocketIO class. Closing the SocketIO is what marks the file object as actually closed, and when all are closed, the real socket will be closed and throw an error in the readline() thread.
Alternatively, you can directly close the underlying socket file descriptor using socket.close(s.detach()).

Related

broken pipe error with python multiprocessing and socketserver

Essentially Im using the socketserver python library to try and handle communications from a central server to multiple raspberry pi4 and esp32 peripherals. Currently i have the socketserver running serve_forever, then the request handler calls a method from a processmanager class which starts a process that should handle the actual communication with the client.
It works fine if i use .join() on the process such that the processmanager method doesnt exit, but thats not how i would like it to run. Without .join() i get a broken pipe error as soon as the client communication process tries to send a message back to the client.
This is the process manager class, it gets defined in the main file and buildprocess is called through the request handler of the socketserver class:
import multiprocessing as mp
mp.allow_connection_pickling()
import queuemanager as qm
import hostmain as hmain
import camproc
import keyproc
import controlproc
# method that gets called into a process so that class and socket share memory
def callprocess(periclass, peritype, clientsocket, inqueue, genqueue):
periclass.startup(clientsocket)
class ProcessManager(qm.QueueManager):
def wipeproc(self, target):
# TODO make wipeproc integrate with the queue manager rather than directly to the class
for macid in list(self.procdict.keys()):
if target == macid:
# calls proc kill for the class
try:
self.procdict[macid]["class"].prockill()
except Exception as e:
print("exception:", e, "in wipeproc")
# waits for process to exit naturally (class threads to close)
self.procdict[macid]["process"].join()
# remove dict entry for this macid
self.procdict.pop(macid)
# called externally to create the new process and append to procdict
def buildprocess(self, peritype, macid, clientsocket):
# TODO put some logic here to handle the differences of the controller process
# generates queue object
inqueue = mp.Queue()
# creates periclass instance based on type
if peritype == hmain.cam:
periclass = camproc.CamMain(self, inqueue, self.genqueue)
elif peritype == hmain.keypad:
print("to be added to")
elif peritype == hmain.motion:
print("to be added to")
elif peritype == hmain.controller:
print("to be added to")
# init and start call for the new process
self.procdict[macid] = {"type": peritype, "inqueue": inqueue, "class": periclass, "process": None}
self.procdict[macid]["process"] = mp.Process(target=callprocess,
args=(self.procdict[macid]["class"], self.procdict[macid]["type"], clientsocket, self.procdict[macid]["inqueue"], self.genqueue))
self.procdict[macid]["process"].start()
# updating the process dictionary before class obj gets appended
# if macid in list(self.procdict.keys()):
# self.wipeproc(macid)
print(self.procdict)
print("client added")
to my eye, all the pertinent objects should be stored in the procdict dictionary but as i mentioned it just gets a broken pipe error unless i join the process with self.procdict[macid]["process"].join() before the end of the buildprocess method
I would like it to exit the method but leave the communication process running as is, ive tried a few different things with restructuring what gets defined within the process and without, but to no avail. Thus far i havent been able to find any pertinent solutions online but of course i may have missed something too.
Thankyou for reading this far if you did! Ive been stuck on this for a couple days so any help would be appreciated, this is my first project with multiprocessing and sockets on any sort of scale.
#################
Edit to include pastebin with all the code:
https://pastebin.com/u/kadytoast/1/PPWfyCFT
Without .join() i get a broken pipe error as soon as the client communication process tries to send a message back to the client.
That's because at the time when the request handler handle() returns, the socketserver does shutdown the connection. That socketserver simplifies the task of writing network servers means it does certain things automatically which are usually done in the course of network request handling. Your code is not quite making the intended use of socketserver. Especially, for handling requests asynchronously, Asynchronous Mixins are intended. With the ForkingMixIn the server will spawn a new process for each request, in contrast to your current code which does this by itself with mp.Process. So, I think you have basically two options:
code less of the request handling yourself and use the provided socketserver methods
stay with your own handling and don't use socketserver at all, so it won't get in the way.

Why can `popen.stdout.readline` deadlock and what to do about it?

From the Python documentation
Warning Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.
I'm trying to understand why this would deadlock. For some background, I am spawning N processes in parallel:
for c in commands:
h = subprocess.Popen(c, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
handles.append(h)
Then printing the output of each process 1-by-1:
for handle in handles:
while handle.poll() is None:
try:
line = handle.stdout.readline()
except UnicodeDecodeError:
line = "((INVALID UNICODE))\n"
sys.stdout.write(line)
if handle.returncode != 0:
print(handle.stdout.read(), file=sys.stdout)
if handle.returncode != 0:
print(handle.stderr.read(), file=sys.stderr)
Occasionally this does in fact deadlock. Unfortunately, the documentation's recommendation to use communicate() is not going to work for me, because this process could take several minutes to run, and I don't want it to appear dead during this time. It should print output in real time.
I have several options, such as changing the bufsize argument, polling in a different thread for each handle, etc. But in order to decide what the best way to fix this, I think I need to understand what the fundamental reason for the deadlock is in the first place. Something to do with buffer sizes, apparently, but what? I can hypothesize that maybe all of these processes are sharing a single OS kernel object, and because I'm only draining the buffer of one of the processes, the other ones fill it up, in which case option 2 above would probably fix it. But maybe that's not even the real problem.
Can anyone shed some light on this?
The bidirectional communication between the parent and child processes uses two unidirectional pipes. One for each direction. OK, stderr is the third one, but the idea is the same.
A pipe has two ends, one for writing, one for reading. The capacity of a pipe was 4K and is now 64K on modern Linux. One can expect similar values on other systems. This means, the writer can write to a pipe without problems up to its limit, but then the pipe gets full and a write to it blocks until the reader reads some data from the other end.
From the reader's view is the situation obvious. A regular read blocks until data is available.
To summarize: a deadlock occurs when a process attempts to read from a pipe where nobody is writing to or when it writes data larger that the pipe's capacity to a pipe nobody is reading from.
Typically the two processes act as a client & server and utilize some kind of request/response style communication. Something like half-duplex. One side is writing and the other one is reading. Then they switch the roles. This is practically the most complex setup we can handle with standard synchronous programming. And a deadlock can stil occur when the client and server get somehow out of sync. This can be caused by an empty response, unexpected error message, etc.
If there are several child processes or when the communication protocol is not so simple or we just want a robust solution, we need the parent to operate on all the pipes. communicate() uses threads for this purpose. The other approach is asynchronous I/O: first check, which one is ready to do I/O and only then read or write from that pipe (or socket). The old and deprecated asyncore library implemented that.
On the low level, the select (or similar) system call checks which file handles from a given set are ready for I/O. But at that low level, we can do only one read or write before re-checking. That is the problem of this snippet:
while handle.poll() is None:
try:
line = handle.stdout.readline()
except UnicodeDecodeError:
line = "((INVALID UNICODE))\n"
The poll check tells us there is something to be read, but this does not mean we will be able to read repeatedly until a newline! We can only do one read and append the data to an input buffer. If there is a newline, we can extract the whole line and process it. If not, we need to wait to next succesfull poll and read.
Writes behave similarly. We can write once, check the number of bytes written and remove that many bytes from the output buffer.
That implies that line buffering and all that higher level stuff needs to be implemented on top of that. Fortunately, the successor of asyncore offers what we need: asyncio > subprocesses.
I hope I could explain the deadlock. The solution could be expected. If you need to do several things, use either threading or asyncio.
UPDATE:
Below is a short asyncio test program. It reads inputs from several child processes and prints the data line by line.
But first a cmd.py helper which prints a line in several small chunks to demonstrate the line buffering. Try the usage e.g. with python3 cmd.py 10.
import sys
import time
def countdown(n):
print('START', n)
while n >= 0:
print(n, end=' ', flush=True)
time.sleep(0.1)
n -= 1
print('END')
if __name__ == '__main__':
args = sys.argv[1:]
if len(args) != 1:
sys.exit(3)
countdown(int(args[0]))
And the main program:
import asyncio
PROG = 'cmd.py'
NPROC = 12
async def run1(*execv):
"""Run a program, read input lines."""
proc = await asyncio.create_subprocess_exec(
*execv,
stdin=asyncio.subprocess.DEVNULL,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.DEVNULL)
# proc.stdout is a StreamReader object
async for line in proc.stdout:
print("Got line:", line.decode().strip())
async def manager(prog, nproc):
"""Spawn 'nproc' copies of python script 'prog'."""
tasks = [asyncio.create_task(run1('python3', prog, str(i))) for i in range(nproc)]
await asyncio.wait(tasks)
if __name__ == '__main__':
asyncio.run(manager(PROG, NPROC))
The async for line ... is a feature of StreamReader similar to the for line in file: idiom. It can be replaced it with:
while True:
line = await proc.stdout.readline()
if not line:
break
print("Got line:", line.decode().strip())

Python: write to single file from multiple processes (ZMQ)

I want to write to a single file from multiple processes. To be precise, I would rather not use the Multiple processing Queue solution for multiprocessing as there are several submodules written by other developers. However, each write to the file for such submodules is associated with a write to a zmq queue. Is there a way I can redirect the zmq messages to a file? Specifically I am looking for something along the lines of http://www.huyng.com/posts/python-logging-from-multiple-processes/ without using the logging module.
It's fairly straightforward. In one process, bind a PULL socket and open a file.
Every time the PULL socket receives a message, it writes directly to the file.
EOF = chr(4)
import zmq
def file_sink(filename, url):
"""forward messages on zmq to a file"""
socket = zmq.Context.instance().socket(zmq.PULL)
socket.bind(url)
written = 0
with open(filename, 'wb') as f:
while True:
chunk = socket.recv()
if chunk == EOF:
break
f.write(chunk)
written += len(chunk)
socket.close()
return written
In the remote processes, create a Proxy object,
whose write method just sends a message over zmq:
class FileProxy(object):
"""Proxy to a remote file over zmq"""
def __init__(self, url):
self.socket = zmq.Context.instance().socket(zmq.PUSH)
self.socket.connect(url)
def write(self, chunk):
"""write a chunk of bytes to the remote file"""
self.socket.send(chunk)
And, just for fun, if you call Proxy.write(EOF), the sink process will close the file and exit.
If you want to write multiple files, you can do this fairly easily either by starting multiple sinks and having one URL per file,
or making the sink slightly more sophisticated and using multipart messages to indicate what file is to be written.

In python when threads die?

I have a service that spawns threads.
And i may have a leak of resources in a code i am using.
I have similar code in python that uses threads
import threading
class Worker(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
# now i am using django orm to make a query
dataList =Mydata.objects.filter(date__isnull = True )[:chunkSize]
print '%s - DB worker finished reading %s entrys' % (datetime.now(),len(dataList))
while True:
myWorker = Worker()
mwWorker.start()
while myWorker.isalive(): # wait for worker to finish
do_other_work()
is it ok ?
will the threads die when they finish executing the run method ?
do i cause a leak in resources ?
Looking at your previous question (that you linkd in a comment) the problem is that you're running out of file descriptors.
From the official doc:
File descriptors are small integers corresponding to a file that has been opened by the current process. For example, standard input is usually file descriptor 0, standard output is 1, and standard error is 2. Further files opened by a process will then be assigned 3, 4, 5, and so forth. The name “file descriptor” is slightly deceptive; on Unix platforms, sockets and pipes are also referenced by file descriptors.
Now I'm guessing, but it could be that you're doing something like:
class Wroker(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
my_file = open('example.txt')
# operations
my_file.close() # without this line!
You need to close your files!
You're probably starting many threads and each one of them is opening but not closing a file, this way after some time you don't have more "small integers" to assign for opening a new file.
Also note that in the #operations part anything could happen, if an exception is thrown the file will not be close unless wrapped in a try/finally statement.
There's a better way for dealing with files: the with statement:
with open('example.txt') as my_file:
# bunch of operations with the file
# other operations for which you don't need the file
Once a thread object is created, its activity must be started by calling the thread’s start() method. This invokes the run() method in a separate thread of control.
Once the thread’s activity is started, the thread is considered ‘alive’. It stops being alive when its run() method terminates – either normally, or by raising an unhandled exception. The is_alive() method tests whether the thread is alive.
From python site

Producing content indefinitely in a separate thread for all connections?

I have a Twisted project which seeks to essentially rebroadcast collected data over TCP in JSON. I essentially have a USB library which I need to subscribe to and synchronously read in a while loop indefinitely like so:
while True:
for line in usbDevice.streamData():
data = MyBrandSpankingNewUSBDeviceData(line)
# parse the data, convert to JSON
output = convertDataToJSON(data)
# broadcast the data
...
The problem, of course, is the .... Essentially, I need to start this process as soon as the server starts and end it when the server ends (Protocol.doStart and Protocol.doStop) and have it constantly running and broadcasting a output to every connected transport.
How can I do this in Twisted? Obviously, I'd need to have the while loop run in its own thread, but how can I "subscribe" clients to listen to output? It's also important that the USB data collection only be running once, as it could seriously mess things up to have it running more than once.
In a nutshell, here's my architecture:
Server has a USB hub which is streaming data all the time. Server is constantly subscribed to this USB hub and is constantly reading data.
Clients will come and go, connecting and disconnecting at will.
We want to send data to all connected clients whenever it is available. How can I do this in Twisted?
One thing you probably want to do is try to extend the common protocol/transport independence. Even though you need a thread with a long-running loop, you can hide this from the protocol. The benefit is the same as usual: the protocol becomes easier to test, and if you ever manage to have a non-threaded implementation of reading the USB events, you can just change the transport without changing the protocol.
from threading import Thread
class USBThingy(Thread):
def __init__(self, reactor, device, protocol):
self._reactor = reactor
self._device = device
self._protocol = protocol
def run(self):
while True:
for line in self._device.streamData():
self._reactor.callFromThread(self._protocol.usbStreamLineReceived, line)
The use of callFromThread is part of what makes this solution usable. It makes sure the usbStreamLineReceived method gets called in the reactor thread rather than in the thread that's reading from the USB device. So from the perspective of that protocol object, there's nothing special going on with respect to threading: it just has its method called once in a while when there's some data to process.
Your protocol then just needs to implement usbStreamLineReceived somehow, and implement your other application-specific logic, like keeping a list of observers:
class SomeUSBProtocol(object):
def __init__(self):
self.observers = []
def usbStreamLineReceived(self, line):
data = MyBrandSpankingNewUSBDeviceData(line)
# broadcast the data
for obs in self.observers[:]:
obs(output)
And then observers can register themselves with an instance of this class and do whatever they want with the data:
class USBObserverThing(Protocol):
def connectionMade(self):
self.factory.usbProto.observers.append(self.emit)
def connectionLost(self):
self.factory.usbProto.observers.remove(self.emit)
def emit(self, output):
# parse the data, convert to JSON
output = convertDataToJSON(data)
self.transport.write(output)
Hook it all together:
usbDevice = ...
usbProto = SomeUSBProtocol()
thingy = USBThingy(reactor, usbDevice, usbProto)
thingy.start()
factory = ServerFactory()
factory.protocol = USBObserverThing
factory.usbProto = usbProto
reactor.listenTCP(12345, factory)
reactor.run()
You can imagine a better observer register/unregister API (like one using actual methods instead of direct access to that list). You could also imagine giving the USBThingy a method for shutting down so SomeUSBProtocol could control when it stops running (so your process will actually be able to exit).

Categories