Python: write to single file from multiple processes (ZMQ) - python

I want to write to a single file from multiple processes. To be precise, I would rather not use the Multiple processing Queue solution for multiprocessing as there are several submodules written by other developers. However, each write to the file for such submodules is associated with a write to a zmq queue. Is there a way I can redirect the zmq messages to a file? Specifically I am looking for something along the lines of http://www.huyng.com/posts/python-logging-from-multiple-processes/ without using the logging module.

It's fairly straightforward. In one process, bind a PULL socket and open a file.
Every time the PULL socket receives a message, it writes directly to the file.
EOF = chr(4)
import zmq
def file_sink(filename, url):
"""forward messages on zmq to a file"""
socket = zmq.Context.instance().socket(zmq.PULL)
socket.bind(url)
written = 0
with open(filename, 'wb') as f:
while True:
chunk = socket.recv()
if chunk == EOF:
break
f.write(chunk)
written += len(chunk)
socket.close()
return written
In the remote processes, create a Proxy object,
whose write method just sends a message over zmq:
class FileProxy(object):
"""Proxy to a remote file over zmq"""
def __init__(self, url):
self.socket = zmq.Context.instance().socket(zmq.PUSH)
self.socket.connect(url)
def write(self, chunk):
"""write a chunk of bytes to the remote file"""
self.socket.send(chunk)
And, just for fun, if you call Proxy.write(EOF), the sink process will close the file and exit.
If you want to write multiple files, you can do this fairly easily either by starting multiple sinks and having one URL per file,
or making the sink slightly more sophisticated and using multipart messages to indicate what file is to be written.

Related

Python atomic logs

I have a logger running on a few thousand processes, and they all write to the same file in append-mode. What would be a good way to guarantee that writes are atomic -- that is, each time a process writes to a log its entire contents are written in one block and there's no other process that writes to that file at the same time?
My thought was doing something like:
logger = getLogger()
global_lockfile = '/tmp/loglock'
def atomic_log(msg):
while True:
if os.path.exists(lockfile):
continue
with open(lockfile, 'w') as f:
logger.write(msg)
os.remove(lockfile)
def some_function(request):
log_atomic("Hello")
What would be an actual way to do the above on a posix system?

How to close a socket file object (makefile) from a different thread?

I'm using socket.makefile() in order to read a textual line-by-line input stream.
I'm also using different threads in order to do that In the background:
import threading
import socket
import time
s = socket.create_connection(("127.0.0.1", 9000))
f = s.makefile()
threading.Thread(target=lambda: print(f.readline())).start()
f.close() # This blocks.
How can I close the file or socket without blocking?
You may close it like so:
# Order is important.
s.close() # First mark the socket as closed
f.detach().detach().close() # Mark the file object as closed, and actually close the socket.
Internally in CPython, f.readline() takes a lock in the underlying BufferedIO object. It does so in order to make sure only one thread is reading at a time. The same lock is used for f.close() for flushing it's buffers.
If you call close() while readline() is waiting in the background, your code will wait forever trying to acquire that lock, until the socket will receive a new line.
Unfortunately, attempting to close the socket in the socket level (s.close()) is not enough. Closing the socket only marks it as closed, but does not actually close it until all remaining file objects (makefiles) are closed as well.
Using f.detach() you traverse lower into the IO abstractions (TextIOWrapper -> BufferedIO -> SocketIO), until you reach the SocketIO class. Closing the SocketIO is what marks the file object as actually closed, and when all are closed, the real socket will be closed and throw an error in the readline() thread.
Alternatively, you can directly close the underlying socket file descriptor using socket.close(s.detach()).

Multithreading makes me get the "ValueError: I/O operation on closed file" error. Why?

I am writing a Flask Web Application using WTForms. In one of the forms the user should upload a csv file and the server will analyze the received data. This is the code I am using.
filename = token_hex(8) + '.csv' # Generate a new filename
form.dataset.data.save('myapp/datasets/' + filename) # Save the received file
dataset = genfromtxt('myapp/datasets/' + filename, delimiter=',') # Open the newly generated file
# analyze 'dataset'
As long as I was using this code inside a single-thread application everything was working. I tried adding a thread in the code. Here's the procedure called by the thread (the same exact code inside a function):
def execute_analysis(form):
filename = token_hex(8) + '.csv' # Generate a new filename
form.dataset.data.save('myapp/datasets/' + filename) # Save the received file
dataset = genfromtxt('myapp/datasets/' + filename, delimiter=',') # Open the newly generated file
# analyze 'dataset'
and here's how I call the thread
import threading
#posts.route("/estimation", methods=['GET', 'POST'])
#login_required
def estimate_parameters():
form = EstimateForm()
if form.validate_on_submit():
threading.Thread(target=execute_analysis, args=[form]).start()
flash("Your request has been received. Please check the site in again in a few minutes.", category='success')
# return render_template('posts/post.html', title=post.id, post=post)
return render_template('estimations/estimator.html', title='New Analysis', form=form, legend='New Analysis')
But now I get the following error:
ValueError: I/O operation on closed file.
Relative to the save function call. Why is it not working? How should I fix this?
It's hard to tell without further context, but I suspect it's likely that you're returning from a function or exiting a context manager which causes some file descriptor to close, and hence causes the save(..) call to fail with ValueError.
If so, one direct fix would be to wait for the thread to finish before returning/closing the file. Something along the lines of:
def handle_request(form):
...
analyzer_thread = threading.Thread(target=execute_analysis, args=[form])
analyzer_thread.start()
...
analyzer_thread.join() # wait for completion of execute_analysis
cleanup_context(form)
return
Here is a reproducable minimal example of the problem I am describing:
import threading
SEM = threading.Semaphore(0)
def run(fd):
SEM.acquire() # wait till release
fd.write("This will fail :(")
fd = open("test.txt", "w+")
other_thread = threading.Thread(target=run, args=[fd])
other_thread.start()
fd.close()
SEM.release() # release the semaphore, so other_thread will acquire & proceed
other_thread.join()
Note that the main thread will close the file, and the other thread will fail on write call with ValueError: I/O operation on closed file., as in your case.
I don't know the framework sufficiently to tell exactly what happened, but I can tell you how you probably can fix it.
Whenever you have a resource that is shared by multiple threads, use a lock.
from threading import Lock
LOCK = Lock()
def process():
LOCK.acquire()
... # open a file, write some data to it etc.
LOCK.release()
# alternatively, use the context manager syntax
with LOCK:
...
threading.Thread(target=process).start()
threading.Thread(target=process).start()
Documentation on threading.Lock:
The class implementing primitive lock objects. Once a thread has acquired a lock, subsequent attempts to acquire it block, until it is released
Basically, after thread 1 calls LOCK.acquire(), subsequent calls e.g. from other threads, will cause those threads to freeze and wait until something calls LOCK.release() (usually thread 1, after it finishes its business with the resource).
If the filenames are randomly generated then I wouldn't expect problems with 1 thread closing the other's file, unless both of them happen to generate the same name. But perhaps you can figure it out with some experimentation, e.g. first try locking calls to both save and genfromtxt and check if that helps. It might also make sense to add some print statements (or even better, use logging), e.g. to check if the file names don't collide.

Python - Handling concurrent POST requests with HTTPServer and saving to files

I have the code below that I would like to use to receive POST requests with data that I will then use to print labels on a label printer. In order to print the labels I will need to write a file with print commands and then do a lp command via the command line to copy the file to the label printer.
The problem I have is that multiple people could be printing labels at the same time. So my question is do I have to change the code below to use ThreadingMixIn in order to handle concurrent POST requests or can I leave the code as is and there will only be a slight delay for secondary request in a concurrent scenario (that is any further requests will be queued and not lost)?
If I have to go the threaded way how does that impact the writing of the file and subsequent command line call to lp if there are now multiple threads trying to write to the same file?
Note that there are multiple label printers that are being accessed through print queues (CUPS).
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
from io import BytesIO
class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b'Hello, world!')
def do_POST(self):
content_length = int(self.headers['Content-Length'])
body = self.rfile.read(content_length)
try:
result = json.loads(body, encoding='utf-8')
self.send_response(200)
self.end_headers()
response = BytesIO()
response.write(b'This is POST request. ')
response.write(b'Received: ')
response.write(body)
self.wfile.write(response.getvalue())
except Exception as exc:
self.wfile.write('Request has failed to process. Error: %s', exc.message)
httpd = HTTPServer(('localhost', 8000), SimpleHTTPRequestHandler)
httpd.serve_forever()
Why not trying to use unique file name?
In this way you are sure that there will be no names clash.
Have a look on https://docs.python.org/2/library/tempfile.html , consider NamedTemporaryFile function. You should use delete=False, otherway the file is deleted imediatelly after close() .
according to your question what I understand is that you have only one label printer and you have multiple producers who try to print labels on it.
So even if you switch to a multithreading option you will have to synchronize threads in order to avoid deadlocks and infinite waiting,
So my best take is to go with a builtin python data structure called queue
according to the doc
The queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads. The Queue class in this module implements all the required locking semantics. It depends on the availability of thread support in Python,
Even though its a multi-consumer, multi-producer queue I suppose it will still work for you like a charm.
So here is what you need to do
Your server receives a request to print the label,
do the necessary processing/cleanup and put it in the queue
A worker thread pops the items from the queue and executes the task
or if you would expect the system to be big enough, here are some links, but steps would be same as above
RabbitMq - A scalable message broker (simply put a queue)
Celery - A python package for popping items from a message broker such as rabbitmq and executes it

Producing content indefinitely in a separate thread for all connections?

I have a Twisted project which seeks to essentially rebroadcast collected data over TCP in JSON. I essentially have a USB library which I need to subscribe to and synchronously read in a while loop indefinitely like so:
while True:
for line in usbDevice.streamData():
data = MyBrandSpankingNewUSBDeviceData(line)
# parse the data, convert to JSON
output = convertDataToJSON(data)
# broadcast the data
...
The problem, of course, is the .... Essentially, I need to start this process as soon as the server starts and end it when the server ends (Protocol.doStart and Protocol.doStop) and have it constantly running and broadcasting a output to every connected transport.
How can I do this in Twisted? Obviously, I'd need to have the while loop run in its own thread, but how can I "subscribe" clients to listen to output? It's also important that the USB data collection only be running once, as it could seriously mess things up to have it running more than once.
In a nutshell, here's my architecture:
Server has a USB hub which is streaming data all the time. Server is constantly subscribed to this USB hub and is constantly reading data.
Clients will come and go, connecting and disconnecting at will.
We want to send data to all connected clients whenever it is available. How can I do this in Twisted?
One thing you probably want to do is try to extend the common protocol/transport independence. Even though you need a thread with a long-running loop, you can hide this from the protocol. The benefit is the same as usual: the protocol becomes easier to test, and if you ever manage to have a non-threaded implementation of reading the USB events, you can just change the transport without changing the protocol.
from threading import Thread
class USBThingy(Thread):
def __init__(self, reactor, device, protocol):
self._reactor = reactor
self._device = device
self._protocol = protocol
def run(self):
while True:
for line in self._device.streamData():
self._reactor.callFromThread(self._protocol.usbStreamLineReceived, line)
The use of callFromThread is part of what makes this solution usable. It makes sure the usbStreamLineReceived method gets called in the reactor thread rather than in the thread that's reading from the USB device. So from the perspective of that protocol object, there's nothing special going on with respect to threading: it just has its method called once in a while when there's some data to process.
Your protocol then just needs to implement usbStreamLineReceived somehow, and implement your other application-specific logic, like keeping a list of observers:
class SomeUSBProtocol(object):
def __init__(self):
self.observers = []
def usbStreamLineReceived(self, line):
data = MyBrandSpankingNewUSBDeviceData(line)
# broadcast the data
for obs in self.observers[:]:
obs(output)
And then observers can register themselves with an instance of this class and do whatever they want with the data:
class USBObserverThing(Protocol):
def connectionMade(self):
self.factory.usbProto.observers.append(self.emit)
def connectionLost(self):
self.factory.usbProto.observers.remove(self.emit)
def emit(self, output):
# parse the data, convert to JSON
output = convertDataToJSON(data)
self.transport.write(output)
Hook it all together:
usbDevice = ...
usbProto = SomeUSBProtocol()
thingy = USBThingy(reactor, usbDevice, usbProto)
thingy.start()
factory = ServerFactory()
factory.protocol = USBObserverThing
factory.usbProto = usbProto
reactor.listenTCP(12345, factory)
reactor.run()
You can imagine a better observer register/unregister API (like one using actual methods instead of direct access to that list). You could also imagine giving the USBThingy a method for shutting down so SomeUSBProtocol could control when it stops running (so your process will actually be able to exit).

Categories