I'm trying to build a system that collects data from some sources using I/O (HDD, network...)
For this, I have a class (controller) that launch the collectors.
Each collector is an infinite loop with a classic ETL process (extract, transform and load).
I want send some commands to the collectors (stop, reload settings...) from an interface (CLI, web...) and I'm not sure about how to do it.
For example, this is the skeleton for a collector:
class Collector(object):
def __init__(self):
self.reload_settings()
def reload_settings(self):
# Get the settings
# Set the settings as attributes
def process_data(self, data):
# Do something
def run(self):
while True:
data = retrieve_data()
self.process_data(data)
And this is the skeleton for the controller:
class Controller(object):
def __init__(self, collectors):
self.collectors = collectors
def run(self):
for collector in collectors:
collector.run()
def reload_settings(self):
??
def stop(self):
??
Is there a classic design pattern that solves this problem (Publish–subscribe, event loop, reactor...)? What is the best way to solve this problem?
PD: Obviously, this will be a multiprocess application and will run on a single machine.
There are multiple choices here, but they boil down to two major kinds: cooperative (event loop/reactor/coroutine/explicit greenlet), or preemptive (implicit greenlet/thread/multiprocess).
The first requires a lot more restructuring of your collectors. It can be a nice way to make the nondeterminism explicit, or to achieve massive concurrency, but neither of those seems relevant here. The second just requires sticking the collectors on threads, and using some synchronization mechanism for both communication and shared data. It seems like you have no shared data, and your communication is trivial and not highly time-sensitive. So, I'd go with threads.
Assuming you want to go with threads in the general sense, assuming your collectors are I/O-bound and you don't have dozens of them, I'd go with actual threads.
So, here's one way you can write it:
class Collector(threading.Thread):
def __init__(self):
self._reload_settings()
self._need_reload = threading.Event()
self._need_stop = threading.Event()
def _reload_settings(self):
# Get the settings
# Set the settings as attributes
self._need_reload.clear()
def reload_settings(self):
self._need_reload.set()
def stop(self):
self._need_stop.set()
def process_data(self, data):
# Do something
def run(self):
while not self._need_stop.is_set():
if self._need_reload.is_set():
self._reload_settings()
data = retrieve_data()
self.process_data(data)
class Controller(object):
def __init__(self, collectors):
self.collectors = collectors
def run(self):
for collector in self.collectors:
collector.start()
def reload_settings(self):
for collector in self.collectors:
collector.reload_settings()
def stop(self):
for collector in self.collectors:
collector.stop()
for collector in self.collectors:
collector.join()
(Although I'd call the Controller.run method stop, because it fits in better with the naming used not only by Thread, but also by the stdlib server classes and other similar things.)
I'd look at the possibility of adapting your case to socket-based client-server architecture where Controller would instantiate required number of Collectors each listening on its own port and handling received data in more elegant way through handle() method of the server. The fact that data comes from various I/O sources speaks even more for this solution - you could use Client part of this architecture to standarize the DataSource -> Collector protocol
https://docs.python.org/2/library/socketserver.html
Related
My issue follows: I've a main GUI that manages different connections with an instrument and elaborates the data coming from this latter according to the user choices. I designed a class InstrumentController that manages all the methods to speak with the instrument (connect, disconnect, set commands and read commands).
Obviously I'd like to make the instrument management to work parallel to the GUI application. I've already explored the QThread, and in particular the moveToThread option widely detailed on the Internet. However, though it works, I don't like this strategy for some reason:
I don't want my object to be a thread (subclass QThread). I'd like to maintain the modularity and generality of my class.
...even if it has to be, it doesn't solve the next point
QThread, obviously, works on a single callback base. Thus, I've an extra workload to either create a thread per each InstrumentController method or accordingly configure a single thread each time a method is called (I'm not expecting the methods of the object to work concurrently!)
As a consequence, I'm seeking a solution that allows me to have the InstrumentController entity to work like a separate program (deamon?) but that must be strongly linked to the main GUI (it has to continuously communicate back and forth), so that I need signals from GUI to be visible by this object and viceversa. I was exploring some solution, namely:
Create an extra event loop (QEventLoop) that works parallel to the main loop, but the official docs is very slim and I found little more on the Internet. Therefore I don't even know if it is practicable.
Create a separate process (another Qt application) and search for an effective protocol of communication.
Aware that venturing into one of these solution might be time-consuming and possibly -waisting, I'd like to ask for any effective, efficient and practicable suggestion that might help with my problem.
The first thing to consider is that a QThread is only a wrapper to a OS thread.
moveToThread() does not move an object to the QThread object, but to the thread that it refers to; in fact, a QThread might have its own thread() property (as Qt documentation reports, it's "the thread in which the object lives").
With that in mind, moveToThread() is not the same as creating a QThread, and, most importantly, a QThread does not work "on a single callback base". What's important is what it's executed in the thread that QThread refers to.
When a QThread is started, whatever is executed in the threaded function (aka, run()) is actually executed in that thread.
Connecting a function to the started signal results in executing that function in the OS thread the QThreads refers to.
Calling a function from any of that functions (including the basic run()) results in running that function in the other thread.
If you want to execute functions for that thread, those functions must be called from there, so a possible solution is to use a Queue to pass that function reference to ensure that a command is actually executed in the other thread. So, you can run a function on the other thread, as long as it's called (not just referenced to) from that thread.
Here's a basic example:
import sys
from queue import Queue
from random import randrange
from PyQt5 import QtCore, QtWidgets
class Worker(QtCore.QThread):
log = QtCore.pyqtSignal(object)
def __init__(self):
super().__init__()
self.queue = Queue()
def run(self):
count = 0
self.keepRunning = True
while self.keepRunning:
wait = self.queue.get()
if wait is None:
self.keepRunning = False
continue
count += 1
self.log.emit('Process {} started ({} seconds)'.format(count, wait))
self.sleep(wait)
self.log.emit('Process {} finished after {} seconds'.format(count, wait))
self.log.emit('Thread finished after {} processes ({} left unprocessed)'.format(
count, self.queue.qsize()))
def _queueCommand(self, wait=0):
self.queue.put(wait)
def shortCommand(self):
self._queueCommand(randrange(1, 5))
def longCommand(self):
self._queueCommand(randrange(5, 10))
def stop(self):
if self.keepRunning:
self.queue.put(None)
self.keepRunning = False
class Test(QtWidgets.QWidget):
def __init__(self):
super().__init__()
self.startShort = QtWidgets.QPushButton('Start short command')
self.startLong = QtWidgets.QPushButton('Start long command')
self.stop = QtWidgets.QPushButton('Stop thread')
self.log = QtWidgets.QTextEdit(readOnly=True)
layout = QtWidgets.QVBoxLayout(self)
layout.addWidget(self.startShort)
layout.addWidget(self.startLong)
layout.addWidget(self.stop)
layout.addWidget(self.log)
self.worker = Worker()
self.worker.log.connect(self.log.append)
self.startShort.clicked.connect(self.worker.shortCommand)
self.startLong.clicked.connect(self.worker.longCommand)
self.stop.clicked.connect(self.worker.stop)
self.worker.finished.connect(lambda: [
w.setEnabled(False) for w in (self.startShort, self.startLong, self.stop)
])
self.worker.start()
app = QtWidgets.QApplication(sys.argv)
test = Test()
test.show()
app.exec()
I can best explain this with example code first;
class reciever(threading.Thread,simple_server):
def __init__(self,callback):
threading.Thread.__init__(self)
self.callback=callback
def run(self):
self.serve_forever(self.callback)
class sender(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.parameter=50
def run(self):
while True:
#do some processing in general
#....
#send some udp messages derived from self.parameter
send_message(self.parameter)
if __name__=='__main__':
osc_send=sender()
osc_send.start()
def update_parameter(val):
osc_send.parameter=val
osc_recv=reciever(update_parameter)
osc_recv.start()
the pieces I have left out are hopefully self explanatory from the code thats there..
My question is, is this a safe way to use a server running in a thread to update the attributes on a separate thread that could be reading the value at any time?
The way you're updating that parameter is actually thread-safe already, because of the Global Interpreter Lock (GIL). The GIL means that Python only allows one thread to execute byte-code at a time, so it is impossible for one thread to be reading from parameter at the same time another thread is writing to it. Reading from and setting an attribute are both single, atomic byte-code operations; one will always start and complete before the other can happen. You would only need to introduce synchronization primitives if you needed to do operations that are more than one byte-code operation from more than one threads (e.g. incrementing parameter from multiple threads).
Assume a class WorkerThread that implements a field running which indicates whether the thread should continue to work after it was started or not.
class WorkerThread(threading.Thread):
running = False
def run(self):
self.running = True
while self.running:
# .. do some important stuff
pass
def main():
t = WorkerThread()
t.start()
# .. do other important stuff
t.running = False
t.join()
Is there something that could possibly go wrong when modifying t.running from the main thread, without locking the read and write operations to this field? What is it?
The main thread and the worker thread could run on cores that do not share cache. Because of the absence of synchronization, the write to t.running might never be shared from the main thread's cache to the worker thread's cache.
What synchronization means is not just "I want exclusive access". It also means, "I want to share my writes to other threads, and see the writes from other threads". No synchronization means that you do not need those things. Not synchronizing doesn't prevent them happening (and on some systems/architectures they will happen with more frequency than others), it just fails to guarantee they will happen.
In practice you might find that provided CPython is taking the GIL at regular intervals, these things sort themselves out even on architectures that, unlike Intel, do not have coherent caches.
for your requirements use a threading.Event() object instead a flag.
class WorkerThread(threading.Thread):
def __init__(self):
super(WorkerThread, self).__init__()
self.running = threading.Event()
def run(self):
self.running.set()
while self.running.is_set():
# .. do some important stuff
pass
def halt(self):
self.running.clear()
def main():
t = WorkerThread()
t.start()
# .. do other important stuff
t.halt()
t.join()
and for check if is running t.is_alive().
The field "running" is shared state, you need to guard it with some kind of monitor. In the absence of synchronizing access to this shared state, its visibility semantics are difficult to reason about and you'll get unexpected results.
I have a Twisted project which seeks to essentially rebroadcast collected data over TCP in JSON. I essentially have a USB library which I need to subscribe to and synchronously read in a while loop indefinitely like so:
while True:
for line in usbDevice.streamData():
data = MyBrandSpankingNewUSBDeviceData(line)
# parse the data, convert to JSON
output = convertDataToJSON(data)
# broadcast the data
...
The problem, of course, is the .... Essentially, I need to start this process as soon as the server starts and end it when the server ends (Protocol.doStart and Protocol.doStop) and have it constantly running and broadcasting a output to every connected transport.
How can I do this in Twisted? Obviously, I'd need to have the while loop run in its own thread, but how can I "subscribe" clients to listen to output? It's also important that the USB data collection only be running once, as it could seriously mess things up to have it running more than once.
In a nutshell, here's my architecture:
Server has a USB hub which is streaming data all the time. Server is constantly subscribed to this USB hub and is constantly reading data.
Clients will come and go, connecting and disconnecting at will.
We want to send data to all connected clients whenever it is available. How can I do this in Twisted?
One thing you probably want to do is try to extend the common protocol/transport independence. Even though you need a thread with a long-running loop, you can hide this from the protocol. The benefit is the same as usual: the protocol becomes easier to test, and if you ever manage to have a non-threaded implementation of reading the USB events, you can just change the transport without changing the protocol.
from threading import Thread
class USBThingy(Thread):
def __init__(self, reactor, device, protocol):
self._reactor = reactor
self._device = device
self._protocol = protocol
def run(self):
while True:
for line in self._device.streamData():
self._reactor.callFromThread(self._protocol.usbStreamLineReceived, line)
The use of callFromThread is part of what makes this solution usable. It makes sure the usbStreamLineReceived method gets called in the reactor thread rather than in the thread that's reading from the USB device. So from the perspective of that protocol object, there's nothing special going on with respect to threading: it just has its method called once in a while when there's some data to process.
Your protocol then just needs to implement usbStreamLineReceived somehow, and implement your other application-specific logic, like keeping a list of observers:
class SomeUSBProtocol(object):
def __init__(self):
self.observers = []
def usbStreamLineReceived(self, line):
data = MyBrandSpankingNewUSBDeviceData(line)
# broadcast the data
for obs in self.observers[:]:
obs(output)
And then observers can register themselves with an instance of this class and do whatever they want with the data:
class USBObserverThing(Protocol):
def connectionMade(self):
self.factory.usbProto.observers.append(self.emit)
def connectionLost(self):
self.factory.usbProto.observers.remove(self.emit)
def emit(self, output):
# parse the data, convert to JSON
output = convertDataToJSON(data)
self.transport.write(output)
Hook it all together:
usbDevice = ...
usbProto = SomeUSBProtocol()
thingy = USBThingy(reactor, usbDevice, usbProto)
thingy.start()
factory = ServerFactory()
factory.protocol = USBObserverThing
factory.usbProto = usbProto
reactor.listenTCP(12345, factory)
reactor.run()
You can imagine a better observer register/unregister API (like one using actual methods instead of direct access to that list). You could also imagine giving the USBThingy a method for shutting down so SomeUSBProtocol could control when it stops running (so your process will actually be able to exit).
I have a single Python process which is using a serial port (unique resource) which is managed using an instance of a class A. There exists two different threads initialized using instances of classes B and C, which are constantly using the serial port resource through the objected already created.
import threading
Class A(threading.Thread):
#Zigbee serial port handler
def __init__(self,dev):
#something here that initialize serial port
def run():
while True:
#listening serial interface
def pack(self):
#something
def checksum(self):
#something
def write(self):
#something
Class B(threading.Thread):
def __init__(self,SerialPortHandler):
self.serialporthandler=SerialPortHandler
def run(self)
while True:
#something that uses self.serialporthandler
Class C(threading.Thread):
def __init__(self,SerialPortHandler):
self.serialporthandler=SerialPortHandler
def run(self)
while True:
#something that uses self.serialporthandler
def main():
a=A('/dev/ttyUSB1')
b=B(a)
b.start()
c=C(a)
c.start()
if __name__=='main':
while True:
main()
The problem is that both threads are trying to access the serial resource at the same time. I could use several instances of the same class A, attaching Lock.acquire() and Lock.release() in the sensitive parts.
Could some of you point me to the right way?
Thank you in advance.
While you could share the serial port using appropriate locking, I wouldn't recommend it. I've written several multi-threaded applications that communicate on the serial port in Python, and in my experience the following approach is better:
Have a single class, in a single thread, manage the actual serial port communication, via a Queue object or two:
Stuff read from the port is placed into the queue
Commands to send to the port are placed into the queue and the "Serial thread" sends them
Have the other threads implement logic by placing things into the queue and taking things out
Using Queue objects will greatly simplify your code and make it more robust.
This approach opens a lot of possibilities for you in terms of design. You can, for example, register events (callbacks) with the serial thread manager and have it call them (in a synchronized way) when interesting events occur, etc.
Add a threading.Lock() to class A and make it acquire the lock when using it:
def __init__(self,dev):
self.lock = threading.Lock()
def read(self):
self.lock.acquire()
data = ? #whatever you need to do here to read
self.lock.release()
return data
def write(self, data):
self.lock.acquire()
#whatever you need to do to write
self.lock.release()