Python multithreading with shared variable - python

I'm trying to parallelise my job, but I'm new to multithreading, so feel confused about the concrete implementation.
I have a socket listener, that saves data to a buffer. When buffer reaches his capacity I need to save its data to database.
On one thread I want to start socket listener, while on parallel task I want to check the buffer status.
BufferQueue is just an extension of a python list, with method that allow to check whether the list has reached the specified size.
SocketManager is streaming data provider of a STREAM_URL I'm listening to. It use callback function to handle messages
But as I use callbacks to retrieve data I'm not sure that using shared variable is a right and optimal decision for that
buffer = BufferQueue(buffer_size=10000)
def start_listening_to_sokcet(client):
s = SocketManager(client)
s.start_socket(cb_new)
s.start()
def cb_new(message):
print("New message")
global buffer
for m in message:
#save data to buffer
def is_buffer_ready(buffer):
global buffer
print("Buffer state")
if buffer.ready():
#save buffer data to db
I'm appreciate if you can help me with this case

I think all you’re looking for is the queue module.
A queue.Queue is a self-synchronized queue designed specifically for passing objects between threads.
By default, calling get on a queue will block until an object is available, which is what you usually want to do—the point of using threads for concurrency in a network app is that your threads all look like normal synchronous code, but spend most of their time waiting on a socket, a file, a queue, or whatever when they have nothing to do. But you can check without blocking by using block=False, or put a timeout on the wait.
You can also specify a maxsize when you construct the queue. Then, by default, put will block until the queue isn’t too full to accept the new object. But, again, you can use block or timeout to try and fail if it’s too full.
All synchronization is taken care of internally inside get and put, so you don’t need a Lock to guarantee thread safety or a Condition to signal waiters.
A queue can even take care of shutdown for you. The producer can just put a special value that tells the consumer to quit when it sees it on a get.
For graceful shutdown where the producer then needs to wait until the consumer has finished, you can use the optional task_done method after the consumer has finished processing each queued object, and have the producer block on the join method. But if you don’t need this—or or have another way to wait for shutdown, e.g., joining the consumer thread—you can skip this part.

Multithreading gives you shared state of resources (variables). Instead of using globals, just pass in the buffer as an argument to your methods, and read/write from/to it.
You still need to control access to the buffer resource, so both threads are not reading/writing at the same time. You can achieve that using Lock from the threading module:
lock = threading.Lock()
def cb_new(buffer_, lock_, message):
print("New message")
with lock_():
for m in message:
#save data to buffer
buffer.add(m)
def is_buffer_ready(buffer_, lock_):
print("Buffer state")
with lock_():
if buffer_.ready():
#save buffer data to db
Note that in case you are working with multiprocessing instead of threads, this solution won't work.
By the way, as #abarnert commented, there are better mechanisms to check if the buffer is ready (has data to read / has free space to write) then calling a function that checks it. Check out select.select() which blocks you until the buffer is actually ready.
When working with select, you put the calls inside a while True loop, and then you check if the buffer is ready for reading. You can start this function in a thread, passing a flag variable and the buffer. If you want to stop the thread, change the flag you passed to False. For the buffer object, use Queue.Queue() or similar datastructure.
def read_select(flag, buff):
flag = 1
while flag:
r, _, _ = select.select([buff], [], [])
if r:
data = s.read(BUFFSIZE)
# process data
P.S - select also works with sockets. You can pass a socket object instead of a buffer, and it would check if the buffer on the socket is ready for read.

Related

Python multiprocessing queue makes code hang with large data

I'm using python's multiprocessing to analyse some large texts. After some days trying to figure out why my code was hanging (i.e. the processes didn't end), I was able to recreate the problem with the following simple code:
import multiprocessing as mp
for y in range(65500, 65600):
print(y)
def func(output):
output.put("a"*y)
if __name__ == "__main__":
output = mp.Queue()
process = mp.Process(target = func, args = (output,))
process.start()
process.join()
As you can see, if the item to put in the queue gets too large, the process just hangs.
It doesn't freeze, if I write more code after output.put() it will run, but still, the process never stops.
This starts happening when the string gets to 65500 chars, depending on your interpreter it may vary.
I was aware that mp.Queue has a maxsize argument, but doing some search I found out it is about the Queue's size in number of items, not the size of the items themselves.
Is there a way around this?
The data I need to put inside the Queue in my original code is very very large...
Your queue fills up with no consumer to empty it.
From the definition of Queue.put:
If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available.
Assuming there is no deadlock possible between producer and consumer (and assuming your original code does have a consumer, since your sample doesn't), eventually the producers should be unlocked and terminate. Check the code of your consumer (or add it to the question, so we an have a look)
Update
This is not the problem, because queue has not been given a maxsize so put should succeed until you run out of memory.
This is not the behavior of Queue. As elaborated in this ticket, the part blocking here is not the queue itself, but the underlying pipe. From the linked resource (inserts between "[]" are mine):
A queue works like this:
- when you call queue.put(data), the data is added to a deque, which can grow and shrink forever
- then a thread pops elements from the deque, and sends them so that the other process can receive them through a pipe or a Unix socket (created via socketpair). But, and that's the important point, both pipes and unix sockets have a limited capacity (used to be 4k - pagesize - on older Linux kernels for pipes, now it's 64k, and between 64k-120k for unix sockets, depending on tunable systcls).
- when you do queue.get(), you just do a read on the pipe/socket
[..] when size [becomes too big] the writing thread blocks on the write syscall.
And since a join is performed before dequeing the item [note: that's your process.join], you just deadlock, since the join waits for the sending thread to complete, and the write can't complete since the pipe/socket is full!
If you dequeue the item before waiting the submitter process, everything works fine.
Update 2
I understand. But I don't actually have a consumer (if it is what I'm thinking it is), I will only get the results from the queue when process has finished putting it into the queue.
Yeah, this is the problem. The multiprocessing.Queue is not a storage container. You should use it exclusively for passing data between "producers" (the processes that generate data that enters the queue) and "consumers (the processes that "use" that data). As you now know, leaving the data there is a bad idea.
How can I get an item from the queue if I cannot even put it there first?
put and get hide away the problem of putting together the data if it fills up the pipe, so you only need to set up a loop in your "main" process to get items out of the queue and, for example, append them to a list. The list is in the memory space of the main process and does not clog the pipe.

How to sniff a network interface with Twisted?

I need to receive raw packets from a network interface within Twisted code. The packets will not have the correct IP or MAC address, nor valid headers, so I need the raw thing.
I have tried looking into twisted.pair, but I was not able to figure out how to use it to get at the raw interface.
Normally, I would use scapy.all.sniff. However, that is blocking, so I can't just use it with Twisted. (I also cannot use scapy.all.sniff with a timeout and busy-loop, because I don't want to lose packets.)
A possible solution would be to run scapy.all.sniff in a thread and somehow call back into Twisted when I get a packet. This seems a bit inelegant (and also, I don't know how to do it because I am a Twisted beginner), but I might settle for that if I don't find anything better.
You could run a distributed system and pass the data through a central queuing system. Take the Unix philosophy and create a single application that does a few tasks and does them well. Create one application that sniffs the packets (you can use scapy here since it won't really matter if you block anything) then sends them to a queue (RabitMQ, Redis, SQS, etc) and have another application process the packet from the queue. This method should give you the least amount of headache.
If you need to run everything in a single application, then threads/multiprocessing is the only option. But there are some design patterns you'll want to follow. You can also break up the following code into separate functions and use a dedicated queuing system.
from threading import Thread
from time import sleep
from twisted.internet import defer, reactor
class Sniffer(Thread):
def __init__(self, _reactor, shared_queue):
super().__init__()
self.reactor = _reactor
self.shared_queue = shared_queue
def run(self):
"""
Sniffer logic here
"""
while True:
self.reactor.callFromThread(self.shared_queue.put, 'hello world')
sleep(5)
#defer.inlineCallbacks
def consume_from_queue(_id, _reactor, shared_queue):
item = yield shared_queue.get()
print(str(_id), item)
_reactor.callLater(0, consume_from_queue, _id, _reactor, shared_queue)
def main():
shared_queue = defer.DeferredQueue()
sniffer = Sniffer(reactor, shared_queue)
sniffer.daemon = True
sniffer.start()
workers = 4
for i in range(workers):
consume_from_queue(i+1, reactor, shared_queue)
reactor.run()
main()
The Sniffer class starts outside of Twisted's control. Notice the sniffer.daemon = True, this is so that the thread will stop when the main thread has stopped. If it were set to False (default) then the application will exit only if all the threads have come to an end. Depending on the task at hand this may or may not always be possible. If you can take breaks from sniffing to check a thread event, then you might be able to stop the thread in a safer way.
self.reactor.callFromThread(self.shared_queue.put, 'hello world') is necessary so that the item being put into the queue happens in the main reactor thread as opposed to the thread the Sniffer executes. The main benefit of this would be that there would be some sort of synchronization of the messages coming from the threads (assuming you plan to scale to sniffing multiple interfaces). Also, I wasn't sure of DeferredQueue objects are thread safe :) I treated them like they were not.
Since Twisted isn't managing the threads in this case, it's vital that the developer does. Notice the worker loop and consume_from_queue(i+1, reactor, shared_queue). This loop ensures only the desired number of workers are handling tasks. Inside the consume_from_queue() function, shared_queue.get() will wait (non-blocking) until an item is put into the queue, prints the item, then schedule another consume_from_queue().

Handling endless data stream with multiprocessing and Queues

I want to use the Python 2.7 multiprocessing package to operate on an endless stream of data. A subprocess will constantly receive data via TCP/IP or UDP packets and immediately place the data in a multiprocessing.Queue. However, at certain intervals, say, every 500ms, I only want to operate on a user specified slice of this data. Let's say, the last 200 data packets.
I know I can put() and get() on the Queue, but how can I create that slice of data without a) Backing up the queue and b) Keeping things threadsafe?
I'm thinking I have to constantly get() from the Queue with another subprocess to prevent the Queue from getting full. Then I have to store the data in another data structure (such as a list) to build the user specified slice. But the data structure would probably not be thread safe, so it does not sound like a good solution.
Is there some programming paradigm that achieves what I am trying to do easily? I looked at the multiprocessing.Manager class, but wasn't sure it would work.
You can do this as follows:
Use an instance of the threading.Lock class. Call method acquire to claim exclusive access to your queue from a certain thread and call release to grant other threads access.
Since you want to keep gathering your input, copying the whole queue would be probably be to expensive. Probably the fastest way is to first collect data in one queue, than swap it for another and use the old one to read data from into your application by a different thread. Protect the swapping with a Lock instance, so you can be sure that whenever the writer acquires the lock, the current 'listener' queue is ready to receive data.
If only recent data is important, use two circular buffer instead of queues, allowing old data to be overwritten.

Access a Python Code That is Running

I apologize if this isn't the correct way to word it, but I'm not sure where to start. If this question needs to be reworded, I will definitely do that.
I have just finished writing a piece of code that is collecting data from a variety of servers. It is currently running, and I would like to be able to start writing other pieces of code that can access the data being collected. Obviously I can do this by dumping the data into files, and have my data analysis code read the files stored on disk. However, for some forms of my analysis I would like to have something closer to real time data. Is there a way for me to access the class from my data collection piece of code without explicitly instantiating it? I mean, can I set up one piece of code to start the data collection, and then write other pieces of code later that are able to access the data collection class without stopping and restarting the data collection piece of code?
I hope that makes sense. I realize the data can just be stored to disk, and I could do things like just have my data analysis code search directories for changes. However, I am just curious to know if something like this can be done.
This seems to be like a Producer Consumer problem.
The producer's job is to generate a piece of data, put it into the
buffer and start again. At the same time, the consumer is consuming
the data (i.e., removing it from the buffer) one piece at a time
The catch here is "At the same time". So, producer and consumer need
to run concurrently. Hence we need separate threads for Producer and
Consumer.
I am taking code from the above link, you should go through it for extra details.
from threading import Thread
import time
import random
from Queue import Queue
queue = Queue(10)
class ProducerThread(Thread):
def run(self):
nums = range(5)
global queue
while True:
num = random.choice(nums)
queue.put(num)
print "Produced", num
time.sleep(random.random())
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
queue.task_done()
print "Consumed", num
time.sleep(random.random())
ProducerThread().start()
ConsumerThread().start()
Explanation :
We are using a Queue instance(hereafter queue).Queue has a Condition
and that condition has its lock. You don't need to bother about
Condition and Lock if you use Queue.
Producer uses put available on queue to insert data in the queue.
put() has the logic to acquire the lock before inserting data in
queue.
Also put() checks whether the queue is full. If yes, then it calls
wait() internally and so producer starts waiting.
Consumer uses get.
get() acquires the lock before removing data from queue.
get() checks if the queue is empty. If yes, it puts consumer in
waiting state.

Pattern for a background Twisted server that fills an incoming message queue and empties an outgoing message queue?

I'd like to do something like this:
twistedServer.start() # This would be a nonblocking call
while True:
while twistedServer.haveMessage():
message = twistedServer.getMessage()
response = handleMessage(message)
twistedServer.sendResponse(response)
doSomeOtherLogic()
The key thing I want to do is run the server in a background thread. I'm hoping to do this with a thread instead of through multiprocessing/queue because I already have one layer of messaging for my app and I'd like to avoid two. I'm bringing this up because I can already see how to do this in a separate process, but what I'd like to know is how to do it in a thread, or if I can. Or if perhaps there is some other pattern I can use that accomplishes this same thing, like perhaps writing my own reactor.run method. Thanks for any help.
:)
The key thing I want to do is run the server in a background thread.
You don't explain why this is key, though. Generally, things like "use threads" are implementation details. Perhaps threads are appropriate, perhaps not, but the actual goal is agnostic on the point. What is your goal? To handle multiple clients concurrently? To handle messages of this sort simultaneously with events from another source (for example, a web server)? Without knowing the ultimate goal, there's no way to know if an implementation strategy I suggest will work or not.
With that in mind, here are two possibilities.
First, you could forget about threads. This would entail defining your event handling logic above as only the event handling parts. The part that tries to get an event would be delegated to another part of the application, probably something ultimately based on one of the reactor APIs (for example, you might set up a TCP server which accepts messages and turns them into the events you're processing, in which case you would start off with a call to reactor.listenTCP of some sort).
So your example might turn into something like this (with some added specificity to try to increase the instructive value):
from twisted.internet import reactor
class MessageReverser(object):
"""
Accept messages, reverse them, and send them onwards.
"""
def __init__(self, server):
self.server = server
def messageReceived(self, message):
"""
Callback invoked whenever a message is received. This implementation
will reverse and re-send the message.
"""
self.server.sendMessage(message[::-1])
doSomeOtherLogic()
def main():
twistedServer = ...
twistedServer.start(MessageReverser(twistedServer))
reactor.run()
main()
Several points to note about this example:
I'm not sure how your twistedServer is defined. I'm imagining that it interfaces with the network in some way. Your version of the code would have had it receiving messages and buffering them until they were removed from the buffer by your loop for processing. This version would probably have no buffer, but instead just call the messageReceived method of the object passed to start as soon as a message arrives. You could still add buffering of some sort if you want, by putting it into the messageReceived method.
There is now a call to reactor.run which will block. You might instead write this code as a twistd plugin or a .tac file, in which case you wouldn't be directly responsible for starting the reactor. However, someone must start the reactor, or most APIs from Twisted won't do anything. reactor.run blocks, of course, until someone calls reactor.stop.
There are no threads used by this approach. Twisted's cooperative multitasking approach to concurrency means you can still do multiple things at once, as long as you're mindful to cooperate (which usually means returning to the reactor once in a while).
The exact times the doSomeOtherLogic function is called is changed slightly, because there's no notion of "the buffer is empty for now" separate from "I just handled a message". You could change this so that the function is installed called once a second, or after every N messages, or whatever is appropriate.
The second possibility would be to really use threads. This might look very similar to the previous example, but you would call reactor.run in another thread, rather than the main thread. For example,
from Queue import Queue
from threading import Thread
class MessageQueuer(object):
def __init__(self, queue):
self.queue = queue
def messageReceived(self, message):
self.queue.put(message)
def main():
queue = Queue()
twistedServer = ...
twistedServer.start(MessageQueuer(queue))
Thread(target=reactor.run, args=(False,)).start()
while True:
message = queue.get()
response = handleMessage(message)
reactor.callFromThread(twistedServer.sendResponse, response)
main()
This version assumes a twistedServer which works similarly, but uses a thread to let you have the while True: loop. Note:
You must invoke reactor.run(False) if you use a thread, to prevent Twisted from trying to install any signal handlers, which Python only allows to be installed in the main thread. This means the Ctrl-C handling will be disabled and reactor.spawnProcess won't work reliably.
MessageQueuer has the same interface as MessageReverser, only its implementation of messageReceived is different. It uses the threadsafe Queue object to communicate between the reactor thread (in which it will be called) and your main thread where the while True: loop is running.
You must use reactor.callFromThread to send the message back to the reactor thread (assuming twistedServer.sendResponse is actually based on Twisted APIs). Twisted APIs are typically not threadsafe and must be called in the reactor thread. This is what reactor.callFromThread does for you.
You'll want to implement some way to stop the loop and the reactor, one supposes. The python process won't exit cleanly until after you call reactor.stop.
Note that while the threaded version gives you the familiar, desired while True loop, it doesn't actually do anything much better than the non-threaded version. It's just more complicated. So, consider whether you actually need threads, or if they're merely an implementation technique that can be exchanged for something else.

Categories