Access a Python Code That is Running - python

I apologize if this isn't the correct way to word it, but I'm not sure where to start. If this question needs to be reworded, I will definitely do that.
I have just finished writing a piece of code that is collecting data from a variety of servers. It is currently running, and I would like to be able to start writing other pieces of code that can access the data being collected. Obviously I can do this by dumping the data into files, and have my data analysis code read the files stored on disk. However, for some forms of my analysis I would like to have something closer to real time data. Is there a way for me to access the class from my data collection piece of code without explicitly instantiating it? I mean, can I set up one piece of code to start the data collection, and then write other pieces of code later that are able to access the data collection class without stopping and restarting the data collection piece of code?
I hope that makes sense. I realize the data can just be stored to disk, and I could do things like just have my data analysis code search directories for changes. However, I am just curious to know if something like this can be done.

This seems to be like a Producer Consumer problem.
The producer's job is to generate a piece of data, put it into the
buffer and start again. At the same time, the consumer is consuming
the data (i.e., removing it from the buffer) one piece at a time
The catch here is "At the same time". So, producer and consumer need
to run concurrently. Hence we need separate threads for Producer and
Consumer.
I am taking code from the above link, you should go through it for extra details.
from threading import Thread
import time
import random
from Queue import Queue
queue = Queue(10)
class ProducerThread(Thread):
def run(self):
nums = range(5)
global queue
while True:
num = random.choice(nums)
queue.put(num)
print "Produced", num
time.sleep(random.random())
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
queue.task_done()
print "Consumed", num
time.sleep(random.random())
ProducerThread().start()
ConsumerThread().start()
Explanation :
We are using a Queue instance(hereafter queue).Queue has a Condition
and that condition has its lock. You don't need to bother about
Condition and Lock if you use Queue.
Producer uses put available on queue to insert data in the queue.
put() has the logic to acquire the lock before inserting data in
queue.
Also put() checks whether the queue is full. If yes, then it calls
wait() internally and so producer starts waiting.
Consumer uses get.
get() acquires the lock before removing data from queue.
get() checks if the queue is empty. If yes, it puts consumer in
waiting state.

Related

How to process a list of tasks while limiting the number of threads that are started simultaneously by first checking if a session is available

I am currently working on a test system that uses selenium grid for WhatsApp automation.
WhatsApp requires a QR code scan to log in, but once the code has been scanned, the session persists as long as the cookies remain saved in the browser's user data directory.
I would like to run a series of tests concurrently while making sure that every session is only used by one thread at any given time.
I would also like to be able to add additional tests to the queue while tests are being run.
So far I have considered using the ThreadPoolExecutor context manager in order to limit the maximum available workers to the maximum number of sessions. Something like this:
import queue
from concurrent.futures import ThreadPoolExecutor
def make_queue(questions):
q = queue.Queue()
for question in questions:
q.put(question)
return q
def test_conversation(q):
item = q.get()
# Whatsapp test happens here
q.task_done()
def run_tests(questions):
q = make_queue(questions)
with ThreadPoolExecutor(max_workers=number_of_sessions) as executor:
while not q.empty()
test_results = executor.submit(test_conversation, q)
for f in concurrent.futures.as_completed(test_results):
# save results somewhere
It does not include some way to make sure that every thread gets its own session though and as far as I know I can only send one parameter to the function that the executor calls.
I could make some complicated checkout system that works like borrowing books from a library so that every session can only be checked out once at any given time, but I'm not confident in making something that is thread safe and works in all cases. Even the ones I can't think of until they happen.
I am also not sure how I would keep the thing going while adding items to the queue without it locking up my entire application. Would I have to run run_tests() in its own thread?
Is there an established way to do this? Any help would be much appreciated.

Python multiprocessing queue makes code hang with large data

I'm using python's multiprocessing to analyse some large texts. After some days trying to figure out why my code was hanging (i.e. the processes didn't end), I was able to recreate the problem with the following simple code:
import multiprocessing as mp
for y in range(65500, 65600):
print(y)
def func(output):
output.put("a"*y)
if __name__ == "__main__":
output = mp.Queue()
process = mp.Process(target = func, args = (output,))
process.start()
process.join()
As you can see, if the item to put in the queue gets too large, the process just hangs.
It doesn't freeze, if I write more code after output.put() it will run, but still, the process never stops.
This starts happening when the string gets to 65500 chars, depending on your interpreter it may vary.
I was aware that mp.Queue has a maxsize argument, but doing some search I found out it is about the Queue's size in number of items, not the size of the items themselves.
Is there a way around this?
The data I need to put inside the Queue in my original code is very very large...
Your queue fills up with no consumer to empty it.
From the definition of Queue.put:
If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available.
Assuming there is no deadlock possible between producer and consumer (and assuming your original code does have a consumer, since your sample doesn't), eventually the producers should be unlocked and terminate. Check the code of your consumer (or add it to the question, so we an have a look)
Update
This is not the problem, because queue has not been given a maxsize so put should succeed until you run out of memory.
This is not the behavior of Queue. As elaborated in this ticket, the part blocking here is not the queue itself, but the underlying pipe. From the linked resource (inserts between "[]" are mine):
A queue works like this:
- when you call queue.put(data), the data is added to a deque, which can grow and shrink forever
- then a thread pops elements from the deque, and sends them so that the other process can receive them through a pipe or a Unix socket (created via socketpair). But, and that's the important point, both pipes and unix sockets have a limited capacity (used to be 4k - pagesize - on older Linux kernels for pipes, now it's 64k, and between 64k-120k for unix sockets, depending on tunable systcls).
- when you do queue.get(), you just do a read on the pipe/socket
[..] when size [becomes too big] the writing thread blocks on the write syscall.
And since a join is performed before dequeing the item [note: that's your process.join], you just deadlock, since the join waits for the sending thread to complete, and the write can't complete since the pipe/socket is full!
If you dequeue the item before waiting the submitter process, everything works fine.
Update 2
I understand. But I don't actually have a consumer (if it is what I'm thinking it is), I will only get the results from the queue when process has finished putting it into the queue.
Yeah, this is the problem. The multiprocessing.Queue is not a storage container. You should use it exclusively for passing data between "producers" (the processes that generate data that enters the queue) and "consumers (the processes that "use" that data). As you now know, leaving the data there is a bad idea.
How can I get an item from the queue if I cannot even put it there first?
put and get hide away the problem of putting together the data if it fills up the pipe, so you only need to set up a loop in your "main" process to get items out of the queue and, for example, append them to a list. The list is in the memory space of the main process and does not clog the pipe.

How to not get stuck in an infinitely blocking function?

I'm not sure what is the correct terminology to use. But my example should clear it up.
I want to listen to a Reddit comment stream.
This stream receives a comment in real-time as it is posted to reddit (/r/askReddit, and /r/worldNews), so I don't have to poll the server.
However, this function is blocking, I need to put it into several threads.
Here's what I have so far:
#! usr/bin/python3
from multiprocessing.dummy import Pool
import praw
def process_item(self, stream):
# Display the comment
for comment in stream:
print(comment.permalink)
def get_streams(reddit):
# Listen for comments from these two subReddits:
streams = [
reddit.subreddit('AskReddit').stream.comments(skip_existing=True),
reddit.subreddit('worldnews').stream.comments(skip_existing=True)
]
pool = Pool(4)
print('waiting for comments...')
results = pool.map(self.process_item, streams)
# But I want to do tons of other things down here or in `main()`.
# The code will never reach down here because it's always listening for comments.
The only workaround I can see is to put my entire program logic into process_item(), but that seems really stupid.
I think I want process_item to keep adding comments to a list, in the background, and then I can process those comments as I see fit. But I need to not get stuck in process_item()
As the program is doing other things, a list is being queued up with jobs to do, all while the program is doing other things.
Possible? If so, could you give me some tips as to the pattern?
I'm brand new to threading.
Read more about pub/sub pattern.
if you want thread use threading module.
multiprocessing is os process. Process and threads is different things. If you want threads use threading(when process data think about GIL)
to do anything that:
you can start some threads to read data from stream and put message to LIFO data structure
start some threads to read data from LIFO data structure to process your data

Python multithreading with shared variable

I'm trying to parallelise my job, but I'm new to multithreading, so feel confused about the concrete implementation.
I have a socket listener, that saves data to a buffer. When buffer reaches his capacity I need to save its data to database.
On one thread I want to start socket listener, while on parallel task I want to check the buffer status.
BufferQueue is just an extension of a python list, with method that allow to check whether the list has reached the specified size.
SocketManager is streaming data provider of a STREAM_URL I'm listening to. It use callback function to handle messages
But as I use callbacks to retrieve data I'm not sure that using shared variable is a right and optimal decision for that
buffer = BufferQueue(buffer_size=10000)
def start_listening_to_sokcet(client):
s = SocketManager(client)
s.start_socket(cb_new)
s.start()
def cb_new(message):
print("New message")
global buffer
for m in message:
#save data to buffer
def is_buffer_ready(buffer):
global buffer
print("Buffer state")
if buffer.ready():
#save buffer data to db
I'm appreciate if you can help me with this case
I think all you’re looking for is the queue module.
A queue.Queue is a self-synchronized queue designed specifically for passing objects between threads.
By default, calling get on a queue will block until an object is available, which is what you usually want to do—the point of using threads for concurrency in a network app is that your threads all look like normal synchronous code, but spend most of their time waiting on a socket, a file, a queue, or whatever when they have nothing to do. But you can check without blocking by using block=False, or put a timeout on the wait.
You can also specify a maxsize when you construct the queue. Then, by default, put will block until the queue isn’t too full to accept the new object. But, again, you can use block or timeout to try and fail if it’s too full.
All synchronization is taken care of internally inside get and put, so you don’t need a Lock to guarantee thread safety or a Condition to signal waiters.
A queue can even take care of shutdown for you. The producer can just put a special value that tells the consumer to quit when it sees it on a get.
For graceful shutdown where the producer then needs to wait until the consumer has finished, you can use the optional task_done method after the consumer has finished processing each queued object, and have the producer block on the join method. But if you don’t need this—or or have another way to wait for shutdown, e.g., joining the consumer thread—you can skip this part.
Multithreading gives you shared state of resources (variables). Instead of using globals, just pass in the buffer as an argument to your methods, and read/write from/to it.
You still need to control access to the buffer resource, so both threads are not reading/writing at the same time. You can achieve that using Lock from the threading module:
lock = threading.Lock()
def cb_new(buffer_, lock_, message):
print("New message")
with lock_():
for m in message:
#save data to buffer
buffer.add(m)
def is_buffer_ready(buffer_, lock_):
print("Buffer state")
with lock_():
if buffer_.ready():
#save buffer data to db
Note that in case you are working with multiprocessing instead of threads, this solution won't work.
By the way, as #abarnert commented, there are better mechanisms to check if the buffer is ready (has data to read / has free space to write) then calling a function that checks it. Check out select.select() which blocks you until the buffer is actually ready.
When working with select, you put the calls inside a while True loop, and then you check if the buffer is ready for reading. You can start this function in a thread, passing a flag variable and the buffer. If you want to stop the thread, change the flag you passed to False. For the buffer object, use Queue.Queue() or similar datastructure.
def read_select(flag, buff):
flag = 1
while flag:
r, _, _ = select.select([buff], [], [])
if r:
data = s.read(BUFFSIZE)
# process data
P.S - select also works with sockets. You can pass a socket object instead of a buffer, and it would check if the buffer on the socket is ready for read.

How to know if a particular task inside a queue is complete?

I have a doubt with respect to python queues.
I have written a threaded class, whose run() method executes the queue.
import threading
import Queue
def AThread(threading.Thread):
def __init__(self,arg1):
self.file_resource=arg1
threading.Thread.__init__(self)
self.queue=Queue.Queue()
def __myTask(self):
self.file_resource.write()
''' Method that will access a common resource
Needs to be synchronized.
Returns a Boolean based on the outcome
'''
def run():
while True:
cmd=self.queue.get()
#cmd is actually a call to method
exec("self.__"+cmd)
self.queue.task_done()
#The problem i have here is while invoking the thread
a=AThread()
a.queue.put("myTask()")
print "Hai"
The same instance of AThread (a=AThread()) will load tasks to the queue from different locations.
Hence the print statement at the bottom should wait for the task added to the queue through the statement above and wait for a definitive period and also receive the value returned after executing the task.
Is there a simplistic way to achieve this ?. I have searched a lot regarding this, kindly review this code and provide suggessions.
And Why python's acquire and release lock are not on the instances of the class. In the scenario mentioned, instances a and b of AThread need not be synchronized, but myTask runs synchronized for both instances of a as well as b when acquire and release lock are applied.
Kindly provide suggestions.
There's lots of approaches you could take, depending on the particular contours of your problem.
If your print "Hai" just needs to happen after myTask completes, you could put it into a task and have myTask put that task on the queue when it finishes. (if you're a CS theory sort of person, you can think of this as being analogous to continuation-passing style).
If your print "Hai" has a more elaborate dependency on multiple tasks, you might look into futures or promises.
You could take a step into the world of Actor-based concurrency, in which case there would probably be a synchronous message send method that does more or less what you want.
If you don't want to use futures or promises, you can achieve a similar thing manually, by introducing a condition variable. Set the condition variable before myTask starts and pass it to myTask, then wait for it to be cleared. You'll have to be very careful as your program grows and constantly rethink your locking strategy to make sure it stays simple and comprehensible - this is the stuff of which difficult concurrency bugs is made.
The smallest sensible step to get what you want is probably to provide a blocking version of Queue.put() which does the condition variable thing. Make sure you think about whether you want to block until the queue is empty, or until the thing you put on the queue is removed from the queue, or until the thing you put on the queue has finished processing. And then make sure you implement the thing you decided to implement when you were thinking about it.

Categories