I am having a synchronization problem while threading with cPython. I have two files, I parse them and return the desired result. However, the code below acts strangely and returns three times instead of two plus doesn't return in the order I put them into queue. Here's the code:
import Queue
import threading
from HtmlDoc import Document
OUT_LIST = []
class Threader(threading.Thread):
"""
Start threading
"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
if self.queue.qsize() == 0: break
path, host = self.queue.get()
f = open(path, "r")
source = f.read()
f.close()
self.out_queue.put((source, host))
self.queue.task_done()
class Processor(threading.Thread):
"""
Process threading
"""
def __init__(self, out_queue):
self.out_queue = out_queue
self.l_first = []
self.f_append = self.l_first.append
self.l_second = []
self.s_append = self.l_second.append
threading.Thread.__init__(self)
def first(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def second(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def run(self):
while True:
if self.out_queue.qsize() == 0: break
doc, host = self.out_queue.get()
if host == "first":
self.first(doc)
elif host == "second":
self.second(doc)
OUT_LIST.extend(self.l_first + self.l_second)
self.out_queue.task_done()
def main():
queue = Queue.Queue()
out_queue = Queue.Queue()
queue.put(("...first.html", "first"))
queue.put(("...second.html", "second"))
qsize = queue.qsize()
for i in range(qsize):
t = Threader(queue, out_queue)
t.setDaemon(True)
t.start()
for i in range(qsize):
dt = Processor(out_queue)
dt.setDaemon(True)
dt.start()
queue.join()
out_queue.join()
print '<br />'.join(OUT_LIST)
main()
Now, when I print, I'd like to print the content of the "first" first of all and then the content of the "second". Can anyone help me?
NOTE: I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
Threading is actually one of the most error-prone ways to manage multiple concurrent connections. A more powerful, more debuggable approach is to use event-driven asynchronous networking, such as implemented by Twisted. If you're interested in using this model, you might want to check out this introduction.
I dont share the same opinion that threading is the best way to do this (IMO some events/select mechanism would be better) but problem with your code could be in variables t and dt. You have the assignements in the cycle and object instances are to stored anywhere - so it may be possible that your new instance of Thread/Processor get deleted at the end of the each cycle.
It would be more clarified if you show us precise output of this code.
1) You cannot control order of job completion. It depends on execution time, so to return results as you want you can create global dictionary with job objects, like job_results : {'first' : None, 'second' : None} and store results here, than you can fetch data on desired order
2) self.first and self.second should be cleared after each processed doc, else you will have duplicates in OUT_LIST
3) You may use multi-processing with subprocess module and put all result data to CSV files for example and them sort them as you wish.
Related
I have a script to traverse an AWS S3 bucket to do some aggregation at the file level.
from threading import Semaphore, Thread
class Spider:
def __init__(self):
self.sem = Semaphore(120)
self.threads = list()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
for thread in self.threads:
thread.join()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
self.sem.acquire()
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
for child_url in children:
try:
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
self.threads.append(thread)
thread.start()
self.sem.release()
This used to run okay, until I encountered a bucket of several TB of data with hundreds of thousand sub-directories. The number of Thread objects in self.threads increases very fast and soon the server reported to me
RuntimeError: can't start new thread
There is some extra processing I have to do in the script so I can't just get all files from the bucket.
Currently I'm putting a depth of at least 2 before the script can go parallelized but it's just a workaround. Any suggestion is appreciated.
So the way the original piece of code worked was BFS, which created a lot of waiting threads in queue. I changed it to DFS and everything is working fine. Pseudo code in case someone needs this in the future:
def __init__(self):
self.sem = Semaphore(120)
self.urls = list()
self.mutex = Lock()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
while not is_done():
self.sem.acquire()
url = self.urls.pop(0)
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
thread.start()
self.sem.release()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
self.mutex.acquire()
for child_url in children:
self.urls.insert(0, child_url)
self.mutex.release()
No join() so I implemented my own is_done() check.
I'm currently working on a project that involves three components,
an observer that check for changes in a directory, a worker and an command line interface.
What I want to achieve is:
The observer, when a change happens send a string to the worker (add a job to the worker's queue).
The worker has a queue of jobs and forever works on his queue.
Now I want the possibility to run a python script to check the status of the worker (number of active jobs, errors and so on)
I don't know how to achieve this with python in terms of which component to use and how to link the three components.
I though as a singleton worker where the observer add a job to a queue but 1) I was not able to write a working code and 2) How can I fit the checker in?
Another solution that I thought of may be multiple child processes from a father that has the queue but I'm a bit lost...
Thanks for any advices
I'd use some kind of observer pattern or publish-subscribe pattern. For the former you can use for example the Python version of ReactiveX. But for a more basic example let's stay with the Python core. Parts of your program can subscribe to the worker and receive updates from the process via queues for example.
import itertools as it
from queue import Queue
from threading import Thread
import time
class Observable(Thread):
def __init__(self):
super().__init__()
self._observers = []
def notify(self, msg):
for obs in self._observers:
obs.put(msg)
def subscribe(self, obs):
self._observers.append(obs)
class Observer(Thread):
def __init__(self):
super().__init__()
self.updates = Queue()
class Watcher(Observable):
def run(self):
for i in it.count():
self.notify(i)
time.sleep(1)
class Worker(Observable, Observer):
def run(self):
while True:
task = self.updates.get()
self.notify((str(task), 'start'))
time.sleep(1)
self.notify((str(task), 'stop'))
class Supervisor(Observer):
def __init__(self):
super().__init__()
self._statuses = {}
def run(self):
while True:
status = self.updates.get()
print(status)
self._statuses[status[0]] = status[1]
# Do something based on status updates.
if status[1] == 'stop':
del self._statuses[status[0]]
watcher = Watcher()
worker = Worker()
supervisor = Supervisor()
watcher.subscribe(worker.updates)
worker.subscribe(supervisor.updates)
supervisor.start()
worker.start()
watcher.start()
However many variations are possible and you can check the various patterns which suits you most.
I've two classes - MessageProducer and MessageConsumer.
MessageConsumer does the following:
receives messages and puts them in its message list "_unprocessed_msgs"
on a separate worker thread, moves the messages to internal list "_in_process_msgs"
on the worker thread, processes messages from "_in_process_msgs"
On my development environment, I'm facing issue with #2 above - after adding a message by performing step#1, when worker thread checks length of "_unprocessed_msgs", it gets it as zero.
When step #1 is repeated, the list properly shows 2 items on the thread on which the item was added. But in step #2, on worker thread, again the len(_unprocessed_msgs) returns zero.
Not sure why this is happening. Would really appreciate help any help on this.
I'm using Ubuntu 16.04 having Python 2.7.12.
Below is the sample source code. Please let me know if more information is required.
import threading
import time
class MessageConsumerThread(threading.Thread):
def __init__(self):
super(MessageConsumerThread, self).__init__()
self._unprocessed_msg_q = []
self._in_process_msg_q = []
self._lock = threading.Lock()
self._stop_processing = False
def start_msg_processing_thread(self):
self._stop_processing = False
self.start()
def stop_msg_processing_thread(self):
self._stop_processing = True
def receive_msg(self, msg):
with self._lock:
LOG.info("Before: MessageConsumerThread::receive_msg: "
"len(self._unprocessed_msg_q)=%s" %
len(self._unprocessed_msg_q))
self._unprocessed_msg_q.append(msg)
LOG.info("After: MessageConsumerThread::receive_msg: "
"len(self._unprocessed_msg_q)=%s" %
len(self._unprocessed_msg_q))
def _queue_unprocessed_msgs(self):
with self._lock:
LOG.info("MessageConsumerThread::_queue_unprocessed_msgs: "
"len(self._unprocessed_msg_q)=%s" %
len(self._unprocessed_msg_q))
if self._unprocessed_msg_q:
LOG.info("Moving messages from unprocessed to in_process queue")
self._in_process_msg_q += self._unprocessed_msg_q
self._unprocessed_msg_q = []
LOG.info("Moved messages from unprocessed to in_process queue")
def run(self):
while not self._stop_processing:
# Allow other threads to add messages to message queue
time.sleep(1)
# Move unprocessed listeners to in-process listener queue
self._queue_unprocessed_msgs()
# If nothing to process continue the loop
if not self._in_process_msg_q:
continue
for msg in self._in_process_msg_q:
self.consume_message(msg)
# Clean up processed messages
del self._in_process_msg_q[:]
def consume_message(self, msg):
print(msg)
class MessageProducerThread(threading.Thread):
def __init__(self, producer_id, msg_receiver):
super(MessageProducerThread, self).__init__()
self._producer_id = producer_id
self._msg_receiver = msg_receiver
def start_producing_msgs(self):
self.start()
def run(self):
for i in range(1,10):
msg = "From: %s; Message:%s" %(self._producer_id, i)
self._msg_receiver.receive_msg(msg)
def main():
msg_receiver_thread = MessageConsumerThread()
msg_receiver_thread.start_msg_processing_thread()
msg_producer_thread = MessageProducerThread(producer_id='Producer-01',
msg_receiver=msg_receiver_thread)
msg_producer_thread.start_producing_msgs()
msg_producer_thread.join()
msg_receiver_thread.stop_msg_processing_thread()
msg_receiver_thread.join()
if __name__ == '__main__':
main()
Following is the log the I get:
INFO: MessageConsumerThread::_queue_unprocessed_msgs: len(self._unprocessed_msg_q)=0
INFO: Before: MessageConsumerThread::receive_msg: len(self._unprocessed_msg_q)=0
INFO: After: MessageConsumerThread::receive_msg: **len(self._unprocessed_msg_q)=1**
INFO: MessageConsumerThread::_queue_unprocessed_msgs: **len(self._unprocessed_msg_q)=0**
INFO: MessageConsumerThread::_queue_unprocessed_msgs: len(self._unprocessed_msg_q)=0
INFO: Before: MessageConsumerThread::receive_msg: len(self._unprocessed_msg_q)=1
INFO: After: MessageConsumerThread::receive_msg: **len(self._unprocessed_msg_q)=2**
INFO: MessageConsumerThread::_queue_unprocessed_msgs: **len(self._unprocessed_msg_q)=0**
This is not a good desing for you application.
I spent some time trying to debug this - but threading code is naturally complicated, so we should try to descomplicate it, instead of getting it even more confure.
When I see threading code in Python, I usually see it written a in a procedural form: a normal function that is passed to threading.Thread as the target argument that drives each thread. That way, you don't need to write code for a new class that will have a single instance.
Another thing is that, although Python's global interpreter lock itself guarantees lists won't get corrupted if modified in two separate threads, lists are not a recomended "thread data passing" data structure. You probably should look at threading.Queue to do that
The thing is wrong in this code at first sight is probably not the cause of your problem due to your use of locks, but it might be. Instead of
self._unprocessed_msg_q = []
which will create a new list object, the other thread have momentarily no reference too (so it might write data to the old list), you should do:
self._unprocessed_msg_q[:] = []
Or just the del slice thing you do on the other method.
But to be on the safer side, and having mode maintanable and less surprising code, you really should change to a procedural approach there, assuming Python threading. Assume "Thread" is the "final" object that can do its thing, and then use Queues around:
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
from threading import Thread
try:
from queue import Queue, Empty
except ImportError:
from Queue import Queue, Empty
import time
import random
TERMINATE_SENTINEL = object()
NO_DATA_SENTINEL = object()
class Receiver(object):
def __init__(self, queue):
self.queue = queue
self.in_process = []
def receive_data(self, data):
self.in_process.append(data)
def consume_data(self):
print("received data:", self.in_process)
del self.in_process[:]
def receiver_loop(self):
queue = self.queue
while True:
try:
data = queue.get(block=False)
except Empty:
print("got no data from queue")
data = NO_DATA_SENTINEL
if data is TERMINATE_SENTINEL:
print("Got sentinel: exiting receiver loop")
break
self.receive_data(data)
time.sleep(random.uniform(0, 0.3))
if queue.empty():
# Only process data if we have nothing to receive right now:
self.consume_data()
print("sleeping receiver")
time.sleep(1)
if self.in_process:
self.consume_data()
def producer_loop(queue):
for i in range(10):
time.sleep(random.uniform(0.05, 0.4))
print("putting {0} in queue".format(i))
queue.put(i)
def main():
msg_queue = Queue()
msg_receiver_thread = Thread(target=Receiver(msg_queue).receiver_loop)
time.sleep(0.1)
msg_producer_thread = Thread(target=producer_loop, args=(msg_queue,))
msg_receiver_thread.start()
msg_producer_thread.start()
msg_producer_thread.join()
msg_queue.put(TERMINATE_SENTINEL)
msg_receiver_thread.join()
if __name__ == '__main__':
main()
note that since you want multiple methods in the recever thread to do things with data, I used a class - but it does not inherit from Thread, and does not have to worry about its workings. All its methods are called within the same thread: no need of locks, no worries about race conditions within the receiver class itself. For communicating outside the class, the Queue class is structured to handle any race conditions for us.
The producer loop, as it is just a dummy producer, has no need at all to be written in class form. But it would look just the same, if it had more methods.
(The random sleeps help visualize what would happen in "real world" message receiving)
Also, you might want to take a look at something like:
https://www.thoughtworks.com/insights/blog/composition-vs-inheritance-how-choose
Finally I was able to solve the issue. In the actual code, I've a Manager class that is responsible for instantiating MessageConsumerThread as its last thing in the initializer:
class Manager(object):
def __init__(self):
...
...
self._consumer = MessageConsumerThread(self)
self._consumer.start_msg_processing_thread()
The problem seems to be with passing 'self' in MessageConsumerThread initializer when Manager is still executing its initializer (eventhough those are last two steps). The moment I moved the creation of consumer out of initializer, consumer thread was able to see the elements in "_unprocessed_msg_q".
Please note that the issue is still not reproducible with the above sample code. It is manifesting itself in the production environment only. Without the above fix, I tried queue and dictionary as well but observed the same issue. After the fix, tried with queue and list and was able to successfully execute the code.
I really appreciate and thank #jsbueno and #ivan_pozdeev for their time and help! Community #stackoverflow is very helpful!
I want to use python gevent library to implement one producer and multiple consumers server. There is my attempt:
class EmailValidationServer():
def __init__(self):
self.queue = Queue()
def worker(self):
while True:
json = self.queue.get()
def handler(self,socket,address):
fileobj = socket.makefile()
content = fileobj.read(max_read)
contents = json.loads(content)
for content in contents:
self.queue.put(content)
def daemon(self,addr='127.0.0.1',num_thread=5):
pool = Pool(1000)
server = StreamServer((addr, 6000),self.handler,spawn=pool) # run
pool = ThreadPool(num_thread)
for _ in range(num_thread):
pool.spawn(self.worker)
server.serve_forever()
if __name__ == "__main__":
email_server = EmailValidationServer()
email_server.daemon()
I used the queue from gevent.queue.Queue. It gives me the error information:
LoopExit: This operation would block forever
(<ThreadPool at 0x7f08c80eef50 0/4/5>,
<bound method EmailValidationServer.worker of <__main__.EmailValidationServer instance at 0x7f08c8dcd998>>) failed with LoopExit
Problem: But when I change the Queue from gevent's implementation to python build-in library, it works. I do not know the reason, I guess it's supported to have difference between their implementation. I do not know the reason why gevent does not allow infinite wait. Is there anyone can give an explanation? Thanks advance
I suggest that you could use the gevent.queue.JoinableQueue() instead of Python's built-in Queue(). You can refer to the official queue guide for API Usages (http://www.gevent.org/gevent.queue.html)
def worker():
while True:
item = q.get()
try:
do_work(item)
finally:
q.task_done()
q = JoinableQueue()
for i in range(num_worker_threads):
gevent.spawn(worker)
for item in source():
q.put(item)
q.join() # block until all tasks are done
If you met the exceptions again, you'd better get fully understand the principle of Gevent corouinte control flow ...Once you get the point, that was not a big deal. :)
I have an idle background process to process data in a queue, which I've implemented in the following way. The data passed in this example is just an integer, but I will be passing lists with up to 1000 integers and putting up to 100 lists on the queue per sec. Is this the correct approach, or should I be looking at more elaborate RPC and server methods?
import multiprocessing
import Queue
import time
class MyProcess(multiprocessing.Process):
def __init__(self, queue, cmds):
multiprocessing.Process.__init__(self)
self.q = queue
self.cmds = cmds
def run(self):
exit_flag = False
while True:
try:
obj = self.q.get(False)
print obj
except Queue.Empty:
if exit_flag:
break
else:
pass
if not exit_flag and self.cmds.poll():
cmd = self.cmds.recv()
if cmd == -1:
exit_flag = True
time.sleep(.01)
if __name__ == '__main__':
queue = multiprocessing.Queue()
proc2main, main2proc = multiprocessing.Pipe(duplex=False)
p = MyProcess(queue, proc2main)
p.start()
for i in range(5):
queue.put(i)
main2proc.send(-1)
proc2main.close()
main2proc.close()
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
It depends on how long it will take to process the data. I can't tell because I don't have a sample of the data, but in general it is better to move to more elaborate RPC and server methods when you need things like load balancing, guaranteed uptime, or scalability. Just remember that these things will add complexity, which may make your application harder to deploy, debug, and maintain. It will also increase the latency that it takes to process a task (which might or might not be a concern to you).
I would test it with some sample data, and determine if you need the scalability that multiple servers provide.