Too many threads in python threading - Recursive traversal - python

I have a script to traverse an AWS S3 bucket to do some aggregation at the file level.
from threading import Semaphore, Thread
class Spider:
def __init__(self):
self.sem = Semaphore(120)
self.threads = list()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
for thread in self.threads:
thread.join()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
self.sem.acquire()
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
for child_url in children:
try:
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
self.threads.append(thread)
thread.start()
self.sem.release()
This used to run okay, until I encountered a bucket of several TB of data with hundreds of thousand sub-directories. The number of Thread objects in self.threads increases very fast and soon the server reported to me
RuntimeError: can't start new thread
There is some extra processing I have to do in the script so I can't just get all files from the bucket.
Currently I'm putting a depth of at least 2 before the script can go parallelized but it's just a workaround. Any suggestion is appreciated.

So the way the original piece of code worked was BFS, which created a lot of waiting threads in queue. I changed it to DFS and everything is working fine. Pseudo code in case someone needs this in the future:
def __init__(self):
self.sem = Semaphore(120)
self.urls = list()
self.mutex = Lock()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
while not is_done():
self.sem.acquire()
url = self.urls.pop(0)
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
thread.start()
self.sem.release()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
self.mutex.acquire()
for child_url in children:
self.urls.insert(0, child_url)
self.mutex.release()
No join() so I implemented my own is_done() check.

Related

python design pattern queue with workers

I'm currently working on a project that involves three components,
an observer that check for changes in a directory, a worker and an command line interface.
What I want to achieve is:
The observer, when a change happens send a string to the worker (add a job to the worker's queue).
The worker has a queue of jobs and forever works on his queue.
Now I want the possibility to run a python script to check the status of the worker (number of active jobs, errors and so on)
I don't know how to achieve this with python in terms of which component to use and how to link the three components.
I though as a singleton worker where the observer add a job to a queue but 1) I was not able to write a working code and 2) How can I fit the checker in?
Another solution that I thought of may be multiple child processes from a father that has the queue but I'm a bit lost...
Thanks for any advices
I'd use some kind of observer pattern or publish-subscribe pattern. For the former you can use for example the Python version of ReactiveX. But for a more basic example let's stay with the Python core. Parts of your program can subscribe to the worker and receive updates from the process via queues for example.
import itertools as it
from queue import Queue
from threading import Thread
import time
class Observable(Thread):
def __init__(self):
super().__init__()
self._observers = []
def notify(self, msg):
for obs in self._observers:
obs.put(msg)
def subscribe(self, obs):
self._observers.append(obs)
class Observer(Thread):
def __init__(self):
super().__init__()
self.updates = Queue()
class Watcher(Observable):
def run(self):
for i in it.count():
self.notify(i)
time.sleep(1)
class Worker(Observable, Observer):
def run(self):
while True:
task = self.updates.get()
self.notify((str(task), 'start'))
time.sleep(1)
self.notify((str(task), 'stop'))
class Supervisor(Observer):
def __init__(self):
super().__init__()
self._statuses = {}
def run(self):
while True:
status = self.updates.get()
print(status)
self._statuses[status[0]] = status[1]
# Do something based on status updates.
if status[1] == 'stop':
del self._statuses[status[0]]
watcher = Watcher()
worker = Worker()
supervisor = Supervisor()
watcher.subscribe(worker.updates)
worker.subscribe(supervisor.updates)
supervisor.start()
worker.start()
watcher.start()
However many variations are possible and you can check the various patterns which suits you most.

canonical example of worker process with PySide or PyQt

I was looking for some good example of managing worker process from Qt GUI created in Python. I need this to be as complete as possible, including reporting progress from the process, including aborting the process, including handling of possible errors coming from the process.
I only found some semi-finished examples which only did part of work but when I tried to make them complete I failed. My current design comes in three layers:
1) there is the main thread in which resides the GUI and ProcessScheduler which controls that only one instance of worker process is running and can abort it
2) there is another thread in which I have ProcessObserver which actually runs the process and understands the stuff coming from queue (which is used for inter-process communication), this must be in non-GUI thread to keep GUI responsive
3) there is the actual worker process which executes a given piece of code (my future intention is to replace multiprocessing with multiprocess or pathos or something else what can pickle function objects, but this is not my current issue) and report progress or result to the queue
Currently I have this snippet (the print functions in the code are just for debugging and will be deleted eventually):
import multiprocessing
from PySide import QtCore, QtGui
QtWidgets = QtGui
N = 10000000
# I would like this to be a function object
# but multiprocessing cannot pickle it :(
# so I will use multiprocess in the future
CODE = """
# calculates sum of numbers from 0 to n-1
# reports percent progress of finished work
sum = 0
progress = -1
for i in range(n):
sum += i
p = i * 100 // n
if p > progress:
queue.put(["progress", p])
progress = p
queue.put(["result", sum])
"""
class EvalProcess(multiprocessing.Process):
def __init__(self, code, symbols):
super(EvalProcess, self).__init__()
self.code= code
self.symbols = symbols # symbols must contain 'queue'
def run(self):
print("EvalProcess started")
exec(self.code, self.symbols)
print("EvalProcess finished")
class ProcessObserver(QtCore.QObject):
"""Resides in worker thread. Its role is to understand
to what is received from the process via the queue."""
progressChanged = QtCore.Signal(float)
finished = QtCore.Signal(object)
def __init__(self, process, queue):
super(ProcessObserver, self).__init__()
self.process = process
self.queue = queue
def run(self):
print("ProcessObserver started")
self.process.start()
try:
while True:
# this loop keeps running and listening to the queue
# even if the process is aborted
result = self.queue.get()
print("received from queue:", result)
if result[0] == "progress":
self.progressChanged.emit(result[1])
elif result[0] == "result":
self.finished.emit(result[1])
break
except Exception as e:
print(e) # QUESTION: WHAT HAPPENS WHEN THE PROCESS FAILS?
self.process.join() # QUESTION: DO I NEED THIS LINE?
print("ProcessObserver finished")
class ProcessScheduler(QtCore.QObject):
"""Resides in the main thread."""
sendText = QtCore.Signal(str)
def __init__(self):
super(ProcessScheduler, self).__init__()
self.observer = None
self.thread = None
self.process = None
self.queue = None
def start(self):
if self.process: # Q: IS THIS OK?
# should kill current process and start a new one
self.abort()
self.queue = multiprocessing.Queue()
self.process = EvalProcess(CODE, {"n": N, "queue": self.queue})
self.thread = QtCore.QThread()
self.observer = ProcessObserver(self.process, self.queue)
self.observer.moveToThread(self.thread)
self.observer.progressChanged.connect(self.onProgressChanged)
self.observer.finished.connect(self.onResultReceived)
self.thread.started.connect(self.observer.run)
self.thread.finished.connect(self.onThreadFinished)
self.thread.start()
self.sendText.emit("Calculation started")
def abort(self):
self.process.terminate()
self.sendText.emit("Aborted.")
self.onThreadFinished()
def onProgressChanged(self, percent):
self.sendText.emit("Progress={}%".format(percent))
def onResultReceived(self, result):
print("onResultReceived called")
self.sendText.emit("Result={}".format(result))
self.thread.quit()
def onThreadFinished(self):
print("onThreadFinished called")
self.thread.deleteLater() # QUESTION: DO I NEED THIS LINE?
self.thread = None
self.observer = None
self.process = None
self.queue = None
if __name__ == '__main__':
app = QtWidgets.QApplication([])
scheduler = ProcessScheduler()
window = QtWidgets.QWidget()
layout = QtWidgets.QVBoxLayout(window)
startButton = QtWidgets.QPushButton("sum(range({}))".format(N))
startButton.pressed.connect(scheduler.start)
layout.addWidget(startButton)
abortButton = QtWidgets.QPushButton("Abort")
abortButton.pressed.connect(scheduler.abort)
layout.addWidget(abortButton)
console = QtWidgets.QPlainTextEdit()
scheduler.sendText.connect(console.appendPlainText)
layout.addWidget(console)
window.show()
app.exec_()
It works kind of OK but it still lacks proper error handling and aborting of process. Especially I am now struggling with the aborting. The main problem is that the worker thread keeps running (in the loop listening to the queue) even if the process has been aborted/terminated in the middle of calculation (or at least it prints this error in the console QThread: Destroyed while thread is still running). Is there a way to solve this? Or any alternative approach? Or, if possible, any real-life and compete example of such task fulfilling all the requirements mentioned above? Any comment would be much appreciated.

python multithreading synchronization

I am having a synchronization problem while threading with cPython. I have two files, I parse them and return the desired result. However, the code below acts strangely and returns three times instead of two plus doesn't return in the order I put them into queue. Here's the code:
import Queue
import threading
from HtmlDoc import Document
OUT_LIST = []
class Threader(threading.Thread):
"""
Start threading
"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
if self.queue.qsize() == 0: break
path, host = self.queue.get()
f = open(path, "r")
source = f.read()
f.close()
self.out_queue.put((source, host))
self.queue.task_done()
class Processor(threading.Thread):
"""
Process threading
"""
def __init__(self, out_queue):
self.out_queue = out_queue
self.l_first = []
self.f_append = self.l_first.append
self.l_second = []
self.s_append = self.l_second.append
threading.Thread.__init__(self)
def first(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def second(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def run(self):
while True:
if self.out_queue.qsize() == 0: break
doc, host = self.out_queue.get()
if host == "first":
self.first(doc)
elif host == "second":
self.second(doc)
OUT_LIST.extend(self.l_first + self.l_second)
self.out_queue.task_done()
def main():
queue = Queue.Queue()
out_queue = Queue.Queue()
queue.put(("...first.html", "first"))
queue.put(("...second.html", "second"))
qsize = queue.qsize()
for i in range(qsize):
t = Threader(queue, out_queue)
t.setDaemon(True)
t.start()
for i in range(qsize):
dt = Processor(out_queue)
dt.setDaemon(True)
dt.start()
queue.join()
out_queue.join()
print '<br />'.join(OUT_LIST)
main()
Now, when I print, I'd like to print the content of the "first" first of all and then the content of the "second". Can anyone help me?
NOTE: I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
Threading is actually one of the most error-prone ways to manage multiple concurrent connections. A more powerful, more debuggable approach is to use event-driven asynchronous networking, such as implemented by Twisted. If you're interested in using this model, you might want to check out this introduction.
I dont share the same opinion that threading is the best way to do this (IMO some events/select mechanism would be better) but problem with your code could be in variables t and dt. You have the assignements in the cycle and object instances are to stored anywhere - so it may be possible that your new instance of Thread/Processor get deleted at the end of the each cycle.
It would be more clarified if you show us precise output of this code.
1) You cannot control order of job completion. It depends on execution time, so to return results as you want you can create global dictionary with job objects, like job_results : {'first' : None, 'second' : None} and store results here, than you can fetch data on desired order
2) self.first and self.second should be cleared after each processed doc, else you will have duplicates in OUT_LIST
3) You may use multi-processing with subprocess module and put all result data to CSV files for example and them sort them as you wish.

end daemon processes with multiprocessing module

I include an example usage of multiprocessing below. This is a process pool model. It is not as simple as it might be, but is relatively close in structure to the code I'm actually using. It also uses sqlalchemy, sorry.
My question is - I currently have a situation where I have a relatively long running Python script which is executing a number of functions which each look like the code below, so the parent process is the same in all cases. In other words, multiple pools are created by one python script. (I don't have to do it this way, I suppose, but the alternative is to use something like os.system and subprocess.) The problem is that these processes hang around and hold on to memory. The docs say these daemon processes are supposed to stick around till the parent process exits, but what about if the parent process then goes on to generate another pool or processes and doesn't exit immediately.
Calling terminate() works, but this doesn't seem terribly polite. Is there a good way to ask the processes to terminate nicely? I.e. clean up after yourself and go away now, I need to start up the next pool?
I also tried calling join() on the processes. According to the documentation this means wait for the processes to terminate. What if they don't plan to terminate? What actually happens is that the process hangs.
Thanks in advance.
Regards, Faheem.
import multiprocessing, time
class Worker(multiprocessing.Process):
"""Process executing tasks from a given tasks queue"""
def __init__(self, queue, num):
multiprocessing.Process.__init__(self)
self.num = num
self.queue = queue
self.daemon = True
def run(self):
import traceback
while True:
func, args, kargs = self.queue.get()
try:
print "trying %s with args %s"%(func.__name__, args)
func(*args, **kargs)
except:
traceback.print_exc()
self.queue.task_done()
class ProcessPool:
"""Pool of threads consuming tasks from a queue"""
def __init__(self, num_threads):
self.queue = multiprocessing.JoinableQueue()
self.workerlist = []
self.num = num_threads
for i in range(num_threads):
self.workerlist.append(Worker(self.queue, i))
def add_task(self, func, *args, **kargs):
"""Add a task to the queue"""
self.queue.put((func, args, kargs))
def start(self):
for w in self.workerlist:
w.start()
def wait_completion(self):
"""Wait for completion of all the tasks in the queue"""
self.queue.join()
for worker in self.workerlist:
print worker.__dict__
#worker.terminate() <--- terminate used here
worker.join() <--- join used here
start = time.time()
from sqlalchemy import *
from sqlalchemy.orm import *
dbuser = ''
password = ''
dbname = ''
dbstring = "postgres://%s:%s#localhost:5432/%s"%(dbuser, password, dbname)
db = create_engine(dbstring, echo=True)
m = MetaData(db)
def make_foo(i):
t1 = Table('foo%s'%i, m, Column('a', Integer, primary_key=True))
conn = db.connect()
for i in range(10):
conn.execute("DROP TABLE IF EXISTS foo%s"%i)
conn.close()
for i in range(10):
make_foo(i)
m.create_all()
def do(i, dbstring):
dbstring = "postgres://%s:%s#localhost:5432/%s"%(dbuser, password, dbname)
db = create_engine(dbstring, echo=True)
Session = scoped_session(sessionmaker())
Session.configure(bind=db)
Session.execute("ALTER TABLE foo%s SET ( autovacuum_enabled = false );"%i)
Session.execute("ALTER TABLE foo%s SET ( autovacuum_enabled = true );"%i)
Session.commit()
pool = ProcessPool(5)
for i in range(10):
pool.add_task(do, i, dbstring)
pool.start()
pool.wait_completion()
My way of dealing with this was:
import multiprocessing
for prc in multiprocessing.active_children():
prc.terminate()
I like this more so I don't have to pollute the worker function with some if clause.
You know multiprocessing already has classes for worker pools, right?
The standard way is to send your threads a quit signal:
queue.put(("QUIT", None, None))
Then check for it:
if func == "QUIT":
return

Checking on a thread / remove from list

I have a thread which extends Thread. The code looks a little like this;
class MyThread(Thread):
def run(self):
# Do stuff
my_threads = []
while has_jobs() and len(my_threads) < 5:
new_thread = MyThread(next_job_details())
new_thread.run()
my_threads.append(new_thread)
for my_thread in my_threads
my_thread.join()
# Do stuff
So here in my pseudo code I check to see if there is any jobs (like a db etc) and if there is some jobs, and if there is less than 5 threads running, create new threads.
So from here, I then check over my threads and this is where I get stuck, I can use .join() but my understanding is that - this then waits until it's finished so if the first thread it checks is still in progress, it then waits till it's done - even if the other threads are finished....
so is there a way to check if a thread is done, then remove it if so?
eg
for my_thread in my_threads:
if my_thread.done():
# process results
del (my_threads[my_thread]) ?? will that work...
As TokenMacGuy says, you should use thread.is_alive() to check if a thread is still running. To remove no longer running threads from your list you can use a list comprehension:
for t in my_threads:
if not t.is_alive():
# get results from thread
t.handled = True
my_threads = [t for t in my_threads if not t.handled]
This avoids the problem of removing items from a list while iterating over it.
mythreads = threading.enumerate()
Enumerate returns a list of all Thread objects still alive.
https://docs.python.org/3.6/library/threading.html
you need to call thread.isAlive()to find out if the thread is still running
The answer has been covered, but for simplicity...
# To filter out finished threads
threads = [t for t in threads if t.is_alive()]
# Same thing but for QThreads (if you are using PyQt)
threads = [t for t in threads if t.isRunning()]
Better way is to use Queue class:
http://docs.python.org/library/queue.html
Look at the good example code in the bottom of documentation page:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
A easy solution to check thread finished or not. It is thread safe
Install pyrvsignal
pip install pyrvsignal
Example:
import time
from threading import Thread
from pyrvsignal import Signal
class MyThread(Thread):
started = Signal()
finished = Signal()
def __init__(self, target, args):
self.target = target
self.args = args
Thread.__init__(self)
def run(self) -> None:
self.started.emit()
self.target(*self.args)
self.finished.emit()
def do_my_work(details):
print(f"Doing work: {details}")
time.sleep(10)
def started_work():
print("Started work")
def finished_work():
print("Work finished")
thread = MyThread(target=do_my_work, args=("testing",))
thread.started.connect(started_work)
thread.finished.connect(finished_work)
thread.start()

Categories