User visit http://example.com/url/ and invoke page_parser from views.py. page_parser create instance of class Foo from script.py.
Each time http://example.com/url/ is visited I see that memory usage goes up and up. I guess Garbage Collector don't collect instantiated class Foo. Any ideas why is it so?
Here is the code:
views.py:
from django.http import HttpResponse
from script import Foo
from script import urls
# When user visits http://example.com/url/ I run `page_parser`
def page_parser(request):
Foo(urls)
return HttpResponse("alldone")
script.py:
import requests
from queue import Queue
from threading import Thread
class Newthread(Thread):
def __init__(self, queue, result):
Thread.__init__(self)
self.queue = queue
self.result = result
def run(self):
while True:
url = self.queue.get()
data = requests.get(url) # Download image at url
self.result.append(data)
self.queue.task_done()
class Foo:
def __init__(self, urls):
self.result = list()
self.queue = Queue()
self.startthreads()
for url in urls:
self.queue.put(url)
self.queue.join()
def startthreads(self):
for x in range(3):
worker = Newthread(queue=self.queue, result=self.result)
worker.daemon = True
worker.start()
urls = [
"https://static.pexels.com/photos/106399/pexels-photo-106399.jpeg",
"https://static.pexels.com/photos/164516/pexels-photo-164516.jpeg",
"https://static.pexels.com/photos/206172/pexels-photo-206172.jpeg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg",
"https://static.pexels.com/photos/106399/pexels-photo-106399.jpeg",
"https://static.pexels.com/photos/164516/pexels-photo-164516.jpeg",
"https://static.pexels.com/photos/206172/pexels-photo-206172.jpeg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg",
"https://static.pexels.com/photos/106399/pexels-photo-106399.jpeg",
"https://static.pexels.com/photos/164516/pexels-photo-164516.jpeg",
"https://static.pexels.com/photos/206172/pexels-photo-206172.jpeg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg"]
There's several moving parts involved, but what I think happens is the following:
WSGI processes are not killed after each request, so things may persist.
You create 3 new threads, but don't let them join the main thread again, for example when the queue is empty.
Since the reference count to Foo.queue never reaches zero (as the threads are still alive waiting for new queue items), it cannot be garbage collected
So you keep creating new threads, new Foo classes and none of them can be freed.
I'm not an expert on queue.Queue, but my theory can be verified if you can watch the number of threads in the WSGI process go up with 3 each request (for example using top(1)).
As a side note, this is a side-effect of your class design. You do everything in __init__, which should really only be assigning class attributes.
Related
I have class Sender that has a function add_task(task) to add the task to the thread pool. The task needs to be handled by one of the 10 worker threads. Each thread has to have a unique information, say their own address (needs to be loaded form the config file) to do the job.
Currently I was thinking of using ThreadPoolExecutor & threading.Local to store address for each worker thread. Something like this:
class Sender:
# addresses of len 10
def __init__(self, addresses):
self.executor = ThreadPoolExecutor(max_workers=10, initializer=initialize, initargs=(addresses))
def add_task(task):
fut = self.executor.submit(task)
fut.add_done_callback(dummy_callback)
def initialize(addresses):|
# ... thread name is <pool prefix>_i
retrieve that i
data = threading.Local()
data.address = addresses[i]
def task():
# do something with data.address
print(data.address)
def dummy_callback():
pass
I am not sure if I am going in the right direction. Could you give any hints on that?
So I have been struggling with this one error of pickle which is driving me crazy. I have the following master Engine class with the following code :
import eventlet
import socketio
import multiprocessing
from multiprocessing import Queue
from multi import SIOSerever
class masterEngine:
if __name__ == '__main__':
serverObj = SIOSerever()
try:
receiveData = multiprocessing.Process(target=serverObj.run)
receiveData.start()
receiveProcess = multiprocessing.Process(target=serverObj.fetchFromQueue)
receiveProcess.start()
receiveData.join()
receiveProcess.join()
except Exception as error:
print(error)
and I have another file called multi which runs like the following :
import multiprocessing
from multiprocessing import Queue
import eventlet
import socketio
class SIOSerever:
def __init__(self):
self.cycletimeQueue = Queue()
self.sio = socketio.Server(cors_allowed_origins='*',logger=False)
self.app = socketio.WSGIApp(self.sio, static_files={'/': 'index.html',})
self.ws_server = eventlet.listen(('0.0.0.0', 5000))
#self.sio.on('production')
def p_message(sid, message):
self.cycletimeQueue.put(message)
print("I logged : "+str(message))
def run(self):
eventlet.wsgi.server(self.ws_server, self.app)
def fetchFromQueue(self):
while True:
cycle = self.cycletimeQueue.get()
print(cycle)
As you can see I can trying to create two processes of def run and fetchFromQueue which i want to run independently.
My run function starts the python-socket server to which im sending some data from a html web page ( This runs perfectly without multiprocessing). I am then trying to push the data received to a Queue so that my other function can retrieve it and play with the data received.
I have a set of time taking operations that I need to carry out with the data received from the socket which is why im pushing it all into a Queue.
On running the master Engine class I receive the following :
Can't pickle <class 'threading.Thread'>: it's not the same object as threading.Thread
I ended!
[Finished in 0.5s]
Can you please help with what I am doing wrong?
From multiprocessing programming guidelines:
Explicitly pass resources to child processes
On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.
Therefore, I slightly modified your example by removing everything unnecessary, but showing an approach where the shared queue is explicitly passed to all processes that use it:
import multiprocessing
MAX = 5
class SIOSerever:
def __init__(self, queue):
self.cycletimeQueue = queue
def run(self):
for i in range(MAX):
self.cycletimeQueue.put(i)
#staticmethod
def fetchFromQueue(cycletimeQueue):
while True:
cycle = cycletimeQueue.get()
print(cycle)
if cycle >= MAX - 1:
break
def start_server(queue):
server = SIOSerever(queue)
server.run()
if __name__ == '__main__':
try:
queue = multiprocessing.Queue()
receiveData = multiprocessing.Process(target=start_server, args=(queue,))
receiveData.start()
receiveProcess = multiprocessing.Process(target=SIOSerever.fetchFromQueue, args=(queue,))
receiveProcess.start()
receiveData.join()
receiveProcess.join()
except Exception as error:
print(error)
0
1
...
from multiprocessing.dummy import Pool as ThreadPool
class TSNew:
def __init__(self):
self.redis_client = redis.StrictRedis(host="172.17.31.147", port=4401, db=0)
self.global_switch = 0
self.pool = ThreadPool(40) # init pool
self.dnn_model = None
self.nnf = None
self.md5sum_nnf = "initialize"
self.thread = threading.Thread(target=self.load_model_item)
self.ts_picked_ids = None
self.thread.start()
self.memory = deque(maxlen=3000)
self.process = threading.Thread(target=self.process_user_dict)
self.process.start()
def load_model_item(self):
'''
code
'''
def predict_memcache(self,user_dict):
'''
code
'''
def process_user_dict(self):
while True:
'''
code to generate user_dicts which is a list
'''
results = self.pool.map(self.predict_memcache, user_dicts)
'''
code
'''
TSNew_ = TSNew()
def get_user_result():
logging.info("----------------come in ------------------")
if request.method == 'POST':
user_dict_json = request.get_data()# userid
if user_dict_json == '' or user_dict_json is None:
logging.info("----------------user_dict_json is ''------------------")
return ''
try:
user_dict = json.loads(user_dict_json)
except:
logging.info("json load error, pass")
return ''
TSNew_.memory.append(user_dict)
logging.info('add to deque TSNew_.memory size: %d PID: %d', len(TSNew_.memory), os.getpid())
logging.info("add to deque userid: %s, nation: %s \n",user_dict['user_id'], user_dict['user_country'])
return 'SUCCESS\n'
#app.route('/', methods=['POST'])
def get_ts_gbdt_id():
return get_user_result()
from werkzeug.contrib.fixers import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=4444)
I create a multi thread pool in class __init__ and I use the self.pool
to map the function of predict_memcache.
I have two doubts:
(a) Should I initialize the pool in __init__ or just init it right before
results = self.pool.map(self.predict_memcache, user_dicts)
(b) Since the pool is a multi thread operation and it is executed in the thread of process_user_dict, so is there any hidden error ?
Thanks.
Question (a):
It depends. If you need to run process_user_dict more than once, then it makes sense to start the pool in the constructor and keep it running. Creating a thread pool always comes with some overhead and by keeping the pool alive between calls to process_user_dict you would avoid that additional overhead.
If you just want to process one set of input, you can as well create your pool right inside process_user_dict. But probably not right before results = self.pool.map(self.predict_memcache, user_dicts) because that would create a pool for every iteration of your surrounding while loop.
In your specific case, it does not make any difference. You create your TSNew_ object on module-level, so that it remains alive (and with it the thread pool) while your app is running; the same thread pool from the same TSNew instance is used to process all the requests during the lifetime of app.run().
Since you seem to be using that construct with self.process = threading.Thread(target=self.process_user_dict) as some sort of listener on self.memory, creating the pool in the constructor is functionally equivalent to creating the pool inside of process_user_dict (but outside the loop).
Question (b):
Technically, there is no hidden error by default when creating a thread inside a thread. In the end, any additional thread's ultimate parent is always the MainThread, that is implicitly created for every instance of a Python interpreter. Basically, every time you create a thread inside a Python program, you create a thread in a thread.
Actually, your code does not even create a thread inside a thread. Your self.pool is created inside the MainThread. When the pool is instantiated via self.pool = ThreadPool(40) it creates the desired number (40) of worker threads, plus one worker handler thread, one task handler thread and one result handler thread. All of these are child threads of the MainThread. All you do with regards to your pool inside your thread under self.process is calling its map method to assign tasks to it.
However, I do not really see the point of what you are doing with that self.process here.
Making a guess, I would say that you want to start the loop in process_user_dict to act as kind of a listener on self.memory, so that the pool starts processing user_dict as soon as they start showing up in the deque in self.memory. From what I see you doing in get_user_result, you seem to get one user_dict per request. I understand that you might have concurrent user sessions passing in these dicts, but do you really see benfit from process_user_dict running in an infinite loop over simply calling TSNew_.process_user_dict() after TSNew_.memory.append(user_dict)? You could even omit self.memory completely and pass the dict directly to process_user_dict, unless I am missing something you did not show us.
I'm currently working on a project that involves three components,
an observer that check for changes in a directory, a worker and an command line interface.
What I want to achieve is:
The observer, when a change happens send a string to the worker (add a job to the worker's queue).
The worker has a queue of jobs and forever works on his queue.
Now I want the possibility to run a python script to check the status of the worker (number of active jobs, errors and so on)
I don't know how to achieve this with python in terms of which component to use and how to link the three components.
I though as a singleton worker where the observer add a job to a queue but 1) I was not able to write a working code and 2) How can I fit the checker in?
Another solution that I thought of may be multiple child processes from a father that has the queue but I'm a bit lost...
Thanks for any advices
I'd use some kind of observer pattern or publish-subscribe pattern. For the former you can use for example the Python version of ReactiveX. But for a more basic example let's stay with the Python core. Parts of your program can subscribe to the worker and receive updates from the process via queues for example.
import itertools as it
from queue import Queue
from threading import Thread
import time
class Observable(Thread):
def __init__(self):
super().__init__()
self._observers = []
def notify(self, msg):
for obs in self._observers:
obs.put(msg)
def subscribe(self, obs):
self._observers.append(obs)
class Observer(Thread):
def __init__(self):
super().__init__()
self.updates = Queue()
class Watcher(Observable):
def run(self):
for i in it.count():
self.notify(i)
time.sleep(1)
class Worker(Observable, Observer):
def run(self):
while True:
task = self.updates.get()
self.notify((str(task), 'start'))
time.sleep(1)
self.notify((str(task), 'stop'))
class Supervisor(Observer):
def __init__(self):
super().__init__()
self._statuses = {}
def run(self):
while True:
status = self.updates.get()
print(status)
self._statuses[status[0]] = status[1]
# Do something based on status updates.
if status[1] == 'stop':
del self._statuses[status[0]]
watcher = Watcher()
worker = Worker()
supervisor = Supervisor()
watcher.subscribe(worker.updates)
worker.subscribe(supervisor.updates)
supervisor.start()
worker.start()
watcher.start()
However many variations are possible and you can check the various patterns which suits you most.
I am trying to write a module which needs to crawl some URLs concurrently/parallelly. since this would be a more expensive Network IO operation instead of CPU heavy. I am using ThreadPoolExecutor.
Now in my code, multiple functions add tasks to the shared thread pool.
my issue is Main thread gets suspended before all future objects are
done processing in the callback functions.
I am a beginner dealing with futures and ThreadPoolExecutor. Any help would be appreciated.
import settings
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures
class Test(Base):
WORKER_THREADS = settings.WORKER_THREADS
def __init__(self, urls):
super(Test, self).__init__()
self.urls = urls
self.worker_pool = ThreadPoolExecutor(max_workers=Test.WORKER_THREADS)
def add_to_worker_queue(self, task, callback, **kwargs):
self.logger.info("Adding task %s to worker pool.", task.func_name)
self.worker_pool.submit(task, **kwargs).add_done_callback(callback)
return
def load_url(self, url):
response = self.make_requests(urls=url) # make_requests is in Base class (it just makes a HTTP req)
# response is a generator, so to get the data out of it need to iterate through it.
for res in response:
return res
def handle_response(self, response):
# do some stuff with response and add it again to the worker queue for further parallel processing
self.add_to_worker_queue(some_task, callback_func, data=response)
return
def start(self):
for url in self.urls:
self.add_to_worker_queue(self.load_url, self.handle_response, url=[url])
return
def stop(self):
self.worker_pool.shutdown(wait=True)
return
if __name__ == "__main__":
start_urls = [ 'http://stackoverflow.com/'
, 'https://docs.python.org/3.3/library/concurrent.futures.html'
]
test = Test(urls=start_urls)
test.start()
test.stop()
PS I tried using executer with "with" statement, according to this example. https://docs.python.org/3.3/library/concurrent.futures.html#threadpoolexecutor-example
but as I submit tasks to the pool one by one and above example wait for future objects to be completed which defeats my purpose.