Handle blocking operations efficiently in python - python

I'm using python and OpenCV to get video from a rtsp stream. I'm getting single frames from the stream and saving them to the file system.
I wrote a StreamingWorker which handles frame getting and saving. Additionally there is a StreamPool that has all the streaming objects. I thought that as the StreamingWorker would always be running, there should only be one per core, in order to take as much as possible. Then the StreamPool would provide the VideoCapture objects to the available StreamingWorker.
The problem is that most of the time that the script is running, is blocking:
import os
import time
import threading
import cv2 as cv
class StreamingWorker(object):
def __init__(self, stream_pool):
self.stream_pool = stream_pool
self.start_loop()
def start_loop(self):
while True:
try:
# getting a stream from the read_strategy
stream_object = self.stream_pool.next()
# getting an image from the stream
_, frame = stream_object['stream'].read()
# saving image to file system
cv.imwrite(os.path.join('result', stream_object['feed'], '{}.jpg'.format(time.time())))
except ValueError as e:
print('[error] {}'.format(e))
class StreamPool(object):
def __init__(self, streams):
self.streams = [{'feed': stream, 'stream': cv.VideoCapture(stream)} for stream in streams]
self.current_stream = 0
self.lock = threading.RLock()
def next(self):
self.lock.acquire()
if(self.current_stream + 1 >= len(self.streams)):
self.current_stream = 0
else:
self.current_stream += 1
result = self.streams[self.current_stream]
self.lock.release()
return result
def get_cores():
# This function returns the number of available cores
import multiprocessing
return multiprocessing.cpu_count()
def start(stream_pool):
StreamingWorker(stream_pool)
def divide_list(input_list, amount):
# This function divides the whole list into list of lists
result = [[] for _ in range(amount)]
for i in range(len(input_list)):
result[i % len(result)].append(input_list[i])
return result
if __name__ == '__main__':
stream_list = ['rtsp://some/stream1', 'rtsp://some/stream2', 'rtsp://some/stream3']
num_cores = get_cores()
divided_streams = divide_list(stream_list, num_cores)
for streams in divided_streams:
stream_pool = StreamPool(streams)
thread = threading.Thread(target=start, args=(stream_pool))
thread.start()
When I thought of this, I didn't take into account that most of the operations will be blocking operations like:
# Getting a frame blocks
_, frame = stream_object['stream'].read()
# Writing to the file system blocks
cv.imwrite(os.path.join('result', stream_object['feed'], '{}.jpg'.format(time.time())))
The problem with spending too much time blocking is that most of the processing power is wasted. I thought of using futures with a ThreadPoolExecutor but I can't seem to reach my goal of using the maximum amount of processing cores possible. Maybe I'm not setting enaugh threads.
Is there a standard way of handling blocking operations, in order to make the best use of the cores' processing power? I'm fine having a language-agnostic answer.

I ended up using the ThreadPoolExecutor using the add_done_callback(fn) function.
class StreamingWorker(object):
def __init__(self, stream_pool):
self.stream_pool = stream_pool
self.thread_pool = ThreadPoolExecutor(10)
self.start_loop()
def start_loop(self):
def done(fn):
print('[info] future done')
def save_image(stream):
# getting an image from the stream
_, frame = stream['stream'].read()
# saving image to file system
cv.imwrite(os.path.join('result', stream['feed'], '{}.jpg'.format(time.time())))
while True:
try:
# getting a stream from the read_strategy
stream_object = self.stream_pool.next()
# Scheduling the process to the thread pool
self.thread_pool.submit(save_image, (stream_object)).add_done_callback(done)
except ValueError as e:
print('[error] {}'.format(e))
I didn't actually want to do anything after the future finished, but if I used result() then the while True would stop, which whould also defeat all the purpose of using the thread pool.
Side note: I had to add a threading.Rlock() when calling self.stream_pool.next() because apparently opencv can't handle calls from multiple threads.

Related

How can I create multiple processes using python multiprocessing inside a function?

In my web application(which uses the Fast API framework), I have 2 CPU-intensive functions that are needed to create a final response.
These functions are not dependent and hence I planned on using processes to make the job faster. For which I have written the below code:
class SomeService:
def __init__(self, ml_model1, ml_model2):
self.ml_model1 = ml_model1
self.ml_model2 = ml_model2
def handle_request(self, request: QUSRequest):
with concurrent.futures.ProcessPoolExecutor() as pool:
futures = {pool.submit(self.some_cpu_intensive_task1, request): "some_cpu_intensive_task1",
pool.submit(self.some_cpu_intensive_task2,
request): "some_cpu_intensive_task2"}
response1 = None
response2 = None
for future in as_completed(futures):
if futures[future] == "some_cpu_intensive_task1":
response1 = future.result()
elif futures[future] == "some_cpu_intensive_task2":
response2 = future.result()
response = APIResponse(response_a=response1, response_b = response2)
return response
def some_cpu_intensive_task1(request):
### some task
return something;
def some_cpu_intensive_task2(request):
### some task
return something;
But this setup runs every other thing in my application and takes huge time.
However, a simple function without processes takes about 6-7 ms.
How can I write processes inside a function(handle_request)?
Your tasks, some_cpu_intensive_task1 and some_cpu_intensive_task2 are not particularly long-running functions relative to the amount of time it takes just to create a process pool. So if function handle_request is not called multiple times to amortize the cost of creating the process pool across multiple invocations, you will not gain anything by using multiprocessing. But, even then you must ensure that you create the process pool only once and reuse it for all handle_request calls.
There is still some overhead in passing arguments and results to and from one process's address space to another that you would not incur if you were just doing straight function calling, so that also cuts down on the possible performance gains. The bottom line is that the less CPU being done by your "worker" functions, the less the gains to be achieved by using multiprocessing. That said, the following changes are what you would need to see if any gains over multiple invocations (instead of losses) are possible:
import concurrent.futures
class SomeService:
# Must be a class variable or we can get: TypeError: cannot pickle 'weakref' object
# We only need 2 workers (assuming there are no concurrent calls to handle_request):
_pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
def handle_request(self, request):
# Since we are not proceeding until both tasks complete,
# nothing is really being gained by using as_completed:
future1 = self._pool.submit(self.some_cpu_intensive_task1, request)
future2 = self._pool.submit(self.some_cpu_intensive_task2, request)
return APIResponse(response_a=future1.result(), response_b=future2.result())
def some_cpu_intensive_task1(self, request):
### some task
return something
def some_cpu_intensive_task2(self, request):
### some task
return something
Update
Here is a concrete example with REPETITIONS set to 10:
import concurrent.futures
REPETITIONS = 10
class SomeService:
# Must be a class variable or we can get: TypeError: cannot pickle 'weakref' object
# We only need 2 workers (assuming there are no concurrent calls to handle_request):
_pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
def handle_request(self, request):
# Since we are not proceeding until both tasks complete,
# nothing is really being gained by using as_completed:
future1 = self._pool.submit(self.some_cpu_intensive_task1, request)
future2 = self._pool.submit(self.some_cpu_intensive_task2, request)
return (future1.result(), future2.result())
def some_cpu_intensive_task1(self, request):
sum = 0
for _ in range(REPETITIONS):
sum += request ** 2
return sum
def some_cpu_intensive_task2(self, request):
sum = 0
for _ in range(REPETITIONS):
sum += request ** 3
return sum
if __name__ == '__main__':
s = SomeService()
import time
t = time.time()
for _ in range(100):
result = s.handle_request(4)
print('Multiprocessing:', time.time() - t, result)
t = time.time()
for _ in range(100):
result = s.some_cpu_intensive_task1(4), s.some_cpu_intensive_task2(4)
print('Serial processing:', time.time() - t, result)
Prints:
Multiprocessing: 0.21735835075378418 (160, 640)
Serial processing: 0.0010030269622802734 (160, 640)
Multiprocessing degrades performances due to the overhead of passing arguments to and getting results back from another process.
But when we re-run with REPETITIONS set to 100_000 so that the worker functions some_cpu_intensive_task1 and some_cpu_intensive_task1 take considerably more time to execute, this is the new output:
Multiprocessing: 2.8213891983032227 (1600000, 6400000)
Serial processing: 4.49717116355896 (1600000, 6400000)

Open CV RTSP camera buffer lag

I'm struggling to understand why I cant get a "LIVE" feed from my IP camera.
It appears that there is a buffer and it causes the frames to build up if not being read - and as each iteration of my code takes some time there is a backlog and it ends up being almost slow mo to whats actually happening.
I found the below code which triggers a thread to do the reading of the camera on a loop to try and avoid this. But now i'm getting a "LIVE" feed for around 5 frames and then it stalls and shows the same image for another few.
##camera class - this stops the RTSP feed getting caught in the buffer
class Camera:
def __init__(self, rtsp_link):
#init last ready and last frame
self.last_frame = None
self.last_ready = None
self.lock = Lock()
#set capture decive
capture = cv2.VideoCapture(rtsp_link,apiPreference=cv2.CAP_FFMPEG)
#set thread to clear buffer
thread = threading.Thread(target=self.rtsp_cam_buffer, args=(capture,), name="rtsp_read_thread")
thread.daemon = True
thread.start()
#delay start of next step to avoid errors
time.sleep(2)
def rtsp_cam_buffer(self, capture):
#loop forever
while True:
with self.lock:
capture.grab()
self.last_ready, self.last_frame = capture.retrieve()
def getFrame(self):
#get last frame
if (self.last_ready is not None) and (self.last_frame is not None):
return self.last_frame.copy())
else:
return None
Whats the correct thing to do in this situation? Is there a way round this?
OR
Should I use something like gstreamer or ffmpeg to get the camera feed? If so which is better and why? Any advice or pages to give me some python examples of getting it working? I couldn't find loads about that made sense to me.
thanks
After searching online through multiple resources the suggestion for using threads to remove frames from the buffer came up ALOT. And although it seemed to work for a while it caused me issues with duplicate frames being displayed for some reason that I could not work out.
I then tried to build opencv from source with gstreamer support but even once it was compiled correctly it still didn't seem to like interfacing with gstreamer correctly.
Eventually I thought the best bet was to go back down the threading approach but again couldnt get it working. So I gave multiprocessing a shot.
I wrote the below class to handle the camera connection:
import cv2
import time
import multiprocessing as mp
class Camera():
def __init__(self,rtsp_url):
#load pipe for data transmittion to the process
self.parent_conn, child_conn = mp.Pipe()
#load process
self.p = mp.Process(target=self.update, args=(child_conn,rtsp_url))
#start process
self.p.daemon = True
self.p.start()
def end(self):
#send closure request to process
self.parent_conn.send(2)
def update(self,conn,rtsp_url):
#load cam into seperate process
print("Cam Loading...")
cap = cv2.VideoCapture(rtsp_url,cv2.CAP_FFMPEG)
print("Cam Loaded...")
run = True
while run:
#grab frames from the buffer
cap.grab()
#recieve input data
rec_dat = conn.recv()
if rec_dat == 1:
#if frame requested
ret,frame = cap.read()
conn.send(frame)
elif rec_dat ==2:
#if close requested
cap.release()
run = False
print("Camera Connection Closed")
conn.close()
def get_frame(self,resize=None):
###used to grab frames from the cam connection process
##[resize] param : % of size reduction or increase i.e 0.65 for 35% reduction or 1.5 for a 50% increase
#send request
self.parent_conn.send(1)
frame = self.parent_conn.recv()
#reset request
self.parent_conn.send(0)
#resize if needed
if resize == None:
return frame
else:
return self.rescale_frame(frame,resize)
def rescale_frame(self,frame, percent=65):
return cv2.resize(frame,None,fx=percent,fy=percent)
Displaying the frames can be done as below
cam = Camera("rtsp://admin:[somepassword]#192.168.0.40/h264Preview_01_main")
print(f"Camera is alive?: {cam.p.is_alive()}")
while(1):
frame = cam.get_frame(0.65)
cv2.imshow("Feed",frame)
key = cv2.waitKey(1)
if key == 13: #13 is the Enter Key
break
cv2.destroyAllWindows()
cam.end()
This solution has resolved all my issues of buffer lag and also repeated frames. #
Hopefully it will help anyone else in the same situation.
Lewis's solution was helpful to reduce the lag so far but there was still some lag in my case, and I have found this gist, which is a bit faster:
import os
import sys
import time
import threading
import numpy as np
import cv2 as cv
# also acts (partly) like a cv.VideoCapture
class FreshestFrame(threading.Thread):
def __init__(self, capture, name='FreshestFrame'):
self.capture = capture
assert self.capture.isOpened()
# this lets the read() method block until there's a new frame
self.cond = threading.Condition()
# this allows us to stop the thread gracefully
self.running = False
# keeping the newest frame around
self.frame = None
# passing a sequence number allows read() to NOT block
# if the currently available one is exactly the one you ask for
self.latestnum = 0
# this is just for demo purposes
self.callback = None
super().__init__(name=name)
self.start()
def start(self):
self.running = True
super().start()
def release(self, timeout=None):
self.running = False
self.join(timeout=timeout)
self.capture.release()
def run(self):
counter = 0
while self.running:
# block for fresh frame
(rv, img) = self.capture.read()
assert rv
counter += 1
# publish the frame
with self.cond: # lock the condition for this operation
self.frame = img if rv else None
self.latestnum = counter
self.cond.notify_all()
if self.callback:
self.callback(img)
def read(self, wait=True, seqnumber=None, timeout=None):
# with no arguments (wait=True), it always blocks for a fresh frame
# with wait=False it returns the current frame immediately (polling)
# with a seqnumber, it blocks until that frame is available (or no wait at all)
# with timeout argument, may return an earlier frame;
# may even be (0,None) if nothing received yet
with self.cond:
if wait:
if seqnumber is None:
seqnumber = self.latestnum+1
if seqnumber < 1:
seqnumber = 1
rv = self.cond.wait_for(lambda: self.latestnum >= seqnumber, timeout=timeout)
if not rv:
return (self.latestnum, self.frame)
return (self.latestnum, self.frame)
And then you use it like:
# open some camera
cap = cv.VideoCapture('rtsp://URL')
cap.set(cv.CAP_PROP_FPS, 30)
# wrap it
fresh = FreshestFrame(cap)
Use fresh to deal with the open camera

using python multiprocessing package inside a qgis plugin code

I spent quite a bit of time looking on how to use the multiprocessing package, but couldn't find anything on how to use it inside a plugin in QGIS. I am developing a plugin that does some optimization for several elements. I would like to parallelize it.
I found a useful link on multi-threading inside a python plugin (http://snorf.net/blog/2013/12/07/multithreading-in-qgis-python-plugins/), but nothing on using the multiprocessing module, which might be easier?
I have been trying with a very basic example. I am only showing the run function from the plugin here:
def run(self):
"""Run method that performs all the real work"""
# show the dialog
self.dlg.show()
# Run the dialog event loop
result = self.dlg.exec_()
# See if OK was pressed and run code
if result:
#Get number of cores
nProcs = mp.cpu_count()
#Start a Process
p = mp.Pool(nProcs)
#Define function
def cube(x):
return x**3
#Run parallel
results = p.map(cube, range(1,7))
When I run this code from the plugin in QGIS, it opens several QGIS windows, which then return errors (can't load layers, etc.). What am I missing? Do I need to start a worker first on another thread and then use multiprocessing there? Or would we use another function from multiprocessing?
Please let me know if the question needs edits. I am working under windows 7, using QGIS 2.10.
Thanks,
UPDATE
I created a worker class to implement the function and sent it to a new thread, but I get the same problem when I use multiprocessing in that thread.
The class I created is as follows:
class Worker(QObject):
'''Example worker'''
def __init__(self, result_queue, f, attr=[], repet=None, nbCores=None):
QObject.__init__(self)
if not hasattr(f, '__call__'):
#Check if not a function
raise TypeError('Worker expected a function as second argument')
if not isinstance(attr, list) and not repet==None:
#Check if not a list if there is a repet command
raise TypeError('Input problem:\nThe arguments for the function should be in a list if repet is provided')
if not all(isinstance(elem, list) for elem in attr) and repet==None and len(inspect.getargspec(f).args) > 1:
#Check if not a list of lists if there isn't a repet command
raise TypeError('Input problem:\nThe arguments for the function should be a list of lists if repet is not provided')
if not repet == None and (not isinstance(repet, int) or repet == 0):
#Check that provided an integer greater than 0
raise TypeError('If provided, repet should be None or a strictly positive integer')
self.result_queue = result_queue
self.f = f
self.attr = attr
self.repet = repet
self.nbCores = nbCores
if self.nbCores == None:
self.nbCores = mp.cpu_count() - 1
def fStar(self, arg):
"""Convert the function to taking a list as arguments"""
return self.f(*arg)
def run(self):
ret = None
try:
if self.repet == 1:
# estimates the function based on provided arguments
ret = self.f(*self.attr) #The star unpacks the list into attributes
else:
pool = mp.Pool(processes=self.nbCores)
if self.repet > 1:
ret = pool.map(self.fStar, itools.repeat(self.attr,self.repet))
elif self.repet == None:
ret = pool.map(self.fStar, self.attr)
pool.close()
pool.join()
except Exception, e:
#I can't pass an exception, it makes qgis bug
pass
self.result_queue.put(ret) #Pass the result to the queue
finished = pyqtSignal(object)
error = pyqtSignal(Exception, basestring)
I start the worker and send it to a new thread using the following function:
def startWorker(f, attr, repet=None, nbCores=None):
#Create a result queue
result_queue = queue.Queue()
# create a new worker instance
worker = Worker(result_queue, f, attr, repet, nbCores)
# start the worker in a new thread
thread = QThread()
worker.moveToThread(thread)
thread.started.connect(worker.run)
thread.start()
#Clean up when the thread is finished
worker.deleteLater()
thread.quit()
thread.wait()
thread.deleteLater()
#Export the result to the queue
res = []
while not result_queue.empty():
r = result_queue.get()
if r is None:
continue
res.append(r)
return res
As in my initial question, I just replaced results = p.map(cube, range(1,7)) by calling the startWorker function
Please let me know if you have any idea how to make this work. I implemented the work in multiple threads, but it would be much faster to use several cores...

Python: 'before' and 'after' for multiprocessing workers

Update: Here is a more specific example
Suppose I want to compile some statistical data from a sizable set of files:
I can make a generator (line for line in fileinput.input(files)) and some processor:
from collections import defaultdict
scores = defaultdict(int)
def process(line):
if 'Result' in line:
res = line.split('\"')[1].split('-')[0]
scores[res] += 1
The question is how to handle this when one gets to the multiprocessing.Pool.
Of course it's possible to define a multiprocessing.sharedctypes as well as a custom struct instead of a defaultdict but this seems rather painful. On the other hand I can't think of a pythonic way to instantiate something before the process or to return something after a generator has run out to the main thread.
So you basically create a histogram. This is can easily be parallelized, because histograms can be merged without complication. One might want to say that this problem is trivially parallelizable or "embarrassingly parallel". That is, you do not need to worry about communication among workers.
Just split your data set into multiple chunks, let your workers work on these chunks independently, collect the histogram of each worker, and then merge the histograms.
In practice, this problem is best off by letting each worker process/read its own file. That is, a "task" could be a file name. You should not start pickling file contents and send them around between processes through pipes. Let each worker process retrieve the bulk data directly from files. Otherwise your architecture spends too much time with inter-process communication, instead of doing some real work.
Do you need an example or can you figure this out yourself?
Edit: example implementation
I have a number of data files with file names in this format: data0.txt, data1.txt, ... .
Example contents:
wolf
wolf
cat
blume
eisenbahn
The goal is to create a histogram over the words contained in the data files. This is the code:
from multiprocessing import Pool
from collections import Counter
import glob
def build_histogram(filepath):
"""This function is run by a worker process.
The `filepath` argument is communicated to the worker
through a pipe. The return value of this function is
communicated to the manager through a pipe.
"""
hist = Counter()
with open(filepath) as f:
for line in f:
hist[line.strip()] += 1
return hist
def main():
"""This function runs in the manager (main) process."""
# Collect paths to data files.
datafile_paths = glob.glob("data*.txt")
# Create a pool of worker processes and distribute work.
# The input to worker processes (function argument) as well
# as the output by worker processes is transmitted through
# pipes, behind the scenes.
pool = Pool(processes=3)
histograms = pool.map(build_histogram, datafile_paths)
# Properly shut down the pool of worker processes, and
# wait until all of them have finished.
pool.close()
pool.join()
# Merge sub-histograms. Do not create too many intermediate
# objects: update the first sub-histogram with the others.
# Relevant docs: collections.Counter.update
merged_hist = histograms[0]
for h in histograms[1:]:
merged_hist.update(h)
for word, count in merged_hist.items():
print "%s: %s" % (word, count)
if __name__ == "__main__":
main()
Test output:
python countwords.py
eisenbahn: 12
auto: 6
cat: 1
katze: 10
stadt: 1
wolf: 3
zug: 4
blume: 5
herbert: 14
destruction: 4
I had to modify the original pool.py (the trouble was worker is defined as a method without any inheritance) to get what I want but it's not so bad, and probably better than writing a new pool entirely.
class worker(object):
def __init__(self, inqueue, outqueue, initializer=None, initargs=(), maxtasks=None,
wrap_exception=False, finalizer=None, finargs=()):
assert maxtasks is None or (type(maxtasks) == int and maxtasks > 0)
put = outqueue.put
get = inqueue.get
self.completed = 0
if hasattr(inqueue, '_writer'):
inqueue._writer.close()
outqueue._reader.close()
if initializer is not None:
initializer(self, *initargs)
def run(self):
while maxtasks is None or (maxtasks and self.completed < maxtasks):
try:
task = get()
except (EOFError, OSError):
util.debug('worker got EOFError or OSError -- exiting')
break
if task is None:
util.debug('worker got sentinel -- exiting')
break
job, i, func, args, kwds = task
try:
result = (True, func(*args, **kwds))
except Exception as e:
if wrap_exception:
e = ExceptionWithTraceback(e, e.__traceback__)
result = (False, e)
try:
put((job, i, result))
except Exception as e:
wrapped = MaybeEncodingError(e, result[1])
util.debug("Possible encoding error while sending result: %s" % (
wrapped))
put((job, i, (False, wrapped)))
self.completed += 1
if finalizer:
finalizer(self, *finargs)
util.debug('worker exiting after %d tasks' % self.completed)
run(self)

How to properly set up multiprocessing proxy objects for objects that already exist

I'm trying to share an existing object across multiple processing using the proxy methods described here. My multiprocessing idiom is the worker/queue setup, modeled after the 4th example here.
The code needs to do some calculations on data that are stored in rather large files on disk. I have a class that encapsulates all the I/O interactions, and once it has read a file from disk, it saves the data in memory for the next time a task needs to use the same data (which happens often).
I thought I had everything working from reading the examples linked to above. Here is a mock up of the code that just uses numpy random arrays to model the disk I/O:
import numpy
from multiprocessing import Process, Queue, current_process, Lock
from multiprocessing.managers import BaseManager
nfiles = 200
njobs = 1000
class BigFiles:
def __init__(self, nfiles):
# Start out with nothing read in.
self.data = [ None for i in range(nfiles) ]
# Use a lock to make sure only one process is reading from disk at a time.
self.lock = Lock()
def access(self, i):
# Get the data for a particular file
# In my real application, this function reads in files from disk.
# Here I mock it up with random numpy arrays.
if self.data[i] is None:
with self.lock:
self.data[i] = numpy.random.rand(1024,1024)
return self.data[i]
def summary(self):
return 'BigFiles: %d, %d Storing %d of %d files in memory'%(
id(self),id(self.data),
(len(self.data) - self.data.count(None)),
len(self.data) )
# I'm using a worker/queue setup for the multprocessing:
def worker(input, output):
proc = current_process().name
for job in iter(input.get, 'STOP'):
(big_files, i, ifile) = job
data = big_files.access(ifile)
# Do some calculations on the data
answer = numpy.var(data)
msg = '%s, job %d'%(proc, i)
msg += '\n Answer for file %d = %f'%(ifile, answer)
msg += '\n ' + big_files.summary()
output.put(msg)
# A class that returns an existing file when called.
# This is my attempted workaround for the fact that Manager.register needs a callable.
class ObjectGetter:
def __init__(self, obj):
self.obj = obj
def __call__(self):
return self.obj
def main():
# Prior to the place where I want to do the multprocessing,
# I already have a BigFiles object, which might have some data already read in.
# (Here I start it out empty.)
big_files = BigFiles(nfiles)
print 'Initial big_files.summary = ',big_files.summary()
# My attempt at making a proxy class to pass big_files to the workers
class BigFileManager(BaseManager):
pass
getter = ObjectGetter(big_files)
BigFileManager.register('big_files', callable = getter)
manager = BigFileManager()
manager.start()
# Set up the jobs:
task_queue = Queue()
for i in range(njobs):
ifile = numpy.random.randint(0, nfiles)
big_files_proxy = manager.big_files()
task_queue.put( (big_files_proxy, i, ifile) )
# Set up the workers
nproc = 12
done_queue = Queue()
process_list = []
for j in range(nproc):
p = Process(target=worker, args=(task_queue, done_queue))
p.start()
process_list.append(p)
task_queue.put('STOP')
# Log the results
for i in range(njobs):
msg = done_queue.get()
print msg
print 'Finished all jobs'
print 'big_files.summary = ',big_files.summary()
# Shut down the workers
for j in range(nproc):
process_list[j].join()
task_queue.close()
done_queue.close()
main()
This works in the sense that it calculates everything correctly, and it is caching the data that is read along the way. The only problem I'm having is that at the end, the big_files object doesn't have any of the files loaded. The final msg returned is:
Process-2, job 999. Answer for file 198 = 0.083406
BigFiles: 4303246400, 4314056248 Storing 198 of 200 files in memory
But then after it's all done, we have:
Finished all jobs
big_files.summary = BigFiles: 4303246400, 4314056248 Storing 0 of 200 files in memory
So my question is: What happened to all the stored data? It's claiming to be using the same self.data according to the id(self.data). But it's empty now.
I want the end state of big_files to have all the saved data that it accumulated along the way, since I actually have to repeat this entire process many times, so I don't want to have to redo all the (slow) I/O each time.
I'm assuming it must have something to do with my ObjectGetter class. The examples for using BaseManager only show how to make a new object that will be shared, not share an existing one. So am I doing something wrong with way I get the existing big_files object? Can anyone suggest a better way to do this step?
Thanks much!

Categories