the missing example | pre-fetch and pre-process data using threads - python

seems there many open questions about the usage of TensorFlow out there and some developer of tensorflow here active on stackoverflow. Here is another question. I want to generate training data on-the-fly in other thread(s) using numpy or something which does not belongs to TensorFlow. But, I do not want to go through re-compiling the entire TensorFlow source again and again. I simply waiting for another way. "tf.py_func" seems to be a workaround. But the
This is related to [how-to-prefetch-data-using-a-custom-python-function-in-tensorflow][1]
Here is my MnWE (minmal-not-working-example):
Update (now there is an output but a race-condition, too):
import numpy as np
import tensorflow as tf
import threading
import os
import glob
import random
import matplotlib.pyplot as plt
IMAGE_ROOT = "/graphics/projects/data/mscoco2014/data/images/"
files = ["train/COCO_train2014_000000178763.jpg",
"train/COCO_train2014_000000543841.jpg",
"train/COCO_train2014_000000364433.jpg",
"train/COCO_train2014_000000091123.jpg",
"train/COCO_train2014_000000498916.jpg",
"train/COCO_train2014_000000429865.jpg",
"train/COCO_train2014_000000400199.jpg",
"train/COCO_train2014_000000230367.jpg",
"train/COCO_train2014_000000281214.jpg",
"train/COCO_train2014_000000041920.jpg"];
# --------------------------------------------------------------------------------
def pre_process(data):
"""Pre-process image with arbitrary functions
does not only use tf.functions, but arbitrary
"""
# here is the place to do some fancy stuff
# which might be out of the scope of tf
return data[0:81,0,0].flatten()
def populate_queue(sess, thread_pool, qData_enqueue_op ):
"""Put stuff into the data queue
is responsible such that there is alwaays data to process
for tensorflow
"""
# until somebody tell me I can stop ...
while not thread_pool.should_stop():
# get a random image from MS COCO
idx = random.randint(0,len(files))-1
data = np.array(plt.imread(os.path.join(IMAGE_ROOT,files[idx])))
data = pre_process(data)
# put into the queue
sess.run(qData_enqueue_op, feed_dict={data_input: data})
# a simple queue for gather data (just to keep it currently simple)
qData = tf.FIFOQueue(100, [tf.float32], shapes=[[9,9]])
data_input = tf.placeholder(tf.float32)
qData_enqueue_op = qData.enqueue([tf.reshape(data_input,[9,9])])
qData_dequeue_op = qData.dequeue()
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
# init all variables
sess.run(init_op)
# coordinate of pool of threads
thread_pool = tf.train.Coordinator()
# start fill in data
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData_enqueue_op))
t.start()
# Can I use "tf.train.start_queue_runners" here
# How to use multiple threads?
try:
while not thread_pool.should_stop():
print "iter"
# HERE THE SILENCE BEGIN !!!!!!!!!!!
batch = sess.run([qData_dequeue_op])
print batch
except tf.errors.OutOfRangeError:
print('Done training -- no more data')
finally:
# When done, ask the threads to stop.
thread_pool.request_stop()
# now they should definetely stop
thread_pool.request_stop()
thread_pool.join([t])
I basically have three question:
What's wrong with this code? It runs into an endless loss (which is not debug-able). See Line "HERE THE SILENCE BEGIN ..."
How to extend this code to use more threads?
Is it worth to convert to tf.Record large datasets or data which can be generated on the fly?

You have a mistake on this line:
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData))
It should be qData_enqueue_op instead of qData. Otherwise your enqueue operations fail, and you get stuck trying to dequeue from queue of size 0. I saw this when trying to run your code and getting
TypeError: Fetch argument <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> of <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> has invalid type <class 'google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue'>, must be a string or Tensor. (Can not convert a FIFOQueue into a Tensor or Operation.)
Regarding other questions:
You don't need to start queue runners in this example because you don't have any. Queue runners are created by input producers like string_input_producer which is essentially FIFO queue + logic to launch threads. You are replicating 50% of queue runner functionality by launching your own threads that do enqueue ops. (the other 50% is closing the queue)
RE: converting to tf.record -- Python has this thing called Global Interpreter Lock which means that two bits of Python code can't execute concurrently. In practice that's mitigated by the fact that a lot of the time is spent in numpy C++ code or IO ops (which release GIL). So I think it's a matter of checking if you are able to achieve required parallelism using Python pre-processing pipelines.

Related

Avoiding IO time delay in a loop using multiprocessing

I am running prediction using a trained tensorflow model and generating data using it on the images coming from a simulator. But the issue here I need to save image too for each prediction I am making which is creating delay in the loop sometime causing issues in simulator. Is there any way we can use python's multiprocessing module to create a producer consumer architecture to avoid the IO cost in the loop?
for data in data_arr:
speed=float(data['speed'])
image=Image.open(BytesIO(base64.b64decode(data['image'])))
image=np.asarray(image)
img_c=image.copy()
image=img_preprocess(image)
image=np.array([image])
steering_angle=float(model_steer.predict(image))
#throttle=float(model_thr.predict(image))
throttle=1.0-speed/speed_limit
save_image(img_c,steering_angle)
print('{} {} {}'.format(steering_angle,throttle,speed))
send_control(steering_angle,throttle)
I tried to experiment similar concept for processing images from color to grayscale but instead of decreasing time. The total time increased from 0.1 sec to 17 sec.
import numpy as np
import cv2
import os
import time
from multiprocessing import Pool,RawArray
import ctypes
files_path=os.listdir('./imgs/')
files_path=list(map(lambda x:'./imgs/'+x,files_path))
temp_img=np.zeros((160,320))
var_dict = {}
def init_worker(X, h,w):
# Using a dictionary is not strictly necessary. You can also
# use global variables.
var_dict['X']=X
var_dict['h'] = h
var_dict['w'] = w
def worker_func(idx):
# Simply computes the sum of the i-th row of the input matrix X
X_np = np.frombuffer(var_dict['X'], dtype=np.uint8)
X_np=X_np.reshape(var_dict['h'],var_dict['w'])
cv2.imwrite('./out/'+str(idx)+'.jpg',X_np)
if __name__=='__main__':
start_time=time.time()
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
h,w=img.shape[:2]
mulproc_array=RawArray(ctypes.c_uint8,160*320)
X_np = np.frombuffer(mulproc_array, dtype=np.uint8).reshape(160,320)
np.copyto(X_np,img)
#cv2.imwrite('./out/'+str(idx)+'.jpg',img)
with Pool(processes=1, initializer=init_worker, initargs=(mulproc_array, h,w)) as pool:
pool.map(worker_func,[idx])
end_time=time.time()
print('Time taken=',(end_time-start_time))
there is no reason for using RawArray, as multiprocessing will already use pickle for objects transfer which has approximately the same size as the numpy array, and using RawArray is different from your use case.
you don't need to wait for the saving function to end, you can run it asynchronously.
you shouldn't be closing the pool until you are done with everything, as creating a worker takes a very long time (in the order of 10-100ms)
def worker_func(img,idx):
cv2.imwrite('./out/'+str(idx)+'.jpg',img)
if __name__=='__main__':
start_time=time.time()
with Pool(processes=1) as pool:
results = []
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) # do other work here
# next line converts image to uint8 before sending it to reduce its size
results.append(pool.apply_async(worker_func,args=(img.astype(np.uint8),idx)))
end_time=time.time() # technically the transfer is done, at this line.
for res in results:
res.get() # call this before closing the pool to make sure all images are saved.
print('Time taken=',(end_time-start_time))
you might want to experiment with threading instead of multiprocessing, to avoid data copy altogether, since writing to disk drops the GIL, but the results are not guaranteed to be faster.

How to post-process multiple datasets while reading new data and updating a graph

I have the following situation:
Datasets are generated by an external device, at varying intervals (between 0.1s and 90s). The code sleeps between acquisitions.
Each dataset needs to be post-processed (which is CPU-bound, single-threaded and requires 10s to 20s). Post-processing should not block (1).
Acquisition and post-processing should work asynchronously and whenever one dataset is done, I want to update a pyplot graph in a Jupyter notebook (currently using ipython widgets), with the data from the post-processing. The plotting should also not block (1).
Doing (1) and (2) serially is easy to do: I acquire all datasets, storing it in a list, then process each item, then display.
I don't know how to set this up in a parallel way and how to start. Do I use callback functions? Do callbacks work across processes? How do I set up the correct amount of processes (acquisition in one, processing and plotting the rest for each core). Can all processes modify the same list of all datasets? Is there a better data structure to use? Can it be done in Python?
This is a general outline of the classes you need and how you put them together along the idea of (more or less) what I described in my comment. There are other approaches, but I think this is the easiest to understand. There are also more "industrial strength" products that implement message queueing but with even steeper learning curves.
from multiprocessing import Process, Queue, cpu_count
def acquirer_process(post_process_queue):
while True:
# get next file and put in on the post processing queue
info_about_file_just_acquired = acquire_next_file()
post_process_queue.put(info_about_file_just_acquired)
def post_process_process(post_process_queue, plotting_queue):
while True:
info_about_file_just_acquired = post_process_queue.get()
# post process this file:
info_about_post_processed_file = post_process(info_about_file_just_acquired)
plotting_queue.put(info_about_post_processed_file)
def plotting_process(plotting_queue):
while True:
# Get plotting info for next post-processed file:
info_about_post_processed_file = plotting_queue.get()
# Plot it:
plot(info_about_post_processed_file)
def main():
"""
The main program.
"""
n_processors = cpu_count()
# We need one acquirer process
# We need one plotting process since the assumption is
# that only a single process (thread) can be plotting at a time
# That leaves n_processors - 2 free to work in parallel post processing acquired files:
post_process_queue = Queue()
plotting_queue = Queue()
processes = []
# All these processes that follow are "daemon" processes and will automatically
# terminate when the main process terminates:
processes.append(Process(target=acquirer_process, args=(post_process_queue,), daemon=True))
processes.append(Process(target=plotting_process, args=(plotting_queue,), daemon=True))
for _ in range(n_processors - 2):
processes.append(Process(target=post_process_process, args=(post_process_queue, plotting_queue), daemon=True))
# Start the processes:
for process in processes:
process.start()
# Pause the main process:
input('Hit enter to terminate:')
# Required for Windows:
if __name__ == '__main__':
main()

Timers cannot be stopped from another thread (short example with nidaqmx-python and callbacks)

I have seen other questions on the topic on this forum, but none have helped me understand how to deal with this. Most of them also seem to me to be about quite intricated and long code. I believe I am doing something rather simple / would like to do something rather simple. I hope someone can help! Here below extensive explanations and then my current code.
NOTE: please do not delete this question. I have given the following a lot of thought and I have carefully gone through related threads, but to no avail. I also believe it makes sense to post this because it is partly related to a more generic question: that of how to plot in real-time while having callbacks running in the background (see summary at the end), which could be summarized to be my general goal.
Setup and goal: National Instruments acquisition module (this matters litlle) NI cDAQ9178, interfaced via nidaqmx-python, a package maintained by NI with documentation here. Some analog signal is inputed in there, and the goal is to acquire it continuously (until I decide to stop the acquisition) at a certain sampling rate (approximately 1000 Hz) while plotting the signal in real time. The plotting needs not be refreshed nearly so often (10Hz refresh rate would even be fine). I'm using Windows 10 with Python 3.7 in a conda virtual environment, and editing is done in PyCharm. Ideally things should work both in PyCharm and any terminal.
Situation: nidaqmx-python provides high-level functions that allow one to register callbacks (which one defines as one wishes), which are called everytime a certain number of samples (in my case 100, but that's not strict) fills the PC buffer. The idea is that the callback, defined below, reads the buffer at that point, and does something (in my case some low-pass filtering, which I have taken out for conciseness, some storing into a global variable data, and maybe plotting - see below).
Problem: I have been fooling around with having whatever plots the data in real time be included in the callback, but with matplotlib that is a nightmare because the callback uses threads other than the main one, and matplotlib does not like to be called from anywhere outside the main thread. I've googled the heck out of other libraries optimized for real-time plotting (and, I was thinking, hopefully thread safe) but it's not so easy: I cannot get vispy to work and I cannot get pyqtgraph to even install, just to give you some examples. Then I saw several posts on the internet of people actually managing pretty decent real-time animations with matplotlib, despite it having been developped with publication in mind and not these applications; so I thought let's give it a go.
My take: Since I could not have matplotlib do the work from inside the callback, I did the following (which is the code you see below): after the callback and after the task is started with task.start() (that's specific to nidaqmx-python), I just create a while loop which plots the global variable buffer. I thought it was a nice trick: see, buffer is updated (call it that) by the callback every 0.1 seconds or so (does not matter) and, on the side, the while loop is plotting the buffer variable over and over, erasing everytime before plotting, effectively yielding a real-time like plot.
NOTE: I am perfectly aware the plotting part is not nearly as good as it could be made (I probably should use the ax API of matplotlib and the subplots, not to mention animation), but I do not care for the moment. I'll deal with that later and refine it to make it more efficient.
What I want: this actually does what I want ... except, in order to stop it, I introduced the try: and except: statements around the while loop, as you see in the code below. Naturally, pressing CTRL+C does break the loop ... but it then also breaks the whole running script and leaves me with the following error: forrtl: error (200): program aborting due to control-C event, in PyCharm, and the following precision when run from a terminal:
Image PC Routine Line Source
libifcoremd.dll 00007FFECF413B58 Unknown Unknown Unknown
KERNELBASE.dll 00007FFF219F60A3 Unknown Unknown Unknown
KERNEL32.DLL 00007FFF23847BD4 Unknown Unknown Unknown
ntdll.dll 00007FFF240CCED1 Unknown Unknown Unknown
QObject::~QObject: Timers cannot be stopped from another thread
The inconvenience is that I then have no choice but to close the python shell (thinking PyCharm again), and I do not have access to my precious variable data, containing ... well, my data.
Guess: obviously, the callback does not like to be stopped in this fahsion. The nidaqmx_python task should be stopped with task.stop(). I try putting task.stop() right after the KeyboardInterrupt except:, but it does not help, since CTRL+C stops the script on top / instead of breaking the while loop. I believe some more sofisticated method of stopping my task is required. I have been thinking about this for days but can't figure out a way of having both things: a task I can stop, and at the same time real-time plotting. Note that, without the plotting, it is easy to stop the task upon ENTER keypress: one simply writes at the end
input('Press ENTER to stop task')
task.stop()
But of course simply doing the above does not allow me to include the real-time plotting part.
Summary: I could not call matplotlib from the callback which reads the data continuously, so I wrote a while loop for real-time plotting in a separate block, but then I see no way of stopping that while loop without getting the above error (which complains that the callback was stop from a different thread, I think).
I hope I am being clear and if not, please do ask!
Code: I've cleaned it to get it down to as close as can be to an MWE that shows the problem, although of course I realize most of you don't have an NI daq to play around and connect so as to be able to run this. Anyway ... here it is:
import matplotlib.pyplot as plt
import numpy as np
import nidaqmx
from nidaqmx import stream_readers
from nidaqmx import constants
sfreq = 1000
bufsize = 100
with nidaqmx.Task() as task:
# Here we set up the task ... nevermind
task.ai_channels.add_ai_voltage_chan("cDAQ2Mod1/ai1")
task.timing.cfg_samp_clk_timing(rate=sfreq, sample_mode=constants.AcquisitionType.CONTINUOUS,
samps_per_chan=bufsize)
# Here we define a stream to be read continuously
stream = stream_readers.AnalogMultiChannelReader(task.in_stream)
data = np.zeros((1, 0)) # initializing an empty numpy array for my total data
buffer = np.zeros((1, bufsize)) # defined so that global buffer can be written to by the callback
# This is my callback to read data continuously
def reading_task_callback(task_idx, event_type, num_samples, callback_data): # bufsize is passed to num_samples when this is called
global data
global buffer
buffer = np.zeros((1, num_samples))
# This is the reading part
stream.read_many_sample(buffer, num_samples, timeout=constants.WAIT_INFINITELY)
data = np.append(data, buffer, axis=1) # appends buffered data to variable data
return 0 # Absolutely needed for this callback to be well defined (see nidaqmx doc).
# Here is the heavy lifting I believe: the above callback is registered
task.register_every_n_samples_acquired_into_buffer_event(bufsize, reading_task_callback)
task.start() # The task is started (callback called periodically)
print('Acquiring sensor data. Press CTRL+C to stop the run.\n') # This should work ...
fig = plt.figure()
try:
while True:
# Poor's man plot updating
plt.clf()
plt.plot(buffer.T)
plt.show()
plt.pause(0.01) # 100 Hz refresh rate
except KeyboardInterrupt: # stop loop with CTRL+C ... or so I thought :-(
plt.close(fig)
pass
task.stop() # I believe I never get to this part after pressing CTRL+C ...
# Some prints at the end ... nevermind
print('Total number of acquired samples: ', len(data.T),'\n')
print('Sampling frequency: ', sfreq, 'Hz\n')
print('Buffer size: ', bufsize, '\n')
print('Acquisition duration: ', len(data.T)/sfreq, 's\n')
Any input would be appreciated. Thank you in advance folks!
EDIT: after the accepted answer here below, I rewrote the code above and came up with the following, which works as intended now (sorry, this time I have not cleaned it up, and some lines are irrelevant for the present question):
# Stream read from a task that is set up to read continuously
import matplotlib.pyplot as plt
import numpy as np
import nidaqmx
from nidaqmx import stream_readers
from nidaqmx import constants
from scipy import signal
import threading
running = True
sfreq = 1000
bufsize = 100
bufsizeb = 100
global task
def askUser(): # it might be better to put this outside of task
global running
input("Press return to stop.")
running = False
def main():
global running
global data
global buffer
global data_filt
global buffer_filt
global b
global z
print('Acquiring sensor data...')
with nidaqmx.Task() as task: # maybe we can use target as above
thread = threading.Thread(target=askUser)
thread.start()
task.ai_channels.add_ai_voltage_chan("cDAQ2Mod1/ai1")
task.timing.cfg_samp_clk_timing(rate=sfreq, sample_mode=constants.AcquisitionType.CONTINUOUS,
samps_per_chan=bufsize)
# unclear samps_per_chan is needed here above or why it would be different than bufsize
stream = stream_readers.AnalogMultiChannelReader(task.in_stream)
data = np.zeros((1, 0)) # probably not the most elegant way of initializing an empty numpy array
buffer = np.zeros((1, bufsizeb)) # defined so that global buffer can be written in the callback
data_filt = np.zeros((1, 0)) # probably not the most elegant way of initializing an empty numpy array
buffer_filt = np.zeros((1, bufsizeb)) # defined so that global buffer can be written in the callback
b = signal.firwin(150, 0.004)
z = signal.lfilter_zi(b, 1)
def reading_task_callback(task_idx, event_type, num_samples, callback_data): # bufsizeb is passed to num_samples
global data
global buffer
global data_filt
global buffer_filt
global z
global b
if running:
# It may be wiser to read slightly more than num_samples here, to make sure one does not miss any sample,
# see: https://documentation.help/NI-DAQmx-Key-Concepts/contCAcqGen.html
buffer = np.zeros((1, num_samples))
stream.read_many_sample(buffer, num_samples, timeout=constants.WAIT_INFINITELY)
data = np.append(data, buffer, axis=1) # appends buffered data to variable data
# IIR Filtering, low-pass
buffer_filt = np.zeros((1, num_samples))
for i, x in enumerate(np.squeeze(buffer)): # squeeze required for x to be just a scalar (which lfilter likes)
buffer_filt[0,i], z = signal.lfilter(b, 1, [x], zi=z)
data_filt = np.append(data_filt, buffer_filt, axis=1) # appends buffered filtered data to variable data_filt
return 0 # Absolutely needed for this callback to be well defined (see nidaqmx doc).
task.register_every_n_samples_acquired_into_buffer_event(bufsizeb, reading_task_callback) # bufsizeb instead
task.start()
while running: # this is perfect: it "stops" the console just like sleep in a way that the task does not stop
plt.clf()
plt.plot(buffer.T)
plt.draw()
plt.pause(0.01) # 100 Hz refresh rate
# plt.close(fig) # maybe no need to close it for now
# task.join() # this is for threads I guess ... (seems useless to my case?)
# Some prints at the end ...
print('Total number of acquired samples:', len(data.T))
print('Sampling frequency:', sfreq, 'Hz')
print('Buffer size:', bufsize)
print('Acquisition duration:', len(data.T)/sfreq, 's')
if __name__ == '__main__':
main()
Note that I do not need a task.stop() after all because the way continuous acquisition tasks work with this package is that reading any line of code after task.start() which is not a sleep or something like that makes the task stop (well that's my understanding at least).
The first thing I did was get rid of the keyboard interrupt loop. I replaced it with a global variable running, and another thread that sets the variable to False when returned from.
def askUser():
global running
input("Press return to stop.")
running = False
Then, before the while loop, created a new thread that will execute this function.
askUserThread = threading.Thread(target=askUser)
askUserThread.start()
And for the while loop, getting rid of the try catch statement:
while running:
plt.clf()
plt.plot(buffer.T)
plt.draw() # Note: this got changed because .show wasn't working.
plt.pause(0.01)
This still didn't work for me because I had to close the plot window for a new one to show up. So from this answer, I changed it from .show to .draw.
My end code was a little different (since I sampled random data) but here it is.
# sampling.py
# by Preston Hager
import matplotlib.pyplot as plt
import numpy as np
import threading
sfreq = 1000
bufsize = 100
running = True
data = np.zeros((1, 0)) # initializing an empty numpy array for my total data
buffer = np.zeros((1, bufsize)) # defined so that global buffer can be written to by the callback
def askUser():
global running
input("Press return to stop.")
running = False
def readingTask():
global data
global buffer
while running:
buffer = np.random.rand(1, bufsize)
# This is the reading part
data = np.append(data, buffer, axis=1) # appends buffered data to variable data
def main():
global running
print('Acquiring sensor data.')
thread = threading.Thread(target=askUser)
thread.start()
task = threading.Thread(target=readingTask)
task.start()
fig = plt.figure()
while running:
# Poor's man plot updating
plt.clf()
plt.plot(buffer.T)
plt.draw()
plt.pause(0.01) # 100 Hz refresh rate
plt.close(fig)
task.join()
# Some prints at the end ... nevermind
print('Total number of acquired samples:', len(data.T))
print('Sampling frequency:', sfreq, 'Hz')
print('Buffer size:', bufsize)
print('Acquisition duration:', len(data.T)/sfreq, 's')
if __name__ == '__main__':
main()

mxnet: how to set up a custom mx.io.DataIter to prefetch?

My mxnet script is likely limited by i/o of data loading into the GPU, and I am trying to speed this up by prefetching. The trouble is I can't figure out how to prefetch with a custom data iterator.
My first hypothesis/hope was that it would be enough to set the values of self.preprocess_threads and self.prefetch_buffer, as I had seen here for iterators such as mxnet.io.ImageRecordUInt8Iter. However, when I did this I saw no performance change relative to the script before I had set these variables, so clearly setting these did not work.
Then I noticed, the existence of a class mx.io.PrefetchingIter in addition to the base class for which I had implemented a child class mx.io.DataIter. I found this documentation, but I have not been able to find any examples, and I am a little confused about what needs to happen where/when. However, I am not clear on how to use this. For example. I see that in addition to next() it has an iter_next() method, which simply says "move to the next batch". What does this mean exactly? What does it mean to "move" to the next batch without producing it? I found the source code for this class, and based on a brief reading, it seems as though it takes multiple iterators and creates one thread per iterator. This likely would not work for my current design, as I really want multiple threads used to prefetch from the same iterator.
Here is what I am trying to do via a custom data iterator
I maintain a global multiprocessing.Queue on which I pop data as it becomes available
I produce that data by running (via multiprocessing) a command line script that executes a c++ binary which produces a numpy file
I open the numpy file and load its contents into memory, process them, and put the processed bits on the global multiprocessing.Queue
My custom iterator pulls off this queue and also kicks off more jobs to produce more data when the queue is empty.
Here is my code:
def launchJobForDate(date_str):
### this is a function that gets called via multiprocessing
### to produce new data by calling a c++ binary
### whenever data queue is empty so that we need to produce more data
try:
f = "testdata/data%s.npy"%date_str
if not os.path.isfile(f):
cmd = CMD % ( date_str, JSON_FILE, date_str, date_str, date_str)
while True:
try:
output = subprocess.check_output(cmd, shell=True)
break
except:
pass
while True:
try:
d = np.load(f)
break
except:
pass
data_queue.put((d, date_str))
except Exception as ex:
print("launchJobForDate: ERROR ", ex)
class ProduceDataIter(mx.io.DataIter):
#staticmethod
def processData(d, time_steps, num_inputs):
try:
...processes data...
return [z for z in zip(bigX, bigY, bigEvalY, dates)]
except Exception as ex:
print("processData: ERROR ", ex)
def __init__(self, num_mgrs, end_date_str):
## iter stuff
self.preprocess_threads = 4
self.prefetch_buffer = 1
## set up internal data to preserve state
## and make a list of dates for which to run binary
#property
def provide_data(self):
return [mx.io.DataDesc(name='seq_var',
shape=(args_batch_size * GPU_COUNT,
self.time_steps,
self.num_inputs),
layout='NTC')]
#property
def provide_label(self):
return [mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT)),
mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT, num_y_cols)),
mx.io.DataDesc(name='date',
shape=(args_batch_size * GPU_COUNT))]
def __next__(self):
try:
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except Exception as ex:
### if self.z (a list) has no elements to pop we need
### to get more data off the queue, process it, and put it
### on self.x so it's ready for calls to __next__()
while True:
try:
d = data_queue.get_nowait()
processedData = ProduceDataIter.processData(d,
self.time_steps,
self.num_inputs)
self.z.extend(processedData)
counter_queue.put(counter_queue.get() - 1)
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except queue.Empty:
...this is where new jobs to produce new data and put them
...on the queue would happen if nothing is left on the queue
I have then tried making one of these iterators as well as a prefetch iterator like so:
mgr = ProcessMgr(2, end_date_str)
mgrOuter = mx.io.PrefetchingIter([mgr])
The problem is that mgrOuter immediately throws a StopIteration as soon as __next__() is called the first time, and without invoking mgr.__next__() as I thought it might.
Finally, I also noticed that gluon has a DataLoader object which seems like it might handle prefetching, however in this case it also seems to assume that the underlying data is from a Dataset which has a finite and unchanging layout (based on the fact that it is implemented in terms of getitem, which takes an index). So I have not pursued this option as it seem unpromising given the dynamic queue-like nature of the data I am generating as training input.
My questions are:
How do I need to modify my code above so that there will be prefetching for my custom iterator?
Where might I find an example or more detailed documentation of how mx.io.PrefetchingIter works?
Are there other strategies I should be aware of for getting more performance out of my GPUs via a custom iterator? Right now they are only operating at around 50% capacity, and upping (or lowering) the batch size doesn't change this. What other knobs might I be able to turn to increase GPU use efficiency?
Thanks for any feedback and advice.
As you already mentioned, gluon DataLoader is providing prefetching. In your custom DataIterator, you are using Numpy arrays as input. So you could do the following:
f = "testdata/data%s.npy"%date_str
data = np.load(f)
train = gluon.data.ArrayDataset(mx.nd.array(data))
train_iter = gluon.data.DataLoader(train, shuffle=True, num_workers=4, batch_size=batch_size, last_batch='rollover')
Since you are creating your data dynamically, you could try resetting the DataLoader in every epoch and load a new Numpy array.
If GPU utilization is still low, then try to increase the batch_size and the num_workers. Another problem could be also the size of your dataset. Resetting the DataLoader will impact the performance, so having a larger dataset will increase the time of an epoch and as such increase performance.

Clean, pythonic way for concurrent data loaders?

Python 3
I would like to know what a really clean, pythonic concurrent data loader should look like. I need this approach for a project of mine that does heavy computations on data that is too big to entirely fit into memory. Hence, I implemented data loaders that should run concurrently and store data in a queue, so that the main process can work while (in the mean time) the next data is being loaded & prepared. Of course, the queue should block when it is empty (main process trying to consume more items -> queue should wait for new data) or full (worker process should wait until main process consumes data out of the queue to prevent out-of-memory errors).
I have written a class to fulfill this need using Python's multiprocessing module (multiprocessing.Queue and multiprocessing.Process). The crucial parts of the class are implemented as follows:
import multiprocessing as mp
from itertools import cycle
class ConcurrentLoader:
def __init__(path_to_data, queue_size, batch_size):
self._batch_size
self._path = path_to_data
filenames = ... # filenames for path 'path_to_data',
# get loaded using glob
self._files = cycle()
self._q = mp.Queue(queue_size)
...
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start() # only started, never stopped
def _worker_func(self):
while True:
buffer = list()
for i in range(batch_size):
f = next(self._files)
... # load f and do some pre-processing with NumPy
... # add it to buffer
self._q.put(np.array(buffer).astype(np.float32))
def get_batch_data(self):
self._q.get()
The class has some more methods, but they are all for "convenience functionality". For example, it counts in a dict how often each file was loaded, how often the whole data set was loaded and so on, but these are rather easy to implement in Python and do not waste much computation time (sets, dicts, ...).
The data part itself on the other hand, due to I/O and pre-processing, can even take seconds. That is the reason why I want this to happen concurrently.
ConcurrentLoader should:
block main process: if get_batch_data is called, but queue is empty
block worker process: if queue is full, to prevent out-of-memory errors and prevent while True from wasting resources
be "transparent" to any class that uses ConcurrentLoader: they should just supply the path to the data and use get_batch_data without noticing that this actually works concurrently ("hassle free usage")
terminate its worker when main process dies to free resources again
Considering these goals (have I forgotten anything?) what should I do to enhance the current implementation? Is it thread/dead-lock safe? Is there a more "pythonic" way of implementation? Can I get it more clean? Does waste resources somehow?
Any class that uses ConcurrentLoader would roughly follow this setup:
class Foo:
...
def do_something(self):
...
data1 = ConcurrentLoader("path/to/data1", 64, 8)
data2 = ConcurrentLoader("path/to/data2", 256, 16)
...
sample1 = data1.get_batch_data()
sample2 = data2.get_batch_data()
... # heavy computations with data contained in 'sample1' & 'sample2'
# go *here*
Please either point out mistakes of any kind in order to improve my approach or supply an own, cleaner, more pythonic approach.
Blocking when a multiprocessing.Queue is empty/full and
get()/put() is called on it happens automatically.
This behavior is transparent to calling functions.
Use self._worker.daemon = True before self._worker.start() so the worker(s) will automatically be killed when main process exits

Categories