Avoiding IO time delay in a loop using multiprocessing - python

I am running prediction using a trained tensorflow model and generating data using it on the images coming from a simulator. But the issue here I need to save image too for each prediction I am making which is creating delay in the loop sometime causing issues in simulator. Is there any way we can use python's multiprocessing module to create a producer consumer architecture to avoid the IO cost in the loop?
for data in data_arr:
speed=float(data['speed'])
image=Image.open(BytesIO(base64.b64decode(data['image'])))
image=np.asarray(image)
img_c=image.copy()
image=img_preprocess(image)
image=np.array([image])
steering_angle=float(model_steer.predict(image))
#throttle=float(model_thr.predict(image))
throttle=1.0-speed/speed_limit
save_image(img_c,steering_angle)
print('{} {} {}'.format(steering_angle,throttle,speed))
send_control(steering_angle,throttle)
I tried to experiment similar concept for processing images from color to grayscale but instead of decreasing time. The total time increased from 0.1 sec to 17 sec.
import numpy as np
import cv2
import os
import time
from multiprocessing import Pool,RawArray
import ctypes
files_path=os.listdir('./imgs/')
files_path=list(map(lambda x:'./imgs/'+x,files_path))
temp_img=np.zeros((160,320))
var_dict = {}
def init_worker(X, h,w):
# Using a dictionary is not strictly necessary. You can also
# use global variables.
var_dict['X']=X
var_dict['h'] = h
var_dict['w'] = w
def worker_func(idx):
# Simply computes the sum of the i-th row of the input matrix X
X_np = np.frombuffer(var_dict['X'], dtype=np.uint8)
X_np=X_np.reshape(var_dict['h'],var_dict['w'])
cv2.imwrite('./out/'+str(idx)+'.jpg',X_np)
if __name__=='__main__':
start_time=time.time()
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
h,w=img.shape[:2]
mulproc_array=RawArray(ctypes.c_uint8,160*320)
X_np = np.frombuffer(mulproc_array, dtype=np.uint8).reshape(160,320)
np.copyto(X_np,img)
#cv2.imwrite('./out/'+str(idx)+'.jpg',img)
with Pool(processes=1, initializer=init_worker, initargs=(mulproc_array, h,w)) as pool:
pool.map(worker_func,[idx])
end_time=time.time()
print('Time taken=',(end_time-start_time))

there is no reason for using RawArray, as multiprocessing will already use pickle for objects transfer which has approximately the same size as the numpy array, and using RawArray is different from your use case.
you don't need to wait for the saving function to end, you can run it asynchronously.
you shouldn't be closing the pool until you are done with everything, as creating a worker takes a very long time (in the order of 10-100ms)
def worker_func(img,idx):
cv2.imwrite('./out/'+str(idx)+'.jpg',img)
if __name__=='__main__':
start_time=time.time()
with Pool(processes=1) as pool:
results = []
for idx,filepath in enumerate(files_path):
img=cv2.imread(filepath)
img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) # do other work here
# next line converts image to uint8 before sending it to reduce its size
results.append(pool.apply_async(worker_func,args=(img.astype(np.uint8),idx)))
end_time=time.time() # technically the transfer is done, at this line.
for res in results:
res.get() # call this before closing the pool to make sure all images are saved.
print('Time taken=',(end_time-start_time))
you might want to experiment with threading instead of multiprocessing, to avoid data copy altogether, since writing to disk drops the GIL, but the results are not guaranteed to be faster.

Related

Make good use of CPUs within one function or run it parallel?

Here is the problem. I have thousands of formulas to evaluate and access, such as 'rank(sqrt(v1))' or 'corr(v1, square(v2))'. All values(v1, v2, ...) have been shared on memory for multiprocessing. I have one function and all computations within it just use numpy or scipy functions.
from scipy.stats import rankdata
import multiprocessing as np
def myeval(formula):
... # read from shared_memory
... # compute
return value
def eval_access(formula):
factor = myeval(formula) # comsumes 1GB memory
factor_performance1 = rankdata(factor, axis=1) # another Gigabyte
factor_performance2 = np.full_like(factor, np.nan) # another one
... # some computation
pickle_part # some I/O to record performance (just some float number)
return
with mp.Pool(30) as p:
p.map(eval_access, all_formulas)
Running this function on parallel can make good use of all CPUs(almost 100% usage every second), but each process will consume at most 3GB, so 30 processes may consume at most 60GB memory simutaneously. And also, it seems numpy operations under multiprocessing will slow down. To avoid these problems, I want to use all CPUs within eval_access function as follows.
from scipy.stats import rankdata
import multiprocessing as np
def myeval(ind1, ind2):
# alter value[ind1:ind2, :] or value[:, ind1:ind2]
def myeval_multiprocess(formula):
# read from shared_memory
value = np.zeros(shape) # prepare return value's space
... # share_memory value
...# split value's index into 30 parts
with mp.Pool(30) as p:
p.map(myeval, ind_list)
return value
def eval_access(formula):
factor = myeval_multiprocess(formula) # comsumes 1GB memory
factor_performance1 = rankdata(factor, axis=1) # another Gigabyte
factor_performance2 = np.full_like(factor, np.nan) # another one
... # some computation
pickle_part # some I/O to record performance (just some float number)
return
for formula in all_formulas:
eval_access(formula)
This is an ideal situation where these two versions will cost the same time and the latter one needs less memory anytime. However, the problem is it costs a lot of time to share_memory value within myeval_multiprocess and not all methods in eval_access can make good use of all CPUs, unless I decorate all functions I will use to support split the whole problem into some subsets.
So, are there any suggestions on which version is better or how to speed up, and can the latter one run ideally? Meanwhile, is there any method that can multiprocessing every computation easily?

skimage.util.apply_parallel not behaving as expected

I'm trying to tile an image and apply calculations to each tile in parallel. But it's not behaving like I expect. I made it print 'here' whenever the function is executed, and many are printed quickly, which indicates they're being launched simultaneously. But my cpu load in my task manager never gets above 100%, and it takes a long time to execute. Can someone please advise? This is my first time using skimage.util.apply_parallel, which uses Dask.
from numpy import random, ones, zeros_like
from skimage.util import apply_parallel
def f3(im):
print('here')
for _ in range(10000):
u=random.random(100000)
return zeros_like(im)
if __name__=='__main__':
im=ones((2,4))
f = lambda img: f3(img)
im2=apply_parallel(f,im,chunks=1)
I looked at the source code, and apply_parallel relies on this Dask command:
res = darr.map_overlap(wrapped_func, depth, boundary=mode, dtype=dtype)
But I found that it needs .compute('processes') at the end of it to guarantee multiple cpu's. So now I'm just using Dask itself:
import dask.array as da
im2 = da.from_array(im,chunks=2)
proc = im2.map_overlap(f, depth=0).compute(scheduler='processes')
Then the cpu usage really jumps!

Python Multiprocessing: Writing to file every k iterations

I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)
My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()

Why this multiprocessing code is slower than the serial one?

I tried the following python programs, both sequential and parallel versions on a cluster computing facility. I could clearly see(using top command) more processes initiating for the parallel program. But when I time it, it seems the parallel version is taking more time. What could be the reason? I am attaching the codes and the timing info herewith.
#parallel.py
from multiprocessing import Pool
import numpy
def sqrt(x):
return numpy.sqrt(x)
pool = Pool()
results = pool.map(sqrt, range(100000), chunksize=10)
#seq.py
import numpy
def sqrt(x):
return numpy.sqrt(x)
results = [sqrt(x) for x in range(100000)]
user#domain$ time python parallel.py > parallel.txt
real 0m1.323s
user 0m2.238s
sys 0m0.243s
user#domain$ time python seq.py > seq.txt
real 0m0.348s
user 0m0.324s
sys 0m0.024s
The amount of work per task is by far too little to compensate for the work-distribution-overhead. First you should increase the chunksize, but still a single square root operation is too short to compensate for the cost of sending around the data between processes. You can see an effective speedup from something like this:
def sqrt(x):
for _ in range(100):
x = numpy.sqrt(x)
return x
results = pool.map(sqrt, range(10000), chunksize=100)

the missing example | pre-fetch and pre-process data using threads

seems there many open questions about the usage of TensorFlow out there and some developer of tensorflow here active on stackoverflow. Here is another question. I want to generate training data on-the-fly in other thread(s) using numpy or something which does not belongs to TensorFlow. But, I do not want to go through re-compiling the entire TensorFlow source again and again. I simply waiting for another way. "tf.py_func" seems to be a workaround. But the
This is related to [how-to-prefetch-data-using-a-custom-python-function-in-tensorflow][1]
Here is my MnWE (minmal-not-working-example):
Update (now there is an output but a race-condition, too):
import numpy as np
import tensorflow as tf
import threading
import os
import glob
import random
import matplotlib.pyplot as plt
IMAGE_ROOT = "/graphics/projects/data/mscoco2014/data/images/"
files = ["train/COCO_train2014_000000178763.jpg",
"train/COCO_train2014_000000543841.jpg",
"train/COCO_train2014_000000364433.jpg",
"train/COCO_train2014_000000091123.jpg",
"train/COCO_train2014_000000498916.jpg",
"train/COCO_train2014_000000429865.jpg",
"train/COCO_train2014_000000400199.jpg",
"train/COCO_train2014_000000230367.jpg",
"train/COCO_train2014_000000281214.jpg",
"train/COCO_train2014_000000041920.jpg"];
# --------------------------------------------------------------------------------
def pre_process(data):
"""Pre-process image with arbitrary functions
does not only use tf.functions, but arbitrary
"""
# here is the place to do some fancy stuff
# which might be out of the scope of tf
return data[0:81,0,0].flatten()
def populate_queue(sess, thread_pool, qData_enqueue_op ):
"""Put stuff into the data queue
is responsible such that there is alwaays data to process
for tensorflow
"""
# until somebody tell me I can stop ...
while not thread_pool.should_stop():
# get a random image from MS COCO
idx = random.randint(0,len(files))-1
data = np.array(plt.imread(os.path.join(IMAGE_ROOT,files[idx])))
data = pre_process(data)
# put into the queue
sess.run(qData_enqueue_op, feed_dict={data_input: data})
# a simple queue for gather data (just to keep it currently simple)
qData = tf.FIFOQueue(100, [tf.float32], shapes=[[9,9]])
data_input = tf.placeholder(tf.float32)
qData_enqueue_op = qData.enqueue([tf.reshape(data_input,[9,9])])
qData_dequeue_op = qData.dequeue()
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
# init all variables
sess.run(init_op)
# coordinate of pool of threads
thread_pool = tf.train.Coordinator()
# start fill in data
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData_enqueue_op))
t.start()
# Can I use "tf.train.start_queue_runners" here
# How to use multiple threads?
try:
while not thread_pool.should_stop():
print "iter"
# HERE THE SILENCE BEGIN !!!!!!!!!!!
batch = sess.run([qData_dequeue_op])
print batch
except tf.errors.OutOfRangeError:
print('Done training -- no more data')
finally:
# When done, ask the threads to stop.
thread_pool.request_stop()
# now they should definetely stop
thread_pool.request_stop()
thread_pool.join([t])
I basically have three question:
What's wrong with this code? It runs into an endless loss (which is not debug-able). See Line "HERE THE SILENCE BEGIN ..."
How to extend this code to use more threads?
Is it worth to convert to tf.Record large datasets or data which can be generated on the fly?
You have a mistake on this line:
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData))
It should be qData_enqueue_op instead of qData. Otherwise your enqueue operations fail, and you get stuck trying to dequeue from queue of size 0. I saw this when trying to run your code and getting
TypeError: Fetch argument <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> of <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> has invalid type <class 'google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue'>, must be a string or Tensor. (Can not convert a FIFOQueue into a Tensor or Operation.)
Regarding other questions:
You don't need to start queue runners in this example because you don't have any. Queue runners are created by input producers like string_input_producer which is essentially FIFO queue + logic to launch threads. You are replicating 50% of queue runner functionality by launching your own threads that do enqueue ops. (the other 50% is closing the queue)
RE: converting to tf.record -- Python has this thing called Global Interpreter Lock which means that two bits of Python code can't execute concurrently. In practice that's mitigated by the fact that a lot of the time is spent in numpy C++ code or IO ops (which release GIL). So I think it's a matter of checking if you are able to achieve required parallelism using Python pre-processing pipelines.

Categories