I'm trying to speed up my processing of a PIL.Image, where I divide the image into small parts, search for the most similar image inside a database and then replace the original small part of the image with this found image.
This is the described function:
def work_image(img, lenx, leny, neigh, split_dict, img_train_rot):
constructed_img = Image.new(mode='L', size=img.size)
for x in range(0,img.size[0],lenx):
for y in range(0,img.size[1],leny):
box = (x,y,x+lenx,y+leny)
split_img = img.crop(box)
res = neigh.kneighbors(np.asarray(split_img).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
constructed_img.paste(img_train_rot[i].crop(new_box), (x,y))
return constructed_img
Now I wanted to parallelize this function, since f.e. each row of such image parts could be dealt with entirely on its own.
I came up with this approach using multiprocessing.Pool:
def work_image_parallel(leny, neigh, split_dict, img_train_rot, img_slice):
constructed_img_slice = Image.new(mode='L', size=img_slice.size)
for y in range(0, img_slice.size[1], leny):
box = (0, y, img_slice.size[0], y+leny)
img_part = img_slice.crop(box)
res = neigh.kneighbors(np.asarray(img_part).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
constructed_img_slice.paste(img_train_rot[i].crop(new_box), (0,y))
return constructed_img_slice
if __name__ == '__main__':
lenx, leny = 16, 16
#define my image database and so on
neigh = setup_nearest_neighbour(train_imgs, n_neighbors=1)
test_img = test_imgs[0]
func = partial(work_image_parallel, leny, neigh, split_dict, img_train_rot)
pool = multiprocessing.Pool()
try:
res = pool.map(func, map(lambda x: x, [test_img.crop((x, 0, x+lenx, test_img.size[1])) for x in range(0, test_img.size[0], lenx)]))
finally:
pool.close()
pool.join()
test_result2 = Image.new(mode='L', size = test_img.size)
for i in range(len(res)):
test_result2.paste(res[i], box=(i*lenx, 0, i*lenx + lenx, test_result2.size[1]))
However, this parallelized version isn't exactly faster than the normal version, and if I decrease the size of my image division, the parallelized version throws an AssertionError (other posts said this might be because the data size to be sent between the processes becomes too big).
Therefore my question, did I maybe do something wrong? Is multiprocessing maybe not the right approach here? Or why doesn't the multiprocessing decrease the computation time, since the workload per image slice should be big enough to offset the time needed to create processes etc.
Any help would be appreciated.
Disclaimer: I am not that familiar with PIL so you may should take a close look at the PIL method calls, which may need some "adjustment" on your part since there is no way that I can actually test this.
First, I observe that you will probably be making a lot of repeated invocations of your worker function work_image_parallel and that some of those arguments being passed to that function might be quite large (all of this depends, of course, on how large your images are). Rather than repeatedly passing such potentially large arguments, I would prefer to copy these arguments once to each process in your pool and instantiate them as global variables. This is accomplished with a pool initializer function.
Second, I have attempted to modify your work_image_parallel function to be as close to your original work_image function except that it now deals with just a single x, y coordinate pair that is passed to it. In that way more of the work is being done by your subprocesses. I have also tried to reduce the number of pasting operations required (if I have correctly understood what is going on).
Third, because the images may be quite large, I am using a generator expression to create the arguments to be used with imap_unordered instead of map. This is because the number of x, y pairs can be quite large in a very large image and map requires that its iterable argument be such that its length can be computed so that an efficient chunksize value can be computed (see the docs). With imap_unordered, we should specify an explicit chunksize value to be efficient (the default is 1 if unspecified) if we expect that the iterable could be large. If you know that you are dealing with relatively small images so that the size of the x_y_args iterable would not be unreasonably memory-inefficient if stored as a list, then, you could just use method map with the default chunksize value of None and have the pool compute the value for you. The advantage of using imap_unordered is that results do not have to be returned in order, so processing could be faster.
def init_pool(the_img, the_img_train_rot, the_neigh, the_split_dict):
global img, img_train_rot, neigh, split_dict
img = the_img
img_train_rot = the_img_train_rot
neigh = the_neigh
split_dict = the_split_dict
def work_image_parallel(lenx, leny, t):
x, y = t
box = (x,y,x+lenx,y+leny)
split_img = img.crop(box)
res = neigh.kneighbors(np.asarray(split_img).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
# return original x, y values used:
return x, y, img_train_rot[i].crop(new_box)
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
lenx, leny = 16, 16
#define my image database and so on
neigh = setup_nearest_neighbour(train_imgs, n_neighbors=1)
test_img = test_imgs[0]
func = partial(work_image_parallel, lenx, leny)
# in case this is a very large image, use a generator expression
x_y_args = ((x, y) for x in range(0, test_img.size[0], lenx) for y in range(0, test_img.size[1], leny))
# approximate size of x_y_args:
iterable_size = (test_img.size[0] // lenx) * (test_img.size[1] // leny)
pool_size = multiprocessing.cpu_count()
chunksize = compute_chunksize(iterable_size, pool_size)
pool = multiprocessing.Pool(pool_size, initiializer=init_pool, initargs=(test_img, img_train_rot, neigh, split_dict))
test_result2 = Image.new(mode='L', size = test_img.size)
try:
# use imap or imap_unordered when the iterable is a generator to avoid conversion of iterable to a list
# but specify a suitable chunksize for efficiency in case the iterable is very large:
for x, y, res in pool.imap_unordered(func, x_y_args, chunksize=chunksize):
test_result2.paste(res, (x, y))
finally:
pool.close()
pool.join()
Update (break up image into bigger slices)
def init_pool(the_img, the_img_train_rot, the_neigh, the_split_dict):
global img, img_train_rot, neigh, split_dict
img = the_img
img_train_rot = the_img_train_rot
neigh = the_neigh
split_dict = the_split_dict
def work_image_parallel(lenx, leny, x):
img_slice = img.crop((x, 0, x+lenx, img.size[1]))
constructed_img_slice = Image.new(mode='L', size=img_slice.size)
for y in range(0, img_slice.size[1], leny):
box = (0, y, img_slice.size[0], y+leny)
img_part = img_slice.crop(box)
res = neigh.kneighbors(np.asarray(img_part).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
constructed_img_slice.paste(img_train_rot[i].crop(new_box), (0,y))
return constructed_img_slice
if __name__ == '__main__':
lenx, leny = 16, 16
#define my image database and so on
neigh = setup_nearest_neighbour(train_imgs, n_neighbors=1)
test_img = test_imgs[0]
pool = multiprocessing.Pool(pool_size, initiializer=init_pool, initargs=(test_img, img_train_rot, neigh, split_dict))
func = partial(work_image_parallel, lenx, leny)
try:
test_result2 = Image.new(mode='L', size = test_img.size)
x = 0
for res in pool.map(func, [x for x in range(0, test_img.size[0], lenx)]):
test_result2.paste(res, box=(x, 0, x + lenx, test_result2.size[1]))
x += lenx
finally:
pool.close()
pool.join()
Related
I have tried multiprocessing.dummy.Pool and multiprocessing.Pool in multiple deep learning projects. I am having a hard time understanding the multiprocessing.Queue, I don't understand its need. Is there a special condition where it is useful.
As an example I have following target function:
def process_detection( det_, dims ,classes):
W = dims[0]
H = dims[1]
classes = classes
boxes = []
confidences=[]
classIDs=[]
classes_pred=[]
for detection in det_:
xcenter, ycenter, width, height = np.asarray([W, H, W, H]) * detection[0:4]
confidence_encoded = detection[5:] # (80,) array
index_class = np.argmax(confidence_encoded) #index of max confidence
confidence = confidence_encoded[index_class] # float value of confidence (probability)
# print(classes)
class_predicted = classes[index_class] # class predicted
if confidence > 0.5:
if class_predicted == "person":
print("{} , {:.2f}".format(class_predicted, confidence))
# continue
topX = int(xcenter - width/2.)
topY = int(ycenter - height/2.)
width = int(width)
height = int(height)
confidence = float(confidence)
bbox = [topX, topY, width, height]
boxes.append(bbox)
confidences.append(confidence)
classIDs.append(index_class)
classes_pred.append(class_predicted)
return [boxes, confidences, classIDs, classes_pred]
I am using multiprocessing.Pool.starmap to process a list of bounding boxes predicted by YOLOv3. The relevant function is below:
def main():
pool = Pool(processes=os.cpu_count()) # make a process pool for multi-processing
path = Path("..")
classes = open(str(path.joinpath("coco.names")), "r").read().strip().split("\n")
colors_array = np.random.randint(0,255,(len(classes),3),dtype="uint8")
colors = {cls_:clr for cls_,clr in zip(classes, colors_array)}
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# reading the video
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
cap = cv2.VideoCapture(str(path.joinpath("video_.mp4")))
_, frame = cap.read()
if frame is None:
print(f"FRAME IS NOT READ")
else:
# frame = resize(frame, width=500)
H, W = frame.shape[0:2]
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# <model>
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
configPath = path.joinpath("yolov3.cfg")
weightsPath = path.joinpath("yolov3.weights")
net = cv2.dnn.readNetFromDarknet(str(configPath), str(weightsPath))
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
writer = None
boxes = []
confidences = []
classIDs = []
classes_pred = []
fps_ = FPS().start()
i = 0
while True:
# pool = Pool(processes=os.cpu_count()) # make a process pool for multi-processing
try:
if writer is None:
writer = cv2.VideoWriter("./detections.avi", cv2.VideoWriter_fourcc(*"MJPG"), int(cap.get(cv2.CAP_PROP_FPS)), (W, H))
# after this writer will not be none
_, frame = cap.read() # reading the frame
# frame = resize(frame, width=W) # resizing the frame
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
swapRB=True, crop=False) # yolov3 version
net.setInput(blob)
start = time()
detections = net.forward(ln)
end = time()
print(f"{(end-start):.2f} seconds taken for detection")
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# MULTIPROCESSING
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
results = pool.starmap_async(process_detection, zip(detections, repeat((W,H)) , repeat(classes) ) )
boxes, confidences, classIDs, classes_pred = results.get()[1]
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
cleaned_indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5 , 0.3) # 0.3 --> nms threshold
print(f"TOTAL INDICES CLEANED ARE {len(cleaned_indices):00d}")
if len(cleaned_indices)>0:
for cleaned_idx in cleaned_indices.flatten():
BOX = boxes[cleaned_idx]
color = [int(i) for i in colors[classes_pred[cleaned_idx]]]
# print(colors[cleaned_idx], cleaned_idx)
cv2.rectangle(frame, (BOX[0],BOX[1]), (BOX[0]+BOX[2], BOX[1]+BOX[3]),color, 1, cv2.LINE_AA)
text = f"{classes_pred[cleaned_idx]} : {confidences[cleaned_idx]:.2f}"
cv2.putText(frame, text, (BOX[0], BOX[1] - 5), cv2.FONT_HERSHEY_SIMPLEX,
0.5, color, 2)
writer.write(frame)
(pool is closed OUTSIDE while loop).
When is the need to use multiprocessing.Queue?
Can I make this code more efficient using multiprocessing.Queue?
In general it is not necessary (nor useful) to use a Pool and a Queue together.
The way a Pool is most useful is to run the same code with different data in parallel on multiple cores to increase throughput. That is, using the map method and its variants. This is useful for situations where the calculation done on each data-item is independent of all the others.
Mechanisms like Queue and Pipe are for communicating between different processes.
If you need a Queue or a Pipe in a pool worker, then the calculations done by that pool worker are by definition not independent. At best, that reduces the performance of the Pool because the pool workers might have to wait for data to become available. At worst, it might stall the Pool completely if all the workers are busy waiting for data to appear from a Queue.
How to use a Pool
If you expect that all the calculations will take approximately the same time, just use the map method. This will return when all calculations are finished. And the returned values are guaranteed to be in the same order as the submitted data.
(Hint: there is little point in using the _async methods when the next thing you do is to call the get method on the result object.)
If some calculations take (much) longer than others, I would suggest using imap_unordered. This will return an iterator that will start yielding results as soon as they are ready. The results will be in the order that they finished, not in the order they were submitted, so you should add some identifier to the result to enable you to tell to which input data the result belongs.
I am using below code for an image processing related study. The code works fine as functionality but it is too slow that one step takes up to 10 seconds.
I need faster process speed to reach at the aim.
import numpy
import glob, os
import cv2
import os
input = cv2.imread(path)
def nothing(x): # for trackbar
pass
windowName = "Image"
cv2.namedWindow(windowName)
cv2.createTrackbar("coef", windowName, 0, 25000, nothing)
condition = True
while (condition):
coef = cv2.getTrackbarPos("coef", windowName)
temp_img = input
row = temp_img.shape[0]
col = temp_img.shape[1]
print(coef)
red = []
green = []
for i in range(row):
for y in range(col):
# temp_img[i][y][0] = 0
temp_img[i][y][1] = temp_img[i][y][1]* (coef / 100)
temp_img[i][y][1] = temp_img[i][y][2] * (1 - (coef / 100))
# relative_diff = value_g - value_r
# temp =cv2.resize(temp,(1000,800))
cv2.imshow(windowName, temp_img)
# cv2.imwrite("output2.jpg", temp)
print("fin")
# cv2.waitKey(0)
if cv2.waitKey(30) >= 0:
condition = False
cv2.destroyAllWindows()
Is there anybody have an idea having faster result on the aim?
It's not entirely clear to me what object temp_img is exactly, but if it behaves like a numpy array, you could replace your loop by
temp_img[:,:,0] = temp_img[:,:,1]*(coef/100)
temp_img[:,:,1] = temp_img[:,:,2]*(1-coef/1000)
which should result in a significant speed up if your array is large. The implementation of such operations on arrays are optimised very well, whereas python loops are generally quite slow.
Edit based on comments:
Since you're working with large images and have some expensive operations that need an unscaled version but only need to be executed once, your code could get the following kind of structure
import... #do all your imports
def expensive_operations(image, *args, **kwargs):
#do all your expensive operations like object detection
def scale_image(image, scale):
#create a scaled version of image
def cheap_operations(scaled_image, windowName):
#perform cheap operations, e.g.
coef = cv2.getTrackbarPos("coef", windowName)
temp_img = np.copy(scaled_image)
temp_img[:,:,1] = temp_img[:,:,1]* (coef / 100)
temp_img[:,:,2] = temp_img[:,:,2] * (1 - (coef / 100))
cv2.imshow(windowName, temp_img)
input = cv2.imread(path)
windowName = "Image"
cv2.namedWindow(windowName)
cv2.createTrackbar("coef", windowName, 0, 25000, nothing)
condition = True
expensive_results = expensive_operations(input) #possibly with some more args and keyword args
scaled_image = scale_image(input)
while condition:
cheap_operations(scaled_image, windowName)
if cv2.waitKey(30) >= 0:
condition = False
cv2.destroyAllWindows()
I do this kind of thing in nip2. It's an image processing spreadsheet that can manipulate huge images quickly. It has no problems doing this kind of operation on any size image at 60fps.
I made you an example workspace: http://www.rollthepotato.net/~john/coeff.ws
Here's what it looks like working on a 1gb starfield image:
You can drag the slider to change coeff. The processed image updates instantly as you drag. You can zoom and pan around the processed image to check details and adjust coeff.
The underlying image processing library is libvips, which has a Python binding, pyvips. In pyvips, your program would be:
import pyvips
def adjust(image, coeff):
return image * [1, coeff / 100, 1 - coeff / 100]
Though that's without the GUI elements, of course.
I'm trying to run a loop that iterates through an image folder and returns two numpy arrays: x - stores the image as a numpy array y - stores the label.
A folder can easily have over 40.000 rgb images, with dimensions (224,224).
I have around 12Gb of memory but after some iterations, the used memory just spikes up and everything stops.
What can I do to fix this issue?
def create_set(path, quality):
x_file = glob.glob(path + '*')
x = []
for i, img in enumerate(x_file):
image = cv2.imread(img, cv2.IMREAD_COLOR)
x.append(np.asarray(image))
if i % 50 == 0:
print('{} - {} images processed'.format(path, i))
x = np.asarray(x)
x = x/255
y = np.zeros((x.shape[0], 2))
if quality == 0:
y[:,0] = 1
else:
y[:,1] = 1
return x, y
You just can't load that many images into memory. You're trying to load every file in a given path to memory, by appending them to x.
Try processing them in batches, or if you're doing this for a tensorflow application try writing them to .tfrecords first.
If you want to save some memory, leave the images as np.uint8 rather than casting them to float (which happens automatically when you normalise them in this line > x = x/255)
You also don't need np.asarray in your x.append(np.asarray(image)) line. image is already an array. np.asarray is for converting lists, tuples, etc to arrays.
edit:
a very rough batching example:
def batching function(imlist, batchsize):
ims = []
batch = imlist[:batchsize]
for image in batch:
ims.append(image)
other_processing()
new_imlist = imlist[batchsize:]
return x, new_imlist
def main():
imlist = all_the_globbing_here()
for i in range(total_files/batch_size):
ims, imlist = batching_function(imlist, batchsize)
process_images(ims)
I need some help because I tried since two days, and I don't know how I can do this. I have function compute_desc that takes multiples arguments (5 to be exact) and I would like to run this in parallel.
I have this for now:
def compute_desc(coord, radius, coords, feat, verbose):
# Compute here my descriptors
return my_desc # numpy array (1x10 dimensions)
def main():
points = np.rand.random((1000000, 4))
coords = points[:, 0:3]
feat = points[:, 3]
all_features = np.empty((1000000, 10))
all_features[:] = np.NAN
scales = [0.5, 1, 2]
for radius in scales:
for index, coord in enumerate(coords):
all_features[index, :] = compute_desc(coord,
radius,
coords,
feat,
False)
I would like to parallelize this. I saw several solutions with a Pool, but I don't understand how it works.
I tried with a pool.map(), but I can only send only one argument to the function.
Here is my solution (it doesn't work):
all_features = [pool.map(compute_desc, zip(point, repeat([radius,
coords,
feat,
False]
)
)
)]
but I doubt it can work with a numpy array.
EDIT
This is my minimum code with a pool (it works now):
import numpy as np
from multiprocessing import Pool
from itertools import repeat
def compute_desc(coord, radius, coords, feat, verbose):
# Compute here my descriptors
my_desc = np.rand.random((1, 10))
return my_desc
def compute_desc_pool(args):
coord, radius, coords, feat, verbose = args
compute_desc(coord, radius, coords, feat, verbose)
def main():
points = np.random.rand(1000000, 4)
coords = points[:, 0:3]
feat = points[:, 3]
scales = [0.5, 1, 2]
for radius in scales:
with Pool() as pool:
args = zip(points, repeat(radius),
repeat(coords),
repeat(feat),
repeat(kdtree),
repeat(False))
feat_one_scale = pool.map(compute_desc_pool, args)
feat_one_scale = np.array(feat_one_scale)
if radius == scales[0]:
all_features = feat_one_scale
else:
all_features = np.hstack([all_features, feat_one_scale])
# Others stuffs
The generic solution is to pass to Pool.map a sequence of tuples, each tuple holding one set of arguments for your worker function, and then to unpack the tuple in the worker function.
So, just change your function to accept only one argument, a tuple of your arguments, which you already prepared with zip and passed to Pool.map. Then simply unpack args to variables:
def compute_desc(args):
coord, radius, coords, feat, verbose = args
# Compute here my descriptors
Also, Pool.map should work with numpy types too, since after all, they are valid Python types.
Just be sure to properly zip 5 sequences, so your function receives a 5-tuple. You don't need to iterate over point in coords, zip will do that for you:
args = zip(coords, repeat(radius), repeat(coords), repeat(feat), repeat(False))
# args is a list of [(coords[0], radius, coords, feat, False), (coords[1], ... )]
(if you do, and give point as a first sequence to zip, the zip will iterate over that point, which is in this case a 3-element array).
Your Pool.map line should look like:
for radius in scales:
args = zip(coords, repeat(radius), repeat(coords), repeat(feat), repeat(False))
feat_one_scale = [pool.map(compute_desc_pool, args)]
# other stuff
A solution specific to your case, where all arguments except one are fixed could be to use functools.partial (as the other answer suggests). Furthermore, you don't even need to unpack coords in the first argument, just pass the index [0..n] in coords, since each invocation of your worker function already receives the complete coords array.
I assume from your example that four of those five arguments would be constant to all calls to compute_desc_pool. If so, then you can use partial to do this.
from functools import partial
....
def compute_desc_pool(coord, radius, coords, feat, verbose):
compute_desc(coord, radius, coords, feat, verbose)
def main():
points = np.random.rand(1000000, 4)
coords = points[:, 0:3]
feat = points[:, 3]
feat_one_scale = np.empty((1000000, 10))
feat_one_scale[:] = np.NAN
scales = [0.5, 1, 2]
pool = Pool()
for radius in scales:
feat_one_scale = [pool.map(partial(compute_desc_pool, radius, coords,
feat, False), coords)]
I've devised a recursive function to handle a specific problem within the deep learning community. It seems to work quickly and well for most cases, but then takes ~20 minutes for other cases for seemingly no reason. The function, in the simplest case, can be abstracted as simply numpy's "repeat" function on two axes. Here's the code I used to test this function:
def recursive_upsample(fMap, index, dims):
if index == 0:
return fMap
else:
start = time.time()
upscale = np.zeros((dims[index-1][0],dims[index-1][1],fMap.shape[-1]))
if dims[index-1][0] % 2 == 1 and dims[index-1][1] % 2 == 1:
crop = fMap[:fMap.shape[0]-1,:fMap.shape[1]-1]
consX = fMap[-1,:][:-1]
consY = fMap[:,-1][:-1]
corner = fMap[-1,-1]
crop = crop.repeat(2, axis=0).repeat(2, axis=1)
upscale[:crop.shape[0],:crop.shape[1]] = crop
upscale[-1,:][:-1] = consX.repeat(2,axis=0)
upscale[:,-1][:-1] = consY.repeat(2,axis=0)
upscale[-1,-1] = corner
elif dims[index-1][0] % 2 == 1:
crop = fMap[:fMap.shape[0]-1]
consX = fMap[-1:,]
crop = crop.repeat(2, axis=0).repeat(2, axis=1)
upscale[:crop.shape[0]] = crop
upscale[-1:,] = consX.repeat(2,axis=1)
elif dims[index-1][1] % 2 == 1:
crop = fMap[:,:fMap.shape[1]-1]
consY = fMap[:,-1]
crop = crop.repeat(2, axis=0).repeat(2, axis=1)
upscale[:,:crop.shape[1]] = crop
upscale[:,-1] = consY.repeat(2,axis=0)
else:
upscale = fMap.repeat(2, axis=0).repeat(2, axis=1)
print('Upscaling from {} to {} took {} seconds'.format(fMap.shape,upscale.shape,time.time() - start))
fMap = upscale
return recursive_upsample(fMap,index-1,dims)
if __name__ == '__main__':
dims = [(634,1020,64),(317,510,128),(159,255,256),(80,128,512),(40,64,512)]
images = []
for dim in dims:
image = np.random.rand(dim[0],dim[1],dim[2])
images.append(image)
start = time.time()
upsampled = []
for index,image in enumerate(images):
upsampled.append(recursive_upsample(image,index,dims))
print('Upsampling took {} seconds'.format(time.time() - start))
For some odd reason, the point in the recursion where the feature map of shape (40,64,512) is being upsampled from shape (317,510,512) to (634,1020,512) takes an egregious 941 seconds! I'm starting to rewrite this code with Theano, but should I be looking to some underlying problem with my code? My reasoning as of right now is that computing this on CPU is unwieldy, but I'm not sure what the hold up is with such a simple function. Also any tips on how to make this function faster would be appreciated!
There's no need to do the recursion. E.g. for the (40,64,512) image you can directly do:
upsampled = image.repeat(16, axis=0).repeat(16, axis=1)[:634,:1020]