I've devised a recursive function to handle a specific problem within the deep learning community. It seems to work quickly and well for most cases, but then takes ~20 minutes for other cases for seemingly no reason. The function, in the simplest case, can be abstracted as simply numpy's "repeat" function on two axes. Here's the code I used to test this function:
def recursive_upsample(fMap, index, dims):
if index == 0:
return fMap
else:
start = time.time()
upscale = np.zeros((dims[index-1][0],dims[index-1][1],fMap.shape[-1]))
if dims[index-1][0] % 2 == 1 and dims[index-1][1] % 2 == 1:
crop = fMap[:fMap.shape[0]-1,:fMap.shape[1]-1]
consX = fMap[-1,:][:-1]
consY = fMap[:,-1][:-1]
corner = fMap[-1,-1]
crop = crop.repeat(2, axis=0).repeat(2, axis=1)
upscale[:crop.shape[0],:crop.shape[1]] = crop
upscale[-1,:][:-1] = consX.repeat(2,axis=0)
upscale[:,-1][:-1] = consY.repeat(2,axis=0)
upscale[-1,-1] = corner
elif dims[index-1][0] % 2 == 1:
crop = fMap[:fMap.shape[0]-1]
consX = fMap[-1:,]
crop = crop.repeat(2, axis=0).repeat(2, axis=1)
upscale[:crop.shape[0]] = crop
upscale[-1:,] = consX.repeat(2,axis=1)
elif dims[index-1][1] % 2 == 1:
crop = fMap[:,:fMap.shape[1]-1]
consY = fMap[:,-1]
crop = crop.repeat(2, axis=0).repeat(2, axis=1)
upscale[:,:crop.shape[1]] = crop
upscale[:,-1] = consY.repeat(2,axis=0)
else:
upscale = fMap.repeat(2, axis=0).repeat(2, axis=1)
print('Upscaling from {} to {} took {} seconds'.format(fMap.shape,upscale.shape,time.time() - start))
fMap = upscale
return recursive_upsample(fMap,index-1,dims)
if __name__ == '__main__':
dims = [(634,1020,64),(317,510,128),(159,255,256),(80,128,512),(40,64,512)]
images = []
for dim in dims:
image = np.random.rand(dim[0],dim[1],dim[2])
images.append(image)
start = time.time()
upsampled = []
for index,image in enumerate(images):
upsampled.append(recursive_upsample(image,index,dims))
print('Upsampling took {} seconds'.format(time.time() - start))
For some odd reason, the point in the recursion where the feature map of shape (40,64,512) is being upsampled from shape (317,510,512) to (634,1020,512) takes an egregious 941 seconds! I'm starting to rewrite this code with Theano, but should I be looking to some underlying problem with my code? My reasoning as of right now is that computing this on CPU is unwieldy, but I'm not sure what the hold up is with such a simple function. Also any tips on how to make this function faster would be appreciated!
There's no need to do the recursion. E.g. for the (40,64,512) image you can directly do:
upsampled = image.repeat(16, axis=0).repeat(16, axis=1)[:634,:1020]
Related
Im looking at the following image stitching example in the OpenCV documentation: https://raw.githubusercontent.com/opencv/opencv/4.x/samples/python/stitching_detailed.py, trying to wrap my head around how to use bundle adjustment to estimate homography and warp images. Im having a hard time following what exactly is going in partially because I can't seem to find the docs for many of the functions they are using. A snippet of the code I think I am particularly interested in is below.
estimator = ESTIMATOR_CHOICES[args.estimator]()
b, cameras = estimator.apply(features, p, None)
if not b:
print("Homography estimation failed.")
exit()
for cam in cameras:
cam.R = cam.R.astype(np.float32)
adjuster = BA_COST_CHOICES[args.ba]()
adjuster.setConfThresh(1)
refine_mask = np.zeros((3, 3), np.uint8)
if ba_refine_mask[0] == 'x':
refine_mask[0, 0] = 1
if ba_refine_mask[1] == 'x':
refine_mask[0, 1] = 1
if ba_refine_mask[2] == 'x':
refine_mask[0, 2] = 1
if ba_refine_mask[3] == 'x':
refine_mask[1, 1] = 1
if ba_refine_mask[4] == 'x':
refine_mask[1, 2] = 1
adjuster.setRefinementMask(refine_mask)
b, cameras = adjuster.apply(features, p, cameras)
if not b:
print("Camera parameters adjusting failed.")
exit()
focals = []
for cam in cameras:
focals.append(cam.focal)
focals.sort()
if len(focals) % 2 == 1:
warped_image_scale = focals[len(focals) // 2]
else:
warped_image_scale = (focals[len(focals) // 2] + focals[len(focals) // 2 - 1]) / 2
if wave_correct is not None:
rmats = []
for cam in cameras:
rmats.append(np.copy(cam.R))
rmats = cv.detail.waveCorrect(rmats, wave_correct)
for idx, cam in enumerate(cameras):
cam.R = rmats[idx]
corners = []
masks_warped = []
images_warped = []
sizes = []
masks = []
for i in range(0, num_images):
um = cv.UMat(255 * np.ones((images[i].shape[0], images[i].shape[1]), np.uint8))
masks.append(um)
warper = cv.PyRotationWarper(warp_type, warped_image_scale * seam_work_aspect) # warper could be nullptr?
for idx in range(0, num_images):
K = cameras[idx].K().astype(np.float32)
swa = seam_work_aspect
K[0, 0] *= swa
K[0, 2] *= swa
K[1, 1] *= swa
K[1, 2] *= swa
corner, image_wp = warper.warp(images[idx], K, cameras[idx].R, cv.INTER_LINEAR, cv.BORDER_REFLECT)
corners.append(corner)
sizes.append((image_wp.shape[1], image_wp.shape[0]))
images_warped.append(image_wp)
p, mask_wp = warper.warp(masks[idx], K, cameras[idx].R, cv.INTER_NEAREST, cv.BORDER_CONSTANT)
masks_warped.append(mask_wp.get())
There are several key things I can't seem to find.
estimator.apply() I can't find the docs to this and so I don't fully understand what the function exects as arguments nor what it returns. (Estimator I'm looking at: https://docs.opencv.org/4.x/df/d15/classcv_1_1detail_1_1Estimator.html)
What is the camera object for cam in cameras: cam.R = cam.R.astype(np.float32) is this the correct documentation to look at? https://docs.opencv.org/4.x/dc/d3a/classcv_1_1viz_1_1Camera.html
adjuster.apply() also doesn't seem to be a member function of any of the classes. BundleAdjusterBase, BundleAdjusterReproj or others... (maybe I just don't understand c++. Adjuster IM looking at: https://docs.opencv.org/4.x/d5/d56/classcv_1_1detail_1_1BundleAdjusterBase.html)
PyRotationWarper Class Reference states PyRotationWarper.warp() takes two parameters from the camera intrinsics and rotation. Would I be correct in assuming this is performing the bundle adjustment step warping images based on 3D points?
Does this snippet more or less represent a minimal working example of image mosaicing using bundle adjustment? Im not sure what Im doing. If someone would be willing to provide an example of stitching 4 or 5 images and applying bundle adjustment I would be eternally grateful.
PS. I'm not using createStitcher because I want to learn to do it from scratch and eventually use deep learning to estimate camera params and pose/ match feature points.
I have a binary segmentation map as output from a neural network (niftii format) and want to maintain only the biggest island, to get rid of unwanted false positives.
I am able to achieve this with:
import nibabel as nib
import numpy as np
from scipy.ndimage import label
vol = 'PATH_TO_VOLUME'
elements_in_biggest_island = 0
biggest_index = 0
aNii = nib.load(vol)
a = aNii.get_fdata()
s = np.ones((3,3,3), dtype = 'uint8')
labelled_array, num_features = label(a, structure=s)
for i in range (1, num_features + 1):
tempArray = labelled_array
if (np.count_nonzero(tempArray == i) > elements_in_biggest_island):
elements_in_biggest_island = np.count_nonzero(tempArray == i)
biggest_index = I
print("Biggest Island was at index ", biggest_index, " with a total of ", elements_in_biggest_island, " members.")
labelled_array[labelled_array == biggest_index] = 1.0
labelled_array[labelled_array < biggest_index ] = 0.0
labelled_array[labelled_array > biggest_index] = 0.0
ni_img = nib.Nifti1Image(labelled_array, aNii.affine)
nib.save(ni_img, f'PATH_TO_PROCESSED_VOL')
But the "thresholding" is very inefficient. In another application I work with numpy.where(), which generates a good speedup compared to the shown way of thresholding.
My approach was, to remove the array[array>I] == x lines by:
labelled_array = np.where(labelled_array==biggest_index, 1, 0)
This exact line works perfectly in another application, but here I only get a black 3D volume, which does not work for me.
Is anybody able to point out the mistake, that I have made?
I'm trying to speed up my processing of a PIL.Image, where I divide the image into small parts, search for the most similar image inside a database and then replace the original small part of the image with this found image.
This is the described function:
def work_image(img, lenx, leny, neigh, split_dict, img_train_rot):
constructed_img = Image.new(mode='L', size=img.size)
for x in range(0,img.size[0],lenx):
for y in range(0,img.size[1],leny):
box = (x,y,x+lenx,y+leny)
split_img = img.crop(box)
res = neigh.kneighbors(np.asarray(split_img).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
constructed_img.paste(img_train_rot[i].crop(new_box), (x,y))
return constructed_img
Now I wanted to parallelize this function, since f.e. each row of such image parts could be dealt with entirely on its own.
I came up with this approach using multiprocessing.Pool:
def work_image_parallel(leny, neigh, split_dict, img_train_rot, img_slice):
constructed_img_slice = Image.new(mode='L', size=img_slice.size)
for y in range(0, img_slice.size[1], leny):
box = (0, y, img_slice.size[0], y+leny)
img_part = img_slice.crop(box)
res = neigh.kneighbors(np.asarray(img_part).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
constructed_img_slice.paste(img_train_rot[i].crop(new_box), (0,y))
return constructed_img_slice
if __name__ == '__main__':
lenx, leny = 16, 16
#define my image database and so on
neigh = setup_nearest_neighbour(train_imgs, n_neighbors=1)
test_img = test_imgs[0]
func = partial(work_image_parallel, leny, neigh, split_dict, img_train_rot)
pool = multiprocessing.Pool()
try:
res = pool.map(func, map(lambda x: x, [test_img.crop((x, 0, x+lenx, test_img.size[1])) for x in range(0, test_img.size[0], lenx)]))
finally:
pool.close()
pool.join()
test_result2 = Image.new(mode='L', size = test_img.size)
for i in range(len(res)):
test_result2.paste(res[i], box=(i*lenx, 0, i*lenx + lenx, test_result2.size[1]))
However, this parallelized version isn't exactly faster than the normal version, and if I decrease the size of my image division, the parallelized version throws an AssertionError (other posts said this might be because the data size to be sent between the processes becomes too big).
Therefore my question, did I maybe do something wrong? Is multiprocessing maybe not the right approach here? Or why doesn't the multiprocessing decrease the computation time, since the workload per image slice should be big enough to offset the time needed to create processes etc.
Any help would be appreciated.
Disclaimer: I am not that familiar with PIL so you may should take a close look at the PIL method calls, which may need some "adjustment" on your part since there is no way that I can actually test this.
First, I observe that you will probably be making a lot of repeated invocations of your worker function work_image_parallel and that some of those arguments being passed to that function might be quite large (all of this depends, of course, on how large your images are). Rather than repeatedly passing such potentially large arguments, I would prefer to copy these arguments once to each process in your pool and instantiate them as global variables. This is accomplished with a pool initializer function.
Second, I have attempted to modify your work_image_parallel function to be as close to your original work_image function except that it now deals with just a single x, y coordinate pair that is passed to it. In that way more of the work is being done by your subprocesses. I have also tried to reduce the number of pasting operations required (if I have correctly understood what is going on).
Third, because the images may be quite large, I am using a generator expression to create the arguments to be used with imap_unordered instead of map. This is because the number of x, y pairs can be quite large in a very large image and map requires that its iterable argument be such that its length can be computed so that an efficient chunksize value can be computed (see the docs). With imap_unordered, we should specify an explicit chunksize value to be efficient (the default is 1 if unspecified) if we expect that the iterable could be large. If you know that you are dealing with relatively small images so that the size of the x_y_args iterable would not be unreasonably memory-inefficient if stored as a list, then, you could just use method map with the default chunksize value of None and have the pool compute the value for you. The advantage of using imap_unordered is that results do not have to be returned in order, so processing could be faster.
def init_pool(the_img, the_img_train_rot, the_neigh, the_split_dict):
global img, img_train_rot, neigh, split_dict
img = the_img
img_train_rot = the_img_train_rot
neigh = the_neigh
split_dict = the_split_dict
def work_image_parallel(lenx, leny, t):
x, y = t
box = (x,y,x+lenx,y+leny)
split_img = img.crop(box)
res = neigh.kneighbors(np.asarray(split_img).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
# return original x, y values used:
return x, y, img_train_rot[i].crop(new_box)
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
lenx, leny = 16, 16
#define my image database and so on
neigh = setup_nearest_neighbour(train_imgs, n_neighbors=1)
test_img = test_imgs[0]
func = partial(work_image_parallel, lenx, leny)
# in case this is a very large image, use a generator expression
x_y_args = ((x, y) for x in range(0, test_img.size[0], lenx) for y in range(0, test_img.size[1], leny))
# approximate size of x_y_args:
iterable_size = (test_img.size[0] // lenx) * (test_img.size[1] // leny)
pool_size = multiprocessing.cpu_count()
chunksize = compute_chunksize(iterable_size, pool_size)
pool = multiprocessing.Pool(pool_size, initiializer=init_pool, initargs=(test_img, img_train_rot, neigh, split_dict))
test_result2 = Image.new(mode='L', size = test_img.size)
try:
# use imap or imap_unordered when the iterable is a generator to avoid conversion of iterable to a list
# but specify a suitable chunksize for efficiency in case the iterable is very large:
for x, y, res in pool.imap_unordered(func, x_y_args, chunksize=chunksize):
test_result2.paste(res, (x, y))
finally:
pool.close()
pool.join()
Update (break up image into bigger slices)
def init_pool(the_img, the_img_train_rot, the_neigh, the_split_dict):
global img, img_train_rot, neigh, split_dict
img = the_img
img_train_rot = the_img_train_rot
neigh = the_neigh
split_dict = the_split_dict
def work_image_parallel(lenx, leny, x):
img_slice = img.crop((x, 0, x+lenx, img.size[1]))
constructed_img_slice = Image.new(mode='L', size=img_slice.size)
for y in range(0, img_slice.size[1], leny):
box = (0, y, img_slice.size[0], y+leny)
img_part = img_slice.crop(box)
res = neigh.kneighbors(np.asarray(img_part).ravel().reshape((1,-1)))
#look up the found image part in img_train_rot and define the position as new_box
constructed_img_slice.paste(img_train_rot[i].crop(new_box), (0,y))
return constructed_img_slice
if __name__ == '__main__':
lenx, leny = 16, 16
#define my image database and so on
neigh = setup_nearest_neighbour(train_imgs, n_neighbors=1)
test_img = test_imgs[0]
pool = multiprocessing.Pool(pool_size, initiializer=init_pool, initargs=(test_img, img_train_rot, neigh, split_dict))
func = partial(work_image_parallel, lenx, leny)
try:
test_result2 = Image.new(mode='L', size = test_img.size)
x = 0
for res in pool.map(func, [x for x in range(0, test_img.size[0], lenx)]):
test_result2.paste(res, box=(x, 0, x + lenx, test_result2.size[1]))
x += lenx
finally:
pool.close()
pool.join()
I am trying to implement the Matlab function bwmorph(bw,'remove') in Python. This function removes interior pixels by setting a pixel to 0 if all of its 4-connected neighbor pixels are 1. The resulting image should return the boundary pixels. I've written a code but I'm not sure if this is how to do it.
# neighbors() function returns the values of the 4-connected neighbors
# bwmorph() function returns the input image with only the boundary pixels
def neighbors(input_matrix,input_array):
indexRow = input_array[0]
indexCol = input_array[1]
output_array = []
output_array[0] = input_matrix[indexRow - 1,indexCol]
output_array[1] = input_matrix[indexRow,indexCol + 1]
output_array[2] = input_matrix[indexRow + 1,indexCol]
output_array[3] = input_matrix[indexRow,indexCol - 1]
return output_array
def bwmorph(input_matrix):
output_matrix = input_matrix.copy()
nRows,nCols = input_matrix.shape
for indexRow in range(0,nRows):
for indexCol in range(0,nCols):
center_pixel = [indexRow,indexCol]
neighbor_array = neighbors(output_matrix,center_pixel)
if neighbor_array == [1,1,1,1]:
output_matrix[indexRow,indexCol] = 0
return output_matrix
Since you are using NumPy arrays, one suggestion I have is to change the if statement to use numpy.all to check if all values are nonzero for the neighbours. In addition, you should make sure that your input is a single channel image. Because grayscale images in colour share all of the same values in all channels, just extract the first channel. Your comments indicate a colour image so make sure you do this. You are also using the output matrix which is being modified in the loop when checking. You need to use an unmodified version. This is also why you're getting a blank output.
def bwmorph(input_matrix):
output_matrix = input_matrix.copy()
# Change. Ensure single channel
if len(output_matrix.shape) == 3:
output_matrix = output_matrix[:, :, 0]
nRows,nCols = output_matrix.shape # Change
orig = output_matrix.copy() # Need another one for checking
for indexRow in range(0,nRows):
for indexCol in range(0,nCols):
center_pixel = [indexRow,indexCol]
neighbor_array = neighbors(orig, center_pixel) # Change to use unmodified image
if np.all(neighbor_array): # Change
output_matrix[indexRow,indexCol] = 0
return output_matrix
In addition, a small grievance I have with your code is that you don't check for out-of-boundary conditions when determining the four neighbours. The test image you provided does not throw an error as you don't have any border pixels that are white. If you have a pixel along any of the borders, it isn't possible to check all four neighbours. However, one way to mitigate this would be to perhaps wrap around by using the modulo operator:
def neighbors(input_matrix,input_array):
(rows, cols) = input_matrix.shape[:2] # New
indexRow = input_array[0]
indexCol = input_array[1]
output_array = [0] * 4 # New - I like pre-allocating
# Edit
output_array[0] = input_matrix[(indexRow - 1) % rows,indexCol]
output_array[1] = input_matrix[indexRow,(indexCol + 1) % cols]
output_array[2] = input_matrix[(indexRow + 1) % rows,indexCol]
output_array[3] = input_matrix[indexRow,(indexCol - 1) % cols]
return output_array
I am currently normalizing an numpy array in python created by splicing an image by windows with a stride which creates about 20K patches. The current normalization implementation is a big pain point in my runtime, and I'm trying to replace it with the same functionality done maybe in a C extension or something. I am looking to see what advice the community has to get this done easily and simple?
Current runtime is about 0.34s just for the normalization part, I'm trying to get below 0.1s or better. You can see creating patches is extremely efficient with view_as_windows, and I am looking for something similar for normalization. Note you can simply comment/uncomment the lines labeled " # ---- Normalization" to see the runtimes yourself for the different implementations.
Here is the current implementation:
import gc
import cv2, time
from libraries import GCN
from skimage.util.shape import view_as_windows
def create_imageArray(patch_list):
returnImageArray = numpy.zeros(shape=(len(patch_list), 1, 40, 60))
idx = 0
for patch, name, coords in patch_list:
imgArray = numpy.asarray(patch[:,:], dtype=numpy.float32)
imgArray = imgArray[numpy.newaxis, ...]
returnImageArray[idx] = imgArray
idx += 1
return returnImageArray
# print "normImgArray[0]:",normImgArray[0]
def NormalizeData(imageArray):
tempImageArray = imageArray
# Normalize the data in batches
batchSize = 25000
dataSize = tempImageArray.shape[0]
imageChannels = tempImageArray.shape[1]
imageHeight = tempImageArray.shape[2]
imageWidth = tempImageArray.shape[3]
for i in xrange(0, dataSize, batchSize):
stop = i + batchSize
print("Normalizing data [{0} to {1}]...".format(i, stop))
dataTemp = tempImageArray[i:stop]
dataTemp = dataTemp.reshape(dataTemp.shape[0], imageChannels * imageHeight * imageWidth)
#print("Performing GCN [{0} to {1}]...".format(i, stop))
dataTemp = GCN(dataTemp)
#print("Reshaping data again [{0} to {1}]...".format(i, stop))
dataTemp = dataTemp.reshape(dataTemp.shape[0], imageChannels, imageHeight, imageWidth)
#print("Updating data with new values [{0} to {1}]...".format(i, stop))
tempImageArray[i:stop] = dataTemp
del dataTemp
gc.collect()
return tempImageArray
start_time = time.time()
img1_path = "777628-1032-0048.jpg"
img_list = ["images/1.jpg", "images/2.jpg", "images/3.jpg", "images/4.jpg", "images/5.jpg"]
patchWidth = 60
patchHeight = 40
channels = 1
stride = patchWidth/6
multiplier = 1.31
finalImgArray = []
vaw_time = 0
norm_time = 0
array_time = 0
for im_path in img_list:
start = time.time()
baseFileWithExt = os.path.basename(im_path)
baseFile = os.path.splitext(baseFileWithExt)[0]
img = cv2.imread(im_path, cv2.IMREAD_GRAYSCALE)
nxtWidth = 800
nxtHeight = 1200
patchesList = []
for i in xrange(7):
img = cv2.resize(img, (nxtWidth, nxtHeight))
nxtWidth = int(nxtWidth//multiplier)
nxtHeight = int(nxtHeight//multiplier)
patches = view_as_windows(img, (patchHeight, patchWidth), stride)
cols = patches.shape[0]
rows = patches.shape[1]
patchCount = cols*rows
print "patchCount:",patchCount, " patches.shape:",patches.shape
returnImageArray = numpy.zeros(shape=(patchCount, channels, patchHeight, patchWidth))
idx = 0
for col in xrange(cols):
for row in xrange(rows):
patch = patches[col][row]
imageName = "{0}-patch{1}-{2}.jpg".format(baseFile, i, idx)
patchCoodrinates = (0, 1, 2, 3) # don't need these for example
patchesList.append((patch, imageName, patchCoodrinates))
# ---- Normalization inside 7 iterations <> Part 1
# imgArray = numpy.asarray(patch[:,:], dtype=numpy.float32)
# imgArray = patch.astype(numpy.float32)
# imgArray = imgArray[numpy.newaxis, ...] # Add a new axis for channel so goes from shape [40,60] to [1,40,60]
# returnImageArray[idx] = imgArray
idx += 1
# if i == 0: finalImgArray = returnImageArray
# else: finalImgArray = numpy.concatenate((finalImgArray, returnImageArray), axis=0)
vaw_time += time.time() - start
# ---- Normalizaion inside 7 iterations <> Part 2
# start = time.time()
# normImageArray = NormalizeData(finalImgArray)
# norm_time += time.time() - start
# print "returnImageArray.shape:", finalImgArray.shape
# ---- Normalization outside 7 iterations
start = time.time()
imgArray = create_imageArray(patchesList)
array_time += time.time() - start
start = time.time()
normImgArray = NormalizeData( imgArray )
norm_time += time.time() - start
print "len(patchesList):",len(patchesList)
total_time = (time.time() - start_time)/len(img_list)
print "\npatches_time per img: {0:.3f} s".format(vaw_time/len(img_list))
print "create imgArray per img: {0:.3f} s".format(array_time/len(img_list))
print "normalization_time per img: {0:.3f} s".format(norm_time/len(img_list))
print "total time per image: {0:.3f} s \n".format(total_time)
Here is the GCN code in case you need to download it to use it: http://pastebin.com/RdVMD2P3
Details on code inside GCN
I am calling GCN using the default params.
At high level it is taking the average of all of the pixels and dividing all the pixels by that average. So if there's an an image array that looks like this [1 2 3], then the average is 2. Therefore we divide each number by 2 and get [0.5, 1, 1.5]. That's what the normalize does. I forgot to highlight in the image above the mean = X.mean(axis=1).
Notes:
If you are wondering why I am re-iterating and creating a new imgArray to normalize instead of doing it in the original patch creation it is to keep data transfer to a minimum. I am implementing this with the multiprocess library, and serializing data takes a LOOONG time, so trying to keep the data serialization to a minimum (meaning pass as little data back from the process). I have measured the difference between doing inside the 7 loops or outside, and notes are below so I can deal with that. However if you know of a faster implementation, do let me know.
Runtimes for creating imageArray inside 7 loops:
patches_time per img: 0.560 s
normalization_time per img: 0.336 s
total time per image: 0.896 s
Runtimes for creating imageArray and normalizing outside of 7 iterations:
patches_time per img: 0.040 s
create imgArray per img: 0.146 s
normalization_time per img: 0.339 s
total time per image: 0.524 s
I didn't see this before, but it seems creating the array is also taking some time.