python numpy: array of arrays - python

I'm trying to build a numpy array of arrays of arrays with the following code below.
Which gives me a
ValueError: setting an array element with a sequence.
My guess is that in numpy I need to declare the arrays as multi-dimensional from the beginning, but I'm not sure..
How can I fix the the code below so that I can build array of array of arrays?
from PIL import Image
import pickle
import os
import numpy
indir1 = 'PositiveResize'
trainimage = numpy.empty(2)
trainpixels = numpy.empty(80000)
trainlabels = numpy.empty(80000)
validimage = numpy.empty(2)
validpixels = numpy.empty(10000)
validlabels = numpy.empty(10000)
testimage = numpy.empty(2)
testpixels = numpy.empty(10408)
testlabels = numpy.empty(10408)
i=0
tr=0
va=0
te=0
for (root, dirs, filenames) in os.walk(indir1):
print 'hello'
for f in filenames:
try:
im = Image.open(os.path.join(root,f))
Imv=im.load()
x,y=im.size
pixelv = numpy.empty(6400)
ind=0
for i in range(x):
for j in range(y):
temp=float(Imv[j,i])
temp=float(temp/255.0)
pixelv[ind]=temp
ind+=1
if i<40000:
trainpixels[tr]=pixelv
tr+=1
elif i<45000:
validpixels[va]=pixelv
va+=1
else:
testpixels[te]=pixelv
te+=1
print str(i)+'\t'+str(f)
i+=1
except IOError:
continue
trainimage[0]=trainpixels
trainimage[1]=trainlabels
validimage[0]=validpixels
validimage[1]=validlabels
testimage[0]=testpixels
testimage[1]=testlabels

Don't try to smash your entire object into a numpy array. If you have distinct things, use a numpy array for each one then use an appropriate data structure to hold them together.
For instance, if you want to do computations across images then you probably want to just store the pixels and labels in separate arrays.
trainpixels = np.empty([10000, 80, 80])
trainlabels = np.empty(10000)
for i in range(10000):
trainpixels[i] = ...
trainlabels[i] = ...
To access an individual image's data:
imagepixels = trainpixels[253]
imagelabel = trainlabels[253]
And you can easily do stuff like compute summary statistics over the images.
meanimage = np.mean(trainpixels, axis=0)
meanlabel = np.mean(trainlabels)
If you really want all the data to be in the same object, you should probably use a struct array as Eelco Hoogendoorn suggests. Some example usage:
# Construction and assignment
trainimages = np.empty(10000, dtype=[('label', np.int), ('pixel', np.int, (80,80))])
for i in range(10000):
trainimages['label'][i] = ...
trainimages['pixel'][i] = ...
# Summary statistics
meanimage = np.mean(trainimages['pixel'], axis=0)
meanlabel = np.mean(trainimages['label'])
# Accessing a single image
image = trainimages[253]
imagepixels, imagelabel = trainimages[['pixel', 'label']][253]
Alternatively, if you want to process each one separately, you could store each image's data in separate arrays and bind them together in a tuple or dictionary, then store all of that in a list.
trainimages = []
for i in range(10000):
pixels = ...
label = ...
image = (pixels, label)
trainimages.append(image)
Now to access a single images data:
imagepixels, imagelabel = trainimages[253]
This makes it more intuitive to access a single image, but because all the data is not in one big numpy array you don't get easy access to functions that work across images.

Refer to the examples in numpy.empty:
>>> np.empty([2, 2])
array([[ -9.74499359e+001, 6.69583040e-309],
[ 2.13182611e-314, 3.06959433e-309]]) #random
Give your images a shape with the N dimensions:
testpixels = numpy.empty([96, 96])

Related

How to manipulate image bands as arrays of numbers

I'm new to Python, and I'm trying to deconstruct image bands as arrays of numbers by applying the Singular Value Decomposition (SVD) to them and then putting them back together with matplotlib.image and the Image module from PIL. An SVD may also be written as a sum of dyads s1u1v1T + ... + sKuKvKT, and the point in decomposing it in this way is that a near-perfect approximation of the image can be made from just a few of those dyads, so less data is required.
There must be something wrong with the calculation, though because result_r, result_g, and result_b look like this when converted to Images, and new_image looks like this.
For an example of what this should look like, here are the first dyads of the layers of this image. The image that I'm using (April23.jpg) is this.
import matplotlib.image as image
import numpy.linalg as la
import numpy as np
from PIL import Image
def getcolumn(j, m):
col = []
for i in range(len(m)):
col.append(m[i][j])
return col
def extractCols(U):
Ucols = []
for j in range(len(U[0])):
Ucols.append(getcolumn(j, U))
return np.asarray(Ucols)
def vectorMultiply(u, v):
matrix = []
for i in range(len(u)):
newVec = []
for j in range(len(v)):
newVec.append(u[i] * v[j])
matrix.append(newVec)
return np.asarray(matrix)
im = Image.open('C:/Users/<user>/Desktop/img/April23.jpg')
im.load()
sim = Image.Image.split(im)
rsim = sim[0].save("rsim.jpg") # image bands as images
gsim = sim[1].save("gsim.jpg")
bsim = sim[2].save("bsim.jpg")
# image bands as arrays of numbers
arsim = image.imread('C:/Users/<user>/Desktop/img/rsim.jpg')
agsim = image.imread('C:/Users/<user>/Desktop/img/gsim.jpg')
absim = image.imread('C:/Users/<user>/Desktop/img/bsim.jpg')
ur, sr, vhr = la.svd(arsim, False) # SVD on each band
ug, sg, vhg = la.svd(agsim, False)
ub, sb, vhb = la.svd(absim, False)
urcols = extractCols(ur)
ugcols = extractCols(ug)
ubcols = extractCols(ub)
# calculating the first dyads
result_r = np.multiply(sr[0], vectorMultiply(urcols[0], vhr[0]))
result_g = np.multiply(sg[0], vectorMultiply(ugcols[0], vhg[0]))
result_b = np.multiply(sb[0], vectorMultiply(ubcols[0], vhb[0]))
r = Image.fromarray(result_r, "L")
g = Image.fromarray(result_g, "L")
b = Image.fromarray(result_b, "L")
new_image = Image.merge("RGB", (r, g, b))
What am I missing, here? It seems to be something with the calculations. I figured for a matrix one would have to extract the columns, say the column [1, 2, 3] from a matrix [[1,...], [2,...], [3,...]], since each element of the matrix is a row. So, I wrote extractCols() for that. numpy's matrix add and multiply seem to be fine. I wrote vectorMultiply because np.dot(), np.multiply(), and np.matmul() didn't seem to realize that u was a column and kept saying the dimensions didn't match up. I tested it and it seemed to do what I wanted it to. I was also thinking that maybe the "rows" of U are actually the columns already and don't need to be extracted, but that didn't work either. I've also tried not using np.asarray() without any luck.
Any advice is appreciated.

Convert mha 3d images into 2d images in Python (2015 brats challenge dataset)

I want to use SimpleITK or wedpy to convert the 3d images into 2d images.
Or i want to get a three-dimensional matrix, and then i divide the three-dimensional matrix into some two-dimensional matrices.
import SimpleITK as ITK
import numpy as np
#from medpy.io import load
url=r'G:\path\to\my.mha'
image = ITK.ReadImage(url)
frame_num, width, height = image_array.shape
print(frame_num,width,height)
Then only get it:155 240 240
but i want to get [[1,5,2,3,1...],[54,1,3,5...],[5,8,9,6....]]
Just to add to Dave Chen's answer, as it is unclear if you want to get a set of 2D SimpleITK images or numpy arrays. The following code covers all three available options:
import SimpleITK as sitk
import numpy as np
url = "my_file.mha"
image = sitk.ReadImage(url)
max_index = image.GetDepth() # or image.GetWidth() or image.GetHeight() depending on the axis along which you want to extract
# As list of 2D SimpleITK images
list_of_2D_images = [image[:,:,i] for i in range(max_index)]
# As list of 2D numpy arrays which cannot be modified (no data copied)
list_of_2D_images_np_view = [sitk.GetArrayViewFromImage(image[:,:,i]) for i in range(max_index)]
# As list of 2D numpy arrays (data copied to numpy array)
list_of_2D_images_np = [sitk.GetArrayFromImage(image[:,:,i]) for i in range(max_index)]
Also, if you really want to work with URLs and not local files I would suggest looking at the remote download approach used in the SimpleITK notebooks repository, the relevant file is downloaddata.py.
That's not a big deal.
CT images have originally all numbers in int16 type so you don't need to handle float numbers.. In this case, we can think that we can easily change from int16 to uint16 only removing negative values in the image (CT images have some negative numbers as pixel values). Note that we really need uint16, or uint8 type so that OpenCV can handle it... as we have a lot of values in the CT image array, the best choice is uint16, so that we don't lose too much precision.
Ok, now you just need to do as follows:
import SimpleITK as sitk
import numpy as np
import cv2
mha = sitk.ReadImage('/mha/directory') #Importing mha file
array = sitk.GetArrayFromImage(mha) #Converting to array int16 (default)
#Translating each slice to the positive side
for m in range(array.shape[0]):
array[m] = array[m] + abs(np.min(array[m]))
array = np.around(array, decimals=0) #remove any float numbers if exists.. probably not
array = np.asarray(array, dtype='uint16') #From int16 to uint16
After these steps the array is just ready to be saved as png images using opencv.imwrite module:
for image in array:
cv2.imwrite('/dir/to/save/'+'name_image.png', image)
Note that by default SimpleITK handles .mha files by the axial view. I really don't know how to change it because I've never needed it before. Anyway, in this case with some searches you can find something.
I'm not sure exactly what you want to get. But it's easy to extract a 2d slice from a 3d image in SimpleITK.
To get a Z slice where Z=100 you can do this:
zslice = image[100]
To get a Y slice for Y=100:
yslice = image[:, 100]
And a X slice for X=100:
xslice = image[:, :, 100]
#zivy#Dave Chen
I've solved my problem.In fact, running this code will give you 150 240*240 PNG pictures.It's i want to get.
# -*- coding:utf-8 -*-
import numpy as np
import subprocess
import random
import progressbar
from glob import glob
from skimage import io
np.random.seed(5) # for reproducibility
progress = progressbar.ProgressBar(widgets=[progressbar.Bar('*', '[', ']'), progressbar.Percentage(), ' '])
class BrainPipeline(object):
'''
A class for processing brain scans for one patient
INPUT: (1) filepath 'path': path to directory of one patient. Contains following mha files:
flair, t1, t1c, t2, ground truth (gt)
(2) bool 'n4itk': True to use n4itk normed t1 scans (defaults to True)
(3) bool 'n4itk_apply': True to apply and save n4itk filter to t1 and t1c scans for given patient. This will only work if the
'''
def __init__(self, path, n4itk = True, n4itk_apply = False):
self.path = path
self.n4itk = n4itk
self.n4itk_apply = n4itk_apply
self.modes = ['flair', 't1', 't1c', 't2', 'gt']
# slices=[[flair x 155], [t1], [t1c], [t2], [gt]], 155 per modality
self.slices_by_mode, n = self.read_scans()
# [ [slice1 x 5], [slice2 x 5], ..., [slice155 x 5]]
self.slices_by_slice = n
self.normed_slices = self.norm_slices()
def read_scans(self):
'''
goes into each modality in patient directory and loads individual scans.
transforms scans of same slice into strip of 5 images
'''
print('Loading scans...')
slices_by_mode = np.zeros((5, 155, 240, 240))
slices_by_slice = np.zeros((155, 5, 240, 240))
flair = glob(self.path + '/*Flair*/*.mha')
t2 = glob(self.path + '/*_T2*/*.mha')
gt = glob(self.path + '/*more*/*.mha')
t1s = glob(self.path + '/**/*T1*.mha')
t1_n4 = glob(self.path + '/*T1*/*_n.mha')
t1 = [scan for scan in t1s if scan not in t1_n4]
scans = [flair[0], t1[0], t1[1], t2[0], gt[0]] # directories to each image (5 total)
if self.n4itk_apply:
print('-> Applyling bias correction...')
for t1_path in t1:
self.n4itk_norm(t1_path) # normalize files
scans = [flair[0], t1_n4[0], t1_n4[1], t2[0], gt[0]]
elif self.n4itk:
scans = [flair[0], t1_n4[0], t1_n4[1], t2[0], gt[0]]
for scan_idx in xrange(5):
# read each image directory, save to self.slices
slices_by_mode[scan_idx] = io.imread(scans[scan_idx], plugin='simpleitk').astype(float)
for mode_ix in xrange(slices_by_mode.shape[0]): # modes 1 thru 5
for slice_ix in xrange(slices_by_mode.shape[1]): # slices 1 thru 155
slices_by_slice[slice_ix][mode_ix] = slices_by_mode[mode_ix][slice_ix] # reshape by slice
return slices_by_mode, slices_by_slice
def norm_slices(self):
'''
normalizes each slice in self.slices_by_slice, excluding gt
subtracts mean and div by std dev for each slice
clips top and bottom one percent of pixel intensities
if n4itk == True, will apply n4itk bias correction to T1 and T1c images
'''
print('Normalizing slices...')
normed_slices = np.zeros((155, 5, 240, 240))
for slice_ix in xrange(155):
normed_slices[slice_ix][-1] = self.slices_by_slice[slice_ix][-1]
for mode_ix in xrange(4):
normed_slices[slice_ix][mode_ix] = self._normalize(self.slices_by_slice[slice_ix][mode_ix])
print('Done.')
return normed_slices
def _normalize(self, slice):
'''
INPUT: (1) a single slice of any given modality (excluding gt)
(2) index of modality assoc with slice (0=flair, 1=t1, 2=t1c, 3=t2)
OUTPUT: normalized slice
'''
b, t = np.percentile(slice, (0.5,99.5))
slice = np.clip(slice, b, t)
if np.std(slice) == 0:
return slice
else:
return (slice - np.mean(slice)) / np.std(slice)
def save_patient(self, reg_norm_n4, patient_num):
'''
INPUT: (1) int 'patient_num': unique identifier for each patient
(2) string 'reg_norm_n4': 'reg' for original images, 'norm' normalized images, 'n4' for n4 normalized images
OUTPUT: saves png in Norm_PNG directory for normed, Training_PNG for reg
'''
print('Saving scans for patient {}...'.format(patient_num))
progress.currval = 0
if reg_norm_n4 == 'norm': #saved normed slices
for slice_ix in progress(xrange(155)): # reshape to strip
strip = self.normed_slices[slice_ix].reshape(1200, 240)
if np.max(strip) != 0: # set values < 1
strip /= np.max(strip)
if np.min(strip) <= -1: # set values > -1
strip /= abs(np.min(strip))
# save as patient_slice.png
io.imsave('Norm_PNG/{}_{}.png'.format(patient_num, slice_ix), strip)
elif reg_norm_n4 == 'reg':
for slice_ix in progress(xrange(155)):
strip = self.slices_by_slice[slice_ix].reshape(1200, 240)
if np.max(strip) != 0:
strip /= np.max(strip)
io.imsave('Training_PNG/{}_{}.png'.format(patient_num, slice_ix), strip)
else:
for slice_ix in progress(xrange(155)): # reshape to strip
strip = self.normed_slices[slice_ix].reshape(1200, 240)
if np.max(strip) != 0: # set values < 1
strip /= np.max(strip)
if np.min(strip) <= -1: # set values > -1
strip /= abs(np.min(strip))
# save as patient_slice.png
io.imsave('n4_PNG/{}_{}.png'.format(patient_num, slice_ix), strip)
def n4itk_norm(self, path, n_dims=3, n_iters='[20,20,10,5]'):
'''
INPUT: (1) filepath 'path': path to mha T1 or T1c file
(2) directory 'parent_dir': parent directory to mha file
OUTPUT: writes n4itk normalized image to parent_dir under orig_filename_n.mha
'''
output_fn = path[:-4] + '_n.mha'
# run n4_bias_correction.py path n_dim n_iters output_fn
subprocess.call('python n4_bias_correction.py ' + path + ' ' + str(n_dims) + ' ' + n_iters + ' ' + output_fn, shell = True)
def save_patient_slices(patients, type):
'''
INPUT (1) list 'patients': paths to any directories of patients to save. for example- glob("Training/HGG/**")
(2) string 'type': options = reg (non-normalized), norm (normalized, but no bias correction), n4 (bias corrected and normalized)
saves strips of patient slices to approriate directory (Training_PNG/, Norm_PNG/ or n4_PNG/) as patient-num_slice-num
'''
for patient_num, path in enumerate(patients):
a = BrainPipeline(path)
a.save_patient(type, patient_num)
def s3_dump(directory, bucket):
'''
dump files from a given directory to an s3 bucket
INPUT (1) string 'directory': directory containing files to save
(2) string 'bucket': name od s3 bucket to dump files
'''
subprocess.call('aws s3 cp' + ' ' + directory + ' ' + 's3://' + bucket + ' ' + '--recursive')
def save_labels(fns):
'''
INPUT list 'fns': filepaths to all labels
'''
progress.currval = 0
for label_idx in progress(range(len(labels))):
slices = io.imread(labels[label_idx], plugin = 'simpleitk')
for slice_idx in range(len(slices)):
io.imsave(r'{}_{}L.png'.format(label_idx, slice_idx), slices[slice_idx])
if __name__ == '__main__':
url = r'G:\work\deeplearning\BRATS2015_Training\HGG\brats_2013_pat0005_1\VSD.Brain.XX.O.MR_T1.54537\VSD.Brain.XX.O.MR_T1.54537.mha'
labels = glob(url)
save_labels(labels)
# patients = glob('Training/HGG/**')
# save_patient_slices(patients, 'reg')
# save_patient_slices(patients, 'norm')
# save_patient_slices(patients, 'n4')
# s3_dump('Graveyard/Training_PNG/', 'orig-training-png')

How to vectorize a code with python numpy.bincount, using apply along axis

I'm trying to vectorize a code with numpy, to run it using multiprocessing, but i can't understand how numpy.apply_along_axis works. This is an example of the code, vectorized using map
import numpy
from scipy import sparse
import multiprocessing
from matplotlib import pyplot
#first i build a matrix of some x positions vs time datas in a sparse format
matrix = numpy.random.randint(2, size = 100).astype(float).reshape(10,10)
x = numpy.nonzero(matrix)[0]
times = numpy.nonzero(matrix)[1]
weights = numpy.random.rand(x.size)
#then i define an array of y positions
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
#now i build an image using x-y-times coordinates and x-times weights
def mapIt(ithStep):
ncolumns = 80
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
image = list(map(mapIt, range(nStepsY)))
image = numpy.array(image)
a = pyplot.imshow(image, aspect = 10)
Here the output plot
I tried to use numpy.apply_along_axis, but this function allows me to iterate only along the rows of image, while i need to iterate along the ithStep index too. E.g.:
#now i build an image using x-y-times coordinates and x-times weights
nrows = nStepsY
ncolumns = 80
matrix = numpy.zeros(nrows*ncolumns).reshape(nrows,ncolumns)
def applyIt(image):
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
imageApplied = numpy.apply_along_axis(applyIt,1,matrix)
a = pyplot.imshow(imageApplied, aspect = 10)
It obviously return only the firs row nrows times, since nothing iterates ithStep:
And here the wrong plot
There is a way to iterate an index, or to use an index while numpy.apply_along_axis iterates?
Here the code with only matricial operations: it's quite faster than map or apply_along_axis but uses so much memory.
(in this function i use a trick with scipy.sparse, which works more intuitively than numpy arrays when you try to sum numbers on a same element)
def fullmatrix(nRows, nColumns):
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
yTimed = numpy.outer(y,times)
x3d = numpy.outer(numpy.ones(nStepsY),x)
weights3d = numpy.outer(numpy.ones(nStepsY),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions)))).todense()
return matrix
image = fullmatrix(nStepsY, 80)
a = pyplot.imshow(image, aspect = 10)
This way is simplier and very fast! Thank you so much.
nStepsY = 5
nRows = nStepsY
nColumns = 80
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
fakeRow = numpy.zeros(positions.size)
def itermatrix(ithStep):
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((weights, (fakeRow, positions))).todense()
matrix = numpy.ravel(matrix)
missColumns = (nColumns-matrix.size)
zeros = numpy.zeros(missColumns)
matrix = numpy.concatenate((matrix, zeros))
return matrix
for i in numpy.arange(nStepsY):
image[i] = itermatrix(i)
#or, without initialization of image:
imageMapped = list(map(itermatrix, range(nStepsY)))
imageMapped = numpy.array(imageMapped)
It feels like attempting to use map or apply_along_axis is obscuring the essentially iteration of the problem.
I rewrote your code as an explicit loop on y:
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nStepsY, 80))
for i, yi in enumerate(y):
yTimed = yi*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[i, positions] = values
a = pyplot.imshow(image, aspect = 10)
pyplot.show()
Looking at the code, I think I could calculate positions for all y values making a (y.shape[0],times.shape[0]) array. But the rest, the bincount and unique still have to work row by row.
apply_along_axis when working with a 2d array, and axis=1 essentially does:
res = np.zeros_like(arr)
for i in range....:
res[i,:] = func1d(arr[i,:])
If the input array has more dimensions it constructs a more elaborate indexing object [i,j,k,:]. And it can handle cases where func1d returns a different size array than the input. But in any case it is just a generalized iteration tool.
Moving the initial positions creation outside the loop:
yTimed = y[:,None]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
image = numpy.zeros((positions.shape[0], 80))
for i, pos in enumerate(positions):
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
image[i, pos] = values
Now I can cast this as an apply_along_axis problem, with an applyIt that takes a positions vector (with all the yTimed information) rather than blank image vector.
def applyIt(pos, size, weights):
acolumn = numpy.zeros(size)
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
acolumn[pos] = values
return acolumn
image = numpy.apply_along_axis(applyIt, 1, positions, 80, weights)
Timing wise I expect it's a bit slower than my explicit iteration. It has to do more setup work, including a test call applyIt(positions[0,:],...) to determine the size of its return array (i.e image has different shape than positions.)
def csrmatrix(y, times, x, weights):
yTimed = numpy.outer(y,times)
n=y.shape[0]
x3d = numpy.outer(numpy.ones(n),x)
weights3d = numpy.outer(numpy.ones(n),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
#print(y.shape, weights3d.shape, y3d.shape, positions.shape)
matrix = sparse.csr_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions))))
#print(repr(matrix))
return matrix
# one call
image = csrmatrix(y, times, x, weights)
# iterative call
alist = []
for yi in numpy.arange(1,nStepsY+1):
alist.append(csrmatrix(numpy.array([yi]), times, x, weights))
def mystack(alist):
# concatenate without offset
row, col, data = [],[],[]
for A in alist:
A = A.tocoo()
row.extend(A.row)
col.extend(A.col)
data.extend(A.data)
print(len(row),len(col),len(data))
return sparse.csr_matrix((data, (row, col)))
vimage = mystack(alist)

Write to HDF5 and shuffle big arrays of data

I have downloaded Caltech101. Its structure is:
#Caltech101 dir
#class1 dir
#images of class1 jpgs
#class2 dir
#images of class2 jpgs
...
#class100 dir
#images of class100 jpgs
My problem is that I can't keep in memory two np arrays x and y of shape (9144, 240, 180, 3) and (9144). So my solution is to overallocate a h5py dataset, load them in 2 chunks and write them to file one after the other. Precisely:
from __future__ import print_function
import os
import glob
from scipy.misc import imread, imresize
from sklearn.utils import shuffle
import numpy as np
import h5py
from time import time
def load_chunk(images_dset, labels_dset, chunk_of_classes, counter, type_key, prev_chunk_length):
# getting images and processing
xtmp = []
ytmp = []
for label in chunk_of_classes:
img_list = sorted(glob.glob(os.path.join(dir_name, label, "*.jpg")))
for img in img_list:
img = imread(img, mode='RGB')
img = imresize(img, (240, 180))
xtmp.append(img)
ytmp.append(label)
print(label, 'done')
x = np.concatenate([arr[np.newaxis] for arr in xtmp])
y = np.array(ytmp, dtype=type_key)
print('x: ', type(x), np.shape(x), 'y: ', type(y), np.shape(y))
# writing to dataset
a = time()
images_dset[prev_chunk_length:prev_chunk_length+x.shape[0], :, :, :] = x
print(labels_dset.shape)
print(y.shape, y.shape[0])
print(type(y), y.dtype)
print(prev_chunk_length)
labels_dset[prev_chunk_length:prev_chunk_length+y.shape[0]] = y
b = time()
print('Chunk', counter, 'written in', b-a, 'seconds')
return prev_chunk_length+x.shape[0]
def write_to_file(remove_DS_Store):
if os.path.isfile('caltech101.h5'):
print('File exists already')
return
else:
# the name of each dir is the name of a class
classes = os.listdir(dir_name)
if remove_DS_Store:
classes.pop(0) # removes .DS_Store - may not be used on other terminals
# need the dtype of y in order to initialize h5 dataset
s = ''
key_type_y = s.join(['S', str(len(max(classes, key=len)))])
classes = np.array(classes, dtype=key_type_y)
# number of chunks in which the dataset must be divided
nb_chunks = 2
nb_chunks_loaded = 0
prev_chunk_length = 0
# open file and allocating a dataset
f = h5py.File('caltech101.h5', 'a')
imgs = f.create_dataset('images', shape=(9144, 240, 180, 3), dtype='uint8')
labels = f.create_dataset('labels', shape=(9144,), dtype=key_type_y)
for class_sublist in np.array_split(classes, nb_chunks):
# loading chunk by chunk in a function to avoid memory overhead
prev_chunk_length = load_chunk(imgs, labels, class_sublist, nb_chunks_loaded, key_type_y, prev_chunk_length)
nb_chunks_loaded += 1
f.close()
print('Images and labels saved to \'caltech101.h5\'')
return
dir_name = '../Datasets/Caltech101'
write_to_file(remove_DS_Store=True)
This works quite well, and also reading is actually fast enough. The problem is that I need to shuffle the dataset.
Observations:
Shuffling the dataset objects: obviously veeeery slow because they're on disk.
Creating an array of shuffled indices and use advanced numpy indexing. This means slower reading from file.
Shuffling before writing to file would be nice, problem: I have only about half of the dataset in memory each time. I would get an improper shuffling.
Can you think of a way to shuffle before writing? I'm open also to solutions which rethink the writing process, as long as it doesn't use a lot of memory.
You could shuffle the file paths before reading the image data.
Instead of shuffling the image data in memory, create a list of all file paths that belong to the dataset. Then shuffle the list of file paths. Now you can create your HDF5 database as before.
You could for example use glob to create the list of files for shuffling:
import glob
import random
files = glob.glob('../Datasets/Caltech101/*/*.jpg')
shuffeled_files = random.shuffle(files)
You could then retrieve the class label and image name from the path:
import os
for file_path in shuffeled_files:
label = os.path.basename(os.path.dirname(file_path))
image_id = os.path.splitext(os.path.basename(file_path))[0]

Numpy compare 2 array shape, if different, append 0 to match shape

I am comparing 2 numpy arrays, and want to add them together. but, before doing so, i need to make sure they are the same size. If the size are not same, then take the smaller sized one and fill the last rows with zero to match the shape.
Both array have 16 columns and N rows. I am assuming it should be pretty straight forward, but I can't get my head around it. So far I am able to compare the 2 array shape.
import csv
import numpy as np
import sys
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
This is the output I got:
=============New file.csv============
(603, 16)
(604, 16)
we have an error
I want the fill the last row of "data" array with 0 so that I can add the 2 arrays.
Thanks for your help.
You can use vstack(array1, array2) from numpy which stacks arrays vertically. For example:
A = np.random.randint(2, size = (2, 16))
B = np.random.randint(2, size = (5, 16))
print A.shape
print B.shape
if A.shape[0] < B.shape[0]:
A = np.vstack((A, np.zeros((B.shape[0] - A.shape[0], 16))))
elif A.shape[0] > B.shape[0]:
B = np.vstack((B, np.zeros((A.shape[0] - B.shape[0], 16))))
print A.shape
print A
In your case:
if data.shape[0] < data_sys.shape[0]:
data = np.vstack((data, np.zeros((data_sys.shape[0] - data.shape[0], 16))))
elif data.shape[0] > data_sys.shape[0]:
data_sys = np.vstack((data_sys, np.zeros((data.shape[0] - data_sys.shape[0], 16))))
I assume that your matrices have always the same number of columns, if not you can similarly use hstack to stack them horizontally.
If you have only two files, and their shapes differ in just the 0th dimension, a simple check and copy is probably easiest, though it lacks generality:
import numpy as np
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
fill_value = 0 # could be np.nan or something else instead
if data.shape[0]>data_sys.shape[0]:
temp = data_sys
data_sys = np.ones(data.shape)*fill_value
data_sys[:temp.shape[0],:] = temp
elif data.shape[0]<data_sys.shape[0]:
temp = data
data = np.ones(data_sys.shape)*fill_value
data[:temp.shape[0],:] = temp
print 'Using conditional:'
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
A much more general solution is a custom class--overkill for your two files but much easier if you have lots of files to handle. The basic idea is that static class variables sx and sy keep track of the largest widths and heights, and are used when get_data is called, to output a standard shape array. This is pre-filled with your desired fill value, and the actual data from the corresponding file are copied into the upper left corner of the standard shape array:
import numpy as np
class IsomorphicArray:
sy = 0 # static class variable
sx = 0 # static class variable
fill_value = 0.0
def __init__(self,csv_filename):
self.data = np.genfromtxt(csv_filename,dtype=float,delimiter=',')
self.instance_sy,self.instance_sx = self.data.shape
if self.instance_sy>IsomorphicArray.sy:
IsomorphicArray.sy = self.instance_sy
if self.instance_sx>IsomorphicArray.sx:
IsomorphicArray.sx = self.instance_sx
def get_data(self):
out = np.ones((IsomorphicArray.sy,IsomorphicArray.sx))*self.fill_value
out[:self.instance_sy,:self.instance_sx] = self.data
return out
isomorphic_array_list = []
for filename in ['./test1.csv','./test2.csv']:
isomorphic_array_list.append(IsomorphicArray(filename))
numpy_array_list = []
for isomorphic_array in isomorphic_array_list:
numpy_array_list.append(isomorphic_array.get_data())
print 'Using custom class:'
for numpy_array in numpy_array_list:
print numpy_array.shape
Assuming both arrays have 16 columns
len1=len(data)
len2=len(data_sys)
if len1<len2:
data=np.append(data, np.zeros((len2-len1, 16)),axis=0)
elif len2<len1:
data_sys=np.append(data_sys, np.zeros((len1-len2, 16)),axis=0)
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
else:
print "we r good"
Numpy provides an append function to add values to an array: see here for details. In multi-dimensional arrays you can define how the values should be added. As you have already the information which of your arrays is the smaller one, just add the desired number of zeroes with creating a zero filled array first by numpy.zeroes and then append it to your target array.
It might be necessary to flatten your array first and then to reshape it.
I had a similar situation. Two arrays of sizes mask_in:(n1,m1) and mask_ot:(n2,m2)that were generated through a mask of a 2D image of size (N,M) where A2 is larger than A1 and both share a common center (X0,Y0). I followed the approach suggested by #AniaG using vstack and hstack. I simply obtained the shapes of both arrays, size difference and finally account the number of missing elements at both ends.
Here is what I got:
mask_in = np.random.randint(2, size = (2, 8))
mask_ot = np.random.randint(2, size = (6, 16))
mask_in_amp = mask_in
dif_row = mask_ot.shape[0]-mask_in_amp.shape[0]
dif_col = mask_ot.shape[1]-mask_in_amp.shape[1]
complete_row = dif_row / 2
complete_col = dif_col / 2
mask_in_amp = np.vstack((mask_in_amp, np.zeros((complete_row, mask_in_amp.shape[1]))))
mask_in_amp = np.vstack((np.zeros((complete_row, mask_in_amp.data.shape[1])), mask_in_amp))
mask_in_amp = np.hstack((mask_in_amp, np.zeros((mask_in_amp.shape[0],complete_col))))
mask_in_amp = np.hstack((np.zeros((mask_in_amp.shape[0],complete_col)), mask_in_amp))
If you don't care about the exact shapes of two arrays you can also do the following:
if data.size == datasys.size:
print ('arrays have the same number of elements, and possibly shape')
else:
print ('arrays do not have the same shape for sure')

Categories