Python (numpy) crashes system with large number of array elements - python

I'm trying to build a basic character recognition model using the many classifiers that scikit provides. The dataset being used is a standard handwritten set of alphanumeric samples (Chars74K image dataset taken from this source: EnglishHnd.tgz).
There are 55 samples of each character (62 alphanumeric characters in all), each being 900x1200 pixels. I'm flattening the matrix (first converting to grayscale) into a 1x1080000 array (each representing a feature).
for sample in sample_images: # sample images is the list of the .png files
img = imread(sample);
img_gray = rgb2gray(img);
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m);
img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
if len(samples) == 0: # samples is the final numpy ndarray
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1]);
else:
samples = np.append(samples, [img_gray], axis=0);
So the final data structure should have 55x62 arrays, where each array is 1080000 elements in capacity. Only the final structure is being stored (the scope of the intermediate matrices is local).
The amount of data being stored to learn the model is pretty large (I guess), because the program isn't really progressing beyond a point, and crashed my system to the extent that the BIOS had to be repaired!
Upto this point, the program is only gathering the data to send to the classifier ... the classification hasn't even been introduced into the code yet.
Any suggestions as to what can be done to handle the data more efficiently?
Note: I'm using numpy to store the final structure of flattened matrices.
Also, the system has an 8Gb RAM.

This seems like a case of stack overflow. You have 3,682,800,000 array elements, if I understand your question. What is the element type? If it is one byte, that is about 3 gigabytes of data, easily enough to fill up your stack size (usually about 1 megabyte). Even with one bit an element, you are still at 500 mb. Try using heap memory (up to 8 gigs on your machine)

I was encouraged to post this as a solution, although the comments above are probably more enlightening.
The issue with the users program is two fold. Really it's just overwhelming the stack.
Much more common, especially with image processing in things like computer graphics or computer vision, is to process the images one at a time. This could work well with sklearn where you could just be updating your models as you read in the image.
You could use this bit of code found from this stack article:
import os
rootdir = '/path/to/my/pictures'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file[-3:] == 'png': # or whatever your file type is / some check
# do your training here
img = imread(file)
img_gray = rgb2gray(img)
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m)
# sample id stores the label of the training sample
img_gray = np.append(img_gray, sample_id)
# samples is the final numpy ndarray
if len(samples) == 0:
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1])
else:
samples = np.append(samples, [img_gray], axis=0)
This is more of pseudocode, but the general flow should have the right idea. Let me know if there's anything else I can do! Also check out OpenCV if you're interested on some cool deep learning algorithms. They're a bunch of cool stuff there and images make for great sample data.
Hope this helps.

Related

(Python/h5py) Memory-efficient processing of HDF5 dataset slices

I'm working in python with a large dataset of images, each 3600x1800. There are ~360000 images in total. Each image is added to the stack one by one since an initial processing step is run on each image. H5py has proven effective at building the stack, image by image, without it filling the entire memory.
The analysis I am running is calculated on the grid cells - so on 1x1x360000 slices of the stack. Since the analysis of each slice depends on the max and min values of within that slice, I think it is necessary to hold the 360000-long array in memory. I have a fair bit of RAM to work with (~100GB) but not enough to hold the entire stack of 3600x1800x360000 in memory at once.
This means I need (or I think I need) a time-efficient way of accessing the 360000-long arrays. While h5py is efficient at adding each image to the stack, it seems that slicing perpendicular to the images is much, much slower (hours or more).
Am I missing an obvious method to slice the data perpendicular to the images?
Code below is a timing benchmark for 2 different slice directions:
file = "file/path/to/large/stack.h5"
t0 = time.time()
with h5py.File(file, 'r') as f:
dat = f['Merged_liqprec'][:,:,1]
print('Time = ' + str(time.time()- t0))
t1 = time.time()
with h5py.File(file, 'r') as f:
dat = f['Merged_liqprec'][500,500,:]
print('Time = ' + str(time.time()- t1))
Output:
## time to read a image slice, e.g. [:,:,1]:
Time = 0.0701
## time to read a slice thru the image stack, e.g. [500,500,:]:
Time = multiple hours, server went offline for maintenance while running
You have the right approach. Using numpy slicing notation to read the stack of interest reduces the memory footprint.
However, with this large dataset, I suspect I/O performance is going to depend on chunking: 1) Did you define chunks= when you created the dataset, and 2) if so, what is the chunked size? The chunk is the shape of the data block used when reading or writing. When an element in a chunk is accessed, the entire chunk is read from disk. As a result, you will not be able to optimize the shape for both writing images (3600x1800x1) and reading stacked slices (1x1x360000). The optimal shape for writing an image is (shape[0], shape[1], 1) and for reading a stacked slice is (1, 1, shape[2])
Tuning the chunk shape is not trivial. h5py docs recommend the chunk size should be between 10 KiB and 1 MiB, larger for larger datasets. You could start with chunks=True (h5py determines the best shape) and see if that helps.
Assuming you will create this file once, and read many times, I suggest optimizing for reading. However, creating the file may take a long time. I wrote a simple example that you can use to "tinker" with the chunk shape on a small file to observe the behavior before you work on the large file. The table below shows the effect of different chunk shapes on 2 different file sizes. The code follows the table.
Dataset shape=(36,18,360)
chunks
Writing
Reading
None
0.349
0.102
(36,18,1)
0.187
0.436
(1,1,360)
2.347
0.095
Dataset shape=(144,72,1440) (4x4x4 larger)
chunks
Writing
Reading
None
59.963
1.248
(9, 9, 180) / True
11.334
1.588
(144, 72, 100)
79.844
2.637
(10, 10, 1440)
56.945
1.464
Code below:
f = 4
a0, a1, n = f*36, f*18, f*360
c0, c1, cn = a0, a1, 100
#c0, c1, cn = 10, 10, n
arr = np.random.randint(256,size=(a0,a1))
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
# Use chunks=True for default size or use chunks=(c0,c1,cn) for user defined size
ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True)
print(f'chunks={ds.chunks}')
for i in range(n):
ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")
start = time.time()
with h5py.File('SO_73791464.h5','r') as h5f:
ds = h5f['images']
for i in range(ds.shape[0]):
for j in range(ds.shape[1]):
dat = ds[i,j,:]
print(f"Time to read file:{(time.time()-start): .3f}")
HDF5 chunked storage improves I/O time, but can still be very slow when reading small slices (e.g., [1,1,360000]). To further improve performance, you need to read larger slices into an array as described by #Jérôme Richard in the comments below your question. You can then quickly access a single slice from the array (because it is in memory).
This answer combines the 2 techniques: 1) HDF5 chunked storage (from my first answer), and 2) reading large slices into an array and then reading a single [i,j] slice from that array. The code to create the file is very similar to the first answer. It is setup to create a dataset of shape [3600, 1800, 100] with default chunk size ([113, 57, 7] on my system). You can increase n to test with larger datasets.
When reading the file, the large slice [0] and [1] dimensions are set equal to the associated chunk shape (so I only access each chunk once). As a result, the process to read the file is "slightly more complicated" (but worth it). There are 2 loops: the 1st loop reads a large slice into an array 'dat_chunk', and the 2nd loop reads a [1,1,:] slice from 'dat_chunk' into a second array 'dat'.
Differences in timing data for 100 images is dramatic. It takes 8 seconds to read all of the data using the method below. It required 74 min with the first answer (reading each [i,j] pair directly). Clearly that is too slow. Just for fun, I increased the dataset size to 1000 images (shape=[3600, 1800, 1000]) in my test file and reran. It takes 4:33 (m:ss) to read all slices with this method. I didn't even try with the previous method (for obvious reasons). Note: my computer is pretty old & slow with a HDD, so your timing data should be faster.
Code to create the file:
a0, a1, n = 3600, 1800, 100
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True)
print(f'chunks={ds.chunks}')
for i in range(n):
arr = np.random.randint(256,size=(a0,a1))
ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")
Code to read the file using large array slices:
start = time.time()
with h5py.File('SO_73791464.h5','r') as h5f:
ds = h5f['images']
print(f'shape={ds.shape}')
print(f'chunks={ds.chunks}')
ds_i_max, ds_j_max, ds_k_max = ds.shape
ch_i_max, ch_j_max, ch_k_max = ds.chunks
i = 0
while i < ds_i_max:
i_stop = min(i+ch_i_max,ds_i_max)
print(f'i_range: {i}:{i_stop}')
j = 0
while j < ds_j_max:
j_stop = min(j+ch_j_max,ds_j_max)
print(f' j_range: {j}:{j_stop}')
dat_chunk = ds[i:i_stop,j:j_stop,:]
# print(dat_chunk.shape)
for ic in range(dat_chunk.shape[0]):
for jc in range(dat_chunk.shape[1]):
dat = dat_chunk[ic,jc,:]
j = j_stop
i = i_stop
print(f"Time to read file:{(time.time()-start): .3f}")

Running out of memory when building features (converting images into derived features [numpy arrays])?

I copied some image data to an instance on Google Cloud (8 vCPU's, 64GB memory, Tesla K80 GPU) and am running into memory problems when converting the raw data into features, and changing the data structure of the output. Eventually I'd like to use the derived features in Keras/Tensorflow neural net.
Process
After copying the data to a storage bucket, I run a build_features.py function to convert the raw data into processed data for the neural network. In this pipeline, I first take each raw image and put it into a list x (which stores the derived features).
Since I'm working with a large number of images (tens of thousands of images that are type float32 and have dimensions 250x500x3) the list x becomes quite large. Each element of x is numpy array that stores the image in shape 250x500x3.
Problem 1 - reduced memory as list x grows
I took 2 screenshots that show available memory decreasing as x grows (below). I'm eventually able to complete this step but I'm only left with a few GB of memory so I definitely want to fix this (in the future I want to work with larger data sets). How can I build features in a way where I'm not limited by the size of x?
Problem 2 - Memory error when converting x into numpy array
The step where the instance actually fails is the following:
x = np.array(x)
The failure message is:
Traceback (most recent call last):
File "build_features.py", line 149, in <module>
build_features(pipeline='9_11_2017_fan_3_lights')
File "build_features.py", line 122, in build_features
x = np.array(x)
MemoryError
How can I adjust this step so that I don't run out of memory?
Your code has two copies of every image - one in the list, and one in the array:
images = []
for i in range(many):
images[i] = load_img(i) # here's the first image
x = np.array(images) # joint them all together into a second copy
Just load the images straight into the array
x = np.zeros((many, 250, 500, 3)
for i in range(many):
x[i] = load_img(i)
Which means that you only hold a copy of one image at a time.
If you don't know the size or dtype of the image ahead of time, or don't want to hard code it, you can use:
x0 = load_img(0)
x = np.zeros((many,) + x0.shape, x0.dtype)
for i in range(1, many):
x[i] = load_img(i)
Having said that, you're on a tricky path here. If you don't have enough room to store your dataset twice in memory, you also don't have room to compute y = x + 1.
You might want to consider using np.float16 to buy more storage, at the cost of precision

Moving/running window of a Multi-dimensional image array

I am trying to work on an efficient numpy solution to perform a running average of an array of color images across the 4th dimension. A set of color images in a directory is read in a loop and I would like to average in subsets of 3. ie. If there are n = 5 color images in the directory I would like to average [1,2,3],[2,3,4], [3,4,5], [4,5,1], and [5,1,2] thus writing 5 output average images.
from os import listdir
from os.path import isfile, join
import numpy as np
import cv2
from matplotlib import pyplot as plt
mypath = 'C:/path/to/5_image/dir'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
img = np.empty(len(onlyfiles), dtype=object)
temp = np.zeros((960, 1280, 3, 3), dtype='uint8')
temp_avg = np.zeros((960, 1280, 3), dtype='uint8')
for n in range(0, len(onlyfiles)):
img[n] = cv2.imread(join(mypath, onlyfiles[n]))
for n in range(0, len(img)):
if (n+2) < len(img)-1:
temp[:, :, :, 0] = img[n]
temp[:, :, :, 1] = img[n + 1]
temp[:, :, :, 2] = img[n + 2]
temp_avg = np.mean(temp,axis=3)
plt.imshow(temp_avg)
plt.show()
else:
break
This script is in no way complete or elegant. The issues i am having is while plotting the average images the color space seems distorted and appears like CMKY. I am not accounting for the last two moving windows [4,5,1] and [5,1,2]. Critique and suggestions welcome.
For performing local operations (such as a running average) across the pixels of an image (or across multiple images), convolution with a kernel is usually a good approach.
Here's how this could be done in your case.
Generating Some Example Data
I used the following to generate 10 images containing random noise to work with:
for i in range(10):
an_img = np.random.randint(0, 256, (960,1280,3))
cv2.imwrite("img_"+str(i)+".png", an_img)
Preparing the Images
This is how I load the images back in:
# Get target file names
mypath = os.getcwd() # or whatever path you like
fnames = [f for f in listdir(mypath) if f.endswith('.png')]
# Create an array to hold all the images
first_img = cv2.imread(join(mypath, fnames[0]))
y,x,c = first_img.shape
all_imgs = np.empty((len(fnames),y,x,c), dtype=np.uint8)
# Load all the images
for i,fname in enumerate(fnames):
all_imgs[i,...] = cv2.imread(join(mypath, fnames[i]))
Some notes:
I use f.endswith('.png') to be a bit more specific with how I generate the list of filenames, allowing other files to be in the same directory without causing problems.
I place all of the images in a single 4D uint8 array of shape (image,y,x,c) instead of the object array you were using. This is necessary to employ the convolution approach below.
I use the first image to get the dimensions of the images, which makes the code just a little bit more general.
Performing Local Averaging by Kernel Convolution
This is all it takes.
from scipy.ndimage import uniform_filter
done = uniform_filter(all_imgs, size=(3,0,0,0), origin=-1, mode='wrap')
Some notes:
I am using scipy.ndimage because it readily allows for its convolution filters to be applied to images with many dimensions (4 in your case). For cv2, I am only aware of cv2.filter2D, which does not have that functionality as far as I know. However, I am not very familiar with cv2, so I may be wrong about this (will edit if someone corrects me in a comment).
The size kwarg specifies the size of the kernel to use along each dimension of the array. By using (3,0,0,0), I make sure that only the first dimension (=the different images) is used for the averaging.
By default, the running window (or rather the kernel) is used to compute the value of its central pixel. To match this more closely with your code, I used origin=-1, so the kernel computes the value of the pixel one to the left of its center.
By default, the edge cases (the two last images in this case) are handled by padding with a reflection. Your question suggests that what you want is to use the first images again instead. This is done using mode='wrap'.
By default, the filter returns the result in the same dtype as the input, here np.uint8. This is probably desirable, but your example code produces floats, so perhaps you want the filter to return floats as well, which you can do by simply changing the dtype of the input, i.e. done = uniform_filter(all_imgs.astype(np.float), size....
As for the distorted color space when you plot your averages; I cannot reproduce that. Your approach seems to produce the correct output for my random noise example images (after correction of the issue I pointed out in my comment to your question). Perhaps you could try plt.imshow(temp_avg, interpolation='none') to avoid possible artefacting from imshow's interpolation?

How to pass Pillow image data to scikit-learn?

I am trying to train an image classifier in scikit-learn. I have a bunch of input images and I am using Pillow to process them. My question is about what shape to give the Pillow data to scikit-learn.
This is my code now:
training = glob.glob('./img/training/*/*.bmp')
data = []
classes = []
for imagefile in training:
edges = Image.open(imagefile).filter(ImageFilter.FIND_EDGES).convert("L")
in_data = np.asarray(edges, dtype=np.uint8)
data.append(in_data[0])
if 'class1' in imagefile:
classes.append('class1')
else:
classes.append('class2')
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(data, classes)
This runs without errors, but I have put the code together fairly crudely and I am not sure it is correct.
In particular, I'm not sure whether I should be using in_data[0]. I just did this because using in_data gives me an error: ValueError: Found array with dim 3. Estimator expected <= 2.
Unless you want the first row of the image matrix ( in_data[0] returns you the first row ) of each image, you probably want to use flattening.
Flattening will take each row of the image matrix and put the rows behind eachother in a 1 dimensional vector.
So it becomes data.append(in_data.flatten())
You could resize your image to a smaller format first, to reduce the number of columns of your data matrix.

Tips on processing a lot of images in python

I have been trying to process two huge files containing around 40000-50000 images in python. But whenever I try to convert my datasets into a numpy array I get a Memory error. I only have about 8GB RAM which isn't very much, but, because I lack experience in python, I wonder if there is any way that I can resolve this issue by using some python library I don't know about, or maybe by optimizing my code? I would like to hear your opinion on this matter.
My image processing code:
from sklearn.cluster import MiniBatchKMeans
import numpy as np
import glob
import os
from PIL import Image
from sklearn.decomposition import PCA
image_dir1 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/train"
image_dir2 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/test1"
Standard_size = (300,200)
pca = PCA(n_components = 10)
file_open = lambda x,y: glob.glob(os.path.join(x,y))
def matrix_image(image):
"opens image and converts it to a m*n matrix"
image = Image.open(image)
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
image = image.resize(Standard_size)
image = list(image.getdata())
image = map(list,image)
image = np.array(image)
return image
def flatten_image(image):
"""
takes in a n*m numpy array and flattens it to
an array of the size (1,m*n)
"""
s = image.shape[0] * image.shape[1]
image_wide = image.reshape(1,s)
return image_wide[0]
if __name__ == "__main__":
train_images = file_open(image_dir1,"*.jpg")
test_images = file_open(image_dir2,"*.jpg")
train_set = []
test_set = []
"Loop over all images in files and modify them"
train_set = [flatten_image(matrix_image(image))for image in train_images]
test_set = [flatten_image(matrix_image(image))for image in test_images]
train_set = np.array(train_set) #This is where the Memory Error occurs
test_set = np.array(test_set)
Small edit: I'm using 64-bit python
Assuming a 4 byte integer for each pixel, you are trying to hold about 11.2 GB of data in (4*300*200*50000 / (1024)**3). Half that for a 2 byte integer.
You have a few options:
Reduce the number or size of images you are trying to hold in memory
Use a file or database to hold the data instead of memory (may be too slow for some applications)
Use the memory you have more effectively...
Instead of copying from list to numpy, which will temporarily use twice the amount of memory, as you do here:
test_set = [flatten_image(matrix_image(image))for image in test_images]
test_set = np.array(test_set)
Do this:
n = len(test_images)
test_set = numpy.zeros((n,300*200),dtype=int)
for i in range(n):
test_set[i] = flatten_image(matrix_image(test_images[i]))
Since your files are JPEGs and you have 300x200 images, for a 24-bit color image you're looking at approximately 1.4 MB per file and at least a whopping 40.2 GB overall:
In [4]: import humanize # `pip install humanize` if you need it
In [5]: humanize.naturalsize(300*200*24, binary=True)
Out[5]: '1.4 MiB'
In [6]: humanize.naturalsize(300*200*24*30000, binary=True)
Out[6]: '40.2 GiB'
If you have grayscale, you likely have 8-bit images which rings in at 13.4 GB:
In [7]: humanize.naturalsize(300*200*8, binary=True)
Out[7]: '468.8 KiB'
In [8]: humanize.naturalsize(300*200*8*30000, binary=True)
Out[8]: '13.4 GiB'
This is only for one copy too. Depending on the operations, this could get much bigger.
Going bigger
You could always rent some time on a server with more memory.
AWS - Up to 224GB
Rackspace - Up to 120GB
DigitalOcean - Up to 96 GB
Azure - Up to 56 GB
Looking at these in terms of amount of RAM isn't the only way to think about which servers are best for your workload. There are other differences among providers including IOPS, number of cores, type of CPU, etc.
Test After you Train
After you train your model, you don't need the full set of training data. Delete what you can out of memory. Here in Python land that means not keeping references to the data. Strange beast, yes.
What this likely means is setting up your training data and creating your model within a function that only returns what you need.
Reducing your memory footprint
Let's imagine for a moment that you could store it all in memory. One improvement you can make here is to convert directly from a PIL Image to a numpy array. Existing arrays are not copied, it's a view of the original data. However, it looks like you need to flatten as well into your vector space.
image = Image.open(image)
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
image = image.resize(Standard_size)
np_image = np.asarray(image).flatten()
EDIT: Actually, this helps your code's maintainability but doesn't help performance. You do this operation on each image in a function individually. The garbage collector will toss the old stuff. Move along, nothing to see here.

Categories