I have been trying to process two huge files containing around 40000-50000 images in python. But whenever I try to convert my datasets into a numpy array I get a Memory error. I only have about 8GB RAM which isn't very much, but, because I lack experience in python, I wonder if there is any way that I can resolve this issue by using some python library I don't know about, or maybe by optimizing my code? I would like to hear your opinion on this matter.
My image processing code:
from sklearn.cluster import MiniBatchKMeans
import numpy as np
import glob
import os
from PIL import Image
from sklearn.decomposition import PCA
image_dir1 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/train"
image_dir2 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/test1"
Standard_size = (300,200)
pca = PCA(n_components = 10)
file_open = lambda x,y: glob.glob(os.path.join(x,y))
def matrix_image(image):
"opens image and converts it to a m*n matrix"
image = Image.open(image)
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
image = image.resize(Standard_size)
image = list(image.getdata())
image = map(list,image)
image = np.array(image)
return image
def flatten_image(image):
"""
takes in a n*m numpy array and flattens it to
an array of the size (1,m*n)
"""
s = image.shape[0] * image.shape[1]
image_wide = image.reshape(1,s)
return image_wide[0]
if __name__ == "__main__":
train_images = file_open(image_dir1,"*.jpg")
test_images = file_open(image_dir2,"*.jpg")
train_set = []
test_set = []
"Loop over all images in files and modify them"
train_set = [flatten_image(matrix_image(image))for image in train_images]
test_set = [flatten_image(matrix_image(image))for image in test_images]
train_set = np.array(train_set) #This is where the Memory Error occurs
test_set = np.array(test_set)
Small edit: I'm using 64-bit python
Assuming a 4 byte integer for each pixel, you are trying to hold about 11.2 GB of data in (4*300*200*50000 / (1024)**3). Half that for a 2 byte integer.
You have a few options:
Reduce the number or size of images you are trying to hold in memory
Use a file or database to hold the data instead of memory (may be too slow for some applications)
Use the memory you have more effectively...
Instead of copying from list to numpy, which will temporarily use twice the amount of memory, as you do here:
test_set = [flatten_image(matrix_image(image))for image in test_images]
test_set = np.array(test_set)
Do this:
n = len(test_images)
test_set = numpy.zeros((n,300*200),dtype=int)
for i in range(n):
test_set[i] = flatten_image(matrix_image(test_images[i]))
Since your files are JPEGs and you have 300x200 images, for a 24-bit color image you're looking at approximately 1.4 MB per file and at least a whopping 40.2 GB overall:
In [4]: import humanize # `pip install humanize` if you need it
In [5]: humanize.naturalsize(300*200*24, binary=True)
Out[5]: '1.4 MiB'
In [6]: humanize.naturalsize(300*200*24*30000, binary=True)
Out[6]: '40.2 GiB'
If you have grayscale, you likely have 8-bit images which rings in at 13.4 GB:
In [7]: humanize.naturalsize(300*200*8, binary=True)
Out[7]: '468.8 KiB'
In [8]: humanize.naturalsize(300*200*8*30000, binary=True)
Out[8]: '13.4 GiB'
This is only for one copy too. Depending on the operations, this could get much bigger.
Going bigger
You could always rent some time on a server with more memory.
AWS - Up to 224GB
Rackspace - Up to 120GB
DigitalOcean - Up to 96 GB
Azure - Up to 56 GB
Looking at these in terms of amount of RAM isn't the only way to think about which servers are best for your workload. There are other differences among providers including IOPS, number of cores, type of CPU, etc.
Test After you Train
After you train your model, you don't need the full set of training data. Delete what you can out of memory. Here in Python land that means not keeping references to the data. Strange beast, yes.
What this likely means is setting up your training data and creating your model within a function that only returns what you need.
Reducing your memory footprint
Let's imagine for a moment that you could store it all in memory. One improvement you can make here is to convert directly from a PIL Image to a numpy array. Existing arrays are not copied, it's a view of the original data. However, it looks like you need to flatten as well into your vector space.
image = Image.open(image)
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
image = image.resize(Standard_size)
np_image = np.asarray(image).flatten()
EDIT: Actually, this helps your code's maintainability but doesn't help performance. You do this operation on each image in a function individually. The garbage collector will toss the old stuff. Move along, nothing to see here.
Related
I have a C++ program which writes pixel data for a 2D grid to a binary file. Each binary file may have numerous grid states back-to-back, and there may be multiple of these binary files.
e.g. I could have 10 binary files bin0, bin1, bin2... bin9 each holding data for 10 grid states for a total of 100 grid states to animate.
I'm looking for a fast way to create an animation from the grid states in these binary files.
My best attempt used python and PIL.Image to create a gif:
import glob
import numpy as np
from PIL import Image
from functools import partial
def create_images():
paths = glob.glob('./outfiles/dump*')
imgs = []
for path in sorted(paths, key=lambda x: int(x.split("p")[1])):
with open(path, 'rb') as ifile:
for dat in iter(partial(ifile.read, rows*cols), b''):
mem = memoryview(dat).cast('B', shape=[rows,cols])
arr = np.asarray(mem)
img = Image.fromarray(arr*255)
imgs.append(img)
return imgs
rows = 128
cols = 128
imgs = create_images()
imgs[0].save('./animation.gif', save_all=True, append_images=imgs[1:], loop=0)
but the final line where I actually write the gif can take a long time if I have potentially thousands of images, each with thousands of pixels. The rendering of the gif is also poor quality when the images are large.
Looking forward to suggestions for how to make this run fast using a different library from Pillow. Not bothered about sticking to Python if there is a better alternative using C/C++ (or other languages, but Python or C/C++ preferred), nor does the animation have to be a gif.
In my case I'm working with grid data which is either 0 or 1 (the context is Conway's Game of Life), so optimizations which take advantage of this would be welcome. Note, however, that currently each 1 or 0 occupies a whole byte in the binary file, therefore are not packed into bits.
EDIT
Just adding a helper python script to generate binary files as I described above for anyone who gives it a go.
import numpy as np
rows = 128
cols = 128
img_per_file = 10
num_files = 3
filesize = rows*cols*img_per_file
for i in range(num_files):
filename = 'bin'+str(i)
data = np.random.randint(2, size=filesize, dtype=np.uint8).tobytes()
f = open(filename, 'wb')
f.write(data)
f.close()
I have the following snippet:
from aicspylibczi import CziFile
from pathlib import Path
pth = Path('/Volumes/USB/20x_HE.czi')
czi = CziFile(pth)
image, shp = czi.read_image(C=0, M=0) # very slow
The parameters C und M are there to slice the big array in to little numpy pieces.
The File is 3,4GB big and it is taking to long(with 8GB RAM Macbook) so I abort it always.
I think thats not okay because I want to have the first slice of the array, not the whole matrix.
You can try slideio python package (http://slideio.com). It makes use of internal image pyramids. You can read the image partially with high resolution or the whole image with low resolution.
The code below rescales the image so that the width of the delivered raster will be 500 pixels (the height is computed to keep the image size ratio).
import slideio
slide = slideio.open_slidei(file_path="/data/a.czi",driver_id="CZI")
scene = slide.get_scene(0)
block = scene.read_block(size=(500,0))
By slice do you mean the first slice of a z-stack? The package you are using, aicspylibczi, allows you to specify a z coordinate e.g. to read the first z-slice:
image, shp = czi.read_image(C=0, M=0, Z=0)
I have a list of about 15,000 images which are 120x90 pixels high and wide. I'm trying to convert them into a Numpy array form however when I try to convert them my computer runs out of memory (8GB of ram + 12GB swap). After this is done I'll save to a file for future machine learning training.
dataSet = genDataSet()
for image in dataSet:
pixelImages.append([imageToRGB(image[0], True),image[1]])
def imageToRGB(inputFile, normalise = False):
os.chdir("/home/spchee/CodeProjects/School Project/images")
img = Image.open(inputFile) #Opens File
pixels = np.asarray(img) #Converts it to a numpy array
pixels = np.rint(pixels)
if normalise: #This normalises it between the values of 0 and 1
pixels= pixels/255
img.close()
return pixels
The function genDataSet() returns a list in the form of [[filepath1, genre], [filepath2, genre]...].
When running this code it runs out of memory so my computer almost totally freezes and I'm forced to force stop it.
I copied some image data to an instance on Google Cloud (8 vCPU's, 64GB memory, Tesla K80 GPU) and am running into memory problems when converting the raw data into features, and changing the data structure of the output. Eventually I'd like to use the derived features in Keras/Tensorflow neural net.
Process
After copying the data to a storage bucket, I run a build_features.py function to convert the raw data into processed data for the neural network. In this pipeline, I first take each raw image and put it into a list x (which stores the derived features).
Since I'm working with a large number of images (tens of thousands of images that are type float32 and have dimensions 250x500x3) the list x becomes quite large. Each element of x is numpy array that stores the image in shape 250x500x3.
Problem 1 - reduced memory as list x grows
I took 2 screenshots that show available memory decreasing as x grows (below). I'm eventually able to complete this step but I'm only left with a few GB of memory so I definitely want to fix this (in the future I want to work with larger data sets). How can I build features in a way where I'm not limited by the size of x?
Problem 2 - Memory error when converting x into numpy array
The step where the instance actually fails is the following:
x = np.array(x)
The failure message is:
Traceback (most recent call last):
File "build_features.py", line 149, in <module>
build_features(pipeline='9_11_2017_fan_3_lights')
File "build_features.py", line 122, in build_features
x = np.array(x)
MemoryError
How can I adjust this step so that I don't run out of memory?
Your code has two copies of every image - one in the list, and one in the array:
images = []
for i in range(many):
images[i] = load_img(i) # here's the first image
x = np.array(images) # joint them all together into a second copy
Just load the images straight into the array
x = np.zeros((many, 250, 500, 3)
for i in range(many):
x[i] = load_img(i)
Which means that you only hold a copy of one image at a time.
If you don't know the size or dtype of the image ahead of time, or don't want to hard code it, you can use:
x0 = load_img(0)
x = np.zeros((many,) + x0.shape, x0.dtype)
for i in range(1, many):
x[i] = load_img(i)
Having said that, you're on a tricky path here. If you don't have enough room to store your dataset twice in memory, you also don't have room to compute y = x + 1.
You might want to consider using np.float16 to buy more storage, at the cost of precision
I'm trying to build a basic character recognition model using the many classifiers that scikit provides. The dataset being used is a standard handwritten set of alphanumeric samples (Chars74K image dataset taken from this source: EnglishHnd.tgz).
There are 55 samples of each character (62 alphanumeric characters in all), each being 900x1200 pixels. I'm flattening the matrix (first converting to grayscale) into a 1x1080000 array (each representing a feature).
for sample in sample_images: # sample images is the list of the .png files
img = imread(sample);
img_gray = rgb2gray(img);
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m);
img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
if len(samples) == 0: # samples is the final numpy ndarray
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1]);
else:
samples = np.append(samples, [img_gray], axis=0);
So the final data structure should have 55x62 arrays, where each array is 1080000 elements in capacity. Only the final structure is being stored (the scope of the intermediate matrices is local).
The amount of data being stored to learn the model is pretty large (I guess), because the program isn't really progressing beyond a point, and crashed my system to the extent that the BIOS had to be repaired!
Upto this point, the program is only gathering the data to send to the classifier ... the classification hasn't even been introduced into the code yet.
Any suggestions as to what can be done to handle the data more efficiently?
Note: I'm using numpy to store the final structure of flattened matrices.
Also, the system has an 8Gb RAM.
This seems like a case of stack overflow. You have 3,682,800,000 array elements, if I understand your question. What is the element type? If it is one byte, that is about 3 gigabytes of data, easily enough to fill up your stack size (usually about 1 megabyte). Even with one bit an element, you are still at 500 mb. Try using heap memory (up to 8 gigs on your machine)
I was encouraged to post this as a solution, although the comments above are probably more enlightening.
The issue with the users program is two fold. Really it's just overwhelming the stack.
Much more common, especially with image processing in things like computer graphics or computer vision, is to process the images one at a time. This could work well with sklearn where you could just be updating your models as you read in the image.
You could use this bit of code found from this stack article:
import os
rootdir = '/path/to/my/pictures'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file[-3:] == 'png': # or whatever your file type is / some check
# do your training here
img = imread(file)
img_gray = rgb2gray(img)
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m)
# sample id stores the label of the training sample
img_gray = np.append(img_gray, sample_id)
# samples is the final numpy ndarray
if len(samples) == 0:
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1])
else:
samples = np.append(samples, [img_gray], axis=0)
This is more of pseudocode, but the general flow should have the right idea. Let me know if there's anything else I can do! Also check out OpenCV if you're interested on some cool deep learning algorithms. They're a bunch of cool stuff there and images make for great sample data.
Hope this helps.