Let's say I have image like below:
root
|___dog
| |___img1.jpg
| |___img2.jpg
| |___...
|
|___cat
|___...
I want to make image files to h5py files.
First, I tried to read all image files and make it to h5 file.
import os
import numpy as np
import h5py
import PIL.Image as Image
datafile = h5py.File(data_path, 'w')
label_list = os.listdir('root')
for i, label in enumerate(label_list):
files = os.listdir(os.path.join('root', label_list))
for filename in files:
img = Image.open(os.path.join('root', label, filename))
ow, oh = 128, 128
img = img.resize((ow, oh), Image.BILINEAR)
data_x.append(np.array(img).tolist())
data_y.append(i)
datafile = h5py.File(data_path, 'w')
datafile.create_dataset("data_image", dtype='uint8', data=data_x)
datafile.create_dataset("data_label", dtype='int64', data=data_y)
But I can't make it because of the memory constraint (Each folder have image more than 200,000 with 224x224 size).
So, what is the best way to make this image to h5 file?
The HDF5/h5py dataset objects have a much smaller memory footprint than the same size NumPy array. (That's one advantage to using HDF5.) You can create the HDF5 file and allocate the datasets BEFORE you start looping on the image files. Then you can operate on the images one at a time (read, resize, and write image 0, then image 1, etc).
The code below creates the necessary datasets presized for 200,000 images. The code logic is rearranged to work as I described. img_cnt variable used to position new image data in existing datasets. (Note: I think this works as written. However without the data, I couldn't test, so it may need minor tweaking.) If you want to adjust the dataset sizes in the future, you can add the maxshape=() parameter to the create_dataset() function.
# Open HDF5 and create datasets in advance
datafile = h5py.File(data_path, 'w')
datafile.create_dataset("data_image", (200000,224,224), dtype='uint8')
datafile.create_dataset("data_label", (200000,), dtype='int64')
label_list = os.listdir('root')
img_cnt = 0
for i, label in enumerate(label_list):
files = os.listdir(os.path.join('root', label_list))
for filename in files:
img = Image.open(os.path.join('root', label, filename))
ow, oh = 128, 128
img = img.resize((ow, oh), Image.BILINEAR)
datafile["data_image"][img_cnt,:,:] = np.array(img).tolist())
datafile["data_label"][img_cnt] = i
img_cnt += 1
datafile.close()
Related
I am trying to group batches of images (.jpeg) and create a multipage TIFF file for each batch using Python and Pillow.
Using the append_images argument in save(), the files I get are much larger than the original.
Considering a batch of 15 jpeg images (total size 643kB), the resulting TIFF is 6.29MB with no compression.
Is there a way to reduce the file size, possibly without using compression, and get a TIFF file with size similar to the one of all the original files?
import os
from PIL import Image
sourcedir = os.getcwd()
savedir = os.path.join(sourcedir,'TIFF')
batch = ['Im-00', 'Im-01', 'Im-02', 'Im-03', 'Im-04', 'Im-05', 'Im-06', 'Im-07', 'Im-08', 'Im-09', 'Im-10', 'Im-11', 'Im-12', 'Im-13', 'Im-14']
batch_counter = 0
imlistA = []
filenameA = ["A-" + s + ".jpg" for s in batch]
for fileA in filenameA:
filepath = os.path.join(sourcedir,fileA)
imlistA.append(Image.open(filepath))
TIFFnameA = 'A-batch-0'+str(batch_counter)+'.tiff'
TIFFdirA = os.path.join(savedir,TIFFnameA)
imlistA[0].save(TIFFdirA, compression=None, save_all=True,
append_images=imlistA[1:])
I'm converting image files to hdf5 files as follows:
import h5py
import io
import os
import cv2
import numpy as np
from PIL import Image
def convertJpgtoH5(input_dir, filename, output_dir):
filepath = input_dir + '/' + filename
print('image size: %d bytes'%os.path.getsize(filepath))
img_f = open(filepath, 'rb')
binary_data = img_f.read()
binary_data_np = np.asarray(binary_data)
new_filepath = output_dir + '/' + filename[:-4] + '.hdf5'
f = h5py.File(new_filepath, 'w')
dset = f.create_dataset('image', data = binary_data_np)
f.close()
print('hdf5 file size: %d bytes'%os.path.getsize(new_filepath))
pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5/files'
ext = [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]
for img in os.listdir(pathImg):
if img.endswith(tuple(ext)):
convertJpgtoH5(pathImg, img, pathH5)
I later read these hdf5 files as follows:
for hf in os.listdir(pathH5):
if hf.endswith(".hdf5"):
hf = h5py.File(f"{pathH5}/{hf}", "r")
key = list(hf.keys())[0]
data = np.array(hf[key])
img = Image.open(io.BytesIO(data))
image = cv2.cvtColor(np.float32(img), cv2.COLOR_BGR2RGB)
hf.close()
Is there a more efficient way to read the hdf5 files rather than converting to numpy array, opening with Pillow before using with OpenCV?
Ideally this should be closed as a duplicate because most of what you want to do is explained in the answers I referenced in my comments above. I am including those links here:
How do I process a large dataset of images in python?
Convert a folder comprising jpeg images to hdf5
There is one difference: my examples load all the image data into 1 HDF5 file, and you are creating 1 HDF5 file for each image. Frankly, I don't think there is much value doing that. You wind up with twice as many files and there's nothing gained. If you are still interested in doing that, here are 2 more answers that might help (and I updated your code at the end):
How to split a big HDF5 file into multiple small HDF5 dataset
Extracting datasets from 1 HDF5 file to multiple files
In the interest of addressing your specific question, I modified your code to use cv2 only (no need for PIL). I resized the images and saved as 1 dataset in 1 file. If you are using the images for training and testing a CNN model, you need to do this anyway (it needs arrays of size/consistent shape). Also, I think you can save the data as int8 -- no need for floats. See below.
import h5py
import glob
import os
import cv2
import numpy as np
def convertImagetoH5(imgfilename):
print('image size: %d bytes'%os.path.getsize(imgfilename))
img = cv2.imread(imgfilename, cv2.COLOR_BGR2RGB)
img_resize = cv2.resize(img, (IMG_WIDTH, IMG_HEIGHT) )
return img_resize
pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5file'
ext_list = [".ppm", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]
IMG_WIDTH = 120
IMG_HEIGHT = 120
#get list of all images and number of images
all_images = []
for ext in ext_list:
all_images.extend(glob.glob(pathImg+"/*"+ext, recursive=True))
n_images = len(all_images)
ds_img_arr = np.zeros((n_images, IMG_WIDTH, IMG_HEIGHT,3),dtype=np.uint8)
for cnt,img in enumerate(all_images):
img_arr = convertImagetoH5(img)
ds_img_arr[cnt]=img_arr[:]
h5_filepath = pathH5 + '/all_image_data.hdf5'
with h5py.File(h5_filepath, 'w') as h5f:
dset = h5f.create_dataset('images', data=ds_img_arr)
print('hdf5 file size: %d bytes'%os.path.getsize(h5_filepath))
with h5py.File(h5_filepath, "r") as h5r:
key = list(h5r.keys())[0]
print (key, h5r[key].shape, h5r[key].dtype)
If you really want 1 HDF5 for each image, the code from your question is updated below. Again, only cv2 is used -- no need for PIL. Images are not resized. This is for completeness only (to demonstrate the process). It's not how you should manage your image data.
import h5py
import os
import cv2
import numpy as np
def convertImagetoH5(input_dir, filename, output_dir):
filepath = input_dir + '/' + filename
print('image size: %d bytes'%os.path.getsize(filepath))
img = cv2.imread(filepath, cv2.COLOR_BGR2RGB)
new_filepath = output_dir + '/' + filename[:-4] + '.hdf5'
with h5py.File(new_filepath, 'w') as h5f:
h5f.create_dataset('image', data =img)
print('hdf5 file size: %d bytes'%os.path.getsize(new_filepath))
pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5file'
ext = [".ppm", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]
# Loop thru image files and create a matching HDF5 file
for img in os.listdir(pathImg):
if img.endswith(tuple(ext)):
convertImagetoH5(pathImg, img, pathH5)
# Loop thru HDF5 files and read image dataset (as an array)
for h5name in os.listdir(pathH5):
if h5name.endswith(".hdf5"):
with h5f = h5py.File(f"{pathH5}/{h5name}", "r") as h5f:
key = list(h5f.keys())[0]
image = h5f[key][:]
print(f'{h5name}: {image.shape}, {image.dtype}')
I would like to save some tiff images I have into a new npy file.
My data are saved in 5 different files (tiff format). I want to access to each one of them, convert them in narray and then save them in a new npy file (for deep learning classification).
import numpy as np
from PIL import Image
import os
Data_dir = r"C:\Desktop\Université_2019_2020\CoursS2_Mosef\Stage\Data\Grand_Leez\shp\imagettes"
Categories = ["Bouleau_tiff", "Chene_tiff", "Erable_tiff", "Frene_tiff", "Peuplier_tiff"]
for categorie in Categories:
path = os.path.join(Data_dir, categorie) #path for each species
for img in os.listdir(path):
path_img = os.path.join(path,img)
im = Image.open(os.path.join(path_img)) #load an image file
imarray = np.array(im) # convert it to a matrix
imarray = np.delete(imarray, 3, axis=2)
np.save(Data_dir, imarray)
Problem: It's only return me the last observation of my last category "Peuplier_tiff", also it's saved into the name imagette, I don't know why.
Last but not least, I have a doubt for my targets, how I can be sure that my categories are correctly assign to the corresponding arrays.
A lot of questions,
thanks in advance for your help.
S.V
Thanks for your response. Its working with this code :
import numpy as np
from PIL import Image
import os
new_dir = "dta_npy"
directory = r"C:\Desktop\Université_2019_2020\CoursS2_Mosef\Stage\Data\Grand_Leez\shp\imagettes"
Data_dir = os.path.join(directory, new_dir)
os.makedirs(Data_dir)
print("Directory '%s' created" %Data_dir)
Categories = ["Bouleau_tif","Chene_tif", "Erable_tif", "Frene_tif", "Peuplier_tif"]
for categorie in Categories:
path = os.path.join(directory,categorie) #path for each species
for img in os.listdir(path):
im = Image.open(os.path.join(path,img)) #load an image file
imarray = np.array(im) # convert it to a matrix
imarray = np.delete(imarray, 3, axis=2)
unique_name=img
unique_name = unique_name.split(".")
unique_name = unique_name[0]
np.save(Data_dir+"/"+unique_name, imarray)
Now my objective is to format my data, for each of my class, in this way : (click on the link)
format goal
I've face this problem when figuring out how to export external images in blender script. But I guess this is not related straight to blender anymore, more to numpy and how to handle arrays. Here is post about first problem.
So the problem is that when saving numpy array to image it will distorted and there is multiple same images. Look below image for a better understanding.
The goal is trying to figure out how to make this work with numpy and python using the blender's own pixel data. So avoiding to use libraries like PIL or cv2 that do not include in blender python.
When saving data where is images that all is final size works correctly. And when trying to merge 4 smaller pieces to final larger image it not exported correctly.
I've done example script with python in blender to demonstrate the problem:
# Example script to show how to merge external images in Blender
# using numpy. In this example we use 4 images (2x2) that should
# be merged to one actual final image.
# Regular (not cropped render borders) seems to work fine but
# how to merge cropped images properly???
#
# Usage: Just run script and it will export image named "MERGED_IMAGE"
# to root of this project folder and you'll see what's the problem.
import bpy, os
import numpy as np
ctx = bpy.context
scn = ctx.scene
print('START')
# Get all image files
def get_files_in_folder(path):
path = bpy.path.abspath(path)
render_files = []
for root, dirs, files in os.walk(path):
for file in files:
if (file.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif'))):
render_files.append(file)
return render_files
def merge_images(image_files, image_cropped = True):
image_pixels = []
final_image_pixels = 0
print(image_files)
for file in image_files:
if image_cropped is True:
filepath = bpy.path.abspath('//Cropped\\' + file)
else:
filepath = bpy.path.abspath('//Regular\\' + file)
loaded_pixels = bpy.data.images.load(filepath, check_existing=True).pixels
image_pixels.append(loaded_pixels)
np_array = np.array(image_pixels)
# Merge images
if image_cropped:
final_image_pixels = np_array
# HOW MERGE PROPERLY WHEN USING CROPPED IMAGES???
else:
for arr in np_array:
final_image_pixels += arr
# Save output image
output_image = bpy.data.images.new('MERGED_IMAGE', alpha=True, width=256, height=256)
output_image.file_format = 'PNG'
output_image.alpha_mode = 'STRAIGHT'
output_image.pixels = final_image_pixels.ravel()
output_image.filepath_raw = bpy.path.abspath("//MERGED_IMAGE.png")
output_image.save()
images_cropped = get_files_in_folder("//Cropped")
images_regular = get_files_in_folder('//Regular')
# Change between these to get different example
merge_images(images_cropped)
#merge_images(images_regular, False)
print('END')
So I guess the problem is related to how to handle image pixel data and arrays with numpy.
Here is project folder in zip file that contains working test script example, where you can test how this works in blender. https://drive.google.com/file/d/1R4G_fubEzFWbHZMLtAAES-QsRhKyLKWb/view?usp=sharing
Since all of your images are the same dimension of 128x128, and since OpenCV images are Numpy arrays, here are three methods. You can save the image using cv2.imwrite.
Input images:
Method #1: np.hstack + np.vstack
hstack1 = np.hstack((image1, image2))
hstack2 = np.hstack((image3, image4))
hstack_result = np.vstack((hstack1, hstack2))
Method #2: np.concatenate
concatenate1 = np.concatenate((image1, image2), axis=1)
concatenate2 = np.concatenate((image3, image4), axis=1)
concatenate_result = np.concatenate((concatenate1, concatenate2), axis=0)
Method #3: cv2.hconcat + cv2.vconcat
hconcat1 = cv2.hconcat([image1, image2])
hconcat2 = cv2.hconcat([image3, image4])
hconcat_result = cv2.vconcat([hconcat1, hconcat2])
Result should be the same for all methods
Full code
import cv2
import numpy as np
# Load images
image1 = cv2.imread('Fart_1_2.png')
image2 = cv2.imread('Fart_2_2.png')
image3 = cv2.imread('Fart_1_1.png')
image4 = cv2.imread('Fart_2_1.png')
# Method #1
hstack1 = np.hstack((image1, image2))
hstack2 = np.hstack((image3, image4))
hstack_result = np.vstack((hstack1, hstack2))
# Method #2
concatenate1 = np.concatenate((image1, image2), axis=1)
concatenate2 = np.concatenate((image3, image4), axis=1)
concatenate_result = np.concatenate((concatenate1, concatenate2), axis=0)
# Method #3
hconcat1 = cv2.hconcat([image1, image2])
hconcat2 = cv2.hconcat([image3, image4])
hconcat_result = cv2.vconcat([hconcat1, hconcat2])
# Display
cv2.imshow('concatenate_result', concatenate_result)
cv2.imshow('hstack_result', hstack_result)
cv2.imshow('hconcat_result', hconcat_result)
cv2.waitKey()
I'm loading a tiff file from http://oceancolor.gsfc.nasa.gov/DOCS/DistFromCoast/
from PIL import Image
im = Image.open('GMT_intermediate_coast_distance_01d.tif')
The data is large (im.size=(36000, 18000) 1.3GB) and conventional conversion doesn't work; i.e, imarray.shape returns ()
import numpy as np
imarray=np.zeros(im.size)
imarray=np.array(im)
How can I convert this tiff file to a numpy.array?
May you dont have too much Ram for this image.You'll need at least some more than 1.3GB free memory.
I don't know what you're doing with the image and you read the entire into your memory but i recommend you to read it bit by bit if its possible to avoid blowing up your computer.
You can use Image.getdata() which returns one pixel per time.
Also read some more for Image.open on this link :
http://www.pythonware.com/library/pil/handbook/
So far I have tested many alternatives but only gdal worked always even with huge 16bit images.
You can open an image with something like this:
from osgeo import gdal
import numpy as np
ds = gdal.Open("name.tif")
channel = np.array(ds.GetRasterBand(1).ReadAsArray())
I had huge tif files between 1 and 3 GB and managed to finally open them with Image.open() after manually changing the value of MAX_IMAGE_PIXELS inside the Image.py source code to an arbitrarily large number:
from PIL import Image
im = np.asarray(Image.open("location/image.tif")
For Python 32 bit, version 2.7 you are limited by the number of bytes you can add to the stack at a given time. One option is to read in the image in parts and then resize the individual chunks and reassemble them into a image that requires less RAM.
I recommend using the packages libtiff and opencv for that.
import os
os.environ["PATH"] += os.pathsep + "C:\\Program Files (x86)\\GnuWin32\\bin"
import numpy as np
import libtiff
import cv2
tif = libtiff.TIFF.open("HUGETIFFILE.tif", 'r')
width = tif.GetField("ImageWidth")
height = tif.GetField("ImageLength")
bits = tif.GetField('BitsPerSample')
sample_format = tif.GetField('SampleFormat')
ResizeFactor = 10 #Reduce Image Size by 10
Chunks = 8 #Read Image in 8 Chunks to prevent Memory Error (can be increased for
# bigger files)
ReadStrip = tif.ReadEncodedStrip
typ = tif.get_numpy_type(bits, sample_format)
#ReadStrip
newarr = np.zeros((1, width/ResizeFactor), typ)
for ii in range(0,Chunks):
pos = 0
arr = np.empty((height/Chunks, width), typ)
size = arr.nbytes
for strip in range((ii*tif.NumberOfStrips()/Chunks),((ii+1)*tif.NumberOfStrips()/Chunks)):
elem = ReadStrip(strip, arr.ctypes.data + pos, max(size-pos, 0))
pos = pos + elem
resized = cv2.resize(arr, (0,0), fx=float(1)/float(ResizeFactor), fy=float(1)/float(ResizeFactor))
# Now remove the large array to free up Memory for the next chunk
del arr
# Finally recombine the individual resized chunks into the final resized image.
newarr = np.vstack((newarr,resized))
newarr = np.delete(newarr, (0), axis=0)
cv2.imwrite('resized.tif', newarr)
you can try to use 'dask' library:
import dask_image.imread
ds = dask_image.imread.imread('name.tif')