Randomly extracting files from subfolders in main folder

Randomly extracting files from subfolders in main folder - python

mat` files from the main folder which contains seven subfolders. Each folder is named with class number.
import glob
import os
import hdf5storage
import numpy as np
DATASET_PATH = "D:/Dataset/Multi-resolution_data/Visual/High/"
files = glob.glob(DATASET_PATH + "**/*.mat", recursive= True)
class_labels = [i.split(os.sep)[-2] for i in files]
for label in range(0, len(class_labels)):
class_labels [label] = int(class_labels[label])
files variable contains the following:
Class labels contains the following:
I want to ask couple of things:
1) when I read the .mat files, it comes if dict and each dict contains different variable name. I want to know how can I read the key and assign to the array?
array_store=[]
for f in files:
mat = hdf5storage.loadmat(f)
arrays = np.array(mat.keys())
array_store.append(arrays)
2) files = glob.glob(DATASET_PATH + "**/*.mat", recursive= True) Is it possible to randomly read the specific amount of files from each folder inside the main folder? like 60% for training and 40% testing?
UPDATE
I have tried what #vopsea sugeested in Answer.
The output looks like that for train variable.
How I make the final array of images each files foy Key 1 - 7 (array (256 x 256 x 11 x total number of images))and labels (total number of images x 1 )? Labels will be same as key values, for example for all the files associated with Key 1 (188 files) will have label 1 (188 x 1).
UPDATE
resolving issue of making label and accessing key without key name.
import os
import random
import hdf5storage
import numpy as np
DATASET_PATH = "D:/Dataset/Multi-resolution_data/Visual/High/"
train_images = []
test_images = []
train_label = list()
test_label = list()
percent_train = 0.4
class_folders = next(os.walk(DATASET_PATH))[1]
for x in class_folders:
files = os.listdir(os.path.join(DATASET_PATH,x))
random.shuffle(files)
n = int(len(files) * percent_train)
train_i = []
test_i = []
for i,f in enumerate(files):
abs_path= os.path.join(DATASET_PATH,x,f)
mat = hdf5storage.loadmat(abs_path)
if(i < n):
train_i.append(mat.values())
train_label.append(x)
else:
test_i.append(mat.values())
test_label.append(x)
train_images.append(train_i)
test_images.append(test_i)

1) Could you explain a bit more what you want in question 1? What is being appended? I might be misunderstanding, but it's easy to read unknown key, value pairs
for key, value in mat.items():
print(key, value)
2) I did this without glob. Shuffle the class files and slice them into two lists according to training percent. Probably best to have the same number of files for each class (or close) so training doesn't favor one especially.
import os
import random
DATASET_PATH = "D:/Dataset/Multi-resolution_data/Visual/High/"
train = {}
test = {}
percent_train = 0.4
class_folders = next(os.walk(DATASET_PATH))[1]
for x in class_folders:
files = os.listdir(os.path.join(DATASET_PATH,x))
random.shuffle(files)
n = int(len(files) * percent_train)
train[x] = files[:n]
test[x] = files[n:]
EDIT 2:
Is this what you mean?
import os
import random
import hdf5storage
import numpy as np
DATASET_PATH = "D:/Dataset/Multi-resolution_data/Visual/High/"
train_images = []
test_images = []
train_label = []
test_label = []
percent_train = 0.4
class_folders = next(os.walk(DATASET_PATH))[1]
for x in class_folders:
files = os.listdir(os.path.join(DATASET_PATH,x))
random.shuffle(files)
n = int(len(files) * percent_train)
for i,f in enumerate(files):
abs_path= os.path.join(DATASET_PATH,x,f)
mat = hdf5storage.loadmat(abs_path)
if(i < n):
train_images.append(mat.values())
train_label.append(x)
else:
test_images.append(mat.values())
test_label.append(x)
EDIT 3: Using dict for simplicity
Notice how simple it is to run through the images at the end. The alternative is storing two lists (data and labels) and one will have many duplicate items. You then have to through them both at the same time.
Although depending on what you're doing with this later, two lists could be the right choice.
import os
import random
import hdf5storage
import numpy as np
DATASET_PATH = "D:/Dataset/Multi-resolution_data/Visual/High/"
train_images = {}
test_images = {}
percent_train = 0.4
class_folders = next(os.walk(DATASET_PATH))[1]
for x in class_folders:
files = os.listdir(os.path.join(DATASET_PATH,x))
random.shuffle(files)
n = int(len(files) * percent_train)
for i,f in enumerate(files):
abs_path= os.path.join(DATASET_PATH,x,f)
mat = hdf5storage.loadmat(abs_path)
if(i < n):
train_images[x] = mat.values()
else:
test_images[x] = mat.values()
for img_class,img_data in train_images.items():
print( img_class, img_data )

Related

How to read multiple DICOM files from a folder?

I have the following code in which I am loading a single DICOM file and checking if there are sagittal and coronal view present or not.
I want to modify this to read all DICOM files from the folder.
print there is no sagittal and coronal view if sag_aspect,cor_aspect value is zero
How do I do this?
import pydicom
import numpy as np
import matplotlib.pyplot as plt
import sys
import glob
# load the DICOM files
files = []
print('glob: {}'.format(sys.argv[1]))
for fname in glob.glob('dicom/3.dcm', recursive=False):
print("loading: {}".format(fname))
files.append(pydicom.dcmread(fname))
print("file count: {}".format(len(files)))
# skip files with no SliceLocation (eg scout views)
slices = []
skipcount = 0
for f in files:
if hasattr(f, 'SliceLocation'):
slices.append(f)
else:
skipcount = skipcount + 1
print("skipped, no SliceLocation: {}".format(skipcount))
# ensure they are in the correct order
slices = sorted(slices, key=lambda s: s.SliceLocation)
# pixel aspects, assuming all slices are the same
ps = slices[0].PixelSpacing
ss = slices[0].SliceThickness
ax_aspect = ps[1]/ps[0]
sag_aspect = ps[1]/ss
cor_aspect = ss/ps[0]
# create 3D array
img_shape = list(slices[0].pixel_array.shape)
img_shape.append(len(slices))
img3d = np.zeros(img_shape)
# fill 3D array with the images from the files
for i, s in enumerate(slices):
img2d = s.pixel_array
img3d[:, :, i] = img2d
# plot 3 orthogonal slices
print(img3d.shape)
print(img_shape)
a1 = plt.subplot(2, 2, 1)
plt.imshow(img3d[:, :, img_shape[2]//2])
a1.set_title("transverse view")
a1.set_aspect(ax_aspect)
a2 = plt.subplot(2, 2, 2)
#print(img3d[:, img_shape[1]//2, :].shape)
plt.imshow(img3d[:, img_shape[1]//2, :])
a2.set_title("sagital view")
a2.set_aspect(sag_aspect)
a3 = plt.subplot(2, 2, 3)
plt.imshow(img3d[img_shape[0]//2, :, :].T)
a3.set_title("coronal view")
a3.set_aspect(cor_aspect)
plt.show()

For reading multiple dicom files from a folder you can use the code below.
import os
from pathlib import Path
import pydicom
dir_path = r"path\to\dicom\files"
dicom_set = []
for root, _, filenames in os.walk(dir_path):
for filename in filenames:
dcm_path = Path(root, filename)
if dcm_path.suffix == ".dcm":
try:
dicom = pydicom.dcmread(dcm_path, force=True)
except IOError as e:
print(f"Can't import {dcm_path.stem}")
else:
dicom_set.append(dicom)
I have leveraged the pathlib library which I strongly suggest to use whenever dealing with folder/file paths. I have also added an exception, but you can modify it to meet your needs.

Vectorise nested for loops

I'm wondering if it is possible to vectorise the nested for loop in my code to speed up the code. I am basically trying to split (or trim) an image into smaller bits.
import numpy as np
import os
from PIL import Image
image_path = 'sample.png'
number_of_rows = 4
input_np_arr_image = np.asarray(Image.open(image_path))
height, width = input_np_arr_image.shape[0:2]
height_trimmed = (height // number_of_rows) * number_of_rows
width_trimmed = (width // number_of_rows) * number_of_rows
trimmed_image = input_np_arr_image[:height_trimmed, :width_trimmed]
image_pieces = [np.hsplit(segment, 4)
for segment in np.vsplit(trimmed_image, 4)
]
I then try to save the result using this nested for loop
for row in range(len(image_pieces)):
for col in range(len(image_pieces[row])):
# create output image name
segment_name = f"{image_path[:len(image_path)-4]}_{row}_{col}{image_path[-4:]}"
# convert array to image
label_image = Image.fromarray(image_pieces[row][col])
# make output directory for saving
output_directory = "split_images"
os.makedirs(output_directory, exist_ok = True)
label_image.save(output_directory +"/"+ segment_name)
I'm curious if it's possible to do this without using the nested for loops. Thanks

Reading the huge image data for training classifiers

I am new to python and Machine Learning. I have a huge image dataset of cars having more than 27000 images and labels. I am trying to create a dataset so I can use it in my training classifier, but ofcourse handling this amount of data will be a real pain for the Memory, and that's where I am stuck. At first I was trying to do something like this.
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpg
import cv2
import gc
import numpy as np
from sklearn.preprocessing import normalize
import gc
import resource
import h5py
bbox = "/run/media/fdai5182/LAMAMADAN/Morethan4000samples/data/labels"
imagepath = "/run/media/fdai5182/LAMAMADAN/Morethan4000samples/data/image"
training_data = []
training_labels = []
count = 0
for root, _, files in os.walk(bbox):
cdp = os.path.abspath(root)
for rootImage , _ , fileImage in os.walk(imagepath):
cdpimg = os.path.abspath(r)
for f in files:
ct = 0
name,ext = os.path.splitext(f)
for fI in fileImage:
n , e = os.path.splitext(fI)
if name == n and ext == ".txt" and e == ".jpg":
cip = os.path.join(cdp,f)
cipimg = os.path.join(cdpimg,fI)
txt = open(cip,"r")
for q in txt:
ct = ct + 1
if ct == 3:
x1 = int(q.rsplit(' ')[0])
y1 = int(q.rsplit(' ')[1])
x2 = int(q.rsplit(' ')[2])
y2 = int(q.rsplit(' ')[3])
try:
read_img = mpg.imread(cipimg)
read_img = read_img.astype('float32')
read_img_bbox = read_img[y1:y2, x1:x2,:]
resize_img = cv2.cv2.resize(read_img_bbox,(300,300))
resize_img /= 255.0
training_labels.append(int(cipimg.split('\\')[4]))
training_data.append(resize_img)
print("len Of Training_data",len(training_data))
training_labels.append(int(cipimg.split('/')[8]))
del resize_img
print("len Of Training Labels", len(training_labels))
gc.collect()
except Exception as e:
print("Error",str(e), cip)
count = count + 1
print(count)
txt.flush()
txt.close()
np.save('/run/media/fdai5182/LAMA MADAN/Training_Data_4000Samples',training_data)
np.save('/run/media/fdai5182/LAMA MADAN/Training_Labels_4000Samples',training_labels)
print("DONE")
But it always gives me a huge Memory error after reading images even on 32gb RAM.
So, for that I want to do some other steps which may be useful taking less memory and get this working.
The Steps I want to do are as follows.
allocate np array X of shape N,150,150,3/300,300,3 of type
float32 (not astype)
iterate through images and fill each row of array X with 150,150,3 image pixels
normalize in-place: X /= 255
Write in file (.npy format)
What I did till now is
import cv2
import matplotlib.pyplot as plt
import matplotlib.iamge as mpg
import numpy as np
bbox = "/run/media/fdai5182/LAMAMADAN/Morethan4000samples/data/labels"
imagepath = "/run/media/fdai5182/LAMAMADAN/Morethan4000samples/data/image"
for root, _, files in os.walk(bbox):
cdp = os.path.abspath(root)
for rootImage, _, fileImage in os.walk(imagepath):
cdpimg = os.path.abspath(rootImage)
for f in files:
ct = 0
name,ext = os.path.splitext(f)
for fI in fileImage:
n , e = os.path.splitext(fI)
if name == n and ext == ".txt" and e == ".jpg":
nparrayX = np.zeros((150,150,3)).view('float32')
cip = os.path.join(cdp,f)
cipImg = os.path.join(cdpimg,fI)
read_image = mpg.imread(cip)
resize_image = cv2.cv2.resize(read_image,(150,150))
Am I on the right path?
Also, How can I fill each row of imageformat with 150,150,3 image pixels. I don't want to use list anymore as they take more Memory and time consuming.
Please help me through this.
Also, as a new member if the question is not obeying the rules and regulations of StackOverflow please tell me and I will edit it more.
Thank you,

Both tensorflow/keras and pytorch provide data set / generator classes, which you can use to construct memory efficient data loaders.
For tensorflow/keras there is an excellent tutorial created by Stanford's Shervine Amidi.
For pytorch you find a good tutorial on the project's man page.
I would strongly suggest to make use of these frameworks for your implementation since they allow you to avoid writing boiler-plate code and make your training scalable.

Thank you for your help . But I wanted to do it manually to check How can we do it without using other generators. Below is my Code.
import cv2
import matplotlib.pyplot as plt
import matplotlib.image as mpg
import numpy as np
import os
N = 0
training_labels = []
bbox = "D:/Morethan4000samples/data/labels"
imagepath = "D:/Morethan4000samples/data/image/"
for root, _, files in os.walk(imagepath):
cdp = os.path.abspath(root)
for f in files:
name, ext = os.path.splitext(f)
if ext == ".jpg":
cip = os.path.join(cdp,f)
N += 1
print(N)
imageX = np.zeros((N,227,227,3), dtype='float32')
i = 0
for root, _ , files in os.walk(imagepath):
cdp = os.path.abspath(root)
print(cdp)
for f in files:
ct = 0
name, ext = os.path.splitext(f)
if ext == ".jpg":
cip = os.path.join(cdp,f)
read = mpg.imread(cip)
cipLabel = cip.replace('image','labels')
cipLabel = cipLabel.replace('.jpg','.txt')
nameL , extL = os.path.splitext(cipLabel)
if extL == '.txt':
boxes = open(cipLabel, 'r')
for q in boxes:
ct = ct + 1
if ct == 3:
x1 = int(q.rsplit(' ')[0])
y1 = int(q.rsplit(' ')[1])
x2 = int(q.rsplit(' ')[2])
y2 = int(q.rsplit(' ')[3])
readimage = read[y1:y2, x1:x2]
resize = cv2.cv2.resize(readimage,(227,227))
resize = cv2.cv2.GaussianBlur(resize, (5,5),0)
imageX[i] = resize
#training_labels.append(int(cip.split('\\')[4]))
training_labels.append(int(cip.split('/')[8]))
print(len(training_labels), len(imageX))
i += 1
print(i)
imageX /= 255.0
plt.imshow(imageX[10])
plt.show()
print(imageX.shape)
print(len(training_labels))
np.save("/run/media/fdai5182/LAMA MADAN/Morethan4000samples/227227/training_images", imageX)
np.save("/run/media/fdai5182/LAMA MADAN/Morethan4000samples/227227/trainin_labels",training_labels)
To save each of your image in a row of matrix of same dimensions is the most efficient way to do that.

Insert images in a folder into dataframe

Im trying to read images from folders into a dataframe , where each row in the dataframe is all the images for a folder :
import cv2
import os,glob
import matplotlib.pylab as plt
from os import listdir,makedirs
from os.path import isfile,join
import pandas as pd
import PIL
import numpy as np
from scipy.ndimage import imread
pth = 'C:/Users/Documents/myfolder/'
folders = os.listdir(pth)
videos = pd.DataFrame()
for folder in folders:
pth_upd = pth + folder + '/'
allfiles = os.listdir(pth_upd)
files = []
columns = ['data']
index = [folders]
for file in allfiles:
files.append(file) if ('.bmp' in file) else None
samples = np.empty((0,64,64))
for file in files:
img = cv2.imread(os.path.join(pth_upd,file),cv2.IMREAD_GRAYSCALE)
img = img.reshape(1,64,64)
samples = np.append(samples, img, axis=0)
result = pd.DataFrame([samples], index=[folder], columns=['videos'])
videos = videos.append(result)
after reading all the images in each folder into (samples array ) how can I insert images for each folder in a dataframe row
ValueError Traceback (most recent call last)
in
17 samples = np.append(samples, img, axis=0)
18
---> 19 result = pd.DataFrame([samples], index=[folder], columns=['videos'])
20 videos = videos.append(result)
ValueError: Must pass 2-d input
:

It's certainly possible to put strings of the resized images into pandas, but there are much better ways to accomplish CNN training. I adapted your image processing code to show how you could do what you asked:
import io
import pandas as pd
import numpy as np
import sklearn
import requests
import tempfile
import os
import cv2
# Image processing for the df
def process_imgfile(x):
img = cv2.imread(os.path.join(
x.Folder, x.image),cv2.IMREAD_GRAYSCALE)
img = cv2.resize(img, (64, 64))
img = str(img)
return img
# Simulate folders with images in them
with tempfile.TemporaryDirectory() as f:
f1 = os.path.join(f, "Folder1")
f2 = os.path.join(f, "Folder2")
os.mkdir(f1)
os.mkdir(f2)
print(r.status_code)
for x in range(20):
with open(os.path.join(f1, "f1-{}.jpg".format(x)), "wb") as file1, open(
os.path.join(f2, "f2-{}.jpg".format(x)), "wb") as file2:
r = requests.get(
'https://upload.wikimedia.org/wikipedia/en/a/a9/Example.jpg',
stream=True)
for chunk in r.iter_content(16): # File writing...
file1.write(chunk)
file2.write(chunk)
result = [x for x in os.walk(f)]
folder1 = result[1][2]
folder2 = result[2][2]
# Generate dataframe data
j = {"Folder":[], "image":[]}
for x in folder1:
j["Folder"].append(result[1][0])
j["image"].append(x)
for x in folder2:
j["Folder"].append(result[2][0])
j["image"].append(x)
# Use the process_imgfile function to append image data
df = pd.DataFrame(j)
df["imgdata"] = df.apply(process_imgfile, axis=1)
But on a large set of images this is not going to work. Instead, check out ImageDataGenerator which can let you load images at train and test time. It can also help you apply augmentation or synthesize data.

Write to HDF5 and shuffle big arrays of data

I have downloaded Caltech101. Its structure is:
#Caltech101 dir
#class1 dir
#images of class1 jpgs
#class2 dir
#images of class2 jpgs
...
#class100 dir
#images of class100 jpgs
My problem is that I can't keep in memory two np arrays x and y of shape (9144, 240, 180, 3) and (9144). So my solution is to overallocate a h5py dataset, load them in 2 chunks and write them to file one after the other. Precisely:
from __future__ import print_function
import os
import glob
from scipy.misc import imread, imresize
from sklearn.utils import shuffle
import numpy as np
import h5py
from time import time
def load_chunk(images_dset, labels_dset, chunk_of_classes, counter, type_key, prev_chunk_length):
# getting images and processing
xtmp = []
ytmp = []
for label in chunk_of_classes:
img_list = sorted(glob.glob(os.path.join(dir_name, label, "*.jpg")))
for img in img_list:
img = imread(img, mode='RGB')
img = imresize(img, (240, 180))
xtmp.append(img)
ytmp.append(label)
print(label, 'done')
x = np.concatenate([arr[np.newaxis] for arr in xtmp])
y = np.array(ytmp, dtype=type_key)
print('x: ', type(x), np.shape(x), 'y: ', type(y), np.shape(y))
# writing to dataset
a = time()
images_dset[prev_chunk_length:prev_chunk_length+x.shape[0], :, :, :] = x
print(labels_dset.shape)
print(y.shape, y.shape[0])
print(type(y), y.dtype)
print(prev_chunk_length)
labels_dset[prev_chunk_length:prev_chunk_length+y.shape[0]] = y
b = time()
print('Chunk', counter, 'written in', b-a, 'seconds')
return prev_chunk_length+x.shape[0]
def write_to_file(remove_DS_Store):
if os.path.isfile('caltech101.h5'):
print('File exists already')
return
else:
# the name of each dir is the name of a class
classes = os.listdir(dir_name)
if remove_DS_Store:
classes.pop(0) # removes .DS_Store - may not be used on other terminals
# need the dtype of y in order to initialize h5 dataset
s = ''
key_type_y = s.join(['S', str(len(max(classes, key=len)))])
classes = np.array(classes, dtype=key_type_y)
# number of chunks in which the dataset must be divided
nb_chunks = 2
nb_chunks_loaded = 0
prev_chunk_length = 0
# open file and allocating a dataset
f = h5py.File('caltech101.h5', 'a')
imgs = f.create_dataset('images', shape=(9144, 240, 180, 3), dtype='uint8')
labels = f.create_dataset('labels', shape=(9144,), dtype=key_type_y)
for class_sublist in np.array_split(classes, nb_chunks):
# loading chunk by chunk in a function to avoid memory overhead
prev_chunk_length = load_chunk(imgs, labels, class_sublist, nb_chunks_loaded, key_type_y, prev_chunk_length)
nb_chunks_loaded += 1
f.close()
print('Images and labels saved to \'caltech101.h5\'')
return
dir_name = '../Datasets/Caltech101'
write_to_file(remove_DS_Store=True)
This works quite well, and also reading is actually fast enough. The problem is that I need to shuffle the dataset.
Observations:
Shuffling the dataset objects: obviously veeeery slow because they're on disk.
Creating an array of shuffled indices and use advanced numpy indexing. This means slower reading from file.
Shuffling before writing to file would be nice, problem: I have only about half of the dataset in memory each time. I would get an improper shuffling.
Can you think of a way to shuffle before writing? I'm open also to solutions which rethink the writing process, as long as it doesn't use a lot of memory.

You could shuffle the file paths before reading the image data.
Instead of shuffling the image data in memory, create a list of all file paths that belong to the dataset. Then shuffle the list of file paths. Now you can create your HDF5 database as before.
You could for example use glob to create the list of files for shuffling:
import glob
import random
files = glob.glob('../Datasets/Caltech101/*/*.jpg')
shuffeled_files = random.shuffle(files)
You could then retrieve the class label and image name from the path:
import os
for file_path in shuffeled_files:
label = os.path.basename(os.path.dirname(file_path))
image_id = os.path.splitext(os.path.basename(file_path))[0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Randomly extracting files from subfolders in main folder - python

Related

How to read multiple DICOM files from a folder?

Vectorise nested for loops

Reading the huge image data for training classifiers

Insert images in a folder into dataframe

Write to HDF5 and shuffle big arrays of data

Categories

Resources