scikit-learn GridSearchCV() fit() performance improvement

scikit-learn GridSearchCV() fit() performance improvement - python

I am using GridSearchCV() and its fit() method to build a model. I currently have this working, but would like to improve the accuracy of the model by supplying more images to train on. Right now, fit() takes over an hour to complete with 500 images. Processing time exponentially grows as the number of images doubles. Ultimately, I'd like to train on several thousand images and even include additional categories besides the two in my proof of concept. I have tried several ways to improve performance and can't resolve it. The only thing that reduces processing time is to significantly lower train_size/test_size in train_test_split() but doing this defeats the purpose of a larger data set to train from. I'm a little stumped on this one. Below is the code I'm using for reference. Thank you.
categories = ['Cat', 'Dog']
flat_data_arr = []
target_arr = []
datadir = 'C:\\Users\\Name\\Python\\images'
for i in categories:
path = os.path.join(datadir, i)
for image in os.listdir(path):
image_array = imread(os.path.join(path, image))
image_resized = resize(image_array, (150, 150, 3))
flat_data_arr.append(image_resized.flatten())
target_arr.append(categories.index(i))
flat_data = np.array(flat_data_arr)
target = np.array(target_arr)
df = pd.DataFrame(flat_data)
df['Target'] = target
x = df.iloc[:,:-1]
y = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, test_size=0.25, shuffle=True, stratify=y)
param_grid={'C':[0.1,1,10,100],'gamma':[0.0001,0.001,0.1,1],'kernel':['rbf','poly']}
svc=svm.SVC(probability=True)
model=GridSearchCV(svc,param_grid)
model.fit(x_train,y_train) #this takes hours depending on number of images

Try - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html
Also...
Probably best to use tensorflow or keras or pytorch for computer vision and with GPUs on top, this will run in mili/seconds...
even without GPU you will see significant speed up.
However in the case if you decide to continue you could try the following (basically reducing dimensions & adding features):
support libraries
import Image from PIL
from PIL import Image
import numpy as np
from skimage.feature import hog
from skimage.color import rgb2grey
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
not sure why but I see you jump from np.array to pandas, I think you should be able to go directly with np matrix
Make sure you are using all cores / processors, parameter n_jobs = -1 in your grid search call should do it...
Then you can also reduce the size of your images even further, say 100 x 100 instead of 150 x 150
Additionally could convert image to gray scale (making your matrix 1 dimensional, not 3)
grey_scaled = rgb2grey(imread(os.path.join(path, image))..
If interested in experimenting then could try to use hog features of your grey_scaled image by pre processing from step 3 via
hog_features = hog(grey_scaled, block_norm='L2-Hys', pixels_per_cell=(10,10))
You could even then try to stack original image and hog features together together
color_features = imread(os.path.join(path, image).flatten()
final_features = np.hstack((color_features,hog_features))
loop over all your images, and append this pipeline to say “final_features_list” list and convert that to a to matrix = np.array(final_features_list)
With so many features you probably can reduce dimensionality. So standard scale and do PCA .
standard_sc = StandardScaler()
matrix_scaled = standard_sc.fit_transform(np.array(final_features_list))
### read up on how to select # of components
### there are methods to help you with that
pca = PCA(n_components=300)
matrix_scaled_pca = pca.fit_transform(matrix_scaled)
Add Try to run your grid search again using matrix_scaled_pca matrix... should go much faster. Could try RandomizedSearchCV or better yet something that should be faster than that (about 10X faster than GridSearch) - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html
Best of luck,

Related

How do I split a custom dataset into spatially disjoint training and test datasets in Python?

My question is close to this thread but the difference is I want my training and test dataset to be spatially disjoint. So no two samples from the same geographical region --you can also define the region by county, state, random geographical grid you create for you own dataset among others. An example of my dataset is like THIS which is an instance segmentation task for satellite imagery.
I know pytorch has this capability for random splitting:
train_size = int(0.75 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
However perhaps what I want is spatially_random_spliting functionality.
Picture below is also showing the question where in my case each point is an image with associated labels.

I am not completely sure what your dataset and labels look like but from what i see why not cut image into pre defined chunk sizes like here - https://stackoverflow.com/a/63815878/4471672
and say save each chunk in different folders according to location then sample from whichever set you need (or know to be "spatially disjoint) randomly

I found the answer via TorchGEO library. Thank you all.
from torchgeo.samplers import RandomGeoSampler
sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler,
collate_fn=stack_samples)

Pyspark with sklearn and dataframe types conversion

I'm trying to use sklearn with pyspark but I'm having some performance issues. Lets say I have a dataset that have already went through a pipeline where features were vectorized and normalized. In order to use sklearn I'll either have to provide the algorithm an array or a pandas dataframe. Considering the size of my dataset (2.8M + and should grow bigger soon), converting the train set to pandas is painstakingly slow so I'm using a numpy approach:
train = np.array(train.select('features').collect()).squeeze()
This is relatively slow as I need to use collect to push data back to the driver. Is there any other approach that would be faster and better? Additionally, due to the nature of the problem I'm currently handling my scoring function in a non-standard way:
def score (fitmodel, test):
predY = fitmodel.score_samples(test)
return np.full((1, len(predY)), np.mean(predY)).transpose()
The idea is to compute the average of the predictions and then return an array where this result is replicated as many times as the number of records tested. E.g. if my testset has 450 records I'll return an array shaped (450,1) where all 450 records have the same value (average of the predictions). Although slow, so far so good and everything works as intended. My problem is that I need to proceed with the testing by doing this multiple times (changing the test set) and append the results to a single array in order to evaluate the model's performance later. My code:
for _ in tdqm(range(450, 800)):
#Get group #X
_test = df.where(col('index') == _) #Get a different "chunk" of the dataset each iteration
_test.coalesce(2)
#Apply pipeline with transforms
test = pipelineModel.transform(_test)
y_test = np.array(test.select('label').collect())
x_test = np.array(test.select('features').collect()).squeeze()
pred = newScore(x_test, model)
if(_ == (450)): #First run
trueY = y_test
predY = pred
else:
trueY = np.append(trueY, y_test)
predY = np.append(predY, pred)
Briefly, I grab a specific portion of the dataset, test it and then i want to append both the predictions and "true" label for later evaluation. My main problem here is that doing 2 collects and a np.append takes a long time and I need to find an alternative. Testing the entire test set (about 400k entries) takes me ~1min but with all these in place it increases the time to 2h20min.
On top of this I have to convert the array back to a pyspark dataframe to use the mllib evaluation functions which adds a bit more time to the process.
With all this being said, could anyone point me in a direction where I could accomplish this in a more efficient way? Maybe there is another way to use spark and sklearn ?

I am not able to append labels to my training data

When I append my labels I end up with 20580 for the length of y when what I'm hoping to do is end up with 120 which is the number of categories. How can I append the categories to my labels?
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
import random as rand
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Activation, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.optimizers import Adam
config = tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True))
sess = tf.compat.v1.Session(config=config)
DATADIR = "C:/Users/samue/Documents/Datasets/DogBreeds/images/Images"
CATEGORIES = os.listdir("C:/Users/samue/Documents/Datasets/DogBreeds/images/Images")
IMG_SIZE = 100
training_data = []
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
class_num = CATEGORIES.index(category)
for img in os.listdir(path):
try:
img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
training_data.append([new_array, class_num])
except Exception as e:
pass
create_training_data()
rand.shuffle(training_data)
X = []
y = []
for features, label in training_data:
X.append(features)
y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
y = np.array(y).reshape(-1,)
print(len(CATEGORIES))
print(len(X))
print(len(y))
The outputs I get at the end are:
120
20580
20580

I think you should step back a little from the implementation details or even this specific problem to try to understand what is going on. In image classification, the objective is to classify the input, a 2D or - 3D tensor if it's multichannel image - by assigning it to a label. The number of labels is finite, you can only classify into a certain number of class.
To give an example, let's take the MNIST database. It is a well known dummy-dataset used for image classification tasks. In the training set, there are 60,000 1x28x28-images representing handwritten digits. Generally speaking, the goal with this dataset is to classify properly each image to a total of 10 labels. The labels correspond to numbers "0", "1", "2", ..., and so on until "9". So the question in this particular case is given image X, my model needs to predict a class for this image: either "0", "1", ..., or "9", there are only 10 possibilities. In supervised learning, we use labels to train the model. For any given input, we need to know the ground-truth i.e. the real class this input belongs to. So in turn you end up with as many inputs as there are labels: because each one is assigned it's own label, regardless of the number of unique possible labels.
In your use case, it seems you are working with a total of 120 classes and 20,580 images. That's 20,580 unique data inputs. Remember, we need to have, for each one of those images, a corresponding ground-truth: the real class this image belongs to. So naturally you would end up with a total of 20,580 labels as well.
This might have been the source of your confusion: in my own terms label is different to class. A class set is a unique set of entities (animals, digits, ...) while a label refers to a particular class inside a class set.

I think you are a bit confused. You should have a data set consisting of 120 classes.
For each of those classes you need to have images characteristic of that class. For example assume you are building a classifier to distinguish between images of dogs and images of cats. So you have 2 classes. You can structure your directories as follows
source_dir
----------cats_dir
------------------cats first image
------------------cats second images
------------------cats nth image
----------dogs_dir
------------------dogs first image
------------------dogs second image
------------------dogs m th image
For your case you will have 120 sub directories (class directories) below the source_dir and each of these should contain images associated with class. In your case it would appear that you have a total of 20580 images. If they are evenly distributed you have roughly 171 images for each class. Now you want to use these images to train a CNN. You can do it the way you were proceeding however I recommend against it because you will end up putting all 20580 100 X 100 images into memory all at once. This will take a very big memory and you are likely to get an OOM (out of memory) error. The way you solve that is to feed the data to your model in batches. For example 32 images at a time. Now Keras has useful functions to assist you in doing that. If you have the directory structure as shown above you can use the ImageDataGenerator.flow_from_directory to feed your images to the model in batches. Documentation is here. This function also enables you to use image augmentation to help expand the diversity of your data set. Below is the code I recommend for the example of dog/cat classification I mentioned above.
source_dir-r'c:\temp\cats_and_dogs'
v_split=.2 # set this to determine the percentage of data to allocate to the validation set
data_gen=ImageDataGenerator(rescale=1/255,validation_split=v_split)
train_gen=data_gen.flow_from_directory(source_dir, target_size=(100,100),
class_mode='categorical', batch_size=32,
subset='training', color_mode='grayscale)
valid_gen=data_gen.flow_from_directory(source_dir, target_size=(100,100),
class_mode='categorical', batch_size=32,
subset='validation', color_mode='grayscale, shuffle=False)
when you compile your model set loss='categorical_crossentropy'.
you can use the two generators above as inputs to model.fit

How to get Data Generator more efficient?

To train a neural network, I modified a code I found on YouTube. It looks as follows:
def data_generator(samples, batch_size, shuffle_data = True, resize=224):
num_samples = len(samples)
while True:
random.shuffle(samples)
for offset in range(0, num_samples, batch_size):
batch_samples = samples[offset: offset + batch_size]
X_train = []
y_train = []
for batch_sample in batch_samples:
img_name = batch_sample[0]
label = batch_sample[1]
img = cv2.imread(os.path.join(root_dir, img_name))
#img, label = preprocessing(img, label, new_height=224, new_width=224, num_classes=37)
img = preprocessing(img, new_height=224, new_width=224)
label = my_onehot_encoded(label)
X_train.append(img)
y_train.append(label)
X_train = np.array(X_train)
y_train = np.array(y_train)
yield X_train, y_train
Now, I tried to train a neural network using this code, train sample size is 105.000 (image files which contain 8 characters out of 37 possibilities, A-Z, 0-9 and blank space).
I used a relatively small batch size (32, I think that is already too small) to get it more efficient but nevertheless it took like forever to train one quarter of the first epoch (I had 826 steps per epoch, and it took 90 minutes for 199 steps... steps_per_epoch = num_train_samples // batch_size).
The following functions are included in the data generator:
def shuffle_data(data):
data=random.shuffle(data)
return data
I don't think we can make this function anyhow more efficient or exclude it from the generator.
def preprocessing(img, new_height, new_width):
img = cv2.resize(img,(new_height, new_width))
img = img/255
return img
For preprocessing/resizing the data I use this code to get the images to a unique size of e.g. (224, 224, 3). I think, this part of the generator takes the most time, but I don't see a possibility to exclude it from the generator (since my memory would be full, if we resize the images outside the batches).
#One Hot Encoding of the Labels
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
I think, in this part there could be one approach to make it more efficient. I am thinking about to exclude this code from the generator and produce the array y_train outside of the generator, so that the generator does not have to one hot encode the labels every time.
What do you think? Or should I maybe go for a completely different approach?

I have found your question very intriguing because you give only clues. So here is my investigation.
Using your snippets, I have found GitHub repository and 3 part video tutorial on YouTube that mainly focuses on the benefits of using generator functions in Python.
The data is based on this kaggle (I would recommend to check out different kernels on that problem to compare the approach that you already tried with another CNN networks and review API in use).
You do not need to write a data generator from scratch, though it is not hard, but inventing the wheel is not productive.
Keras has the ImageDataGenerator class.
Plus here is a more generic example for DataGenerator.
Tensorflow offers very neat pipelines with their tf.data.Dataset.
Nevertheless, to solve the kaggle's task, the model needs to perceive single images only, hence the model is a simple deep CNN. But as I understand, you are combining 8 random characters (classes) into one image to recognize multiple classes at once. For that task, you need R-CNN or YOLO as your model. I just recently opened for myself YOLO v4, and it is possible to make it work for specific task really quick.
General advice about your design and code.
Make sure the library uses GPU. It saves a lot of time. (Even though I repeated flowers experiment from the repository very fast on CPU - about 10 minutes, but resulting predictions are no better than a random guess. So full training requires a lot of time on CPU.)
Compare different versions to find a bottleneck. Try a dataset with 48 images (1 per class), increase the number of images per class, and compare. Reduce image size, change the model structure, etc.
Test brand new models on small, artificial data to prove the idea or use iterative process, start from projects that can be converted to your task (handwriting recognition?).

100% Accuracy using SVC classification, something must be wrong?

Context to what I'm trying to achieve:
I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).
import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC
def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature
trnArray = np.zeros([10000,324])
tstArray = np.zeros([1000,324])
for i in range (0, 10000 ):
trnFeatures = computeFeatures(trnImages[:,:,:,i])
trnArray[i,:] = trnFeatures
for i in range (0, 1000):
tstFeatures = computeFeatures(tstImages[:,:,:,i])
tstArray[i,:] = tstFeatures
pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels
train_data = trnModel
train_labels = trnLabels
C = 1
model = SVC(kernel='linear', C=C)
model.fit(train_data, train_labels.ravel())
y_pred = model.predict(test_data)
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))
Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?

First of all,
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
this is wrong. You have to use:
tstModel = pca.transform(tstArray)
Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.
Just for interest, check the balance of classes.
Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.

The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.
I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?
The PCA is standard for images, but should be followed by something very simple such as K-means.
The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.
Otherwise the calculation looks fine.
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).
I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question.
BTW could some explain, "shape[0]"please? Is it needed?

One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.
This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.
Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.