Context to what I'm trying to achieve:
I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).
import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC
def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature
trnArray = np.zeros([10000,324])
tstArray = np.zeros([1000,324])
for i in range (0, 10000 ):
trnFeatures = computeFeatures(trnImages[:,:,:,i])
trnArray[i,:] = trnFeatures
for i in range (0, 1000):
tstFeatures = computeFeatures(tstImages[:,:,:,i])
tstArray[i,:] = tstFeatures
pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels
train_data = trnModel
train_labels = trnLabels
C = 1
model = SVC(kernel='linear', C=C)
model.fit(train_data, train_labels.ravel())
y_pred = model.predict(test_data)
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))
Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?
First of all,
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
this is wrong. You have to use:
tstModel = pca.transform(tstArray)
Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.
Just for interest, check the balance of classes.
Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.
The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.
I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?
The PCA is standard for images, but should be followed by something very simple such as K-means.
The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.
Otherwise the calculation looks fine.
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).
I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question.
BTW could some explain, "shape[0]"please? Is it needed?
One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.
This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.
Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?
Related
I'm using pytorch to implement a simple linear regression model.
The code works perfectly for randomly created datasets, but when it comes to the dataset I wanted to train, it gives significantly wrong results.
Here is the code:
x = torch.linspace(1,100,steps=100)
learn_rate = 0.000001
x_train = x[:100]
x_test = x[100:]
y_train = data[:100]
y_test = data[100:]
# y_train = -0.01*x_train + torch.randn(100)*10 #Code for generating random data.
w = torch.rand(1,requires_grad=True)
b= torch.rand(1,requires_grad=True)
for i in range(1000):
loss = torch.mean((y_train-(w*x_train+b))**2)
if(i%100==0):
print(loss)
loss.backward()
w.data.add_(-w.grad.data*learn_rate)
b.data.add_(-b.grad.data*learn_rate)
w.grad.data.zero_()
b.grad.data.zero_()
The result it gives makes no sense.
However, when I used a randomly generated dataset, it works perfectly:
The dataset actually looks similar. I am not sure for the reason of the inaccuracy of this model.
Code for plotting data:
plt.plot(x_train.numpy(),y_train.numpy())
plt.plot(x_train.numpy(),(w*x_train+b).data.numpy())
plt.show()
--
Now the problem seems to be that weight converges much faster than bias. At the current learning rate, bias will not converge to the optimal. However, if I increase the learning rate just by a little, the weight will simply diverge. I have to set two learning rates.
However, I'm wondering whether setting different learning rate is the best solution for a simple model like this, because I've found out that not much model actually uses different learning rate for different parameters.
Your code seems to be correct, but your model converges slower when there is a large bias in your data (because it now has to update the bias parameter many times before it reaches the correct value).
You could try running it for more iterations or increasing the learning rate.
I am using GridSearchCV() and its fit() method to build a model. I currently have this working, but would like to improve the accuracy of the model by supplying more images to train on. Right now, fit() takes over an hour to complete with 500 images. Processing time exponentially grows as the number of images doubles. Ultimately, I'd like to train on several thousand images and even include additional categories besides the two in my proof of concept. I have tried several ways to improve performance and can't resolve it. The only thing that reduces processing time is to significantly lower train_size/test_size in train_test_split() but doing this defeats the purpose of a larger data set to train from. I'm a little stumped on this one. Below is the code I'm using for reference. Thank you.
categories = ['Cat', 'Dog']
flat_data_arr = []
target_arr = []
datadir = 'C:\\Users\\Name\\Python\\images'
for i in categories:
path = os.path.join(datadir, i)
for image in os.listdir(path):
image_array = imread(os.path.join(path, image))
image_resized = resize(image_array, (150, 150, 3))
flat_data_arr.append(image_resized.flatten())
target_arr.append(categories.index(i))
flat_data = np.array(flat_data_arr)
target = np.array(target_arr)
df = pd.DataFrame(flat_data)
df['Target'] = target
x = df.iloc[:,:-1]
y = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, test_size=0.25, shuffle=True, stratify=y)
param_grid={'C':[0.1,1,10,100],'gamma':[0.0001,0.001,0.1,1],'kernel':['rbf','poly']}
svc=svm.SVC(probability=True)
model=GridSearchCV(svc,param_grid)
model.fit(x_train,y_train) #this takes hours depending on number of images
Try - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html
Also...
Probably best to use tensorflow or keras or pytorch for computer vision and with GPUs on top, this will run in mili/seconds...
even without GPU you will see significant speed up.
However in the case if you decide to continue you could try the following (basically reducing dimensions & adding features):
support libraries
import Image from PIL
from PIL import Image
import numpy as np
from skimage.feature import hog
from skimage.color import rgb2grey
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
not sure why but I see you jump from np.array to pandas, I think you should be able to go directly with np matrix
Make sure you are using all cores / processors, parameter n_jobs = -1 in your grid search call should do it...
Then you can also reduce the size of your images even further, say 100 x 100 instead of 150 x 150
Additionally could convert image to gray scale (making your matrix 1 dimensional, not 3)
grey_scaled = rgb2grey(imread(os.path.join(path, image))..
If interested in experimenting then could try to use hog features of your grey_scaled image by pre processing from step 3 via
hog_features = hog(grey_scaled, block_norm='L2-Hys', pixels_per_cell=(10,10))
You could even then try to stack original image and hog features together together
color_features = imread(os.path.join(path, image).flatten()
final_features = np.hstack((color_features,hog_features))
loop over all your images, and append this pipeline to say “final_features_list” list and convert that to a to matrix = np.array(final_features_list)
With so many features you probably can reduce dimensionality. So standard scale and do PCA .
standard_sc = StandardScaler()
matrix_scaled = standard_sc.fit_transform(np.array(final_features_list))
### read up on how to select # of components
### there are methods to help you with that
pca = PCA(n_components=300)
matrix_scaled_pca = pca.fit_transform(matrix_scaled)
Add Try to run your grid search again using matrix_scaled_pca matrix... should go much faster. Could try RandomizedSearchCV or better yet something that should be faster than that (about 10X faster than GridSearch) - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html
Best of luck,
I am getting into machine learning and recently I have studied classification of linear separable data using linear Discriminant Analysis. To do so I have used the scikit-learn package and the function
.discriminant_analysis.LinearDiscriminantAnalysis
On data from MNIST database of handwritten digits. I have used the database to fit the model and do predictions on test data by doing like this:
LDA(n_components=2)
LDA_fit(data,labels)
LDA_predict(testdata)
Which works just fine. I get a nice accuracy rate of 95%. However the predict function uses data from all 784 dimensions (corresponding to images of 28x28 pixels). I don’t understand why all dimensions are used for the prediction?
I though the purpose of the linear Discriminant analysis is to find a projection on the low dimension space that allows maximizes class separation allowing, such that ideally data is linear separable and classification is easy.
What’s the point of LDA and determining the projection matrix if all 784 dimensions are used for prediction anyway?
From documentation:
discriminant_analysis.LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes (in a precise sense discussed in the mathematics section below). The dimension of the output is necessarily less than the number of classes, so this is, in general, a rather strong dimensionality reduction, and only makes sense in a multiclass setting.
This is implemented in discriminant_analysis.LinearDiscriminantAnalysis.transform. The desired dimensionality can be set using the n_components constructor parameter. This parameter has no influence on discriminant_analysis.LinearDiscriminantAnalysis.fit or discriminant_analysis.LinearDiscriminantAnalysis.predict.
Meaning n_components is used only for transform or fit_transform. You can use dimensionality reduction for removing noise from your data or for visualization.
The low dimension which you had mentioned is actually n_classes in terms of classification.
If you use this for dimension reduction technique you can chose n_components dimensions, if you had specified it (it must be < n_classes). This has no impact on prediction as mentioned in documentation.
Hence, once you give input data, it will transform the data into n_classes dimensional space, then use this space for training/prediction. Reference - _decision_function() is used for prediction.
You can use Transform(X) to view the new lower dimensional space learned by the model.
Applying LDA on mnist data with reduced dimensions:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(data_1000, labels_1000).transform(data_1000)
# LDA before tsne
plt.figure()
colors = ['brown','black','deepskyblue','red','yellow','darkslategrey','navy','darkorange','deeppink', 'lawngreen']
target_names = ['0','1','2','3','4','5','6','7','8','9']
lw = 2
y = labels_1000
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2,3,4,5,6,7,8,9], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of MNIST dataset before TSNE')
plt.show()
I have an array with the shape 57159x924 which I will use as training data. 896 of these 924 columns are feature and the remaining labels. I want to use logistic regression on this, but when I use the fit function from logistic regression I get a memory error. I guess it is because it's too much data for my computer's memory to handle. Is there any way to get around this problem?
The code I want to use is
lr = LogisticRegression(random_state=1)
lr.fit(train_set, train_label)
lr.predict_proba(x_test)
And the following is the error
line 21, in main
lr.fit(train_set, train_label)
....
return array(a, dtype, copy=False, order=order)
MemoryError
You haven't given enough details to really understand the problem or give a definite answer, but here are a few options I hope will help:
The amount of memory available might be configurable.
Training over all the data at the same time would raise OOM problems in many contexts, which is why the common practice is to use SGD (stochastic gradient descent) by training over batches, i.e. introducing only subsets of the data every iteration and getting a global optimization solution in a stochastic sense. If I'm guessing correctly, you're using sklearn.linear_model.LogisticRegression, which has different "solvers". Maybe the saga solver will handle your situation better.
There are other implementations out there, and some of them definitely have batching options built-in in a highly configurable way. And if worst comes to worst, implementing a logistic-regression model is fairly simple, and then you can batch easy as pie.
Edit (due to discussion in comments):
Here's a practical way to go about it, with a very very simple (and easy) example -
from sklearn.linear_model import SGDClassifier
import numpy as np
import random
X1 = np.random.multivariate_normal(mean=[10, 5], cov = np.diag([3, 8]), size=1000) # diagonal covariance for simplicity
Y1 = np.zeros((1000, 1))
X2 = np.random.multivariate_normal(mean=[-4, 55], cov = np.diag([5, 1]), size=1000) # diagonal covariance for simplicity
Y2 = np.ones((1000, 1))
X = np.vstack([X1, X2])
Y = np.vstack([Y1, Y2]).reshape([2000,])
sgd = SGDClassifier(loss='log', warm_start=True) # as mentioned in answer. note that shuffle is defaulted to True.
sgd.partial_fit(X, Y, classes = [0, 1]) # first time you need to say what your classes are
for k in range(1000):
batch_indexs = random.sample(range(2000), 20)
sgd.partial_fit(X[batch_indexs, :], Y[batch_indexs])
In practice you should be looking at the loss and accuracy and using a suitable while instead of for, but that much is left for the reader ;-)
Note that you can control more than I've shown (like the number of iterations etc.), so you should read the documentation of SGDClassifier properly.
Another thing to note is that there are different practices of batching. I just took a random subset every iteration, but some prefer to make sure every point in the data has been seen an equal amount of times (e.g. shuffle the data and then take batches in order indexes or something).
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.