Scikit-learn MemoryError with RandomForestClassifier

Scikit-learn MemoryError with RandomForestClassifier - python

I am following along with the tutorial here:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
I have the exact same code the author uses, but I will still share it below...
train_data = scipy.io.loadmat('train_32x32.mat')
X = train_data['X']
y = train_data['y']
img_index = 24
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
clf = RandomForestClassifier(n_estimators=10, n_jobs=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf.fit(X_train, y_train) <-----------(MEMORY ERROR)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
The dataset I am using is basically a dictionary of numbers and pictures of numbers. Everytime I get to the line which I pointed out above, I receive a MemoryError. The full error traceback is below:
Traceback (most recent call last):
File "C:/Users/jack.walsh/Projects/img_recog/main.py", line 22, in <module>
clf.fit(X_train, y_train)
File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\ensemble\forest.py", line 249, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\utils\validation.py", line 496, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\core\numeric.py", line 538, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError
I ran Resource Monitor side-by-side with it and realized my used memory never goes above 30%. Let me know how I can get around this without altering the results!
X.shape = (73257, 3072)
X_train.shape = (51279, 3072)
I have 16GB RAM on this machine.

Given that your dataset has 3072 columns (reasonable for images), I simply think that it's too overloaded for a random forest, especially when you have no regularization applied to the classifier. The machine simply don't have enough memory to allocate for such a gigantic model.
Something that I would do in this situation:
Reduce the number of features before training, difficult to do as your data is image and each column is just a pixel value, maybe you can resize your image to be smaller.
Add regularization to your random forest classifier, for example, set max_depth to be smaller or set max_features so that every time when splitting, not all 3072 features are considered. Here's the full list of parameters that you can tune: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
According to Scikit Learn RandomForest Memory Error, setting n_jobs=1 might help as well.
Lastly, I would personally not use random forest for image classifications. I would choose classifiers like SVM or go deep with deep learning models.

Related

scikit-learn: Multi-label cross validation not working anymore in v0.19

Dear Stack Overflow users,
I use sklearn to train a multi-label SVM (with probabilities) for text mining. For each entry, I do not have a single target label, but a list. These targets are transformed via a MultiLabelBinarizer:
vect = TfidfVectorizer()
x_train = vect.fit_transform(training_texts)
mlb = MultiLabelBinarizer()
training_targets_mlb = mlb.fit_transform(training_targets)
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True)
clf.fit(x_train, training_targets_mlb)
After upgrading sklearn to the latest version 0.19, the above code still works, but the following code for cross validation, which worked in the previous version (0.18 I think), now raises an error:
from sklearn.model_selection import cross_val_predict
cv_scores = cross_val_predict(estimator=clf,
X=vect.fit_transform(texts),
y=mlb.fit_transform(targets),
cv=sklearn.model_selection.KFold(shuffle=True, n_splits=5),
method='predict_proba')
Error:
File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_validation.py", line 647, in cross_val_predict
y = le.fit_transform(y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py", line 111, in fit_transform
y = column_or_1d(y, warn=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 583, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (18165, 25)
Is this intended behaviour, i.e. am I not supposed to pass an array here? If so, how would I then perform cross validation?
Thanks for your help!

WHy an error is generated when tuning n_estimators for RandomForestClassifier using cross_val_score?

I am working on an unbalanced data, using undersampling, I have made the both classes in the same proportion.
X_undersample dataframe (984,28)
y_undersample dataframe(984,1)
I am using randomforest classifier, in order to find the best parameter n_estimators I am using cross-validation. here is the code below.
j_shout=range(1,300)
j_acc=[]
for j in j_shout:
lr = RandomForestClassifier(n_estimators = j, criterion = 'entropy', random_state = 0)
score=cross_val_score(lr,X_undersample,y_undersample,cv=10,scoring='accuracy')
print ('iteration',j,':cross_validation accuracy=',score)
j_acc.append(score.mean())
now when I run this I am getting the following error.
File "<ipython-input-43-954a9717dcea>", line 5, in <module>
score=cross_val_score(lr,X_undersample,y_undersample,cv=10,scoring='accuracy')
File "D:\installations\AC\lib\site-packages\sklearn\cross_validation.py", line 1562, in cross_val_score
cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
File "D:\installations\AC\lib\site-packages\sklearn\cross_validation.py", line 1823, in check_cv
cv = StratifiedKFold(y, cv)
File "D:\installations\AC\lib\site-packages\sklearn\cross_validation.py", line 569, in __init__
label_test_folds = test_folds[y == label]
IndexError: too many indices for array
I try changing the n_estimators to smaller values but it still showing the same error

According to your traceback and scikit-learn documentation of StratifiedKFold iterator it seems, that StratifiedKFold get y as flattened array. In your case, you pass dataframe with size (984, 1). Your part of code should be like this:
score=cross_val_score(estimator=lr,
X=X_undersample.values,
y=y_undersample.values.ravel(),
cv=10,
scoring='accuracy')

ValueError: Unknown label type: array while using Decision Tree Classifier and using a custom dataset

Given below is my code
dataset = np.genfromtxt('train_py.csv', dtype=float, delimiter=",")
X_train, X_test, y_train, y_test = train_test_split(dataset[:,:-1],dataset[:,-1], test_size=0.2,random_state=0)
model = tree.DecisionTreeClassifier(criterion='gini')
#y_train = y_train.tolist()
#X_train = X_train.tolist()
model.fit(X_train, y_train)
model.score(X_train, y_train)
predicted= model.predict(x_test)
I am trying to use the decision Tree classifier on a custom dataset imported using the numpy library. But I get a ValueError which is given below when I try to fit the model.I tried using both numpy arrays and non numpy arrays such as lists but still dont seem to figure out what is causing the error. Any help appreciated.
Traceback (most recent call last):
File "tree.py", line 19, in <module>
model.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 177, in fit
check_classification_targets(y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 173, in check_classification_targets
raise ValueError("Unknown label type: %r" % y)
ValueError: Unknown label type: array([[ 252.3352],....<until end of array>

python (scikit-learn) expects you to pass something that is label-like, thus: integer, string, etc. floats are not a typical encoding form of finite space, they are used for regression.
docu:
fit X_train The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

Error in using Non Linear SVM in Scikit-Learn

I have a code to try to use Non Linear SVM (RBF kernel).
raw_data1 = open("/Users/prateek/Desktop/Programs/ML/Dataset.csv")
raw_data2 = open("/Users/prateek/Desktop/Programs/ML/Result.csv")
dataset1 = np.loadtxt(raw_data1,delimiter=",")
result1 = np.loadtxt(raw_data2,delimiter=",")
clf = svm.NuSVC(kernel='rbf')
clf.fit(dataset1,result1)
However, when I try to fit, I get the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/prateek/Desktop/Programs/ML/lib/python2.7/site-packages/sklearn/svm/base.py", line 193, in fit
fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
File "/Users/prateek/Desktop/Programs/ML/lib/python2.7/site-packages/sklearn/svm/base.py", line 251, in _dense_fit
max_iter=self.max_iter, random_seed=random_seed)
File "sklearn/svm/libsvm.pyx", line 187, in sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2098)
ValueError: specified nu is infeasible
Link for Results.csv
Link for dataset
What is the reason for such an error?

The nu parameter is, as pointed out in the documentation, "An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors".
So, whenever you try to fit your data and this bound cannot be satisfied, optimization problem becomes infeasible. Therefore your error.
As a matter of fact, I looped from 1. to 0.1 (decreasing in decimal units) and still got the error, then just tried with 0.01 and no complaints arose. But of course, you should check the results of fitting your model with that value, check if accuracy is acceptable on predictions.
Update: actually I was curious and splitted your dataset to validate, output was 69% accuracy (also I think your training set might be very little)
Just for reproducibility purposes, here, the quick test I performed:
from sklearn import svm
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
raw_data1 = open("Dataset.csv")
raw_data2 = open("Result.csv")
dataset1 = np.loadtxt(raw_data1,delimiter=",")
result1 = np.loadtxt(raw_data2,delimiter=",")
clf = svm.NuSVC(kernel='rbf',nu=0.01)
X_train, X_test, y_train, y_test = train_test_split(dataset1,result1, test_size=0.25, random_state=42)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred, normalize=True, sample_weight=None)

Issues with Pipelining in sklearn

I am brand new to sklearn. I am using Pipeline to use Vectorizer and Classifier together in a Text mining problem. Here is my code:
def create_ngram_model():
tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),
analyzer="word", binary=False)
clf = GaussianNB()
pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])
return pipeline
def get_trains():
data=open('../cleaning data/cleaning the sentences/cleaned_comments.csv','r').readlines()[1:]
lines=len(data)
features_train=[]
labels_train=[]
for i in range(lines):
l=data[i].split(',')
labels_train+=[int(l[0])]
a=l[2]
features_train+=[a]
return features_train,labels_train
def train_model(clf_factory,features_train,labels_train):
features_train,labels_train=get_trains()
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features_train, labels_train, test_size=0.1, random_state=42)
clf=clf_factory()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred,labels_test)
return accuracy
X,Y=get_trains()
print train_model(create_ngram_model,X,Y)
The features returned from get_trains() are strings.
I am getting this error.
clf.fit(features_train,labels_train)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 149, in fit
X, y = check_arrays(X, y, sparse_format='dense')
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 263, in check_arrays
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
I have come across this error many times. Then, I just changed the features to features_transformed.toarray() but since, here, I am using pipeline I am not able to do so as the transformed feature is returned automatically. I also tried making a new class which returns the features_transformed.toarray() but that too throwed the same error.
I have searched a lot but not getting it. Please help!!

There are 2 options:
Use sparse-data-compatible classifier. For example, documentation says Bernoulli Naive Bayes and Multinomial Naive Bayes support sparse input for fit.
Add a "densifier" to the Pipeline. Apparently, you got it wrong, this one worked for me (when I needed to densify my sparse data along the way):
class Densifier(object):
def fit(self, X, y=None):
pass
def fit_transform(self, X, y=None):
return self.transform(X)
def transform(self, X, y=None):
return X.toarray()
Make sure to put into pipeline it right before classificator.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit-learn MemoryError with RandomForestClassifier - python

Related

scikit-learn: Multi-label cross validation not working anymore in v0.19

WHy an error is generated when tuning n_estimators for RandomForestClassifier using cross_val_score?

ValueError: Unknown label type: array while using Decision Tree Classifier and using a custom dataset

Error in using Non Linear SVM in Scikit-Learn

Issues with Pipelining in sklearn

Categories

Resources