Error in using Non Linear SVM in Scikit-Learn - python

I have a code to try to use Non Linear SVM (RBF kernel).
raw_data1 = open("/Users/prateek/Desktop/Programs/ML/Dataset.csv")
raw_data2 = open("/Users/prateek/Desktop/Programs/ML/Result.csv")
dataset1 = np.loadtxt(raw_data1,delimiter=",")
result1 = np.loadtxt(raw_data2,delimiter=",")
clf = svm.NuSVC(kernel='rbf')
clf.fit(dataset1,result1)
However, when I try to fit, I get the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/prateek/Desktop/Programs/ML/lib/python2.7/site-packages/sklearn/svm/base.py", line 193, in fit
fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
File "/Users/prateek/Desktop/Programs/ML/lib/python2.7/site-packages/sklearn/svm/base.py", line 251, in _dense_fit
max_iter=self.max_iter, random_seed=random_seed)
File "sklearn/svm/libsvm.pyx", line 187, in sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2098)
ValueError: specified nu is infeasible
Link for Results.csv
Link for dataset
What is the reason for such an error?

The nu parameter is, as pointed out in the documentation, "An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors".
So, whenever you try to fit your data and this bound cannot be satisfied, optimization problem becomes infeasible. Therefore your error.
As a matter of fact, I looped from 1. to 0.1 (decreasing in decimal units) and still got the error, then just tried with 0.01 and no complaints arose. But of course, you should check the results of fitting your model with that value, check if accuracy is acceptable on predictions.
Update: actually I was curious and splitted your dataset to validate, output was 69% accuracy (also I think your training set might be very little)
Just for reproducibility purposes, here, the quick test I performed:
from sklearn import svm
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
raw_data1 = open("Dataset.csv")
raw_data2 = open("Result.csv")
dataset1 = np.loadtxt(raw_data1,delimiter=",")
result1 = np.loadtxt(raw_data2,delimiter=",")
clf = svm.NuSVC(kernel='rbf',nu=0.01)
X_train, X_test, y_train, y_test = train_test_split(dataset1,result1, test_size=0.25, random_state=42)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred, normalize=True, sample_weight=None)

Related

Scikit-learn MemoryError with RandomForestClassifier

I am following along with the tutorial here:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
I have the exact same code the author uses, but I will still share it below...
train_data = scipy.io.loadmat('train_32x32.mat')
X = train_data['X']
y = train_data['y']
img_index = 24
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
clf = RandomForestClassifier(n_estimators=10, n_jobs=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf.fit(X_train, y_train) <-----------(MEMORY ERROR)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
The dataset I am using is basically a dictionary of numbers and pictures of numbers. Everytime I get to the line which I pointed out above, I receive a MemoryError. The full error traceback is below:
Traceback (most recent call last):
File "C:/Users/jack.walsh/Projects/img_recog/main.py", line 22, in <module>
clf.fit(X_train, y_train)
File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\ensemble\forest.py", line 249, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\utils\validation.py", line 496, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\core\numeric.py", line 538, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError
I ran Resource Monitor side-by-side with it and realized my used memory never goes above 30%. Let me know how I can get around this without altering the results!
X.shape = (73257, 3072)
X_train.shape = (51279, 3072)
I have 16GB RAM on this machine.
Given that your dataset has 3072 columns (reasonable for images), I simply think that it's too overloaded for a random forest, especially when you have no regularization applied to the classifier. The machine simply don't have enough memory to allocate for such a gigantic model.
Something that I would do in this situation:
Reduce the number of features before training, difficult to do as your data is image and each column is just a pixel value, maybe you can resize your image to be smaller.
Add regularization to your random forest classifier, for example, set max_depth to be smaller or set max_features so that every time when splitting, not all 3072 features are considered. Here's the full list of parameters that you can tune: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
According to Scikit Learn RandomForest Memory Error, setting n_jobs=1 might help as well.
Lastly, I would personally not use random forest for image classifications. I would choose classifiers like SVM or go deep with deep learning models.

ValueError: cannot use sparse input in 'SVC' trained on dense data

I'm trying to run my classifier but I get this error
import pandas
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = pd.read_csv('all_topics_limpo.csv', encoding = 'utf-8')
data = pandas.get_dummies(dataset['verbatim_corrige'])
labels = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.2, random_state = 0)
count_vector = CountVectorizer()
tfidf = TfidfTransformer()
classifier = OneVsOneClassifier(SVC(kernel = 'linear', random_state = 100))
#classifier = LogisticRegression()
train_counts = count_vector.fit_transform(X_train)
train_tfidf = tfidf.fit_transform(train_counts)
classifier.fit(X_train, y_train)
test_counts = count_vector.transform(X_test)
test_tfidf = tfidf.transform(test_counts)
predicted = classifier.predict(test_tfidf)
predicted = classifier.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, predicted, labels = labels))
print("F-score")
print(f1_score(y_test, predicted))
print(precision_score(y_test, predicted))
print(recall_score(y_test, predicted))
print("cross validation")
test_counts = count_vector.fit_transform(data)
test_tfidf = tfidf.fit_transform(test_counts)
scores = cross_validation.cross_val_score(classifier, test_tfidf, labels, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
My output error:
ValueError: cannot use sparse input in 'SVC' trained on dense data
I can not execute my code because of this problem and I am not understanding anything of what is happening.
all output error
Traceback (most recent call last):
File "classification.py", line 42, in
predicted = classifier.predict(test_tfidf)
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 584, in predict
Y = self.decision_function(X)
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in decision_function
for est, Xi in zip(self.estimators_, Xs)]).T
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in
for est, Xi in zip(self.estimators_, Xs)]).T
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 548, in predict
y = super(BaseSVC, self).predict(X)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 308, in predict
X = self._validate_for_predict(X)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 448, in _validate_for_predict
% type(self).name)
ValueError: cannot use sparse input in 'SVC' trained on dense data
You get this error because your training & test data are not of the same kind: while you train in your initial X_train set:
classifier.fit(X_train, y_train)
you are trying to get predictions from a dataset which has undergone count vectorization & tf-idf transormations first:
predicted = classifier.predict(test_tfidf)
It is puzzling why you choose to do so, why you nevertheless compute train_counts and train_tfidf (you don't seem to actually use them anywhere), and why you are also trying to redefine predicted as classifier.predict(X_test) immediately afterwards. Normally, changing your training line to
classifier.fit(train_tfidf, y_train)
and getting rid of your second predicted definition should work OK...
you can use this code :
test_tfidf = tfidf.transform(test_counts).toarray()
befor you want to predict your model and after :
predicted = classifier.predict(test_tfidf)
just do this simple code
nice job

WHy an error is generated when tuning n_estimators for RandomForestClassifier using cross_val_score?

I am working on an unbalanced data, using undersampling, I have made the both classes in the same proportion.
X_undersample dataframe (984,28)
y_undersample dataframe(984,1)
I am using randomforest classifier, in order to find the best parameter n_estimators I am using cross-validation. here is the code below.
j_shout=range(1,300)
j_acc=[]
for j in j_shout:
lr = RandomForestClassifier(n_estimators = j, criterion = 'entropy', random_state = 0)
score=cross_val_score(lr,X_undersample,y_undersample,cv=10,scoring='accuracy')
print ('iteration',j,':cross_validation accuracy=',score)
j_acc.append(score.mean())
now when I run this I am getting the following error.
File "<ipython-input-43-954a9717dcea>", line 5, in <module>
score=cross_val_score(lr,X_undersample,y_undersample,cv=10,scoring='accuracy')
File "D:\installations\AC\lib\site-packages\sklearn\cross_validation.py", line 1562, in cross_val_score
cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
File "D:\installations\AC\lib\site-packages\sklearn\cross_validation.py", line 1823, in check_cv
cv = StratifiedKFold(y, cv)
File "D:\installations\AC\lib\site-packages\sklearn\cross_validation.py", line 569, in __init__
label_test_folds = test_folds[y == label]
IndexError: too many indices for array
I try changing the n_estimators to smaller values but it still showing the same error
According to your traceback and scikit-learn documentation of StratifiedKFold iterator it seems, that StratifiedKFold get y as flattened array. In your case, you pass dataframe with size (984, 1). Your part of code should be like this:
score=cross_val_score(estimator=lr,
X=X_undersample.values,
y=y_undersample.values.ravel(),
cv=10,
scoring='accuracy')

RandomForest score method ValueError

I am trying to find the score of a given data set with respect to some training data. I have written the following code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [[10,20,30,40,50,60,70,80,90], [10,20,30,40,50,60,70,80,90]]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
On compiling and trying to run the program I am getting the error:
Traceback (most recent call last):
File "trial.py", line 16, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 89, in _check_targets
raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported
On checking the documentation related to the score method it says:
score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
Both X and y in my case are arrays, 2d arrays.
I also went through this question but I couldn't understand where am I going wrong.
EDIT
So as per the answer and the comments that follow, I have edited the program as follows:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
mlb = MultiLabelBinarizer()
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [100,200]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
After this edit I am getting the error:
Traceback (most recent call last):
File "trial.py", line 19, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 82, in _check_targets
"".format(type_true, type_pred))
ValueError: Can't handle mix of binary and multiclass-multioutput
According to the documentation:
Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.
The score method invokes sklearn's accuracy metric but this isn't supported for the multi-class, multi-output classification problem you've defined.
It's not clear from your question if you really intend to solve a multi-class, multi-output problem. If that's not your intention, then you should restructure your input arrays.
If on the other hand you really want to solve this kind of problem, you'll simply need to define your own scoring function.
UPDATE
Since you are not solving a multi-class, multi-label problem you should restructure your data so that it looks something like this:
from sklearn.ensemble import RandomForestClassifier
# training data
X = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
y = [0,1]
# fit the model
randomForest.fit(X,y)
# test data
Xtest = [
[1,2,0,4,5,6,0,8,9],
[1,1,3,1,5,0,7,8,9]
]
ytest = [0,1]
output = randomForest.score(Xtest,ytest)
print(output) # 0.5

ValueError: Unknown label type: array while using Decision Tree Classifier and using a custom dataset

Given below is my code
dataset = np.genfromtxt('train_py.csv', dtype=float, delimiter=",")
X_train, X_test, y_train, y_test = train_test_split(dataset[:,:-1],dataset[:,-1], test_size=0.2,random_state=0)
model = tree.DecisionTreeClassifier(criterion='gini')
#y_train = y_train.tolist()
#X_train = X_train.tolist()
model.fit(X_train, y_train)
model.score(X_train, y_train)
predicted= model.predict(x_test)
I am trying to use the decision Tree classifier on a custom dataset imported using the numpy library. But I get a ValueError which is given below when I try to fit the model.I tried using both numpy arrays and non numpy arrays such as lists but still dont seem to figure out what is causing the error. Any help appreciated.
Traceback (most recent call last):
File "tree.py", line 19, in <module>
model.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 177, in fit
check_classification_targets(y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 173, in check_classification_targets
raise ValueError("Unknown label type: %r" % y)
ValueError: Unknown label type: array([[ 252.3352],....<until end of array>
python (scikit-learn) expects you to pass something that is label-like, thus: integer, string, etc. floats are not a typical encoding form of finite space, they are used for regression.
docu:
fit X_train The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.

Categories