ValueError: Found arrays with inconsistent numbers of samples [1,299] - python

Here is data files here and here. You can download it by clicking on links the link. I am using Pandas, Numpy and Python3.
Here is my code:
import pandas as pa
import numpy as nu
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
def get_accuracy(X_train, y_train, X_test, y_test):
perceptron = Perceptron()
perceptron.fit(X_train, y_train)
perceptron.transform(X_train)
prediction = perceptron.predict(X_test)
result = accuracy_score(y_test, prediction)
return result
test_data = pa.read_csv("C:/Users/Roman/Downloads/perceptron-test.csv")
test_data.columns = ["class", "f1", "f2"]
train_data = pa.read_csv("C:/Users/Roman/Downloads/perceptron-train.csv")
train_data.columns = ["class", "f1", "f2"]
scaler = StandardScaler()
scaler.fit_transform(train_data[train_data.columns[1:]]).reshape(-1,1)
X_train = scaler.transform(train_data[train_data.columns[1:]])
scaler.fit_transform(train_data[train_data.columns[0]])
y_train = scaler.transform(train_data[train_data.columns[0]])
scaler.fit_transform(test_data[test_data.columns[1:]])
X_test = scaler.transform(test_data[test_data.columns[1:]])
scaler.fit_transform(test_data[test_data.columns[0]])
y_test = scaler.transform(test_data[test_data.columns[0]])
scaled_accuracy = get_accuracy(nu.ravel(X_train), nu.ravel(y_train), nu.ravel(X_test), nu.ravel(y_test))
print(scaled_accuracy)
And here is error that I get:
Traceback (most recent call last):
File "C:/Users/Roman/PycharmProjects/data_project-1/lecture_2_perceptron.py", line 33, in <module>
scaled_accuracy = get_accuracy(nu.ravel(X_train), nu.ravel(y_train), nu.ravel(X_test), nu.ravel(y_test))
File "C:/Users/Roman/PycharmProjects/data_project-1/lecture_2_perceptron.py", line 9, in get_accuracy
perceptron.fit(X_train, y_train)
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\linear_model\stochastic_gradient.py", line 545, in fit
sample_weight=sample_weight)
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\linear_model\stochastic_gradient.py", line 389, in _fit
X, y = check_X_y(X, y, 'csr', dtype=np.float64, order="C")
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py", line 520, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
**ValueError: Found arrays with inconsistent numbers of samples: [ 1 299]**
Without scaling data everything work fine. But after scaling not.

You should not call fit_transform each time you use scaler. You should fit it once, on the training data, and later only transform, otherwise you get different representation for training and testing (leading to error provided). There is also no point in scaling labels.

Related

Tf-IDF vectorized data won't work with naive bayes classifier

I have the following python code that I am using after preprocessing the data where data has to columns, one is the label either positive or negative and the other has tweet texts.
X_train, X_test, y_train, y_test = train_test_split(data['Tweet'], data['Label'], test_size=0.20, random_state=0)
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_traintf = tf_idf.transform(X_train)
x_testtf = tf_idf.fit_transform(X_test)
x_testtf = tf_idf.transform(X_test)
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(x_traintf, y_train)
y_pred = naive_bayes_classifier.predict(x_testtf)
print(metrics.classification_report(y_test, y_pred, target_names=['pos', 'neg']))
Here is the full error:
Traceback (most recent call last):
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\main.py", line 72, in <module>
naive_bayes_classifier.fit(x_traintf, y_train)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 749, in fit
X, y = self._check_X_y(X, y)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 583, in _check_X_y
return self._validate_data(X, y, accept_sparse="csr", reset=reset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\base.py", line 565, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1122, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1144, in _check_y
_assert_all_finite(y, input_name="y", estimator_name=estimator_name)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 111, in _assert_all_finite
raise ValueError("Input contains NaN")
ValueError: Input contains NaN
I've tried this as well but got similar results:
X_train, X_test, y_train, y_test = train_test_split(all_X, all_y, test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
X_test_tfidf = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_tfidf)
print(metrics.accuracy_score(y_test, y_pred))
edit: I've used data.dropna(inplace=True) and it appears to think that my strings are null because they are in Arabic.
Ok, so first of all, there must be some confusion around what the methods do because they are being used wrong. There might a chance that this is the issue.
Method .fit_transform() will fit to the data but also transform in place, it is like calling fit() and then transform(). Have a look at the documentation for more clarity: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform
Your code, for the transformation of the data, should be something like the following:
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_testtf = tf_idf.transform(X_test)
You only fit and learn the train data, and transform the train and test data with that learning. Your original code, was actually transforming the train data twice, and learning from the test data again and transforming twice also.
Do let me know if this is the issue. It might be that this double transformation caused issues. Please print the output after using the transformer, to see the output of the train and test data (features).

cross_val_score giving an error - Why is this? [duplicate]

This question already has an answer here:
Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead
(1 answer)
Closed 4 months ago.
I am trying to follow a machine-learning tutorial listed here: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/, but I am encountering an issue. I was able to run the following code on my Macbook air, however, it did not work on my Windows machine. I checked other questions with similar titles, none of which seem to fit my problem.
Why is this happening? How can it be fixed?
My entire code:
# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "energyFormatted.csv"
names = ['TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)
# Split-out validation dataset
array = dataset.values
X = array[:, 0:4]
y = array[:, 4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
The line that's giving me an error:
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
The error itself:
Traceback (most recent call last):
File "D:\Applications\pythonProject\venv\lib\site-packages\joblib\parallel.py", line 862, in dispatch_one_batch
tasks = self._ready_batches.get(block=False)
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\queue.py", line 168, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\danie\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\FY23 SCI FAIR\main.py", line 63, in <module>
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=None)
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 515, in cross_val_score
cv_results = cross_validate(
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 266, in cross_validate
results = parallel(
File "D:\Applications\pythonProject\venv\lib\site-packages\joblib\parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "D:\Applications\pythonProject\venv\lib\site-packages\joblib\parallel.py", line 873, in dispatch_one_batch
islice = list(itertools.islice(iterator, big_batch_size))
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 266, in <genexpr>
results = parallel(
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_split.py", line 340, in split
for train, test in super().split(X, y, groups):
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_split.py", line 86, in split
for test_index in self._iter_test_masks(X, y, groups):
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_split.py", line 717, in _iter_test_masks
test_folds = self._make_test_folds(X, y)
File "D:\Applications\pythonProject\venv\lib\site-packages\sklearn\model_selection\_split.py", line 660, in _make_test_folds
raise ValueError(
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
CSV:
28564,0,6284.08,1713.84,19.9948,19994.8,19.9948,19.9948
28411,0,6250.42,852.33,0,20740.03,568.22,0
27515,0,6053.3,550.3,0,20361.1,550.3,0
24586,491.72,5408.92,245.86,0,17947.78,491.72,0
26653,533.06,6130.19,0,0,18923.63,1066.12,0
26836,805.08,6172.28,0,0,18785.2,1073.44,0
26073,1303.65,5736.06,0,0,17990.37,1042.92,0
27055,1352.75,6222.65,0,0,18397.4,1082.2,0
26236,1311.8,6034.28,0,0,17578.12,1311.8,0
26020,1821.4,3903,0,0,18994.6,1040.8,260.2
26538,0,4246.08,265.38,13799.76,6369.12,0,1326.9
25800,3354,5160,0,0,14964,1290,1032
26682,3468.66,5603.22,0,0,14941.92,1600.92,1067.28
24997,3499.58,5499.34,0,0,13248.41,1499.82,1249.85
25100,3765,4769,0,0,13052,1506,2008
24651,4190.67,4930.2,0,0,12325.5,1232.55,1972.08
12053,0,1084.77,0,3133.78,6508.62,0,723.18
11500,2070,2415,0,0,4255,690,2070
Accuracy doesn't make sense for scoring a continuous variable. The error indicates that your y values are of type float (or are otherwise continuous). You could try something like sklearn.metrics.mean_squared_error instead of accuracy.

Fix ValueError in Logistic Regression in Python

I am following Müller & Guido's Machine Learning with Python book, and I am trying to run classifications on this dataset.
So far my code looks like this:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
# Read the Churn data into a dataset (pandas) from the cvs file
dataset = pd.read_csv(r'C:\Users\Amalie\IdeaProjects\INFO284\src\Lab2.csv')
# Make the data into a 2D NumPy array (as scikit-learn expects for the data)
dataframe = dataset[['SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'Churn']]
y = dataframe['Churn'] # Target
X = dataframe.drop('Churn', 1) # Features ( all other than target column 'Churn' )
# Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=20) # Split into test/training sets
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logReg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logReg.score(X_test, y_test)))
When I run it, I get this error:
Traceback (most recent call last):
File "C:/Users/Amalie/IdeaProjects/INFO284/src/Lab5.py", line 19, in <module>
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\linear_model\_logistic.py", line 1514, in fit
accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 976, in check_X_y
estimator=estimator,
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Amalie\IdeaProjects\INFO284\venv\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'No'
Process finished with exit code 1
It says that the problem is with this line
logReg = LogisticRegression(max_iter=100000).fit(X_train, y_train)
I have used the fit()-method before when running other classification problems, but I've never come across this issue before. What am I doing wrong?

ValueError: Number of labels=19 does not match number of samples=1

This is the code that I used . I'm trying to use a randomforestclassifier to classify the activity based on learner and dominant subject.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_multilabel_classification
Data = pd.read_excel("F:\VIT material\Master thesis\DATASET.xlsx",names=['learner','Dominant_Subject','Activity'])
print(Data)
print(Data.columns)
Data.reshape(Data.columns.values)
print(Data)
number= LabelEncoder()
Data['learner']= number.fit_transform(Data['learner'].astype('str'))
Data['Dominant_Subject']=number.fit_transform(Data['Dominant_Subject'].astpye('str'))
Data['Activity']= number.fit_transform(Data['Activity'].astype('str'))
print(Data)
print(Data.shape)
X = Data['learner']
print(X)
print(X.shape)
Y = Data['Dominant_Subject']
print(Y)
print(Y.shape)
print(len(X))
print(len(Y))
X_train = X[:-5]
X_test = X[-5:]
Y_train = Y[:-10]
Y_test = Y[-10:]
X_train, X_test, Y_train, Y_test=train_test_split(X,test_size=0.2,random_state=20)
print(X_train,X_test,Y_train,Y_test)
model = linear_model.LinearRegression()
model.fit(X_train,Y_train)
print(model.fit())
clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
clf.fit(X_train,Y_train)
print(clf.fit())
predicted = clf.predict(X)
print(accuracy_score(predicted,Y))
The number of samples and labels is equal however I'm still getting the error that number of labels is not equal to number of samples.
Traceback of error :
File "C:/Users/RAJIV MISHRA/PycharmProjects/mltutorialpractice/13.py", line 38, in
clf.fit(X_train,Y_train)
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py", line 326, in fit
for i, t in enumerate(trees))
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 758,
in call
while self.dispatch_one_batch(iterator):
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib_parallel_backends.py", line 326, in init
self.results = batch()
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py", line 120, in _parallel_build_trees
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "C:\Users\RAJIV MISHRA\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 240, in fit
"number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=19 does not match number of samples=1
There are some issues that could be fixed inside this code. Assuming X.shape[0] == Y.shape[0]:-
1. The following code is unnecessary if you are using train_test_split
X_train = X[:-5]
X_test = X[-5:]
Y_train = Y[:-10]
Y_test = Y[-10:]
The code has another problem also. The samples indexes do not match the labels indexes. My be the following can be used to fix this.
X_train = X[:-5]
X_test = X[-5:]
Y_train = Y[:-5]
Y_test = Y[-5:]
2. If you are using train_test_split for spliting dataset into train and test sets you should pass both labels and samples.
X_train, X_test, Y_train, Y_test=train_test_split(X,Y,test_size=0.2,random_state=20)

RandomForest score method ValueError

I am trying to find the score of a given data set with respect to some training data. I have written the following code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [[10,20,30,40,50,60,70,80,90], [10,20,30,40,50,60,70,80,90]]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
On compiling and trying to run the program I am getting the error:
Traceback (most recent call last):
File "trial.py", line 16, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 89, in _check_targets
raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported
On checking the documentation related to the score method it says:
score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
Both X and y in my case are arrays, 2d arrays.
I also went through this question but I couldn't understand where am I going wrong.
EDIT
So as per the answer and the comments that follow, I have edited the program as follows:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
mlb = MultiLabelBinarizer()
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [100,200]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
After this edit I am getting the error:
Traceback (most recent call last):
File "trial.py", line 19, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 82, in _check_targets
"".format(type_true, type_pred))
ValueError: Can't handle mix of binary and multiclass-multioutput
According to the documentation:
Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.
The score method invokes sklearn's accuracy metric but this isn't supported for the multi-class, multi-output classification problem you've defined.
It's not clear from your question if you really intend to solve a multi-class, multi-output problem. If that's not your intention, then you should restructure your input arrays.
If on the other hand you really want to solve this kind of problem, you'll simply need to define your own scoring function.
UPDATE
Since you are not solving a multi-class, multi-label problem you should restructure your data so that it looks something like this:
from sklearn.ensemble import RandomForestClassifier
# training data
X = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
y = [0,1]
# fit the model
randomForest.fit(X,y)
# test data
Xtest = [
[1,2,0,4,5,6,0,8,9],
[1,1,3,1,5,0,7,8,9]
]
ytest = [0,1]
output = randomForest.score(Xtest,ytest)
print(output) # 0.5

Categories