KNeighborsClassifier .predict() function doesn't work - python

i am working with KNeighborsClassifier algorithm from scikit-learn library in Python. I followed basic instructions e.g. split my data and labels into training and test data, then trained my model on a training data. Now I am trying to predict accuracy of testing data but get an error. Here is my code:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
data_train, data_test, label_train, label_test = train_test_split(df, labels,
test_size=0.2,
random_state=7)
mod = KNeighborsClassifier(n_neighbors=4)
mod.fit(data_train, label_train)
predictions = mod.predict(data_test)
print accuracy_score(label_train, predictions)
The error I get:
ValueError: Found arrays with inconsistent numbers of samples: [140 558]
140 is the portion of training data and 558 is the test data based on the test_size=0.2 (my data set is 698 samples). I verified that labels and data sets are of the same size 698. However, I get this error which is basically trying to compare test data and training data sets.
Does anyone knows what is wrong here? What should I use to train my model against to and what should I use to predict the score?
Thanks!

You should calculate the accuracy_score with label_test, not label_train. You want to compare the actual labels of the test set, label_test, to the predictions from your model, predictions, for the test set.

Did you tried to solve your issue via the following question ?
sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Related

How to build re-usable scikit-learn pipeline for Random Forest Classifier?

I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train) , I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train) works fine.
In a normal case scenario, without any pipeline code, what i have usually done is taken a ML model and applied fit_transform() on my training dataset and transform on my unseen dataset for generating predictions.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
But when i execute pipeline.fit_transform(X_train, y_train), I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier does not transform data because it is a model, not a transformer. Pipelines implement either transform or predict (and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train), then in testing or production you'll call pipeline.predict(X_test, y_test) (or predict_proba, or ...), which internally will transform with the first step(s) and predict with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]

Can we predict a rating based on text, using NLP?

I've used regression and classification in the past to train, test, and make predictions. Now, I am looking at some NLP sample code and everything is running fine, but at the end, I was hoping to make a prediction of a 'rating' score based on what is contained in a 'text' field. Maybe NLP can't do this, but it seems like it should be doable. Here is the code that I am testing.
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
text_tf= tf.fit_transform(df['review_text'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_tf, df['reviews.rating'], test_size=0.3, random_state=123)
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# Model Generation Using Multinomial Naive Bayes
clf = MultinomialNB().fit(X_train, y_train)
predicted= clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))
# around 7% accurate...
Now, based on specific text, I want to predict the rating a customer will give.
y_predicted = clf.predict(text_tf["Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark"])
Then I get this error: IndexError: Index dimension must be <= 2
The actual rating for this actual review is 4. I was expecting 'y_predicted' to show me a 4. Maybe there is some other library for this kind of thing. Again, I think it should be doable. Thoughts? Suggestions?
I think the issue is what you're asking it to predict on.
Text_tf is a matrix of size (n_samples, n_features). This is what you trained your model on. It doesn't have any text in it anymore. What you want is to transform your test sample the same way you did your training samples, using the TfidfVectorizer. Try the following:
y_predicted = clf.predict(tf.transform("Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark"))

Is there a classify function for Scikit learn classifiers?

I've been using NTLK classifiers to train datasets and classify single record.
For training the records I use this function,
nltk.NaiveBayesClassifier.train(train_set)
For classifying a single record,
nltk.NaiveBayesClassifier.classify(record)
where, "record" is the variable name.
In Scikit classifiers, for training dataset, the function used is,
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
What is the function to classify single record in scikit learn classifiers? i.e., is there something like this classifier.classify() ?
Predict method only classifies for whole test set converted into a sparse matrix vector, like y_pred = classifier.predict(X_test)
y_pred = classifier.predict(X_test)
I couldn't classify for a single record; I get this error :
File "C:\Users\HSR\Anaconda2\lib\site-packages\sklearn\utils\validation.py",
line 433, in check_array array = np.array(array, dtype=dtype, order=order,
copy=copy) ValueError: could not convert string to float: This is a bot
If predict can be used to classify a single record, then how to do it?
If you are looking for a method that helps you to predict which class your data would fall in, I believe,
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
classifier.predict(record)
would help. To know more about the available APIs, please follow this link to the documentation.
It looks like, you are looking for a text classifier. Here is a scikit-learn example of a text classifier. The page gives a thorough introduction to working with text data in scikit-learn.
You need to apply all of the same preprocessing that you applied to your training data, sklearn classifiers don't know what you did to turn your text into training data. However, this can be done using sklearn's pipelines. predict does also expect an array, but you can pass it an array of one sample.

Predicting Multiple output based on multiple input like Month and Fixed values column

I have a data like shown in image. it is about 25,000 rows. The data containes details about 12 months for past 4 years. I want to predict Client and Position Opened for particular month and particular jobtitle.
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_final['Clientname_numeric'] = le.fit_transform(df_final['ClientName'])
X = df_final[['MONTH','JobTitleID']]
y = df_final[['PositionsOpened','Clientname_numeric']]
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.05 )
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
predictions = predictions.astype(int)
accuracy = accuracy_score(y_test,predictions)
I am using above code and getting error
ValueError: multiclass-multioutput is not supported
You could use the package scikit learn and the random forest classifier. I should point out that I only have very superficial knowledge of machine learning, so this might just be the wrong one for your specific case. The RandomForestClassifier however allows to predict multiple outputs at once.
In general, given your data, you would approach it like this (using Scikit Learn):
Split the tables into input columns and output columns. This could propably be done most easily using the pandas package. Then split those into training and test subsets. Scikit offers an off-the-shelf solution for this.
Create an instance of a classifier like RandomForestClassifier and train it using the input- and output-data from your training set (classifier.train(inputs_train, outputs_train))
Given the inputs of your test data, predict the outputs (classifier.predict(inputs_predict)). Decide whether you are satisfied with the predictive quality of your classifier.
For classifying multiple outputs, sklearn has this library, it expects a base estimator like random forests, gradient boosting etc.
The library allows multiple output regression and classification.
Hope this helps!

Why K-Cross validation need to fit first?

I get an error in the following code unless I do a fit on the SVC:
This SVC instance is not fitted yet. Call 'fit' with appropriate
arguments before using this method.
Unless I do this:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
Why I need to do a fit before doing a cross validation?
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target,
test_size=0.4, random_state=0)
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
# Now measure its performance with the test data
clf.score(X_test, y_test)
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
You don't. Your cross_val_score runs fine without the fit.
You do need to fit before running score.
The reason you are seeing that error is because you are asking your estimator (clf) to compute the accuracy of its classifications (with the clf.score method) before it actually knows how to do the classification. To teach clf how to do the classification you have to train it by calling the fit method. This is what the error message is trying to tell you.
score in the above sense has nothing to do with cross-validation, only accuracy. The cross_val_score helper method you use can take an untrained estimator and compute a cross-validated score for you data. This helper trains the estimator for you and that's why you don't have to call fit before using this helper.
See the documentation for cross-validation for more information.

Categories