Is there a classify function for Scikit learn classifiers? - python

I've been using NTLK classifiers to train datasets and classify single record.
For training the records I use this function,
nltk.NaiveBayesClassifier.train(train_set)
For classifying a single record,
nltk.NaiveBayesClassifier.classify(record)
where, "record" is the variable name.
In Scikit classifiers, for training dataset, the function used is,
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
What is the function to classify single record in scikit learn classifiers? i.e., is there something like this classifier.classify() ?
Predict method only classifies for whole test set converted into a sparse matrix vector, like y_pred = classifier.predict(X_test)
y_pred = classifier.predict(X_test)
I couldn't classify for a single record; I get this error :
File "C:\Users\HSR\Anaconda2\lib\site-packages\sklearn\utils\validation.py",
line 433, in check_array array = np.array(array, dtype=dtype, order=order,
copy=copy) ValueError: could not convert string to float: This is a bot
If predict can be used to classify a single record, then how to do it?

If you are looking for a method that helps you to predict which class your data would fall in, I believe,
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
classifier.predict(record)
would help. To know more about the available APIs, please follow this link to the documentation.
It looks like, you are looking for a text classifier. Here is a scikit-learn example of a text classifier. The page gives a thorough introduction to working with text data in scikit-learn.

You need to apply all of the same preprocessing that you applied to your training data, sklearn classifiers don't know what you did to turn your text into training data. However, this can be done using sklearn's pipelines. predict does also expect an array, but you can pass it an array of one sample.

Related

IsolationForest is always predicting 1

I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
Fit and transform in-domain dataset using CountVectorizer().
Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
Therefore, transform the training data using tfidftransformer()
Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
On IsolationForest:
Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
Save the model using joblib.
TESTING:
Load all of the saved models.
Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
Predict if it is out-of-domain or not, using the saved IsolationForest model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.

How to make naive bayes multinomial with TF-idf from scratch in python?

I know there is a library in python
from sklearn.naive_bayes import MultinomialNB
but I want to know how to create one from scratch without using libraries like TfIdfVectorizer and MultinomialNB?
Here is the step-by-step about how to make simple MNB Classifier with TF-IDF
First, you need to import the method TfIdfVectorizer to tokenize the terms inside the dataset, the MultinomialNB as the classifier, and the train_test_split for splitting the dataset. (Both are available in sklearn).
Split the dataset into train and test sets.
Initialize the constructor of TfIdfVectorizer, then Vectorize/Tokenize the train set by the method fit_transform.
Vectorize/Fit the test set with the method fit.
Initialize the classifier by calling the constructor MultinomialNB().
model = MultinomialNB() # with default hyperparameters
Train the classifier with the train set.
model.fit(X_train, y_train)
Test/Validate the classifier with the test set.
model.predict(X_test, y_test)
Those 7 steps above are the simple steps. Apparently you can also do the text preprocessing and also model evaluation.

Permutation importance using a Pipeline in SciKit-Learn

I am using the exact example from SciKit, which compares permutation_importance with tree feature_importances
As you can see, a Pipeline is used:
rf = Pipeline([
('preprocess', preprocessing),
('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)
permutation_importance:
Now, when you fit a Pipeline, it will Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Later in the example, they used the permutation_importance on the fitted model:
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
random_state=42, n_jobs=2)
Problem: What I don't understand is that the features in the result are still the original non-transformed features. Why is this the case? Is this working correctly? What is the purpose of the Pipeline then?
tree feature_importance:
In the same example, when they use the feature_importance, the results are transformed:
tree_feature_importances = (
rf.named_steps['classifier'].feature_importances_)
I can obviously transform my features and then use permutation_importance, but it seems that the steps presented in the examples are intentional, and there should be a reason why permutation_importance does not transform the features.
This is the expected behavior. The way permutation importance works is to shuffle the input data and apply it to the pipeline (or the model if that is what you want). In fact, if you want to understand how the initial input data effects the model then you should apply it to the pipeline.
If you are interested in the feature importance of each of the additional feature that is generated by your preprocessing steps, then you should generate the preprocessed dataset with column names and then apply that data to the model (using permutation importance) directly instead of the pipeline.
In most cases people are not interested in learning the impact of the secondary features that the pipeline generates. That is why they use the pipeline here to encompass the preprocessing and modeling steps.

Why do we need to fit the model again in order to get score?

I'm testing the embedded methods for feature selection.
I understand (maybe I misunderstand) that with embedded methods we can get best features (based on importance of the features) while training the model.
Is so, I want to get the score of the trained model (which was trained to select features).
I'm testing that on classification problem with Lasso method.
When I'm trying to get the score, I'm getting error that I need to fit again the model.
Why do I need to do it (it seem waste of time if the model was fitted on feature selection ?)
Why can't we do it (select features and get model score) in one shot ?
Why if we are using embedded method, why do we need to do it 2 phases ? why can't we train the model while choose the best features in one fit ?
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
estimator = LogisticRegression(C=1, penalty='l1', solver='liblinear')
selection = SelectFromModel(estimator)
selection.fit(x_train, y_train)
print(estimator.score(x_test, y_test))
Error:
sklearn.exceptions.NotFittedError: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
The fitted estimator is returned as selection.estimator_ (see the docs); so, after fitting selection, you can simply do:
selection.estimator_.score(x_test, y_test)

Predicting Multiple output based on multiple input like Month and Fixed values column

I have a data like shown in image. it is about 25,000 rows. The data containes details about 12 months for past 4 years. I want to predict Client and Position Opened for particular month and particular jobtitle.
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_final['Clientname_numeric'] = le.fit_transform(df_final['ClientName'])
X = df_final[['MONTH','JobTitleID']]
y = df_final[['PositionsOpened','Clientname_numeric']]
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.05 )
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
predictions = predictions.astype(int)
accuracy = accuracy_score(y_test,predictions)
I am using above code and getting error
ValueError: multiclass-multioutput is not supported
You could use the package scikit learn and the random forest classifier. I should point out that I only have very superficial knowledge of machine learning, so this might just be the wrong one for your specific case. The RandomForestClassifier however allows to predict multiple outputs at once.
In general, given your data, you would approach it like this (using Scikit Learn):
Split the tables into input columns and output columns. This could propably be done most easily using the pandas package. Then split those into training and test subsets. Scikit offers an off-the-shelf solution for this.
Create an instance of a classifier like RandomForestClassifier and train it using the input- and output-data from your training set (classifier.train(inputs_train, outputs_train))
Given the inputs of your test data, predict the outputs (classifier.predict(inputs_predict)). Decide whether you are satisfied with the predictive quality of your classifier.
For classifying multiple outputs, sklearn has this library, it expects a base estimator like random forests, gradient boosting etc.
The library allows multiple output regression and classification.
Hope this helps!

Categories