What does clf mean in machine learning? - python

When doing fitting, I always come across code like
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
(from http://scikit-learn.org/stable/modules/cross_validation.html#k-fold)
What does clf stand for? I googled around but didn't find any clues.

In the scikit-learn tutorial, it's short for classifier.:
We call our estimator instance clf, as it is a classifier.

In the link you provided, clf refers to classifier.

You can write svm_model or any easy name at place of of clf for better understanding.

Related

How to use GridSearchCV , cross_val_score and a model

I need to find best hyperparams for ANN and then run prediction on the best model. I use KerasRegressor. I find conflicting examples and advices. Please help me understand the right sequence and which params to use when.
I split my data into Train and Test datasets
I look for the best hyperparams using GridSearchCV on Train dataset
GridSearchCV.fit(X_Train, Y_Train)
I take GridSearchCV.best_estimator_ and use it in cross_val_score on Test dataset, i.e
cross_val_score(model.best_estimator_, X_Test, Y_Test , scoring='r2')
I'm not sure if I need to do this step? In theory, it should show similar r2 scores as GridSearchCV did for this best_estimator_ shouldn't it ?
I use model.best_estimator_.predict( X_Test, Y_Test) on Test data to predict the results. I.e I pass best_estimator_ from GridSearchCV to run actual prediction.
Is this correct ?
*Do I need to fit again model.best_estimator_ on Train data before doing a prediction? Or does it keep all the weights found during GridSearchCV ?
Do I need to save weights to be able to reuse it later ?
Usually when you use GridSearchCV on your training set, you will have an object which contains the best trained model with the best parameters.
gs = GridSearchCV.fit(X_train, y_train)
This is also evident from running gs.best_params_ which will print out the best parameters of the model after cross validation. Now, you can make predictions on your test set directly by running gs.predict(X_test, y_test) which will use the best selected model to predict on your test set.
For question 3, you don't need to use cross_val_score again, as this is an helper function that allows you to perform cross validation on your dataset and returns the score of each fold of the data split.
For question 4, I believe this answer is quite explanatory: https://stats.stackexchange.com/a/26535

Can we predict a rating based on text, using NLP?

I've used regression and classification in the past to train, test, and make predictions. Now, I am looking at some NLP sample code and everything is running fine, but at the end, I was hoping to make a prediction of a 'rating' score based on what is contained in a 'text' field. Maybe NLP can't do this, but it seems like it should be doable. Here is the code that I am testing.
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
text_tf= tf.fit_transform(df['review_text'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_tf, df['reviews.rating'], test_size=0.3, random_state=123)
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# Model Generation Using Multinomial Naive Bayes
clf = MultinomialNB().fit(X_train, y_train)
predicted= clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))
# around 7% accurate...
Now, based on specific text, I want to predict the rating a customer will give.
y_predicted = clf.predict(text_tf["Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark"])
Then I get this error: IndexError: Index dimension must be <= 2
The actual rating for this actual review is 4. I was expecting 'y_predicted' to show me a 4. Maybe there is some other library for this kind of thing. Again, I think it should be doable. Thoughts? Suggestions?
I think the issue is what you're asking it to predict on.
Text_tf is a matrix of size (n_samples, n_features). This is what you trained your model on. It doesn't have any text in it anymore. What you want is to transform your test sample the same way you did your training samples, using the TfidfVectorizer. Try the following:
y_predicted = clf.predict(tf.transform("Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark"))

Is it possible to tune the linear regression (hyper)parameter in sklearn

I'm starting to learn a bit of sci-kit learn and ML in general and i'm running into a problem.
I've created a model using linear regression.
the .score is good (above 0.8) but i want to get it better (perhaps to 0.9).
I've searched the documentation of sklearn and googled this question but I cannot seem to find the answer.
My question is: Is it possible to tune the LinearRegression model?
and if so, where can I find it?
#----- Forecast in hours -----#
forecast_out = 48
#----- Import and prep data -----#
using pandas to create X and y
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#----- Linear Regression-----#
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
lr.fit(x_train, y_train)
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
x_forecast = np.array(data.drop(['Prediction'],1))[-forecast_out:]
lr_prediction = lr.predict(x_forecast)
There is always room for improvement. Parameters are there in the LinearRegression model. Use .get_params() to find out parameters names and their default values, and then use .set_params(**params) to set values from a dictionary.
GridSearchCV and RandomSearchCV can help you tune them better than you can, and quicker.
This is a very open-ended question and you should just look up the documentation. It's all there, really, trust me - I've looked. Just Google LinearRegression documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
It seems that sklearn.linear_model.LinearRegression does not have hyperparameters that can be tuned. So, instead please use sklearn.linear_model.SGDRegressor, which will provide many possiblites for tuning hyperparameters.
Its documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html .
No, it is not possible.
For Hyperparams tune Linear Regressions, try Lasso, Ridge or ElasticNet

Multi-target regression using scikit-learn

I am solving the classic regression problem using the python language and the scikit-learn library. It's simple:
ml_model = GradientBoostingRegressor()
ml_params = {}
ml_model.fit(X_train, y_train)
where y_train is one-dimensional array-like object
Now I would like to expand the functionality of the task, to get not a single target value, but a set of them. Training set of samples X_train will remain the same.
An intuitive solution to the problem is to train several models, where X_train for all of them will be the same but y_train for each model will be specific. This is definitely a working, but, it seems to me, inefficient solution.
When searching for alternatives, I came across such concepts as Multi-Target Regression. As I understand such functionality is not implemented in scikit-learn.
How to solve Multi-Target Regression problem in python in efficient way? Thanks)
It depends on what problem you solve, training data you have, and an algorithm you choose to find a solution. It's really hard to suggest anything without knowing all the details. You could try a random forest as a starting point. It's a very powerful and robust algorithm which is resistant to overfitting in the case you have not so much data, and also it can be used for multi-target regression. Here is a working example:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
X, y = make_regression(n_targets=2)
print('Feature vector:', X.shape)
print('Target vector:', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print('Build and fit a regressor model...')
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print('Done. Score', score)
Output:
Feature vector: (100, 100)
Target vector: (100, 2)
Build and fit a regressor model...
Done. Score 0.4405974071273537
This algorithm natively supports multi-target regression. For those ones which don't, you can use the multi-output regressor which simply fits one regressor per target.
Another alternative to the random forest approach would be to use an adapted version of Support Vector Regression, that fits multi-target regression problems. The advantage over fitting SVR with MultiOutputRegressor is that this method takes the underlying correlations between the multiple targets into account and hence should perform better.
A working implementation with a paper reference can be found here

An SVM that has the capabilities to perform "online learning" and give probability to a prediction using hinge loss

There is a bit of a dilemma with sklearn.
Using SVC I can use the method predict_proba to calculate how likely a prediction is.
Using SGDClassifier I can perform "online/incremental" learning using the method partial_fit method but predict_proba doesn't work with the 'hinge' loss.
Is there anyway to have both?
As someone asked about that before #github, the authors then added mentioning of this to the docs.
So your doc-entry to go for is sklearn's Probability calibration.
A basic example was given in the github link as:
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='hinge')
calibrated_clf = CalibratedClassifierCV(clf, cv=5, method='sigmoid')
calibrated_clf.fit(X, y)
But this probably won't work for partial_fit as mentioned here.
So for your task it's important to know if the following works for you:
no, you can't partial_fit the calibration, but you can partial_fit the
underlying classifier

Categories