I'm starting to learn a bit of sci-kit learn and ML in general and i'm running into a problem.
I've created a model using linear regression.
the .score is good (above 0.8) but i want to get it better (perhaps to 0.9).
I've searched the documentation of sklearn and googled this question but I cannot seem to find the answer.
My question is: Is it possible to tune the LinearRegression model?
and if so, where can I find it?
#----- Forecast in hours -----#
forecast_out = 48
#----- Import and prep data -----#
using pandas to create X and y
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#----- Linear Regression-----#
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
lr.fit(x_train, y_train)
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
x_forecast = np.array(data.drop(['Prediction'],1))[-forecast_out:]
lr_prediction = lr.predict(x_forecast)
There is always room for improvement. Parameters are there in the LinearRegression model. Use .get_params() to find out parameters names and their default values, and then use .set_params(**params) to set values from a dictionary.
GridSearchCV and RandomSearchCV can help you tune them better than you can, and quicker.
This is a very open-ended question and you should just look up the documentation. It's all there, really, trust me - I've looked. Just Google LinearRegression documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
It seems that sklearn.linear_model.LinearRegression does not have hyperparameters that can be tuned. So, instead please use sklearn.linear_model.SGDRegressor, which will provide many possiblites for tuning hyperparameters.
Its documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html .
No, it is not possible.
For Hyperparams tune Linear Regressions, try Lasso, Ridge or ElasticNet
Related
i am trying to train a logistic regression model in scikit learn and it is taking very long to train, around 2 hours. The size of the dataset is 21613 x 19. I am new to scikit learn, as such i dont know whether my code is wrong or that it just takes very long to train. Any suggestion on how to improve the training speed would be very much appreciated!
code used to train is below
# get the LogisticRegression estimator
from sklearn.linear_model import LogisticRegression
# training the model
# apply algorithm to data using fit()
clf = LogisticRegression(solver='newton-cg',multi_class='multinomial')
clf.fit(X_train,y_train)
If you have a specific reason for using this solver, one thing you can do is parallelize the computations by setting the n_jobs=-1 argument.
If you're open to using other solvers, you can use faster solvers with a one-versus-rest strategy. For instance:
clf = LogisticRegression(solver='liblinear', multi_class='ovr')
It's all in the documentation, which can help you guide your choice of solver:
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest
schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not support setting penalty='none'
It's probably that slow because of the solver you have chosen. The newton-cg is a newton method. It's slow for large datasets because it computes the second derivatives. Use a different solver like sag or saga, they are fast for big datasets.
You might want to change your solver. The documentation says that scikit-learn has 5 different solvers: 'sag', 'saga', 'newton-cg', 'lbfgs', and 'liblinear' (not suitable for multinomial).
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Set training and validation sets
X, y = make_classification(n_samples=1000000, n_features=19, n_classes = 8, n_informative=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
# Solvers
solvers = ['newton-cg', 'sag', 'saga', 'lbfgs']
for sol in solvers:
start = time.time()
logreg = LogisticRegression(solver=sol, multi_class='multinomial')
logreg.fit(X_train, y_train)
end = time.time()
print(sol + " Fit Time: ",end-start)
Output (from 16GB RAM 8 Core Macbook):
Choosing the right solver for a problem can save a lot of time (code adapted from here). To determine which solver is right for your problem, you can check out the table from the documentation to learn more (notice that 'newton-cg' is not faster for large datasets).
I've used regression and classification in the past to train, test, and make predictions. Now, I am looking at some NLP sample code and everything is running fine, but at the end, I was hoping to make a prediction of a 'rating' score based on what is contained in a 'text' field. Maybe NLP can't do this, but it seems like it should be doable. Here is the code that I am testing.
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
text_tf= tf.fit_transform(df['review_text'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_tf, df['reviews.rating'], test_size=0.3, random_state=123)
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# Model Generation Using Multinomial Naive Bayes
clf = MultinomialNB().fit(X_train, y_train)
predicted= clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))
# around 7% accurate...
Now, based on specific text, I want to predict the rating a customer will give.
y_predicted = clf.predict(text_tf["Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark"])
Then I get this error: IndexError: Index dimension must be <= 2
The actual rating for this actual review is 4. I was expecting 'y_predicted' to show me a 4. Maybe there is some other library for this kind of thing. Again, I think it should be doable. Thoughts? Suggestions?
I think the issue is what you're asking it to predict on.
Text_tf is a matrix of size (n_samples, n_features). This is what you trained your model on. It doesn't have any text in it anymore. What you want is to transform your test sample the same way you did your training samples, using the TfidfVectorizer. Try the following:
y_predicted = clf.predict(tf.transform("Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark"))
I am solving the classic regression problem using the python language and the scikit-learn library. It's simple:
ml_model = GradientBoostingRegressor()
ml_params = {}
ml_model.fit(X_train, y_train)
where y_train is one-dimensional array-like object
Now I would like to expand the functionality of the task, to get not a single target value, but a set of them. Training set of samples X_train will remain the same.
An intuitive solution to the problem is to train several models, where X_train for all of them will be the same but y_train for each model will be specific. This is definitely a working, but, it seems to me, inefficient solution.
When searching for alternatives, I came across such concepts as Multi-Target Regression. As I understand such functionality is not implemented in scikit-learn.
How to solve Multi-Target Regression problem in python in efficient way? Thanks)
It depends on what problem you solve, training data you have, and an algorithm you choose to find a solution. It's really hard to suggest anything without knowing all the details. You could try a random forest as a starting point. It's a very powerful and robust algorithm which is resistant to overfitting in the case you have not so much data, and also it can be used for multi-target regression. Here is a working example:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
X, y = make_regression(n_targets=2)
print('Feature vector:', X.shape)
print('Target vector:', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print('Build and fit a regressor model...')
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print('Done. Score', score)
Output:
Feature vector: (100, 100)
Target vector: (100, 2)
Build and fit a regressor model...
Done. Score 0.4405974071273537
This algorithm natively supports multi-target regression. For those ones which don't, you can use the multi-output regressor which simply fits one regressor per target.
Another alternative to the random forest approach would be to use an adapted version of Support Vector Regression, that fits multi-target regression problems. The advantage over fitting SVR with MultiOutputRegressor is that this method takes the underlying correlations between the multiple targets into account and hence should perform better.
A working implementation with a paper reference can be found here
When doing fitting, I always come across code like
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
(from http://scikit-learn.org/stable/modules/cross_validation.html#k-fold)
What does clf stand for? I googled around but didn't find any clues.
In the scikit-learn tutorial, it's short for classifier.:
We call our estimator instance clf, as it is a classifier.
In the link you provided, clf refers to classifier.
You can write svm_model or any easy name at place of of clf for better understanding.
I am trying to build decision trees and regression trees with Python. I am using sci-kit, but am open to alternatives.
What I don't understand about this library is whether a training and a validation subset can be provided, so that the library builds the model on the training subset, tests it on the validation and stops splitting based on some rules (typically when additional splits don't result in better performance on the validation subset- this prevents overfitting).
For example, this is what the JMP software does (http://www.jmp.com/support/help/Validation_2.shtml#1016975).
I found no mention of how to use a validation subset in the official website (http://scikit-learn.org/stable/modules/tree.html), nor on the internet.
Any help would be most welcome! Thanks!
There is a fairly rich set of cross validation routines and examples in the scikit learn cross validation section of the userguide.
Note that lots of progress seems to have been made in cross validation between SK-Learn version 0.14 and 0.15, so I recommend upgrading to 0.15 if you haven't already.
As jme notes in his comment, some of the cross validation capability has also been incorporated into the grid search and pipeline capabilities of SK-learn.
For completeness in answering your question, here is simple example, but more advanced k-fold, leave-one-out, leave-p-out, shuffle-split, etc. are all available:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
iris.target,
test_size=0.4,
random_state=0)
X_train.shape, y_train.shape
((90, 4), (90,))
X_test.shape, y_test.shape
((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
0.96...
I hope this helps... Good luck!