I am trying out scikit-learn for the first time, for a Multi-Output Multi-Class text classification problem. I am attempting to use GridSearchCV to optimize the parameters of MLPClassifier for this purpose.
I will admit that I am shooting in the dark here, having no prior experience. Please let me know if this makes sense.
Below is what I currently have:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
df = pd.read_csv('data.csv')
df.fillna('', inplace=True) #Replaces NaNs with "" in the DataFrame (which would be considered a viable choice in this multi-classification model)
x_features = df['input_text']
y_labels = df[['output_text_label_1', 'output_text_label_2']]
x_train, x_test, y_train, y_test = train_test_split(x_features, y_labels, test_size=0.3, random_state=7)
pipe = Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(MLPClassifier()))])
pipe.fit(x_train, y_train)
pipe.score(x_test, y_test)
pipe.score gives a score of ~0.837, which seems to suggest that the above code is doing something. Running pipe.predict() on some test strings seems to yield relatively adequate output results.
However, even after looking at plenty examples, I don't understand how to implement GridSearchCV for this Pipeline. (Additionally, I would like advice on which parameters to search).
I doubt it makes sense to post my attempts with GridSearchCV since they have been varied and all unsuccessful. But a brief example from a Stack Overflow answer could be:
grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [(100,),(200,)]
}
]
grid_search = GridSearchCV(pipe, grid, scoring='accuracy', n_jobs=-1)
grid_search.fit(x_train, y_train)
This gives the error:
ValueError: Invalid parameter activation for estimator
Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(estimator=MLPClassifier()))]). Check the list of
available parameters with estimator.get_params().keys().
I'm not sure what causes this, nor exactly how to utilize estimator.get_params().keys() to figure out which parameters are faulty.
Perhaps my uses of 'cv', CountVectorizer() or 'mlpc', MultiOutputClassifier(estimator=MLPClassifier())) are incorrect in relation to the grid parameters.
I believe I need to use CountVectorizer() here because my inputs (and desired label outputs) are all strings.
I very much appreciate an example of how GridSearchCV should be used for a Pipeline presumably utilizing CountVectorizer() and MLPClassifier in the correct way, and which grid parameters may be advisable to search.
TL;DR Try something like this:
mlpc = MLPClassifier(solver='adam',
learning_rate_init=0.01,
max_iter=300,
activation='relu',
early_stopping=True)
pipe = Pipeline(steps=[('cv', CountVectorizer(ngram_range=(1, 1))),
('scale', StandardScaler()),
('mlpc', MultiOutputClassifier(mlpc))])
search_space = {
'cv__max_df': (0.9, 0.95, 0.99),
'cv__min_df': (0.01, 0.05, 0.1),
'mlpc__estimator__alpha': 10.0 ** -np.arange(1, 5),
'mlpc__estimator__hidden_layer_sizes': ((64, 32), (128, 64),
(64, 32, 16), (128, 64, 32)),
'mlpc__estimator__tol': (1e-3, 5e-3, 1e-4),
}
Discussion:
[Edit] For multi-output binary classification only, MLPClassifier supports multi-output classification, and having interrelating outputs, I wouldn't recommend using MultiOutputClassifier as it trains separate MLPClassifier instances without taking into account the relationship between outputs. Training only one MLPClassifier is faster, cheaper, and usually more accurate.
The ValueError is due to improper parameter grid names. See Nested parameters.
With a modest workstation and/or large training data, set solver='adam'
to use a cheaper, first-order method as opposed to a second-order 'lbfgs'. Alternatively, try solver='sgd'---even cheaper to compute---but then also tune momentum. I anticipate that your data will be sparse and of different scales after CountVectorizer, and momentum/solver='adam' is a way to tackle variant gradients.
Insert one of the standardization transformers (I guess StandardScaler will work better) after CountVectorizer as MLPs are sensitive to feature scaling. Although, solver='adam' would probably handle imbalanced bag of words well. Still, I believe it won't hurt to standardize your data.
I think tuning activation is needles. Set activation='relu'.
Use early_stopping=True, specify a large enough max_iter, and tune tol to prevent overfitting.
Definitely tune learning_rate_init with solver='sgd'; for solver='adam', I assume higher learning rates will be OK and adam won't require comprehensive learning-rate tuning.
Prefer deeper nets to wider ones (e.g., hidden_layer_sizes=(128, 64, 32) to hidden_layer_sizes=(256, 192)).
Always tune alpha.
Optimal hidden_layer_sizes may depend on a document-term dimension.
Try setting higher batch_sizes but take into account computational expenses.
If you wish to optimize CountVectorizer, tune max_df and min_df but not ngram_range; I believe at least a two-layer MLP will handle unigram relationships itself in hidden layers without need to process n-grams explicitly.
Optimize the hyperparameters in the code sample above first. But note that the remaining hyperparameters can also affect both computational performance and predictive power.
Disclaimer: Most of the remarks are based on my (insubstantial🤔) assumptions about your data and pertain only to scikit-learn's MLPs. Refer to docs to learn more about neural networks and experiment with other tips. And remember, There is No Free Lunch.
Related
I have a data set with some float column features (X_train) and a continuous target (y_train).
I want to run KNN regression on the data set, and I want to (1) do a grid search for hyperparameter tuning and (2) run cross validation on the training.
I wrote this code:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=3,
random_state=999)
# Define our candidate hyperparameters
hp_candidates = [{'n_neighbors': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 'weights': ['uniform','distance'],'p':[1,2,5]}]
# Search for best hyperparameters
grid = GridSearchCV(estimator=KNeighborsRegressor(),
param_grid=hp_candidates,
cv=cv_method,
verbose=1,
scoring='accuracy',
return_train_score=True)
grid.fit(X_train,y_train)
The error I get is:
Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I understand the error, that I can only do this method for KNN in classification, not regression.
But what I can't find is how to edit this code to make it suitable for KNN regression? Can someone explain to me how this could be done?
(So the ultimate aim is I have a data set, I want to tune the parameters, do cross validation, and output the best model based on above and get back some accuracy scores, ideally scores that have comparable scores in other algorithms and are not specific to KNN, so I can compare accuracy).
Also just to mention, this is my first attempt at KNN in scikitlearn, so all comments/critic is welcome.
Yes you can use GridSearchCV with the KNeighboursRegressor.
As you have a metric choice problem,
you can read the metrics documentation here : https://scikit-learn.org/stable/modules/model_evaluation.html
The metrics appropriate for a regression problem are different than from classification problems, and you have the list here for appropritae regression metrics:
‘explained_variance’
‘max_error’
‘neg_mean_absolute_error’
‘neg_mean_squared_error’
‘neg_root_mean_squared_error’
‘neg_mean_squared_log_error’
‘neg_median_absolute_error’
‘r2’
‘neg_mean_poisson_deviance’
‘neg_mean_gamma_deviance’
‘neg_mean_absolute_percentage_error’
So you can chose one to replace "accuracy" and test it.
I am working on a model which predicts car prices by their attributes. I have noticed that the predictions of LinearRegression model differ depending on the type of input (numpy.ndarray, scipy.sparse.csr.csr_matrix).
My data consists of a few numerical and categorical attributes, there are no NaNs.
This is a simple data preparation code (it is common for every case I describe later):
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
# Splitting to test and train
X = data_orig.drop("price", axis=1)
y = data_orig[["price"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Numerical attributes pipeline
num_pipeline = Pipeline([ ("scaler", StandardScaler()) ])
# Categorical attributes pipeline
cat_pipeline = Pipeline([ ("encoder", OneHotEncoder(handle_unknown="ignore")) ])
# Complete pipeline
full_pipeline = ColumnTransformer([
("cat", cat_pipeline, ["model", "transmission", "fuelType"]),
("num", num_pipeline, ["year", "mileage", "tax", "mpg", "engineSize"]),
])
Let's build a LinearRegression model (X_train and X_test will be instances of scipy.sparse.csr.csr_matrix):
...
X_train = full_pipeline.fit_transform(X_train)
X_test = full_pipeline.transform(X_test)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression().fit(X_train, y_train)
pred = lin_reg.predict(X_test)
r2_score(y_test, pred) # 0.896044623680753 OK
If I convert X_test and X_train to the numpy.ndarray, the predictions of the model are completely incorrect:
...
X_train = full_pipeline.fit_transform(X_train).toarray() # Here
X_test = full_pipeline.transform(X_test).toarray() # And here
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression().fit(X_train, y_train)
pred = lin_reg.predict(X_test)
r2_score(y_test, pred) # -7.919935999010152e+19 Something is wrong
I also tested DecisionTreeRegressor, RandomForestRegressor and SVR but the problem occurs only with LinearRegression.
In the source code, you can see if your input is a sparse matrix, it does some centering then calls on the sparse version of linear least square. If the array is dense, it calls on the numpy version of linear least square.
However the larger issue with this example is that before you perform onehot encoding, you should check whether any of the categorical values have only 1 entry:
data_orig.select_dtypes(['object']).apply(lambda x:min(pd.Series.value_counts(x)))
model 1
transmission 2708
fuelType 28
And if we check model:
data_orig['model'].value_counts().tail()
SQ7 8
S8 4
S5 3
RS7 1
A2 1
So if RS7 and A2 are in your test but not in your training, then this coefficient will be totally rubbish because all it's values are zeros. If we try using another seed to split the data, you can see both fits are similar:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
A function to fit using sparse/dense data differently:
import matplotlib.pyplot as plt
def fit(X_tr,X_ts,y_tr,y_ts,is_sparse=True):
data_train = full_pipeline.fit_transform(X_tr)
data_test = full_pipeline.transform(X_ts)
if not is_sparse:
data_train = data_train.toarray()
data_test = data_test.toarray()
lin_reg = LinearRegression().fit(data_train, y_tr)
pred = lin_reg.predict(data_test)
plt.scatter(y_ts,pred)
return {'r2_train':r2_score(y_tr, lin_reg.predict(data_train)),
'r2_test':r2_score(y_ts, pred),
'pred':pred}
We can see the r2 for training and test:
sparse_pred = fit(X_train,X_test,y_train,y_test,is_sparse=True)
[sparse_pred['r2_train'],sparse_pred['r2_test']]
[0.8896333645670668, 0.898030271986993]
sparse_pred = fit(X_train,X_test,y_train,y_test,is_sparse=False)
[sparse_pred['r2_train'],sparse_pred['r2_test']]
[0.8896302753422759, 0.8980115229388697]
You can try the above with the seed (42) in your example and you will see the r^2 for training is similar. It's the prediction that goes haywire.
So if you use a sparse matrix, the least square would most likely return a less nonsense coefficient for an all zero column (most likely what #piterbarg was pointing to).
Still, I think what makes sense is to check the data for such missing factors between test and train, before plunging the pipeline. For this dataset, it is most likely not over-determined so sparse vs dense should not be the difference.
Linear regression involves numerical inversion of certain matrices or, more precisely, solving certain linear equations, see here. In some cases these matrices are either singular or nearly so ("poorly conditioned"), which is often the case when the "features" of your model are either not independent or nearly not independent.
in this case straightforward inversion can lead to catastrophic error accumulation with the resulting inversion/solution basically blowing up. This is more prevalent for larger number of dimensions as you have here (36 as far as I can tell)
When a matrix is nearly singular, its inversion depends strongly on subtle differences in representations of numbers involved (their base), roundoff errors, the precise order of calculations, etc. In your case, I believe, this is the reason why the two answers are dramatically different. It seems the calculation path involving sparse-matrix representation can handle nearly-singular matrices somewhat better than the numpy representation. Why is it so I cannot say for sure beyond what I have written above.
It is easy enough to test my assertion by regularizing the problem a bit. For example, one can use Tikhonov regularization, aka Ridge regression, instead of the plain Linear Regression. Tikhonov regression adds a (small) term to the matrix that makes nearly-singular matrices behave much better under inversion.
It is a one-line change in your code:
from sklearn.linear_model import Ridge
lin_reg = Ridge(alpha=1e-4).fit(X_train, y_train)
Now both cases should produce (nearly) identical answers. Note alpha parameter that specifies the strength of regularization, and the small value I used there.
I trained a model using Logistic Regression to predict whether a name field and description field belong to a profile of a male, female, or brand. My train accuracy is around 99% while my test accuracy is around 83%. I have tried implementing regularization by tuning the C parameter but the improvements were barely noticed. I have around 5,000 examples in my training set. Is this an instance where I just need more data or is there something else I can do in Sci-Kit Learn to get my test accuracy higher?
overfitting is a multifaceted problem. It could be your train/test/validate split (anything from 50/40/10 to 90/9/1 could change things). You might need to shuffle your input. Try an ensemble method, or reduce the number of features. you might have outliers throwing things off
then again, it could be none of these, or all of these, or some combination of these.
for starters, try to plot out test score as a function of test split size, and see what you get
#The 'C' value in Logistic Regresion works very similar as the Support
#Vector Machine (SVM) algorithm, when I use SVM I like to use #Gridsearch
#to find the best posible fit values for 'C' and 'gamma',
#maybe this can give you some light:
# For SVC You can remove the gamma and kernel keys
# param_grid = {'C': [0.1,1, 10, 100, 1000],
# 'gamma': [1,0.1,0.01,0.001,0.0001],
# 'kernel': ['rbf']}
param_grid = {'C': [0.1,1, 10, 100, 1000]}
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix
# Train and fit your model to see initial values
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
model = SVC()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
# Find the best 'C' value
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.best_params_
c_val = grid.best_estimator_.C
#Then you can re-run predictions on this grid object just like you would with a normal model.
grid_predictions = grid.predict(X_test)
# use the best 'C' value found by GridSearch and reload your LogisticRegression module
logmodel = LogisticRegression(C=c_val)
logmodel.fit(X_train,y_train)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))
I use scikit-learn's SVM like so:
clf = svm.SVC()
clf.fit(td_X, td_y)
My question is when I use the classifier to predict the class of a member of the training set, could the classifier ever be wrong even in scikit-learns implementation. (eg. clf.predict(td_X[a])==td_Y[a])
Yes definitely, run this code for example:
from sklearn import svm
import numpy as np
clf = svm.SVC()
np.random.seed(seed=42)
x=np.random.normal(loc=0.0, scale=1.0, size=[100,2])
y=np.random.randint(2,size=100)
clf.fit(x,y)
print(clf.score(x,y))
The score is 0.61, so nearly 40% of the training data is missclassified. Part of the reason is that even though the default kernel is 'rbf' (which in theory should be able to classify perfectly any training data set, as long as you don't have two identical training points with different labels), there is also regularization to reduce overfitting. The default regularizer is C=1.0.
If you run the same code as above but switch clf = svm.SVC() to clf = svm.SVC(C=200000), you'll get an accuracy of 0.94.
Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.
Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.
I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)
Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]
My data is generally sparse like in the Example.
When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the
distribution of my data (age/amount):
I used diffrerent Regression Models: Linear Regression, Lasso,
SGDRegressor from scikit-learn with no improvement.
So the questions are:
1.How do I improve the r² score?
2.Do I have to change the data to fit the Regression better? If yes with what method?
3.Which Regressor/Methods should I use for text classification?
To my knowledge Bag-of-words models usually use Naive Bayes as classifier to fit the document-by-term sparse matrix.
None of your regressors can handle large sparse matrix well. Lasso may work well if you have groups of highly correlated features.
I think for your problem, Latent Semantic Analysis may provide better results. Essentially, use the TfidfVectorizer to normalize the word count matrix, then use TruncatedSVD to reduce the dimensionality to retain the first N components which capture the major variance. Most regressors should work well with the matrix in lower dimension. In my experimence SVM works pretty good for this problem.
Here I show an example script:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD()),
('clf', svm.SVR())
])
# You can tune hyperparameters using grid search
params = {
'tfidf__max_df': (0.5, 0.75, 1.0),
'tfidf__ngram_range': ((1,1), (1,2)),
'svd__n_components': (50, 100, 150, 200),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, params, scoring='r2',
n_jobs=-1, verbose=10)
# fit your documents (Should be a list/array of strings)
grid_search.fit(documents, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))