Validation but not cross validation in Python sklearn Lasso

Validation but not cross validation in Python sklearn Lasso - python

I need to train a LASSO model using sklearn. I am given a pair of specifically designed training and validation datasets.
The goal is to let the algorithm autogenerate a sequence of alphas (the L1 penalty strength), and for each alpha, fit a model with the training data, and then evaluate the model on the validation data. Finally, select the model that performs the best on the validation data.
How to achieve the above in the most efficient way?
I attempted sklearn.linear_model.LassoCV() by binding the training and validation data, and enforced it to do like a 1-fold CV by supplying iterator to argument cv, but the fit() method will eventually use the optimized alpha and the entire merged data to produce the final model. I of course can take the optimized alpha and call sklearn.linear_model.Lasso() again, but this seems too troublesome:
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
X, y = make_regression(noise = 4, random_state = 0)
Nrow, Ncol = len(X), len(X[0])
Ntrain = int(np.round(Nrow * 0.7))
Nvalid = Nrow - Ntrain
trainInd = np.asarray([i for i in range(Ntrain)])
validInd = np.asarray([i for i in range(Ntrain, Nrow)])
trainValidInd = [(trainInd, validInd)]
cvIter = iter(trainValidInd)
reg = LassoCV(cv = cvIter, verbose = True).fit(X, y)
'''
But .fit() will use the optimized alpha and the entire merged data to
train the model.
'''
I also attempted sklearn.linear_model.lasso_path(), but how to apply it to a new dataset (the validation set) and make predictions? It also doesn't return the intercept term. How can I find it?
Thanks!
Came up with a "smart" workaround:
sampleW = np.asarray([1.0 for i in range(Ntrain)] + \
[1e-200 for i in range(Nvalid)])
reg = LassoCV(cv = cvIter, verbose = True).fit(X, y, sampleW)
By lowering the weight on the portion of validation data to almost 0, validation data is effectively excluded from training. Tests have proven its correctness, but it looks ridiculous. It shouldn't be this hard to achieve what I need.

This may be too basic for what you're looking for, but I would focus on the problem that you've already identified: finding the optimal alpha value. The first thing that comes to mind is to use a scipy optimizer, something like this:
import numpy as np
from scipy.optimize import minimize_scalar
from sklearn.linear_model import Lasso
def cost(alpha):
model = Lasso(alpha=alpha)
model.fit(training_X)
return np.linalg.norm(
model.predict(validation_X) - validation_y)
res = minimize_scalar(cost)
print('Optimal alpha', res.x, 'yields error', res.fun)
Since you're trying to find the best lasso as a function of only the alpha value, you only need to minimize the scalar-input, scalar-output cost function. (docs)

Related

sklearn to pmml pipeline how to apply postprocessing linear trasnformation

I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages. What I'm trying to do is to apply a linear transformation after applying the predict_proba method within the PMMMLPipeline class in sklearn2pmml package. Any idea about how to do this?
Even a solution outside this package but automatable would help me (like modifying automatically the XML from the PMML).
Here's an example so you can get a deeper understanding of what I'm trying to do:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
# FORGET ABOIT TRAIN TEST SPLIT; we only care if the PMML pipeline works for now
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
model = DecisionTreeClassifier()
model.fit(X,y)
def postprocessig_linear_transformation(probabilities, a,b):
"This function would multiply proabilities by a and sum b"
return probabilities*a+b
# the pipeline should look like this
# first predict probabilities
probabilities = model.predict_proba(X)[:,0]
# then scale them (apply linear transformation)
probabilities_scaled = postprocessig_linear_transformation(probabilities, a = 1000, b=100)
# of course it does not work,
pmml_pipeline = PMMLPipeline([
# here we should place the category preprocesor; I know it does not work but , so you can get the idea
('decisiontree',model),
('postprocesing_apply_linear_transformation',postprocessig_linear_transformation)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)

On a second thought, you don't need a full-blown LinearRegression step to perform a deterministic a * x + b probability scaling operation. A simple ExpressionTransformer step is more than adequate:
from sklearn2pmml.preprocessing import ExpressionTransformer
pipeline = PMMLPipeline([
("decisiontree", model)
], predict_proba_transformer = ExpressionTransformer("X[0] * 1000 + 100"))

I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages.
Don't blame the SkLearn2PMML package for your troubles. It is the Scikit-Learn framework that prohibits you from inserting two estimator objects into a single Pipeline object.
In the current case, you should rephrase your problem. What you're really trying to do is build a "chain of two models" (the first model feeding into the second model). The SkLearn2PMML package provides a sklearn2pmml.ensemble.EstimatorChain estimator type, which allows you to accomplish exactly that.

Multiclass classification using Gaussian Mixture Models with scikit learn

I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.

Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.

GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.

Feature selection with sklearn - ValueError: X has a different shape than during fitting

:) Very sorry in advance if my code looks like something a total newbie would write. Down below is a portion of my code in python. I am fiddling with sklearn and machine learning techniques.
I trained several Naive Bayes Model based on different datasets and stored them in trained_models
Prior this step i created an object chi_squared of the SelectPercentile class using the chi2 function for feature selection. From my understanding, i should write data_feature_reduced = chi_squared.transform(some_data) then use data_feature_reduced at the time of training like this, ie: nb.fit(data_feature_reduced, data.target)
This is what did, and stored the results objects nb ( and some other informations in the list trained_models.
I am now attempting to apply these models on a different set of data ( actually from the same source, if that matters to the question )
for name, model, intra_result, dev, training_data, chi_squarer in trained_models:
cross_results = []
new_vect= StemmedVectorizer(ngram_range=(1, 4), stop_words='english', max_df=0.90, min_df=2)
for data in demframes:
data_name = data[0]
X_test_data = new_vect.fit_transform(data[1].values.astype('U'))
Y_test_data = data[2]
chi_squared_test_data = chi_squarer.transform(X_test_data)
final_results.append((name, "applied to", data[0], model.score(X_test_data,Y_test_data)))
I have to admit that I am a bit of stranger to the feature selection part.
Here is the error that i get :
ValueError: X has a different shape than during fitting.
at line chi_squared_test_data = chi_squarer.transform(X_test_data)
I am assuming I am doing feature selection in an incorrect manner, Where did I go wrong ?

Thanks to everyone for their help!
I will just paste the comment that helped me solve my problem from #Vivek-Kumar.
This error is due to this line new_vect.fit_transform(). Like your
trained models, you should use the same StemmedVectorizer which was
used at training time.
The same StemmedVectorize object will transform the X_test_data to same shape, what it had during the training. Currently, you are using different object and fitting on it (fit_transform is fit and transform), hence the shape is different. Hence the error.

why not use a pipeline to make it simple? that way you dont have to transform twice and take care of the shapes.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
chi_squarer = SelectKBest(chi2, k=100) # change accordingly
lr = LogisticRegression() # or naive bayes
clf = pipeline.Pipeline([('chi_sq', chi_squarer), ('model', lr)])
# for training:
clf.fit(training_data, targets)
# for predictions:
clf.predict(test_data)
you can also add the new_vect in the pipeline

scikit-learn classification on soft labels

According to the documentation it is possible to specify different loss functions to SGDClassifier. And as far as I understand log loss is a cross-entropy loss function which theoretically can handle soft labels, i.e. labels given as some probabilities [0,1].
The question is: is it possible to use SGDClassifier with log loss function out the box for classification problems with soft labels? And if not - how this task (linear classification on soft labels) can be solved using scikit-learn?
UPDATE:
The way target is labeled and by the nature of the problem hard labels don't give good results. But it is still a classification problem (not regression) and I wan't to keep probabilistic interpretation of the prediction so regression doesn't work out of the box too. Cross-entropy loss function can handle soft labels in target naturally. It seems that all loss functions for linear classifiers in scikit-learn can only handle hard labels.
So the question is probably:
How to specify my own loss function for SGDClassifier, for example. It seems scikit-learn doesn't stick to the modular approach here and changes need to be done somewhere inside it's sources

I recently had this problem and came up with a nice fix that seems to work.
Basically, transform your targets to log-odds-ratio space using the inverse sigmoid function. Then fit a linear regression. Then, to do inference, take the sigmoid of the predictions from the linear regression model.
So say we have soft targets/labels y ∈ (0, 1) (make sure to clamp the targets to say [1e-8, 1 - 1e-8] to avoid instability issues when we take logs).
We take the inverse sigmoid, then we fit a linear regression (assuming predictor variables are in matrix X):
y = np.clip(y, 1e-8, 1 - 1e-8) # numerical stability
inv_sig_y = np.log(y / (1 - y)) # transform to log-odds-ratio space
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, inv_sig_y)
Then to make predictions:
def sigmoid(x):
ex = np.exp(x)
return ex / (1 + ex)
preds = sigmoid(lr.predict(X_new))
This seems to work, at least for my use case. My guess is that it's not far off what happens behind the scenes for LogisticRegression anyway.
Bonus: this also seems to work with other regression models in sklearn, e.g. RandomForestRegressor.

According to the docs,
The ‘log’ loss gives logistic regression, a probabilistic classifier.
In general a loss function is of the form Loss( prediction, target ), where prediction is the model's output, and target is the ground-truth value. In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
So in answer to your question, it depends on if you are referring to the prediction or target. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction and by the data on hand for target.
If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.
If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1), and this would have to be handled.

for those interested, i've implemented a custom class that behaves like a normal classifier, but takes a any regressor in the cosntructor to perform the transformation suggested by #nlml:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_array
from scipy.special import softmax
import numpy as np
def _log_odds_ratio_scale(X):
X = np.clip(X, 1e-8, 1 - 1e-8) # numerical stability
X = np.log(X / (1 - X)) # transform to log-odds-ratio space
return X
class FuzzyTargetClassifier(ClassifierMixin, BaseEstimator):
def __init__(self, regressor):
'''
Fits regressor in the log odds ratio space (inverse crossentropy) of target variable.
during transform, rescales back to probability space with softmax function
Parameters
---------
regressor: Sklearn Regressor
base regressor to fit log odds ratio space. Any valid sklearn regressor can be used here.
'''
self.regressor = regressor
return
def fit(self, X, y=None, **kwargs):
#ensure passed y is onehotencoded-like
y = check_array(y, accept_sparse=True, dtype = 'numeric', ensure_min_features=1)
self.regressors_ = [clone(self.regressor) for _ in range(y.shape[1])]
for i in range(y.shape[1]):
self._fit_single_regressor(self.regressors_[i], X, y[:,i], **kwargs)
return self
def _fit_single_regressor(self, regressor, X, ysub, **kwargs):
ysub = _log_odds_ratio_scale(ysub)
regressor.fit(X, ysub, **kwargs)
return regressor
def decision_function(self,X):
all_results = []
for reg in self.regressors_:
results = reg.predict(X)
if results.ndim < 2:
results = results.reshape(-1,1)
all_results.append(results)
results = np.hstack(all_results)
return results
def predict_proba(self, X):
results = self.decision_function(X)
results = softmax(results, axis = 1)
return results
def predict(self, X):
results = self.decision_function(X)
results = results.argmax(1)
return results

Using ranking data in Logistic Regression

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.
I have put my data in a .csv as follows :
URL WebsiteText AlexaRank GooglePageRank
In my Test CSV we have :
URL WebsiteText AlexaRank GooglePageRank Label
Label is a binary classification indicating "good" with 1 or "bad" with 0.
I currently have my LR running using only the website text; which I run a TF-IDF on.
I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.
How can I normalize my ranking data for AlexaRank? I have a set of
10,000 webpages, for which I have the Alexa rank of all of them;
however they aren't ranked 1-10,000. They are ranked out of the
entire Internet, so while http://www.google.com may be ranked #1,
http://www.notasite.com may be ranked #83904803289480. How do I
normalize this in Scikit learn in order to get the best possible
results from my data?
I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1] #reading label
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
#Add Two Integer Columns
AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData
#Add two columns to X.
X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."`
Thank you very much for all feedback - please post if you need any further information!

I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.
This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
In order to understand why normalization would help in LR, let's revisit the logit formulation of LR.
In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.
Here is how you can scale the X matrix.
sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)
Don't forget to use same scaler to transform X_test.
X_test = sc.transform(X_test)
Now you can use the fitting procedure etc.
rd.fit(X, y)
re.predict_proba(X_test)
Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.
AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)
Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.
X = np.hstack((X, AllAlexaAndGoogleInfo))
hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.

Regarding normalizing the numeric ranks either scikit StandardScaler or a logarithmic transform (or both) should work well enough.
For building up a working pipeline, I find my sanity greatly benefits from using the Pandas package and the sklearn.pipeline utilities. Here is a simple script that should do what you need.
First a couple of utlitlty classes I always seem to need. It would be nice to have something like these in sklearn.pipeline or sklearn.utilities.
from sklearn import base
class Columns(base.TransformerMixin, base.BaseEstimator):
def __init__(self, columns):
super(Columns, self).__init__()
self.columns_ = columns
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return X[self.columns_]
class Text(base.TransformerMixin, base.BaseEstimator):
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return (X.apply("\t".join, axis=1, raw=False))
Now set up the pipeline.
I used the SGDClassifier implementation of logistic regression since it tends to be more eficcient for high dimensional data like text classification also I usually find that hinge loss usually gives better results than logistic regression anyway.
from sklearn import linear_model as lin
from sklearn import metrics
from sklearn.feature_extraction import text as txt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing as prep
import numpy as np
from pandas.io import parsers
import pandas as pd
pipe = Pipeline([
('feat', FeatureUnion([
('txt', Pipeline([
('txtcols', Columns(["WebsiteText"])),
('totxt', Text()),
('vect', txt.TfidfVectorizer()),
])),
('num', Pipeline([
('numcols', Columns(["AlexaRank", "GooglePageRank"])),
('scale', prep.StandardScaler()),
])),
])),
('clf', lin.SGDClassifier(loss="log")),
])
Next train the model:
train=parsers.read_csv("train.csv")
pipe.fit(train, train.Label)
Finally evaluate on test data:
test=parsers.read_csv("test.csv")
tstlbl=np.array(test.Label)
print pipe.score(test, tstlbl)
pred = pipe.predict(test)
print metrics.confusion_matrix(tstlbl, pred)
print metrics.classification_report(tstlbl, pred)
print metrics.f1_score(tstlbl, pred)
prob = pipe.decision_function(test)
print metrics.roc_auc_score(tstlbl, prob)
print metrics.average_precision_score(tstlbl, prob)
You will probably not get very good results with everything using default setting like this,
but it should give you a working baseline to work from. I can suggest some parameter settings that usually work for me if you like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.