Not enough Folds on CrossValidation from Scikit Learn - python

I try to create a prediction model with Scikit Learn in Python. I have a dataframe with about 850k rows and 17 columns. The last column is my label and the others my features.
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
predictors = [a list of my predictors columns]
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=8, min_samples_leaf=4)
scores = cross_validation.cross_val_score(alg, train[predictors],train["Sales"], cv=5)
However, when I run the code, I have the following warning:
/Users/.../anaconda/lib/python2.7/site-packages/sklearn/ Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=5.
% (min_labels, self.n_folds)), Warning)
I am not sure if I understood the warning message. I thought I would have it only on small samples.

As suggested by David in the comments, it sounds like your output is continuous instead of categorical. If this is the case, you almost certainly don't want to be performing classification, as opposed to regression.
The warning is stemming from the fact that (at least) one of the values in your targets, which are treated as categorical, is underrepresented. If you actually do want to perform classification, a good thing to do first would be to count the number of occurrences of each class in your entire training set.
When you do k fold cross-validation, and min_labels < k, one of the runs of the cross-validation is guaranteed not to see any examples of the class with min_labels, either at train or test time (more frequently happens at test time because the test sets are smaller). In the case that you only have 1 instance of a specific class, then further, you will be guaranteed to get a training run that doesn't see any examples of that class (it lies in one of the folds, which will be used as the test set once)


How can i split my dataset into training and validation with no using and spliting test set?

I know this is wrong for training and validation sets spliting,but you can understand here what i really need. I want to use just training set and validation set. I don't need any test set
#Data Split
from sklearn.model_selection import train_test_split
The test is the validation;
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
x_test and y_test is your validation test or test set. They are the same. It is a small slice of the total x, y samples to validate your model on data it hasn't been trained on.
By using random_state you get reproducible results. In other words, you get the same sets each times you run the script.
The terms validation set and test set are sometimes used to interchangeably, and sometimes to mean slightly different things. #Sy Ker's point is correct: the sklearn module you're using does provide you with a validation set, though the term used in the module is test. Effectively, what you're doing is getting data for training and data for evaluation, regardless of the term used. I'm adding this answer to answer that you might, in fact, need a form of test set.
Using test_train_split will give you a pair of sets that allow you to train a model (with a proportion specified in the percentage argument -- which, generally, should be something like 10-25% to ensure that it's a representative subsample). But I would suggest thinking of the process a little more broadly.
Splitting data for use in testing and model evaluation can be done simply (and, likely, incorrectly) by just using some y% of the rows from the bottom of a dataset. If normalization/standardization is being done, then make sure it train that on the test set and apply it to the set for evaluation so that the same treatment is applied to both.
sklearn and others have also made it possible to do cross-validation very simply, and in this case "validation" sets should be thought of a little differently. Cross-validation will take a portion of your data and subdivide it into smaller groups for repeated testing-and-training passes. In this case, you might start with a split of data like that from train_test_split, and keep the "test" set in this case as a total holdout -- meaning that the cross-validation procedure never uses (or "sees") the data during it's test/train process.
That test set that you got from the test_train_split process can then serve as a good set of data to use as a test for how the model performs against data it has never seen. You might see this referred to as a "holdout" set, or again as some version of "test" and/or "validation".
This link has a quick, but intuitive, description of cross-validation and holdout sets.

Do I give cross_val_score() the entire dataset or just the training-set?

As the documentation of the class is not very clear. I don't understand what value I give it.
cross_val_score(estimator, X, y=None)
This is my code:
clf = LinearSVC(random_state=seed, **params)
cvscore = cross_val_score(clf, features, labels)
I am not sure if this is correct or if I need to give X_train and y_train instead of features and labels.
It is always a good idea to seperate the test set and training set, even while using cross_val_score. The reason behind this is knowledge leaking. It basically means that when you use both training and test sets, you are leaking information from test set into your model, thereby making your model biased, leading to incorrect predictions.
Here is detailed blog post on the same issue.
Reddit post on cross-validation
Cross_val_Score example showing correct way of using it
A similar question on stats.stackexchange
I assume you were referring to the below documentation:
The purpose of cross validation is ensuring that your model hasn't had particularly high variance leading to a good-fit in one instance, but a poor fit in another instance. This is used generally in model validation. With this in mind, you should be passing the training set (X_train, y_train) and seeing how your model performs.
Your question was focused on:
"Can I pass in the whole data-set into cross validation?"
The answer is, yes. This is conditional and based off whether or not you are satisfied with your ML output. Say for example, I have the following below:
I have used a Random forest model and am happy with my overall model fit and score.
In this case, I have a hold-out set.
Once I remove this hold-out set and give my model the whole data-set, we would get a plot with an even higher score as I am giving my model more information (and as such, your CV scores will also reflectively be higher).
An example of calling the method might be as such:
probablistic_scores = cross_val_score(model, X_train, y_train, cv=5)
Generally a 5 Fold Cross validation is preferred.
If you wish to go higher than 5 fold - please note that as your 'n' folds increase, the number of computational resources required will also increase and will take longer to process.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here:
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests are the typical go-to algorithm, gradient boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

How to do linear regression using Python and Scikit learn using one hot encoding?

I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"
I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.
I am not sure how to take into account my one hot encoding, or if I even need to.
Input Data:
I tried reading here:
I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.
I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.
You can do like:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)
# apply your linear regression as you want
model = LinearRegression(), y)
print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))
Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.
I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group:
Doing it the same ways as described by #silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange:
To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.
class DummyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, n_values='auto'):
self.n_values = n_values
def transform(self, X):
ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
return ohe.fit_transform(X)[:,:-1]
def fit(self, X, y=None, **fit_params):
return self
So building on the code of #silviomoreto, you would change line 6:
enc = DummyEncoder()
This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.
I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.
Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (, as it provides better model statistics, e.g. p-values per variable, etc.
how to prepare data for sklearn LinearRegression
OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.
DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.
To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.
numerical columns: standardize if your model contains interactions or polynomial terms
categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.
As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X), which is similar to
use regularized linear regression
If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.
limitation of p-value statistics
The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer,Y)
#what I want is
model =[0:50000],Y.ix[0:50000])
model =[0:100000],Y.ix[0:100000])
model =[0:150000],Y.ix[0:150000])
#... and so on
Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
# and so forth...
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.
Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.
from sklearn.datasets import load_iris
boston = load_iris()
X, y =,
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn
