Could you briefly describe me what the below lines of code mean. This is the code of logistic regression in Python.
What means size =0.25 and random_state = 0 ? And what is train_test_split ? What was done in this line of code ?
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
And what was done in these lines of code ?
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Have a look at the description of the function here:
random_state sets the seed for the random number generator to give you the same result with each run, especially useful in education settings to give everyone an identical result.
test_size refers to the proportion used in the test split, here 75% of the data is used for training, 25% is used for testing the model.
The other lines simply run the logistic regression on the training dataset. You then use the test dataset to check the goodness of the fitted regression.
What means size =0.25 and random_state = 0 ?
test_size=0.25 -> 25% split of training and test data.
random_state = 0 -> for reproducible results this can be any number.
What was done in this line of code ?
Splits X and y into X_train, X_test, y_train, y_test
And what was done in these lines of code ?
Trains the logistic regression model through the fit(X_train, y_train) and then makes predictions on the test set X_test.
Later you probably compare y_pred to y_test to see what the accuracy of the model is.
Based on the documentation:
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
This gives you the split between your train data and test data, if you have in total 1000 data points, a test_size=0.25 would mean that you have:
750 data points for train
250 data points for test
The perfect size is still under discussions, for large datasets (1.000.000+ ) I currently prefer to set it to 0.1. And even before I have another validation dataset, which I will keep completly out until I decided to run the algorithm.
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
For machine learning you should set this to a value, if you set it, you will have the chance to open your programm on another day and still produce the same results, normally random_state is also in all classifiers/regression models avaiable, so that you can start working and tuning, and have it reproducible,
To comment your regression:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Will load your Regression, for python this is only to name it
Will fit your logistic regression based on your training set, in this example it will use 750 datsets to train the regression. Training means, that the weights of logistic regression will be minimized with the 750 entries, that the estimat for your y_train fits
This will use the learned weights of step 2 to do an estimation for y_pred with the X_test
After that you can test your results, you now have a y_pred which you calculated and the real y_test, you can know calculate some accuracy scores and the how good the regression was trained.
This line line:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
divides your source into train and test set, 0.25 shows 25% of the source will be used for test and remaining will be used for training.
For, random_state = 0, here is a brief discussion.
A part from above link:
if you use random_state=some_number, then you can guarantee that the
output of Run 1 will be equal to the output of Run 2,
logistic_regression= LogisticRegression() #Creates logistic regressor
Calculates some values for your source. Recommended read
logistic_regression.fit(X_train,y_train)
A part from above link:
Here the fit method, when applied to the training dataset,learns the
model parameters (for example, mean and standard deviation)
....
It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results,
Perform prediction on test set based on the learning from training set.
y_pred=logistic_regression.predict(X_test)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
Above line splits your data into training and testing data randomly
X is your dataset minus output variable
y is your output variable
test_size=0.25 means you are dividing data into 75%-25% where 25% is your testing dataset
random_state is used for generating same sample again when you run the code
Refer train-test-split documentation
Related
For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.
I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1.
Here is the Histogram of predicted probabilities on test data:
with majority values at 0 - 0.2 and 0.9 to 1, which is much accurate.
But when I try to predict the probability values for unseen data or let's say all data points for which value of 0 or 1 is unknown, the predicted probabilities values are between 0 to 0.5 only for class 1. Why is that so? Aren't the values should be from 0.5 to 1?
Here is the histogram of predicted probabilities on unseen data:
I am using sklearn RandomforestClassifier in python. The code is below:
#Read the CSV
df=pd.read_csv('path/df_all.csv')
#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})
#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']
# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=100387, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])
y = df1['probabilities']
X = df1.iloc[:,1:138]
#Change interfere values to category
y_01=y.astype('category')
#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)
#Model
model=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)
#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#X1_train is same dataframe with but with only 15 varible
rf.fit(X1_train,y_train)
#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))
#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))
#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487
#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities.
IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]
The results shown for your test data are not valid; you perform a mistaken procedure that has two serious consequences, which invalidate them.
The mistake here is that you perform the minority class upsampling before splitting to train & test sets, which should not be the case; you should first split into training and test sets, and then perform the upsampling only to the training data and not to the test ones.
The first reason why such a procedure is invalid is that, this way, some of the duplicates due to upsampling will end up both to the training and the test splits; the result being that the algorithm is tested with some samples that have already been seen during training, which invalidates the very fundamental requirement of a test set. For more details, see own answer in Process for oversampling data for imbalanced binary classification; quoting from there:
I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...
The second reason is that this procedure shows biased performance measures in a test set that is no longer representative of reality: remember, we want our test set to be representative of the real unseen data, which of course will be imbalanced; artificially balancing our test set and claiming that it has X% accuracy when a great part of this accuracy will be due to the artificially upsampled minority class makes no sense, and gives misleading impressions. For details, see own answer in Balance classes in cross validation (the rationale is identical for the case of train-test split, as here).
The second reason is why your procedure would still be wrong even if you had not performed the first mistake, and you had proceeded to upsample the training and test sets separately after splitting.
I short, you should remedy the procedure, so that you first split into training & test sets, and then upsample your training set only.
I created python code for ridge regression.For that I used cross validation and grid-search technique in together. i got output result. I want check whether my regression model building steps correct or not? can some one explain it?
from sklearn.linear_model import Ridge
ridge_reg = Ridge()
from sklearn.model_selection import GridSearchCV
params_Ridge = {'alpha': [1,0.1,0.01,0.001,0.0001,0] , "fit_intercept": [True, False], "solver": ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
Ridge_GS = GridSearchCV(ridge_reg, param_grid=params_Ridge, n_jobs=-1)
Ridge_GS.fit(x_train,y_train)
Ridge_GS.best_params_
output - {'alpha': 1, 'fit_intercept': True, 'solver': 'cholesky'}
Ridgeregression = Ridge(random_state=3, **Ridge_GS.best_params_)
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=Ridgeregression, X=x_train, y=y_train, cv=5)
all_accuracies
output - array([0.93335508, 0.8984485 , 0.91529146, 0.89309012, 0.90829416])
print(all_accuracies.mean())
output - 0.909695864130532
Ridgeregression.fit(x_train,y_train)
Ridgeregression.score(x_test,y_test)
output - 0.9113458623386644
Is 0.9113458623386644 my ridge regression accuracy(R squred) ?
if it is, then what is meaning of 0.909695864130532 value.
Yes the score method from Ridge regression returns your R-squared value (docs).
In case you are not aware how the CV method works it splits your data into 5 equal chunks. Then for each combination of parameters it fits the model five times using each chunk once as evaluation set, while using the remainder of the data as the training set. The best parameter set is chosen to be the set which gives the highest average score.
Your main question seems to be why the average of your CV score is less than the score from the full training evaluated on the test set. This is not necessarily surprising, since the full training set will be larger than any of CV samples which are used for the all_accuracies values. More training data will generally get you a more accurate model.
The test set score (i.e. your second 'score', 0.91...) is most likely to represent how your model will generalize to unseen data. This is what you should quote as the 'score' of your model. The performance on CV set is biased, since this is the data on which you based your parameter choices.
In general your method looks correct. The step where you refit ridge regression using cross_val_score seems necessary. Once you have found your best parameters from GridSearchCV I would go straight to fitting on the full training dataset (as you do at the end).
I have been trying to use XGBregressor in python. It is by far one of the best ML techniques I have used.However, in some data sets I have very high training R-squared, but it performs really poor in prediction or testing. I have tried playing with gamma, depth, and subsampling to reduce the complexity of the model or to make sure its not overfitted but still there is a huge difference between training and testing. I was wondering if someone could help me with this:
Below is the code I am using:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=100)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
xgb = xgboost.XGBRegressor(colsample_bytree=0.7,
gamma=0,
learning_rate=0.01,
max_depth=1,
min_child_weight=1.5,
n_estimators=100000,
reg_alpha=0.75,
reg_lambda=0.45,
subsample=0.8,
seed=1000)
Here is the performance in training vs testing:
Training :
MAE: 0.10 R^2: 0.99
Testing:
MAE: 1.47 R^2: -0.89
XGBoost tends to overfit the data , so reduce the n_estimators and n_depth and use that particular iteration where the train loss and val loss does not have much difference between them.
The issue here is overfitting. You need to tune some of the parameters(Source).
set n_estimators to 80-200 if the size of data is high (of the order of lakh), 800-1200 is if it is medium-low
learning_rate: between 0.1 and 0.01
subsample: between 0.8 and 1
colsample_bytree: number of columns used by each tree. Values from 0.3 to 0.8 if you have many feature vectors or columns , or 0.8 to 1 if you only few feature vectors or columns.
gamma: Either 0, 1 or 5
Since max_depth you have already taken very low, so you can try to tune above parameters. Also, if your dataset is very small then the difference in training and test is expected. You need to check whether within training and test data a good split of data is there or not. For example, in test data whether you have almost equal percentage of Yes and No for the output column.
You need to try various option. certainly xgboost and random forest will give overfit model for less data. You can try:-
1.Naive bayes. Its good for less data set but it considers the weigtage of all feature vector same.
Logistic Regression - try to tune the regularisation parameter and see where your recall score max. Other things in this are calsss weight = balanced.
Logistic Regression with Cross Validation - this is good for small data as well. Last thing which I told earlier also, check your data and see its not biased towards one kind of result. Like if the result is yes in 50 cases out of 70, it is highly biased and you may not get high accuracy.
I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)
TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.
If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.