Related
I'm using the StandardScalar() and lin_reg.coef_ function in the following context:
for i in range(100):
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=i)
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
lin_reg = LinearRegression().fit(x_train, y_train)
if i == 0:
print(lin_reg.coef_)
if i == 1:
print(lin_reg.coef_)
This leads to the following output:
Code Output
So, as have been expected, the coef_ function returns the coefficients for the 22 different features I am passing into the linear regression. However, for the second output, some of the coefficients are way too large (e.g. 1.61e+14). I am pretty sure that the scaling with StandardScaler() works as it should be. However, if I do not scale the training data before applying the coef_ function, I do not get these high coefficients. One important thing that I should mention is that the last 13 features are binary features, whereas the first 9 features are continuous (such as age). I can imagine that the problem is somehow related to this fact, although, for the first binary feature, the coefficients are properly computed (just the last 12 binary features have too large coefficients).
You should use Standardization when the data come from a Gaussian distribution. Using StandardScal() on binary data doesn't make any sense.
You should scale only the first 9 nine variables, and then pass them all in the linear regression.
https://www.atoti.io/when-to-perform-a-feature-scaling/
Avoid scaling binary columns in sci-kit learn StandsardScaler
For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.
I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.
When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.
There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.
Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.
Could you briefly describe me what the below lines of code mean. This is the code of logistic regression in Python.
What means size =0.25 and random_state = 0 ? And what is train_test_split ? What was done in this line of code ?
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
And what was done in these lines of code ?
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Have a look at the description of the function here:
random_state sets the seed for the random number generator to give you the same result with each run, especially useful in education settings to give everyone an identical result.
test_size refers to the proportion used in the test split, here 75% of the data is used for training, 25% is used for testing the model.
The other lines simply run the logistic regression on the training dataset. You then use the test dataset to check the goodness of the fitted regression.
What means size =0.25 and random_state = 0 ?
test_size=0.25 -> 25% split of training and test data.
random_state = 0 -> for reproducible results this can be any number.
What was done in this line of code ?
Splits X and y into X_train, X_test, y_train, y_test
And what was done in these lines of code ?
Trains the logistic regression model through the fit(X_train, y_train) and then makes predictions on the test set X_test.
Later you probably compare y_pred to y_test to see what the accuracy of the model is.
Based on the documentation:
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
This gives you the split between your train data and test data, if you have in total 1000 data points, a test_size=0.25 would mean that you have:
750 data points for train
250 data points for test
The perfect size is still under discussions, for large datasets (1.000.000+ ) I currently prefer to set it to 0.1. And even before I have another validation dataset, which I will keep completly out until I decided to run the algorithm.
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
For machine learning you should set this to a value, if you set it, you will have the chance to open your programm on another day and still produce the same results, normally random_state is also in all classifiers/regression models avaiable, so that you can start working and tuning, and have it reproducible,
To comment your regression:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Will load your Regression, for python this is only to name it
Will fit your logistic regression based on your training set, in this example it will use 750 datsets to train the regression. Training means, that the weights of logistic regression will be minimized with the 750 entries, that the estimat for your y_train fits
This will use the learned weights of step 2 to do an estimation for y_pred with the X_test
After that you can test your results, you now have a y_pred which you calculated and the real y_test, you can know calculate some accuracy scores and the how good the regression was trained.
This line line:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
divides your source into train and test set, 0.25 shows 25% of the source will be used for test and remaining will be used for training.
For, random_state = 0, here is a brief discussion.
A part from above link:
if you use random_state=some_number, then you can guarantee that the
output of Run 1 will be equal to the output of Run 2,
logistic_regression= LogisticRegression() #Creates logistic regressor
Calculates some values for your source. Recommended read
logistic_regression.fit(X_train,y_train)
A part from above link:
Here the fit method, when applied to the training dataset,learns the
model parameters (for example, mean and standard deviation)
....
It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results,
Perform prediction on test set based on the learning from training set.
y_pred=logistic_regression.predict(X_test)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
Above line splits your data into training and testing data randomly
X is your dataset minus output variable
y is your output variable
test_size=0.25 means you are dividing data into 75%-25% where 25% is your testing dataset
random_state is used for generating same sample again when you run the code
Refer train-test-split documentation
I have been trying to use XGBregressor in python. It is by far one of the best ML techniques I have used.However, in some data sets I have very high training R-squared, but it performs really poor in prediction or testing. I have tried playing with gamma, depth, and subsampling to reduce the complexity of the model or to make sure its not overfitted but still there is a huge difference between training and testing. I was wondering if someone could help me with this:
Below is the code I am using:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=100)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
xgb = xgboost.XGBRegressor(colsample_bytree=0.7,
gamma=0,
learning_rate=0.01,
max_depth=1,
min_child_weight=1.5,
n_estimators=100000,
reg_alpha=0.75,
reg_lambda=0.45,
subsample=0.8,
seed=1000)
Here is the performance in training vs testing:
Training :
MAE: 0.10 R^2: 0.99
Testing:
MAE: 1.47 R^2: -0.89
XGBoost tends to overfit the data , so reduce the n_estimators and n_depth and use that particular iteration where the train loss and val loss does not have much difference between them.
The issue here is overfitting. You need to tune some of the parameters(Source).
set n_estimators to 80-200 if the size of data is high (of the order of lakh), 800-1200 is if it is medium-low
learning_rate: between 0.1 and 0.01
subsample: between 0.8 and 1
colsample_bytree: number of columns used by each tree. Values from 0.3 to 0.8 if you have many feature vectors or columns , or 0.8 to 1 if you only few feature vectors or columns.
gamma: Either 0, 1 or 5
Since max_depth you have already taken very low, so you can try to tune above parameters. Also, if your dataset is very small then the difference in training and test is expected. You need to check whether within training and test data a good split of data is there or not. For example, in test data whether you have almost equal percentage of Yes and No for the output column.
You need to try various option. certainly xgboost and random forest will give overfit model for less data. You can try:-
1.Naive bayes. Its good for less data set but it considers the weigtage of all feature vector same.
Logistic Regression - try to tune the regularisation parameter and see where your recall score max. Other things in this are calsss weight = balanced.
Logistic Regression with Cross Validation - this is good for small data as well. Last thing which I told earlier also, check your data and see its not biased towards one kind of result. Like if the result is yes in 50 cases out of 70, it is highly biased and you may not get high accuracy.