I am trying to solve a classification problem on a given dataset, through logistic regression (and this is not the problem). To avoid overfitting I'm trying to implement it through cross-validation (and here's the problem): there's something that I'm missing to complete the program. My purpose here is to determine accuracy.
But let me be specific. This is what I've done:
I split the set into train set and test set
I defined the logregression prediction model to be used
I used the cross_val_predict method (in sklearn.cross_validation) to make predictions
Lastly, I measured accuracy
Here is the code:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# define method
logreg=LogisticRegression()
# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted))
My problems:
From what I understand the test set should not be considered until the very end and cross-validation should be made on training set. That's why I inserted X_train and t_train in the cross_val_predict method. Thuogh, I get an error saying:
ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]
where 6016 is the number of samples in the whole dataset, and 4812 is the number of samples in the training set after the dataset has been split
After this, I don't know what to do. I mean: when do the X_test and t_test come into play? I don't get how I should use them after cross-validating and how to get the final accuracy.
Bonus question: I'd also like to perform scaling and reduction of dimensionality (through feature selection or PCA) within each step of the cross-validation. How can I do this? I've seen that defining a pipeline can help with scaling, but I don't know how to apply this to the second problem.
I'd really appreciate any help :-)
Here is working code tested on a sample dataframe. The first issue in your code is the target array is not an np.array. You also shouldn't have target data in your features. Below I illustrate how to manually split the training and testing data using train_test_split. I also show how to use the wrapper cross_val_score to automatically split, fit, and score.
random.seed(42)
# Create example df with alphabetic col names.
alphabet_cols = list(string.ascii_uppercase)[:26]
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
columns=alphabet_cols)
df['Target'] = df['A']
df.drop(['A'], axis=1, inplace=True)
print(df.head())
y = df.Target.values # df['Target'] is not an np.array.
feature_cols = [i for i in list(df.columns) if i != 'Target']
X = df.ix[:, feature_cols].as_matrix()
# Illustrated here for manual splitting of training and testing data.
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
# Initialize model.
logreg = linear_model.LinearRegression()
# Use cross_val_score to automatically split, fit, and score.
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
print(scores)
print('average score: {}'.format(scores.mean()))
Output
B C D E F G H I J K ... Target
0 20 33 451 0 420 657 954 156 200 935 ... 253
1 427 533 801 183 894 822 303 623 455 668 ... 421
2 148 681 339 450 376 482 834 90 82 684 ... 903
3 289 612 472 105 515 845 752 389 532 306 ... 639
4 556 103 132 823 149 974 161 632 153 782 ... 347
[5 rows x 26 columns]
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399 0.0328
-0.0409]
average score: -0.04258093018969249
Helpful references:
Convert from pandas to numpy
Select all but subset of columns of dataframe
sklearn.model_selection.train_test_split
sklearn.model_selection.cross_val_score
Please look at the documentation of cross-validation at scikit to understand it more.
Also you are using cross_val_predict incorrectly. What it will do is internally call the cv you supplied (cv=10) to split the supplied data (i.e. X_train, t_train in your case) into again train and test, fit the estimator on train and predict on data which remains in test.
Now for usage of your X_test, y_test, you should first fit your estimtor on the train data (cross_val_predict will not fit) and then use it to predict on test data and then calculate accuracy.
Simple code snippet to describe the above (borrowing from your code) (Do read the comments and ask if not understand anything):
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods
# define method
logreg=LogisticRegression()
# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
#1. Get the data and estimator (logreg, X_train, t_train)
#2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
#3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
#4. Use X_cv_train, t_cv_train for fitting 'logreg'
#5. Predict on X_cv_test (No use of t_cv_test)
#6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.
# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted))
# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.
# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test
logreg.fit(X_train, t_train)
t_pred = logreg(X_test)
# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred))
# If this accuracy is good, then your model is good.
If you have less data or dont want to split the data into training and testing, then you should use the approach as suggested by #fuzzyhedge
# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
# 'cross_val_score' will almost work same from steps 1 to 4
#5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test.
#6. Repeat steps 1 to 5 for cv_iterations = 10
#7. Return array of accuracies calculated in step 5.
# Find out average of returned accuracies to see the model performance
scores = scores.mean()
Note - Also cross_validation is best used with gridsearch to find out parameters of the estimator which perform best for the given data.
For example, using LogisticRegression it has many parameters defined. But if you use
logreg = LogisticRegression()
will initialize the model with only default parameters. Maybe a different value of parameter
logreg = LogisticRegression(penalty='l1', solver='liblinear')
may perform better for your data. This search for better parameters is gridsearch.
Now as for your second part of scaling, dimension reductions etc using pipeline. You can refer to the documentation of pipeline and the following examples:
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#sphx-glr-auto-examples-feature-stacker-py
http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py
Feel free to contact me if need any help.
Related
I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.
For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I need to generate a random n-dimensional dataset having m tuples. The first four dimensions are expected to be correlated with the ground truth vector y and the remaining ones are to be arbitrarily generated. I will use the dataset for my regression task using Scikit-learn. How can I generate this data?
for example:
A dataset where
tuple size(m)=10000
and
dimension size(n)=100
After that, I need to split the dataset such that randomly selected 70% tuples are used for training while 30% tuples are used for testing.
PS: I have found this code in sci-kit learn but I am not sure if I can use it. How can I translate this into my problem?
x, y, coef = datasets.make_regression(n_samples=100,#number of samples
n_features=1,#number of features
n_informative=1,#number of useful features
noise=10,#bias and standard deviation of the guassian noise
coef=True,#true coefficient used to generated the data
random_state=0) #set for same data points for each run
make_regression will do your job indeed:
from sklearn import datasets
# your example values:
m = 10000
n = 100
random_seed = 42 # for reproducibility, exact value does not matter
X, y = datasets.make_regression(n_samples=m, n_features=n, n_informative=4, random_state = random_seed)
And train_test_split for splitting:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed) # does not need to be the same random_seed
In both methods, random_seed is set only for reproducibility - its exact value does not matter, neither need it be the same value in both.
I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?
Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.
It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.
Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.