I am running a random forest regression model, but my results are not that great. One person recommended to check interaction effects.
Surprisingly I do not see too many questions about this. This one did not help me. I am also not sure how to incorporate to my code sklearn.preprocessing.PolynomialFeatures.
My data is very simple:
My code:
# Split data
y = starbucks_log.iloc[:, 0]
# x as all others
X = starbucks_log.drop('total_amount', axis = 1)
# Set seed for reproducibility
SEED = 1
# Split dataset into 80% train and 20% test
X_train, X_test, y_train, y_test = \
train_test_split(X, y,
test_size = 0.2,
random_state = SEED)
# Instantiate a random forests regressor 'rf' 400 estimators
rf = RandomForestRegressor(n_estimators = 400,
min_samples_leaf = 1,
random_state = SEED)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
# Predict the test set labels 'y_pred'
y_pred = rf.predict(X_test)
y_pred_train=rf.predict(X_train)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
rmse_train = MSE(y_train, y_pred_train)**(1/2)
# Print the test set RMSE
print('Test set RMSE of rf: {:.5f}'.format(rmse_test))
print('Train set RMSE of rf: {:.5f}'.format(rmse_train))
I would like to add all possible interaction effects of income, age, and male(gender). It would be easier to drop some of them later.
Thanks!
Related
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
I am trying to learn about machine learning, and I am having trouble understanding when and how to use the validation set. I have understood that it is used to evaluate the candidate models, before checking with the test set, but I don't understand how to properly write it in code. Take for example this code I am working on:
# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7) # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set
print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))
import warnings # supress warnings
warnings.filterwarnings('ignore')
# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()
# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train) # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train) # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train) # Robust scales each values variance and quartiles (ignores outliers)
# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)
# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))
# Now we train our model for different values of `max_depth`, ranging from 1 to 20.
max_depths = range(1, 30)
training_error = []
for max_depth in max_depths:
model_1 = DecisionTreeRegressor(max_depth=max_depth)
model_1.fit(X,y)
training_error.append(mean_squared_error(y, model_1.predict(X)))
testing_error = []
for max_depth in max_depths:
model_2 = DecisionTreeRegressor(max_depth=max_depth)
model_2.fit(X, y)
testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))
plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()
Where would I run the tests on the validation set? How do I incorporate it into the code?
First of all make sure to only create one model keep using this one model. Currently you create a model in every training step and overwrite the old one. Otherwise your model will never improve.
Secondly: The Idea behind the validation set is to evaluate the progress of your training, to see how your model performs on data it hasn't seen before. Therefore you need to incorporate it into your training process.
So in your case it would look like that.
model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
model.fit(X_train,y_train) # here we train the model
training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error
Make sure that you only train on your training data, never on your validation or test data.
I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
how am I supposed to implement Gaussian Naive Bayes, in two training sets.
I need:
Create a training set by selecting the rows with id <= 160
Train a Gaussian Naive-Bayes classifier as we saw in class to determine if a campaign will be successful, given the amounts used in each marketing channel
Calculate the fraction of the training set that is correctly classified.
and:
Create a test set by selecting the rows with id> 160
Evaluate the performance of the classifier as follows:
What percentage of the test set was classified
correctly (correct answers on the total)? It is desirable that this number reaches at least 80%
What is the ratio of false positives to false negatives?
Successful marketing campaign:
successful_marketing_campaign = (dataset['sales'] > 15) | (dataset['total_invested'] < 20)
And my code:
X = dataset.iloc[:, [0, 3]].values.astype('int')
y = dataset.iloc[:, [4]].values.astype('int')
X_train = dataset.iloc[0:160, [0, 3]].values.astype('int')
y_train = dataset.iloc[0:160, 4].values.astype('int')
X_test = dataset.iloc[160:, [0, 3]].values.astype('int')
y_test = dataset.iloc[160:, 4].values.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
print(matrix)
I have the following code that attempts to valuate stock on non-price based features.
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = LinearRegression()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
In my df (which is very large), there is never an instance where 'regularMarketPrice' is less than 0. However, I occasionally receive a value less than 0 for some points in y_predicted.
Is there a way in Scikit to say anything less than 0 is an invalid prediction? I am hoping this makes my model more accurate.
Please comment if there is a need for further explanation.
To make more prediction larger than 0, you should not use linear regression. You should consider generalized linear regression (glm), such as poisson regression.
from sklearn.linear_model import PoissonRegressor
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = PoissonRegressor()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
All prediction is greater than or equal to 0
Consider using something other than a Gaussian response variable. Plot your y-values with a histogram. If the data are right skewed, considering modeling with a glm, gamma distribution, and log link.
Alternatively, you could set the y_predicted to the max of the model.score value and 0.