[ASK]
I am working on a lecture but I do not understand and still lack of knowledge about how to change the transformation feature from linear to gaussian, I have searched for information but still haven't found the right one and still lack of knowledge about python, below is a python script that I made with linear regression, and I want to transform features with Gaussian, please help to write the script for feature transformation
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
df1=data[['time', 'Cases']]
from sklearn.utils import shuffle
df1 = shuffle(df1,random_state=0)
X=np.array(df1[['time']])
y=np.array(df1[['Cases']])
X_train = X[:-10]
X_test = X[-10:]
y_train = y[:-10]
y_test = y[-10:]
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(y_test, y_pred))
# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.scatter(X_train, y_train, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Related
I have been trying to build a model that would consider a independent variable such as "age" to predict the level of efficacy of an individual. I have achieved a 0.049 r2 scored by using polynomial regression and decision tree algorithm but it is still really low. What would be the best way to tackle this issue?
Here is the code:
#Load data into a pandas DataFrame
df = pd.read_csv("workers.csv")
# Split data into training and testing sets
X = df[['sub_age']]
y = df["actual_efficacy_h"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#Create polynomial features
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
X_train_trans = poly.fit_transform(X_train)
X_test_trans = poly.transform(X_test)
# Train a linear regression model on the training data
model = DecisionTreeRegressor()
model.fit(X_train_trans, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test_trans)
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error: ", mse)
print("R-squared score: ", r2)
X_new = np.linspace(18, 65, 200).reshape(200, 1)
X_new_poly = poly.transform(X_new)
y_new = model.predict(X_new_poly)
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.plot(X_train, y_train, "b.",label='Training points')
plt.plot(X_test, y_test, "g.",label='Testing points')
plt.xlabel("Age")
plt.ylabel("Degree of Efficacy")
plt.legend()
plt.show()
This code achieves the following:
Mean squared error: 0.15122979391134947
R-squared score: 0.04171398183828967
And plots a graph like this:
Plot Produced
I would appreciate any kind of help or guidance.
Thank you
I fine tuned LGBM and applied calibration, but have troubles applying calibration.
I have 1) train, 2) valid, 3) test data.
I trained and fine-tuned LGBM using 1) train data and 2) valid data.
Then, I got a best parameter of LGBM.
After then, I want to calibrate, in order to make my model's output can be directly interpreted as a confidence level. But I'm confused in using scikit-learn CalibratedClassifierCV.
In my situation, should I use cv='prefit' or cv=5? Also, should I use train data or valid data fitting CalibratedClassifierCV?
1) uncalibrated_clf but after training
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
Which one is right? All is right, or only one or two is(are) right?
Below is code.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import calibration_curve
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb
import matplotlib.pyplot as plt
np.random.seed(0)
n_samples = 10000
X, y = make_classification(
n_samples=3*n_samples, n_features=20, n_informative=2,
n_classes=2, n_redundant=2, random_state=32)
#n_samples = N_SAMPLES//10
X_train, y_train = X[:n_samples], y[:n_samples]
X_valid, y_valid = X[n_samples:2*n_samples], y[n_samples:2*n_samples]
X_test, y_test = X[2*n_samples:], y[2*n_samples:]
plt.figure(figsize=(12, 9))
plt.plot([0, 1], [0, 1], '--', color='gray')
# 1) Uncalibrated_clf but fine-tuned on training data
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
y_prob = clf.predict_proba(X_test)[:, 1]
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_prob, n_bins=10)
plt.plot(
fraction_of_positives,
mean_predicted_value,
'o-', label='uncalibrated_clf')
# 2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob1 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives1, mean_predicted_value1 = calibration_curve(y_test, y_prob1, n_bins=10)
plt.plot(
fraction_of_positives1,
mean_predicted_value1,
'o-', label='calibrated_clf1')
# 2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
y_prob2 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives2, mean_predicted_value2 = calibration_curve(y_test, y_prob2, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf2')
plt.legend()
# 2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob3 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives3, mean_predicted_value3 = calibration_curve(y_test, y_prob3, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf3')
plt.legend()
The way to go about this is:
a) fit the model and calibrate on the hold out set
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, cv='prefit').fit(X_val, y_val)
y_pred = calibrated.predict(X_test)
(this is actually the meaning of prefit here: the model is already fitted now take a new relevant set and calibrate the output).
b) fit the model and calibrate with cross validation on the training set
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, cv=5).fit(X_train, y_train)
y_pred_val = calibrated.predict(X_val)
As is usually the case the number of cross validations and the method (isotonic regression vs Platt scaling or sigmoid in scikit-learn's jargon) critically depends on your data and your setup. Therefore, I'd suggest to put those in a grid search and see what produces the best results.
Finally, a deeper dive can be found here:
https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/
I am trying to play around to see how outliers in a dataset might affect a Linear Regression model. The issue I'm having is I don't exactly know how to add outliers to a dataset, I've only found loads of articles online about how to detect and remove them.
This is the code I have so far:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
regressor = LinearRegression()
regressor.fit(X_train, y_train) # Training the algorithm
y_pred = regressor.predict(X_test)
print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()
And this is the output:
My question is how can I add outliers to this clean dataset in order to see the effects outliers will have on the resulting model?
Any help would be appreciated, thanks!
You can add values directly to X and y. Since the slope is significant enough, this will end up giving you outliers. You could us any method you want really.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)
for x in range(20):
X=np.append(X, np.random.choice(X.flatten()))
y=np.append(y, np.random.choice(y.flatten()))
X = X.reshape(-1,1)
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
regressor = LinearRegression()
regressor.fit(X_train, y_train) # Training the algorithm
y_pred = regressor.predict(X_test)
print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()
I'd like to evaluate my machine learning model. I computed the area under the ROC curve with roc_auc_score() and plotted the ROC curve with plot_roc_curve() functions of sklearn. In the second function the AUC is also computed and shown in the plot. Now my problem is, that I get different results for the two AUC.
Here's the reproducible code with sample dataset:
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model = MLPClassifier(random_state=42)
model.fit(X_train, y_train)
yPred = model.predict(X_test)
print(roc_auc_score(y_test, yPred))
plot_roc_curve(model, X_test, y_test)
plt.show()
The roc_auc_score function gives me 0.979 and the plot shows 1.00.
Despite the fact that the second function takes the model as an argument and predicts yPred again, the outcome should not differ. It is not a round off error. If I decrease training iterations to get a bad predictor the values still differ.
With my real dataset I "achieved" a difference of 0.1 between the two methods.
How does this aberration come?
You should pass the prediction probabilities to roc_auc_score, and not the predicted classes. Like this:
yPred_p = model.predict_proba(X_test)[:,1]
print(roc_auc_score(y_test, yPred_p))
# output: 0.9983354140657512
When you pass the predicted classes, this is actually the curve for which AUC is being calculated (which is wrong):
Code to regenerate:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, yPred)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
I have a function that imports a random forest classifier from scikit learn, i fit it with data and finally I want to display accuracy, kappa and confusion matrix. All works except printing the confusion matrix. I do not get any error, but the confusion matrix does not print.
I have tried calling print(cm) and it works, but it does not print in usual pandas dataframe style, which is what I am looking for.
Here's the code
def rf_clf(X, y, test_size = 0.3, random_state = 42):
"""This function splits the data into train and test and fits it in a random forest classifier
to the data provided, analysing its errors (Accuracy and Kappa). Also as this is classification,
the function will output a confusion matrix"""
#Split data in train and test, as well as predictors (X) and targets, (y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
#import random forest classifier
base_model = RandomForestClassifier(random_state=random_state)
#Train the model
base_model.fit(X_train,y_train)
#make predictions on test set
y_pred=base_model.predict(X_test)
#Print Accuracy and Kappa
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Kappa:",metrics.cohen_kappa_score(y_test, y_pred))
#create confusion matrix
labs = [y_test[i][0] for i in range(len(y_test))]
cm = pd.DataFrame(confusion_matrix(labs, y_pred))
cm #here is the issue. Kinda works with print(cm)
Import metrics from sklearn at the beginning.
from sklearn import metrics
Use this when you want to show confussion matrix.
# Get and show confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)
With this you should view confussion matrix in raw text.
If you want to show confussion Matrix with colours do it in this other way:
Import
from sklearn.metrics import confusion_matrix
import pandas as pd
import seaborn as sns; sns.set()
Use it that way:
cm = confusion_matrix(y_test, y_pred)
cmat_df = pd.DataFrame(cm, index=class_names, columns=class_names)
ax = sns.heatmap(cmat_df, square=True, annot=True, cbar=False)
ax.set_xlabel('Predicción')
ax.set_ylabel('Real')`
Hope for the best!