I am trying to play around to see how outliers in a dataset might affect a Linear Regression model. The issue I'm having is I don't exactly know how to add outliers to a dataset, I've only found loads of articles online about how to detect and remove them.
This is the code I have so far:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
regressor = LinearRegression()
regressor.fit(X_train, y_train) # Training the algorithm
y_pred = regressor.predict(X_test)
print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()
And this is the output:
My question is how can I add outliers to this clean dataset in order to see the effects outliers will have on the resulting model?
Any help would be appreciated, thanks!
You can add values directly to X and y. Since the slope is significant enough, this will end up giving you outliers. You could us any method you want really.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)
for x in range(20):
X=np.append(X, np.random.choice(X.flatten()))
y=np.append(y, np.random.choice(y.flatten()))
X = X.reshape(-1,1)
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
regressor = LinearRegression()
regressor.fit(X_train, y_train) # Training the algorithm
y_pred = regressor.predict(X_test)
print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()
Related
I am currently attempting to train a neural network that predicts a 1kHz sine wave.
While the model itself has an accuracy score of 0.89, it does not accurately predict my test data.
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
#Generate 1kHz sine wave
pi = np.pi
X = np.arange(0,2*pi,0.05)
y = np.sin(1000*X)
tscv = TimeSeriesSplit(n_splits=3, test_size=30)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
plt.plot(X_train, y_train)
plt.plot(X_test, y_test)
#Training using train samples
sc = StandardScaler()
X_train = sc.fit_transform(X_train.reshape(-1,1))
X_test = sc.fit_transform(X_test.reshape(-1,1))
regr = MLPRegressor(random_state=1, max_iter=1000,hidden_layer_sizes=(32, 32)).fit(X_train, y_train)
regr.fit(X_train, y_train)
plt.plot(X_train, regr.predict(X_train), color = 'red')
plt.scatter(X_train, y_train)
regr.score(X_train, y_train)
Result of training
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_test, regr.predict(X_test))
Result of test
As you can see, the test data is far less periodic than the ML model. Why is this the case?
I'd like to evaluate my machine learning model. I computed the area under the ROC curve with roc_auc_score() and plotted the ROC curve with plot_roc_curve() functions of sklearn. In the second function the AUC is also computed and shown in the plot. Now my problem is, that I get different results for the two AUC.
Here's the reproducible code with sample dataset:
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model = MLPClassifier(random_state=42)
model.fit(X_train, y_train)
yPred = model.predict(X_test)
print(roc_auc_score(y_test, yPred))
plot_roc_curve(model, X_test, y_test)
plt.show()
The roc_auc_score function gives me 0.979 and the plot shows 1.00.
Despite the fact that the second function takes the model as an argument and predicts yPred again, the outcome should not differ. It is not a round off error. If I decrease training iterations to get a bad predictor the values still differ.
With my real dataset I "achieved" a difference of 0.1 between the two methods.
How does this aberration come?
You should pass the prediction probabilities to roc_auc_score, and not the predicted classes. Like this:
yPred_p = model.predict_proba(X_test)[:,1]
print(roc_auc_score(y_test, yPred_p))
# output: 0.9983354140657512
When you pass the predicted classes, this is actually the curve for which AUC is being calculated (which is wrong):
Code to regenerate:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, yPred)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
[ASK]
I am working on a lecture but I do not understand and still lack of knowledge about how to change the transformation feature from linear to gaussian, I have searched for information but still haven't found the right one and still lack of knowledge about python, below is a python script that I made with linear regression, and I want to transform features with Gaussian, please help to write the script for feature transformation
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
df1=data[['time', 'Cases']]
from sklearn.utils import shuffle
df1 = shuffle(df1,random_state=0)
X=np.array(df1[['time']])
y=np.array(df1[['Cases']])
X_train = X[:-10]
X_test = X[-10:]
y_train = y[:-10]
y_test = y[-10:]
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(y_test, y_pred))
# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.scatter(X_train, y_train, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
This is my code:
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
%matplotlib inline
boston_properties = load_boston()
l_distance = boston_properties['data'][:, np.newaxis, 7]
linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(l_distance, boston_properties['target'], test_size = 0.3)
y_pred = cross_val_predict(linreg, l_distance, boston_properties.target, cv=5)
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=2)
plt.show()
print(y_pred.shape)
The error which I'm receiving is the following:
ValueError: x and y must have same first dimension, but have shapes (152, 1) and (506,)
How can I make this work?
You made a train_test_split, but you're not using it to train the model. Then you predict on your entire training data, and compare it with y_test. This makes no sense. Use these lines instead:
l_distance = boston_properties['data'][:, np.newaxis, 7]
linreg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(l_distance,
boston_properties['target'], test_size = 0.3) # now you have a train/test set
y_pred = cross_val_predict(linreg, X_train, y_train, cv=5)
plt.scatter(X_train, y_train, color='black')
plt.plot(X_train, y_pred, color='blue', linewidth=2)
plt.show()
Edit: You can also use this line to make a straight line through your points:
plt.scatter(X_train, y_train, color='black')
plt.plot([X_train[np.argmin(X_train)], X_train[np.argmax(X_train)]],
[y_pred[np.argmin(X_train)], y_pred[np.argmax(X_train)]],
color='blue')
plt.show()
I've trained a simple random forest algorithm and bagging classifier (n_estimators = 100). Is it possible to plot the history of accuracy in bagging Classifier? How to calculate the variance of in 100 samples?
I've just printed the accuracy value for both algorithms:
# DecisionTree
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90)
clf2 = tree.DecisionTreeClassifier()
clf2.fit(X_tr, y_tr)
pred2 = clf2.predict(X_test)
acc2 = clf2.score(X_test, y_test)
acc2 # 0.6983930778739185
# Bagging
clf3 = BaggingClassifier(tree.DecisionTreeClassifier(), max_samples=0.5, max_features=0.5, n_estimators=100,\
verbose=2)
clf3.fit(X_tr, y_tr)
pred3 = clf3.predict(X_test)
acc3=clf3.score(X_test,y_test)
acc3 # 0.911619283065513
I don't think that you can get this information from the fitted BaggingClassifier. But you can create such a plot by fitting for different n_estimators:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X, X_test, y, y_test = train_test_split(iris.data,
iris.target,
test_size=0.20)
estimators = list(range(1, 20))
accuracy = []
for n_estimators in estimators:
clf = BaggingClassifier(DecisionTreeClassifier(max_depth=1),
max_samples=0.2,
n_estimators=n_estimators)
clf.fit(X, y)
acc = clf.score(X_test, y_test)
accuracy.append(acc)
plt.plot(estimators, accuracy)
plt.xlabel("Number of estimators")
plt.ylabel("Accuracy")
plt.show()
(Of course, the iris dataset is easily fit with just a single DecisionTreeClassifier, so I set max_depth=1 in this example.)
For a statistically meaningful result, you should fit a BaggingClassifier multiple times for each n_estimators and take the average of the obtained accuracies.