why the confusion_matrix is different when I execute it again? - python

I wonder why the confusion_matrix changes as I execute it in a second time and whether it is avoidable. To be more exact, I got [[53445 597] [958 5000]] for the first time, however, I get [[52556 1486][805 5153]] when I execute it again.
# get the data from dataset and split into training-set and test-set
mnist = fetch_openml('mnist_784')
X, y = mnist['data'], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# make the data random
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# true for all y_train='2', false for all others
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')
# train the data with a label of T/F depends on whether the data is 2
# I use the random_state as 0, so it will not change, am I right?
sgd_clf = SGDClassifier(random_state=0)
sgd_clf.fit(X_train, y_train_2)
# get the confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3)
print('confusion_matrix is', confusion_matrix(y_train_2, y_train_pred))

You are using different data on each run (shuffle_index) - so there is no reason for the ML run and resulting confusion matrix to be exactly the same - though results should be close if the algorithm is doing a good job.
To get rid of the randomness either specify indices:
shuffle_index = np.arange(60000) #Rather "not_shuffled_index"
Or use the same seed:
np.random.seed(1) #Or any number
shuffle_index = np.random.permutation(60000) #Will be the same for a given seed

Related

Using TimeSeriesSplit within cross_val_score

I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit function. I believe that the easiest way to apply this function is through the cross_val_score function, through the cv argument.
The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train) or should I use the split(X_train) or split(input_data) ? Or, should I cross-validate in another way?
This is the code I am writing:
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df[['next_count']]
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
#scaling
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)
#Build Model
lr = LinearRegression()
lr.fit(scaled_train, y_train.values.ravel())
predictions = lr.predict(scaled_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1
The TimeSeriesSplit is simply an iterator that yields a growing window of sequential folds. Therefore, you can pass it as is to cv, or you can pass time_series_split(scaled_train), which amounts to the same thing: making splits in an array of the same size as your train data (which cross_val_score takes as the second positional parameter). It doesn't matter whether the TimeSeriesSplit gets the scaled or original data, as long as cross_val_score has the scaled data.
I made some minor simplifications in your code as well - scaling before the train_test_split, and making the output data a Series (so you don't need values.ravel):
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df['next_count']
scaler = MinMaxScaler()
scaled_input = scaler.fit_transform(input_data)
X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)
#Build Model
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1

Retrieving same output for different instances for XGBoost regression algorithm

I have the following data using the XGBoost regression algorithm to perform prediction. The problem is, however, that the regression algorithm predicts the same output for any input and I'm not really sure why.
data= pd.read_csv("depthwise_data.csv", delimiter=',', header=None, skiprows=1, names=['input_size','input_channels','conv_kernel','conv_strides','running_time'])
X = data[['input_size', 'input_channels','conv_kernel', 'conv_strides']]
Y = data[["running_time"]]
X_train, X_test, y_train, y_test = train_test_split(
np.array(X), np.array(Y), test_size=0.2, random_state=42)
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)
xgb_depth_conv = xgb.XGBRegressor(objective ='reg:squarederror',
n_estimators = 1000,
seed = 123,
tree_method = 'hist',
max_depth=10)
xgb_depth_conv.fit(X_train, y_train_log)
y_pred_train = xgb_depth_conv.predict(X_train)
#y_pred_test = xgb_depth_conv.predict(X_test)
X_data=[[8,576,3,2]] #instance
X_test=np.log(X_data)
y_pred_test=xgb_depth_conv.predict(X_test)
print(np.exp(y_pred_test))
MSE_test, MSE_train = mse(y_test_log,y_pred_test), mse(y_train_log, y_pred_train)
R_squared = r2_score(y_pred_test,y_test_log)
print("MSE-Train = {}".format(MSE_train))
print("MSE-Test = {}".format(MSE_test))
print("R-Squared: ", np.round(R_squared, 2))
Output for first instance
X_data=[[8,576,3,2]]
print(np.exp(y_pred_test))
[0.7050679]
Output for second instance
X_data=[[4,960,3,1]]
print(np.exp(y_pred_test))
[0.7050679]
Your problem stems from this X_test=np.log(X_data)
Why are you applying log on the test cases while you have not applied it on the training samples?
If you take away the np.log completely, even from the target (y), you get really good results. I tested it myself with the data you provided us with.

Why does my PCA change every time I run the code in python?

I imputed my dataframe of any missing values with the median of each feature and scaled using StandardScaler(). I ran regular kneighbors with n=3 and the accuracy stays consistent.
Now I am to do the PCA of the resulting dataset with n_components=4 and apply K-neighbors with 3 neighbors. However, every time I run my code, the PCA dataset and kneighbors accuracy changes every time I run the program but the master dataset itself doesn't change. I even tried using first 4 features of the dataset when applying kneighbors and even that is inconsistent.
data = pd.read_csv('dataset.csv')
y = merged['Life expectancy at birth (years)']
X_train, X_test, y_train, y_test = train_test_split(data,
y,
train_size=0.7,
test_size=0.3,
random_state=200)
for i in range(len(features)):
featuredata = X_train.iloc[:,i]
fulldata = data.iloc[:,i]
fulldata.fillna(featuredata.median(), inplace=True)
data.iloc[:,i] = fulldata
scaler = preprocessing.StandardScaler().fit(X_train)
data = scaler.transform(data)
If I apply KNeighbors here, it runs fine, and my accuracy score remains the same.
pcatest = PCA(n_components=4)
pca_data = pcatest.fit_transform(data)
X_train, X_test, y_train, y_test = train_test_split(pca_data,
y,
train_size=0.7,
test_size=0.3)
pca = neighbors.KNeighborsClassifier(n_neighbors=3)
pca.fit(X_train, y_train)
y_pred_pca = pca.predict(X_test)
pca_accuracy = accuracy_score(y_test, y_pred_pca)
However, my pca_accuracy score changes every time I run the code. What can I do to make it set and consistent?
first4_data = data[:,:4]
X_train, X_test, y_train, y_test = train_test_split(first4_data,
y,
train_size=0.7,
test_size=0.3)
first4 = neighbors.KNeighborsClassifier(n_neighbors=3)
first4.fit(X_train, y_train)
y_pred_first4 = first4.predict(X_test)
first4_accuracy = accuracy_score(y_test, y_pred_first4)
I am only taking the first 4 features/columns and the data should remain the same, but for some reason, the accuracy score changes everytime I run it.
You need to give random_statea value in train_test_split otherwise everytime you run it without specifying random_state, you will get a different result. What happens is that every time you split your data, you do it in different ways, unless you specify a random state, or lack there of. It's the equivalent of seed() in R.

How is train_test_split with test_size=0 affecting the data?

I was using train_test_split in my code and then wanted to change it to cross validation, but something strange is hapenning.
train, test = train_test_split(data, test_size=0)
x_train = train.drop('CRO', axis=1)
y_train = train['CRO']
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)
for k in range(1, 5):
knn = neighbors.KNeighborsRegressor(n_neighbors=k, weights='uniform')
scores = model_selection.cross_val_score(knn, x_train, y_train, cv=5)
print(scores.mean(), 'score for k = ', k)
This code gives the scores around 0.8, but when I delete that first line and change the 'train' set for the 'data' set in the 2nd and 3rd lines, the score changes to 0.2, which is strange because I even set the test_size to 0 so the train should be equal to the whole data.
What is hapenning?
One thing to be aware of are the implicit arguments passed in train_test_split.
By default, shuffle=True, which could easily be adding some noise into your training data by shuffling it, where just passing the data in without shuffling my be introducing some other pattern into the model.

Different values each time I run the code even with random_state

Each time I run this code, I get a different value for the print statement. I'm confused why it's doing that because I specifically included the random_state parameter for the train/test split. (On a side note, I hope I'm supposed to encode the data; it was giving "ValueError: could not convert string to float" otherwise).
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',
names=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety', 'acceptability'])
# turns variables into numbers (algorithms won't let you do it otherwise)
df = df.apply(LabelEncoder().fit_transform)
print(df)
X = df.reindex(columns=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety'])
y = df['acceptability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train)
# decision trees classification
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_train, y_train)
y_true = y_test
y_pred = clf.predict(X_test)
print(math.sqrt(mean_squared_error(y_true, y_pred)))
DecisionTreeClassifier also takes a random_state param: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
All you did was ensure that the train/test splits are repeatable but the classifier also needs to ensure it's own seed is the same on each run
Update
Thanks to #Chester VonWinchester for pointing out: https://github.com/scikit-learn/scikit-learn/issues/8443 due to sklearn's implementation choice it can be non-deterministic with max_features= None even though it should mean that all features are considered.
There is further information and discussion in the link above.

Categories