Each time I run this code, I get a different value for the print statement. I'm confused why it's doing that because I specifically included the random_state parameter for the train/test split. (On a side note, I hope I'm supposed to encode the data; it was giving "ValueError: could not convert string to float" otherwise).
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',
names=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety', 'acceptability'])
# turns variables into numbers (algorithms won't let you do it otherwise)
df = df.apply(LabelEncoder().fit_transform)
print(df)
X = df.reindex(columns=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety'])
y = df['acceptability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train)
# decision trees classification
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_train, y_train)
y_true = y_test
y_pred = clf.predict(X_test)
print(math.sqrt(mean_squared_error(y_true, y_pred)))
DecisionTreeClassifier also takes a random_state param: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
All you did was ensure that the train/test splits are repeatable but the classifier also needs to ensure it's own seed is the same on each run
Update
Thanks to #Chester VonWinchester for pointing out: https://github.com/scikit-learn/scikit-learn/issues/8443 due to sklearn's implementation choice it can be non-deterministic with max_features= None even though it should mean that all features are considered.
There is further information and discussion in the link above.
Related
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
I have the following data using the XGBoost regression algorithm to perform prediction. The problem is, however, that the regression algorithm predicts the same output for any input and I'm not really sure why.
data= pd.read_csv("depthwise_data.csv", delimiter=',', header=None, skiprows=1, names=['input_size','input_channels','conv_kernel','conv_strides','running_time'])
X = data[['input_size', 'input_channels','conv_kernel', 'conv_strides']]
Y = data[["running_time"]]
X_train, X_test, y_train, y_test = train_test_split(
np.array(X), np.array(Y), test_size=0.2, random_state=42)
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)
xgb_depth_conv = xgb.XGBRegressor(objective ='reg:squarederror',
n_estimators = 1000,
seed = 123,
tree_method = 'hist',
max_depth=10)
xgb_depth_conv.fit(X_train, y_train_log)
y_pred_train = xgb_depth_conv.predict(X_train)
#y_pred_test = xgb_depth_conv.predict(X_test)
X_data=[[8,576,3,2]] #instance
X_test=np.log(X_data)
y_pred_test=xgb_depth_conv.predict(X_test)
print(np.exp(y_pred_test))
MSE_test, MSE_train = mse(y_test_log,y_pred_test), mse(y_train_log, y_pred_train)
R_squared = r2_score(y_pred_test,y_test_log)
print("MSE-Train = {}".format(MSE_train))
print("MSE-Test = {}".format(MSE_test))
print("R-Squared: ", np.round(R_squared, 2))
Output for first instance
X_data=[[8,576,3,2]]
print(np.exp(y_pred_test))
[0.7050679]
Output for second instance
X_data=[[4,960,3,1]]
print(np.exp(y_pred_test))
[0.7050679]
Your problem stems from this X_test=np.log(X_data)
Why are you applying log on the test cases while you have not applied it on the training samples?
If you take away the np.log completely, even from the target (y), you get really good results. I tested it myself with the data you provided us with.
I imputed my dataframe of any missing values with the median of each feature and scaled using StandardScaler(). I ran regular kneighbors with n=3 and the accuracy stays consistent.
Now I am to do the PCA of the resulting dataset with n_components=4 and apply K-neighbors with 3 neighbors. However, every time I run my code, the PCA dataset and kneighbors accuracy changes every time I run the program but the master dataset itself doesn't change. I even tried using first 4 features of the dataset when applying kneighbors and even that is inconsistent.
data = pd.read_csv('dataset.csv')
y = merged['Life expectancy at birth (years)']
X_train, X_test, y_train, y_test = train_test_split(data,
y,
train_size=0.7,
test_size=0.3,
random_state=200)
for i in range(len(features)):
featuredata = X_train.iloc[:,i]
fulldata = data.iloc[:,i]
fulldata.fillna(featuredata.median(), inplace=True)
data.iloc[:,i] = fulldata
scaler = preprocessing.StandardScaler().fit(X_train)
data = scaler.transform(data)
If I apply KNeighbors here, it runs fine, and my accuracy score remains the same.
pcatest = PCA(n_components=4)
pca_data = pcatest.fit_transform(data)
X_train, X_test, y_train, y_test = train_test_split(pca_data,
y,
train_size=0.7,
test_size=0.3)
pca = neighbors.KNeighborsClassifier(n_neighbors=3)
pca.fit(X_train, y_train)
y_pred_pca = pca.predict(X_test)
pca_accuracy = accuracy_score(y_test, y_pred_pca)
However, my pca_accuracy score changes every time I run the code. What can I do to make it set and consistent?
first4_data = data[:,:4]
X_train, X_test, y_train, y_test = train_test_split(first4_data,
y,
train_size=0.7,
test_size=0.3)
first4 = neighbors.KNeighborsClassifier(n_neighbors=3)
first4.fit(X_train, y_train)
y_pred_first4 = first4.predict(X_test)
first4_accuracy = accuracy_score(y_test, y_pred_first4)
I am only taking the first 4 features/columns and the data should remain the same, but for some reason, the accuracy score changes everytime I run it.
You need to give random_statea value in train_test_split otherwise everytime you run it without specifying random_state, you will get a different result. What happens is that every time you split your data, you do it in different ways, unless you specify a random state, or lack there of. It's the equivalent of seed() in R.
I am working with variable-length time series. In particular, I am using the tslearn tool.
I transformed data so that they fit the allowed format for tslearn:
X_train, X_test, y_train, y_test = train_test_split(data, Y, test_size=0.3,random_state=42, stratify= Y)
X_train= to_time_series_dataset(X_train)
X_test= to_time_series_dataset(X_test)
then I run:
from tslearn.shapelets import LearningShapelets
clf = LearningShapelets(n_shapelets_per_size={3: 1})
clf.fit(X_train, y_train)
Everything runs smoothly, but when I run:
y_pred = clf.predict_proba(X_test)
I get
y_pred contains only 'nan' and no predictions.
Does anyone have a suggestion about why this is happening?
I wonder why the confusion_matrix changes as I execute it in a second time and whether it is avoidable. To be more exact, I got [[53445 597] [958 5000]] for the first time, however, I get [[52556 1486][805 5153]] when I execute it again.
# get the data from dataset and split into training-set and test-set
mnist = fetch_openml('mnist_784')
X, y = mnist['data'], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# make the data random
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# true for all y_train='2', false for all others
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')
# train the data with a label of T/F depends on whether the data is 2
# I use the random_state as 0, so it will not change, am I right?
sgd_clf = SGDClassifier(random_state=0)
sgd_clf.fit(X_train, y_train_2)
# get the confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3)
print('confusion_matrix is', confusion_matrix(y_train_2, y_train_pred))
You are using different data on each run (shuffle_index) - so there is no reason for the ML run and resulting confusion matrix to be exactly the same - though results should be close if the algorithm is doing a good job.
To get rid of the randomness either specify indices:
shuffle_index = np.arange(60000) #Rather "not_shuffled_index"
Or use the same seed:
np.random.seed(1) #Or any number
shuffle_index = np.random.permutation(60000) #Will be the same for a given seed