I am going through the Kaggle Digit Recognizer Tutorial and I'm trying to understand how all of this works. I would like to validate a predicted value. Basically, I have a prediction that's wrong, but I want to see what the actual value of that prediction was. I think I am way off:
...
df = pd.read_csv('data/train.csv')
labels = df['label'].values
x_train = df.drop(columns=['label']).values / 255
# trying to produce a crappy dataset for train/test
x_train, x_test, y_train, y_test = train_test_split(x_train, labels, test_size=0.95)
# Purposely trying to get a crappy model so I can learn about validation
model = tf.keras.models.Sequential()
# model.add(tf.keras.layers.Flatten())
# model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1)
predictions = model.predict([x_test])
index_to_predict = 0
print('Prediction: ', np.argmax(predictions[index_to_predict]))
print('Actual: ', predictions.argmax(axis=-1)[index_to_predict])
print(predictions.shape)
vals = x_test[index_to_predict].reshape(28, 28)
plt.imshow(vals)
This yields the following:
How can I get a true 'heres the prediction' and 'heres the actual' breakdown? My logic on getting the actual is definitely off.
The true labels (also sometimes called target values, or ground-truth labels) are stored in y_train and y_test for training and test set respectively. Therefore, you can easily just print that to find the true label:
print('Actual:', y_test[index_to_predict])
y_test[index_to_predict]
will have the actual label and
predictions[index_to_predict]
should have the predicted probability values for each of your classes.
Related
I
have built multi classification model with Keras and after model is finished I would like to predict value for one of my test input.
This is the part where I scaled features:
x = dataframe.drop("workTime", axis = 1)
x = dataframe.drop("creation", axis = 1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = pd.DataFrame(sc.fit_transform(x))
y = dataframe["workTime"]
import seaborn as sb
corr = dataframe.corr()
sb.heatmap(corr, cmap="Blues", annot=True)
print("Scaled features:", x.head(3))
Then I did:
y_cat = to_categorical(y)
x_train, x_test, y_train, y_test = train_test_split(x.values, y_cat, test_size=0.2)
And built model:
model = Sequential()
model.add(Dense(16, input_shape = (9,), activation = "relu"))
model.add(Dense(8, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(6, activation = "softmax"))
model.compile(Adam(lr = 0.0001), "categorical_crossentropy", metrics = ["categorical_accuracy"])
model.summary()
model.fit(x_train, y_train, verbose=1, batch_size = 8, epochs=100, shuffle=True)
After my calculation finished, I wanted to take first element from test data and predict
value/classify it.
print(x_test.shape, x_train.shape) // (1550, 9) (6196, 9)
firstTest = x_test[:1]; // [[ 2.76473141 1.21064165 0.18816548 -0.94077449 -0.30981017 -0.37723917
-0.44471711 -1.44141792 0.20222467]]
prediction = model.predict(firstTest)
print(prediction) // [[7.5265622e-01 2.4710520e-01 2.3643016e-04 2.1405797e-06 3.8411264e-19
9.4137732e-23]]
print(prediction[0]) // [7.5265622e-01 2.4710520e-01 2.3643016e-04 2.1405797e-06 3.8411264e-19
9.4137732e-23]
unscaled = sc.inverse_transform(prediction)
print("prediction", unscaled)
During this I retrieve:
ValueError: operands could not be broadcast together with shapes (1,6) (9,) (1,6)
I think it may be related to my scalers.
And please correct me if I wrong, but what I want to achieve here is to either have one output value which points me how this entry was classified or array of possibilities for each classification label.
Thank you for hints
Your StandardScaler was used to scale the input features, you can't apply it (or its inverse) on the outputs!
If you are looking for the probabilities of the test sample being in each class, you already have it in prediction[0].
If you want the final class predicted, just take the one with the largest probability with argmax: tf.math.argmax(prediction[0]).
I am using tensorflow and keras for a binary classification problem.
I have only a training set of 81 samples (Testsize 21), but ~1900 features. I know its too less samples and too many features, but its a biological problem (gene-expression data), so i have to deal with it.
My model looks like this: (using different neurons per layer, different number of hidden layers, regularization and dropout to deal with the high dimensional data)
model = Sequential()
model.add(Input((input_shape,)))
for i in range(num_hidden):
model.add(Dense(n_neurons, activation="relu",kernel_regularizer=keras.regularizers.l1_l2(l1_reg, l2_reg)))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation="sigmoid"))
ann_optimizer= keras.optimizers.Adam()
model.compile(loss="binary_crossentropy",
optimizer=ann_optimizer, metrics=['accuracy'])
I am using a 10 fold nested cross validation and grid search in the inner fold like this:
# fit and evaluate the model
# configure the inner cross-validation procedure (5 fold, 80 inner training dataset, 20 inner test dataset)
cv_inner = ShuffleSplit(n_splits=5, test_size=0.2, random_state=1)
# define the mode
ann = KerasRegressor(build_fn=regressionModel_sequential, input_shape=X_train.shape[1],
batch_size=batch_size)
# use pipeline as prevent from Leaky Preprocessing (StandardScaler on 80% inner-training dataset))
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('ann', ann)])
# define the grid search of with inner cv to get good parameters
grid_search_result = GridSearchCV(
pipe, param_grid, n_jobs=-1, cv=cv_inner, refit=True, verbose=0)
#refit = True a final model with the entire inner-training dataset
# execute search
grid_search_result.fit(X_train, y_train, ann__verbose=0)
logger.info('>>>>> est=%.3f, params=%s' % (grid_search_result.best_score_, grid_search_result.best_params_))
# to get loss curve
ann_val = regressionModel_sequential(input_shape=X_train.shape[1],
n_neurons=grid_search_result.best_params_['ann__n_neurons'],
l1_reg=grid_search_result.best_params_['ann__l1_reg'],
l2_reg=grid_search_result.best_params_['ann__l2_reg'],
num_hidden=grid_search_result.best_params_['ann__num_hidden'],
dropout_rate=grid_search_result.best_params_['ann__dropout_rate'])
# Validation with outer 20 %
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
history = ann_val.fit(X_train, y_train, batch_size=batch_size, verbose=0,
validation_split=0.25, shuffle=True, epochs=grid_search_result.best_params_['ann__epochs'])
plot_history(history, directory, i)
# use best grid search reult for predicting on outer test dataset
y_predicted = ann_val.predict(X_test)
# print predicted
logger.info(y_predicted[:5])
logger.info(y_test[:5])
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_predicted)))
mae = (metrics.mean_squared_error(y_test, y_predicted))
r_squared = metrics.r2_score(y_test, y_predicted)
My loss seems good: loss
But accuracy is very bad. accuracy (example from one outer fold)
Does anyone have suggestions on what i could do to improve my results?
I also know that the biological question behind is very hard/maybe not possible to solve.
import pandas as pd
df=pd.read_csv('final sheet for project.csv')
features=['moisture','volatile matter','fixed carbon','calorific value','carbon %','oxygen%']
train_data=df[features]
target_data=df.pop('Activation energy')
X_train, X_test, y_train, y_test = train_test_split(train_data,target_data, test_size=0.09375, random_state=1)
standard_X_train=pd.DataFrame(StandardScaler().fit_transform(X_train))
standard_X_test=pd.DataFrame(StandardScaler().fit_transform(X_test))
y_train=y_train.values
y_train = y_train.reshape((-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(y_train)
normalized_y_train = scaler.transform(y_train)
y_test=y_test.values
y_test = y_test.reshape((-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(y_test)
normalized_y_test = scaler.transform(y_test)
model=keras.Sequential([layers.Dense(units=20,input_shape=[6,]),layers.Dense(units=1,activation='tanh')])
model.compile(
optimizer='adam',
loss='mae',
)
history = model.fit(standard_X_train,normalized_y_train, validation_data=(standard_X_test,normalized_y_test),epochs=200)
I wish to create a model to predict activation energy using some features . I am getting training loss: 0.0629 and val_loss: 0.4213.
But when I try to predict the activation energies of some other unseen data ,I get bizarre results. I am a beginner in ML.
Can someone please help what changes can be made in the code. ( I want to make a model with one hidden layer of 20 units that has activation function tanh.)
You should not use fit_transform for test data. You should use fit_transform for training data and apply just transform to test data, in order to use the same parameters for training data, on the test data.
So, the transformation part of your code should change like this:
scaler_x = StandardScaler()
standard_X_train = pd.DataFrame(scaler_x.fit_transform(X_train))
standard_X_test = pd.DataFrame(scaler_x.transform(X_test))
y_train=y_train.values
y_train = y_train.reshape((-1, 1))
y_test=y_test.values
y_test = y_test.reshape((-1, 1))
scaler_y = MinMaxScaler(feature_range=(0, 1))
normalized_y_train = scaler_y.fit_transform(y_train)
normalized_y_test = scaler_y.transform(y_test)
Furthermore, since you are scaling your data, you should do the same thing for any prediction. So, your prediction line should be something like:
preds = scaler_y.inverse_transform(
model.predict(scaler_x.transform(pred_input)) #if it is standard_X_test you don't need to transform again, since you already did it.
)
Additionally, since you are transforming your labels in range 0 and 1, you may need to change your last layer activation function to sigmoid instead of tanh, or even may better to use an activation function like relu in your first layer if you are still getting poor results after above modifications.
model=keras.Sequential([
layers.Dense(units=20,input_shape=[6,],activation='relu'),
layers.Dense(units=1,activation='sigmoid')
])
Dataset:
The PV Yield (kWh) is my output. My model is suppose to predict this.
This is what I have done. I have attached the image of the dataset. From AirTemp to Zenith is my X and Y is PV Yield(KW/H).
df=pd.read_csv("Data1.csv")
X=df.drop(['Date-PrimaryKey','output-PV Yield (kWh)'],axis=1)
Y=df['output-PV Yield (kWh)']
pca = PCA(n_components=9)
pca.fit(X_train)
X_train = pca.transform(X_train)
pca.fit(X_test)
X_test = pca.transform(X_test)
#normalizing the input values to fall in -1 to 1
X_train = X_train/180000000.0
X_test = X_test/180000000.0
#Creating Model
model = Sequential()
model.add(Dense(15, input_shape=(9,)))
model.add(Activation('tanh'))
model.add(Dense(11))
model.add(Activation('tanh'))
model.add(Dense(1))
model.summary()
sgd = optimizers.SGD(lr=0.1,momentum=0.2)
model.compile(loss='mean_absolute_error',optimizer=sgd,metrics=['accuracy'])
#Training
model.fit(X_train, train_y, epochs=20, batch_size = 50, validation_data=(X_test, test_y))
My weights are not getting updated. Accuracy is zero in all epochs.
The model seems OK but there are two problems I can spot fast:
pca = PCA(n_components=9)
pca.fit(X_train)
X_train = pca.transform(X_train)
pca.fit(X_test)
X_test = pca.transform(X_test)
Anything used for transformation of the data must not be fit on testing data. You fit it on train samples and then use it to transform both train and test part. You should assume that you know nothing about data you will be predicting on in production, eg. you know nothing about tomorrows weather, results of sport matches in a month, etc. You wont be able to do so then, so you cant do so during training. Correct way:
pca = PCA(n_components=9)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
The second very incorrect stuff you have there is here:
#normalizing the input values to fall in -1 to 1
X_train = X_train/180000000.0
X_test = X_test/180000000.0
Of course you want to normalize your data, but this way you will end up with incredibly low decimals in cases where values are low, eg. AlbedoDaily column, and quite high values where are values high, such as SurfacePressure. For such scaling you can use already defined classes such as standard scaler. The code is very simple and each column is treated independently:
from sklearn.preprocessing import StandardScaler
transformer = StandardScaler().fit(X_train)
X_train = transformer.transform(X_train)
X_test = transformer.transform(X_test)
You have not provided or explained what your target variable is and where you get is, there could be other problems in your code I can not see right now.
I want to start trying my hand at neural networks and found keras to be very simple syntactically. My set up is X_train is an array of shape (3516, 6)
and y_train is of shape (3516,)
X_train looks like this:
[[ 888. 900.5 855. 879.311 877.00266667
893.5008 ]
[ 875. 878.5 840. 880.026 874.56933333
890.7948 ]
[ 860. 870. 839.5 880.746 870.54333333
887.6428 ]....]
it is an input of 6 pieces of financial data to predict one output. I know its not going to be accurate but it is to get me going on something at least before I get on to RNNs
my problem is that the loss function at every epoch shows nan, accuracy shows 0%, validation_accuracy shows zero percent as if to say that data isnt even being passed through the model, I mean even if its a poor model with poor inputs even that should be represented by a large loss right? here is the model:(see below)
anyway guys I am sure that I am doing something wrong and would really appreciate you guys' input
many thanks
S
EDIT: FULL WORKING CODE:
def load_data(keyword):
df = pd.read_csv('%s_x.csv' %keyword)
df2 = pd.read_csv('%s_y.csv' %keyword)
df2 = df2['label']
try:
df.drop('Unnamed: 0', axis = 1, inplace=True)
except:
print('wouldnt let drop unnamed column')
X = df.as_matrix()
y = df2.as_matrix()
X_len = len(X)
test_size = 0.2
test_split = int(test_size * X_len)
X_train = X[:-test_split]
y_train = y[:-test_split]
X_test = X[-test_split:]
y_test = y[-test_split:]
def keras():
model = Sequential( [
Dense(input_dim=3, output_dim=3),
Dense(output_dim=60, activation='linear'),
core.Dropout(p=0.1),
Dense(60, activation='linear'),
core.Dropout(p=0.1),
Dense(1, activation='linear')
])
return model
def training(epoch):
# start the program off by loading some data into it
X_train, X_test, y_train, y_test = load_data('admiral')
y_train = y_train.reshape(len(y_train), 1)
y_test = y_test.reshape(len(y_test), 1)
model = keras()
# optimizer will go into the compile function
# RMSpop is apparently a pretty decent choice for recurrent neural networks although we will start it on a simple nn too.
rms = optimizers.RMSprop(lr=0.001, rho = 0.9, epsilon =1e-08)
model.compile(optimizer= rms, loss='mean_squared_error ', metrics = ['accuracy'])
model.fit(X_train, y_train, nb_epoch=epoch, batch_size =500, validation_split=0.01)
score = model.evaluate(X_test, y_test, batch_size=50)
print(score)
training(300)
accuracy really low because it doesnt make sense to show accuracy, for a regression problem, it is more for classification
data was being passed through it was just so low it was a Nan question answered