I am new to TensorFlow framework and I am trying to apply Tensorflow to predict the survivor based on this Titanic Dataset:https://www.kaggle.com/c/titanic/data.
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#%%
titanictrain = pd.read_csv('train.csv')
titanictest = pd.read_csv('test.csv')
df = pd.concat([titanictrain,titanictest],join='outer',keys='PassengerId',sort=False,ignore_index=True).drop(['Name'],1)
#%%
def preprocess(df):
df['Fare'].fillna(value=df.groupby('Pclass')['Fare'].transform('median'),inplace=True)
df['Fare'] = df['Fare'].map(lambda x: np.log(x) if x>0 else 0)
df['Embarked'].fillna(value=df['Embarked'].mode()[0],inplace=True)
df['CabinAlphabet'] = df['Cabin'].str[0]
categories_to_one_hot = ['Pclass','Sex','Embarked','CabinAlphabet']
df = pd.get_dummies(df,columns=categories_to_one_hot,drop_first=True)
return df
df = preprocess(df)
df = df.drop(['PassengerId','Ticket','Cabin','Survived'],1)
titanic_trainandval = df.iloc[:len(titanictrain)]
titanic_test = df.iloc[len(titanictrain):] #test after preprocessing
titanic_test.head()
# split train into training and validation set
labels = titanictrain['Survived']
y = labels.values
test = titanic_test.copy() # real test sets
print(len(test), 'test examples')
Here I am trying to apply preprocessing on the data:
1.Drop Name column and Do one hot coding both on the train and test set
2.Drop ['PassengerId','Ticket','Cabin','Survived'] for Simplicity.
Split train and test following the original order
Here is a picture showing what the training set looks like.
"""# model training"""
from tensorflow.keras.layers import Input, Dense, Activation,Dropout
from tensorflow.keras.models import Model
X = titanic_trainandval.copy()
input_layer = Input(shape=(X.shape[1],))
dense_layer_1 = Dense(10, activation='relu')(input_layer)
dense_layer_2 = Dense(5, activation='relu')(dense_layer_1)
output = Dense(1, activation='softmax',name = 'predictions')(dense_layer_2)
model = Model(inputs=input_layer, outputs=output)
base_learning_rate = 0.0001
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate), metrics=['acc'])
history = model.fit(X, y, batch_size=5, epochs=20, verbose=2, validation_split=0.1,shuffle = False)
submission = pd.DataFrame()
submission['PassengerId'] = titanictest['PassengerId']
Then I put the training set X into the model to get the result. However, history shows the following result:
No matter how I change the learning rate and batch size, the result does not change, and the loss is always 'nan', and the prediction based on the test set is always 'nan' as well.
Could anybody explain where the problem is and give some possible solutions?
at first glance there are 2 major problems in your code:
your output layer must be Dense(2, activation='softmax'). this is because yours is a binary classification problem and if you are using softmax to generate probabilities the output dim must be equal to the number of classes. (you can use one output dimension with sigmoid activation)
you have to change your loss function. with softmax and numerical encoded target use sparse_categorical_crossentropy. (you can use binary_crossentropy with sigmoid and with from_logits=False as default)
PS: make sure to remove all NaNs in your original data before starting fit
Marco Cerliani is right with the points 1 and 2.
The real problem why you have NaNs is because you feed NaNs in your code. If you look carefully, even in your third photo, the 888th example on the column Age contains a NaN.
This is why you have NaNs. Solve this one, and apply Marco Cerliani's suggestions and you're good to go.
Apart from the above answers, 1 more thing which I would like to add is that whenever you want to use form_logits=True for classification problems, use Linear activation function i.e activation='linear' which is the default value for activation function in the last layer.
Related
I'm trying to use a time-series data set with 30 different features and I want to predict the future values for 3 of those features. Is there any way I can specify what features I want to be used for output and how many outputs using TensorFlow and Sckit-learn? Or is that just done when I am creating the x_train, y_train, etc. sets? I want to predict the heat index, temperature, and humidity based on various meteorological factors (air pressure, HDD, CDD, pollution, etc.) The 3 factors I wish to predict are part of the 30 total features.
I am using TensorFlows RNN tutorial: https://www.tensorflow.org/tutorials/structured_data/time_series
univariate_past_history = 30
univariate_future_target = 0
x_train_uni, y_train_uni = univariate_data(uni_data, 0, 1930,
univariate_past_history,
univariate_future_target)
x_val_uni, y_val_uni = univariate_data(uni_data, 1930, None,
univariate_past_history,
univariate_future_target)
My data is given daily so I wanted to predict the next day using the last 30 days for example here.
and this is my implementation of the training of the model:
BATCH_SIZE = 256
BUFFER_SIZE = 10000
train_univariate = tf.data.Dataset.from_tensor_slices((x_train_uni, y_train_uni))
train_univariate =
train_univariate.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()
val_univariate = tf.data.Dataset.from_tensor_slices((x_val_uni, y_val_uni))
val_univariate = val_univariate.batch(BATCH_SIZE).repeat()
simple_lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(8, input_shape=x_train_uni.shape[-2:]),
tf.keras.layers.Dense(1)
])
simple_lstm_model.compile(optimizer='adam', loss='mae')
for x, y in val_univariate.take(1):
print(simple_lstm_model.predict(x).shape)
EVALUATION_INTERVAL = 200
EPOCHS = 30
simple_lstm_model.fit(train_univariate, epochs=EPOCHS,
steps_per_epoch=EVALUATION_INTERVAL,
validation_data=val_univariate, validation_steps=50)
EDIT: I understand that to increase the number of outputs I have to increase the Dense(1) value, want to understand how to specify which features to output/predict
You need to give the model.fit call the variables you want to learn from in a shape compatible with an LSTM layer
So for example, without any code a model like yours might take as input:
[batchsize, n_timestamps, n_features]
and output:
[batchsize, n_timestamps, m_features]
where n is input and m output.
So then you need to give the model the truth data of the same shape as the model output in order for the model to calculate a loss.
So the model.fit call should be:
model.fit(x_train, y_train, ....) where y_train are the truth vectors of the same shape as the model output.
You have to design a model architecture that fits your needs and matches the outputs you expect. I made a toy example, but I have never really worked with this type of NN so no idea if it makes sense for the problem.
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Dense, InputLayer, Reshape
ni_feats = 10
no_feats = 3
ndays = 30
model = tf.keras.Sequential([
InputLayer((ndays, ni_feats)),
LSTM(10),
Dense(int(no_feats * ndays)),
Reshape((ndays, no_feats))
])
In my TensorFlow model I have some data that I feed into a stack of CNNs before it goes into a few fully connected layers. I have implemented that with Keras' Sequential model. However, I now have some data that should not go into the CNN and instead be fed directly into the first fully connected layer because that data contains some values and labels that are part of the input data but that data should not undergo convolutions as it is not image data.
Is such a thing possible with tensorflow.keras or should I do that with tensorflow.nn instead? As far as I understand Keras' sequential models is that the input goes in one end and comes out the other with no special wiring in the middle.
Am I correct that to do this I have to use tensorflow.concat on the data from the last CNN layer and the data that bypasses the CNNs before feeding it into the first fully connected layer?
Here is an simple example in which the operation is to sum the activations from different subnets:
import keras
import numpy as np
import tensorflow as tf
from keras.layers import Input, Dense, Activation
tf.reset_default_graph()
# this represents your cnn model
def nn_model(input_x):
feature_maker = Dense(10, activation='relu')(input_x)
feature_maker = Dense(20, activation='relu')(feature_maker)
feature_maker = Dense(1, activation='linear')(feature_maker)
return feature_maker
# a list of input layers, of course the input shapes can be different
input_layers = [Input(shape=(3, )) for _ in range(2)]
coupled_feature = [nn_model(input_x) for input_x in input_layers]
# assume you take the sum of the outputs
coupled_feature = keras.layers.Add()(coupled_feature)
prediction = Dense(1, activation='relu')(coupled_feature)
model = keras.models.Model(inputs=input_layers, outputs=prediction)
model.compile(loss='mse', optimizer='adam')
# example training set
x_1 = np.linspace(1, 90, 270).reshape(90, 3)
x_2 = np.linspace(1, 90, 270).reshape(90, 3)
y = np.random.rand(90)
inputs_x = [x_1, x_2]
model.fit(inputs_x, y, batch_size=32, epochs=10)
You can actually plot the model to gain more intuition
from keras.utils.vis_utils import plot_model
plot_model(model, show_shapes=True)
The model of the above code looks like this
With a little remodeling and the functional API you can:
#create the CNN - it can also be a sequential
cnn_input = Input(image_shape)
cnn_output = Conv2D(...)(cnn_input)
cnn_output = Conv2D(...)(cnn_output)
cnn_output = MaxPooling2D()(cnn_output)
....
cnn_model = Model(cnn_input, cnn_output)
#create the FC model - can also be a sequential
fc_input = Input(fc_input_shape)
fc_output = Dense(...)(fc_input)
fc_output = Dense(...)(fc_output)
fc_model = Model(fc_input, fc_output)
There is a lot of space for creativity, this is just one of the ways.
#create the full model
full_input = Input(image_shape)
full_output = cnn_model(full_input)
full_output = fc_model(full_output)
full_model = Model(full_input, full_output)
You can use any of the three models in any way you want. They share the layers and the weights, so internally they are the same.
Saving and loading the full model might be quirky. I'd probably save the other two separately and when loading create the full model again.
Notice also that if you save two models that share the same layers, after loading they will probably not share these layers anymore. (Another reason for saving/loading only fc_model and cnn_model, while creating full_model again from code)
BatchNormalization (BN) operates slightly differently when in training and in inference. In training, it uses the average and variance of the current mini-batch to scale its inputs; this means that the exact result of the application of batch normalization depends not only on the current input, but also on all other elements of the mini-batch. This is clearly not desirable when in inference mode, where we want a deterministic result. Therefore, in that case, a fixed statistic of the global average and variance over the entire training set is used.
In Tensorflow, this behavior is controlled by a boolean switch training that needs to be specified when calling the layer, see https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization. How do I deal with this switch when using Keras high-level API? Am I correct in assuming that it is dealt with automatically, depending whether we are using model.fit(x, ...) or model.predict(x, ...)?
To test this, I have written this example. We start with a random distribution and we want to classify whether the input is positive or negative. However, we also have a test dataset coming from a different distribution where the inputs are displaced by 2 (and consequently the labels check whether x>2).
import numpy as np
from math import ceil
from tensorflow.python.data import Dataset
from tensorflow.python.keras import Input, Model
from tensorflow.python.keras.layers import Dense, BatchNormalization
np.random.seed(18)
xt = np.random.randn(10_000, 1)
yt = np.array([[int(x > 0)] for x in xt])
train_data = Dataset.from_tensor_slices((xt, yt)).shuffle(10_000).repeat().batch(32).prefetch(2)
xv = np.random.randn(100, 1)
yv = np.array([[int(x > 0)] for x in xv])
valid_data = Dataset.from_tensor_slices((xv, yv)).repeat().batch(32).prefetch(2)
xs = np.random.randn(100, 1) + 2
ys = np.array([[int(x > 2)] for x in xs])
test_data = Dataset.from_tensor_slices((xs, ys)).repeat().batch(32).prefetch(2)
x = Input(shape=(1,))
a = BatchNormalization()(x)
a = Dense(8, activation='sigmoid')(a)
a = BatchNormalization()(a)
y = Dense(1, activation='sigmoid')(a)
model = Model(inputs=x, outputs=y, )
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_data, epochs=10, steps_per_epoch=ceil(10_000 / 32), validation_data=valid_data,
validation_steps=ceil(100 / 32))
zs = model.predict(test_data, steps=ceil(100 / 32))
print(sum([ys[i] == int(zs[i] > 0.5) for i in range(100)]))
Running the code prints the value 0.5, meaning that half the examples are labeled properly. This is what I would expect if the system was using the global statistics on the training set to implement BN.
If we change the BN layers to read
x = Input(shape=(1,))
a = BatchNormalization()(x, training=True)
a = Dense(8, activation='sigmoid')(a)
a = BatchNormalization()(a, training=True)
y = Dense(1, activation='sigmoid')(a)
and run the code again we find 0.87. Forcing always the training state, the percentage of correct prediction has changed. This is consistent with the idea that model.predict(x, ...) is now using the statistic of the mini-batch to implement BN, and is therefore able to slightly "correct" the mismatch in the source distributions between training and test data.
Is that correct?
If I'm understanding your question correctly, then yes, keras does automatically manage training vs inference behavior based on fit vs predict/evaluate. The flag is called learning_phase, and it determines the behavior of batch norm, dropout, and potentially other things. The current learning phase can be seen with keras.backend.learning_phase(), and set with keras.backend.set_learning_phase().
https://keras.io/backend/#learning_phase
I'm training my Keras model to predict whether, with the provided data parameter, it will make a shot or not and it will represent in such a way that 0 means no and 1 means yes. However, when I try to predict it I got values that are float.
I've tried using the data that is exactly the same as train data to get 1 but it does not work.
I used the data below to tried the one-hot encoding.
https://github.com/eijaz1/Deep-Learning-in-Keras-Tutorial/blob/master/keras_tutorial.ipynb
import pandas as pd
from keras.utils import to_categorical
from keras.models import load_model
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
#read in training data
train_df_2 = pd.read_csv('diabetes_data.csv')
#view data structure
train_df_2.head()
#create a dataframe with all training data except the target column
train_X_2 = train_df_2.drop(columns=['diabetes'])
#check that the target variable has been removed
train_X_2.head()
#one-hot encode target column
train_y_2 = to_categorical(train_df_2.diabetes)
#vcheck that target column has been converted
train_y_2[0:5]
#create model
model_2 = Sequential()
#get number of columns in training data
n_cols_2 = train_X_2.shape[1]
#add layers to model
model_2.add(Dense(250, activation='relu', input_shape=(n_cols_2,)))
model_2.add(Dense(250, activation='relu'))
model_2.add(Dense(250, activation='relu'))
model_2.add(Dense(2, activation='softmax'))
#compile model using accuracy to measure model performance
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
early_stopping_monitor = EarlyStopping(patience=3)
model_2.fit(train_X_2, train_y_2, epochs=30, validation_split=0.2, callbacks=[early_stopping_monitor])
train_dft = pd.read_csv('diabetes_data - Copy.csv')
train_dft.head()
test_y_predictions = model_2.predict(train_dft)
print(test_y_predictions)
I wanted to get
[[0,1]
[1,0]]
However, I am getting
[[0.8544417 0.14555828]
[0.9312985 0.06870154]]
Additionally, can anyone explain to me what does this value 0.8544417 mean?
Actually, you may interpret the output of a model with a softmax classifier at the top as the confidence scores or probabilities of classes (because the softmax function normalizes the values such that they would be positive and have a sum of 1). So, when you provide the model with a true label of [1, 0] this means that this sample belongs to class 1 with probability of 1, and it belongs to class 2 with probability of zero. Therefore, during training the optimization process tries to get as close as possible to that label, but it would never exactly reach [1,0] (actually due to softmax it might get as close as [0.999999, 0.000001], but never [1, 0]).
But that is not a problem, because we are interested to get just close enough and know the class with the highest probability and consider that as the prediction of the model. And you can easily do that by finding the index of the class with maximum probability:
import numpy as np
preds = model.predict(some_data)
class_preds = np.argmax(preds, axis=-1) # e.g. for [max,min] it gives 0, for [min,max] it gives 1
Further, if you are interested to convert predictions to either [0,1] or [1,0] for any reason, you can just round the values:
import numpy as np
preds = model.predict(some_data)
round_preds = np.around(preds) # this would convert [0.87, 0.13] to [1., 0.]
Note: rounding only works properly with two classes, and not when you have more than two classes (e.g. [0.3, 0.4, 0.3] would become [0, 0, 0] after rounding).
Note 2: Since you are creating the model using Sequential API of Keras, then as an alternative to argmax approach described above you can directly use model.predict_classes(some_data) which gives you the exact same output.
I have a sklearn pipeline performing feature engineering on heterogeneous data types (boolean, categorical, numeric, text) and wanted to try a neural network as my learning algorithm to fit the model. I am running into some problems with the shape of the input data.
I am wondering if what I am trying to do is even possible and or if I should try a different approach?
I have tried a couple different methods but am receiving these errors:
Error when checking input: expected dense_22_input to have shape (11,) but got array with shape (30513,) => I have 11 input features ... so I then tried converting my X and y to arrays and now get this error
ValueError: Specifying the columns using strings is only supported for pandas DataFrames => which I think is because of the ColumnTransformer() where I specify column names
print(X_train_OS.shape)
print(y_train_OS.shape)
(22354, 11)
(22354,)
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import to_categorical # OHE
X_train_predictors = df_train_OS.drop("label", axis=1)
X_train_predictors = X_train_predictors.values
y_train_target = to_categorical(df_train_OS["label"])
y_test_predictors = test_set.drop("label", axis=1)
y_test_predictors = y_test_predictors.values
y_test_target = to_categorical(test_set["label"])
print(X_train_predictors.shape)
print(y_train_target.shape)
(22354, 11)
(22354, 2)
def keras_classifier_wrapper():
clf = Sequential()
clf.add(Dense(32, input_dim=11, activation='relu'))
clf.add(Dense(2, activation='softmax'))
clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
return clf
TOKENS_ALPHANUMERIC_HYPHEN = "[A-Za-z0-9\-]+(?=\\s+)"
boolTransformer = Pipeline(steps=[
('bool', PandasDataFrameSelector(BOOL_FEATURES))])
catTransformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
numTransformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
('num_scaler', StandardScaler())])
textTransformer_0 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
stop_words=stopwords))])
textTransformer_1 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
stop_words=stopwords))])
FE = ColumnTransformer(
transformers=[
('bool', boolTransformer, BOOL_FEATURES),
('cat', catTransformer, CAT_FEATURES),
('num', numTransformer, NUM_FEATURES),
('text0', textTransformer_0, TEXT_FEATURES[0]),
('text1', textTransformer_1, TEXT_FEATURES[1])])
clf = KerasClassifier(keras_classifier_wrapper, epochs=100, batch_size=500, verbose=0)
PL = Pipeline(steps=[('feature_engineer', FE),
('keras_clf', clf)])
PL.fit(X_train_predictors, y_train_target)
#PL.fit(X_train_OS, y_train_OS)
I think I understand the problem here however not sure how to solve it. If it is not possible to integrate sklearn ColumnTransformer+Pipeline into Keras model does Keras have a good way for dealing with fixed data types to feature engineer? Thank you!
It looks like you are passing your 11 columns of original data through your various column transformers and the number of dimensions is expanding to 30,513 (after count vectorizing your text, one hot encoding etc). Your neural network architecture is set up to accept only 11 input features but is being passed your (now transformed) 30,513 features, which is what error 1 is explaining.
You therefore need to amend the input_dim of your neural network to match the number of features being created in the feature extraction pipeline.
One thing you could do is add an intermediate step between them with something like SelectKBest and set that to something like 20,000 so that you know exactly how many features will eventually be passed to the classifier.
This is a good guide and flowchart on the Google machine learning website - link - look at the flow chart - here you can see they have a 'select top k features' step in the pipeline before training a model.
So, try updating these parts of your code to:
def keras_classifier_wrapper():
clf = Sequential()
clf.add(Dense(32, input_dim=20000, activation='relu'))
clf.add(Dense(2, activation='softmax'))
clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
return clf
and
from sklearn.feature_selection import SelectKBest
select_best_features = SelectKBest(k=20000)
PL = Pipeline(steps=[('feature_engineer', FE),
('select_k_best', select_best_features),
('keras_clf', clf)])
I think using Sklearn Pipelines and Keras sklearnWrappers is a standard way to dealing with your problem, and ColumnDataTransformer allows you to manage each feature differently( whether it is boolean, numerical or categorical),
To debugg you code,
I would suggest to do unit testing on each of the steps of your Pipeline, especially
textTransformer_0 and textTransformer_1
For instance
textTransformer_0.fit_transform(X_train_predictors).shape # shape[1]
textTransformer_1.fit_transform(X_train_predictors).shape # shape[1]
And so one for one hot encoder, to understand what will be your final feature dimension.
Because standards for Sklearn Pipelines are to deal with 2D np.ndarray,
So CountVectorizer will create a bunch of columns, depending on data,
And this value must be introduced as input_dim in keras.Dense layers