I have a sklearn pipeline performing feature engineering on heterogeneous data types (boolean, categorical, numeric, text) and wanted to try a neural network as my learning algorithm to fit the model. I am running into some problems with the shape of the input data.
I am wondering if what I am trying to do is even possible and or if I should try a different approach?
I have tried a couple different methods but am receiving these errors:
Error when checking input: expected dense_22_input to have shape (11,) but got array with shape (30513,) => I have 11 input features ... so I then tried converting my X and y to arrays and now get this error
ValueError: Specifying the columns using strings is only supported for pandas DataFrames => which I think is because of the ColumnTransformer() where I specify column names
print(X_train_OS.shape)
print(y_train_OS.shape)
(22354, 11)
(22354,)
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import to_categorical # OHE
X_train_predictors = df_train_OS.drop("label", axis=1)
X_train_predictors = X_train_predictors.values
y_train_target = to_categorical(df_train_OS["label"])
y_test_predictors = test_set.drop("label", axis=1)
y_test_predictors = y_test_predictors.values
y_test_target = to_categorical(test_set["label"])
print(X_train_predictors.shape)
print(y_train_target.shape)
(22354, 11)
(22354, 2)
def keras_classifier_wrapper():
clf = Sequential()
clf.add(Dense(32, input_dim=11, activation='relu'))
clf.add(Dense(2, activation='softmax'))
clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
return clf
TOKENS_ALPHANUMERIC_HYPHEN = "[A-Za-z0-9\-]+(?=\\s+)"
boolTransformer = Pipeline(steps=[
('bool', PandasDataFrameSelector(BOOL_FEATURES))])
catTransformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
numTransformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
('num_scaler', StandardScaler())])
textTransformer_0 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
stop_words=stopwords))])
textTransformer_1 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
stop_words=stopwords))])
FE = ColumnTransformer(
transformers=[
('bool', boolTransformer, BOOL_FEATURES),
('cat', catTransformer, CAT_FEATURES),
('num', numTransformer, NUM_FEATURES),
('text0', textTransformer_0, TEXT_FEATURES[0]),
('text1', textTransformer_1, TEXT_FEATURES[1])])
clf = KerasClassifier(keras_classifier_wrapper, epochs=100, batch_size=500, verbose=0)
PL = Pipeline(steps=[('feature_engineer', FE),
('keras_clf', clf)])
PL.fit(X_train_predictors, y_train_target)
#PL.fit(X_train_OS, y_train_OS)
I think I understand the problem here however not sure how to solve it. If it is not possible to integrate sklearn ColumnTransformer+Pipeline into Keras model does Keras have a good way for dealing with fixed data types to feature engineer? Thank you!
It looks like you are passing your 11 columns of original data through your various column transformers and the number of dimensions is expanding to 30,513 (after count vectorizing your text, one hot encoding etc). Your neural network architecture is set up to accept only 11 input features but is being passed your (now transformed) 30,513 features, which is what error 1 is explaining.
You therefore need to amend the input_dim of your neural network to match the number of features being created in the feature extraction pipeline.
One thing you could do is add an intermediate step between them with something like SelectKBest and set that to something like 20,000 so that you know exactly how many features will eventually be passed to the classifier.
This is a good guide and flowchart on the Google machine learning website - link - look at the flow chart - here you can see they have a 'select top k features' step in the pipeline before training a model.
So, try updating these parts of your code to:
def keras_classifier_wrapper():
clf = Sequential()
clf.add(Dense(32, input_dim=20000, activation='relu'))
clf.add(Dense(2, activation='softmax'))
clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
return clf
and
from sklearn.feature_selection import SelectKBest
select_best_features = SelectKBest(k=20000)
PL = Pipeline(steps=[('feature_engineer', FE),
('select_k_best', select_best_features),
('keras_clf', clf)])
I think using Sklearn Pipelines and Keras sklearnWrappers is a standard way to dealing with your problem, and ColumnDataTransformer allows you to manage each feature differently( whether it is boolean, numerical or categorical),
To debugg you code,
I would suggest to do unit testing on each of the steps of your Pipeline, especially
textTransformer_0 and textTransformer_1
For instance
textTransformer_0.fit_transform(X_train_predictors).shape # shape[1]
textTransformer_1.fit_transform(X_train_predictors).shape # shape[1]
And so one for one hot encoder, to understand what will be your final feature dimension.
Because standards for Sklearn Pipelines are to deal with 2D np.ndarray,
So CountVectorizer will create a bunch of columns, depending on data,
And this value must be introduced as input_dim in keras.Dense layers
Related
I'm trying to implement a small prototype of The GCN Model using the library Stellargraph. I've my StellarGraph graph object ready, I'm trying to solve a multi-class multi-label classification problem. This means I'm trying to predict more than one column (19 exactly) each column is encoded to either 0 or 1.
Here is what I've done:
from sklearn.model_selection import train_test_split
from stellargraph.mapper import FullBatchNodeGenerator
train_subjects, test_subjects = train_test_split(nodelist, test_size = .25)
generator = FullBatchNodeGenerator(graph, method="gcn")
from stellargraph.layer import GCN
train_gen = generator.flow(train_subjects['ID'], train_subjects.drop(['ID'], axis = 1))
gcn = GCN(layer_sizes=[16, 16], activations=["relu", "relu"], generator=generator, dropout=0.5)
from tensorflow.keras import layers, optimizers, losses, metrics, Model
x_inp, x_out = gcn.in_out_tensors()
predictions = layers.Dense(units = 1, activation="sigmoid")(x_out)
from tensorflow.keras.metrics import Precision as Precision
model = Model(inputs=x_inp, outputs=predictions)
model.compile(
optimizer=optimizers.Adam(learning_rate = 0.01),
loss=losses.categorical_crossentropy,
metrics= [Precision()])
val_gen = generator.flow(test_subjects['ID'], test_subjects.drop(['ID'], axis = 1))
from tensorflow.keras.callbacks import EarlyStopping
es_callback = EarlyStopping(monitor="val_precision", patience=200, restore_best_weights=True)
history = model.fit(
train_gen,
epochs=200,
validation_data=val_gen,
verbose=2,
shuffle=False,
callbacks=[es_callback])
I've 271045 edges & 16354 nodes in total including 12265 training nodes. The issue I'm getting is a shape mismatching from Keras. It states as follows. I suspect it's due to inserting multiple columns as target columns. I've tried the model using only one column (class) & it worked perfectly.
InvalidArgumentError: Incompatible shapes: [1,12265] vs. [1,233035]
[[node LogicalAnd_1 (defined at tmp/ipykernel_52/2745570431.py:7) ]] [Op:__inference_train_function_1405]
It's worth mentioning that 233035 = 12265 (number of train nodes) times 19 (number of classes). Any Idea on what is going wrong here?
I figured out the problem.
It was a newbie mistake, I initialized the Dense Classification layer with 1 unit instead of 19 (number of classes).
I just needed to fix that line to:
predictions = layers.Dense(units = 19, activation="sigmoid")(x_out)
Have a nice day!
I'm writing here, hoping to solve a problem, that i had with a neural network, developed in python by using keras.
I'm newer with the deep learning world, and I'm studying the theory and trying to implement some code.
Goal: develop a net that allows me to recognize 2 different words (commands) that i say [in the future them will be used to drive a small robot-car]. Actually, I want only to achieve the identification of "yes/no"
Implementation: i'm trying to implement a binary classification network.
Here is my code:
i used librosa to convert the audio training and test set in a matrix input with 193 features
to overcame the possibility of batch normalization problem, i scaled the data, by using the preprocessing package (I saw that effectively that improves and affects performance): i notice that if I don't normalize training, test and data_to_be_anlyzed by using the same normalization, it doesn't work
I read that keras accept as input that can be both numpy array, so i convert the target y into numpy
i proceed with the construction of the model, training and test (i know that i'm using methods that actually are deprecated)
I use one audio to perform another text, because for the future i assume that the net will receive (and judge) one audio at-a-time
import support.myutilities as utilNP
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import StandardScaler
#READ AND BUILD INPUT
X_si = utilNP.np_array_features_dir('DIRECTORY PATH')
X_no = utilNP.np_array_features_dir('DIRECTORY PATH')
X_tot = np.concatenate((X_si, X_no), axis=0)
# Scale the train set
scaler = StandardScaler().fit(X_tot)
X_train = scaler.transform(X_tot)
#0=si 1=no
y=[]
for i in range(len(X_si)):
y.append(0)
for i in range(len(X_no)):
y.append(1)
Y=np.array(y)
#READ AND BUIL TEST TRAINING SET
X_si_test = utilNP.np_array_features_dir('DIRECTORY PATH')
X_no_test = utilNP.np_array_features_dir('DIRECTORY PATH')
X_tot_test = np.concatenate((X_si_test, X_no_test), axis=0)
# Scale the test set
scaler2 = StandardScaler().fit(X_tot_test)
X_test = scaler2.transform(X_tot_test)
y_test=[]
for i in range(len(X_si_test)):
y_test.append(0)
for i in range(len(X_no_test)):
y_test.append(1)
Y_test=np.array(y_test)
###### BUILD MODEL
model = Sequential()
model.add(Dense(100, input_dim=len(X_tot[0]), activation='relu')) #193 features as input
model.add(Dense(50, activation='relu'))
model.add(Dense(1, activation='sigmoid')) #1 output
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(X_train, Y, epochs=300, verbose=1)
#test
scores = model.evaluate(X_test, Y_test, verbose=0)
print('Accuracy on training data: {}% \n Error on training data: {}'.format(scores[1], 1 - scores[1]))
predictions = model.predict(X_test)
for i in range(len(predictions)):
print('=> %d (expected %d)' % (predictions[i], y_test[i]))
#TEST WITH A PRACTICAL NEW SOUND: supposed acquired
file_name = 'PATH AUDIO'
X = utilNP.np_array_features(file_name)
#Normalize according to input data
X_analyze = scaler2.transform(X)
y_analysis=[]
y_analysis.append(1) # i supposed that the audio is the one that return 1
pred_test= model.predict(X_analyze)
scores2 = model.evaluate(X_analyze, np.array(y_analysis), verbose=0)
print('Accuracy on test data: {}% \n Error on test data: {}'.format(scores2[1], 1 - scores2[1]))
Problems:
accuracy go in 100% in very few epochs. Is real that the training set is not so big (a total of 300 samples, and 40 for test), but this result is clearly wrong. By the way, if I use a number of epochs > 100, the net works well and performs its work good (practically the result of the single case study, is recognized)
if the number of epochs is low (20 for example), accuracy still reach 100% after few iterations, but the training is affect by the error in the results (why are not recognized?) and the final prediction wrong. It is not normal: i would expect a low accuracy to justify the wrong answer, but it remains at 100%
I test a lot of solution, passing from setting 'training=True/False', and read a lot of answer here and in stack exchange, but I don't solved nothing.
There is something wrong in my code?
Thanks in advance.
I am new to TensorFlow framework and I am trying to apply Tensorflow to predict the survivor based on this Titanic Dataset:https://www.kaggle.com/c/titanic/data.
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#%%
titanictrain = pd.read_csv('train.csv')
titanictest = pd.read_csv('test.csv')
df = pd.concat([titanictrain,titanictest],join='outer',keys='PassengerId',sort=False,ignore_index=True).drop(['Name'],1)
#%%
def preprocess(df):
df['Fare'].fillna(value=df.groupby('Pclass')['Fare'].transform('median'),inplace=True)
df['Fare'] = df['Fare'].map(lambda x: np.log(x) if x>0 else 0)
df['Embarked'].fillna(value=df['Embarked'].mode()[0],inplace=True)
df['CabinAlphabet'] = df['Cabin'].str[0]
categories_to_one_hot = ['Pclass','Sex','Embarked','CabinAlphabet']
df = pd.get_dummies(df,columns=categories_to_one_hot,drop_first=True)
return df
df = preprocess(df)
df = df.drop(['PassengerId','Ticket','Cabin','Survived'],1)
titanic_trainandval = df.iloc[:len(titanictrain)]
titanic_test = df.iloc[len(titanictrain):] #test after preprocessing
titanic_test.head()
# split train into training and validation set
labels = titanictrain['Survived']
y = labels.values
test = titanic_test.copy() # real test sets
print(len(test), 'test examples')
Here I am trying to apply preprocessing on the data:
1.Drop Name column and Do one hot coding both on the train and test set
2.Drop ['PassengerId','Ticket','Cabin','Survived'] for Simplicity.
Split train and test following the original order
Here is a picture showing what the training set looks like.
"""# model training"""
from tensorflow.keras.layers import Input, Dense, Activation,Dropout
from tensorflow.keras.models import Model
X = titanic_trainandval.copy()
input_layer = Input(shape=(X.shape[1],))
dense_layer_1 = Dense(10, activation='relu')(input_layer)
dense_layer_2 = Dense(5, activation='relu')(dense_layer_1)
output = Dense(1, activation='softmax',name = 'predictions')(dense_layer_2)
model = Model(inputs=input_layer, outputs=output)
base_learning_rate = 0.0001
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate), metrics=['acc'])
history = model.fit(X, y, batch_size=5, epochs=20, verbose=2, validation_split=0.1,shuffle = False)
submission = pd.DataFrame()
submission['PassengerId'] = titanictest['PassengerId']
Then I put the training set X into the model to get the result. However, history shows the following result:
No matter how I change the learning rate and batch size, the result does not change, and the loss is always 'nan', and the prediction based on the test set is always 'nan' as well.
Could anybody explain where the problem is and give some possible solutions?
at first glance there are 2 major problems in your code:
your output layer must be Dense(2, activation='softmax'). this is because yours is a binary classification problem and if you are using softmax to generate probabilities the output dim must be equal to the number of classes. (you can use one output dimension with sigmoid activation)
you have to change your loss function. with softmax and numerical encoded target use sparse_categorical_crossentropy. (you can use binary_crossentropy with sigmoid and with from_logits=False as default)
PS: make sure to remove all NaNs in your original data before starting fit
Marco Cerliani is right with the points 1 and 2.
The real problem why you have NaNs is because you feed NaNs in your code. If you look carefully, even in your third photo, the 888th example on the column Age contains a NaN.
This is why you have NaNs. Solve this one, and apply Marco Cerliani's suggestions and you're good to go.
Apart from the above answers, 1 more thing which I would like to add is that whenever you want to use form_logits=True for classification problems, use Linear activation function i.e activation='linear' which is the default value for activation function in the last layer.
I am trying to practice my machine learning skills with Tensorflow/Keras but I am having trouble around fitting the model. Let me explain what I've done and where I'm at.
I am using the dataset from Kaggle's Costa Rican Household Poverty Level Prediction Challenge
Since I am just trying to get familiar with the Tensorflow workflow, I cleaned the dataset by removing a few columns that had a lot of missing data and then filled in the other columns with their mean. So there are no missing values in my dataset.
Next I loaded the new, cleaned, csv in using make_csv_dataset from TF.
batch_size = 32
train_dataset = tf.data.experimental.make_csv_dataset(
'clean_train.csv',
batch_size,
column_names=column_names,
label_name=label_name,
num_epochs=1)
I set up a function to return my compiled model like so:
f1_macro = tfa.metrics.F1Score(num_classes=4, average='macro')
def get_compiled_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation=tf.nn.relu, input_shape=(137,)), # input shape required
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(4, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=[f1_macro, 'accuracy'])
return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Below is the result of that
A link to my notebook is Here
I should mention that I strongly based my implementation on Tensorflow's iris data walkthrough
Thank you!
After a while, I was able to find the issues with your code they are in the order of importance. (First is of highest importance)
You are doing multi-class classification (not binary classification). Therefore your loss should be categorical_crossentropy.
You are not onehot encoding your labels. Using binary_crossentropy and having labels as a numerical ID is definitely not the way forward. Instead, you should do onehot encode your labels and solve this like a multi-class classification problem. Here's how you do that.
def pack_features_vector(features, labels):
"""Pack the features into a single array."""
features = tf.stack(list(features.values()), axis=1)
return features, tf.one_hot(tf.cast(labels-1, tf.int32), depth=4)
Normalizing your data. If you look at your training data. They are not normalized. And their values are all over the place. Therefore, you should consider normalizing your data by doing something like below. This is just for demonstration purposes. You should read about Scalers in scikit learn and choose what's best for you.
x = train_df[feature_names].values #returns a numpy array
min_max_scaler = preprocessing.StandardScaler()
x_scaled = min_max_scaler.fit_transform(x)
train_df = pd.DataFrame(x_scaled)
These issues should set your model straight.
I am using sel = SelectFromModel(ExtraTreesClassifier(10), threshold='mean') to select the most important features in my data set.
Then I want to feed these selected features to my keras classifier. But my keras based Neural Network classifier needs the number of imprtant features selected in the first step. Below is the code for my keras classifier and the variable X_new is the numpy array of new features selected.
The code for keras classifier is as under.
def create_model(
dropout=0.2):
n_x_new=X_new.shape[1]
np.random.seed(6000)
model_new = Sequential()
model_new.add(Dense(n_x_new, input_dim=n_x_new, kernel_initializer='glorot_uniform', activation='sigmoid'))
model_new.add(Dense(10, kernel_initializer='glorot_uniform', activation='sigmoid'))
model_new.add(Dropout(0.2))
model_new.add(Dense(1,kernel_initializer='glorot_uniform', activation='sigmoid'))
model_new.compile(loss='binary_crossentropy',optimizer='adam', metrics=['binary_crossentropy'])
return model_new
seed = 7
np.random.seed(seed)
clf=KerasClassifier(build_fn=create_model, epochs=10, batch_size=1000, verbose=0)
param_grid = {'clf__dropout':[0.1,0.2]}
model = Pipeline([('sel', sel),('clf', clf),])
grid = GridSearchCV(estimator=model, param_grid=param_grid,scoring='roc_auc', n_jobs=1)
grid_result = grid.fit(np.concatenate((train_x_upsampled, cross_val_x_upsampled), axis=0), np.concatenate((train_y_upsampled, cross_val_y_upsampled), axis=0))
As I am using Pipline with grid search, I don't understand how my neural network will get the important features selected in the first step. I want to get those important features selected into an array of X_new.
Do I need to implement a custom estimator in between sel and keras model?
If yes, How would I implement one? I know the generic code for custom estimator but I am unable to mold it according to my requirement. The generic code is as under.
class new_features(TransformerMixin):
def transform(self, X):
X_new = sel.transform(X)
return X_new
But this is not working. Is there any way I can solve this problem without using custom estimator in between?