TensorFlow Extended data_accessor.tf_dataset_factory() shape discrepancies - python

I am facing a perplexing issue while attempting to convert a vanilla tensorflow/keras workflow into a tensorflow extended pipeline.
In short: the datasets generated using tfx’s ExampleGen component have different shapes from those created manually using tf.data.Dataset.from_tensor_slices() from the same data, and cannot be fed into a keras model.
Reproducible example
1. Data generation
Let’s assume we create a sample dataset using:
import pandas as pd
import random
df = pd.DataFrame({
'a': [float(x) for x in range(100)],
'b': [float(x + 1) for x in range(100)],
'c': [float(x**2) for x in range(100)],
'target': [random.randint(0, 2) for _ in range(100)],
})
df.to_parquet({my_path})
2. Model generation
Let's use a dummy dense model for simplicity's sake.
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
def build_model():
model = Sequential()
model.add(Dense(8, input_shape=(3,), activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(
optimizer=SGD(),
loss="sparse_categorical_crossentropy",
metrics=["sparse_categorical_accuracy"],
)
return model
3. What works: manual dataset creation
This parquet file can then be loaded back into a pandas df and converted into a tensorflow dataset using:
import tensorflow as tf
_BATCH_SIZE = 4
dataset = tf.data.Dataset.from_tensor_slices((
tf.cast(df[['a', 'b', 'c']].values, tf.float32),
tf.cast(df['target'].values, tf.int32),
)).batch(_BATCH_SIZE, drop_remainder=True)
This gives a dataset with cardinality() = <tf.Tensor: shape=(), dtype=int64, numpy=25>, which can be fed to the toy model above.
4. What doesn't work: making a tensorflow extended pipeline
I have tried to replicate those results by applying a slightly modified tfx starter pipeline:
from tfx_bsl.tfxio import dataset_options
from tfx.components import SchemaGen
from tfx.components import StatisticsGen
from tfx.components import Trainer
from tfx.dsl.components.base import executor_spec
from tfx.components.example_gen.component import FileBasedExampleGen
from tfx.components.example_gen.custom_executors import parquet_executor
from tfx.components.trainer.executor import GenericExecutor
from tfx.orchestration import metadata
from tfx.orchestration import pipeline
from tfx.proto import trainer_pb2
from tfx.proto import example_gen_pb2
from tfx.utils.io_utils import parse_pbtxt_file
_BATCH_SIZE = 4
_LABEL_KEY = 'target'
_EPOCHS = 10
def _input_fn(file_pattern, data_accessor, schema) -> Dataset:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=_BATCH_SIZE,
label_key=_LABEL_KEY,
num_epochs=_EPOCHS,
),
schema,
)
return dataset
def build_model():
"""Same as above"""
...
return model
def run_fn(fn_args):
schema = parse_pbtxt_file(fn_args.schema_file, schema_pb2.Schema())
train_dataset = _input_fn(
fn_args.train_files,
fn_args.data_accessor,
schema,
)
eval_dataset = _input_fn(
fn_args.eval_files,
fn_args.data_accessor,
schema,
)
model = build_model()
model.fit(
train_dataset,
steps_per_epoch=fn_args.train_steps,
validation_data=eval_dataset,
validation_steps=fn_args.eval_steps,
epochs=_EPOCHS,
)
model.save(fn_args.serving_model_dir, save_format='tf')
def _create_pipeline(
pipeline_name: str,
pipeline_root: str,
data_root: str,
module_file: str,
metadata_path: str,
split: dict,
) -> pipeline.Pipeline:
split_config = example_gen_pb2.SplitConfig(
splits=[
example_gen_pb2.SplitConfig.Split(name=name, hash_buckets=buckets)
for name, buckets in split.items()
]
)
example_gen = FileBasedExampleGen(
input_base=data_root,
custom_executor_spec=executor_spec.ExecutorClassSpec(parquet_executor.Executor),
output_config=example_gen_pb2.Output(split_config=split_config),
)
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
infer_schema = SchemaGen(statistics=statistics_gen.outputs['statistics'])
trainer = Trainer(
module_file=module_file,
custom_executor_spec=executor_spec.ExecutorClassSpec(GenericExecutor),
examples=example_gen.outputs['examples'],
schema=infer_schema.outputs['schema'],
train_args=trainer_pb2.TrainArgs(),
eval_args=trainer_pb2.EvalArgs()
)
components = [example_gen, statistics_gen, infer_schema, trainer]
metadata_config = metadata.sqlite_metadata_connection_config(metadata_path)
_pipeline = pipeline.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
components=components,
metadata_connection_config=metadata_config,
)
return _pipeline
However, the dataset generated by ExampleGen has cardinality tf.Tensor(-2, shape=(), dtype=int64), and gives the following error message when fed to the same model:
ValueError: Layer sequential expects 1 inputs, but it received 3 input tensors. Inputs received: [<tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f40353373d0>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f4035337710>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f40352e3190>]
Importantly: the problem persists even when the data are stored as a csv file and read using CsvExampleGen, which makes the issue very unlikely to arise from the data themselves.
Also, batching the tfx output dataset has no effect on the results.
I’ve tried everything I could think of to no benefit. The relative obscurity of what's happening under tfx's hood doesn't help with the debugging either. Has anyone any idea of what the problem is?
Edit 1
Two points have come to my attention since writing this question:
data_accessor.tf_dataset_factory() doesn't actually output a tensorflow.python.data.ops.dataset_ops.TensorSliceDataset, but a tensorflow.python.data.ops.dataset_ops.PrefetchDataset instead.
There's actually a small bunch of as yet unanswered questions that look somewhat related to my problem discussing the pains of working with PrefetchDatasets:
TFDS Audio Preprocessing PrefetchDataset Problems
How to feed tf.prefetch dataset into LSTM?
Change PrefetchDataset shapes
Considering none of those questions have found an answer, and that the crux of the problem seems to be the lack of documentation regarding PrefetchDatasets and how to use them, I'll open an issue on tfx's board and see how it goes if this doesn't get answered here within a few days.
Edit 2: version and environment details
As requested by TensorFlow Support, here are the details regarding the versions of all my TensorFlow-related installs:
Core components:
tensorflow==2.3.0
tfx==0.25.0
tfx-bsl==0.25.0
TensorFlow-related stuff:
tensorflow-cloud==0.1.7
tensorflow-data-validation==0.25.0
tensorflow-datasets==3.0.0
tensorflow-estimator==2.3.0
tensorflow-hub==0.9.0
tensorflow-io==0.15.0
tensorflow-metadata==0.25.0
tensorflow-model-analysis==0.25.0
tensorflow-probability==0.11.0
tensorflow-serving-api==2.3.0
tensorflow-transform==0.25.0
Environment and other miscellaneous details:
Python version: 3.7.9
OS: Debian GNU/Linux 10 (buster)
Running from an N1 GCP instance

Related

Using NumPy arrays with Keras (2.4.3) and Tensorflow (2.4.1)

I want to implement a lambda layer or custom layer in Keras that passes an input tensor's values to an external library that accepts and returns Numpy arrays. This library performs a physical simulation which I cannot reasonably reproduce using the tensorflow.experimental.numpy or keras.backend modules.
I see that this is a common issue with many questions and answers on StackOverflow and GitHub. However, I've not been able to find a solution that works with the versions of Keras and Tensorflow that I have installed (which are currently the most recent versions on PyPI).
So far, I have ensured that eager execution is enabled, tried using Keras custom layers instead of lambda layers and experimented with different methods of performing the conversion (e.g. tf.make_ndarray and tf.convert_to_tensor). In each case, I trigger an error indicating that the function is not being executed eagerly, for instance:
AttributeError: 'Tensor' object has no attribute 'tensor_shape'
I have observed that tf.executing_eagerly() returns false when called from within a function being used in a lambda layer. I understand that the issue stems from the parent class of the Keras layers not being included in eager execution - but I'm not sure where to go from there.
Below is a minimal reproducible example. The code executes if line 31 is removed.
import numpy as np
import tensorflow as tf
tf.config.run_functions_eagerly(True)
print(tf.executing_eagerly())
def function(input_tensor):
print(tf.executing_eagerly())
array = tf.keras.backend.eval(input_tensor)
array = array + 0.2
out = np.empty([1,sample_dim])
out[0,:] = array[:]
return tf.keras.backend.variable(out)
samples = 20
sample_dim = 10
x = np.empty([20,10], dtype = np.float32)
y = np.empty([20,10], dtype = np.float32)
for i in range(samples):
x[i,:] = np.array([j + i for j in range(10)], dtype = np.float32)/10
y[i,:] = x[i,:] + 0.1
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(sample_dim)))
model.add(tf.keras.layers.Dense(sample_dim, activation = "relu"))
model.add(tf.keras.layers.Lambda(function, output_shape = (sample_dim)))
model.add(tf.keras.layers.Dense(sample_dim, activation = "relu"))
model.build()
model.compile(optimizer='sgd',loss="mse", run_eagerly=True)
model.fit(x, y, batch_size=10, epochs=10)

If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?

UPDATE: This question was for Tensorflow 1.x. I upgraded to 2.0 and (at least on the simple code below) the reproducibility issue seems fixed on 2.0. So that solves my problem; but I'm still curious about what "best practices" were used for this issue on 1.x.
Training the exact same model/parameters/data on keras/tensorflow does not give reproducible results and the loss is significantly different each time you train the model. There are many stackoverflow questions about that (eg, How to get reproducible results in keras ) but the recommend workarounds don't seem to work for me or many other people on StackOverflow. OK, it is what it is.
But given that limitation of non-reproducibility with keras on tensorflow -- what's the best practice for comparing models and choosing hyper parameters? I'm testing different architectures and activations, but since the loss estimate is different each time, I'm never sure if one model is better than the other. Is there any best practice for dealing with this?
I don't think the issue has anything to do with my code, but just in case it helps; here's a sample program:
import os
#stackoverflow says turning off the GPU helps reproducibility, but it doesn't help for me
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ['PYTHONHASHSEED']=str(1)
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers
import random
import pandas as pd
import numpy as np
#StackOverflow says this is needed for reproducibility but it doesn't help for me
from tensorflow.keras import backend as K
config = tf.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=config)
K.set_session(sess)
#make some random data
NUM_ROWS = 1000
NUM_FEATURES = 10
random_data = np.random.normal(size=(NUM_ROWS, NUM_FEATURES))
df = pd.DataFrame(data=random_data, columns=['x_' + str(ii) for ii in range(NUM_FEATURES)])
y = df.sum(axis=1) + np.random.normal(size=(NUM_ROWS))
def run(x, y):
#StackOverflow says you have to set the seeds but it doesn't help for me
tf.set_random_seed(1)
np.random.seed(1)
random.seed(1)
os.environ['PYTHONHASHSEED']=str(1)
model = keras.Sequential([
keras.layers.Dense(40, input_dim=df.shape[1], activation='relu'),
keras.layers.Dense(20, activation='relu'),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(1, activation='linear')
])
NUM_EPOCHS = 500
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x, y, epochs=NUM_EPOCHS, verbose=0)
predictions = model.predict(x).flatten()
loss = model.evaluate(x, y) #This prints out the loss by side-effect
#Each time we run it gives a wildly different loss. :-(
run(df, y)
run(df, y)
run(df, y)
Given the non-reproducibility, how can I evaluate whether changes in my hyper-parameters and architecture are helping or not?
It's sneaky, but your code does, in fact, lack a step for better reproducibility: resetting the Keras & TensorFlow graphs before each run. Without this, tf.set_random_seed() won't work properly - see correct approach below.
I'd exhaust all the options before tossing the towel on non-reproducibility; currently I'm aware of only one such instance, and it's likely a bug. Nonetheless, it's possible you'll get notably differing results even if you follow through all the steps - in that case, see "If nothing works", but each is clearly not very productive, thus it's best on focusing attaining reproducibility:
Definitive improvements:
Use reset_seeds(K) below
Increase numeric precision: K.set_floatx('float64')
Set PYTHONHASHSEED before the Python kernel starts - e.g. from terminal
Upgrade to TF 2, which includes some reproducibility bug fixes, but mind performance
Run CPU on a single thread (painfully slow)
Do not import from tf.python.keras - see here
Ensure all imports are consistent (i.e. don't do from keras.layers import ... and from tensorflow.keras.optimizers import ...)
Use a superior CPU - for example, Google Colab, even if using GPU, is much more robust against numeric imprecision - see this SO
Also see related SO on reproducibility
If nothing works:
Rerun X times w/ exact same hyperparameters & seeds, average results
K-Fold Cross-Validation w/ exact same hyperparameters & seeds, average results - superior option, but more work involved
Correct reset method:
def reset_seeds(reset_graph_with_backend=None):
if reset_graph_with_backend is not None:
K = reset_graph_with_backend
K.clear_session()
tf.compat.v1.reset_default_graph()
print("KERAS AND TENSORFLOW GRAPHS RESET") # optional
np.random.seed(1)
random.seed(2)
tf.compat.v1.set_random_seed(3)
print("RANDOM SEEDS RESET") # optional
Running TF on single CPU thread: (code for TF1-only)
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)
The problem appears to be solved in Tensorflow 2.0 (at least on simple models)! Here is a code snippet that seems to yield repeatable results.
import os
####*IMPORANT*: Have to do this line *before* importing tensorflow
os.environ['PYTHONHASHSEED']=str(1)
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers
import random
import pandas as pd
import numpy as np
def reset_random_seeds():
os.environ['PYTHONHASHSEED']=str(1)
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)
#make some random data
reset_random_seeds()
NUM_ROWS = 1000
NUM_FEATURES = 10
random_data = np.random.normal(size=(NUM_ROWS, NUM_FEATURES))
df = pd.DataFrame(data=random_data, columns=['x_' + str(ii) for ii in range(NUM_FEATURES)])
y = df.sum(axis=1) + np.random.normal(size=(NUM_ROWS))
def run(x, y):
reset_random_seeds()
model = keras.Sequential([
keras.layers.Dense(40, input_dim=df.shape[1], activation='relu'),
keras.layers.Dense(20, activation='relu'),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(1, activation='linear')
])
NUM_EPOCHS = 500
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x, y, epochs=NUM_EPOCHS, verbose=0)
predictions = model.predict(x).flatten()
loss = model.evaluate(x, y) #This prints out the loss by side-effect
#With Tensorflow 2.0 this is now reproducible!
run(df, y)
run(df, y)
run(df, y)
Putting only the code of below, it works. The KEY of the question, VERY IMPORTANT, is to call the function reset_seeds() every time before running the model. Doing that you will obtain reproducible results as I checked in the Google Collab.
import numpy as np
import tensorflow as tf
import random as python_random
def reset_seeds():
np.random.seed(123)
python_random.seed(123)
tf.random.set_seed(1234)
reset_seeds()
You have a couple option for stabilizing performance...
1) Set the seed for your intializers so they are always initialized to the same values.
2) More data generally results in a more stable convergence.
3) Lower learning rates and bigger batch sizes are also good for more predictable learning.
4) Training based on a fixed amount of epochs instead of using callbacks to modify hyperparams during train.
5) K-fold validation to train on different subsets. The average of these folds should result in a fairly predictable metric.
6) Also you have the option of just training multiple times and taking an average of this.

keras K.function error for layer output extraction

I currently have a modified resnet 50 architecture that takes two inputs. Building the model and training the model works fine, but when I’m trying to extract layer outputs using the backend function, I encounter errors.
I would prefer to extract layers using the backend function, rather than creating a new truncated model with just my layer of interest as the output.
The following snippet is self contained, and should be able to run and give the error I’ve been seeing.
I've tried reformatting the function in a few ways, such as K.function( [ mymodel.input[0],mymodel.input[1] ] , [mymodel.layers[-1].layers[-6].output])
or
K.function( [ mymodel.layers[0].input,mymodel.layers[1].input ] , [mymodel.layers[-1].layers[-6].output])
but nothing seems to fix the issue
##imports
from keras.applications.resnet50 import ResNet50
from keras.layers import Input
from keras.layers import Lambda
from keras.models import Model
from keras.optimizers import Adam
import keras
import keras.backend as K
import numpy as np
#pop off the input
res = ResNet50(weights=None,include_top=True,classes=2)
res.layers.pop(0)
#add two inputs
auxinput= Input(batch_shape=(None,224,224,1), name='aux_input')
main_input = Input(batch_shape=(None,224,224,3), name='main_input')
#use a lambda functon to return just our main input (avoids errors from out auxilary input not being used in resnet50 component)
l_output = Lambda(lambda x: x[0])([main_input, auxinput])
#feed our main layer to resnet50
data_passed_thru = res(l_output)
#assemble the model with our two inputs, and output
mymodel = Model(inputs=[main_input, auxinput], outputs=[data_passed_thru])
mymodel.compile(optimizer=Adam(lr=0.001), loss= keras.losses.poisson, metrics=[ 'accuracy'])
print("my model summary:")
mymodel.summary()
##generate some fake data for testing
fake_aux= np.zeros((224,224))
fake_aux=fake_aux[None,...]
fake_aux=fake_aux[...,None]
print('fake aux input shape:', fake_aux.shape)
fake_main= np.zeros((224,224,3))
fake_main=fake_main[None,...]
print('fake main input shape:', fake_main.shape)
#check our model inputs and target layer
print("inputs:", mymodel.input)
print("layer outout I'm trying to extract:", mymodel.layers[-1].layers[-6])
#create function to feed inputs, get our desired layer outputs
get_output_func = K.function( mymodel.input , [mymodel.layers[-1].layers[-6].output])
##this is the line that fails
X= [fake_main,fake_aux]
preds=get_output_func(X)
The error message I get is
InvalidArgumentError: You must feed a value for placeholder tensor 'input_1' with dtype float and shape [?,224,224,3]
[[{{node input_1}}]]
I managed to fix it by accessing the Resnet50 inputs directly for the function, rather than just the whole model's initial inputs. The K.function that works is
get_output_func = K.function( [mymodel.layers[-1].get_input_at(0)] , [mymodel.layers[-1].layers[-6].output])
X= [fake_main]
preds=get_output_func(X)
It only works because of my architecture only depends on the 1 input passing through, so not sure what the solution would be for other situations, but works for my case

Integrate Keras to SKLearn Pipeline?

I have a sklearn pipeline performing feature engineering on heterogeneous data types (boolean, categorical, numeric, text) and wanted to try a neural network as my learning algorithm to fit the model. I am running into some problems with the shape of the input data.
I am wondering if what I am trying to do is even possible and or if I should try a different approach?
I have tried a couple different methods but am receiving these errors:
Error when checking input: expected dense_22_input to have shape (11,) but got array with shape (30513,) => I have 11 input features ... so I then tried converting my X and y to arrays and now get this error
ValueError: Specifying the columns using strings is only supported for pandas DataFrames => which I think is because of the ColumnTransformer() where I specify column names
print(X_train_OS.shape)
print(y_train_OS.shape)
(22354, 11)
(22354,)
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import to_categorical # OHE
X_train_predictors = df_train_OS.drop("label", axis=1)
X_train_predictors = X_train_predictors.values
y_train_target = to_categorical(df_train_OS["label"])
y_test_predictors = test_set.drop("label", axis=1)
y_test_predictors = y_test_predictors.values
y_test_target = to_categorical(test_set["label"])
print(X_train_predictors.shape)
print(y_train_target.shape)
(22354, 11)
(22354, 2)
def keras_classifier_wrapper():
clf = Sequential()
clf.add(Dense(32, input_dim=11, activation='relu'))
clf.add(Dense(2, activation='softmax'))
clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
return clf
TOKENS_ALPHANUMERIC_HYPHEN = "[A-Za-z0-9\-]+(?=\\s+)"
boolTransformer = Pipeline(steps=[
('bool', PandasDataFrameSelector(BOOL_FEATURES))])
catTransformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
numTransformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
('num_scaler', StandardScaler())])
textTransformer_0 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
stop_words=stopwords))])
textTransformer_1 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
stop_words=stopwords))])
FE = ColumnTransformer(
transformers=[
('bool', boolTransformer, BOOL_FEATURES),
('cat', catTransformer, CAT_FEATURES),
('num', numTransformer, NUM_FEATURES),
('text0', textTransformer_0, TEXT_FEATURES[0]),
('text1', textTransformer_1, TEXT_FEATURES[1])])
clf = KerasClassifier(keras_classifier_wrapper, epochs=100, batch_size=500, verbose=0)
PL = Pipeline(steps=[('feature_engineer', FE),
('keras_clf', clf)])
PL.fit(X_train_predictors, y_train_target)
#PL.fit(X_train_OS, y_train_OS)
I think I understand the problem here however not sure how to solve it. If it is not possible to integrate sklearn ColumnTransformer+Pipeline into Keras model does Keras have a good way for dealing with fixed data types to feature engineer? Thank you!
It looks like you are passing your 11 columns of original data through your various column transformers and the number of dimensions is expanding to 30,513 (after count vectorizing your text, one hot encoding etc). Your neural network architecture is set up to accept only 11 input features but is being passed your (now transformed) 30,513 features, which is what error 1 is explaining.
You therefore need to amend the input_dim of your neural network to match the number of features being created in the feature extraction pipeline.
One thing you could do is add an intermediate step between them with something like SelectKBest and set that to something like 20,000 so that you know exactly how many features will eventually be passed to the classifier.
This is a good guide and flowchart on the Google machine learning website - link - look at the flow chart - here you can see they have a 'select top k features' step in the pipeline before training a model.
So, try updating these parts of your code to:
def keras_classifier_wrapper():
clf = Sequential()
clf.add(Dense(32, input_dim=20000, activation='relu'))
clf.add(Dense(2, activation='softmax'))
clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
return clf
and
from sklearn.feature_selection import SelectKBest
select_best_features = SelectKBest(k=20000)
PL = Pipeline(steps=[('feature_engineer', FE),
('select_k_best', select_best_features),
('keras_clf', clf)])
I think using Sklearn Pipelines and Keras sklearnWrappers is a standard way to dealing with your problem, and ColumnDataTransformer allows you to manage each feature differently( whether it is boolean, numerical or categorical),
To debugg you code,
I would suggest to do unit testing on each of the steps of your Pipeline, especially
textTransformer_0 and textTransformer_1
For instance
textTransformer_0.fit_transform(X_train_predictors).shape # shape[1]
textTransformer_1.fit_transform(X_train_predictors).shape # shape[1]
And so one for one hot encoder, to understand what will be your final feature dimension.
Because standards for Sklearn Pipelines are to deal with 2D np.ndarray,
So CountVectorizer will create a bunch of columns, depending on data,
And this value must be introduced as input_dim in keras.Dense layers

How to get the output of a keras model as numerical values, rather than a Tensor object?

This is probably quite a basic tensorflow/keras question but it is I can't seem to find it in the docs. I'm looking to retrieve the output of the hidden layer as numerical values for use in subsequent calculations. Here's the model
from io import StringIO
import pandas as pd
import numpy as np
import keras
data_str = """
ti,z1,z2
0.0,1.000,0.000
0.1,0.606,0.373
0.2,0.368,0.564
0.3,0.223,0.647
0.4,0.135,0.669
0.5,0.082,0.656
0.6,0.050,0.624
0.7,0.030,0.583
0.8,0.018,0.539
0.9,0.011,0.494
1.0,0.007,0.451"""
data = pd.read_csv(StringIO(data_str), sep=',')
wd = r'/path/to/working/directory'
model_filename = os.path.join(wd, 'example1_with_keras.h5')
RUN = True
if RUN:
model = keras.Sequential()
model.add(keras.layers.Dense(3, activation='tanh', input_shape=(1, )))
model.add(keras.layers.Dense(2, activation='tanh'))
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(data['ti'].values, data[['z1', 'z2']].values, epochs=30000)
model.save(filepath=model_filename)
else:
model = keras.models.load_model(model_filename)
outputs = model.layers[1].output
print(outputs)
This prints the following:
>>> Tensor("dense_2/Tanh:0", shape=(?, 2), dtype=float32)
How can I get the output as a np.array rather than a Tensor object?
Using model.layer[1].output does not produce an output, it simply returns the tensor definition of the output. In order to actually produce an output, you need to run your data through the model and specify model.layer[1].output as the output.
You can do this by using tf.keras.backend.function (documentation), which will return Numpy arrays. A similar question to yours can be found here.
The following should work for your example if you only want the output from model.layers[1].output and if you convert your data to a Numpy array for input:
from keras import backend as K
outputs = [model.layers[1].output]
functor = K.function([model.input, K.learning_phase()], outputs)
layer_outs = functor([data, 1.])
What you want is just:
model.predict(inputs)
Which will do a forward pass of the model, given an input, and produce numerical outputs.
As Luke already mentioned you seem to confuse what keras is actually doing here for you.
There are two phases in libraries like Keras, tensorflow or PyTorch.
1. Computational Graph inference
2. Computation using sessions
You are in phase 1 where you create a static computational graph. This does not do any computation yet but it is helpful because you know beforehand how to run data forward and backward through your graph thus making it faster than computing it every time you pass data.
If you actually want to get an output in form of a numpy-array you will have to pass data to the inputs of your graph. In Tensorflow this has to be done using sessions, but Keras hides this from you and lets you input data freely.
In keras you usually do something like scores = model.predict(X)

Categories