`flow_from_dataframe()` custom pre-processing - python

I'm trying to use the keras ImageDataGenerator.flow_from_dataframe() method to generate image data on the fly as the dataset I'm working on is too large to load into memory in one go.
The source image data files are DICOM files, not supported by the flow_from_dataframe() method.
Is it possible to (easily) extend flow_from_dataframe() to handle DICOM (or other unsupported) images/input?
Perhaps a custom pre-processing function could be run on each unsupported file, returning a normalised (windowed/photometric corrected) numpy array, then allowing the ImageDataGenerator instance to proceed.
I could edit the source on my own installation but a general solution that can be used on vanilla keras is preferred, to ensure portability to other platforms (especially Kaggle)!!

A solution to this can be found on the Keras github issue tracker/feature requests: https://github.com/keras-team/keras/issues/13665
A custom data generator can be created based on keras.utils.Sequence as a superclass.

Related

Best way to load images from directory if numpy and cv2 preprocessing is required

I'm currently trying to use tf.keras.preprocessing.image_dataset_from_directory to create a dataset but I'm running into a couple of problems
I can't insert a numpy/cv2 preprocessing step before conversion to tensors. I can use tf.data.Dataset.map but that means I have to unbatch, convert to numpy array, do the preprocessing ... and so on. I tried this but I keep running into little bugs that are fixable, but I'm getting the feeling it wasn't designed to be used this way.
I don't have much control over the way the train/val split is done. I want to do do a particular KFold splitting.
FYI I'm used to doing this in PyTorch via custom Datasets and DataLoaders.
You may use preprocessing_function argument, in tf.keras.preprocessing.image.ImageDataGenerator to pass your preprocessing function to the generator in order to apply on each sample.
Reference: Image Data Generator

Is there a format like ONNX for persisting data pre-processing / feature-engineering pipelines without serializing?

I'm looking for a way to save my sklearn Pipeline for data pre-processing so that I can re-load it to make predictions.
So far I've only seen options like pickle or joblib, that will serialize arbitrary python objects, but the resulting file
is opaque if I wanted to store the pipeline in version control,
will serialize any python object and therefore might not be safe to unserialize, and
may run into issues with different Python version or library versions
It seems like ONNX is a great way to save models in a safe & interoperable way -- Is there any alternative for data pre-processing pipelines?

Project organization with Tensorflow.keras. Should one subclass tf.keras.Model?

I'm using Tensorflow 1.14 and the tf.keras API to build a number (>10) of differnet neural networks. (I'm also interested in the answers to this question using Tensorflow 2). I'm wondering how I should organize my project.
I convert the keras models into estimators using tf.keras.estimator.model_to_estimator and Tensorboard for visualization. I'm also sometimes using model.summary(). Each of my models has a number (>20) of hyperparameters and takes as input one of three types of input data. I sometimes use hyperparameter optimization, such that I often manually delete models and use tf.keras.backend.clear_session() before trying the next set of hyperparameters.
Currently I'm using functions that take hyperparameters as arguments and return the respective compiled keras model to be turned into an estimator. I use three different "Main_Datatype.py" scripts to train models for the three different input data types. All data is loaded from .tfrecord files and there is an input function for each data type, which is used by all estimators taking that type of data as input. I switch between models (i.e. functions returning a model) in the Main scripts. I also have some building blocks that are part of more than one model, for which I use helper functions returning them, piecing together the final result using the Keras functional API.
The slight incompatibilities of the different models are begining to confuse me and I've decided to organise the project using classes. I'm planing to make a class for each model that keeps track of hyperparameters and correct naming of each model and its model directory. However, I'm wondering if there are established or recomended ways to do this in Tensorflow.
Question: Should I be subclassing tf.keras.Model instead of using functions to build models or python classes that encapsulate them? Would subclassing keras.Model break (or require much work to enable) any of the functionality that I use with keras estimators and tensorboard? I've seen many issues people have with using custom Model classes and am somewhat reluctant to put in the work only to find that it doesn't work for me. Do you have other suggestions how to better organize my project?
Thank you very much in advance.
Subclass only if you absolutely need to. I personally prefer following the following order of implementation. If the complexity of the model you are designing, can not be achieved using the first two options, then of course subclassing is the only option left.
tf.keras Sequential API
tf.keras Functional API
Subclass tf.keras.Model
Seems like a reasonable thing to do: https://www.tensorflow.org/guide/keras/custom_layers_and_models https://www.tensorflow.org/api_docs/python/tf/keras/Model guide

Sagemaker to use processed pickled ndarray instead of csv files from S3

I understand that you can pass a CSV file from S3 into a Sagemaker XGBoost container using the following code
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')
data_channels = {'train': train_channel, 'validation': valid_channel}
xgb_model.fit(inputs=data_channels, logs=True)
But I have an ndArray stored in S3 bucket. These are processed, label encoded, feature engineered arrays. I would want to pass this into the container instead of the csv. I do understand I can always convert my ndarray into csv files before saving it in S3. Just checking if there is an array option.
There are multiple options for algorithms in SageMaker:
Built-in algorithms, like the SageMaker XGBoost you mention
Custom, user-created algorithm code, which can be:
Written for a pre-built docker image, available for Sklearn, TensorFlow, Pytorch, MXNet
Written in your own container
When you use built-ins (option 1), your choice of data format options is limited to what the built-ins support, which is only csv and libsvm in the case of the built-in XGBoost. If you want to use custom data formats and pre-processing logic before XGBoost, it is absolutely possible if you use your own script leveraging the open-source XGBoost. You can get inspiration from the Random Forest demo to see how to create custom models in pre-built containers

How to share tensorflow weights with other users

is it possible to share tensorflow checkpoint files with other users (plattform & CPU/GPU independet)? I had shared a tensorflow implementation of the DeconvNet and now I want to provide the trained weights. Can I simply upload the saved model or is there another tf way? I'm asking because I read a tutorial were the weights were stored using numpy.savetxt and then restored during the weight initalization. But this method was used for the MNIST example which uses a very small net..
Thanks!
You could save metagraph + provide code to restore and run your model --
http://tensorflow.org/how_tos/meta_graph
One downside of this is that it doesn't provide annotations of which tensors to feed/fetch, so you need to provide some code showing how to use it.
SavedModel is the next iteration of TensorFlow checkpoint format that takes care of that, but it doesn't have much documentation yet.
I use pickle, in binary mode, to dump and load big numpy matrix and it works quite well.

Categories