Creating a Training/Validation LMDB for NVIDIA Digits in Python

Creating a Training/Validation LMDB for NVIDIA Digits in Python - python

I'm trying to make a Training/Validation LMDB set for use with NVIDIA Digits, but I can't find any good examples/tutorials.
I understand how to create an LMDB database, but I'm uncertain on how to correctly format the data. I get how to create an image using the caffe_pb2 Datum by setting channels/width/height/data and save them.
But, how do I create the Labels LMDB? Do I still use a Caffe Datum? If so, what do I set the channels/width/height to? Will it work if I have a single value label?
Thanks

DIGITS only really supports data in LMDBs for now. Each value in the LMDB key/val store must be a Caffe Datum, which limits the number of dimensions to 3.
Even though Caffe Datums allow for a single numeric label (datum.label), when uploading a prebuilt LMDB to DIGITS you need to specify a separate database for the labels. That's inefficient if you only have a single numeric label (since you could have done it all in one DB), but it's more generic and scalable to other label types.
Sorry, you're right that this isn't documented very well right now. Here are some source files you can browse for inspiration if you're so inclined:
Data are images, labels are (gradientX, gradientY)
Data are image pairs, labels are 1 or 0
Data are text snippets, labels are numeric classes
Generic script for creating LMDBs from any data blobs coming from extensions/plugins

Related

Keras custom fit_generator for numeric dataframe

I have several CSV files which are placed in a directory. What I want to do is to create a flow from this directory where each file is taken, prepossessed(such as null value fill, outlier treatment etc) and then each data point is passed to keras model and this process should repeat itself for every file placed in the directory. Any suggestions on this to create data flow same as for Image data available in keras. Also this should happen in python :)
Thanks in advance!

I don't think that Keras natively supplies such functionality.
You should make your own converter, using something like glob to go over each file, send it to preprocessing functions, and finally save it as a format readily usable by Keras, such as a numpy array.
You might want to have a look here for an example of inputting multiple files (although in this case they are already numpy arrays, not csv files) to use in the training of a model.

Representing time sequence input/output in tensorflow

I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.

As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.

Proper way of loading large amounts of image data

For a Deep Learning application I am building, I have a dataset of about 50k grayscale images, ranging from about 300*2k to 300*10k pixels. Loading all this data into memory is not possible, so I am looking for a proper way to handle reading in random batches of data. One extra complication with this is, I need to know the width of each image before building my Deep Learning model, to define different size-buckets within the data (for example: [2k-4k, 4k-6k, 6k-8k, 8k-10k].
Currently, I am working with a smaller dataset and just load each image from a png file, bucket them by size and start learning. When I want to scale up this is no longer possible.
To train the model, each batch of the data should be (ideally) fully random from a random bucket. A naive way of doing this would be saving the sizes of the images beforehand, and just loading each random batch when it is needed. However, this would result in a lot of extra loading of data and not very efficient memory management.
Does anyone have a suggestion how to handle this problem efficiently?
Cheers!

Why not add a preprocessing step, where you would either (a) physically move the images to folders associated with bucket and/or rename them, or (b) first scan through all images (headers only) to build the in-memory table of image filenames and their sizes/buckets, and then the random sampling step would be quite simple to implement.

HDF5 Links to Events in Dataset

I'm trying to use HDF5 to store time-series EEG data. These files can be quite large and consist of many channels, and I like the features of the HDF5 file format (lazy I/O, dynamic compression, mpi, etc).
One common thing to do with EEG data is to mark sections of data as 'interesting'. I'm struggling with a good way to store these marks in the file. I see soft/hard links supported for linking the same dataset to other groups, etc -- but I do not see any way to link to sections of the dataset.
For example, let's assume I have a dataset called EEG containing sleep data. Let's say I run an algorithm that takes a while to process the data and generates indices corresponding to periods of REM sleep. What is the best way to store these index ranges in an HDF5 file?
The best I can think of right now is to create a dataset with three columns -- the first column is a string and contains a label for the event ("REM1"), and the second/third column contains the start/end index respectively. The only reason I don't like this solution is because HDF5 datasets are pretty set in size -- if I decide later that a period of REM sleep was mis-identified and I need to add/remove that event, the dataset size would need to change (and deleting the dataset/recreating it with a new size is suboptimal). Compound this by the fact that I may have MANY events (imagine marking eyeblink events), this becomes more of a problem.
I'm more curious to find out if there's functionality in the HDF5 file that I'm just not aware of, because this seems like a pretty common thing that one would want to do.

I think what you want is a Region Reference — essentially, a way to store a reference to a slice of your data. In h5py, you create them with the regionref property and numpy slicing syntax, so if you have a dataset called ds and your start and end indexes of your REM period, you can do:
rem_ref = ds.regionref[start:end]
ds.attrs['REM1'] = rem_ref
ds[ds.attrs['REM1']] # Will be a 1-d set of values
You can store regionrefs pretty naturally — they can be attributes on a dataset, objects in a group, or you can create a regionref-type dataset and store them in there.
In your case, I might create a group ("REM_periods" or something) and store the references in there. Creating a "REM_periods" dataset and storing the regionrefs there is reasonable too, but you run into the whole "datasets tend not to be variable-length very well" thing.
Storing them as attrs on the dataset might be OK, too, but it'd get awkward if you wanted to have more than one event type.

Storing images and metadata with PyTables

I'm using PyTables to store some images as Array and CArray data types. For each of these images, I also want to store some basic metadata (e.g., EXIF data).
I can imagine a number of approaches to storing both of these data formats, ranging from storing the metadata with the AttributeSet class for each Array/CArray to using a Table for all of the metadata.
My question is: What is the best approach if I want to be able to efficiently query and extract images from the ultimate hdf5 file for processing? For example, I'd like to be able to extract images taken at certain times (12-3pm) and process that subset of the data and then insert copies into the database or replace the existing arrays.
Many thanks for the help.
Best,
Nick
[Edit (clarification): I'm currently processing these images as NumPy arrays, and I hope to preserve that functionality]

My understanding of the PyTables docs suggest the following.
Create a table. Create one column for each piece of metadata you are interested in. If your images are the same size, and this is known at the time of table creation, create a column of arrays and store them there. If the image sizes vary, create a column with an unique identifier for each image (the functional equivalent of a filename), then create a new group and create one array/carray per image, named identically to the list in the aforementioned table.
The other option would have been to use a lightweight RDMS (sqlite even) to store the table, which would allow for easy querying/sorting/etc., but keep the actual image arrays in the h5 file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.