Generating TFRecord format data from C+

Generating TFRecord format data from C+ - python

I'm trying to use TFRecord format to record data from C++ and then use it in python to feed TensorFlow model.
TLDR; Simply serializing proto messages into a stream doesn't satisfy .tfrecord format requirements of Python TFRecordDataset class. Is there an equivalent of Python TfRecordWriter in C++ (either in TensorFlow or in Google Protobuf libraries) to generate proper .tfrecord data?
Details:
The simplified C++ code looks like this:
tensorflow::Example sample;
sample.mutable_features()->mutable_feature()->operator[]("a").mutable_float_list()->add_value(1.0);
std::ofstream out;
out.open("cpp_example.tfrecord", std::ios::out | std::ios::binary);
sample.SerializeToOstream(&out);
In Python, to create a TensorFlow data I'm trying to use TFRecordDataset, but apparently it expects extra header/footer information in the .tfrecord file (rather than simple list of serialized proto messages):
import tensorflow as tf
tfrecord_dataset = tf.data.TFRecordDataset(filenames="cpp_example.tfrecord")
next(tfrecord_dataset.as_numpy_iterator())
output:
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0 [Op:IteratorGetNext]
Note that there is nothing wrong with the recorded binary file, as following code prints a valid output:
import tensorflow as tf
p = open("cpp_example.tfrecord", "rb")
example = tf.train.Example.FromString(p.read())
output:
features {
feature {
key: "a"
value {
float_list {
value: 1.0
}
}
}
}
By analyzing the binary output generated by my C++ example, and an output generated by using Python TfRecordWriter, I observed additional header and footer bytes in the content. Unfortunately, what do these extra bytes represent was an implementation detail (probably compression type and some extra info) and I couldn't track it deeper than some class in python libraries which just exposed the interface from _pywrap_tfe.so.
There was this advice saying that .tfrecord is just a normal google protobuf data. It might be I'm missing the knowledge where to find protobuf data writer (expect serializing proto messages into the output stream)?

It turns out tensorflow::io::RecordWriter class of TensorFlow C++ library does the job.
#include <tensorflow/core/lib/io/record_writer.h>
#include <tensorflow/core/platform/default/posix_file_system.h>
#include <tensorflow/core/example/example.pb.h>
// ...
// Create WritableFile and instantiate RecordWriter.
tensorflow::PosixFileSystem posixFileSystem;
std::unique_ptr<tensorflow::WritableFile> writableFile;
posixFileSystem.NewWritableFile("cpp_example.tfrecord", &writableFile);
tensorflow::io::RecordWriter recordWriter(mWritableFile.get(), tensorflow::io::RecordWriterOptions::CreateRecordWriterOptions(""));
// ...
tensorflow::Example sample;
// ...
// Serialize proto message into a buffer and record in tfrecord format.
std::string buffer;
sample.SerializeToString(&buffer);
recordWriter.WriteRecord(buffer);
It would be helpful if this class is referenced from somewhere in TFRecord documentation.

Related

Adding custom extratags with tifffile

I'm trying to write a script to simplify my everyday life in the lab. I operate one ThermoFisher / FEI scanning electron microscope and I save all my pictures in the TIFF format.
The microscope software is adding an extensive custom TiffTag (code 34682) containing all the microscope / image parameters.
In my script, I would like to open an image, perform some manipulations and then save the data in a new file, including the original FEI metadata. To do so, I would like to use a python script using the tifffile module.
I can open the image file and perform the needed manipulations without problems. Retrieving the FEI metadata from the input file is also working fine.
I was thinking to use the imwrite function to save the output file and using the extratags optional argument to transfer to the output file the original FEI metadata.
This is an extract of the tifffile documentation about the extratags:
extratags : sequence of tuples
Additional tags as [(code, dtype, count, value, writeonce)].
code : int
The TIFF tag Id.
dtype : int or str
Data type of items in 'value'. One of TIFF.DATATYPES.
count : int
Number of data values. Not used for string or bytes values.
value : sequence
'Count' values compatible with 'dtype'.
Bytes must contain count values of dtype packed as binary data.
writeonce : bool
If True, the tag is written to the first page of a series only.
Here is a snippet of my code.
my_extratags = [(input_tags['FEI_HELIOS'].code,
input_tags['FEI_HELIOS'].dtype,
input_tags['FEI_HELIOS'].count,
input_tags['FEI_HELIOS'].value, True)]
tifffile.imwrite('output.tif', data, extratags = my_extratags)
This code is not working and complaining that the value of the extra tag should be ASCII 7-bit encoded. This looks already very strange to me because I haven't touched the metadata and I am just copying it to the output file.
If I convert the metadata tag value in a string as below:
my_extratags = [(input_tags['FEI_HELIOS'].code,
input_tags['FEI_HELIOS'].dtype,
input_tags['FEI_HELIOS'].count,
str(input_tags['FEI_HELIOS'].value), True)]
tifffile.imwrite('output.tif', data, extratags = my_extratags)
the code is working, the image is saved, the metadata corresponding to 'FEI_HELIOS' is created but it is empty!
Can you help me in finding what I am doing wrongly?
I don't need to use tifffile, but I would prefer to use python rather than ImageJ because I have already several other python scripts and I would like to integrate this new one with the others.
Thanks a lot in advance!
toto
ps. I'm a frequent user of stackoverflow, but this is actually my first question!

In principle the approach is correct. However, tifffile parses the raw values of certain tags, including FEI_HELIOS, to dictionaries or other Python types. To get the raw tag value for rewriting, it needs to be read from file again. In these cases, use the internal TiffTag._astuple function to get an extratag compatible tuple of the tag, e.g.:
import tifffile
with tifffile.TiffFile('FEI_SEM.tif') as tif:
assert tif.is_fei
page = tif.pages[0]
image = page.asarray()
... # process image
with tifffile.TiffWriter('copy1.tif') as out:
out.write(
image,
photometric=page.photometric,
compression=page.compression,
planarconfig=page.planarconfig,
rowsperstrip=page.rowsperstrip,
resolution=(
page.tags['XResolution'].value,
page.tags['YResolution'].value,
page.tags['ResolutionUnit'].value,
),
extratags=[page.tags['FEI_HELIOS']._astuple()],
)
This approach does not preserve Exif metadata, which tifffile cannot write.
Another approach, since FEI files seem to be written uncompressed, is to directly memory map the image data in the file to a numpy array and manipulate that array:
import shutil
import tifffile
shutil.copyfile('FEI_SEM.tif', 'copy2.tif')
image = tifffile.memmap('copy2.tif')
... # process image
image.flush()
Finally, consider tifftools for rewriting TIFF files where tifffile is currently failing, e.g. Exif metadata.

Load and Use Keras model built using LSTM network in C++ or deploy that model in iOS app

I have created LSTM network using Keras for next word prediction based on the context of the previous words in a sentence. I have written the code in Python, but have to deploy it with existing code of C++. So how to translate this chunk of code to C++ as I am new to it and I have been using built-in functions in python for the same.
I have been trying to convert each and every line of python code to C++ but that is consuming a lot of time, also I am not even able to read hdf5 and pkl files in C++.
Python Code :
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
def generate_seq(model, tokenizer, seq_length, seed_text, no_next_words):
result = []
in_text = seed_text
for _ in range(no_next_words):
# encode the text as integer
encoded = tokenizer.texts_to_sequences([in_text])[0]
# convert sequences to a fixed length
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
# predict probabilities for each word
yhat = model.predict_classes(encoded, verbose=0)
# map predicted word index to word
out_word = ''
for word, index in tokenizer.word_index.items():
if index == yhat:
out_word = word
break
# append to input
in_text += ' ' + out_word
result.append(out_word)
return ' '.join(result)
# load the model
model = load_model('model_chat2_3.h5')
# load the tokenizer
tokenizer = load(open('tokenizer_chat2_3.pkl', 'rb'))
# generate new text
while(True):
inp = input("Enter:")
generated = generate_seq(model, tokenizer, 2,inp, 1)
print(generated)
C++ code :
#include <iostream>
#include <fstream>
#include <typeinfo>
#include <string>
using namespace std;
int main(int argc, const char * argv[]) {
ifstream myReadFile,myReadFile2;
myReadFile.open("/Users/Apple/New word_prediction/word_prediction/data/colab models/chat2_2/tokenizer_chat2_2.pkl");
myReadFile2.open("/Users/Apple/New word_prediction/word_prediction/data/colab models/chat2_2/model_chat2_2.h5");
char output[1200];
if (myReadFile.is_open()) {
while (!myReadFile.eof()) {
myReadFile >> output;
cout<<output<<endl;
}
}
myReadFile.close();
myReadFile2.close();
return 0;
}
The output of C++ comes out to be some encoded string.
I tried referring to the previous answers but all of them are built very specifically to their use-cases.
For e.g :
Convert Keras model to C++.
Basically, I have built keras model using LSTM networks(hidden layers: LSTM) while in the above link, it's built for CNN network, with completely different architecture. If I follow the above link I will have to completely change the code.
Also in the other similar repositories, they work for CNN for image processing, but in my use case, I have to predict next words based on the context of previous words, through LSTM network.
This model has good accuracy on the console, but it has to be tested on the real-time keyboard of mobile applications.
So my question is that if there is a method or tool through which I can convert some of the code to C++ or at least how can I read the keras model built on LSTM network(model_chat2_3.h5) and tokenizer_chat2_3.pkl files into C++.
Also if someone can guide me to directly deploy the keras model on the mobile application, to work as a keyboard. Again all existing repositories are made to deploy CNN based models on the mobile app to do image related things
Keras LSTM model to android This will also help but no one has answered it yet.
I am new to C++ so not able to write the complete code for it, as I have a time constraint. Any sort of help is appreciated!

Can python sitk.ReadImage read a list/series of images?

I do not understand if sitk.ReadImage can read a list of images or not? I did not manage to find an example showing how to list of images should be inputed to the function.
But in the function documentations it say:
ReadImage(**VectorString fileNames**, itk::simple::PixelIDValueEnum outputPixelType) -> Image
ReadImage(std::string const & filename, itk::simple::PixelIDValueEnum outputPixelType) -> Image
ReadImage is a procedural interface to the ImageSeriesReader class which is convenient for most image reading tasks.
Note that when reading a series of images that have meta-data
associated with them (e.g. a DICOM series) the resulting image will
have an empty meta-data dictionary. It is possible to programmatically
add a meta-data dictionary to the compounded image by reading in one
or more images from the series using the ImageFileReader class,
analyzing the meta-dictionary associated with each of those images and
creating one that is relevant for the compounded image.
So it seems from the documentations that it is possible. Can someone show me a simple example.
EDIT:
I tried the following:
sitk.ReadImage(['volume00001.mhd','volume00002.mhd'])
but this is the error that I get:
RuntimeErrorTraceback (most recent call last)
<ipython-input-42-85abf82c3afa> in <module>()
1 files = [f for f in os.listdir('.') if 'mhd' in f]
2 print(sorted_files[1:25])
----> 3 sitk.ReadImage(['volume00001.mhd','volume00002.mhd'])
/gpfs/bbp.cscs.ch/home/amsalem/anaconda2/lib/python2.7/site-packages/SimpleITK/SimpleITK.pyc in ReadImage(*args)
8330
8331 """
-> 8332 return _SimpleITK.ReadImage(*args)
8333 class HashImageFilter(ProcessObject):
8334 """
RuntimeError: Exception thrown in SimpleITK ReadImage: /tmp/SimpleITK/Code/IO/src/sitkImageSeriesReader.cxx:145:
sitk::ERROR: The file in the series have unsupported 3 dimensions.
Thanks.

SimpleITK uses SWIG to wrap a C++ interface to Insight Segmentation and Registration Toolkit (ITK). As such the inline python documentation should be supplemented with the C++ Doxygen documentation. There is a mapping of C++ types to Python types with robust implicit conversion between them. You can find the documentation for the sitk::ReadImage methods here:
https://itk.org/SimpleITKDoxygen/html/namespaceitk_1_1simple.html#ae3b678b5b043c5a8c93aa616d5ee574c
Notice there are 2 ReadImage methods, and the Python docstring you listed appears is one of them.
From the SimpleITK examples here is a snippet to read a DICOM series:
print( "Reading Dicom directory:", sys.argv[1] )
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames( sys.argv[1] )
reader.SetFileNames(dicom_names)
image = reader.Execute()
This uses the class interface as opposed to the procedural. Which would simply be:
image = sitk.ReadImage(dicom_names)
or generically with a list of string filenames:
image = sitk.ReadImage(["image1.png", "image2.png"...])
Many common array like type of strings will be implicitly converted to the SWIG VectorString type.

Using saved Tensorflow Estimator with C++ API

I have written the Abalone estimator in Python as described in https://www.tensorflow.org/versions/r0.11/tutorials/estimators/. I wish to save the state of the estimator, then load it in C++ and use it to make predictions.
To save it from Python, I use the model_dir parameter in the tf.contrib.learn.Estimator constructor, which creates a (text) protobuf file and several checkpoint files. I then use the freeze_graph.py tool (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py) to combine the checkpoint and the protobuf file into a standalone GraphDef file.
I load this file using the C++ API, load some input values into a Tensor, then run the session. The input node in the protobuf file is called 'input' and the output node 'output', and both are placeholder nodes.
// ...
std::vector<std::pair<string, tensorflow::Tensor>> inputs =
{
{"input", inputTensor}
};
std::vector<tensorflow::Tensor> outputs;
status = pSession->Run(inputs, {"output"}, {}, &outputs);
However, since the output node is a placeholder node, this fails since it needs to be fed a value. But you cannot both feed and fetch a node value, so I cannot get access to the output of the estimator. Why is the output node a placeholder node?
What is the best way to save a trained estimator from Python and load it for prediction in C++?

_pickle in python3 doesn't work for large data saving

I am trying to apply _pickle to save data onto disk. But when calling _pickle.dump, I got an error
OverflowError: cannot serialize a bytes object larger than 4 GiB
Is this a hard limitation to use _pickle? (cPickle for python2)

Not anymore in Python 3.4 which has PEP 3154 and Pickle 4.0
https://www.python.org/dev/peps/pep-3154/
But you need to say you want to use version 4 of the protocol:
https://docs.python.org/3/library/pickle.html
pickle.dump(d, open("file", 'w'), protocol=4)

Yes, this is a hard-coded limit; from save_bytes function:
else if (size <= 0xffffffffL) {
// ...
}
else {
PyErr_SetString(PyExc_OverflowError,
"cannot serialize a bytes object larger than 4 GiB");
return -1; /* string too large */
}
The protocol uses 4 bytes to write the size of the object to disk, which means you can only track sizes of up to 232 == 4GB.
If you can break up the bytes object into multiple objects, each smaller than 4GB, you can still save the data to a pickle, of course.

There is a great answers above for why pickle doesn't work.
But it still doesn't work for Python 2.7, which is a problem
if you are are still at Python 2.7 and want to support large
files, especially NumPy (NumPy arrays over 4G fail).
You can use OC serialization, which has been updated to work for data over
4Gig. There is a Python C Extension module available from:
http://www.picklingtools.com/Downloads
Take a look at the Documentation:
http://www.picklingtools.com/html/faq.html#python-c-extension-modules-new-as-of-picklingtools-1-6-0-and-1-3-3
But, here's a quick summary: there's ocdumps and ocloads, very much like
pickle's dumps and loads::
from pyocser import ocdumps, ocloads
ser = ocdumps(pyobject) : Serialize pyobject into string ser
pyobject = ocloads(ser) : Deserialize from string ser into pyobject
The OC Serialization is 1.5-2x faster and also works with C++ (if you are mixing langauges). It works with all built-in types, but not classes
(partly because it is cross-language and it's hard to build C++ classes
from Python).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating TFRecord format data from C+ - python

Related

Adding custom extratags with tifffile

Load and Use Keras model built using LSTM network in C++ or deploy that model in iOS app

Can python sitk.ReadImage read a list/series of images?

Using saved Tensorflow Estimator with C++ API

_pickle in python3 doesn't work for large data saving

Categories

Resources