how to store numpy arrays as tfrecord?

how to store numpy arrays as tfrecord? - python

I am trying to create a dataset in tfrecord format from numpy arrays. I am trying to store 2d and 3d coordinates.
2d coordinates are numpy array of shape (2,10) of type float64
3d coordinates are numpy array of shape (3,10) of type float64
this is my code:
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
train_filename = 'train.tfrecords' # address to save the TFRecords file
writer = tf.python_io.TFRecordWriter(train_filename)
for c in range(0,1000):
#get 2d and 3d coordinates and save in c2d and c3d
feature = {'train/coord2d': _floats_feature(c2d),
'train/coord3d': _floats_feature(c3d)}
sample = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(sample.SerializeToString())
writer.close()
when i run this i get the error:
feature = {'train/coord2d': _floats_feature(c2d),
File "genData.py", line 19, in _floats_feature
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\python_message.py", line 510, in init
copy.extend(field_value)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\containers.py", line 275, in extend
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\containers.py", line 275, in <listcomp>
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\type_checkers.py", line 109, in CheckValue
raise TypeError(message)
TypeError: array([-163.685, 240.818, -114.05 , -518.554, 107.968, 427.184,
157.418, -161.798, 87.102, 406.318]) has type <class 'numpy.ndarray'>, but expected one of: ((<class 'numbers.Real'>,),)
I dont know how to fix this. should i store the features as int64 or bytes? I have no clue how to go about this since i am completely new to tensorflow. any help would be great! thanks

The function _floats_feature described in the Tensorflow-Guide expects a scalar (either float32 or float64) as input.
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
As you can see the inputted scalar is written into a list (value=[value]) which is subsequently given to tf.train.FloatList as input. tf.train.FloatList expects an iterator that outputs a float in each iteration (as the list does).
If your feature is not a scalar but a vectur, _float_feature can be rewritten to pass the iterator directly to tf.train.FloatList (instead of putting it into a list first).
def _float_array_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
However if your feature has two or more dimensions this solution does not work anymore. Like #mmry described in his answer in this case flattening your feature or splitting it into several one-dimensional features would be a solution. The disadvantage of these two approaches is that the information about the actual shape of the feature is lost if no further effort is invested.
Another possibility to write an example message for a higher dimensional array is to convert the array into a byte string and then use the _bytes_feature function described in the Tensorflow-Guide to write the example message for it. The example message is then serialized and written into a TFRecord file.
import tensorflow as tf
import numpy as np
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))): # if value ist tensor
value = value.numpy() # get value of tensor
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def serialize_array(array):
array = tf.io.serialize_tensor(array)
return array
#----------------------------------------------------------------------------------
# Create example data
array_blueprint = np.arange(4, dtype='float64').reshape(2,2)
arrays = [array_blueprint+1, array_blueprint+2, array_blueprint+3]
#----------------------------------------------------------------------------------
# Write TFrecord file
file_path = 'data.tfrecords'
with tf.io.TFRecordWriter(file_path) as writer:
for array in arrays:
serialized_array = serialize_array(array)
feature = {'b_feature': _bytes_feature(serialized_array)}
example_message = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example_message.SerializeToString())
The serialized example messages stored in the TFRecord file can be accessed via tf.data.TFRecordDataset. After the example messages have been parsed, the original array needs to be restored from the byte string it was converted to. This is possible via tf.io.parse_tensor.
# Read TFRecord file
def _parse_tfr_element(element):
parse_dic = {
'b_feature': tf.io.FixedLenFeature([], tf.string), # Note that it is tf.string, not tf.float32
}
example_message = tf.io.parse_single_example(element, parse_dic)
b_feature = example_message['b_feature'] # get byte string
feature = tf.io.parse_tensor(b_feature, out_type=tf.float64) # restore 2D array from byte string
return feature
tfr_dataset = tf.data.TFRecordDataset('data.tfrecords')
for serialized_instance in tfr_dataset:
print(serialized_instance) # print serialized example messages
dataset = tfr_dataset.map(_parse_tfr_element)
for instance in dataset:
print()
print(instance) # print parsed example messages with restored arrays

The tf.train.Feature class only supports lists (or 1-D arrays) when using the float_list argument. Depending on your data, you might try one of the following approaches:
Flatten the data in your array before passing it to tf.train.Feature:
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value.reshape(-1)))
Note that you might need to add another feature to indicate how this data should be reshaped when you parse it again (and you could use an int64_list feature for that purpose).
Split the multidimensional feature into multiple 1-D features. For example, if c2d contains an N * 2 array of x- and y-coordinates, you could split that feature into separate train/coord2d/x and train/coord2d/y features, each containing the x- and y-coordinate data, respectively.

The documentation about Tfrecord recommends to use serialize_tensor
TFRecord and tf.train.Example
Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use tf.io.serialize_tensor to convert tensors to binary-strings. Strings are scalars in tensorflow. Use tf.io.parse_tensor to convert the binary-string back to a tensor.
2 lines of code does the trick for me:
tensor = tf.convert_to_tensor(array)
result = tf.io.serialize_tensor(tensor)

Related

Error when trying to simulate one FMU with 4 inputs from csv files [duplicate]

I have a fmu created in gt-suite. I am trying to work with it in python using python PyFMI package.
My code
from pyfmi import load_fmu
import numpy as np
model = load_fmu('AHUPIv2b.fmu')
t = np.linspace(0.,100.,100)
u = np.linspace(3.5,4.5,100)
v = np.linspace(900,1000,100)
u_traj = np.transpose(np.vstack((t,u)))
v_traj = np.transpose(np.vstack((t,v)))
input_object = (('InputVarI','InputVarP'),(u_traj,v_traj))
res = model.simulate(final_time=500, input=input_object, options={'ncp':500})
res = model.simulate(final_time=10)
model.simulate takes input as one of its parameters, Documentation says
input --
Input signal for the simulation. The input should be a 2-tuple
consisting of first the names of the input variable(s) and then
the data matrix.
'InputVarI','InputVarP' are the input variables and u_traj,v_traj are data matrices.
My code gives an error
gives an error -
TypeError: tuple indices must be integers or slices, not tuple
Is the input_object created wrong? Can someone help with how to create the input tuples correctly as per the documentation?

The input object is created incorrect. The second variable in the input tuple should be a single data matrix, not two data matrices.
The correct input should be:
data = np.transpose(np.vstack((t,u,v)))
input_object = (['InputVarI','InputVarP'],data)
See also pyFMI parameter change don't change the simulation output

produce vector output from a dask array

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.
function to apply to each chunk:
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
# Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
return polys
line of code trying to get dask to do this:
test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()
where labeled_arr is the input chunked dask array.
Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.
I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.

This might work:
import dask
import dask.array as da
# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))
#dask.delayed
def my_func(block):
# do something fancy
return list(block)
results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])
As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).

Here's the solution I ended up with initially, though it still requires a lot of RAM given the concatenate=True kwarg.
poss_list = []
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
poss_list.append(polys)
da.blockwise(get_bergs, '', labeled_arr, 'ij',
meta=pd.DataFrame({'c':[]}), concatenate=True).compute()
If I'm interpreting correctly, this doesn't feed the chunks into my function across workers/processes though (which it seems I can get away with for now).
Update - improved answer using dask.delayed, building on the accepted answer by #SultanOrazbayev
import dask
# onedem = original_xarray_dataarray
poss_list = []
#dask.delayed
def get_bergs(labeled_blocks, pointer, chunk0, chunk1):
# Note: I'm using this in a CRS (polar stereo) with negative y coordinates - it hasn't been tested for other CRSs
def getpx(chunkid, chunksz):
amin = chunkid[0] * chunksz[0][0]
amax = amin + chunksz[0][0]
bmin = chunkid[1] * chunksz[1][0]
bmax = bmin + chunksz[1][0]
return (amin, amax, bmin, bmax)
# order of all inputs (and outputs) should be y, x when axis order is used
chunksz = (onedem.chunks['y'], onedem.chunks['x'])
ymini, ymaxi, xmini, xmaxi = getpx((chunk0, chunk1), chunksz)
# use rasterio Windows and rioxarray to construct transform
# https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html#window-transforms
chwindow = rasterio.windows.Window(xmini, ymini, xmaxi-xmini, ymaxi-ymini) #.from_slices[ymini, ymaxi],[xmini, xmaxi])
trans = onedem.rio.isel_window(chwindow).rio.transform(recalc=True)
return list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(labeled_blocks.astype('int32'), transform=trans))[:-1]
for __, obj in enumerate(labeled_arr.to_delayed()):
for bl in obj:
piece = dask.delayed(get_bergs)(bl, *bl.key)
poss_list.append(piece)
poss_list = dask.compute(*poss_list)
# unnest the list of polygons returned by using dask to polygonize
concat_list = [item for sublist in poss_list for item in sublist if len(item)!=0]

Converting a Numpy file to TFRecord where each row contains a number, and a variable length list

This is a follow up to these two SO questions
Select random value from row in a TF.record array, with limits on what the value can be?
Numpy array to TFrecord
The first one mentioned that tfrecords can handle variable length data using tf.VarLenFeature()
I am still having trouble figuring how how to convert my number array to tfrecord file though. Here's what the first 10 rows looks like
print(my_data[0:15])
[[1446549
list([491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205])]
[406529 list([900479, 660976, 1270383, 1287181])]
[1350274
list([207721, 676951, 1311781, 2712019, 1719660, 2969693, 37187, 2284531, 1253304, 1274866, 2815382, 1513583, 1339084, 1624616, 2967307, 1702118, 585261, 426595, 1444507, 1982792])]
[1163243
list([324383, 81509, 322474, 406941, 768416, 109067, 173425, 1478467, 573723, 1009159, 313463, 313924, 627680, 1072293, 1025620, 2325337, 2457705, 1505115, 2812547, 922812, 2152425, 2524196, 182325, 2912690, 1388620, 1484514, 1481728, 2616639, 2180765, 1544586, 1987272, 1557441, 453182, 892217, 1462085, 1892770, 1646735, 2521186, 2814552, 2983691, 3037096, 832554, 2807250, 2253333, 2595688, 2650475, 2525317, 2716592, 2573244, 2666514, 256757, 1135836, 1856208, 2605537, 1851963, 2381938, 1716883, 773842, 1877852, 2504806, 2208699, 1076111, 3058991, 3024546, 2010887, 2630915])]
[2491621 list([877803, 546802, 2855232, 2950610, 1378514, 285536])]
[2465968
list([626040, 1151291, 560715, 1153787, 893941, 3094902, 1239392, 2081948, 1677321, 1193880, 2326117, 2805797, 1715983, 1213177, 1476995, 2620772, 1242804, 2942330, 588938, 2338375, 2805378, 169015, 1766962, 562485, 1210404, 772334, 415148, 1293624, 527245, 587088, 665484, 449673, 315509])]
[886255
list([694445, 796232, 1151072, 2312348, 1773175, 1898319, 1696093, 91310, 719379, 2080422, 1352695, 2364846, 845154, 2476191, 537059, 1216854, 1529449, 284855, 1215830, 3041789, 1625939])]
[451113
list([2805707, 2727647, 742706, 1727139, 2585759, 822759, 1099617])]
[1529600
list([1755946, 2110553, 1056110, 426876, 2448684, 396996, 1498300, 756831, 2181288, 1159493])]
[596550 list([1610895, 2579387, 3081786, 2000733, 2142308])]
[1631548
list([1576412, 849908, 2705650, 2291675, 751733, 1911747, 1496204])]
[3001784 list([327334, 1197547, 2515733])]
[308747
list([8344, 80684, 996504, 2250076, 1905654, 863587, 2235560, 2676079, 1826, 685487, 1481871, 588465, 1126662, 2458841, 2481927])]
[731288 list([2793620, 1115724, 1406934])]
[1523219
list([12825, 1128776, 1761080, 1486798, 2689369, 1040645, 3012606])]]
Might be a little tricky to read but here's an instance of a smaller row
[406529 list([900479, 660976, 1270383, 1287181])]
Each now contains a number, and a list of numbers, and that list varies in length.
I am having trouble figuring out how exactly I can convert this to a tfrecord file. Any help/hints would be greatly appreciated.

I assume you want to add numbers feature and list feature respectively here.
import tensorflow as tf
writer = tf.python_io.TFRecordWriter('test.tfrecords')
for index in range(my_data.shape[0]):
example = tf.train.Example(features=tf.train.Features(feature={
'num_value':tf.train.Feature(int64_list=tf.train.Int64List(value=[my_data[index][0]])),
'list_value':tf.train.Feature(int64_list=tf.train.Int64List(value=my_data[index][1]))
}))
writer.write(example.SerializeToString())
writer.close()
#read data from tfrecords
record_iterator = tf.python_io.tf_record_iterator('test.tfrecords')
for _ in range(2):
seralized_img_example = next(record_iterator)
example = tf.train.Example()
example.ParseFromString(seralized_img_example)
num_value = example.features.feature['num_value'].int64_list.value[0]
list_value = example.features.feature['list_value'].int64_list.value
print(num_value,list_value)
#print
1446549 [491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205]
406529 [900479, 660976, 1270383, 1287181]

Convert a tensorflow tf.data.Dataset FlatMapDataset to TensorSliceDataset

I want to pass a list of tf.Strings to the .map(_parse_function) function.
def _parse_function(self, img_path):
img_str = tf.read_file(img_path)
img_decode = tf.image.decode_jpeg(img_str, channels=3)
img_decode = tf.divide(tf.cast(img_decode , tf.float32),255)
return img_decode
When the tf.data.Dataset is of type TensorSliceDataset,
dataset_from_slices = tf.data.Dataset.from_tensor_slices((tensor_with_filenames))
I can simply do
dataset_from_slices.map(_parse_function), which works.
However, dataset_from_generator = tf.data.Dataset.from_generator(...) returns a Dataset which is an instance of FlatMapDataset type and dataset_from_generator.map(_parse_function) gives the following error:
InvalidArgumentError: Input filename tensor must be scalar, but had shape: [32]
If I change the first line to:
img_str = tf.read_file(img_path[0])
that also works but then I only get the first image, which is not what I am looking for. Any suggestions?

It sounds like the elements of your dataset_from_generator are batched. The simplest remedy is to use tf.contrib.data.unbatch() to convert them back into individual elements:
# Each element is a vector of strings.
dataset_from_generator = tf.data.Dataset.from_generator(...)
# Converts each vector of strings into multiple individual elements.
dataset = dataset_from_generator.apply(tf.contrib.data.unbatch())
dataset = dataset.map(_parse_function)

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

From a list of word images with their transcriptions, I am trying to create and read sparse sequence labels (for tf.nn.ctc_loss) using a tf.train.slice_input_producer, avoiding
serializing pre-packaged training data to disk in
TFRecord format
the apparent limitations of tf.py_func,
any unnecessary or premature padding, and
reading the entire data set to RAM.
The main issue seems to be converting a string to the sequence of labels (a SparseTensor) needed for tf.nn.ctc_loss.
For example, with the character set in the (ordered) range [A-Z], I'd want to convert the text label string "BAD" to the sequence label class list [1,0,3].
Each example image I want to read contains the text as part of the filename, so it's straightforward to extract and do the conversion in straight up python. (If there's a way to do it within TensorFlow computations, I haven't found it.)
Several previous questions glance at these issues, but I haven't been able to integrate them successfully. For example,
Tensorflow read images with labels
shows a straightforward framework with discrete, categorical labels,
which I've begun with as a model.
How to load sparse data with TensorFlow?
nicely explains an approach for loading sparse data, but assumes
pre-packaging tf.train.Examples.
Is there a way to integrate these approaches?
Another example (SO question #38012743) shows how I might delay the conversion from string to list until after dequeuing the filename for decoding, but it relies on tf.py_func, which has caveats. (Should I worry about them?)
I recognize that "SparseTensors don't play well with queues" (per the tf docs), so it might be necessary to do some voodoo on the result (serialization?) before batching, or even rework where the computation happens; I'm open to that.
Following MarvMind's outline, here is a basic framework with the computations I want (iterate over lines containing example filenames, extract each label string and convert to sequence), but I have not successfully determined the "Tensorflow" way to do it.
Thank you for the right "tweak", a more appropriate strategy for my goals, or an indication that tf.py_func won't wreck training efficiency or something else downstream (e.g.,loading trained models for future use).
EDIT (+7 hours) I found the missing ops to patch things up. While still need to verify this connects with CTC_Loss downstream, I have checked that the edited version below correctly batches and reads in the images and sparse tensors.
out_charset="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def input_pipeline(data_filename):
filenames,seq_labels = _get_image_filenames_labels(data_filename)
data_queue = tf.train.slice_input_producer([filenames, seq_labels])
image,label = _read_data_format(data_queue)
image,label = tf.train.batch([image,label],batch_size=2,dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)
return image,label
def _get_image_filenames_labels(data_filename):
filenames = []
labels = []
with open(data_filename)) as f:
for line in f:
# Carve out the ground truth string and file path from
# lines formatted like:
# ./241/7/158_NETWORK_51375.jpg 51375
filename = line.split(' ',1)[0][2:] # split off "./" and number
# Extract label string embedded within image filename
# between underscores, e.g. NETWORK
text = os.path.basename(filename).split('_',2)[1]
# Transform string text to sequence of indices using charset, e.g.,
# NETWORK -> [13, 4, 19, 22, 14, 17, 10]
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
# Add data to lists for conversion
filenames.append(filename)
labels.append(label)
filenames = tf.convert_to_tensor(filenames)
labels = tf.convert_to_tensor_or_sparse_tensor(labels)
return filenames, labels
def _read_data_format(data_queue):
label = data_queue[1]
raw_image = tf.read_file(data_queue[0])
image = tf.image.decode_jpeg(raw_image,channels=1)
return image,label

The key ideas seem to be creating a SparseTensorValue from the data wanted, pass it through tf.convert_to_tensor_or_sparse_tensor and then (if you want to batch the data) serialize it with tf.serialize_sparse. After batching, you can restore the values with tf.deserialize_many_sparse.
Here's the outline. Create the sparse values, convert to tensor, and serialize:
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
Then, you can do the batching and deserialize:
image,label = tf.train.batch([image,label],dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to store numpy arrays as tfrecord? - python

Related

Error when trying to simulate one FMU with 4 inputs from csv files [duplicate]

produce vector output from a dask array

Converting a Numpy file to TFRecord where each row contains a number, and a variable length list

Convert a tensorflow tf.data.Dataset FlatMapDataset to TensorSliceDataset

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

Categories

Resources