Convert a tensorflow tf.data.Dataset FlatMapDataset to TensorSliceDataset - python

I want to pass a list of tf.Strings to the .map(_parse_function) function.
def _parse_function(self, img_path):
img_str = tf.read_file(img_path)
img_decode = tf.image.decode_jpeg(img_str, channels=3)
img_decode = tf.divide(tf.cast(img_decode , tf.float32),255)
return img_decode
When the tf.data.Dataset is of type TensorSliceDataset,
dataset_from_slices = tf.data.Dataset.from_tensor_slices((tensor_with_filenames))
I can simply do
dataset_from_slices.map(_parse_function), which works.
However, dataset_from_generator = tf.data.Dataset.from_generator(...) returns a Dataset which is an instance of FlatMapDataset type and dataset_from_generator.map(_parse_function) gives the following error:
InvalidArgumentError: Input filename tensor must be scalar, but had shape: [32]
If I change the first line to:
img_str = tf.read_file(img_path[0])
that also works but then I only get the first image, which is not what I am looking for. Any suggestions?

It sounds like the elements of your dataset_from_generator are batched. The simplest remedy is to use tf.contrib.data.unbatch() to convert them back into individual elements:
# Each element is a vector of strings.
dataset_from_generator = tf.data.Dataset.from_generator(...)
# Converts each vector of strings into multiple individual elements.
dataset = dataset_from_generator.apply(tf.contrib.data.unbatch())
dataset = dataset.map(_parse_function)

Related

Tensorflow transform each element of a string tensor

I have a tensor of strings. Some example strings are as follows.
com.abc.display,com.abc.backend,com.xyz.forte,blah
com.pqr,npr.goog
I want to do some preprocessing which splits the CSV into its part, then splits each part at the dots and then create multiple strings where one string is a prefix of another. Also, all blahs have to be dropped.
For example, given the first string com.abc.display,com.abc.backend,com.xyz.forte, it is transformed into an array/list of the following strings.
['com', 'com.abc', 'com.abc.display', 'com.abc.backend', 'com.xyz', 'com.xyz.forte']
The resulting list has no duplicates (that is why the prefixed strings for com.abc.backend didn't show up as those were already included - com and com.abc).
I wrote the following python function that would do the above given a single CSV string example.
def expand_meta(meta):
expanded_subparts = []
meta_parts = set([x for x in meta.split(',') if x != 'blah'])
for part in meta_parts:
subparts = part.split('.')
for i in range(len(subparts)+1):
expanded = '.'.join(subparts[:i])
if expanded:
expanded_subparts.append(expanded)
return list(set(expanded_subparts))
Calling this method on the first example
expand_meta('com.abc.display,com.abc.backend,com.xyz.forte,blah')
returns
['com.abc.display',
'com.abc',
'com.xyz',
'com.xyz.forte',
'com.abc.backend',
'com']
I know that tensorflow has this map_fn method. I was hoping to use that to transform each element of the tensor. However, I am getting the following error.
File "mypreprocess.py", line 152, in expand_meta
meta_parts = set([x for x in meta.split(',') if x != 'blah'])
AttributeError: 'Tensor' object has no attribute 'split'
So, it seems like I can't use a regular python function with map_fn since it expects the elements to be tensors. How can I do what I intend to do here?
(My Tensorflow version is 1.11.0)
I think this does what you want:
import tensorflow as tf
# Function to process a single string
def make_splits(s):
s = tf.convert_to_tensor(s)
# Split by comma
split1 = tf.strings.split([s], ',').values
# Remove blahs
split1 = tf.boolean_mask(split1, tf.not_equal(split1, 'blah'))
# Split by period
split2 = tf.string_split(split1, '.')
# Get dense split tensor
split2_dense = tf.sparse.to_dense(split2, default_value='')
# Accummulated concatenations
concats = tf.scan(lambda a, b: tf.string_join([a, b], '.'),
tf.transpose(split2_dense))
# Get relevant concatenations
out = tf.gather_nd(tf.transpose(concats), split2.indices)
# Remove duplicates
return tf.unique(out)[0]
# Test
with tf.Graph().as_default(), tf.Session() as sess:
# Individual examples
print(make_splits('com.abc.display,com.abc.backend,com.xyz.forte,blah').eval())
# [b'com' b'com.abc' b'com.abc.display' b'com.abc.backend' b'com.xyz'
# b'com.xyz.forte']
print(make_splits('com.pqr,npr.goog').eval())
# [b'com' b'com.pqr' b'npr' b'npr.goog']
# Apply to multiple strings with a loop
data = tf.constant([
'com.abc.display,com.abc.backend,com.xyz.forte,blah',
'com.pqr,npr.goog'])
ta = tf.TensorArray(size=data.shape[0], dtype=tf.string,
infer_shape=False, element_shape=[None])
_, ta = tf.while_loop(
lambda i, ta: i < tf.shape(data)[0],
lambda i, ta: (i + 1, ta.write(i, make_splits(data[i]))),
[0, ta])
out = ta.concat()
print(out.eval())
# [b'com' b'com.abc' b'com.abc.display' b'com.abc.backend' b'com.xyz'
# b'com.xyz.forte' b'com' b'com.pqr' b'npr' b'npr.goog']
I'm not sure if you want the total results concatenated like that, or maybe you want to apply tf.unique to the global result, but in any case the idea is the same.

How can I filter tf.data.Dataset by specific values?

I create a dataset by reading the TFRecords, I map the values and I want to filter the dataset for specific values, but since the result is a dict with tensors, I am not able to get the actual value of a tensor or to check it with tf.cond() / tf.equal. How can I do that?
def mapping_func(serialized_example):
feature = { 'label': tf.FixedLenFeature([1], tf.string) }
features = tf.parse_single_example(serialized_example, features=feature)
return features
def filter_func(features):
# this doesn't work
#result = features['label'] == 'some_label_value'
# neither this
result = tf.reshape(tf.equal(features['label'], 'some_label_value'), [])
return result
def main():
file_names = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(file_names)
dataset = dataset.map(mapping_func)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.filter(filter_func)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
sample = iterator.get_next()
I am answering my own question. I found the issue!
What I needed to do is tf.unstack() the label like this:
label = tf.unstack(features['label'])
label = label[0]
before I give it to tf.equal():
result = tf.reshape(tf.equal(label, 'some_label_value'), [])
I suppose the problem was that the label is defined as an array with one element of type string tf.FixedLenFeature([1], tf.string), so in order to get the first and single element I had to unpack it (which creates a list) and then get the element with index 0, correct me if I'm wrong.
I think you don't need to make label a 1-dimensional array in the first place.
with:
feature = {'label': tf.FixedLenFeature((), tf.string)}
you won't need to unstack the label in your filter_func
Reading, filtering a dataset is very easy and there is no need to unstack anything.
to read the dataset:
print(my_dataset, '\n\n')
##let us print the first 3 records
for record in my_dataset.take(3):
##below could be large in case of image
print(record)
##let us print a specific key
print(record['key2'])
To filter is equally simple:
my_filtereddataset = my_dataset.filter(_filtcond1)
where you define _filtcond1 however you want. Let us say there is a 'true' 'false' boolean flag in your dataset, then:
#tf.function
def _filtcond1(x):
return x['key_bool'] == 1
or even a lambda function:
my_filtereddataset = my_dataset.filter(lambda x: x['key_int']>13)
If you are reading a dataset which you havent created or you are unaware of the keys (as seems to be the OPs case), you can use this to get an idea of the keys and structure first:
import json
from google.protobuf.json_format import MessageToJson
for raw_record in noidea_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
##print(example) ##if image it will be toooolong
m = json.loads(MessageToJson(example))
print(m['features']['feature'].keys())
Now you can proceed with the filtering
You should try to use the apply function from
tf.data.TFRecordDataset tensorflow documentation
Otherwise... read this article about TFRecords to get a better knowledge about TFRecords TFRecords for humans
But the most likely situation is that you can not access neither modify a TFRecord...there is a request on github about this topic TFRecords request
My advice is to make the things as easy as you can...you have to know that you are you working with graph and sessions...
In any case...if everything fail try the part of the code that does not work in a tensorflow session as simple as you can do it...probably all these operations should be done when tf.session is running...

how to store numpy arrays as tfrecord?

I am trying to create a dataset in tfrecord format from numpy arrays. I am trying to store 2d and 3d coordinates.
2d coordinates are numpy array of shape (2,10) of type float64
3d coordinates are numpy array of shape (3,10) of type float64
this is my code:
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
train_filename = 'train.tfrecords' # address to save the TFRecords file
writer = tf.python_io.TFRecordWriter(train_filename)
for c in range(0,1000):
#get 2d and 3d coordinates and save in c2d and c3d
feature = {'train/coord2d': _floats_feature(c2d),
'train/coord3d': _floats_feature(c3d)}
sample = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(sample.SerializeToString())
writer.close()
when i run this i get the error:
feature = {'train/coord2d': _floats_feature(c2d),
File "genData.py", line 19, in _floats_feature
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\python_message.py", line 510, in init
copy.extend(field_value)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\containers.py", line 275, in extend
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\containers.py", line 275, in <listcomp>
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\type_checkers.py", line 109, in CheckValue
raise TypeError(message)
TypeError: array([-163.685, 240.818, -114.05 , -518.554, 107.968, 427.184,
157.418, -161.798, 87.102, 406.318]) has type <class 'numpy.ndarray'>, but expected one of: ((<class 'numbers.Real'>,),)
I dont know how to fix this. should i store the features as int64 or bytes? I have no clue how to go about this since i am completely new to tensorflow. any help would be great! thanks
The function _floats_feature described in the Tensorflow-Guide expects a scalar (either float32 or float64) as input.
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
As you can see the inputted scalar is written into a list (value=[value]) which is subsequently given to tf.train.FloatList as input. tf.train.FloatList expects an iterator that outputs a float in each iteration (as the list does).
If your feature is not a scalar but a vectur, _float_feature can be rewritten to pass the iterator directly to tf.train.FloatList (instead of putting it into a list first).
def _float_array_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
However if your feature has two or more dimensions this solution does not work anymore. Like #mmry described in his answer in this case flattening your feature or splitting it into several one-dimensional features would be a solution. The disadvantage of these two approaches is that the information about the actual shape of the feature is lost if no further effort is invested.
Another possibility to write an example message for a higher dimensional array is to convert the array into a byte string and then use the _bytes_feature function described in the Tensorflow-Guide to write the example message for it. The example message is then serialized and written into a TFRecord file.
import tensorflow as tf
import numpy as np
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))): # if value ist tensor
value = value.numpy() # get value of tensor
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def serialize_array(array):
array = tf.io.serialize_tensor(array)
return array
#----------------------------------------------------------------------------------
# Create example data
array_blueprint = np.arange(4, dtype='float64').reshape(2,2)
arrays = [array_blueprint+1, array_blueprint+2, array_blueprint+3]
#----------------------------------------------------------------------------------
# Write TFrecord file
file_path = 'data.tfrecords'
with tf.io.TFRecordWriter(file_path) as writer:
for array in arrays:
serialized_array = serialize_array(array)
feature = {'b_feature': _bytes_feature(serialized_array)}
example_message = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example_message.SerializeToString())
The serialized example messages stored in the TFRecord file can be accessed via tf.data.TFRecordDataset. After the example messages have been parsed, the original array needs to be restored from the byte string it was converted to. This is possible via tf.io.parse_tensor.
# Read TFRecord file
def _parse_tfr_element(element):
parse_dic = {
'b_feature': tf.io.FixedLenFeature([], tf.string), # Note that it is tf.string, not tf.float32
}
example_message = tf.io.parse_single_example(element, parse_dic)
b_feature = example_message['b_feature'] # get byte string
feature = tf.io.parse_tensor(b_feature, out_type=tf.float64) # restore 2D array from byte string
return feature
tfr_dataset = tf.data.TFRecordDataset('data.tfrecords')
for serialized_instance in tfr_dataset:
print(serialized_instance) # print serialized example messages
dataset = tfr_dataset.map(_parse_tfr_element)
for instance in dataset:
print()
print(instance) # print parsed example messages with restored arrays
The tf.train.Feature class only supports lists (or 1-D arrays) when using the float_list argument. Depending on your data, you might try one of the following approaches:
Flatten the data in your array before passing it to tf.train.Feature:
def _floats_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value.reshape(-1)))
Note that you might need to add another feature to indicate how this data should be reshaped when you parse it again (and you could use an int64_list feature for that purpose).
Split the multidimensional feature into multiple 1-D features. For example, if c2d contains an N * 2 array of x- and y-coordinates, you could split that feature into separate train/coord2d/x and train/coord2d/y features, each containing the x- and y-coordinate data, respectively.
The documentation about Tfrecord recommends to use serialize_tensor
TFRecord and tf.train.Example
Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use tf.io.serialize_tensor to convert tensors to binary-strings. Strings are scalars in tensorflow. Use tf.io.parse_tensor to convert the binary-string back to a tensor.
2 lines of code does the trick for me:
tensor = tf.convert_to_tensor(array)
result = tf.io.serialize_tensor(tensor)

Tensorflow Split Using Feed Dict Input Dimension

I'm trying to tf.split a tensor based on the dimension of an input fed in using feed_dict (dimension of input changes with each batch). Currently I keep getting an error saying that a tensor cannot be split with a "Dimension". Is there a way to get the value of the dimension and split using it?
Thanks!
input_d = tf.placeholder(tf.int32, [None, None], name="input_d")
# toy feed dict
feed = {
input_d: [[20,30,40,50,60],[2,3,4,5,-1]] # document
}
W_embeddings = tf.get_variable(shape=[vocab_size, embedding_dim], \
initializer=tf.random_uniform_initializer(-0.01, 0.01),\
name="W_embeddings")
document_embedding = tf.gather(W_embeddings, input_d)
timesteps_d = document_embedding.get_shape()[1]
doc_input = tf.split(1, timesteps_d, document_embedding)
tf.split takes a python integer for the num_split argument. However, document_embedding.get_shape() returns a TensorShape, and document_embedding.get_shape()[1] gives a Dimension instance, hence you get an error says "can't split with a Dimension".
Try timestep_ds = document_embedding.get_shape().as_list()[1], this statement should give you a python integer.
Here are some relevant documentations for tf.split and tf.Tensor.get_shape

List and Numpy array in Python

I was trying to hot-encode data.
Data is list of vocabulary_size = 17005207.
To hot-encode, I made a list of inputs of num_labels = 100.
Following code:
inputs = []
for i in range(vocabulary_size):
inputs.append(np.arange(num_labels) == data[i]).astype(np.float32)
Throws me an Error:
AttributeError: 'NoneType' object has no attribute 'astype'
I tried dtype = np.float32 inside append function but again erroneous.
When I try this :
inputs = []
for i in range(vocabulary_size):
inputs.append(np.arange(num_labels) == data[i])
inputs = np.array(inputs,dtype=np.float32)
I get correct answer : Hot-Encoded Input Sequence of vocabulary_size x num_labels.
Any Alternative Solution Of One Line Without Using Numpy?
Solved :Can I be done directly using numpy array(input) with list(data)?
Info about data : data = np.ndarray(len(words), dtype=np.int32)
Reformat function:
def reformat(data):
num_labels = vocabulary_size
print (type(data))
data = (np.arange(num_labels) == data[:,None]).astype(np.int32)
return data
print (data,len(data))
return data
New Question : The dimension of data is (vocabulary_size,)...How to convert data using ravel or reshape into dimension of (1,vocabulary_size)?
Not sure whether I've understood correctly what you're asking for, but if what you want is a oneliner, you could transform you're already working code into this:
inputs = np.array([np.arange(num_labels) == data[i] for i in range(vocabulary_size)], dtype=np.float32)

Categories