Tensorflow - Extract string from Tensor - python

I'm trying to follow the "Load using tf.data" part of this tutorial. In the tutorial, they can get away with only working with string Tensors, however, I need to extract the string representation of the filename, as I need to look up extra data from a dictionary. I can't seem to extract the string part of a Tensor. I'm pretty sure the .name attribute of a Tensor should return the string, but I keep getting an error message saying KeyError: 'strided_slice_1:0' so somehow, the slicing is doing something weird?
I'm loading the dataset using:
dataset_list = tf.data.Dataset.list_files(str(DATASET_DIR / "data/*"))
and then process it using:
def process(t):
return dataset.process_image_path(t, param_data, param_min_max)
dataset_labeled = dataset_list.map(
process,
num_parallel_calls=AUTOTUNE)
where param_data and param_min_max are two dictionaries I've loaded that contains extra data that is needed to construct the label.
These are the three functions that I use to process the data Tensors (from my dataset.py):
def process_image_path(image_path, param_data_file, param_max_min_file):
label = path_to_label(image_path, param_data_file, param_max_min_file)
img = tf.io.read_file(image_path)
img = decode_img(img)
return (img, label)
def decode_img(img):
"""Converts an image to a 3D uint8 tensor"""
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.convert_image_dtype(img, tf.float32)
return img
def path_to_label(image_path, param_data_file, param_max_min_file):
"""Returns the NORMALIZED label (set of parameter values) of an image."""
parts = tf.strings.split(image_path, os.path.sep)
filename = parts[-1] # Extract filename with extension
filename = tf.strings.split(filename, ".")[0].name # Extract filename
param_data = param_data_file[filename] # ERROR! .name above doesn't seem to return just the filename
P = len(param_max_min_file)
label = np.zeros(P)
i = 0
while i < P:
param = param_max_min_file[i]
umin = param["user_min"]
umax = param["user_max"]
sub_index = param["sub_index"]
identifier = param["identifier"]
node = param["node_name"]
value = param_data[node][identifier]
label[i] = _normalize(value[sub_index])
i += 1
return label
I have verified that filename = tf.strings.split(filename, ".")[0] in path_to_label() does return the correct Tensor, but I need it as a string. The whole thing is proving difficult to debug as well, as I can't access attributes when debugging (I get errors saying AttributeError: Tensor.name is meaningless when eager execution is enabled.).

The name field is a name for the tensor itself, not the content of the tensor.
To do a regular python dictionary lookup, wrap your parsing function in tf.py_func.
import tensorflow as tf
tf.enable_eager_execution()
d = {"a": 1, "b": 3, "c": 10}
dataset = tf.data.Dataset.from_tensor_slices(["a", "b", "c"])
def parse(s):
return s, d[s]
dataset = dataset.map(lambda s: tf.py_func(parse, [s], (tf.string, tf.int64)))
for element in dataset:
print(element[1].numpy()) # prints 1, 3, 10

Related

How to read element in list item in Python?

I have the following output from a function and I need to read shape, labels, and domain from this stream.
[Annotation(shape=Rectangle(x=0.0, y=0.0, width=1.0, height=1.0), labels=[ScoredLabel(62282a1dc79ed6743e731b36, name=GOOD, probability=0.5143796801567078, domain=CLASSIFICATION, color=Color(red=233, green=97, blue=21, alpha=255), hotkey=ctrl+3)], id=622cc4d962f051a8f41ddf35)]
I need them as follows
shp = Annotation.shape
lbl = Annotation.labels
dmn = domain
It seems simple but I could not figure it out yet.
Given output as a list of Annotation objects:
output = [Annotation(...)]
you ought to be able to simply do:
shp = output[0].shape
lbl = output[0].labels
dmn = labels[0].domain

Tensorflow transform each element of a string tensor

I have a tensor of strings. Some example strings are as follows.
com.abc.display,com.abc.backend,com.xyz.forte,blah
com.pqr,npr.goog
I want to do some preprocessing which splits the CSV into its part, then splits each part at the dots and then create multiple strings where one string is a prefix of another. Also, all blahs have to be dropped.
For example, given the first string com.abc.display,com.abc.backend,com.xyz.forte, it is transformed into an array/list of the following strings.
['com', 'com.abc', 'com.abc.display', 'com.abc.backend', 'com.xyz', 'com.xyz.forte']
The resulting list has no duplicates (that is why the prefixed strings for com.abc.backend didn't show up as those were already included - com and com.abc).
I wrote the following python function that would do the above given a single CSV string example.
def expand_meta(meta):
expanded_subparts = []
meta_parts = set([x for x in meta.split(',') if x != 'blah'])
for part in meta_parts:
subparts = part.split('.')
for i in range(len(subparts)+1):
expanded = '.'.join(subparts[:i])
if expanded:
expanded_subparts.append(expanded)
return list(set(expanded_subparts))
Calling this method on the first example
expand_meta('com.abc.display,com.abc.backend,com.xyz.forte,blah')
returns
['com.abc.display',
'com.abc',
'com.xyz',
'com.xyz.forte',
'com.abc.backend',
'com']
I know that tensorflow has this map_fn method. I was hoping to use that to transform each element of the tensor. However, I am getting the following error.
File "mypreprocess.py", line 152, in expand_meta
meta_parts = set([x for x in meta.split(',') if x != 'blah'])
AttributeError: 'Tensor' object has no attribute 'split'
So, it seems like I can't use a regular python function with map_fn since it expects the elements to be tensors. How can I do what I intend to do here?
(My Tensorflow version is 1.11.0)
I think this does what you want:
import tensorflow as tf
# Function to process a single string
def make_splits(s):
s = tf.convert_to_tensor(s)
# Split by comma
split1 = tf.strings.split([s], ',').values
# Remove blahs
split1 = tf.boolean_mask(split1, tf.not_equal(split1, 'blah'))
# Split by period
split2 = tf.string_split(split1, '.')
# Get dense split tensor
split2_dense = tf.sparse.to_dense(split2, default_value='')
# Accummulated concatenations
concats = tf.scan(lambda a, b: tf.string_join([a, b], '.'),
tf.transpose(split2_dense))
# Get relevant concatenations
out = tf.gather_nd(tf.transpose(concats), split2.indices)
# Remove duplicates
return tf.unique(out)[0]
# Test
with tf.Graph().as_default(), tf.Session() as sess:
# Individual examples
print(make_splits('com.abc.display,com.abc.backend,com.xyz.forte,blah').eval())
# [b'com' b'com.abc' b'com.abc.display' b'com.abc.backend' b'com.xyz'
# b'com.xyz.forte']
print(make_splits('com.pqr,npr.goog').eval())
# [b'com' b'com.pqr' b'npr' b'npr.goog']
# Apply to multiple strings with a loop
data = tf.constant([
'com.abc.display,com.abc.backend,com.xyz.forte,blah',
'com.pqr,npr.goog'])
ta = tf.TensorArray(size=data.shape[0], dtype=tf.string,
infer_shape=False, element_shape=[None])
_, ta = tf.while_loop(
lambda i, ta: i < tf.shape(data)[0],
lambda i, ta: (i + 1, ta.write(i, make_splits(data[i]))),
[0, ta])
out = ta.concat()
print(out.eval())
# [b'com' b'com.abc' b'com.abc.display' b'com.abc.backend' b'com.xyz'
# b'com.xyz.forte' b'com' b'com.pqr' b'npr' b'npr.goog']
I'm not sure if you want the total results concatenated like that, or maybe you want to apply tf.unique to the global result, but in any case the idea is the same.

How to use tensorflow to ingest sharded CSVs

This is a problem I am working on in google cloud platform with tensorflow v1.15
I am working on this notebook
In this section, I am supposed to return a function that feeds model.train()
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]
# TODO: Create an appropriate input function read_dataset
def read_dataset(filename, mode):
#TODO Add CSV decoder function and dataset creation and methods
return dataset
def get_train_input_fn():
return read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN)
def get_valid_input_fn():
return read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL)
I think it should be like this:
def read_dataset(filename, mode, batch_size = 512):
def fn():
def decode_csv(value_column):
columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
features = dict(zip(CSV_COLUMNS, columns))
label = features.pop(LABEL_COLUMN)
return features, label
# Create list of file names that match "glob" pattern (i.e. data_file_*.csv)
filenames_dataset = tf.data.Dataset.list_files(filename)
# Read lines from text files
textlines_dataset = filenames_dataset.flat_map(tf.data.TextLineDataset)
# Parse text lines as comma-separated values (CSV)
dataset = textlines_dataset.map(decode_csv)
if mode == tf.estimator.ModeKeys.TRAIN:
num_epochs = None # indefinitely
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
else:
num_epochs = 1 # end-of-input after this
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset
return fn
That is actually reflective of code in the video recap that accompanies this notebook, and very similar to my own attempts before I saw that recap. It is also similar to the next notebook, but that code also unfortunately fails.
With the above code, I am getting this error:
UnimplementedError: Cast string to float is not supported
[[node linear/head/ToFloat (defined at /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
So, I'm not sure how to transform the data to match the datatype.. I cannot cast the data in decode_csv like:
features = {CSV_COLUMNS[i]: float(cols[i]) for i in range(1, len(CSV_COLUMNS) - 1)}
because the error is happening the line before that is called.
Investigating the data I note:
import csv
with open('./taxi-train.csv') as f:
reader = csv.reader(f)
print(next(reader))
['12.0', '-73.987625', '40.750617', '-73.971163', '40.78518', '1', '0']
that looks like the raw data might actually be a string .. am I correct? How can I solve this?
edit : I have located the csv file, it is not raw string data. Why is the tensorflow import bringing it in as text??
The training-data-analyst repository you mentioned, also has the solutions to all the notebooks.
From analysing the provided solution it looks like the def fn() part is reduntant. the read_dataset function should simply return a tf.Data.dataset:
def read_dataset(filename, mode, batch_size = 512):
def decode_csv(row):
columns = tf.decode_csv(row, record_defaults = DEFAULTS)
features = dict(zip(CSV_COLUMNS, columns))
features.pop('key') # discard, not a real feature
label = features.pop('fare_amount') # remove label from features and store
return features, label
# Create list of file names that match "glob" pattern (i.e. data_file_*.csv)
filenames_dataset = tf.data.Dataset.list_files(filename, shuffle=False)
# Read lines from text files
textlines_dataset = filenames_dataset.flat_map(tf.data.TextLineDataset)
# Parse text lines as comma-separated values (CSV)
dataset = textlines_dataset.map(decode_csv)
# Note:
# use tf.data.Dataset.flat_map to apply one to many transformations (here: filename -> text lines)
# use tf.data.Dataset.map to apply one to one transformations (here: text line -> feature list)
if mode == tf.estimator.ModeKeys.TRAIN:
num_epochs = None # loop indefinitely
dataset = dataset.shuffle(buffer_size = 10 * batch_size, seed=2)
else:
num_epochs = 1 # end-of-input after this
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset
The solutions are located i the same directory as labs. So for example the solution for
training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/labs/c_dataset.ipynb
is located at
training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/c_dataset.ipynb

How to create variables for Facial_Recognition from database

I'm trying to be able to pull data from a database with a name and an image file name then put it into a face_recognition Python program. However, for the code that I'm using, the program learns the faces by calling variables with different names.
How can I create variables based on the amount of data that I have in the database?
What could be a better approach to solve this problem?
first_image = face_recognition.load_image_file("first.jpg")
first_face_encoding = face_recognition.face_encodings(first_image)[0]
second_image = face_recognition.load_image_file("second.jpg")
biden_face_encoding = face_recognition.face_encodings(second_image)[0]
You can use arrays instead of storing each image/encoding in an individual variable, and fill the arrays from a for loop.
Assuming you can change the filenames from first.jpg, second.jpg... to 1.jpg, 2.jpg... you can do this:
numberofimages = 10 # change this to the total number of images
images = [None] * (numberofimages+1) # create an array to store all the images
encodings = [None] * (numberofimages+1) # create an array to store all the encodings
for i in range(1, numberofimages+1):
filename = str(i) + ".jpg" # generate image file name (eg. 1.jpg, 2.jpg...)
# load the image and store it in the array
images[i] = face_recognition.load_image_file(filename)
# store the encoding
encodings[i] = face_recognition.face_encodings(images[i])[0]
You can then access eg. the 3rd image and 3rd encoding like this:
image[3]
encoding[3]
If changing image file names is not an option, you can store them in a dictionary and do this:
numberofimages = 3 # change this to the total number of images
images = [None] * (numberofimages+1) # create an array to store all the images
encodings = [None] * (numberofimages+1) # create an array to store all the encodings
filenames = {
1: "first",
2: "second",
3: "third"
}
for i in range(1, numberofimages+1):
filename = filenames[i] + ".jpg" # generate file name (eg. first.jpg, second.jpg...)
print(filename)
# load the image and store it in the array
images[i] = face_recognition.load_image_file(filename)
# store the encoding
encodings[i] = face_recognition.face_encodings(images[i])[0]

Convert a tensorflow tf.data.Dataset FlatMapDataset to TensorSliceDataset

I want to pass a list of tf.Strings to the .map(_parse_function) function.
def _parse_function(self, img_path):
img_str = tf.read_file(img_path)
img_decode = tf.image.decode_jpeg(img_str, channels=3)
img_decode = tf.divide(tf.cast(img_decode , tf.float32),255)
return img_decode
When the tf.data.Dataset is of type TensorSliceDataset,
dataset_from_slices = tf.data.Dataset.from_tensor_slices((tensor_with_filenames))
I can simply do
dataset_from_slices.map(_parse_function), which works.
However, dataset_from_generator = tf.data.Dataset.from_generator(...) returns a Dataset which is an instance of FlatMapDataset type and dataset_from_generator.map(_parse_function) gives the following error:
InvalidArgumentError: Input filename tensor must be scalar, but had shape: [32]
If I change the first line to:
img_str = tf.read_file(img_path[0])
that also works but then I only get the first image, which is not what I am looking for. Any suggestions?
It sounds like the elements of your dataset_from_generator are batched. The simplest remedy is to use tf.contrib.data.unbatch() to convert them back into individual elements:
# Each element is a vector of strings.
dataset_from_generator = tf.data.Dataset.from_generator(...)
# Converts each vector of strings into multiple individual elements.
dataset = dataset_from_generator.apply(tf.contrib.data.unbatch())
dataset = dataset.map(_parse_function)

Categories