This is a follow up to these two SO questions
Select random value from row in a TF.record array, with limits on what the value can be?
Numpy array to TFrecord
The first one mentioned that tfrecords can handle variable length data using tf.VarLenFeature()
I am still having trouble figuring how how to convert my number array to tfrecord file though. Here's what the first 10 rows looks like
print(my_data[0:15])
[[1446549
list([491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205])]
[406529 list([900479, 660976, 1270383, 1287181])]
[1350274
list([207721, 676951, 1311781, 2712019, 1719660, 2969693, 37187, 2284531, 1253304, 1274866, 2815382, 1513583, 1339084, 1624616, 2967307, 1702118, 585261, 426595, 1444507, 1982792])]
[1163243
list([324383, 81509, 322474, 406941, 768416, 109067, 173425, 1478467, 573723, 1009159, 313463, 313924, 627680, 1072293, 1025620, 2325337, 2457705, 1505115, 2812547, 922812, 2152425, 2524196, 182325, 2912690, 1388620, 1484514, 1481728, 2616639, 2180765, 1544586, 1987272, 1557441, 453182, 892217, 1462085, 1892770, 1646735, 2521186, 2814552, 2983691, 3037096, 832554, 2807250, 2253333, 2595688, 2650475, 2525317, 2716592, 2573244, 2666514, 256757, 1135836, 1856208, 2605537, 1851963, 2381938, 1716883, 773842, 1877852, 2504806, 2208699, 1076111, 3058991, 3024546, 2010887, 2630915])]
[2491621 list([877803, 546802, 2855232, 2950610, 1378514, 285536])]
[2465968
list([626040, 1151291, 560715, 1153787, 893941, 3094902, 1239392, 2081948, 1677321, 1193880, 2326117, 2805797, 1715983, 1213177, 1476995, 2620772, 1242804, 2942330, 588938, 2338375, 2805378, 169015, 1766962, 562485, 1210404, 772334, 415148, 1293624, 527245, 587088, 665484, 449673, 315509])]
[886255
list([694445, 796232, 1151072, 2312348, 1773175, 1898319, 1696093, 91310, 719379, 2080422, 1352695, 2364846, 845154, 2476191, 537059, 1216854, 1529449, 284855, 1215830, 3041789, 1625939])]
[451113
list([2805707, 2727647, 742706, 1727139, 2585759, 822759, 1099617])]
[1529600
list([1755946, 2110553, 1056110, 426876, 2448684, 396996, 1498300, 756831, 2181288, 1159493])]
[596550 list([1610895, 2579387, 3081786, 2000733, 2142308])]
[1631548
list([1576412, 849908, 2705650, 2291675, 751733, 1911747, 1496204])]
[3001784 list([327334, 1197547, 2515733])]
[308747
list([8344, 80684, 996504, 2250076, 1905654, 863587, 2235560, 2676079, 1826, 685487, 1481871, 588465, 1126662, 2458841, 2481927])]
[731288 list([2793620, 1115724, 1406934])]
[1523219
list([12825, 1128776, 1761080, 1486798, 2689369, 1040645, 3012606])]]
Might be a little tricky to read but here's an instance of a smaller row
[406529 list([900479, 660976, 1270383, 1287181])]
Each now contains a number, and a list of numbers, and that list varies in length.
I am having trouble figuring out how exactly I can convert this to a tfrecord file. Any help/hints would be greatly appreciated.
I assume you want to add numbers feature and list feature respectively here.
import tensorflow as tf
writer = tf.python_io.TFRecordWriter('test.tfrecords')
for index in range(my_data.shape[0]):
example = tf.train.Example(features=tf.train.Features(feature={
'num_value':tf.train.Feature(int64_list=tf.train.Int64List(value=[my_data[index][0]])),
'list_value':tf.train.Feature(int64_list=tf.train.Int64List(value=my_data[index][1]))
}))
writer.write(example.SerializeToString())
writer.close()
#read data from tfrecords
record_iterator = tf.python_io.tf_record_iterator('test.tfrecords')
for _ in range(2):
seralized_img_example = next(record_iterator)
example = tf.train.Example()
example.ParseFromString(seralized_img_example)
num_value = example.features.feature['num_value'].int64_list.value[0]
list_value = example.features.feature['list_value'].int64_list.value
print(num_value,list_value)
#print
1446549 [491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205]
406529 [900479, 660976, 1270383, 1287181]
I create a dataset by reading the TFRecords, I map the values and I want to filter the dataset for specific values, but since the result is a dict with tensors, I am not able to get the actual value of a tensor or to check it with tf.cond() / tf.equal. How can I do that?
def mapping_func(serialized_example):
feature = { 'label': tf.FixedLenFeature([1], tf.string) }
features = tf.parse_single_example(serialized_example, features=feature)
return features
def filter_func(features):
# this doesn't work
#result = features['label'] == 'some_label_value'
# neither this
result = tf.reshape(tf.equal(features['label'], 'some_label_value'), [])
return result
def main():
file_names = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(file_names)
dataset = dataset.map(mapping_func)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.filter(filter_func)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
sample = iterator.get_next()
I am answering my own question. I found the issue!
What I needed to do is tf.unstack() the label like this:
label = tf.unstack(features['label'])
label = label[0]
before I give it to tf.equal():
result = tf.reshape(tf.equal(label, 'some_label_value'), [])
I suppose the problem was that the label is defined as an array with one element of type string tf.FixedLenFeature([1], tf.string), so in order to get the first and single element I had to unpack it (which creates a list) and then get the element with index 0, correct me if I'm wrong.
I think you don't need to make label a 1-dimensional array in the first place.
with:
feature = {'label': tf.FixedLenFeature((), tf.string)}
you won't need to unstack the label in your filter_func
Reading, filtering a dataset is very easy and there is no need to unstack anything.
to read the dataset:
print(my_dataset, '\n\n')
##let us print the first 3 records
for record in my_dataset.take(3):
##below could be large in case of image
print(record)
##let us print a specific key
print(record['key2'])
To filter is equally simple:
my_filtereddataset = my_dataset.filter(_filtcond1)
where you define _filtcond1 however you want. Let us say there is a 'true' 'false' boolean flag in your dataset, then:
#tf.function
def _filtcond1(x):
return x['key_bool'] == 1
or even a lambda function:
my_filtereddataset = my_dataset.filter(lambda x: x['key_int']>13)
If you are reading a dataset which you havent created or you are unaware of the keys (as seems to be the OPs case), you can use this to get an idea of the keys and structure first:
import json
from google.protobuf.json_format import MessageToJson
for raw_record in noidea_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
##print(example) ##if image it will be toooolong
m = json.loads(MessageToJson(example))
print(m['features']['feature'].keys())
Now you can proceed with the filtering
You should try to use the apply function from
tf.data.TFRecordDataset tensorflow documentation
Otherwise... read this article about TFRecords to get a better knowledge about TFRecords TFRecords for humans
But the most likely situation is that you can not access neither modify a TFRecord...there is a request on github about this topic TFRecords request
My advice is to make the things as easy as you can...you have to know that you are you working with graph and sessions...
In any case...if everything fail try the part of the code that does not work in a tensorflow session as simple as you can do it...probably all these operations should be done when tf.session is running...
I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.
From a list of word images with their transcriptions, I am trying to create and read sparse sequence labels (for tf.nn.ctc_loss) using a tf.train.slice_input_producer, avoiding
serializing pre-packaged training data to disk in
TFRecord format
the apparent limitations of tf.py_func,
any unnecessary or premature padding, and
reading the entire data set to RAM.
The main issue seems to be converting a string to the sequence of labels (a SparseTensor) needed for tf.nn.ctc_loss.
For example, with the character set in the (ordered) range [A-Z], I'd want to convert the text label string "BAD" to the sequence label class list [1,0,3].
Each example image I want to read contains the text as part of the filename, so it's straightforward to extract and do the conversion in straight up python. (If there's a way to do it within TensorFlow computations, I haven't found it.)
Several previous questions glance at these issues, but I haven't been able to integrate them successfully. For example,
Tensorflow read images with labels
shows a straightforward framework with discrete, categorical labels,
which I've begun with as a model.
How to load sparse data with TensorFlow?
nicely explains an approach for loading sparse data, but assumes
pre-packaging tf.train.Examples.
Is there a way to integrate these approaches?
Another example (SO question #38012743) shows how I might delay the conversion from string to list until after dequeuing the filename for decoding, but it relies on tf.py_func, which has caveats. (Should I worry about them?)
I recognize that "SparseTensors don't play well with queues" (per the tf docs), so it might be necessary to do some voodoo on the result (serialization?) before batching, or even rework where the computation happens; I'm open to that.
Following MarvMind's outline, here is a basic framework with the computations I want (iterate over lines containing example filenames, extract each label string and convert to sequence), but I have not successfully determined the "Tensorflow" way to do it.
Thank you for the right "tweak", a more appropriate strategy for my goals, or an indication that tf.py_func won't wreck training efficiency or something else downstream (e.g.,loading trained models for future use).
EDIT (+7 hours) I found the missing ops to patch things up. While still need to verify this connects with CTC_Loss downstream, I have checked that the edited version below correctly batches and reads in the images and sparse tensors.
out_charset="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def input_pipeline(data_filename):
filenames,seq_labels = _get_image_filenames_labels(data_filename)
data_queue = tf.train.slice_input_producer([filenames, seq_labels])
image,label = _read_data_format(data_queue)
image,label = tf.train.batch([image,label],batch_size=2,dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)
return image,label
def _get_image_filenames_labels(data_filename):
filenames = []
labels = []
with open(data_filename)) as f:
for line in f:
# Carve out the ground truth string and file path from
# lines formatted like:
# ./241/7/158_NETWORK_51375.jpg 51375
filename = line.split(' ',1)[0][2:] # split off "./" and number
# Extract label string embedded within image filename
# between underscores, e.g. NETWORK
text = os.path.basename(filename).split('_',2)[1]
# Transform string text to sequence of indices using charset, e.g.,
# NETWORK -> [13, 4, 19, 22, 14, 17, 10]
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
# Add data to lists for conversion
filenames.append(filename)
labels.append(label)
filenames = tf.convert_to_tensor(filenames)
labels = tf.convert_to_tensor_or_sparse_tensor(labels)
return filenames, labels
def _read_data_format(data_queue):
label = data_queue[1]
raw_image = tf.read_file(data_queue[0])
image = tf.image.decode_jpeg(raw_image,channels=1)
return image,label
The key ideas seem to be creating a SparseTensorValue from the data wanted, pass it through tf.convert_to_tensor_or_sparse_tensor and then (if you want to batch the data) serialize it with tf.serialize_sparse. After batching, you can restore the values with tf.deserialize_many_sparse.
Here's the outline. Create the sparse values, convert to tensor, and serialize:
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
Then, you can do the batching and deserialize:
image,label = tf.train.batch([image,label],dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)