Preserve coherence while shuffling queues in Tensorflow

Preserve coherence while shuffling queues in Tensorflow - python

I have 3 queues, one is a FileReader supplied by a string_input_producer, two are slice_input_producers fed by a vector of int32 and a matrix of int32 respectively. They are all ordered such that, when read in sequence, they provide an image, question, and answer that forms one example.
What I want to do is shuffle them, while preserving the relations between them.
I've tried using shuffle_batch, but this does not preserve the relations - making it useless.
My current code (the relevant bits):
def load_images(self,images,q_name):
filename_queue = tf.train.string_input_producer(images,shuffle=False,name=q_name)
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
imagedata = tf.image.decode_png(value)
imagedata = tf.cast(tf.image.resize_images(imagedata,[224,224],tf.image.ResizeMethod.NEAREST_NEIGHBOR),tf.float32)
imagedata = tf.div(imagedata,tf.reduce_max(tf.abs(imagedata)))
imagedata.set_shape([224,224,3])
return key,imagedata
keys[testfile],imagedata[testfile] = self.load_images(imagefiles[testfile],'test')
keys[trainfile],imagedata[trainfile] = self.load_images(imagefiles[trainfile],'train')
s_train_answer_batch,s_train_question_batch,s_train_image_batch = tf.train.batch([tf.train.slice_input_producer([answers[trainfile]],shuffle=False)[0],tf.train.slice_input_producer([questions[trainfile]],shuffle=False)[0],imagedata[trainfile]],batch_size=batch_size,capacity=batch_size*2,enqueue_many=False)
feed_dict = {self.x_image:s_train_image_batch.eval(), self.x_question: s_train_question_batch.eval(), self.y_:s_train_answer_batch.eval(),self.keep_prob:keep_prob}
_,ce, summary, image_summary, accuracy= sess.run([self.train_step,self.cross_entropy, self.summary_op, self.image_summary_op, self.accuracy],feed_dict=feed_dict)
So, to be absolutely clear: if the image, question, and answer matrices where just vectors of the numbers one to ten, I'd want the feed dictionary to look like:
q:[4,1,8,2,3,9,6,5,7],a:[4,1,8,2,3,9,6,5,7],i:[4,1,8,2,3,9,6,5,7]
but currently they'd look like:
q:[4,1,8,2,3,9,6,5,7],a:[7,3,1,5,6,2,4,9,8],i:[9,8,3,5,4,6,7,1,2]

I solved it! Don't use .eval() to get the output, call sess.run([image_batch,question_batch,answer_batch]). This preserves ordering and does the shuffling. I have no idea why.

Why do not use a single queue, like tf.train.range_input_producer with shuffle, to extract an integer used to access the different data?
In other words, you use a single queue to extract an integer that you use to index all your three data structures.

Related

Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB in total. Then I need to order the dictionary with respect to the key and at the end take only the values of this ordered dictionary and save it as one big numpy array. I need this big numpy array because I want to sent it to train_test_split sklearn function and split the whole data to train and test sets with respect to their label. I found this question where they have the same problem with running out of RAM in the step where after creating the dictionary of 19GB I try to sort the dict: How to sort a LARGE dictionary and people suggest using database.
def save_all_images_as_one_numpy_array():
data_dict = {}
for img in os.listdir('images'):
id_img = img.split('_')[1]
loadimg = load_img(os.path.join('images', img))
x = image.img_to_array(loadimg)
data_dict[id_img] = x
data_dict = np.stack([ v for k, v in sorted(data_dict.items(), key = lambda x: int(x[0]))])
mmamfile = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='w+',shape=data_dict.shape)
mmamfile[:] = data_dict[:]
def load_numpy_array_with_images():
a = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='r')
When using np.stack I am stacking each numpy array in new array and this is where I run out of RAM. I can't afford to buy more RAM. I thought I can use redis in docker container but I don't understand why and how using a database will solve my problem?

The reason using a DB helps is because the DB library stores data on the hard-disk rather than in memory. If you look at the documentation for the library the linked answer suggests then you'll see that the first argument is filename, demonstrating that the hard-disk is used.
https://docs.python.org/2/library/bsddb.html#bsddb.hashopen
However, the linked question is talking about sorting by value, not key. Sorting by key will be much less memory intensive although you'll likely still have memory issues when training your model. I'd suggest trying something along the lines of
# Get the list of file names
imgs = os.listdir('images')
# Create a mapping of ID to file name
# This will allow us to sort the IDs then load the files in order
img_ids = {int(img.split('_')[1]): img for img in imgs}
# Get the list of file names sorted by ID
sorted_imgs = [v for k, v in sorted(img_ids.items(), key=lambda x: x[0])]
# Define a function for loading a named img
def load_img(img):
loadimg = load_img(os.path.join('images', img))
return image.img_to_array(loadimg)
# Iterate through the sorted file names and stack the results
data_dict = np.stack([load_img(img) for img in sorted_imgs])

Converting a Numpy file to TFRecord where each row contains a number, and a variable length list

This is a follow up to these two SO questions
Select random value from row in a TF.record array, with limits on what the value can be?
Numpy array to TFrecord
The first one mentioned that tfrecords can handle variable length data using tf.VarLenFeature()
I am still having trouble figuring how how to convert my number array to tfrecord file though. Here's what the first 10 rows looks like
print(my_data[0:15])
[[1446549
list([491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205])]
[406529 list([900479, 660976, 1270383, 1287181])]
[1350274
list([207721, 676951, 1311781, 2712019, 1719660, 2969693, 37187, 2284531, 1253304, 1274866, 2815382, 1513583, 1339084, 1624616, 2967307, 1702118, 585261, 426595, 1444507, 1982792])]
[1163243
list([324383, 81509, 322474, 406941, 768416, 109067, 173425, 1478467, 573723, 1009159, 313463, 313924, 627680, 1072293, 1025620, 2325337, 2457705, 1505115, 2812547, 922812, 2152425, 2524196, 182325, 2912690, 1388620, 1484514, 1481728, 2616639, 2180765, 1544586, 1987272, 1557441, 453182, 892217, 1462085, 1892770, 1646735, 2521186, 2814552, 2983691, 3037096, 832554, 2807250, 2253333, 2595688, 2650475, 2525317, 2716592, 2573244, 2666514, 256757, 1135836, 1856208, 2605537, 1851963, 2381938, 1716883, 773842, 1877852, 2504806, 2208699, 1076111, 3058991, 3024546, 2010887, 2630915])]
[2491621 list([877803, 546802, 2855232, 2950610, 1378514, 285536])]
[2465968
list([626040, 1151291, 560715, 1153787, 893941, 3094902, 1239392, 2081948, 1677321, 1193880, 2326117, 2805797, 1715983, 1213177, 1476995, 2620772, 1242804, 2942330, 588938, 2338375, 2805378, 169015, 1766962, 562485, 1210404, 772334, 415148, 1293624, 527245, 587088, 665484, 449673, 315509])]
[886255
list([694445, 796232, 1151072, 2312348, 1773175, 1898319, 1696093, 91310, 719379, 2080422, 1352695, 2364846, 845154, 2476191, 537059, 1216854, 1529449, 284855, 1215830, 3041789, 1625939])]
[451113
list([2805707, 2727647, 742706, 1727139, 2585759, 822759, 1099617])]
[1529600
list([1755946, 2110553, 1056110, 426876, 2448684, 396996, 1498300, 756831, 2181288, 1159493])]
[596550 list([1610895, 2579387, 3081786, 2000733, 2142308])]
[1631548
list([1576412, 849908, 2705650, 2291675, 751733, 1911747, 1496204])]
[3001784 list([327334, 1197547, 2515733])]
[308747
list([8344, 80684, 996504, 2250076, 1905654, 863587, 2235560, 2676079, 1826, 685487, 1481871, 588465, 1126662, 2458841, 2481927])]
[731288 list([2793620, 1115724, 1406934])]
[1523219
list([12825, 1128776, 1761080, 1486798, 2689369, 1040645, 3012606])]]
Might be a little tricky to read but here's an instance of a smaller row
[406529 list([900479, 660976, 1270383, 1287181])]
Each now contains a number, and a list of numbers, and that list varies in length.
I am having trouble figuring out how exactly I can convert this to a tfrecord file. Any help/hints would be greatly appreciated.

I assume you want to add numbers feature and list feature respectively here.
import tensorflow as tf
writer = tf.python_io.TFRecordWriter('test.tfrecords')
for index in range(my_data.shape[0]):
example = tf.train.Example(features=tf.train.Features(feature={
'num_value':tf.train.Feature(int64_list=tf.train.Int64List(value=[my_data[index][0]])),
'list_value':tf.train.Feature(int64_list=tf.train.Int64List(value=my_data[index][1]))
}))
writer.write(example.SerializeToString())
writer.close()
#read data from tfrecords
record_iterator = tf.python_io.tf_record_iterator('test.tfrecords')
for _ in range(2):
seralized_img_example = next(record_iterator)
example = tf.train.Example()
example.ParseFromString(seralized_img_example)
num_value = example.features.feature['num_value'].int64_list.value[0]
list_value = example.features.feature['list_value'].int64_list.value
print(num_value,list_value)
#print
1446549 [491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205]
406529 [900479, 660976, 1270383, 1287181]

How can I filter tf.data.Dataset by specific values?

I create a dataset by reading the TFRecords, I map the values and I want to filter the dataset for specific values, but since the result is a dict with tensors, I am not able to get the actual value of a tensor or to check it with tf.cond() / tf.equal. How can I do that?
def mapping_func(serialized_example):
feature = { 'label': tf.FixedLenFeature([1], tf.string) }
features = tf.parse_single_example(serialized_example, features=feature)
return features
def filter_func(features):
# this doesn't work
#result = features['label'] == 'some_label_value'
# neither this
result = tf.reshape(tf.equal(features['label'], 'some_label_value'), [])
return result
def main():
file_names = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(file_names)
dataset = dataset.map(mapping_func)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.filter(filter_func)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
sample = iterator.get_next()

I am answering my own question. I found the issue!
What I needed to do is tf.unstack() the label like this:
label = tf.unstack(features['label'])
label = label[0]
before I give it to tf.equal():
result = tf.reshape(tf.equal(label, 'some_label_value'), [])
I suppose the problem was that the label is defined as an array with one element of type string tf.FixedLenFeature([1], tf.string), so in order to get the first and single element I had to unpack it (which creates a list) and then get the element with index 0, correct me if I'm wrong.

I think you don't need to make label a 1-dimensional array in the first place.
with:
feature = {'label': tf.FixedLenFeature((), tf.string)}
you won't need to unstack the label in your filter_func

Reading, filtering a dataset is very easy and there is no need to unstack anything.
to read the dataset:
print(my_dataset, '\n\n')
##let us print the first 3 records
for record in my_dataset.take(3):
##below could be large in case of image
print(record)
##let us print a specific key
print(record['key2'])
To filter is equally simple:
my_filtereddataset = my_dataset.filter(_filtcond1)
where you define _filtcond1 however you want. Let us say there is a 'true' 'false' boolean flag in your dataset, then:
#tf.function
def _filtcond1(x):
return x['key_bool'] == 1
or even a lambda function:
my_filtereddataset = my_dataset.filter(lambda x: x['key_int']>13)
If you are reading a dataset which you havent created or you are unaware of the keys (as seems to be the OPs case), you can use this to get an idea of the keys and structure first:
import json
from google.protobuf.json_format import MessageToJson
for raw_record in noidea_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
##print(example) ##if image it will be toooolong
m = json.loads(MessageToJson(example))
print(m['features']['feature'].keys())
Now you can proceed with the filtering

You should try to use the apply function from
tf.data.TFRecordDataset tensorflow documentation
Otherwise... read this article about TFRecords to get a better knowledge about TFRecords TFRecords for humans
But the most likely situation is that you can not access neither modify a TFRecord...there is a request on github about this topic TFRecords request
My advice is to make the things as easy as you can...you have to know that you are you working with graph and sessions...
In any case...if everything fail try the part of the code that does not work in a tensorflow session as simple as you can do it...probably all these operations should be done when tf.session is running...

Preprocessing CSV data using tensorflow DataSet API

I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

From a list of word images with their transcriptions, I am trying to create and read sparse sequence labels (for tf.nn.ctc_loss) using a tf.train.slice_input_producer, avoiding
serializing pre-packaged training data to disk in
TFRecord format
the apparent limitations of tf.py_func,
any unnecessary or premature padding, and
reading the entire data set to RAM.
The main issue seems to be converting a string to the sequence of labels (a SparseTensor) needed for tf.nn.ctc_loss.
For example, with the character set in the (ordered) range [A-Z], I'd want to convert the text label string "BAD" to the sequence label class list [1,0,3].
Each example image I want to read contains the text as part of the filename, so it's straightforward to extract and do the conversion in straight up python. (If there's a way to do it within TensorFlow computations, I haven't found it.)
Several previous questions glance at these issues, but I haven't been able to integrate them successfully. For example,
Tensorflow read images with labels
shows a straightforward framework with discrete, categorical labels,
which I've begun with as a model.
How to load sparse data with TensorFlow?
nicely explains an approach for loading sparse data, but assumes
pre-packaging tf.train.Examples.
Is there a way to integrate these approaches?
Another example (SO question #38012743) shows how I might delay the conversion from string to list until after dequeuing the filename for decoding, but it relies on tf.py_func, which has caveats. (Should I worry about them?)
I recognize that "SparseTensors don't play well with queues" (per the tf docs), so it might be necessary to do some voodoo on the result (serialization?) before batching, or even rework where the computation happens; I'm open to that.
Following MarvMind's outline, here is a basic framework with the computations I want (iterate over lines containing example filenames, extract each label string and convert to sequence), but I have not successfully determined the "Tensorflow" way to do it.
Thank you for the right "tweak", a more appropriate strategy for my goals, or an indication that tf.py_func won't wreck training efficiency or something else downstream (e.g.,loading trained models for future use).
EDIT (+7 hours) I found the missing ops to patch things up. While still need to verify this connects with CTC_Loss downstream, I have checked that the edited version below correctly batches and reads in the images and sparse tensors.
out_charset="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def input_pipeline(data_filename):
filenames,seq_labels = _get_image_filenames_labels(data_filename)
data_queue = tf.train.slice_input_producer([filenames, seq_labels])
image,label = _read_data_format(data_queue)
image,label = tf.train.batch([image,label],batch_size=2,dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)
return image,label
def _get_image_filenames_labels(data_filename):
filenames = []
labels = []
with open(data_filename)) as f:
for line in f:
# Carve out the ground truth string and file path from
# lines formatted like:
# ./241/7/158_NETWORK_51375.jpg 51375
filename = line.split(' ',1)[0][2:] # split off "./" and number
# Extract label string embedded within image filename
# between underscores, e.g. NETWORK
text = os.path.basename(filename).split('_',2)[1]
# Transform string text to sequence of indices using charset, e.g.,
# NETWORK -> [13, 4, 19, 22, 14, 17, 10]
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
# Add data to lists for conversion
filenames.append(filename)
labels.append(label)
filenames = tf.convert_to_tensor(filenames)
labels = tf.convert_to_tensor_or_sparse_tensor(labels)
return filenames, labels
def _read_data_format(data_queue):
label = data_queue[1]
raw_image = tf.read_file(data_queue[0])
image = tf.image.decode_jpeg(raw_image,channels=1)
return image,label

The key ideas seem to be creating a SparseTensorValue from the data wanted, pass it through tf.convert_to_tensor_or_sparse_tensor and then (if you want to batch the data) serialize it with tf.serialize_sparse. After batching, you can restore the values with tf.deserialize_many_sparse.
Here's the outline. Create the sparse values, convert to tensor, and serialize:
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
Then, you can do the batching and deserialize:
image,label = tf.train.batch([image,label],dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserve coherence while shuffling queues in Tensorflow - python

I solved it! Don't use .eval() to get the output, call sess.run([image_batch,question_batch,answer_batch]). This preserves ordering and does the shuffling. I have no idea why.

Why do not use a single queue, like tf.train.range_input_producer with shuffle, to extract an integer used to access the different data? In other words, you use a single queue to extract an integer that you use to index all your three data structures.

Related

Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

Converting a Numpy file to TFRecord where each row contains a number, and a variable length list

How can I filter tf.data.Dataset by specific values?

Preprocessing CSV data using tensorflow DataSet API

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

Categories

Resources