How to use tensorflow to ingest sharded CSVs

How to use tensorflow to ingest sharded CSVs - python

This is a problem I am working on in google cloud platform with tensorflow v1.15
I am working on this notebook
In this section, I am supposed to return a function that feeds model.train()
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]
# TODO: Create an appropriate input function read_dataset
def read_dataset(filename, mode):
#TODO Add CSV decoder function and dataset creation and methods
return dataset
def get_train_input_fn():
return read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN)
def get_valid_input_fn():
return read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL)
I think it should be like this:
def read_dataset(filename, mode, batch_size = 512):
def fn():
def decode_csv(value_column):
columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
features = dict(zip(CSV_COLUMNS, columns))
label = features.pop(LABEL_COLUMN)
return features, label
# Create list of file names that match "glob" pattern (i.e. data_file_*.csv)
filenames_dataset = tf.data.Dataset.list_files(filename)
# Read lines from text files
textlines_dataset = filenames_dataset.flat_map(tf.data.TextLineDataset)
# Parse text lines as comma-separated values (CSV)
dataset = textlines_dataset.map(decode_csv)
if mode == tf.estimator.ModeKeys.TRAIN:
num_epochs = None # indefinitely
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
else:
num_epochs = 1 # end-of-input after this
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset
return fn
That is actually reflective of code in the video recap that accompanies this notebook, and very similar to my own attempts before I saw that recap. It is also similar to the next notebook, but that code also unfortunately fails.
With the above code, I am getting this error:
UnimplementedError: Cast string to float is not supported
[[node linear/head/ToFloat (defined at /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
So, I'm not sure how to transform the data to match the datatype.. I cannot cast the data in decode_csv like:
features = {CSV_COLUMNS[i]: float(cols[i]) for i in range(1, len(CSV_COLUMNS) - 1)}
because the error is happening the line before that is called.
Investigating the data I note:
import csv
with open('./taxi-train.csv') as f:
reader = csv.reader(f)
print(next(reader))
['12.0', '-73.987625', '40.750617', '-73.971163', '40.78518', '1', '0']
that looks like the raw data might actually be a string .. am I correct? How can I solve this?
edit : I have located the csv file, it is not raw string data. Why is the tensorflow import bringing it in as text??

The training-data-analyst repository you mentioned, also has the solutions to all the notebooks.
From analysing the provided solution it looks like the def fn() part is reduntant. the read_dataset function should simply return a tf.Data.dataset:
def read_dataset(filename, mode, batch_size = 512):
def decode_csv(row):
columns = tf.decode_csv(row, record_defaults = DEFAULTS)
features = dict(zip(CSV_COLUMNS, columns))
features.pop('key') # discard, not a real feature
label = features.pop('fare_amount') # remove label from features and store
return features, label
# Create list of file names that match "glob" pattern (i.e. data_file_*.csv)
filenames_dataset = tf.data.Dataset.list_files(filename, shuffle=False)
# Read lines from text files
textlines_dataset = filenames_dataset.flat_map(tf.data.TextLineDataset)
# Parse text lines as comma-separated values (CSV)
dataset = textlines_dataset.map(decode_csv)
# Note:
# use tf.data.Dataset.flat_map to apply one to many transformations (here: filename -> text lines)
# use tf.data.Dataset.map to apply one to one transformations (here: text line -> feature list)
if mode == tf.estimator.ModeKeys.TRAIN:
num_epochs = None # loop indefinitely
dataset = dataset.shuffle(buffer_size = 10 * batch_size, seed=2)
else:
num_epochs = 1 # end-of-input after this
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset
The solutions are located i the same directory as labs. So for example the solution for
training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/labs/c_dataset.ipynb
is located at
training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/c_dataset.ipynb

Related

Tokenizing & encoding dataset uses too much RAM

Trying to tokenize and encode data to feed to a neural network.
I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”
I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not.
The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.
max_q_len = 128
max_a_len = 64
trainq_list = train_q.tolist()
batch_size = 50000
def batch_encode(text, max_seq_len):
for i in range(0, len(trainq_list), batch_size):
encoded_sent = tokenizer.batch_encode_plus(
text,
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
return encoded_sent
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(trainq_list, max_q_len)
The tokenizer comes from HuggingFace:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

You should use generators and pass data to tokenizer.batch_encode_plus, no matter the size.
Conceptually, something like this:
Training list
This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size lines at once):
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:
import pathlib
def read_in_chunks(directory: pathlib.Path):
# Use "*.txt" or any other extension your file might have
for file in directory.glob("*"):
with open(file, "r") as f:
yield f.readlines()
Encoding
Encoder should take this generator and yield back encoded parts, something like this:
# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
for text in generator:
yield tokenizer.batch_encode_plus(
text,
max_length=max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False,
)
Saving encoded files
As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).
Something along those lines:
import numpy as np
# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
for i, encoded in enumerate(encoding_generator):
np.save(str(i), encoded)

Tensorflow - Extract string from Tensor

I'm trying to follow the "Load using tf.data" part of this tutorial. In the tutorial, they can get away with only working with string Tensors, however, I need to extract the string representation of the filename, as I need to look up extra data from a dictionary. I can't seem to extract the string part of a Tensor. I'm pretty sure the .name attribute of a Tensor should return the string, but I keep getting an error message saying KeyError: 'strided_slice_1:0' so somehow, the slicing is doing something weird?
I'm loading the dataset using:
dataset_list = tf.data.Dataset.list_files(str(DATASET_DIR / "data/*"))
and then process it using:
def process(t):
return dataset.process_image_path(t, param_data, param_min_max)
dataset_labeled = dataset_list.map(
process,
num_parallel_calls=AUTOTUNE)
where param_data and param_min_max are two dictionaries I've loaded that contains extra data that is needed to construct the label.
These are the three functions that I use to process the data Tensors (from my dataset.py):
def process_image_path(image_path, param_data_file, param_max_min_file):
label = path_to_label(image_path, param_data_file, param_max_min_file)
img = tf.io.read_file(image_path)
img = decode_img(img)
return (img, label)
def decode_img(img):
"""Converts an image to a 3D uint8 tensor"""
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.convert_image_dtype(img, tf.float32)
return img
def path_to_label(image_path, param_data_file, param_max_min_file):
"""Returns the NORMALIZED label (set of parameter values) of an image."""
parts = tf.strings.split(image_path, os.path.sep)
filename = parts[-1] # Extract filename with extension
filename = tf.strings.split(filename, ".")[0].name # Extract filename
param_data = param_data_file[filename] # ERROR! .name above doesn't seem to return just the filename
P = len(param_max_min_file)
label = np.zeros(P)
i = 0
while i < P:
param = param_max_min_file[i]
umin = param["user_min"]
umax = param["user_max"]
sub_index = param["sub_index"]
identifier = param["identifier"]
node = param["node_name"]
value = param_data[node][identifier]
label[i] = _normalize(value[sub_index])
i += 1
return label
I have verified that filename = tf.strings.split(filename, ".")[0] in path_to_label() does return the correct Tensor, but I need it as a string. The whole thing is proving difficult to debug as well, as I can't access attributes when debugging (I get errors saying AttributeError: Tensor.name is meaningless when eager execution is enabled.).

The name field is a name for the tensor itself, not the content of the tensor.
To do a regular python dictionary lookup, wrap your parsing function in tf.py_func.
import tensorflow as tf
tf.enable_eager_execution()
d = {"a": 1, "b": 3, "c": 10}
dataset = tf.data.Dataset.from_tensor_slices(["a", "b", "c"])
def parse(s):
return s, d[s]
dataset = dataset.map(lambda s: tf.py_func(parse, [s], (tf.string, tf.int64)))
for element in dataset:
print(element[1].numpy()) # prints 1, 3, 10

How can I filter tf.data.Dataset by specific values?

I create a dataset by reading the TFRecords, I map the values and I want to filter the dataset for specific values, but since the result is a dict with tensors, I am not able to get the actual value of a tensor or to check it with tf.cond() / tf.equal. How can I do that?
def mapping_func(serialized_example):
feature = { 'label': tf.FixedLenFeature([1], tf.string) }
features = tf.parse_single_example(serialized_example, features=feature)
return features
def filter_func(features):
# this doesn't work
#result = features['label'] == 'some_label_value'
# neither this
result = tf.reshape(tf.equal(features['label'], 'some_label_value'), [])
return result
def main():
file_names = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(file_names)
dataset = dataset.map(mapping_func)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.filter(filter_func)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
sample = iterator.get_next()

I am answering my own question. I found the issue!
What I needed to do is tf.unstack() the label like this:
label = tf.unstack(features['label'])
label = label[0]
before I give it to tf.equal():
result = tf.reshape(tf.equal(label, 'some_label_value'), [])
I suppose the problem was that the label is defined as an array with one element of type string tf.FixedLenFeature([1], tf.string), so in order to get the first and single element I had to unpack it (which creates a list) and then get the element with index 0, correct me if I'm wrong.

I think you don't need to make label a 1-dimensional array in the first place.
with:
feature = {'label': tf.FixedLenFeature((), tf.string)}
you won't need to unstack the label in your filter_func

Reading, filtering a dataset is very easy and there is no need to unstack anything.
to read the dataset:
print(my_dataset, '\n\n')
##let us print the first 3 records
for record in my_dataset.take(3):
##below could be large in case of image
print(record)
##let us print a specific key
print(record['key2'])
To filter is equally simple:
my_filtereddataset = my_dataset.filter(_filtcond1)
where you define _filtcond1 however you want. Let us say there is a 'true' 'false' boolean flag in your dataset, then:
#tf.function
def _filtcond1(x):
return x['key_bool'] == 1
or even a lambda function:
my_filtereddataset = my_dataset.filter(lambda x: x['key_int']>13)
If you are reading a dataset which you havent created or you are unaware of the keys (as seems to be the OPs case), you can use this to get an idea of the keys and structure first:
import json
from google.protobuf.json_format import MessageToJson
for raw_record in noidea_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
##print(example) ##if image it will be toooolong
m = json.loads(MessageToJson(example))
print(m['features']['feature'].keys())
Now you can proceed with the filtering

You should try to use the apply function from
tf.data.TFRecordDataset tensorflow documentation
Otherwise... read this article about TFRecords to get a better knowledge about TFRecords TFRecords for humans
But the most likely situation is that you can not access neither modify a TFRecord...there is a request on github about this topic TFRecords request
My advice is to make the things as easy as you can...you have to know that you are you working with graph and sessions...
In any case...if everything fail try the part of the code that does not work in a tensorflow session as simple as you can do it...probably all these operations should be done when tf.session is running...

Preprocessing CSV data using tensorflow DataSet API

I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

From a list of word images with their transcriptions, I am trying to create and read sparse sequence labels (for tf.nn.ctc_loss) using a tf.train.slice_input_producer, avoiding
serializing pre-packaged training data to disk in
TFRecord format
the apparent limitations of tf.py_func,
any unnecessary or premature padding, and
reading the entire data set to RAM.
The main issue seems to be converting a string to the sequence of labels (a SparseTensor) needed for tf.nn.ctc_loss.
For example, with the character set in the (ordered) range [A-Z], I'd want to convert the text label string "BAD" to the sequence label class list [1,0,3].
Each example image I want to read contains the text as part of the filename, so it's straightforward to extract and do the conversion in straight up python. (If there's a way to do it within TensorFlow computations, I haven't found it.)
Several previous questions glance at these issues, but I haven't been able to integrate them successfully. For example,
Tensorflow read images with labels
shows a straightforward framework with discrete, categorical labels,
which I've begun with as a model.
How to load sparse data with TensorFlow?
nicely explains an approach for loading sparse data, but assumes
pre-packaging tf.train.Examples.
Is there a way to integrate these approaches?
Another example (SO question #38012743) shows how I might delay the conversion from string to list until after dequeuing the filename for decoding, but it relies on tf.py_func, which has caveats. (Should I worry about them?)
I recognize that "SparseTensors don't play well with queues" (per the tf docs), so it might be necessary to do some voodoo on the result (serialization?) before batching, or even rework where the computation happens; I'm open to that.
Following MarvMind's outline, here is a basic framework with the computations I want (iterate over lines containing example filenames, extract each label string and convert to sequence), but I have not successfully determined the "Tensorflow" way to do it.
Thank you for the right "tweak", a more appropriate strategy for my goals, or an indication that tf.py_func won't wreck training efficiency or something else downstream (e.g.,loading trained models for future use).
EDIT (+7 hours) I found the missing ops to patch things up. While still need to verify this connects with CTC_Loss downstream, I have checked that the edited version below correctly batches and reads in the images and sparse tensors.
out_charset="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def input_pipeline(data_filename):
filenames,seq_labels = _get_image_filenames_labels(data_filename)
data_queue = tf.train.slice_input_producer([filenames, seq_labels])
image,label = _read_data_format(data_queue)
image,label = tf.train.batch([image,label],batch_size=2,dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)
return image,label
def _get_image_filenames_labels(data_filename):
filenames = []
labels = []
with open(data_filename)) as f:
for line in f:
# Carve out the ground truth string and file path from
# lines formatted like:
# ./241/7/158_NETWORK_51375.jpg 51375
filename = line.split(' ',1)[0][2:] # split off "./" and number
# Extract label string embedded within image filename
# between underscores, e.g. NETWORK
text = os.path.basename(filename).split('_',2)[1]
# Transform string text to sequence of indices using charset, e.g.,
# NETWORK -> [13, 4, 19, 22, 14, 17, 10]
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
# Add data to lists for conversion
filenames.append(filename)
labels.append(label)
filenames = tf.convert_to_tensor(filenames)
labels = tf.convert_to_tensor_or_sparse_tensor(labels)
return filenames, labels
def _read_data_format(data_queue):
label = data_queue[1]
raw_image = tf.read_file(data_queue[0])
image = tf.image.decode_jpeg(raw_image,channels=1)
return image,label

The key ideas seem to be creating a SparseTensorValue from the data wanted, pass it through tf.convert_to_tensor_or_sparse_tensor and then (if you want to batch the data) serialize it with tf.serialize_sparse. After batching, you can restore the values with tf.deserialize_many_sparse.
Here's the outline. Create the sparse values, convert to tensor, and serialize:
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
Then, you can do the batching and deserialize:
image,label = tf.train.batch([image,label],dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use tensorflow to ingest sharded CSVs - python

Related

Tokenizing & encoding dataset uses too much RAM

Tensorflow - Extract string from Tensor

How can I filter tf.data.Dataset by specific values?

Preprocessing CSV data using tensorflow DataSet API

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

Categories

Resources