Tokenizing & encoding dataset uses too much RAM

Tokenizing & encoding dataset uses too much RAM - python

Trying to tokenize and encode data to feed to a neural network.
I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”
I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not.
The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.
max_q_len = 128
max_a_len = 64
trainq_list = train_q.tolist()
batch_size = 50000
def batch_encode(text, max_seq_len):
for i in range(0, len(trainq_list), batch_size):
encoded_sent = tokenizer.batch_encode_plus(
text,
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
return encoded_sent
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(trainq_list, max_q_len)
The tokenizer comes from HuggingFace:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

You should use generators and pass data to tokenizer.batch_encode_plus, no matter the size.
Conceptually, something like this:
Training list
This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size lines at once):
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:
import pathlib
def read_in_chunks(directory: pathlib.Path):
# Use "*.txt" or any other extension your file might have
for file in directory.glob("*"):
with open(file, "r") as f:
yield f.readlines()
Encoding
Encoder should take this generator and yield back encoded parts, something like this:
# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
for text in generator:
yield tokenizer.batch_encode_plus(
text,
max_length=max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False,
)
Saving encoded files
As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).
Something along those lines:
import numpy as np
# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
for i, encoded in enumerate(encoding_generator):
np.save(str(i), encoded)

Related

I/O Issues in Loading Several Large H5PY Files (Pytorch)

I met a problem!
Recently I meet a problem of I/O issue. The target and input data are stored with h5py files. Each target file is 2.6GB while each input file is 10.2GB. I have 5 input datasets and 5 target datasets in total.
I created a custom dataset function for each h5py file and then use data.ConcatDataset class to link all the datasets. The custom dataset function is:
class MydataSet(Dataset):
def __init__(self, indx=1, root_path='./xxx', tar_size=128, data_aug=True, train=True):
self.train = train
if self.train:
self.in_file = pth.join(root_path, 'train', 'train_noisy_%d.h5' % indx)
self.tar_file = pth.join(root_path, 'train', 'train_clean_%d.h5' % indx)
else:
self.in_file = pth.join(root_path, 'test', 'test_noisy.h5')
self.tar_file = pth.join(root_path, 'test', 'test_clean.h5')
self.h5f_n = h5py.File(self.in_file, 'r', driver='core')
self.h5f_c = h5py.File(self.tar_file, 'r')
self.keys_n = list(self.h5f_n.keys())
self.keys_c = list(self.h5f_c.keys())
# h5f_n.close()
# h5f_c.close()
self.tar_size = tar_size
self.data_aug = data_aug
def __len__(self):
return len(self.keys_n)
def __del__(self):
self.h5f_n.close()
self.h5f_c.close()
def __getitem__(self, index):
keyn = self.keys_n[index]
keyc = self.keys_c[index]
datan = np.array(self.h5f_n[keyn])
datac = np.array(self.h5f_c[keyc])
datan_tensor = torch.from_numpy(datan).unsqueeze(0)
datac_tensor = torch.from_numpy(datac)
if self.data_aug and np.random.randint(2, size=1)[0] == 1: # horizontal flip
datan_tensor = torch.flip(datan_tensor,dims=[2]) # c h w
datac_tensor = torch.flip(datac_tensor,dims=[2])
Then I use dataset_train = data.ConcatDataset([MydataSet(indx=index, train=True) for index in range(1, 6)]) for training. When only 2-3 h5py files are used, the I/O speed is normal and everything goes right. However, when 5 files are used, the training speed is gradually decreasing (5 iterations/s to 1 iterations/s). I change the num_worker and the problem still exists.
Could anyone give me a solution? Should I merge several h5py files into a bigger one? Or other methods? Thanks in advance!

Improving performance requires timing benchmarks. To do that you need to identify potential bottlenecks and associated scenarios. You said "with 2-3 files the I/O speed is normal" and "when 5 files are used, the training speed gradually decreases". So, is your performance issue I/O speed, or training speed? Or do you know? If you don't know, you need to isolate and compare I/O performance and training performance separately for the 2 scenarios.
In other words, to measure I/O performance (only) you need to run the following tests:
Time to read and concatenate 2-3 files,
Time to read and concatenate 5 files,
Copy the 5 files into 1, and time the read from the merged file,
Or, link the 5 files to 1 file, and time.
And to measure training speed (only) you need to compare performance for the following tests:
Merge 2-3 files, then read and train from the merged file.
Merge all 5 files, then read and train from merged file.
Or, link the 5 files to 1 file, then read and train from linked file.
As noted in my comment, merging (or linking) multiple HDF5 files into one is easy if all datasets are at the root level and all dataset names are unique. I added the external link method because it might provide the same performance, without duplicating large data files.
Below is the code that shows both methods. Substitute your file names in the fnames list, and it should be ready to run. If your dataset names aren't unique, you will need to create unique names, and assign in h5fr.copy() -- like this: h5fr.copy(h5fr[ds],h5fw,'unique_dataset_name')
Code to merge -or- link files :
(comment/uncomment lines as appropriate)
import h5py
fnames = ['file_1.h5','file_2.h5','file_3.h5']
# consider changing filename to 'linked_' when using links:
with h5py.File(f'merge_{len(fnames)}.h5','w') as h5fw:
for fname in fnames:
with h5py.File(fname,'r') as h5fr:
for ds in h5fr.keys():
# To copy datasets into 1 file use:
h5fr.copy(h5fr[ds],h5fw)
# to link datasets to 1 file use:
# h5fw[ds] = h5py.ExternalLink(fname,ds)

Batch-train word2vec in gensim with support of multiple workers

Context
There exists severals questions about how to train Word2Vec using gensim with streamed data. Anyhow, these questions don't deal with the issue that streaming cannot use multiple workers since there is no array to split between threads.
Hence I wanted to create a generator providing such functionality for gensim. My results look like:
from gensim.models import Word2Vec as w2v
#The data is stored in a python-list and unsplitted.
#It's too much data to store it splitted, so I have to do the split while streaming.
data = ['this is document one', 'this is document two', ...]
#Now the generator-class
import threading
class dataGenerator:
"""
Generator for batch-tokenization.
"""
def __init__(self, data: list, batch_size:int = 40):
"""Initialize generator and pass data."""
self.data = data
self.batch_size = batch_size
self.lock = threading.Lock()
def __len__(self):
"""Get total number of batches."""
return int(np.ceil(len(self.data) / float(self.batch_size)))
def __iter__(self) -> list([]):
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly).
Allows for data-streaming.
"""
for idx in range(len(self)):
yield self[idx]
def __getitem__(self, idx):
#Make multithreading thread-safe
with self.lock:
# Returns current batch by slicing data.
return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]
#And now do the training
model = w2v(
sentences=dataGenerator(data),
size=300,
window=5,
min_count=1,
workers=4
)
This results in the error
TypeError: unhashable type: 'list'
Since dataGenerator(data) would work if I'd just yield a single splitted document, I assume that gensims word2vec wraps the generator within an extra list. In this case the __iter__ would look like:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
Hence, my batch would also be wrapped resulting in something like [[['this', '...'], ['this', '...']], [[...], [...]]] (=> list of list of list) which cannot be processed by gensim.
My question:
Can I "stream"-pass batches in order to use multiple workers?
How can I change my code accordingly?

It seems I was too impatient. I ran the streaming-function written above which processes only one document instead of a batch:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
After starting the w2v-function it took about ten minutes until all cores were working correctly.
It seems that building the vocabulary does not support multiple cores and, hence, only one was used for this task. Presumably, it took so long because auf the corpus-size. After gensim built the vocab, all cores were used for the training.
So if you are running in this issue as well, maybe some patience will already help :)

Just want to reiterate that
#gojomo's comment is the way to go: with a large corpus and multiple cpus, it's much faster to train gensim word2vec using the corpus_file parameter instead of sentences, as mentioned in the docs:
corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).
LineSentence format is basically just one sentence per line, with words space-separated. Plain text, .bz2 or gz.

Preprocessing CSV data using tensorflow DataSet API

I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.

Batch reading and writing, from textfile to HDF5 in python

Goal is to feed large datasets to Tensorflow. I came to the following implementation. However, while io of HDF5 is supposed to be very fast my implementation is slow. Is this due to not using the chunks function? I do not seem to get the dimensions right for the chunks, should I see this as a third dimension. Like; (4096, 7, 1000) for chunksize 1000?
Please note, I could have simplified my code below more by finding solution for a single generator. However, I think the data/label combination is very common and usefull for others.
I use the following function to create two generators, one for the data and one for the corresponding labels.
def read_chunks(file, dim, batch_size=batch_size):
chunk = np.empty(dim,)
current_size = 1
# read input file line by line
for line in file:
current_size += 1
# build chunk
chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))
# reaches batch size
if current_size == batch_size:
yield chunk
# reset counters
current_size = 1
chunk = np.empty(dim,)
Then I wish move the data and labels produced by these generators to HDF5.
def write_h5(data_gen, label_gen, out_file, batch_size, h5_batch_size, data_dtype, label_dtype):
# remove existing file
if os.path.isfile(out_file):
os.remove(out_file)
with h5py.File(out_file, 'a') as f:
# create a dataset and labelset in the same file
d = f.create_dataset('data', (batch_size,data_dim), maxshape=(None,data_dim), dtype=data_dtype)
l = f.create_dataset('label', (batch_size,label_dim), maxshape=(None,label_dim), dtype=label_dtype)
# use generators to fill both sets
for data in data_gen:
d.resize(d.shape[0]+batch_size, axis=0)
d[-batch_size:] = data
l.resize(l.shape[0]+batch_size, axis=0)
l[-batch_size:] = next(label_gen)
With the following constants I combined both functions like so;
batch_size = 4096
h5_batch_size = 1000
data_dim = 7 #[NUM_POINT, 9]
label_dim = 1 #[NUM_POINT]
data_dtype = 'float32'
label_dtype = 'uint8'
for data_file, label_file in data_label_files:
print(data_file)
with open(data_file, 'r') as data_f, open(label_file, 'r') as label_f:
data_gen = read_chunks(data_f, dim=data_dim)
label_gen = read_chunks(label_f, dim=label_dim)
out_file = data_file[:-4] + '.h5'
write_h5(data_gen, label_gen, out_file, batch_size, h5_batch_size, data_dtype, label_dtype)

The problem is not that HDF5 is slow. The problem is that you are reading a single line at a time using a Python loop, calling genfromtxt() once per line! That function is meant to read entire files. And then you use the anti-pattern of "array = vstack(array, newstuff)` in the same loop.
In short, your performance problem starts here:
chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))
You should just read the entire file at once. If you can't do that, read half of it (you can set a max number of lines to read each time, such as 1 million).

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

From a list of word images with their transcriptions, I am trying to create and read sparse sequence labels (for tf.nn.ctc_loss) using a tf.train.slice_input_producer, avoiding
serializing pre-packaged training data to disk in
TFRecord format
the apparent limitations of tf.py_func,
any unnecessary or premature padding, and
reading the entire data set to RAM.
The main issue seems to be converting a string to the sequence of labels (a SparseTensor) needed for tf.nn.ctc_loss.
For example, with the character set in the (ordered) range [A-Z], I'd want to convert the text label string "BAD" to the sequence label class list [1,0,3].
Each example image I want to read contains the text as part of the filename, so it's straightforward to extract and do the conversion in straight up python. (If there's a way to do it within TensorFlow computations, I haven't found it.)
Several previous questions glance at these issues, but I haven't been able to integrate them successfully. For example,
Tensorflow read images with labels
shows a straightforward framework with discrete, categorical labels,
which I've begun with as a model.
How to load sparse data with TensorFlow?
nicely explains an approach for loading sparse data, but assumes
pre-packaging tf.train.Examples.
Is there a way to integrate these approaches?
Another example (SO question #38012743) shows how I might delay the conversion from string to list until after dequeuing the filename for decoding, but it relies on tf.py_func, which has caveats. (Should I worry about them?)
I recognize that "SparseTensors don't play well with queues" (per the tf docs), so it might be necessary to do some voodoo on the result (serialization?) before batching, or even rework where the computation happens; I'm open to that.
Following MarvMind's outline, here is a basic framework with the computations I want (iterate over lines containing example filenames, extract each label string and convert to sequence), but I have not successfully determined the "Tensorflow" way to do it.
Thank you for the right "tweak", a more appropriate strategy for my goals, or an indication that tf.py_func won't wreck training efficiency or something else downstream (e.g.,loading trained models for future use).
EDIT (+7 hours) I found the missing ops to patch things up. While still need to verify this connects with CTC_Loss downstream, I have checked that the edited version below correctly batches and reads in the images and sparse tensors.
out_charset="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def input_pipeline(data_filename):
filenames,seq_labels = _get_image_filenames_labels(data_filename)
data_queue = tf.train.slice_input_producer([filenames, seq_labels])
image,label = _read_data_format(data_queue)
image,label = tf.train.batch([image,label],batch_size=2,dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)
return image,label
def _get_image_filenames_labels(data_filename):
filenames = []
labels = []
with open(data_filename)) as f:
for line in f:
# Carve out the ground truth string and file path from
# lines formatted like:
# ./241/7/158_NETWORK_51375.jpg 51375
filename = line.split(' ',1)[0][2:] # split off "./" and number
# Extract label string embedded within image filename
# between underscores, e.g. NETWORK
text = os.path.basename(filename).split('_',2)[1]
# Transform string text to sequence of indices using charset, e.g.,
# NETWORK -> [13, 4, 19, 22, 14, 17, 10]
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
# Add data to lists for conversion
filenames.append(filename)
labels.append(label)
filenames = tf.convert_to_tensor(filenames)
labels = tf.convert_to_tensor_or_sparse_tensor(labels)
return filenames, labels
def _read_data_format(data_queue):
label = data_queue[1]
raw_image = tf.read_file(data_queue[0])
image = tf.image.decode_jpeg(raw_image,channels=1)
return image,label

The key ideas seem to be creating a SparseTensorValue from the data wanted, pass it through tf.convert_to_tensor_or_sparse_tensor and then (if you want to batch the data) serialize it with tf.serialize_sparse. After batching, you can restore the values with tf.deserialize_many_sparse.
Here's the outline. Create the sparse values, convert to tensor, and serialize:
indices = [[i] for i in range(0,len(text))]
values = [out_charset.index(c) for c in list(text)]
shape = [len(text)]
label = tf.SparseTensorValue(indices,values,shape)
label = tf.convert_to_tensor_or_sparse_tensor(label)
label = tf.serialize_sparse(label) # needed for batching
Then, you can do the batching and deserialize:
image,label = tf.train.batch([image,label],dynamic_pad=True)
label = tf.deserialize_many_sparse(label,tf.int32)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tokenizing & encoding dataset uses too much RAM - python

Related

I/O Issues in Loading Several Large H5PY Files (Pytorch)

Batch-train word2vec in gensim with support of multiple workers

Preprocessing CSV data using tensorflow DataSet API

Batch reading and writing, from textfile to HDF5 in python

How to generate/read sparse sequence labels for CTC loss within Tensorflow?

Categories

Resources