Using Huggingface zero-shot text classification with large data set - python

I'm trying to use Huggingface zero-shot text classification using 12 labels with large data set (57K sentences) read from a CSV file as follows:
csv_file = tf.keras.utils.get_file('batch.csv', filename)
df = pd.read_csv(csv_file)
classifier = pipeline('zero-shot-classification')
results = classifier(df['description'].to_list(), labels, multi_class=True)
This keeps crashing as python runs out of memory.
I tried to create a dataset instead as follows:
dataset = load_dataset('csv', data_files=filename)
But not sure how to use it with Huggingface's classifier. What is the best way to batch process classification?
I eventually would like to feed it over 1M sentences for classification.

The problem isn't that your dataset is too big to fit into RAM, but that you're trying to pass the whole thing through a large transformer model at once. Hugging Face's pipelines don't do any mini-batching under the hood at the moment, so pass the sequences one by one or in small subgroups instead:
results = [classifier(desc, labels, multi_class=True for desc in df['description']]
If you're using a GPU, you'll get the best speed by using as many sequences at each pass as will fit into the GPU's memory, so you could try the following:
batch_size = 4 # see how big you can make this number before OOM
classifier = pipeline('zero-shot-classification', device=0) # to utilize GPU
sequences = df['description'].to_list()
results = []
for i in range(0, len(sequences), batch_size):
results += classifier(sequences[i:i+batch_size], labels, multi_class=True)
and see how large you can make batch_size before you get OOM errors.

The Zero-shot-classification model takes 1 input in one go, plus it's very heavy model to run, So as recommended run it on GPU only,
The very simple approach is to convert the text into list
df = pd.read_csv(csv_file)
classifier = pipeline('zero-shot-classification')
filter_keys = ['labels']
output = []
for index, row in df.iterrows():
d = {}
sequence = row['description']
result = classifier(sequence, labels, multi_class=True)
temp = list(map(result.get, filter_keys))
d['description'] = row['description']
d['label'] = temp[0][0]
output.append(d)
#convert the list of dictionary into pandas DataFrame
new = pd.DataFrame(output)
new.head()

Related

Batching large input file into MLlib model

Is there any way to batch a large input file (111MB) made of 22 MLN cells (222 rows for 110k columns) in MLlib (something similar to this tutorial made in keras) Keras batching tutorial.
The file contains the actual features extracted from 222 images using the above tutorial, but instead of using a keras model I would like to replicate such code using pyspark and MLlib.
Unfortunately I've not enough resources for dealing in memory for such big file and the computation fails for Java Heap Space memory error.
The file structure is composed by for each row (representing an image) we have these columns: "_c0" the label 0/1, from "_c1" up to "_c100353" features extracted.
Here's my code, I don't care about precision and accuracy, I'm just interested on running the model for making resource usage metrics.
sql,sc = init_spark()
df = sql.read.option("maxColumns", 100400).load(file3,format="csv",inferSchema="true",sep=',',header="false")
labelIndexer = StringIndexer(inputCol="_c0", outputCol="indexedLabel").fit(df)
cols = df.columns
cols.remove("_c0")
assembler = VectorAssembler(inputCols=cols,outputCol="features")
data = assembler.transform(df)
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=100).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
#
# # Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
#
# # Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
#
# # Make predictions.
predictions = model.transform(testData)
#
# # Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(100)
predictions.printSchema()
#
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g " % accuracy)
Please don't suggest me to use sparkdl library for features extraction using DeepImageFeaturizer because it's completely broken.

What is the best way to load data with tf.data.Dataset in memory efficient way

I'm trying to load data for optimizing model for object detection + instance segmentation. However using tf.data.Dataset is giving me a bit headache with loading instance segmentations masks. tf.data.Dataset is using all the memory on the server (more than 128 GB) with a small dataset.
Is there a way to effectively load data in more memory efficient way, right now we are using this code:
train_dataset, train_examples = dataset.load_train_datasets()
ds = (
train_dataset.shuffle(min(100, train_examples), reshuffle_each_iteration=True)
.map(dataset.decode, num_parallel_calls=args.num_parallel_calls)
.map(train_processing.prepare_for_batch, num_parallel_calls=args.num_parallel_calls)
.batch(args.batch_size)
.map(train_processing.preprocess_batch, num_parallel_calls=args.num_parallel_calls)
.prefetch(AUTOTUNE)
)
The problem is that the second map call with train_processing.prepare_for_batch (takes single element) and third with train_processing.preprocess_batch (takes batch of elements) is creating a lot of binary masks for segmentation which are using all the memory.
Is there a way to reorganize the mapping functions to save the memory? I was thinking something like: 1. take first 100 samples, 2. decode the samples, 3. prepare the the masks and bounding boxes for one sample 4. takes the batch of them 5. final preparation of data per batch 6. FIT ONE step/one batch of data 7. clean the data from memory
Manually
First make a list of all the filenames in the dataset and a list of all the labels in the dataset.
filenames = [abc.png, def.png, ...]
labels = [0, 1, ...]
Then create dataset from tensor slices
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames))
dataset = dataset.map(PARSE_FUNCTION, num_parallel_calls=PARALLEL_CALLS)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)
Through a function
def dataset(csv, parse):
filenames = []
labels = []
for i, row in csv.iterrows():
filename = row[0]
filenames.append(filename)
label = row[1]
labels.append(label)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
labels = np_utils.to_categorical(labels)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames))
dataset = dataset.map(PARSE_FUNCTION, num_parallel_calls=PARALLEL_CALLS)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)
return dataset
Disclaimer: this method assumes csv is in (filenames, label) format

How to build a dataset for language modeling with the datasets library as with the old TextDataset from the transformers library

I am trying to load a custom dataset that I will then use for language modeling. The dataset consists of a text file that has a whole document in each line, meaning that each line overpasses the normal 512 tokens limit of most tokenizers.
I would like to understand what is the process to build a text dataset that tokenizes each line, having previously split the documents in the dataset into lines of a "tokenizable" size, as the old TextDataset class would do, where you only had to do the following, and a tokenized dataset without text loss would be available to pass to a DataCollator:
model_checkpoint = 'distilbert-base-uncased'
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
from transformers import TextDataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="path/to/text_file.txt",
block_size=512,
)
Instead of this way, which is to be deprecated soon, I would like to use the datasets library. For now, what I have is the following, which, of course, throws an error because each line is longer than the maximum block size in the tokenizer:
import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
So what would be the "standard" way of creating a dataset in the way it was done before but with the datasets lib?
Thank you very much for the help :))
I received an answer for this question on the HuggingFace Datasets forum by #lhoestq
Hi !
If you want to tokenize line by line, you can use this:
max_seq_length = 512
num_proc = 4
def tokenize_function(examples):
# Remove empty lines
examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
return tokenizer(
examples["text"],
truncation=True,
max_length=max_seq_length,
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=num_proc,
remove_columns=["text"],
)
Though the TextDataset was doing a different processing by
concatenating all the texts and building blocks of size 512. If you
need this behavior, then you must apply an additional map function
after the tokenization:
# Main data processing function that will concatenate all texts from
# our dataset and generate chunks of max_seq_length.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop,
# you can customize this part to your needs.
total_length = (total_length // max_seq_length) * max_seq_length
# Split by chunks of max_len.
result = {
k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
for k, t in concatenated_examples.items()
}
return result
# Note that with `batched=True`, this map processes 1,000 texts together,
# so group_texts throws away a remainder for each of those groups of 1,000 texts.
# You can adjust that batch_size here but a higher value might be slower to preprocess.
tokenized_dataset = tokenized_dataset.map(
group_texts,
batched=True,
num_proc=num_proc,
)
This code comes from the processing of the run_mlm.py example script
of transformers

Is there a simple function can exclude training set from dataset in python?

I have a question in spliting dataset in python
if I have got a subset of dataset as training, is there some function in python can exclude training set from dataset and get rest of dataset directly?
just like:
testing set = numpy.exclude(dataset , trainingset)
for example, there are 10 row in dataset, I have took 2,4,7,9 row as training set, so how can I get rest of dataset easily.
In detail, these my training dataset
for i in range(0,5):
Test_data = dataset[ratio*i:ratio*(i+1),:]
Train_data = dataset[0:ratio*i&ratio*(i+1):-1,:]
My code didn't work because there is no & defination
If you already know the indices of the training set rows, you can just exclude them to get the indices of the remaining rows:
training_rows_ix = [2,4,7,9]
non_training_rows = [i for i in dataset.index if i not in training_rows_ix]
test_set = dataset.loc[non_training_rows]
Or using set operations instead of list comprehension:
non_training_rows = sorted(set(dataset.index) - set(training_rows_ix))
Also for a more robust solution to splitting datasets into test-train look into scikit-learn's test-train-split

10 fold cross validation python

There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained
by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction

Categories