How to deal with DataCollator and DataLoaders in Huggingface?

How to deal with DataCollator and DataLoaders in Huggingface? - python

I have issues combining a DataLoader and DataCollator. The following code with DataCollatorWithPadding results in a ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. when I want to iterate through the batches.
from torch.utils.data.dataloader import DataLoader
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16,
collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=data_collator)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
However, I found annother approach where I changed the DataCollator to lambda x: x Then it gives me a TypeError: DistilBertForSequenceClassification object argument after ** must be a mapping, not list
from torch.utils.data.dataloader import DataLoader
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, collate_fn=lambda x: x )
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=lambda x: x)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
For reproducability and for the rest of the code I provide you a Jupyter Notebook on Google Colab. You find the errors at the bottom of the notebook.
Link to Colab Notebook

If you take a look at the train_dataset object from your notebook:
print(train_dataset)
Output:
Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 25000
})
DataCollatorWithPadding doesn't know how to pad the text column because it's just a string.
Since you've already tokenized the dataset, you can simply remove the text column like so:
train_dataset = train_dataset.remove_columns("text")
The other three columns are all tensors and so can be padded by the data collator. Your first training loop will then run as expected.

Related

How to extract a specific digit from the MNIST dataset with dataloader?

I am feeding the MNIST dataset to train my neural network in the following manner
indices = torch.arange(60000)
dataset = datasets.MNIST(root="dataset/", transform=transforms, download=True)
datasetsmall = data_utils.Subset(dataset, indices)
loader = DataLoader(datasetsmall, batch_size=batch_size, shuffle=True)
However, since the training is taking huge time to complete I have decided to train the model with only a specific digit from the MNIST dataset, for example the digit 4. How can I just extract the digit 4 and feed it to my neural network in the same way. The loop to train the neural network is like
for batch_idx, (real, _) in enumerate(loader):
Now I want only the digit 4 in the loader. How should I proceed in that case?

Does this code solve your problem?
import torch
from torchvision import datasets
from torch.utils.data import TensorDataset, DataLoader
from torchvision.transforms import ToTensor
cls = 4 # needed class
batch_size = 32
dataset = datasets.MNIST(root="dataset/", download=True, transform=ToTensor())
dataset = list(filter(lambda i: i[1] == cls, dataset))
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
s = 0
for i in loader:
s += 1
print(f'We\'ve got {s} batches with batch_size {batch_size} only for class {cls}')
# print(i) # uncomment this line if you want to examine last batch by yourself
Result:
We've got 183 batches with batch_size 32 only for class 4

Dataset for tensorflow shuffle() messing up predictions

Hi I created a little function to prepare my dataset before training as below which reshuffles also. But this functionality runs every time I get the output without running the function again.
def create_ds(x, y, shuffle=True, batch_size=512):
# Renaming columns in the input df
x = MyPackage.rename_df(x)
ds = tf.data.Dataset.from_tensor_slices((dict(x), y))
if shuffle:
ds = ds.shuffle(len(x))
ds = ds.batch(batch_size)
return ds
Then I use this to create a tensorflow dataset for train and test.
train_ds = create_ds(
train_df,
train_df['target'].values,
shuffle=True,
batch_size=batch_size,
)
test_ds = create_ds(
test_df,
test_df['target'].values,
shuffle=False,
batch_size=batch_size_test,
)
--- Here I compile the model not included --
model.fit(
train_ds,
epochs=epochs,
callbacks=[
tf.keras.callbacks.TensorBoard(log_dir=os.path.join(output_path, "logs")),
tf.keras.callbacks.EarlyStopping(monitor="val_auc", patience=10),
],
validation_data=test_ds,
verbose=2,
)
# predict
pred_1 = model.predict(train_ds)
pred_2 = model.predict(train_ds)
But the pred_1 != pred_2. This is because of the shuffle = True when building the train_ds - it gets swapped around still. I thought once I had shuffled the data; why is it doing it every time I call it.
For example - I predicted on 3 rows and got for pred_1:
array([[0.51523584],
[0.50634336],
[0.51264596]], dtype=float32)
and pred_2:
array([[0.50634336],
[0.51523584],
[0.51264596]], dtype=float32)

spliting custom binary dataset in train/test subsets using tensorflow io

I am trying to use local binary data to train a network to perform regression inference.
Each local binary data has the following layout:
and the whole data consists of several *.bin files with the layout above. Each file has a variable number of sequences of 403*4 bytes. I was able to read one of those files using the following code:
import tensorflow as tf
RAW_N = 2 + 20*20 + 1
def convert_binary_to_float_array(register):
return tf.io.decode_raw(register, out_type=tf.float32)
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(map_func=convert_binary_to_float_array)
Now, I need to create 4 datasets train_data, train_labels, test_data, test_labels as follows:
train_data, train_labels, test_data, test_labels = prepare_ds(raw_dataset, 0.8)
and use them to train & evaluate:
model = build_model()
history = model.fit(train_data, train_labels, ...)
loss, mse = model.evaluate(test_data, test_labels)
My question is: how to implement function prepare_ds(dataset, frac)?
def prepare_ds(dataset, frac):
...
I have tried to use tf.shape, tf.reshape, tf.slice, subscription [:] with no success. I realized that those functions doesn't work properly because after the map() call raw_dataset is a MapDataset (as a result of the eager execution concerns).

If the meta-data is suppose to be part of your inputs, which I am assuming, you could try something like this:
import random
import struct
import tensorflow as tf
import numpy as np
RAW_N = 2 + 20*20 + 1
bytess = random.sample(range(1, 5000), RAW_N*4)
with open('mydata.bin', 'wb') as f:
f.write(struct.pack('1612i', *bytess))
def decode_and_prepare(register):
register = tf.io.decode_raw(register, out_type=tf.float32)
inputs = register[:402]
label = register[402:]
return inputs, label
total_data_entries = 8
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin', '/content/mydata.bin'], record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(decode_and_prepare)
raw_dataset = raw_dataset.shuffle(buffer_size=total_data_entries)
train_ds_size = int(0.8 * total_data_entries)
test_ds_size = int(0.2 * total_data_entries)
train_ds = raw_dataset.take(train_ds_size)
remaining_data = raw_dataset.skip(train_ds_size)
test_ds = remaining_data.take(test_ds_size)
Note that I am using the same bin file twice for demonstration purposes. After running that code snippet, you could feed the datasets to your model like this:
model = build_model()
history = model.fit(train_ds, ...)
loss, mse = model.evaluate(test_ds)
as each dataset contains the inputs and the corresponding labels.

Passing dataframe to keras sequential model

I'm trying to build and train a simple MLP model using keras.Sequential().
However, I'm having issues when, after each training epoch, I try to evaluate the current status of the model on the train and test data.
I'm having this problem on a couple different datasets, one of them is the "CAR DETAILS FROM CAR DEKHO" dataset, you can find it here
This is what I'm doing so far:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
and I'm getting the error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
I guess the error is due to me passing X_train (which is a dataframe) directly to net.
I also tried using again:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
like when creating training batches, but it returns another error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
Finally, I tried using:
y_pred_train = net.predict(X_train)
the weird thing in this case is that I got an OOM error, referring to a tensor with shape[76571,76571]:
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
but the X_train datagrame has shape (76571, 19), so I don't understand what is happening.
What is the correct way to do this?

Your code mostly looks OK, the issue must be with the data that you pass.
Check content and datatypes of the data that you feed.
Try converting pandas slices into np.arrays, re-check their dimensions and then feed np.arrays to load_array().
Also try smaller batches, like 64 (not 5000).
UPDATE:
Apparently when you pass X_batch to the model you pass tf.tensor, but later when you pass whole X_train or X_test - you pass pd.DataFrames and the model gets confused.
You should change just 2 lines:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here

The issue looks like it is related to the data (as Poe Dator says). What I believe is going on is that your network has some input shape based on the batches of data it is receiving. Then when you are trying to predict or call your network on the data (this also recomputes shapes since it calls the build() function), it tries to get the data into the shape it expects. I think specifically it is expecting a shape of (batch, 1, 19) but with your data in (76571, 19) it is not finding the correct shape.
A couple of easy steps to work on this would be:
Call net.summary() to see what the shapes it believes it is getting before and after training
Provide the input shape to the first layer, net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
Slice your X data in the same shape as your training data.
Add a dimension to your data so it is (76571, 1, 19) to explicitly shape it as well.
Also as noted above, smaller batch sizes would be best. I would also recommend using the model.train() method instead of handling gradients if you don't have a lot of experience with tensorflow. This saves you code and is easier to ensure you are handling your model correctly during training.

Keras fit() and fit_generator() give different results

Keras fit() and fit_generator() gives different results. I implemented both methods keeping all the other parameters same. I have attached my data generator and model below. The model is taken from this site. https://machinelearningmastery.com/
In data generator, I am loading the data from the hard drive. Each X_train file contains a matrix of size (3,1). For example, if the batch size is 2, the size of X_batch will be (2, 3, 1).
def generator(list_xtrain, list_ytrain, batch_size):
samples_per_epoch = len(list_xtrain)
number_of_batches = samples_per_epoch/batch_size
counter=0
X_batch = np.empty((batch_size,3,1))
y_batch = np.empty((batch_size))
while 1:
temp_listx = list_xtrain[batch_size*counter:batch_size*(counter+1)]
temp_listy = list_ytrain[batch_size*counter:batch_size*(counter+1)]
for i, ID in enumerate(temp_listx):
X_batch[i,] = np.load('F:/Air_passenger_data_gen/' + ID)
for j, ID in enumerate(temp_listy):
# Store class
y_batch[j] = np.load('F:/Air_passenger_data_gen/' + ID)
counter += 1
yield X_batch,y_batch
#restart counter to yeild data in the next epoch as well
if counter >= number_of_batches:
counter = 0
#using fit_generator()
batch_size=2
model.fit_generator(generator=generator(list_xtrain, list_ytrain,
batch_size),
epochs=100,
steps_per_epoch=len(list_xtrain)/batch_size,
verbose=2,
use_multiprocessing=False,
workers=4)
#using fit()
model.fit(trainX, trainY, epochs=100, batch_size=2)
I expect the output to be same as that from fit(). But using fit_generator() gives some crazy value for loss=41781.00 whereas using fit(), the loss=0.0020

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with DataCollator and DataLoaders in Huggingface? - python

Related

How to extract a specific digit from the MNIST dataset with dataloader?

Dataset for tensorflow shuffle() messing up predictions

spliting custom binary dataset in train/test subsets using tensorflow io

Passing dataframe to keras sequential model

Keras fit() and fit_generator() give different results

Categories

Resources