I am using the getting started example of Tensorflow CNN and updating parameters to my own data but since my model is large (244 * 244 features) I got OutOfMemory error.
I am running the training on Ubuntu 14.04 with 4 CPUs and 16Go of RAM.
Is there a way to shrink my data so I don't get this OOM error?
My code looks like this:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="path/to/model")
# Load the data
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
batch_size=5,
shuffle=True)
# Train the model
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
Is there a way to shrink my data so I don't get this OOM error?
You can slice your training_set to obtain just a portion of the dataset. Something like:
x={"x": np.array(training_set.data)[:(len(training_set)/2)]},
y=np.array(training_set.target)[:(len(training_set)/2)],
In this example you are getting the first half of your dataset (you can select up to what point of your dataset you want to load).
Edit: Another way you can do this is to obtain a random subset of your training dataset. This you can achieve by masking elements on your dataset array. For example:
import numpy as np
from random import random as rn
#obtain boolean mask to filter out some elements
#here you can define your sample %
r = 0.5 #say filter half the elements
mask = [True if rn() >= r else False for i in range(len(training_set))]
#finally, mask out those elements,
#the result will have ~r times the original elements
reduced_ds = training_set[mask]
Related
So I'm trying to manually split my training data into separate batches such that I can easily access them via indexing, and not relying on DataLoader to split them up for me, since that way I won't be able to access the individual batches by indexing. So I attempted the following:
train_data = datasets.ANY(root='data', transform=T_train, download=True)
BS = 200
num_batches = len(train_data) // BS
sequence = list(range(len(train_data)))
np.random.shuffle(sequence) # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets] # Create multiple batches, each with BS number of samples
Which works during training just fine.
However, when I attempted another way to manually split the training data I got different end results, even with all the same parameters and the following settings ensured:
device = torch.device('cuda')
torch.manual_seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()
I only split the training data the following way this time:
train_data = list(datasets.ANY(root='data', transform=T_train, download=True)) # Cast into a list
BS = 200
num_batches = len(train_data) // BS
np.random.shuffle(train_data) # To shuffle the training data
train_loader = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]
But this gives me different results than the first approach, even though (I believe) that both approaches are identical in manually splitting the training data into batches. I even tried not shuffling at all and loading the data just as it is, but I still got different results (85.2% v.s 81.98% accuracy). I even manually checked that the loaded images from the batches match; and are the same using both methods.
The training scheme used in both ways:
for e in trange(epochs):
for loader in train_loader:
for x, y in loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
loss = F.cross_entropy(m1(x), y)
loss.backward()
optim.step()
scheduler.step()
optim.zero_grad()
Can somebody please explain to me why these differences arise (and if there's a better way)?
UPDATE:
T_train transformation contains some random transformations (H_flip, crop) and when using it along with the first train_loader the time taken during training was: 24.79s/it, while the second train_loader took: 10.88s/it (even though both have the exact same number of parameters updates/steps). So I decided to remove the random transformations from T_train; then the time taken using the first train_loader was: 16.99s/it, while the second train_loader took: 10.87s/it. So somehow, the second train_loader still took the same time (with or without the random transformations). Thus, I decided to visualize the image outputs from the second train_loader to make sure that the transformations were applied, and indeed they were! So this is really confusing and I'm not quite why they're giving different results.
I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)
Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]
Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).
When I tried to add validation_split in my LSTM model, I got this error
ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator object)
This is the code
from keras.preprocessing.sequence import TimeseriesGenerator
train_generator = TimeseriesGenerator(df_scaled, df_scaled, length=n_timestamp, batch_size=1)
model.fit(train_generator, epochs=50,verbose=2,callbacks=[tensorboard_callback], validation_split=0.1)
----------
ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator object)
One reason I could think of is, to use validation_split a tensor or numpy array is expected, as mentioned in the error, however, when passing train data through TimeSeriesGenerator, it changes the dimension of the train data to a 3D array
And since TimeSeriesGenerator is mandatory to be used when using LSTM, does this means for LSTM we can't use validation_split
Your first intution is right that you can't use the validation_split when using dataset generator.
You will have to understand how the functioninig of dataset generator happens. The model.fit API does not know how many records or batch your dataset has in its first epoch. As the data is generated or supplied for each batch one at a time to the model for training. So there is no way to for the API to know how many records are initially there and then making a validation set out of it. Due to this reason you cannot use the validation_split when using dataset generator. You can read it in their documentation.
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling. This argument is not supported when x is a
dataset, generator or keras.utils.Sequence instance.
You need to read the last two lines where they have said that it is not supported for dataset generator.
What you can instead do is use the following code to split the dataset. You can read in detail here. I am just writing the important part from the link below.
# Splitting the dataset for training and testing.
def is_test(x, _):
return x % 4 == 0
def is_train(x, y):
return not is_test(x, y)
recover = lambda x, y: y
# Split the dataset for training.
test_dataset = dataset.enumerate() \
.filter(is_test) \
.map(recover)
# Split the dataset for testing/validation.
train_dataset = dataset.enumerate() \
.filter(is_train) \
.map(recover)
I hope my answer helps you.
y = np.array(y)
This fixed it for me.
The error says it only supports numpy arrays, so turn it into an array.
I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?
You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.
Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
How do I get a single random example from a PyTorch DataLoader?
If my DataLoader gives minbatches of multiple images and labels, how do I get a single random image and label?
Note that I don't want a single image and label per minibatch, I want a total of one example.
If your DataLoader is something like this:
test_loader = DataLoader(image_datasets['val'], batch_size=batch_size, shuffle=True)
it is giving you a batch of size batch_size, and you can pick out a single random example by directly indexing the batch:
for test_images, test_labels in test_loader:
sample_image = test_images[0] # Reshape them according to your needs.
sample_label = test_labels[0]
Alternative solutions
You can use RandomSampler to obtain random samples.
Use a batch_size of 1 in your DataLoader.
Directly take samples from your DataSet like so:
mnist_test = datasets.MNIST('../MNIST/', train=False, transform=transform)
Now use this dataset to take samples:
for image, label in mnist_test:
# do something with image and other attributes
(Probably the best) See here:
inputs, classes = next(iter(dataloader))
If you want to choose specific images from your Trainloader/Testloader, you should check out the Subset function from master:
Here's an example how to use it:
testset = ImageFolderWithPaths(root="path/to/your/Image_Data/Test/", transform=transform)
subset_indices = [0] # select your indices here as a list
subset = torch.utils.data.Subset(testset, subset_indices)
testloader_subset = torch.utils.data.DataLoader(subset, batch_size=1, num_workers=0, shuffle=False)
This way you can use exactly one image and label. However, you can of course use more than just one index in your subset_indices.
If you want to use a specific image from your DataFolder, you can use dataset.sample and build a dictionary to get the index of the image you want to use.
(This answer is to supplement Alternative 3 of #parthagar's answer)
Iterating through dataset does not return "random" examples, you should instead use:
# Recovers the original `dataset` from the `dataloader`
dataset = dataloader.dataset
n_samples = len(dataset)
# Get a random sample
random_index = int(numpy.random.random()*n_samples)
single_example = dataset[random_index]
TL;DR:
The general form to get a single example from a DataLoader is:
list = [ x[0] for x in iter(trainloader).next() ]
In particular to the question asked, where minbatches of images and labels are returned:
image, label = [ x[0] for x in iter(trainloader).next() ]
Possibly interesting information:
To get a single minibatch from the DataLoader, use:
iter(trainloader).next()
When running something like for images, labels in dataloader: what happens under the hood is an iterator is created via iter(dataloader), then the iterator's .next() is called on each loop execution.
To get a single image from a DataLoader, which returns images and labels use:
image = iter(trainloader).next()[0][0]
This is the same as doing:
images, labels = iter(trainloader).next()
image = images[0]
Random sample from DataLoader
Assuming DataLoader(shuffle=True) was used in its construction, a single random example can be drawn from the DataLoader with:
example = next(iter(dataloader))[0]
Random sample from Dataset
If that is not the case, you can draw a single random example from the Dataset with:
idx = torch.randint(len(dataset), (1,))
example = dataset[idx]
The key to get random sample is to set shuffle=True for the DataLoader, and the key for getting the single image is to set the batch size to 1.
Here is the example after loading the mnist dataset.
from torch.utils.data import DataLoader, Dataset, TensorDataset
bs = 1
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
for xb, yb in train_dl:
print(xb.shape)
x = xb.view(28,28)
print(x.shape)
print(yb)
break #just once
from matplotlib import pyplot as plt
plt.imshow(x, cmap="gray")