tf.data.Dataset.map() for datasets made from multiple slices

tf.data.Dataset.map() for datasets made from multiple slices - python

The tf.data.Dataset.map() for a dataset created from a single slice looks like dataset.map(lambda x: x/2). What would it look like if the dataset was created from two slices? See, for example, the following code. The map() function in the last line of the code will work for a dataset created from a single slice, but causes an error for my two-slice case.
import tensorflow as tf, numpy as np # tensorflow 2.0
from tensorflow import keras as kr
dataset = tf.data.Dataset.from_tensor_slices((features_int8, labels_int8)) # features, labels are numpy arrays
model = kr.Sequential()
model.add(kr.layers.InputLayer(6)
model.add(kr.layers.Dense( 8, activation=tf.nn.tanh))
model.add(kr.layers.Dense( 3, activation=tf.nn.tanh))
model.compile(optimizer = kr.optimizers.RMSprop(), loss = kr.losses.MeanSquaredError())
model.fit(dataset.batch(64).map(lambda x: x/9), epochs = 10)

Pass the lambda function in a separate function as shown
def map_fn(x, y):
return x / 9, y
model.fit(dataset.batch(64).map(map_fn), epochs = 10)

Related

Random split a PyTorch dataset of type TensorDataset

I have a custom dataset loaded into python in numpy: (20640x8) matrix of inputs, (20640x1) vector of labels.
I am trying to prepare the data for training in a PyTorch machine learning model, which requires a training set and test set split. In my attempt, the random_split() function reports an error:
TypeError: randperm() received an invalid combination of arguments.
I couldn't figure out how to split the dataset. Here is the code I wrote:
import numpy as np
import torch
from torch.utils.data import TensorDataset, random_split
x_numpy = # (20640x8) matrix of floats
y_numpy = # (20640x1) vector of floats
x = torch.from_numpy(x_numpy.astype(np.float32))
y = torch.from_numpy(y_numpy.astype(np.float32))
dataset = TensorDataset(x, y)
trainSet, testSet = random_split(dataset, [0.6*len(dataset), 0.4*len(dataset)])
Thanks in advance for the help!

The dataset length must be integers:
>>> split = int(0.6*len(dataset))
>>> trainSet, testSet = random_split(dataset, [split, len(dataset)-split])

Merge two tensorflow datasets into one dataset with inputs and lables

I have two tensorflow datasets that are generated using timeseries_dataset_from_array (docs). One corresponds to the input of my network and the other one to the output. I guess we can call them the inputs dataset and the targets dataset, which are both the same shape (a timeseries window of a fixed size).
The code I'm using to generate these datasets goes like this:
train_x = timeseries_dataset_from_array(
df_train['x'],
None,
sequence_length,
sequence_stride=sequence_stride,
batch_size=batch_size
)
train_y = timeseries_dataset_from_array(
df_train['y'],
None,
sequence_length,
sequence_stride=sequence_stride,
batch_size=batch_size
)
The problem is that when calling model.fit, tf.keras expects that if a tf.data.Dataset is given in the x argument, it has to provide both the inputs and targets. That is why I need to combine these two datasets into one, setting one as inputs and the other one as targets.

Simplest way would be to use tf.data.Dataset.zip:
import tensorflow as tf
import numpy as np
X = np.arange(100)
Y = X*2
sample_length = 20
input_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
X, None, sequence_length=sample_length, sequence_stride=sample_length)
target_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
Y, None, sequence_length=sample_length, sequence_stride=sample_length)
dataset = tf.data.Dataset.zip((input_dataset, target_dataset))
for x, y in dataset:
print(x.shape, y.shape)
(5, 20) (5, 20)
You can then feed dataset directly to your model.

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)

Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]

Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).

Use Generator to input datasets but get IndexError

model.fit(x,y, epochs=10000, batch_size=1)
The above codes works fine. When I use a function to feed the data in the model, something went wrong.
model.fit(GData(), epochs=10000, batch_size=1)
per_sample_losses = loss_fn.call(targets[i], outs[i])
IndexError: list index out of range
The GData() function is given below:
def GData():
return (x,y)
x is a numpy array with dimension (2, 63, 85)
y is a numpy array with dimension (2, 63, 41000)
This is the whole codes:
import os
import tensorflow as tf
import numpy as np
def MSE( y_true, y_pred):
error = tf.math.reduce_mean(tf.math.square(y_true-y_pred))
return error
data = np.load("Data.npz")
x = data['x'] # (2,63, 85)
y = data['y'] # (2,63,41000)
frame = x.shape[1]
InSize = x.shape[2]
OutSize = y.shape[2]
def GData():
return (x,y)
model = tf.keras.Sequential()
model.add(tf.keras.layers.GRU(1000, return_sequences=True, input_shape=(frame,InSize)))
model.add(tf.keras.layers.Dense(OutSize))
model.compile(optimizer='adam',
loss=MSE)#'mean_squared_error')
model.fit(GData(), epochs=10000, batch_size=1)

First, your function GData is not actually a generator as it is returning a value rather than yielding a value. Regardless, we should take a look at the fit() method and its documentation which you can find here.
From this, we see that the first two arguments to fit() are x and y. Going further, we see that x is limited to a few types. Namely, generators, numpy arrays, tf.data.Datasets, and a few others. An important thing to note in the documentation is that if x is a generator, it must be A generator or keras.utils.Sequence returning (inputs, targets). I am assuming this is what you are looking for. If this is the case, you will need to modify your GData function so that it is actually a generator. This can be done as such
batch_size = 1
EPOCHS = 10000
def GData():
for _ in range(EPOCHS): # Iterate through epochs. Note that this can be changed to be while True so that the generator yields indefinitely. The model will stop training after the amount of epochs you specify in the fit method.
for i in range(0, len(x), batch_size): # Iterate through batches
yield (x[i:batch_size], y[i:batch_size]) # Yield batches for training
Then, you have to specify the amount of steps per epoch in your fit() call so your model knows when to stop at each epoch.
model.fit(GData(), epochs=EPOCHS, steps_per_epoch=x.shape[0]//batch_size)

PyTorch DataLoader returns the batch as a list with the batch as the only entry. How is the best way to get a tensor from my DataLoader

I currently have the following situation where I want to use DataLoader to batch a numpy array:
import numpy as np
import torch
import torch.utils.data as data_utils
# Create toy data
x = np.linspace(start=1, stop=10, num=10)
x = np.array([np.random.normal(size=len(x)) for i in range(100)])
print(x.shape)
# >> (100,10)
# Create DataLoader
input_as_tensor = torch.from_numpy(x).float()
dataset = data_utils.TensorDataset(input_as_tensor)
dataloader = data_utils.DataLoader(dataset,
batch_size=100,
)
batch = next(iter(dataloader))
print(type(batch))
# >> <class 'list'>
print(len(batch))
# >> 1
print(type(batch[0]))
# >> class 'torch.Tensor'>
I expect the batchto be already a torch.Tensor. As of now I index the batch like so, batch[0] to get a Tensor but I feel this is not really pretty and makes the code harder to read.
I found that the DataLoader takes a batch processing function called collate_fn. However, setting data_utils.DataLoader(..., collage_fn=lambda batch: batch[0]) only changes the list to a tuple (tensor([ 0.8454, ..., -0.5863]),) where the only entry is the batch as a Tensor.
You would help me a lot by helping me finding out how to elegantly transform the batch to a tensor (even if this would include telling me that indexing the single entry in batch is okay).

Sorry for inconvenience with my answer.
Actually, you don't have to create Dataset from your tensor, you can pass torch.Tensor directly as it implements __getitem__ and __len__, so this is sufficient:
import numpy as np
import torch
import torch.utils.data as data_utils
# Create toy data
x = np.linspace(start=1, stop=10, num=10)
x = np.array([np.random.normal(size=len(x)) for i in range(100)])
# Create DataLoader
dataset = torch.from_numpy(x).float()
dataloader = data_utils.DataLoader(dataset, batch_size=100)
batch = next(iter(dataloader))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

tf.data.Dataset.map() for datasets made from multiple slices - python

Pass the lambda function in a separate function as shown def map_fn(x, y): return x / 9, y model.fit(dataset.batch(64).map(map_fn), epochs = 10)

Related

Random split a PyTorch dataset of type TensorDataset

Merge two tensorflow datasets into one dataset with inputs and lables

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

Use Generator to input datasets but get IndexError

PyTorch DataLoader returns the batch as a list with the batch as the only entry. How is the best way to get a tensor from my DataLoader

Categories

Resources