Merge two tensorflow datasets into one dataset with inputs and lables

Merge two tensorflow datasets into one dataset with inputs and lables - python

I have two tensorflow datasets that are generated using timeseries_dataset_from_array (docs). One corresponds to the input of my network and the other one to the output. I guess we can call them the inputs dataset and the targets dataset, which are both the same shape (a timeseries window of a fixed size).
The code I'm using to generate these datasets goes like this:
train_x = timeseries_dataset_from_array(
df_train['x'],
None,
sequence_length,
sequence_stride=sequence_stride,
batch_size=batch_size
)
train_y = timeseries_dataset_from_array(
df_train['y'],
None,
sequence_length,
sequence_stride=sequence_stride,
batch_size=batch_size
)
The problem is that when calling model.fit, tf.keras expects that if a tf.data.Dataset is given in the x argument, it has to provide both the inputs and targets. That is why I need to combine these two datasets into one, setting one as inputs and the other one as targets.

Simplest way would be to use tf.data.Dataset.zip:
import tensorflow as tf
import numpy as np
X = np.arange(100)
Y = X*2
sample_length = 20
input_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
X, None, sequence_length=sample_length, sequence_stride=sample_length)
target_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
Y, None, sequence_length=sample_length, sequence_stride=sample_length)
dataset = tf.data.Dataset.zip((input_dataset, target_dataset))
for x, y in dataset:
print(x.shape, y.shape)
(5, 20) (5, 20)
You can then feed dataset directly to your model.

Related

Convert a Tensorflow dataset containing inputs and labels to two NumPy arrays

I'm using Tensorflow 2.9.1. I have a test_dataset object of class tf.data.Dataset, which stores both inputs and labels. The inputs are 4-dimensional Tensors, and the labels are 3-dimensional Tensors:
print(tf.data.Dataset)
<PrefetchDataset element_spec=(TensorSpec(shape=(64, 5, 548, 1), dtype=tf.float64, name=None), TensorSpec(shape=(64, 1, 1), dtype=tf.float64, name=None))>
The first dimension is the minibatch size. I need to convert this Tensorflow Dataset to two NumPy arrays, X_test containing the inputs, and y_test containing the labels, ordered in the same way. In other words, (X_test[0], y_test[0]) must correspond to the first sample from test_dataset. Since the first dimension of my tensors is the minibatch size, I want to concatenate the results along that first dimension.
How can I do that? I've seen two approaches:
np.concatenate
X_test = np.concatenate([x for x, _ in test_dataset], axis=0)
y_test = np.concatenate([y for _, y in test_dataset], axis=0)
But I don't like it for two reasons:
it seems wasteful to iterate twice on the same dataset
X_test and y_test are probably not ordered in the same way. If I run
X_test = np.concatenate([x for x, _ in test_dataset], axis=0)
X_test2 = np.concatenate([x for x, _ in test_dataset], axis=0)
X_test and X_test2 are different arrays, though of identical shape. I suspect the dataset is being shuffled after I itera through it once. However, this implies that also X_test and y_test, in my snippet above, won't be ordered in the same way. How can I fix that?
tfds.as_numpy
tfds.as_numpy can be used to convert a Tensorflow Dataset to an iterable of NumPy arrays:
import tensorflow_datasets as tfds
np_test_dataset = tfds.as_numpy(test_dataset)
print(np_test_dataset)
<generator object _eager_dataset_iterator at 0x7fee81fd8b30>
However, I don't know how to proceed from here: how do I convert this iterable of NumPy arrays, to two NumPy arrays of the right shapes?

Instead of iterating over the dataset twice, you can unpack the dataset and concatenate the arrays inside the resulting tuples to get the final result.
The zip(*ds) is used to separate the dataset into two separate sequences (X's and y's). X and y each becomes a tuple of arrays and you then concatenate those arrays. You can read more about how zip(*iterables) works here.
Here is an example with mnist data:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train', as_supervised=True)
print(ds)
# <PrefetchDataset element_spec=(TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>
X, y = zip(*ds)
print(type(X), type(y))
# <class 'tuple'> <class 'tuple'>
print(len(X), len(y))
# 60000 60000
X_arr = np.concatenate(X)
print(X_arr.shape)
# (1680000, 28, 1)
You would do the same concatenation with your y's. I am not showing it here because this dataset has different dimensionality. np.concat is used here since you want to join arrays on the first existing axis.
If needed, unpacking could also be done on the iterable created by the tfds.as_numpy() method:
X, y = zip(*tfds.as_numpy(ds))

Train a neural network with input as sliding windows of a matrix with Tensorflow / Keras, and memory issues

I have a dataset which is a big matrix of shape (100 000, 2 000).
I would like to train my Tensorflow neural network with all the possible sliding windows/submatrices of shape (16, 2000) of this big matrix.
I use:
from skimage.util.shape import view_as_windows
A.shape # (100000, 2000) ie 100k x 2k matrix
X = view_as_windows(A, (16, 2000)).reshape((-1, 16, 2000, 1))
X.shape # (99985, 16, 2000, 1)
...
model.fit(X, Y, batch_size=4, epochs=8)
Unfortunately, this leads to a memory problem:
W tensorflow/core/framework/allocator.cc:122] Allocation of ... exceeds 10% of system memory.
This is normal, since X has ~ 100k * 16 * 2k coefficients, i.e. more than 3 billion coefficients!
But in fact, it is a waste of memory to load X in memory because it is highly redundant: it is made of sliding windows of shape (16, 2000) over A.
Question: how to train a neural network with input being all sliding windows of width 16 over a 100k x 2k matrix, without wasting memory?
The documentation of skimage.util.view_as_windows states indeed that it's costly in memory:
One should be very careful with rolling views when it comes to memory usage. Indeed, although a ‘view’ has the same memory footprint as its base array, the actual array that emerges when this ‘view’ is used in a computation is generally a (much) larger array than the original, especially for 2-dimensional arrays and above.
For example, let us consider a 3 dimensional array of size (100, 100, 100) of float64. [...] the hypothetical size of the rolling view (if one was to reshape the view for example) would be 8*(100-3+1)3*33 which is about 203 MB! The scaling becomes even worse as the dimension of the input array becomes larger.
Edit: timeseries_dataset_from_array is exactly what I'm looking for except that it works only for 1D sequences:
import tensorflow
import tensorflow.keras.preprocessing
x = list(range(100))
x2 = tensorflow.keras.preprocessing.timeseries_dataset_from_array(x, None, 10, sequence_stride=1, sampling_rate=1, batch_size=128, shuffle=False, seed=None, start_index=None, end_index=None)
for b in x2:
print(b)
and it doesn't work for 2D arrays:
x = np.array(range(90)).reshape(6, 15)
print(x)
x2 = tensorflow.keras.preprocessing.timeseries_dataset_from_array(x, None, (6, 3), sequence_stride=1, sampling_rate=1, batch_size=128, shuffle=False, seed=None, start_index=None, end_index=None)
# does not work

If using Tensorflow, you can use a Tensorflow Dataset and map a preprocessing function over the data like so:
import tensorflow as tf
A.shape # (100000, 2000)
def get_window(starting_idx):
"""Extract a window from A of shape (16, 2000) as a tf.Tensor"""
return tf.convert_to_tensor(A[starting_idx : starting_idx + 16])
# Make dataset for actual data
data_ds = tf.data.Dataset.range(A.shape[0] - 16)
data_ds = data_ds.map(get_window)
# Make dataset for labels
label_ds = tf.data.Dataset.from_tensor_slices(Y)
# Zip them into one dataset
ds = tf.data.Dataset.zip((data_ds, label_ds))
# Pre-batch the dataset
ds = ds.batch(4)
# Sanity check for batch size
for batch, label in ds:
print(batch.shape) # (4, 16, 2000, 1)
break
# Now call .fit() without batch size
model.fit(ds, epochs=8)
Defining a function for extracting each window and mapping this over an existing dataset should solve your memory problem, as it should allow the windows to be formed only when needed.
This is in general one of the best ways to handle data when working with Tensorflow, and you can handle large amounts of data this way.
For more information, see tf.data.Dataset and tensorflow.org/guide/data.

You can use generator to yield examples on fly instead of storing them in memory.
You can either write custom generator or generator provided by keras like timeseries_dataset_from_array (docs) that can yield windows as well with help of options like sequence_stride.
For custom generator, you can do something like
def generator_custom(df3):
for idex,row in df3.iterrows():
#some preprocessing
yield X,y
And then you can use tf.data to take batch of 128/64/32 as
tf.data.Dataset.from_generator(lambda: generator_custom(df_train))
train_dataset = train_dataset.batch(128,drop_remainder=True)
Replying you comment about two dimensions
Just an example (I scale 100000,2000 to 1000.200 for example, feel free to change them)
import numpy as np
x = np.array(range(200000)).reshape(1000, 200)
x2 = tensorflow.keras.preprocessing.timeseries_dataset_from_array(x, None, 16, sequence_stride=1, sampling_rate=1, batch_size=128)
That gives you something like
shapes (128, 16, 200)
shapes (128, 16, 200)
Which is what you want (16*2000), right? (remember we have 200 just to showing purpose)

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)

Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]

Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).

tf.data.Dataset.map() for datasets made from multiple slices

The tf.data.Dataset.map() for a dataset created from a single slice looks like dataset.map(lambda x: x/2). What would it look like if the dataset was created from two slices? See, for example, the following code. The map() function in the last line of the code will work for a dataset created from a single slice, but causes an error for my two-slice case.
import tensorflow as tf, numpy as np # tensorflow 2.0
from tensorflow import keras as kr
dataset = tf.data.Dataset.from_tensor_slices((features_int8, labels_int8)) # features, labels are numpy arrays
model = kr.Sequential()
model.add(kr.layers.InputLayer(6)
model.add(kr.layers.Dense( 8, activation=tf.nn.tanh))
model.add(kr.layers.Dense( 3, activation=tf.nn.tanh))
model.compile(optimizer = kr.optimizers.RMSprop(), loss = kr.losses.MeanSquaredError())
model.fit(dataset.batch(64).map(lambda x: x/9), epochs = 10)

Pass the lambda function in a separate function as shown
def map_fn(x, y):
return x / 9, y
model.fit(dataset.batch(64).map(map_fn), epochs = 10)

Really confused with this numpy shape mismatch error

I was implementing a simple k-nearest-neighbor algorithm from sklearn on google cloud platform ml engine. I used a custom metric to calculate the distance between two input vectors so that the distance is the weighted sum of elements in the element-wise squared difference between two vectors. The code is below:
import os.path
from sklearn import neighbors
import numpy as np
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.python.lib.io import file_io
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('input_dir', 'input', 'Input Directory.')
flags.DEFINE_string('input_train_data','train_data','Input Training Data File Name.')
pickle_file = os.path.join(FLAGS.input_dir, FLAGS.input_train_data)
def mydist(x, y):
return np.dot((x - y) ** 2, weight)
with file_io.FileIO(pickle_file, 'r') as f:
save = pickle.load(f)
train_dataset, train_labels, valid_dataset, valid_labels = save['train_dataset'], save['train_labels'], save[
'valid_dataset'], save['valid_labels']
train_data = train_dataset[:1000]
train_label = train_labels[:1000]
test_data = valid_dataset[:100]
weight = [1.0]* len(train_dataset[1])
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric=lambda x, y: mydist(x, y))
knn.fit(train_data, train_label)
predict = knn.predict(test_data)
print(predict)
train_dataset is a numpy array of shape (86667,13) and valid_dataset has shape (8000,13). Train_labels has shape (86667,1) and valid_labels (8000,1). For some reason I got a dimension mismatch:
line 15, in mydist return np.dot((x - y) ** 2, weight) ValueError: shapes
(10,) and (13,) not aligned: 10 (dim 0) != 13 (dim 0)
Both x and y in the input of the custom metric should have size 13 yet somehow they have size 10. Can anyone explain what is wrong here?

You are taking the distance between the wrong terms. You cannot take the distance between the labels and the train features. These are of two different dimensions. You need to calculate the distance between any two feature points say x1 and x2, not between the label and it's feature point (say x1 and y1). Secondly, when declaring the KNeighborsRegressor object, you have specified the parameter incorrectly. In the 'metric' parameter you specify the 'string' or the 'DistanceMetric' Object. First you have to make a distance metric object and then pass it as metric. So, this is the correct way of calling your function:
my_metric=DistanceMetric.get_metric('myfunc',func=mydist)
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric='myfunc')
Sklearn will itself take care of how the parameters are to be passed in the distance function. I am assuming that the weights variable is global for your code to function correctly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two tensorflow datasets into one dataset with inputs and lables - python

Related

Convert a Tensorflow dataset containing inputs and labels to two NumPy arrays

Train a neural network with input as sliding windows of a matrix with Tensorflow / Keras, and memory issues

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

tf.data.Dataset.map() for datasets made from multiple slices

Really confused with this numpy shape mismatch error

Categories

Resources