TensorFlow tf.data.Dataset and bucketing - python

For an LSTM network, I've seen great improvements with bucketing.
I've come across the bucketing section in the TensorFlow docs which (tf.contrib).
Though in my network, I am using the tf.data.Dataset API, specifically I'm working with TFRecords, so my input pipeline looks something like this
dataset = tf.data.TFRecordDataset(TFRECORDS_PATH)
dataset = dataset.map(_parse_function)
dataset = dataset.map(_scale_function)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={.....})
How can I incorporate the bucketing method into a the tf.data.Dataset pipeline?
If it matters, in every record in the TFRecords file I have the sequence length saved as an integer.

Various bucketing use cases using Dataset API are explained well here.
bucket_by_sequence_length() example:
def elements_gen():
text = [[1, 2, 3], [3, 4, 5, 6, 7], [1, 2], [8, 9, 0, 2]]
label = [1, 2, 1, 2]
for x, y in zip(text, label):
yield (x, y)
def element_length_fn(x, y):
return tf.shape(x)[0]
dataset = tf.data.Dataset.from_generator(generator=elements_gen,
output_shapes=([None],[]),
output_types=(tf.int32, tf.int32))
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length(element_length_func=element_length_fn,
bucket_batch_sizes=[2, 2, 2],
bucket_boundaries=[0, 8]))
batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for _ in range(2):
print('Get_next:')
print(sess.run(batch))
Output:
Get_next:
(array([[1, 2, 3, 0, 0],
[3, 4, 5, 6, 7]], dtype=int32), array([1, 2], dtype=int32))
Get_next:
(array([[1, 2, 0, 0],
[8, 9, 0, 2]], dtype=int32), array([1, 2], dtype=int32))

Related

MinmaxScaler: Normalise a 4D array of input

I have a 4D array of input that I would like to normalise using MinMaxScaler. For simplicity, I give an example with the following array:
A = np.array([
[[[0, 1, 2, 3],
[3, 0, 1, 2],
[2, 3, 0, 1],
[1, 3, 2, 1],
[1, 2, 3, 0]]],
[[[9, 8, 7, 6],
[5, 4, 3, 2],
[0, 9, 8, 3],
[1, 9, 2, 3],
[1, 0, -1, 2]]],
[[[0, 7, 1, 2],
[1, 2, 1, 0],
[0, 2, 0, 7],
[-1, 3, 0, 1],
[1, 0, 1, 0]]]
])
A.shape
(3,1,5,4)
In the given example, the array contains 3 input samples, where each sample has the shape (1,5,4). Each column of the input represents 1 variable (feature), so each sample has 4 features.
I would like to normalise the input data, But MinMaxScaler expects a 2D array (n_samples, n_features) like dataframe.
How then do I use it to normalise this input data?
Vectorize the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
A_sq = np.squeeze(A)
print(A_sq.shape)
# (3, 5, 4)
scaler.fit(np.squeeze(A_sq).reshape(3,-1)) # reshape to (3, 20)
#MinMaxScaler()
You can use the below code to normalize 4D array.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler(feature_range=(0, 1))
def norm(arr):
arrays_list=list()
objects_list=list()
for i in range(arr.shape[0]):
temp_arr=arr[i]
temp_arr=temp_arr[0]
scaler.fit(temp_arr)
temp_arr=scaler.transform(temp_arr)
objects_list.append(scaler)
arrays_list.append([temp_arr])
return objects_list,np.array(arrays_list)
pass the array to the function like
objects,array=norm(A)
it will return a list of MinMax objects and your original array with normalize values.
Output:
" If you want a scaler for each channel, you can reshape each channel of the data to be of shape (10000, 5*5). Each channel (which was previously 5x5) is now a length 25 vector, and the scaler will work. You'll have to transform your evaluation data in the same way with the scalers in channel_scalers."
Maybe this will help, not sure if this is what you're looking for exactly, but...
Python scaling with 4D data

generating segment labels for a Tensor given a value indicating segment boundaries

Does anyone know of a way to generate a 'segment label' for a Tensor, given a unique value that represents segment boundaries within the Tensor?
For example, given a 1D input tensor where the value 1 represents a segment boundary,
x = torch.Tensor([5, 4, 1, 3, 6, 2])
the resulting segment label Tensor should have the same shape with values representing the two segments:
segment_label = torch.Tensor([1, 1, 1, 2, 2, 2])
Likewise, for a batch of inputs, e.g. batch size = 3,
x = torch.Tensor([
[5, 4, 1, 3, 6, 2],
[9, 4, 5, 1, 8, 10],
[10, 1, 5, 4, 8, 9]
])
the resulting segment label Tensor (using 1 as the segment separator) should look something like this:
segment_label = torch.Tensor([
[1, 1, 1, 2, 2, 2],
[1, 1, 1, 1, 2, 2],
[1, 1, 2, 2, 2, 2]
])
Context: I'm currently working with Fairseq's Transformer implementation in PyTorch for a seq2seq NLP task. I am looking for a way to incorporate BERT-like segment embeddings in Transformer during the encoder's forward pass, rather than modifying an exisiting dataset used for translation tasks such as language_pair_dataset.
Thanks in advance!
You can use torch.cumsum to pull the trick:
mask = (x == 1).to(x) # mask with only the boundaries
segment_label = mask.cumsum(dim=-1) - mask + 1
Results with the desired segment_label.

Using tf.nn.embedding_lookup with multiple columns

I have a problem on making forecasting model(lstm) model now with tensorflow.
I try to use embedding_lookup with multiple categorical columns.
This is my data (shape[100, 12, 16])
[[[1, 1, 1, 7, 1, -1, 26, 9],
[1, 5, 2, -5, 2, 20, 25, 8],
[1, 4, 3, 1, 1, 32, 1, 7],
...]]
for forecasting, each row has multiple features, and I changed these features to categorical data(0, 1, 2...). after then, I'd like to use embedding layer for each columns and then concat all for using as an input data to LSTM.
How can I use embedding layer and concat each others?
This is my code
def get_var(self, name='', shape=None, dtype=tf.float32):
return tf.get_variable(name, shape, dtype=dtype, initializer=self.initializer)
def get_embedding(self, data='', name='', shape=[], dtype=tf.float32):
return tf.nn.embedding_lookup(self.get_var(name=name, shape=shape), data)
emb_concat = self.get_embedding(tf.cast(data[:,:, 0], 'int32'), name='emb_ap2id'+type, shape=[3, 2])
emb = self.get_embedding(tf.cast(data[:,:, 1], 'int32'), name='emb_month'+type, shape=[12, 11])
emb_concat = tf.concat(emb_concat, emb, axis=2)
It isn't working and I got an error msg.
result img

how to explain the output of tf.rank in tensorflow

I am new in tensorflow and have a question about tf.rank method.
In the doc https://www.tensorflow.org/api_docs/python/tf/rank there is a simple example about the tf.rank:
# shape of tensor 't' is [2, 2, 3]
t = tf.constant([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]])
tf.rank(t) # 3
But when I run the code below:
t = tf.constant([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]])
print(tf.rank(t)) # 3
I get output like:
Tensor("Rank:0", shape=(), dtype=int32)
Why can I get the output of "3"?
As I said in the comments of this question, tf.rank(t) creates a tensor in charge of evaluating the rank of tensor t. If you use the python print() function, it just prints information about the tensor itself.
Let's assign the tf.rank(t) tensor to a variable rank (as suggested by #Picnix_) and evaluate its value under a tf.Session():
import tensorflow as tf
t = tf.constant([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]])
rank = tf.rank(t)
with tf.Session() as sess:
rank_value = sess.run(rank)
print(rank_value) # Outputs --> 3
So, rank_value is the variable containing the value of tensor rank, and as documentation suggest its value is 3. Hope this puts some light on how tensorflow works.

How do I make a ragged batch in Tensorflow 2.0?

I'm trying to create a data input pipeline from a Tensorflow Dataset that consists of 1d tensors of numerical data. I would like to create batches of ragged tensors; I do not want to pad the data.
For instance, if my data is of the form:
[
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4]
...
]
I would like my dataset to consist of batches of the form:
<tf.Tensor [
<tf.RaggedTensor [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4],
...]>,
<tf.RaggedTensor [
[ ... ],
...]>
]>
I've tried creating a RaggedTensor using a map but I can't seem to do it on one dimensional data.
I think this can be achieved with a little work before and after the batch.
# First, you can expand along the 0 axis for each data point
dataset = dataset.map(lambda x: tf.expand_dims(x, 0))
# Then create a RaggedTensor with a ragged rank of 1
dataset = dataset.map(lambda x: tf.RaggedTensor.from_tensor(x))
# Create batches
dataset = dataset.batch(BATCH_SIZE)
# Squeeze the extra dimension from the created batches
dataset = dataset.map(lambda x: tf.squeeze(x, axis=1))
Then the final output will be of the form:
<tf.RaggedTensor [
<tf.Tensor [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>,
<tf.Tensor [0, 1, 2, 3]>,
...
]>
for each batch.

Categories