Using tf.nn.embedding_lookup with multiple columns - python

I have a problem on making forecasting model(lstm) model now with tensorflow.
I try to use embedding_lookup with multiple categorical columns.
This is my data (shape[100, 12, 16])
[[[1, 1, 1, 7, 1, -1, 26, 9],
[1, 5, 2, -5, 2, 20, 25, 8],
[1, 4, 3, 1, 1, 32, 1, 7],
...]]
for forecasting, each row has multiple features, and I changed these features to categorical data(0, 1, 2...). after then, I'd like to use embedding layer for each columns and then concat all for using as an input data to LSTM.
How can I use embedding layer and concat each others?
This is my code
def get_var(self, name='', shape=None, dtype=tf.float32):
return tf.get_variable(name, shape, dtype=dtype, initializer=self.initializer)
def get_embedding(self, data='', name='', shape=[], dtype=tf.float32):
return tf.nn.embedding_lookup(self.get_var(name=name, shape=shape), data)
emb_concat = self.get_embedding(tf.cast(data[:,:, 0], 'int32'), name='emb_ap2id'+type, shape=[3, 2])
emb = self.get_embedding(tf.cast(data[:,:, 1], 'int32'), name='emb_month'+type, shape=[12, 11])
emb_concat = tf.concat(emb_concat, emb, axis=2)
It isn't working and I got an error msg.
result img

Related

Embedding Index out of range:

I understand why this error usually occurs, which is that the input >= embedding_dim.
However in my case the torch.max(inputs) = embedding_dim - 1.
print('inputs: ', src_seq)
print('input_shape: ', src_seq.shape)
print(self.src_word_emb)
inputs: tensor([[10, 6, 2, 4, 9, 14, 6, 2, 5, 0],
[12, 6, 3, 8, 13, 2, 0, 1, 1, 1],
[13, 8, 12, 7, 2, 4, 0, 1, 1, 1]])
input_shape: [3, 10]
Embedding(15, 512, padding_idx=1)
emb = self.src_word_emb(src_seq)
I try to get a transformer model to work and for some reason the encoder embedding only accepts inputs < embedding_dim_decoder, which does not make sense right?
Found the error source! In the transformer model the encoder and decoder can be set up to share the same embedding weights. However, I had a translation task with one embedding for the decoder and one embedding for the encoder. In the code it initializes the weights via:
if emb_src_trg_weight_sharing:
self.encoder.src_word_emb.weight = self.decoder.trg_word_emb.weight
Setting emb_src_trg_weight_sharing to false solved the issue!

generating segment labels for a Tensor given a value indicating segment boundaries

Does anyone know of a way to generate a 'segment label' for a Tensor, given a unique value that represents segment boundaries within the Tensor?
For example, given a 1D input tensor where the value 1 represents a segment boundary,
x = torch.Tensor([5, 4, 1, 3, 6, 2])
the resulting segment label Tensor should have the same shape with values representing the two segments:
segment_label = torch.Tensor([1, 1, 1, 2, 2, 2])
Likewise, for a batch of inputs, e.g. batch size = 3,
x = torch.Tensor([
[5, 4, 1, 3, 6, 2],
[9, 4, 5, 1, 8, 10],
[10, 1, 5, 4, 8, 9]
])
the resulting segment label Tensor (using 1 as the segment separator) should look something like this:
segment_label = torch.Tensor([
[1, 1, 1, 2, 2, 2],
[1, 1, 1, 1, 2, 2],
[1, 1, 2, 2, 2, 2]
])
Context: I'm currently working with Fairseq's Transformer implementation in PyTorch for a seq2seq NLP task. I am looking for a way to incorporate BERT-like segment embeddings in Transformer during the encoder's forward pass, rather than modifying an exisiting dataset used for translation tasks such as language_pair_dataset.
Thanks in advance!
You can use torch.cumsum to pull the trick:
mask = (x == 1).to(x) # mask with only the boundaries
segment_label = mask.cumsum(dim=-1) - mask + 1
Results with the desired segment_label.

keras-gcn fit model ValueError

I'm using this library to create a model to learn graphs. Here is the code (from repository):
import numpy as np
from keras_gcn.backend import keras
from keras_gcn import GraphConv
# feature matrix
input_data = np.array([[[0, 1, 2],
[2, 3, 4],
[4, 5, 6],
[7, 7, 8]]])
# adjacency matrix
input_edge = np.array([[[1, 1, 1, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[0, 0, 0, 1]]])
labels = np.array([[[1],
[0],
[1],
[0]]])
data_layer = keras.layers.Input(shape=(None, 3), name='Input-Data')
edge_layer = keras.layers.Input(shape=(None, None), dtype='int32', name='Input-Edge')
conv_layer = GraphConv(units=4, step_num=1, kernel_initializer='ones',
bias_initializer='ones', name='GraphConv')([data_layer, edge_layer])
model = keras.models.Model(inputs=[data_layer, edge_layer], outputs=conv_layer)
model.compile(optimizer='adam', loss='mae', metrics=['mae'])
model.fit([input_data, input_edge], labels)
However, when I run the code I get the following error:
ValueError: Error when checking target: expected GraphConv to have 3 dimensions, but got array with shape (4, 1)
while the shape of labels is (1, 4, 1)
You should encode your labels using onehot-encoder, something like the following:
lables = np.array([[[0, 1],
[1, 0],
[0, 1],
[1, 0]]])
Also number of units in GraphConv layer should be equal to the number of unique labels which is 2 in your case.
I think the issue is mismatch between the shapes of your edge_layer and data_layer.
When you use the function keras.layers.Input you're giving data_layer a shape of shape=(None, 3) and then you're giving edge_layer a shape of shape=(None, None)
Match the shapes and let me know how it goes.

TensorFlow tf.data.Dataset and bucketing

For an LSTM network, I've seen great improvements with bucketing.
I've come across the bucketing section in the TensorFlow docs which (tf.contrib).
Though in my network, I am using the tf.data.Dataset API, specifically I'm working with TFRecords, so my input pipeline looks something like this
dataset = tf.data.TFRecordDataset(TFRECORDS_PATH)
dataset = dataset.map(_parse_function)
dataset = dataset.map(_scale_function)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={.....})
How can I incorporate the bucketing method into a the tf.data.Dataset pipeline?
If it matters, in every record in the TFRecords file I have the sequence length saved as an integer.
Various bucketing use cases using Dataset API are explained well here.
bucket_by_sequence_length() example:
def elements_gen():
text = [[1, 2, 3], [3, 4, 5, 6, 7], [1, 2], [8, 9, 0, 2]]
label = [1, 2, 1, 2]
for x, y in zip(text, label):
yield (x, y)
def element_length_fn(x, y):
return tf.shape(x)[0]
dataset = tf.data.Dataset.from_generator(generator=elements_gen,
output_shapes=([None],[]),
output_types=(tf.int32, tf.int32))
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length(element_length_func=element_length_fn,
bucket_batch_sizes=[2, 2, 2],
bucket_boundaries=[0, 8]))
batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for _ in range(2):
print('Get_next:')
print(sess.run(batch))
Output:
Get_next:
(array([[1, 2, 3, 0, 0],
[3, 4, 5, 6, 7]], dtype=int32), array([1, 2], dtype=int32))
Get_next:
(array([[1, 2, 0, 0],
[8, 9, 0, 2]], dtype=int32), array([1, 2], dtype=int32))

How do I make a ragged batch in Tensorflow 2.0?

I'm trying to create a data input pipeline from a Tensorflow Dataset that consists of 1d tensors of numerical data. I would like to create batches of ragged tensors; I do not want to pad the data.
For instance, if my data is of the form:
[
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4]
...
]
I would like my dataset to consist of batches of the form:
<tf.Tensor [
<tf.RaggedTensor [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4],
...]>,
<tf.RaggedTensor [
[ ... ],
...]>
]>
I've tried creating a RaggedTensor using a map but I can't seem to do it on one dimensional data.
I think this can be achieved with a little work before and after the batch.
# First, you can expand along the 0 axis for each data point
dataset = dataset.map(lambda x: tf.expand_dims(x, 0))
# Then create a RaggedTensor with a ragged rank of 1
dataset = dataset.map(lambda x: tf.RaggedTensor.from_tensor(x))
# Create batches
dataset = dataset.batch(BATCH_SIZE)
# Squeeze the extra dimension from the created batches
dataset = dataset.map(lambda x: tf.squeeze(x, axis=1))
Then the final output will be of the form:
<tf.RaggedTensor [
<tf.Tensor [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>,
<tf.Tensor [0, 1, 2, 3]>,
...
]>
for each batch.

Categories