Related
I'm trying to use the pre-trained BERT models on TensorFlow Hub to do some simple NLP. I'm on a 2021 MacBook Pro (Apple Silicon) with Python 3.9.13 and TensorFlow v2.9.2. However, preprocessing any amount of text returns a "NotFoundError" that I can't seem to resolve. The link to the preprocessing model is here: (https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3) and I have pasted my code/error messages below. Does anyone know why this is happening and how I can fix it? Thanks in advance.
Code
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
print(bert_preprocess(["test"]))
Output
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
Cell In [42], line 3
1 bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
2 bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
----> 3 print(bert_preprocess(["test"]))
File ~/miniforge3/envs/tfenv/lib/python3.9/site-packages/keras/utils/traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
File ~/miniforge3/envs/tfenv/lib/python3.9/site-packages/tensorflow_hub/keras_layer.py:237, in KerasLayer.call(self, inputs, training)
234 else:
235 # Behave like BatchNormalization. (Dropout is different, b/181839368.)
236 training = False
--> 237 result = smart_cond.smart_cond(training,
238 lambda: f(training=True),
239 lambda: f(training=False))
241 # Unwrap dicts returned by signatures.
242 if self._output_key:
File ~/miniforge3/envs/tfenv/lib/python3.9/site-packages/tensorflow_hub/keras_layer.py:239, in KerasLayer.call.<locals>.<lambda>()
...
[[StatefulPartitionedCall/StatefulPartitionedCall/bert_pack_inputs/PartitionedCall/RaggedConcat/ArithmeticOptimizer/AddOpsRewrite_Leaf_0_add_2]] [Op:__inference_restored_function_body_209194]
Call arguments received by layer "keras_layer_6" (type KerasLayer):
• inputs=["'test'"]
• training=None
Update: While using BERT preprocessing from TFHub, Tensorflow and tensorflow_text versions should be same so please make sure that installed both versions are same. It happens because you're using latest version for tensorflow_text but you're using other versions for python and tensorflow but there is internal dependancy with versions for Tensorflow and tensorflow_text which should be same.
!pip install -U tensorflow
!pip install -U tensorflow-text
import tensorflow as tf
import tensorflow_text as text
# Or install with a specific Version
!pip install -U tensorflow==2.11.*
!pip install -U tensorflow-text==2.11.*
import tensorflow as tf
import tensorflow_text as text
I have executed below lines of code in Google Colab and It's working fine,
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
print(bert_preprocess(["test"]))
Here is output:
{'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
dtype=int32)>, 'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
dtype=int32)>, 'input_word_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[ 101, 3231, 102, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
I hope it will help you to resolve your issue, Thank You!
I have a model in Keras:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
import random
df = pd.read_csv('/home/Datasets/creditcard.csv')
output = df['Class']
features = df.drop('Class', 1)
train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)
train_features = train_features.to_numpy()
test_features = test_features.to_numpy()
train_labels = train_labels.to_numpy()
test_labels = test_labels.to_numpy()
model = tf.keras.Sequential()
num_nodes = [1]
act_functions = [tf.nn.relu]
optimizers = ['SGD']
loss_functions = ['categorical_crossentropy']
epochs_count = ['10']
batch_sizes = ['500']
act = random.choice(act_functions)
opt = random.choice(optimizers)
ep = random.choice(epochs_count)
batch = random.choice(batch_sizes)
loss = random.choice(loss_functions)
count = random.choice(num_nodes)
model.add(tf.keras.layers.Dense(31, activation = act, input_shape=(31,)))
model.add(tf.keras.layers.Dense(count, activation = act))
model.add(tf.keras.layers.Dense(1, activation = act))
model.compile(loss = loss,
optimizer = opt,
metrics = ['accuracy'])
epochs = int(ep)
batch_size = int(batch)
model.fit(train_features, train_labels, epochs=epochs, batch_size=batch_size)
The train labels are binary:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
But, the output of:
z = model.predict(test_features)
is:
array([[ 4574.6 ],
[ 4896.158 ],
[ 3867.8225],
...,
[15516.117 ],
[ 6441.43 ],
[ 5453.437 ]], dtype=float32)
Why would it be predicting these values?
Thanks
On your last layer use sigmoid activation,
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
For a text classification task I applied Bert(fine tune) and the output that I got is as below:
Why input_mask is all 1 ?
#to_feature_map is a function.
to_feature_map("hi how are you doing",0)
({'input_mask': <tf.Tensor: shape=(64,), dtype=int32, numpy=
array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)>,
'input_type_ids': <tf.Tensor: shape=(64,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)>,
'input_word_ids': <tf.Tensor: shape=(64,), dtype=int32, numpy=
array([ 101, 7632, 2129, 2024, 2017, 2725, 102, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>},
<tf.Tensor: shape=(), dtype=int32, numpy=0>)```
The input masks — allows the model to cleanly differentiate between the content and the padding. The mask has the same shape as the input ids, and contains 1 anywhere the the input ids is not padding.
Now I'm working on a space environment model that predicts the maximum Kp index of tomorrow using last 3-days coronal hole information. (Total amount of data is around 4300 days.)
For the input, 3 arrays with 136 elements are used (one array for a day, so 3 days data). For example,
inputArray_day1 = [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
inputArray_day2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]
inputArray_day3 = [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
The output is single one-hot vector of length 28 which indicates maximum Kp index of day4. I use dictionaries below to convert between Kp index and one-hot vector easily.
kp2idx = {0.0:0, 0.3:1, 0.7:2, 1.0:3, 1.3:4, 1.7:5, 2.0:6, 2.3:7, 2.7:8, 3.0:9, 3.3:10, 3.7:11, 4.0:12, 4.3:13,
4.7:14, 5.0:15, 5.3:16, 5.7:17, 6.0:18, 6.3:19, 6.7:20, 7.0:21, 7.3:22, 7.7:23, 8.0:24, 8.3:25, 8.7:26, 9.0:27}
idx2kp = {0:0.0, 1:0.3, 2:0.7, 3:1.0, 4:1.3, 5:1.7, 6:2.0, 7:2.3, 8:2.7, 9:3.0, 10:3.3, 11:3.7, 12:4.0, 13:4.3,
14:4.7, 15:5.0, 16:5.3, 17:5.7, 18:6.0, 19:6.3, 20:6.7, 21:7.0, 22:7.3, 23:7.7, 24:8.0, 25:8.3, 26:8.7, 27:9.0}
The model contains two LSTM layers with dropout.
def fit_lstm2(X,Y,Xv,Yv, n_batch, nb_epoch, n_neu1, n_neu2, dropout):
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(n_neu1, batch_input_shape = (n_batch,X.shape[1],X.shape[2]), return_sequences=True))
model.add(tf.keras.layers.Dropout(dropout))
model.add(tf.keras.layers.LSTM(n_neu2))
model.add(tf.keras.layers.Dropout(dropout))
model.add(tf.keras.layers.Dense(28,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy','mse'])
for i in range(nb_epoch):
print('epochs : ' + str(i))
model.fit(X,Y, epochs=1, batch_size = n_batch, verbose=1, shuffle=False,callbacks=[custom_hist], validation_data = (Xv,Yv))
model.reset_states()
return model
I tried various neuron number and dropout rate such as
n_batch = 1
nb_epochs = 100
n_neu1 = [128,64,32,16]
n_neu2 = [64,32,16,8]
n_dropout = [0.2,0.4,0.6,0.8]
for dropout in n_dropout:
for i in range(len(n_neu1)):
model = fit_lstm2(x_train,y_train,x_val,y_val,n_batch, nb_epochs,n_neu1[i],n_neu2[i],dropout)
The problem is that the prediction accuracy never goes up more than 10% and over-fitting starts pretty soon after intializing training.
Here are some images of the training histories. (Sorry for the location of the legends)
n_neu1,n_neu2,dropout=(64,32,0.2)
n_neu1,n_neu2,dropout=(32,16,0.2)
n_neu1,n_neu2,dropout=(16,8,0.2)
Honestly, I have no idea why the validation accuracy never goes up and the over-fitting starts so quickly.. Is there better way to use the input data? I mean, should I normalize or standardize the input?
Please help me, any comments and suggestions will be greatly appreciated.
I have coded a sequence to sequence learning LSTM in keras myself using the knowledge gained from the web tutorials and my own intuitions. I converted my sample text to sequences and then padded using pad_sequence function in keras.
from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences
def shift(seq, n):
n = n % len(seq)
return seq[n:] + seq[:n]
txt="abcdefghijklmn"*100
tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt)
x = tk.texts_to_sequences(txt)
#shifing to left
y = shift(x,1)
#padding sequence
max_len = 100
max_features=len(tk.word_counts)
X = pad_sequences(x, maxlen=max_len)
Y = pad_sequences(y, maxlen=max_len)
After a carefully inspection I found my padded sequence looks like this
>>> X[0:6]
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]], dtype=int32)
>>> X
array([[ 0, 0, 0, ..., 0, 0, 1],
[ 0, 0, 0, ..., 0, 0, 3],
[ 0, 0, 0, ..., 0, 0, 2],
...,
[ 0, 0, 0, ..., 0, 0, 13],
[ 0, 0, 0, ..., 0, 0, 12],
[ 0, 0, 0, ..., 0, 0, 14]], dtype=int32)
Is the padded sequence suppose to look like this? Except the last column in the array the rest are all zeros. I think I made some mistake in padding the text to sequence and if so can you tell me where I made the error?
If you want to tokenize by char, you can do it manually, it's not too complex:
First build a vocabulary for your characters:
txt="abcdefghijklmn"*100
vocab_char = {k: (v+1) for k, v in zip(set(txt), range(len(set(txt))))}
vocab_char['<PAD>'] = 0
This will associate a distinct number for every character in your txt. The character with index 0 should be preserved for the padding.
Having the reverse vocabulary will be usefull to decode the output.
rvocab = {v: k for k, v in vocab.items()}
Once you have this, you can first split your text into sequences, say you want to have sequences of length seq_len = 13 :
[[vocab_char[char] for char in txt[i:(i+seq_len)]] for i in range(0,len(txt),seq_len)]
your output will look like :
[[9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4, 3],
[14, 9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4],
...,
[2, 1, 5, 13, 11, 4, 3, 14, 9, 12, 6, 10, 8],
[7, 2, 1, 5, 13, 11, 4, 3, 14]]
Note that the last sequence doesn't have the same length, you can discard it or pad your sequence to max_len = 13, it will add 0's to it.
You can build your targets Y the same way, by shifting everything by 1. :-)
I hope this helps.
The problem is in this line:
tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
When you set such split (by " "), due to nature of your data, you'll get each sequence consisting of a single word. That's why your padded sequences have only one non-zero element. To change that try:
txt="a b c d e f g h i j k l m n "*100
The argument padding controls padding either before or after each sequence. Use like this:
X = pad_sequences(x, maxlen=max_len, padding='post')
Y = pad_sequences(y, maxlen=max_len, padding='post')