PipelineException: No mask_token ([MASK]) found on the input - python

I am getting this error "PipelineException: No mask_token ([MASK]) found on the input"
when I run this line.
fill_mask("Auto Car .")
I am running it on Colab.
My Code:
from transformers import BertTokenizer, BertForMaskedLM
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
from transformers import BertTokenizer, BertForMaskedLM
paths = [str(x) for x in Path(".").glob("**/*.txt")]
print(paths)
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
from transformers import BertModel, BertConfig
configuration = BertConfig()
model = BertModel(configuration)
configuration = model.config
print(configuration)
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=bert_tokenizer,
file_path="./kant.txt",
block_size=128,
)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./KantaiBERT",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model=model,
tokenizer=bert_tokenizer,
device=0,
)
fill_mask("Auto Car <mask>."). # This line is giving me the error...
The last line is giving me the error mentioned above. Please let me know what I am doing wrong or what I have to do in order to remove this error.
Complete error: "f"No mask_token ({self.tokenizer.mask_token}) found on the input","

Even if you have already found the error, a recommendation to avoid it in the future. Instead of calling
fill_mask("Auto Car <mask>.")
you can do the following to be more flexible when you use different models:
MASK_TOKEN = tokenizer.mask_token
fill_mask("Auto Car {}.".format(MASK_TOKEN))

If the model implementation changes the token to be identified (some identify , some [mask]), then you get into trouble. It is better to use f strings and pass the argument. The advantage of using an f-string is that it is intuitive to understand.
The following code works for me -
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
mask_fill = pipeline("fill-mask", model="bert-base-uncased")
mask_fill(f"The gaming laptop is {tokenizer.mask_token} and I have loved playing games on it.", top_k=2)

Related

I'd like to get the episodic rewards in csv format in stable baselines 3

I want to retrieve the data after every episode, I've read the documentation that you can use, stable_baselines3.common.monitor.ResultsWriter but I don't know how to implement it in my code.
import gym
import numpy as np
import Neural_Traffic_Env
import stable_baselines3
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback, CallbackList, StopTrainingOnMaxEpisodes, EveryNTimesteps
from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from stable_baselines3.common.monitor import Monitor, ResultsWriter
env = gym.make('NeuralTraffic-v1')
env = Monitor(env, filename="Monitor")
eval_callback = EvalCallback(env, best_model_save_path='./logs/best_model', log_path='./logs/', eval_freq=500)
checkpoint_callback = CheckpointCallback(save_freq=100, save_path='./saves/')
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=1000, verbose=1)
callback = CallbackList([callback_max_episodes, checkpoint_callback, eval_callback])
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
model = DDPG("MlpPolicy", env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=1e6, log_interval=1, callback=callback)
model.save("ddpg")
env = model.get_env()
also is there a stable baselines forum I can direct my question too?
from stable_baselines3.common.logger import configure
from stable_baselines3.common.monitor import Monitor
tmp_path = "./tmp/sb3_log/"
# set up logger
new_logger = configure(tmp_path, ["stdout", "csv", "tensorboard"])
model = PPO('MlpPolicy', env, verbose=1)
model.set_logger(new_logger)

Using custom trained Keras model with Sagemaker endpoint results in "Session was not created with a graph before Run()" error while prediction

I have a trained a BERT text classification model using keras on spam vs ham dataset. I have deployed the model and got a Sagemaker endpoint. I want to use it for any prediction.
I am using a ml.t2.medium Sagemaker instance and my tensorflow version is 2.6.2 in the Sagemaker notebook
I am getting an error while using the Sagemaker endpoint for prediction. The error is Session was not created with a graph before Run()
This is my code for training the classifier
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
# In[2]:
import pandas as pd
df = pd.read_csv("spam.csv")
df.head(5)
# In[3]:
df.groupby('Category').describe()
# In[4]:
df['Category'].value_counts()
# In[5]:
df_spam = df[df['Category']=='spam']
df_spam.shape
# In[6]:
df_ham = df[df['Category']=='ham']
df_ham.shape
# In[7]:
df_ham_downsampled = df_ham.sample(df_spam.shape[0])
df_ham_downsampled.shape
# In[8]:
df_balanced = pd.concat([df_ham_downsampled, df_spam])
df_balanced.shape
# In[9]:
df_balanced['Category'].value_counts()
# In[10]:
df_balanced['spam']=df_balanced['Category'].apply(lambda x: 1 if x=='spam' else 0)
df_balanced.sample(5)
# In[11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_balanced['Message'],df_balanced['spam'], stratify=df_balanced['spam'])
# In[12]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
# In[13]:
def get_sentence_embeding(sentences):
preprocessed_text = bert_preprocess(sentences)
return bert_encoder(preprocessed_text)['pooled_output']
get_sentence_embeding([
"500$ discount. hurry up",
"Bhavin, are you up for a volleybal game tomorrow?"]
)
# In[14]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])
# In[15]:
model.summary()
# In[16]:
METRICS = [
tf.keras.metrics.BinaryAccuracy(name='accuracy'),
tf.keras.metrics.Precision(name='precision'),
tf.keras.metrics.Recall(name='recall')
]
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=METRICS)
# In[17]:
model.fit(X_train, y_train, epochs=1)
AND THIS PART IS USED FOR DEPLOYING THE MODEL
# In[18]:
model.save('saved_model/28dec1')
# In[3]:
model = tf.keras.models.load_model('saved_model/28dec1')
model.predict(["who is the spammer on here"])
array([[0.08218178]], dtype=float32)
# Check its architecture
model.summary()
# In[18]:
tf.compat.v1.enable_eager_execution()
print("pass")
# In[5]:
def convert_h5_to_aws(loaded_model):
"""
given a pre-trained keras model, this function converts it to a TF protobuf format
and saves it in the file structure which aws expects
"""
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants
# This is the file structure which AWS expects. Cannot be changed.
model_version = '1'
export_dir = 'export/Servo/' + model_version
# Build the Protocol Buffer SavedModel at 'export_dir'
builder = builder.SavedModelBuilder(export_dir)
# Create prediction signature to be used by TensorFlow Serving Predict API
signature = predict_signature_def(
inputs={"inputs": loaded_model.input}, outputs={"score": loaded_model.output})
from keras import backend as K
with K.get_session() as sess:
# Save the meta graph and variables
builder.add_meta_graph_and_variables(
sess=sess, tags=[tag_constants.SERVING], signature_def_map={"serving_default": signature})
builder.save()
#create a tarball/tar file and zip it
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('export', recursive=True)
convert_h5_to_aws(model)
# In[3]:
import sagemaker
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
# In[7]:
# where did it upload to?
print("Bucket name is:")
sagemaker_session.default_bucket()
# In[9]:
import boto3, re
from sagemaker import get_execution_role
# the (default) IAM role you created when creating this notebook
role = get_execution_role()
# Create a Sagemaker model (see AWS console>SageMaker>Models)
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
framework_version = '1.12',
entry_point = 'train.py')
# In[10]:
# Deploy a SageMaker to an endpoint
predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')
# In[5]:
import numpy as np
import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel
endpoint = 'sagemaker-tensorflow-serving-2021-10-28-11-18-34-001' #get endpoint name from SageMaker > endpoints
predictor=sagemaker.tensorflow.model.TensorFlowPredictor(endpoint, sagemaker_session)
# .predict send the data to our endpoint
#data = np.asarray(["what the shit"]) #<-- update this to have inputs for your model
predictor.predict(["this is not a spam"])
And I am getting this error
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{ "error": "Session was not created with a graph before Run()!" }
Can someone please help me.
Instead of saving model as h5 use below method.
model.save("export/Servo/1/")
for some reason it expects exact format. Also, please remove if any hidden file is there in this folder.
This will save the model already in protobuf format.

SpaCy: how do you add custom NER labels to a pre-trained model?

I am new to SpaCy and NLP. I am using SpaCy v 3.1 and Python 3.9.7 64-bit.
My objective: to use a pre-trained SpaCy model (en_core_web_sm) and add a set of custom labels to the existing NER labels (GPE, PERSON, MONEY, etc.) so that the model can recognize both the default AND the custom entities.
I've looked at the SpaCy documentation and what I need seems to be an EntityRecogniser, specifically a new pipe.
However, it is not really clear to me at what point in my workflow I should add this new pipe, since in SpaCy 3 the training happens in CLI, and from the docs it's not even clear to me where the pre-trained model is called.
Any tutorials or pointers you might have are highly appreciated.
This is what I think should be done, but I am not sure how:
import spacy
from spacy import displacy
from spacy_langdetect import LanguageDetector
from spacy.language import Language
from spacy.pipeline import EntityRecognizer
# Load model
nlp = spacy.load("en_core_web_sm")
# Register custom component and turn a simple function into a pipeline component
#Language.factory('new-ner')
def create_bespoke_ner(nlp, name):
# Train the new pipeline with custom labels here??
return LanguageDetector()
# Add custom pipe
custom = nlp.add_pipe("new-ner")
This is what my config file looks like so far. I suspect my new pipe needs to go next to "tok2vec" and "ner".
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"#tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100
For Spacy 3.2 I did it this way:
import spacy
import random
from spacy import util
from spacy.tokens import Doc
from spacy.training import Example
from spacy.language import Language
def print_doc_entities(_doc: Doc):
if _doc.ents:
for _ent in _doc.ents:
print(f" {_ent.text} {_ent.label_}")
else:
print(" NONE")
def customizing_pipeline_component(nlp: Language):
# NOTE: Starting from Spacy 3.0, training via Python API was changed. For information see - https://spacy.io/usage/v3#migrating-training-python
train_data = [
('We need to deliver it to Festy.', [(25, 30, 'DISTRICT')]),
('I like red oranges', [])
]
# Result before training
print(f"\nResult BEFORE training:")
doc = nlp(u'I need a taxi to Festy.')
print_doc_entities(doc)
# Disable all pipe components except 'ner'
disabled_pipes = []
for pipe_name in nlp.pipe_names:
if pipe_name != 'ner':
nlp.disable_pipes(pipe_name)
disabled_pipes.append(pipe_name)
print(" Training ...")
optimizer = nlp.create_optimizer()
for _ in range(25):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
example = Example.from_dict(doc, {"entities": entity_offsets})
nlp.update([example], sgd=optimizer)
# Enable all previously disabled pipe components
for pipe_name in disabled_pipes:
nlp.enable_pipe(pipe_name)
# Result after training
print(f"Result AFTER training:")
doc = nlp(u'I need a taxi to Festy.')
print_doc_entities(doc)
def main():
nlp = spacy.load('en_core_web_sm')
customizing_pipeline_component(nlp)
if __name__ == '__main__':
main()

Python: PyCUDA ERROR: The context stack was not empty upon module cleanup

I have created a Streamlit App to as a demo of a project on Multilingual Text Classification using mBERT in PyTorch. When I run the app with the command python app.py it works fine but when I try to use Streamlit with the command streamlit run app.py it throws a PyCUDA Error.
Following is the code present in app.py:
import torch
from typing import Text
import streamlit as st
import pandas as pd
from textblob import TextBlob
from inference.inference_onnx import run_onnx_inference
from inference.inference_tensorRT import run_trt_inference
from googletrans import Translator
st.title("LinClass: Multilingual Text Classifier")
input_text = st.text_input('Text:')
####################
# Google Translate API
####################
translator = Translator()
input_text = translator.translate(
input_text,
dest= "en"
)
input_text = input_text.text
####################
#Select Precision and Inference Method
####################
df = pd.DataFrame()
df["lang"] = ["en"]
precision = st.sidebar.selectbox("Select Precision:",
("16 Bit", "32 Bit")
)
inference = st.sidebar.selectbox("Inference Method:",
("ONNX", "TensorRT")
)
if st.button('Show Selected Configuration'):
st.subheader("Selected Configuration:")
st.write("Precision: ", precision)
st.write("Inference: ", inference)
st.subheader("Results")
def result(x):
"""
Function to classify the comment toxicity based on the probability and given threshold
params: x(float) - Probability of Toxicity
"""
if x >= 0.4:
st.write("Toxic")
else:
st.write("Non Toxic")
####################
# Implement Selected Configuration
####################
if precision=="16 Bit":
if inference=="ONNX":
df["comment_text"] = [input_text]
predictions = run_onnx_inference(
onnx_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_lightning_fp16_2GPU.onnx",
stage="inference",
df_test = df
)
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
if inference=="TensorRT":
df["content"] = [input_text]
predictions = run_trt_inference(
trt_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_lightning_fp16_bs16.engine",
stage="inference",
df_test = df
)
predictions = predictions.astype("float32")
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
if precision=="32 Bit":
if inference=="ONNX":
df["comment_text"] = [input_text]
predictions = run_onnx_inference(
onnx_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_fp32.onnx",
stage="inference",
df_test = df
)
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
if inference=="TensorRT":
df["content"] = [input_text]
predictions = run_trt_inference(
trt_model_path = "/workspace/data/multilingual-text-classifier/output models/mBERT_fp32.engine",
stage="inference",
df_test = df
)
predictions = predictions.astype("float32")
predictions = torch.sigmoid(torch.tensor(predictions))
st.write(input_text)
st.write(predictions)
result(predictions)
####################
# Take Feedback
####################
st.subheader("Feedback:")
feedback = st.radio(
"Are you satisfied with the results?",
('Yes', 'No'))
st.write("Thanks for the Feedback!")
Error
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)

Sklearn classifier and flask issues

I have been trying to self host with apache an sklearn classifier that I put together, and I ended up using joblib to serialize the saved model, then load it in a flask app. Now, this app worked perfectly when running flask's built in development server, but when I set this up with a debian 9 apache server, I get a 500 error. Delving into apache's error.log, I get:
AttributeError: module '__main__' has no attribute 'tokenize'
Now, this is funny to me because while I did write my own tokenizer, the web app gave me no problems when I was running it locally. Furthermore, the saved model that I used was trained on the webserver, so slightly different library versions should not be a problem.
My code for the web app is:
import re
import sys
from flask import Flask, request, render_template
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.externals import joblib
app = Flask(__name__)
def tokenize(text):
# text = text.translate(str.maketrans('','',string.punctuation))
text = re.sub(r'\W+', ' ', text)
tokens = word_tokenize(text)
lemas = []
for item in tokens:
lemas.append(WordNetLemmatizer().lemmatize(item))
return lemas
#app.route('/')
def home():
return render_template('home.html')
#app.route('/analyze',methods=['POST','GET'])
def analyze():
if request.method=='POST':
result=request.form
input_text = result['input_text']
clf = joblib.load("model.pkl.z")
parameters = clf.named_steps['clf'].get_params()
predicted = clf.predict([input_text])
# print(predicted)
certainty = clf.decision_function([input_text])
# Is it bonkers?
if predicted[0]:
verdict = "Not too nuts!"
else:
verdict = "Bonkers!"
return render_template('result.html',prediction=[input_text, verdict, float(certainty), parameters])
if __name__ == '__main__':
#app.debug = True
app.run()
With the .wsgi file being:
import sys
sys.path.append('/var/www/mysite')
from conspiracydetector import app as application
Furthermore, I trained the model with this code:
import logging
import pprint # Pretty stuff
import re
import sys # For command line arguments
from time import time # to show progress
import numpy as np
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.externals import joblib # In order to save
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
# Tokenizer that does stemming and strips punctuation
def tokenize(text):
# text = text.translate(str.maketrans('','',string.punctuation))
text = re.sub(r'\W+', ' ', text)
tokens = word_tokenize(text)
lemas = []
for item in tokens:
lemas.append(WordNetLemmatizer().lemmatize(item))
return lemas
if __name__ == "__main__":
# NOTE: we put the following in a 'if __name__ == "__main__"' protected
# block to be able to use a multi-core grid search that also works under
# Windows, see: http://docs.python.org/library/multiprocessing.html#windows
# The multiprocessing module is used as the backend of joblib.Parallel
# that is used when n_jobs != 1 in GridSearchCV
# Display progress logs on stdout
print("Initializing...")
# Command line arguments
save = sys.argv[1]
training_directory = sys.argv[2]
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
dataset = load_files(training_directory, shuffle=False)
print("n_samples: %d" % len(dataset.data))
# split the dataset in training and test set:
print("Splitting the dataset in training and test set...")
docs_train, docs_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.25, random_state=None)
# Build a vectorizer / classifier pipeline that filters out tokens
# that are too rare or too frequent
# Also remove stop words
print("Loading list of stop words...")
with open('stopwords.txt', 'r') as f:
words = [line.strip() for line in f]
print("Stop words list loaded...")
print("Setting up pipeline...")
pipeline = Pipeline(
[
# ('vect', TfidfVectorizer(stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1,1))),
('vect',
TfidfVectorizer(tokenizer=tokenize, stop_words=words, min_df=0.001, max_df=0.5, ngram_range=(1, 1))),
('clf', LinearSVC(C=5000)),
])
print("Pipeline:", [name for name, _ in pipeline.steps])
# Build a grid search to find out whether unigrams or bigrams are
# more useful.
# Fit the pipeline on the training set using grid search for the parameters
print("Initializing grid search...")
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
# 'vect__ngram_range': [(1, 1), (1, 2)],
# 'vect__min_df': (0.0005, 0.001),
# 'vect__max_df': (0.25, 0.5),
# 'clf__C': (10, 15, 20),
}
print("Parameters:")
pprint.pprint(parameters)
grid_search = GridSearchCV(
pipeline,
parameters,
n_jobs=-1,
verbose=True)
print("Training and performing grid search...\n")
t0 = time()
grid_search.fit(docs_train, y_train)
print("\nDone in %0.3fs!\n" % (time() - t0))
# Print the mean and std for each candidate along with the parameter
# settings for all the candidates explored by grid search.
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
print(i, 'params - %s; mean - %0.2f; std - %0.2f'
% (grid_search.cv_results_['params'][i],
grid_search.cv_results_['mean_test_score'][i],
grid_search.cv_results_['std_test_score'][i]))
# Predict the outcome on the testing set and store it in a variable
# named y_predicted
print("\nRunning against testing set...\n")
y_predicted = grid_search.predict(docs_test)
# Save model
print("\nSaving model to", save, "...")
joblib.dump(grid_search.best_estimator_, save)
print("Model Saved! \nPrepare for some awesome stats!")
I must confess that I am pretty stumped, and after tinkering around, searching, and making sure that my server is configured correctly, I felt that perhaps someone here might be able to help.
Any help is appreciated, and if there is any more information that I need to provide, please let me know and I will be happy to.
Also, I am running:
python 3.5.3 with nltk and sklearn.
I solved this problem, although imperfectly, by removing my custom tokenizer and falling back on one of sklearn's.
However, I am still in the dark on how to integrate my own tokenizer.

Categories