I am following this tutorial here: https://huggingface.co/transformers/training.html - though, I am coming across an error, and I think the tutorial is missing an import, but i do not know which.
These are my current imports:
# Transformers installation
! pip install transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git
! pip install datasets transformers
from transformers import pipeline
Current code:
from datasets import load_dataset
raw_datasets = load_dataset("imdb")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer(sentences, padding="max_length", truncation=True)
The error:
NameError Traceback (most recent call last)
<ipython-input-9-5a234f114e2e> in <module>()
----> 1 inputs = tokenizer(sentences, padding="max_length", truncation=True)
NameError: name 'sentences' is not defined
This error is because you have not declared sentences. Now you need to
access raw data using:
k = raw_datasets['train']
sentences = k['text']
create a variable
sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
"As we saw in Preprocessing data, we can prepare the text inputs for the model with the following command (this is an example, not a command you can execute)"
The error states that you do not have a variable called sentences in the scope. I believe the tutorial presumes you already have a list of sentences and are tokenizing it.
Have a look at the documentation The first argument can be either a string or list of string or list of list of strings.
__call__(text: Union[str, List[str], List[List[str]]],...)
Related
Trying to use tuner007/pegasus_paraphrase. Followed the examples in Pegasus.
The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Problem:
PegasusTokenizer cannot be instantiated as PegasusTokenizer.from_pretrained(model_name) returns None. Using the 'google/pegasus-xsum' as the model name caused the same.
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'tuner007/pegasus_paraphrase'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
type(tokenizer)
---
NoneType
Please suggest how to work it around.
You need to install sentence piece library needed for tokenizer to work properly. To install it run:
pip install sentencepiece
Actually the error occurred because you imported the tokenizer first before installing sentencepiece and after receiving the error you installed it without restarting the session.
Make sure you install sentence piece before importing the tokenizer.
As a programming noob, I am trying to find similar sentences in several hundreds of newspaper articles. I have tried my code with a smaller text sample which has worked brilliantly. Now, with a larger text file (using the same code), I get the error code "[E1002] Span index out of range.".
This is my code so far:
!pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 2000000
with open('/content/BSE.txt', 'r', encoding="utf-8", errors="ignore") as f:
sentences_articles = f.read()
about_doc = nlp(sentences_articles)
sentences = list(about_doc.sents)
len(sentences)
sentences[:10]
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('all-mpnet-base-v2')
corpus = sentences
corpus_embeddings = embedder.encode(corpus, show_progress_bar=True, batch_size = 128)
The progress bar stops at 94%, with error "[E1002] Span index out of range". I have used the .readlines() function, which worked, yet because of my text data's nature has produced unusable results (but no error!). I limited the number of words in each sentence, but that didn't help either. I tried several text data (different length, different content), but without success.
Any suggestions on how to fix this?
I had a similar problem with the same mistake, and for me it was solved after changing sentences from a list[Span] to list[str] as this is what .encode() requires. Instead of sentences = list(about_doc.sents), write sentences = list(sent.text for sent in about_doc.sents)
I just recently started looking into the huggingface transformer library.
When I tried to get started using the model card code at e.g. community model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
However, I got the following error:
Traceback (most recent call last):
File "test.py", line 2, in <module>
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
File "/Users/Lukas/miniconda3/envs/nlp/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 124, in from_pretrained
"'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
ValueError: Unrecognized model identifier in emilyalsentzer/Bio_ClinicalBERT. Should contains one of 'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm', 'roberta', 'ctrl'
If I try a different tokenizer such as "baykenney/bert-base-gpt2detector-topp92" I get the following error:
OSError: Model name 'baykenney/bert-base-gpt2detector-topp92' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed 'baykenney/bert-base-gpt2detector-topp92' was a path or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.
Did I miss anything to get started? I feel like the model cards indicate that these three lines of code should should be enough to get started.
I am using Python 3.7 and the transformer library version 2.1.1 and pytorch 1.5.
Please update your transformers library to at least 2.4.0. You should create a new conda environment and install all your packages directly from pypi with pip to get the most recent version (currently 2.11.0).
I've tried to load pre-trained FastText vectors from fastext - wiki word vectors.
My code is below, and it works well.
from gensim.models import FastText
model = FastText.load_fasttext_format('./wiki.en/wiki.en.bin')
but, the warning message is a little annoying.
gensim_fasttext_pretrained_vector.py:13: DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors (to use pretrained embeddings)
The message said, load_fasttext_format will be deprecated so, it will be better to use load_facebook_vectors.
So I decided to changed the code. and My changed code is like below.
from gensim.models import FastText
model = FastText.load_facebook_vectors('./wiki.en/wiki.en.bin')
But, the error occurred, the error message is like this.
Traceback (most recent call last):
File "gensim_fasttext_pretrained_vector.py", line 13, in <module>
model = FastText.load_facebook_vectors('./wiki.en/wiki.en.bin')
AttributeError: type object 'FastText' has no attribute 'load_facebook_vectors'
I couldn't understand why these thing happen.
I just change what the messages said, but it doesn't work.
If you know anything about this, please let me know.
Always, thanks for you guys help.
You're almost there, you need to change two things:
First of all, it's fasttext all lowercase letters, not Fasttext.
Second of all, to use load_facebook_vectors, you need first to create a datapath object before using it.
So, you should do like so:
from gensim.models import fasttext
from gensim.test.utils import datapath
wv = fasttext.load_facebook_vectors(datapath("./wiki.en/wiki.en.bin"))
Please forgive my newbness here, but fasttext is not working for me on python. I am using anaconda running python 3.6. My code is as follows(just an example):
import fasttext
model = fasttext.load_model('/home/sproc/share/fastText/model.bin')
print(model.words)
This returns the following error:
Traceback (most recent call last):
File "/media/sf_VBoxShare/LiClipseWorkspace/test/testpack/fasttext.py", line 1, in <module>
import fasttext
File "/media/sf_VBoxShare/LiClipseWorkspace/test/testpack/fasttext.py", line 3, in <module>
model = fasttext.load_model('/home/sproc/share/fastText/model.bin')
AttributeError: module 'fasttext' has no attribute 'load_model'
Does the same thing with cbow and skipgram when trying to create word vectors. I check the init.py file from the .../site-packages/fasttext directory and it imports said attributes, but they are not part of the model.py module. I'm guessing this has something to do with the shared object file but I am not sure. Any help is greatly appreciated.
Here is a solution that worked for me when I got the error you are getting;
Import FastText
from gensim.models.wrappers import FastText
Load the binary
model=FastText.load_fasttext_format('wiki.simple.bin')
Rename your python file .
Don't name it as fasttext.py .If your name it like this , what you import by "import fasttext.py " will be your own file.
You can rename it as 'fast_text.py' or something else .
If you install fastText package instead of the old fasttext, then
import fastText
model = fastText.load_model('/home/sproc/share/fastText/model.bin')
should work as expected.
#spencerktm30 I recommend you using pyfasttext instead of fasttext which is no longer active and it has a lot of bugs. link to pyfasttext
Actually, I faced similar issue when trying to load a C++ pre trained model and I had to switch to using pyfasttext to get it to work.
So this should hopefully work for you:
>>> from pyfasttext import FastText
>>> model = FastText('/home/sproc/share/fastText/model.bin')
Rename the file from fasttext.py to another name, it will work.
Apparently there are different fasttext python libraries out there!
fasttext != fasttext-win