I did research on Google also on Gensim Support forum, but I cannot find a good answer.
Basically, I am implementing online learning for Doc2Vec using Gensim, but Gensim keeps throwing me a random error called "Segmentation
Please take a look at my sample code
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
import random
import logging
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
sentence1 = "this is a test"
sentence2 = "test test 123 test"
sentence3 = "qqq zzz"
sentence4 = "ppp"
sentences = [
LabeledSentence(sentence1.split(), ["p1"]),
LabeledSentence(sentence2.split(), ["p2"])
]
model = Doc2Vec(min_count=1, window=5, size=400, sample=1e-4, negative=5, workers=1)
model.build_vocab(sentences)
for a in range(2):
random.shuffle(sentences)
print([s.tags[0] for s in sentences])
model.train(sentences)
model.save("test.d2v")
new_model = Doc2Vec.load("test.d2v")
new_sentences = [
LabeledSentence(sentence1.split(), ["n1"]),
LabeledSentence(sentence3.split(), ["n2"])
]
new_model.build_vocab(new_sentences, update=True)
for a in range(4):
random.shuffle(new_sentences)
print([s.tags[0] for s in new_sentences])
new_model.train(new_sentences)
Here is my error
INFO:gensim.models.word2vec:training model with 1 workers on 7 vocabulary and 400 features, using sg=0 hs=0 sample=0.0001 negative=5 window=5
INFO:gensim.models.word2vec:expecting 2 sentences, matching count from corpus used for vocabulary survey
Segmentation fault
Can anyone explain to me why? and how to solve this?
Thanks
A segmentation fault – that is, an illegal memory access – should be nearly impossible to trigger from your Python code. That suggests this could be a problem specific to your installation/configuration – OS, Python, gensim, support-libraries – even a corrupted file.
Try clearing & reinstalling the Python environment & support libraries (like NumPy and SciPy), and confirming that some of the examples bundled with gensim run without a segmentation fault - like for example the notebook in docs/notebooks/doc2vec-lee.ipynb. If you're still getting such faults with either the bundled examples or your own code, turn on debug logging, capture all output, and report the problem with full details on your OS/Python/gensim/etc versions.
Related
Trying to use tuner007/pegasus_paraphrase. Followed the examples in Pegasus.
The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Problem:
PegasusTokenizer cannot be instantiated as PegasusTokenizer.from_pretrained(model_name) returns None. Using the 'google/pegasus-xsum' as the model name caused the same.
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'tuner007/pegasus_paraphrase'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
type(tokenizer)
---
NoneType
Please suggest how to work it around.
You need to install sentence piece library needed for tokenizer to work properly. To install it run:
pip install sentencepiece
Actually the error occurred because you imported the tokenizer first before installing sentencepiece and after receiving the error you installed it without restarting the session.
Make sure you install sentence piece before importing the tokenizer.
As a programming noob, I am trying to find similar sentences in several hundreds of newspaper articles. I have tried my code with a smaller text sample which has worked brilliantly. Now, with a larger text file (using the same code), I get the error code "[E1002] Span index out of range.".
This is my code so far:
!pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 2000000
with open('/content/BSE.txt', 'r', encoding="utf-8", errors="ignore") as f:
sentences_articles = f.read()
about_doc = nlp(sentences_articles)
sentences = list(about_doc.sents)
len(sentences)
sentences[:10]
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import torch
embedder = SentenceTransformer('all-mpnet-base-v2')
corpus = sentences
corpus_embeddings = embedder.encode(corpus, show_progress_bar=True, batch_size = 128)
The progress bar stops at 94%, with error "[E1002] Span index out of range". I have used the .readlines() function, which worked, yet because of my text data's nature has produced unusable results (but no error!). I limited the number of words in each sentence, but that didn't help either. I tried several text data (different length, different content), but without success.
Any suggestions on how to fix this?
I had a similar problem with the same mistake, and for me it was solved after changing sentences from a list[Span] to list[str] as this is what .encode() requires. Instead of sentences = list(about_doc.sents), write sentences = list(sent.text for sent in about_doc.sents)
I've tried to load pre-trained FastText vectors from fastext - wiki word vectors.
My code is below, and it works well.
from gensim.models import FastText
model = FastText.load_fasttext_format('./wiki.en/wiki.en.bin')
but, the warning message is a little annoying.
gensim_fasttext_pretrained_vector.py:13: DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors (to use pretrained embeddings)
The message said, load_fasttext_format will be deprecated so, it will be better to use load_facebook_vectors.
So I decided to changed the code. and My changed code is like below.
from gensim.models import FastText
model = FastText.load_facebook_vectors('./wiki.en/wiki.en.bin')
But, the error occurred, the error message is like this.
Traceback (most recent call last):
File "gensim_fasttext_pretrained_vector.py", line 13, in <module>
model = FastText.load_facebook_vectors('./wiki.en/wiki.en.bin')
AttributeError: type object 'FastText' has no attribute 'load_facebook_vectors'
I couldn't understand why these thing happen.
I just change what the messages said, but it doesn't work.
If you know anything about this, please let me know.
Always, thanks for you guys help.
You're almost there, you need to change two things:
First of all, it's fasttext all lowercase letters, not Fasttext.
Second of all, to use load_facebook_vectors, you need first to create a datapath object before using it.
So, you should do like so:
from gensim.models import fasttext
from gensim.test.utils import datapath
wv = fasttext.load_facebook_vectors(datapath("./wiki.en/wiki.en.bin"))
I have installed the NLTK package and other dependencies and set the environment variables as follows:
STANFORD_MODELS=/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz:/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz:/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.conll.4class.distsim.crf.ser.gz
CLASSPATH=/mnt/d/stanford-ner/stanford-ner-2018-10-16/stanford-ner.jar
When I try to access the classifier like below:
stanford_classifier = os.environ.get('STANFORD_MODELS').split(':')[0]
stanford_ner_path = os.environ.get('CLASSPATH').split(':')[0]
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
I get the following error. But I don't understand what is causing this error.
Error: Could not find or load main class edu.stanford.nlp.ie.crf.CRFClassifier
OSError: Java command failed : ['/mnt/c/Program Files (x86)/Common
Files/Oracle/Java/javapath_target_1133041234/java.exe', '-mx1000m', '-cp', '/mnt/d/stanford-ner/stanford-ner-2018-10-16/stanford-ner.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz', '-textFile', '/tmp/tmpaiqclf_d', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
I found the answer for this issue. I am using NLTK == 3.4. From NLTK ==3.3 and above Stanford NLP (POS, NER , tokenizer) are not loaded as part of nltk.tag but from nltk.parse.corenlp.CoreNLPParser. The stackoverflow answer is available in stackoverflow.com/questions/13883277/stanford-parser-and-nltk/… and the github link for official documentation is github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK.
Additional information if you are facing timeout issue from the NER tagger or any other parser of coreNLP API, please increase the timeout limit as stated in https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK/_compare/3d64e56bede5e6d93502360f2fcd286b633cbdb9...f33be8b06094dae21f1437a6cb634f86ad7d83f7 by dimazest.
I try to use word2vec, but it gives an error when trying to do anything with any word. It seems to be an encoding issue, here is what I did:
Init word2vec:
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.Word2Vec.load_word2vec_format('freebase-vectors-skipgram1000/knowledge-vectors-skipgram1000.bin', binary=True)
model.init_sims(replace=True)
Test it a bit:
print(model)
# prints: Word2Vec(vocab=1422903, size=1000, alpha=0.025)
print(model.index2word[0])
# prints: u'/m/0dgps15'
# I would expect a readable word, how to fix that?
The error:
print(model.similarity('word', 'sound'))
# An error happen: KeyError: 'word'
I also tried to load the model with binary=False, but this makes an error while loading.
There is nothing wrong with your word2vec usage. File format is binary (and can be converted to text using this nice utility).
You have downloaded a pre-trained "entity" vector file. I'll recommend you to use word or phrase vectors (also available on word2vec website) from Google News.
[1] import gensim
[2] model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
[3] print(model.similarity('word', 'sound'))
0.152615140536