I have installed the NLTK package and other dependencies and set the environment variables as follows:
STANFORD_MODELS=/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz:/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz:/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.conll.4class.distsim.crf.ser.gz
CLASSPATH=/mnt/d/stanford-ner/stanford-ner-2018-10-16/stanford-ner.jar
When I try to access the classifier like below:
stanford_classifier = os.environ.get('STANFORD_MODELS').split(':')[0]
stanford_ner_path = os.environ.get('CLASSPATH').split(':')[0]
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
I get the following error. But I don't understand what is causing this error.
Error: Could not find or load main class edu.stanford.nlp.ie.crf.CRFClassifier
OSError: Java command failed : ['/mnt/c/Program Files (x86)/Common
Files/Oracle/Java/javapath_target_1133041234/java.exe', '-mx1000m', '-cp', '/mnt/d/stanford-ner/stanford-ner-2018-10-16/stanford-ner.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz', '-textFile', '/tmp/tmpaiqclf_d', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
I found the answer for this issue. I am using NLTK == 3.4. From NLTK ==3.3 and above Stanford NLP (POS, NER , tokenizer) are not loaded as part of nltk.tag but from nltk.parse.corenlp.CoreNLPParser. The stackoverflow answer is available in stackoverflow.com/questions/13883277/stanford-parser-and-nltk/… and the github link for official documentation is github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK.
Additional information if you are facing timeout issue from the NER tagger or any other parser of coreNLP API, please increase the timeout limit as stated in https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK/_compare/3d64e56bede5e6d93502360f2fcd286b633cbdb9...f33be8b06094dae21f1437a6cb634f86ad7d83f7 by dimazest.
Related
Trying to use tuner007/pegasus_paraphrase. Followed the examples in Pegasus.
The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Problem:
PegasusTokenizer cannot be instantiated as PegasusTokenizer.from_pretrained(model_name) returns None. Using the 'google/pegasus-xsum' as the model name caused the same.
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'tuner007/pegasus_paraphrase'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
type(tokenizer)
---
NoneType
Please suggest how to work it around.
You need to install sentence piece library needed for tokenizer to work properly. To install it run:
pip install sentencepiece
Actually the error occurred because you imported the tokenizer first before installing sentencepiece and after receiving the error you installed it without restarting the session.
Make sure you install sentence piece before importing the tokenizer.
I'm trying to load the English model for StanfordNLP (python) from my local machine, but am unable to find the proper import statements to do so. What commands can be used? Is there a pip installation available to load the english model?
I have tried using the download command to do so, however my machine requires all files to be added locally. I downloaded the english jar files from https://stanfordnlp.github.io/CoreNLP/ but am unsure if I need both the English and the English KBP version.
directory set for model download is /home/sf
pip install stanfordnlp # install stanfordnlp
import stanfordnlp
stanfordnlp.download("en") # here after 'Y' one set custom directory path
local_dir_store_model = "/home/sf"
english_model_dir = "/home/sf/en_ewt_models"
tokienizer_en_pt_file = "/home/sf/en_ewt_models/en_ewt_tokenizer.pt"
nlp = stanfordnlp.Pipeline(models_dir=local_dir_store_model,processors = 'tokenize,mwt,lemma,pos')
doc = nlp("""One of the most wonderful things in life is to wake up and enjoy a cuddle with somebody; unless you are in prison"""")
doc.sentences[0].print_tokens()
I am unclear what you want to do.
If you want to run the all-Python pipeline, you can download the files and run them in Python code by specifying the paths for each annotator as in this example.
import stanfordnlp
config = {
'processors': 'tokenize,mwt,pos,lemma,depparse', # Comma-separated list of processors to use
'lang': 'fr', # Language code for the language to build the Pipeline in
'tokenize_model_path': './fr_gsd_models/fr_gsd_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
'mwt_model_path': './fr_gsd_models/fr_gsd_mwt_expander.pt',
'pos_model_path': './fr_gsd_models/fr_gsd_tagger.pt',
'pos_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt',
'lemma_model_path': './fr_gsd_models/fr_gsd_lemmatizer.pt',
'depparse_model_path': './fr_gsd_models/fr_gsd_parser.pt',
'depparse_pretrain_path': './fr_gsd_models/fr_gsd.pretrain.pt'
}
nlp = stanfordnlp.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie.") # Run the pipeline on input text
doc.sentences[0].print_tokens()
If you want to run the Java server with the Python interface, you need to download the Java jar files and start the server. Full info here: https://stanfordnlp.github.io/CoreNLP/corenlp-server.html
Then you can access the server with the Python interface. Full info here: https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html
But just to be clear, the jar files should not be used with the pure Python pipeline. Those are for running the Java server.
I have the following code and I made sure its extension and name are correct. However, I still get the error outputted as seen below.
I did see another person asked a similar question here on Stack Overflow, and read the answer but it did not help me.
Failed to load a .bin.gz pre trained words2vecx
Any suggestions how to fix this?
Input:
import gensim
word2vec_path = "GoogleNews-vectors-negative300.bin.gz"
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
Output:
OSError: Not a gzipped file (b've')
The problem is that the file you've downloaded is not a gzip file. If you check the size of the file it maybe in KBs (that is what happened with me, when I downloaded it from this Github link because it needed git-lfs)
Here is an alternate solution to resolve this issue:
Download the model using the below command on your terminal:
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
Then, load the model as you would using gensim:
from gensim import models
w = models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)
Hope this helps you!!
Try this
import tensorflow
word2vec_path = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
word2vec = models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
I did research on Google also on Gensim Support forum, but I cannot find a good answer.
Basically, I am implementing online learning for Doc2Vec using Gensim, but Gensim keeps throwing me a random error called "Segmentation
Please take a look at my sample code
from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
import random
import logging
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
sentence1 = "this is a test"
sentence2 = "test test 123 test"
sentence3 = "qqq zzz"
sentence4 = "ppp"
sentences = [
LabeledSentence(sentence1.split(), ["p1"]),
LabeledSentence(sentence2.split(), ["p2"])
]
model = Doc2Vec(min_count=1, window=5, size=400, sample=1e-4, negative=5, workers=1)
model.build_vocab(sentences)
for a in range(2):
random.shuffle(sentences)
print([s.tags[0] for s in sentences])
model.train(sentences)
model.save("test.d2v")
new_model = Doc2Vec.load("test.d2v")
new_sentences = [
LabeledSentence(sentence1.split(), ["n1"]),
LabeledSentence(sentence3.split(), ["n2"])
]
new_model.build_vocab(new_sentences, update=True)
for a in range(4):
random.shuffle(new_sentences)
print([s.tags[0] for s in new_sentences])
new_model.train(new_sentences)
Here is my error
INFO:gensim.models.word2vec:training model with 1 workers on 7 vocabulary and 400 features, using sg=0 hs=0 sample=0.0001 negative=5 window=5
INFO:gensim.models.word2vec:expecting 2 sentences, matching count from corpus used for vocabulary survey
Segmentation fault
Can anyone explain to me why? and how to solve this?
Thanks
A segmentation fault – that is, an illegal memory access – should be nearly impossible to trigger from your Python code. That suggests this could be a problem specific to your installation/configuration – OS, Python, gensim, support-libraries – even a corrupted file.
Try clearing & reinstalling the Python environment & support libraries (like NumPy and SciPy), and confirming that some of the examples bundled with gensim run without a segmentation fault - like for example the notebook in docs/notebooks/doc2vec-lee.ipynb. If you're still getting such faults with either the bundled examples or your own code, turn on debug logging, capture all output, and report the problem with full details on your OS/Python/gensim/etc versions.
I have installed nlpnet (http://nilc.icmc.usp.br/nlpnet/), but I can't locate the metadata_pos.pickle file it needs to run a part of speech tagger. THis file does not appear to be on my machine, and is not included in the current github repository.
Any suggestions?
You need to download nlpnet-data(models for PoS, SRL and Dependency). It is available on http://nilc.icmc.usp.br/nlpnet/models.html . PoS tag model file Metadata_pos.pickle is available in http://nilc.icmc.usp.br/nlpnet/data/pos-pt.tgz
You need to download the models from this page http://nilc.icmc.usp.br/nlpnet/models.html (either POS or SRL)
decompress the file in some folder, let's say '/Users/Downloads', then import in your code like that:
import nlpnet
nlpnet.set_data_dir('/Users/Downloads/pos-pt')
# Now you can start using it
tagger = nlpnet.POSTagger()
op = tagger.tag('texto em portugues')
To train the model, you'll need examples with one sentence per line, having tokens and tags concatenated by an underscore character:
This_DT is_VBZ an_DT example_NN
Using this command with your corpus, you'll generate the data needed to use the POS tagger (including metadata_pos.pickle):
nlpnet-train.py pos --gold /path/to/training-data.txt
If you want to use an already trained model, they have one here. It was trained/evaluated with Mac-Morpho Corpus, a brazilian-portuguese news corpus so probably it won't work with other languages.