I need to extract from a list of sentences (strings) all sentences that contain two specific entities and store them in a new list. The code I tried to use looks like this but unfortunately it doesnt work. I'm using Python and SpaCy.
sents_required = []
for s in sentences:
if token.ent_type_=='SPECIES' in s and token.ent_type_=='KEYWORD' in s:
sents_required.append(s)
I am grateful for any help.
The way you're declaring the condition is SQL-like, but that doesn't work in Python - you need to iterate over the list and access the data yourself. There are many ways to do this but here's one.
for s in sentences:
etypes = [tok.ent_type_ for tok in s]
if "SPECIES" in etypes and "KEYWORD" in etypes:
sents_required.append(s)
This code works for me. Thanks for help!
sents_required = []
for s in sentences:
token_types = [token.ent_type_ for token in s]
if ('SPECIES' in token_types) and ('KEYWORD' in token_types):
sents_required.append(s)
I have a list of words and I'm trying to turn plural words in singular in python, then I remove the duplicates. This is how I do it :
import spacy
nlp = spacy.load('fr_core_news_md')
words = ['animaux', 'poule', 'adresse', 'animal', 'janvier', 'poules']
clean_words = []
for word in words:
doc = nlp(word)
for token in doc:
clean_words.append(token.lemma_)
clean_words = list(set(clean_words))
This is the output :
['animal', 'janvier', 'poule', 'adresse']
It works well, but my problem is that 'fr_core_news_md' takes a little too long to load so I was wondering if there was another way to do this ?
The task you trying to do is called lemmatization and it does more than just converting plural to singular, it removes its flexions. It returns the canonical version of a word, the infinitive form of a verb for example.
If you want to use spacy you can make it load quicker by using the disable parameter.
For example spacy.load('fr_core_news_md', disable=['parser', 'textcat', 'ner', 'tagger']).
Alternatively, you use treetagger which is kinda hard to install but works great.
Or the FrenchLefffLemmatizer.
Very new to the PyDictionary library, and have had some trouble finding proper documentation for it. So, I've come here to ask:
A) Does anybody know how to check if a word (in english) exists, using PyDictionary?
B) Does anybody know of some more full documentation for PyDictionary?
If you read the code here in theory there is this:
meaning(term, disable_errors=False)
so you should be able to pass True to avoid printing the error in case the word is not in the dictionary. I tried but I guess the version I installed via pip does not contains that code...
To further expound on what #Daniele stated, you can just pass True to the meaning function. If the word is not found in the dictionary, the function returns None.
from PyDictionary import PyDictionary
def check_if_word_in_dictionary(word):
dictionary = PyDictionary()
if dictionary.meaning(word,True) is None:
print(f"It appears '{word}' is NOT a word found in the dictionary.")
else:
print(f"You're in luck, '{word}' IS found in the dictionary!")
check_if_word_in_dictionary("fingle")
Output: It appears 'fingle' is not a word found in the dictionary.
check_if_word_in_dictionary("finger")
Output: You are in luck, 'finger' IS found in the dictionary!
I have trained a word2vec model using a corpus of documents with Gensim. Once the model is training, I am writing the following piece of code to get the raw feature vector of a word say "view".
myModel["view"]
However, I get a KeyError for the word which is probably because this doesn't exist as a key in the list of keys indexed by word2vec. How can I check if a key exits in the index before trying to get the raw feature vector?
Word2Vec also provides a 'vocab' member, which you can access directly.
Using a pythonistic approach:
if word in w2v_model.vocab:
# Do something
EDIT Since gensim release 2.0, the API for Word2Vec changed. To access the vocabulary you should now use this:
if word in w2v_model.wv.vocab:
# Do something
EDIT 2 The attribute 'wv' is being deprecated and will be completed removed in gensim 4.0.0. Now it's back to the original answer by OP:
if word in w2v_model.vocab:
# Do something
convert the model into vectors with
word_vectors = model.wv
then we can use
if 'word' in word_vectors.vocab
The vocab attribute was removed from KeyedVector in Gensim 4.0.0. Try using this:
if 'word' in model.wv.key_to_index:
# do something
https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes
Answering my own question here.
Word2Vec provides a method named contains('view') which returns True or False based on whether the corresponding word has been indexed or not.
I generally use a filter:
for doc in labeled_corpus:
words = filter(lambda x: x in model.vocab, doc.words)
This is one simple method for getting past the KeyError on unseen words.
Hey i know am getting late this post, but here is a piece of code that can handle this issue well. I myself using it in my code and it works like a charm :)
size = 300 #word vector size
word = 'food' #word token
try:
wordVector = model[word].reshape((1, size))
except KeyError:
print "not found! ", word
NOTE:
I am using python Gensim Library for word2vec models
as #quemeful has mentioned, you could do something like,
if "view" in model.wv.key_to_index.keys():
# do something
to check if the word is exist in your model you can use
word2vec_pretrained_dict = dict(zip(w2v_model.key_to_index.keys(), w2v_model.vectors))
where w2v_model.key_to_index give you dictionary of each word and sequance number
and w2v_model.vectors return the vectorized for of each word
I'm fairly new to Python and NLTK. I am busy with an application that can perform spell checks (replaces an incorrectly spelled word with the correct one).
I'm currently using the Enchant library on Python 2.7, PyEnchant and the NLTK library. The code below is a class that handles the correction/replacement.
from nltk.metrics import edit_distance
class SpellingReplacer:
def __init__(self, dict_name='en_GB', max_dist=2):
self.spell_dict = enchant.Dict(dict_name)
self.max_dist = 2
def replace(self, word):
if self.spell_dict.check(word):
return word
suggestions = self.spell_dict.suggest(word)
if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
return suggestions[0]
else:
return word
I have written a function that takes in a list of words and executes replace() on each word and then returns a list of those words, but spelled correctly.
def spell_check(word_list):
checked_list = []
for item in word_list:
replacer = SpellingReplacer()
r = replacer.replace(item)
checked_list.append(r)
return checked_list
>>> word_list = ['car', 'colour']
>>> spell_check(words)
['car', 'color']
Now, I don't really like this because it isn't very accurate and I'm looking for a way to achieve spelling checks and replacements on words. I also need something that can pick up spelling mistakes like "caaaar"? Are there better ways to perform spelling checks out there? If so, what are they? How does Google do it? Because their spelling suggester is very good.
Any suggestions?
You can use the autocorrect lib to spell check in python.
Example Usage:
from autocorrect import Speller
spell = Speller(lang='en')
print(spell('caaaar'))
print(spell('mussage'))
print(spell('survice'))
print(spell('hte'))
Result:
caesar
message
service
the
I'd recommend starting by carefully reading this post by Peter Norvig. (I had to something similar and I found it extremely useful.)
The following function, in particular has the ideas that you now need to make your spell checker more sophisticated: splitting, deleting, transposing, and inserting the irregular words to 'correct' them.
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
Note: The above is one snippet from Norvig's spelling corrector
And the good news is that you can incrementally add to and keep improving your spell-checker.
Hope that helps.
The best way for spell checking in python is by: SymSpell, Bk-Tree or Peter Novig's method.
The fastest one is SymSpell.
This is Method1: Reference link pyspellchecker
This library is based on Peter Norvig's implementation.
pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
Method2: SymSpell Python
pip install -U symspellpy
Maybe it is too late, but I am answering for future searches.
TO perform spelling mistake correction, you first need to make sure the word is not absurd or from slang like, caaaar, amazzzing etc. with repeated alphabets. So, we first need to get rid of these alphabets. As we know in English language words usually have a maximum of 2 repeated alphabets, e.g., hello., so we remove the extra repetitions from the words first and then check them for spelling.
For removing the extra alphabets, you can use Regular Expression module in Python.
Once this is done use Pyspellchecker library from Python for correcting spellings.
For implementation visit this link: https://rustyonrampage.github.io/text-mining/2017/11/28/spelling-correction-with-python-and-nltk.html
Try jamspell - it works pretty well for automatic spelling correction:
import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')
corrector.FixFragment('Some sentnec with error')
# u'Some sentence with error'
corrector.GetCandidates(['Some', 'sentnec', 'with', 'error'], 1)
# ('sentence', 'senate', 'scented', 'sentinel')
IN TERMINAL
pip install gingerit
FOR CODE
from gingerit.gingerit import GingerIt
text = input("Enter text to be corrected")
result = GingerIt().parse(text)
corrections = result['corrections']
correctText = result['result']
print("Correct Text:",correctText)
print()
print("CORRECTIONS")
for d in corrections:
print("________________")
print("Previous:",d['text'])
print("Correction:",d['correct'])
print("`Definiton`:",d['definition'])
You can also try:
pip install textblob
from textblob import TextBlob
txt="machne learnig"
b = TextBlob(txt)
print("after spell correction: "+str(b.correct()))
after spell correction: machine learning
spell corrector->
you need to import a corpus on to your desktop if you store elsewhere change the path in the code i have added a few graphics as well using tkinter and this is only to tackle non word errors!!
def min_edit_dist(word1,word2):
len_1=len(word1)
len_2=len(word2)
x = [[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
for i in range(0,len_1+1):
#initialization of base case values
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
return x[i][j]
from Tkinter import *
def retrieve_text():
global word1
word1=(app_entry.get())
path="C:\Documents and Settings\Owner\Desktop\Dictionary.txt"
ffile=open(path,'r')
lines=ffile.readlines()
distance_list=[]
print "Suggestions coming right up count till 10"
for i in range(0,58109):
dist=min_edit_dist(word1,lines[i])
distance_list.append(dist)
for j in range(0,58109):
if distance_list[j]<=2:
print lines[j]
print" "
ffile.close()
if __name__ == "__main__":
app_win = Tk()
app_win.title("spell")
app_label = Label(app_win, text="Enter the incorrect word")
app_label.pack()
app_entry = Entry(app_win)
app_entry.pack()
app_button = Button(app_win, text="Get Suggestions", command=retrieve_text)
app_button.pack()
# Initialize GUI loop
app_win.mainloop()
pyspellchecker is the one of the best solutions for this problem. pyspellchecker library is based on Peter Norvig’s blog post.
It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word.
There are two ways to install this library. The official document highly recommends using the pipev package.
install using pip
pip install pyspellchecker
install from source
git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python setup.py install
the following code is the example provided from the documentation
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
from autocorrect import spell
for this you need to install, prefer anaconda and it only works for words, not sentences so that's a limitation u gonna face.
from autocorrect import spell
print(spell('intrerpreter'))
# output: interpreter
pip install scuse
from scuse import scuse
obj = scuse()
checkedspell = obj.wordf("spelling you want to check")
print(checkedspell)
Spark NLP is another option that I used and it is working excellent. A simple tutorial can be found here. https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/spell-check-ml-pipeline/Pretrained-SpellCheckML-Pipeline.ipynb