I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible.
It depends greatly on what do you mean by valid English words. Is ECG, Thor or Loki a valid English word? If your definition of valid words is different, you might need to create your own language model.
Anyway besides obvious use of pyEnchant or nltk I would suggest fasttext library. It has multiple pre-built word vector models and you can check your paragraph for rare or out-of-vocabulary words. What you essentially want to check is that the word embedding representation for this specific "non valid" word corresponds to low number (or zero) other words.
You can use fasttext directly from python
pip install fasstext
or you can use gensim library (which will provide you some additional algorithms as well, such as Word2Vec which can be useful for your case as well)
pip install --upgrade gensim
Or for conda
conda install -c conda-forge gensim
Assuming you use gensim and you use pre-trained fasttext model:
from gensim.models import FastText
from gensim.test.utils import datapath
cap_path = datapath("fasttext-model.bin")
fb_model = load_facebook_model(cap_path)
Now you can perform several tasks to achieve your goal:
1. Check out-of-vocabulary
'mybizarreword' in fb_model.wv.vocab
Check similarity
fb_model.wv.most_similar("man")
For rare word you will get low scores and by setting the threshold you will decide which word is not 'valid'
Linux and Mac OS X have a list of words which you can use directly, otherwise you can download a list of English words.
You can use it as follows:
d = {}
fname = "/usr/share/dict/words"
with open(fname) as f:
content = f.readlines()
for w in content:
d[w.strip()] = True
p ="""I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible."""
lw = []
for w in p.split():
if len(w) < 4:
continue
if d.get(w, False):
lw.append(w)
print(len(lw))
print(lw)
#43
#['have', 'list', 'would', 'like', 'check', 'these', 'words', 'valid', 'English', 'words', 'some', 'external', 'might', 'valid', 'English', 'words', 'these', 'aware', 'libraries', 'like', 'which', 'have', 'dictionaries', 'provide', 'accuracy', 'some', 'level', 'both', 'these', 'have', 'wonder', 'there', 'exists', 'another', 'library', 'procedure', 'that', 'provide', 'with', 'what', 'looking', 'with', 'accuracy']
Related
How do I add a custom tokenization rule to spacy for the case of wanting a number and a symbol or word to be tokenized together. E.g. the following sentence:
"I 100% like apples. I like 500g of apples"
is tokenized as follows:
['I', '100', '%', 'like', 'apples', '.', 'I', 'like', '500', 'g', 'of', 'apples']
It would be preferable if it was tokenized like this:
['I', '100%', 'like', 'apples', '.', 'I', 'like', '500g', 'of', 'apples']
The following code was used to generate this:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I 100% like apples. I like 500g of apples"
print([token.text for token in nlp(text)])
So normally you can modify the tokenizer by adding special rules or something, but in this particular case it's trickier than that. spaCy actually has a lot of code to make sure that suffixes like those in your example become separate tokens. So what you have to do is remove the relevant rules.
In this example code I just look for the set of rules that contain '%' and remove it; it just so happens that rule also contains unit suffixes like "g". So this does what you want:
import spacy
nlp = spacy.blank("en")
text = "I 100% like apples. I like 500g of apples"
# remove the entry with units and %
suffixes = [ss for ss in nlp.Defaults.suffixes if '%' not in ss]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
print(list(nlp(text)))
You can see the list of rule definitions here.
I understand you mean to give a simple example but there are a couple of things here that are of concern.
Typically, stopwords and punctuation are removed first as, particularly with topic modeling, they take up quite a bit of processing power but add very little.
If you read through the documentation, you'll see Parts of Speech analysis is a fairly central feature. While you may not be intending to use that, you should understand that you're going against the grain here in that you're looking to conjoin things (eg. a QUANTMOD or Quantifier phrase modifier with the NUM or number it modifies) rather than tease out concepts from term (SpaCy example is 'Gimme' --> 'gim' (or give) and 'me')
But if you're really bent on going down this path, SpaCy documentation will get you there.
Say I have a corpus of annotated text where a sentence looks something like:
txt = 'red foxes <emotion>scare</emption> me.'
is it possible to tokenize this using word_tokenize in such as way that we get:
['red', 'foxes', '<emotion>scare<emotion>', 'me', '.']
We could use an alternative annotation scheme say:
txt = 'red foxes scare\_EMOTION me'
Is it possible to do this with NLTK -- currently I'm parsing out the annotations and then tracking them out of band and it is very cumbersome.
To achieve the desired result you don't need nltk.
Just run txt.split()
If you insist on using nltk, check out the different tokenizers.
PunktWordTokenizer and WhitespaceTokenizer fit.
I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".
The problem is I can't find a way to do both. I only have these two partial options:
I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.
I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.
My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.
However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed. This answer seem to imply that lemmatization is performed independent of any pipeline component and possibly at different stages of it.
So my question is: how to choose when to perform the lemmatization and which input to give to it?
If you can, use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1.
If you have to use 2.0, consider using an alternate lemmatizer like this one: https://spacy.io/universe/project/spacy-lefff
Having a trained word2vec model, is there a way to check which word in its vocabulary is the most "related" to a whole sentence ?
i was looking for something similar to
model.wv.most_similar("the dog is on the table")
which could result in ["dog","table"]
The most_similar() method can take multiple words as input, ideally as the named parameter positive. (That's as in, "positive examples", to be contrasted with "negative examples" which can also be provided via the negative parameter, and are useful when asking most_similar() to solve analogy-problems.)
When it receives multiple words, it returns results that are closest to the average of all words provided. That might be somewhat related to a whole sentence, but such an average-of-all-word-vectors is a fairly weak way of summarizing a sentence.
The multiple words should be provided as a list of strings, not a raw string of space-delimited words. So, for example:
sims = model.wv.most_similar(positive=['the', 'dog', 'is', 'on', 'the', 'table'])
I use the gensim library to create a word2vec model. It contains the function predict_output_words() which I understand as follows:
For example, I have a model that is trained with the sentence: "Anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy."
and then I use
model.predict_output_words(context_words_list=['Anarchism', 'does', 'not', 'offer', 'a', 'fixed', 'body', 'of', 'from', 'a', 'single', 'particular', 'world', 'view', 'instead', 'fluxing'], topn=10).
In this situation, could I get/predict the correct word or the omitted word 'doctrine'?
Is this the right way? Please explain this function in detail.
I am wondering if you have seen the documentation of predict_output_word?
Report the probability distribution of the center word given the
context words as input to the trained model.
To answer your specific question about the word 'doctrine' - it strongly depends if for the words you listed as your context one of the 10 most probable words is 'doctrine'. This means that it must occur relatively frequently in the corpus you use for training of the model. Also, since 'doctrine' does not seem to be one of the very often used words there is a high chance other words will have a higher probability of appearing in the context. Therefore, if you base only on the returned probability of the words given the context you may end up failing to predict 'doctrine' in this case.