How do I add a custom tokenization rule to spacy for the case of wanting a number and a symbol or word to be tokenized together. E.g. the following sentence:
"I 100% like apples. I like 500g of apples"
is tokenized as follows:
['I', '100', '%', 'like', 'apples', '.', 'I', 'like', '500', 'g', 'of', 'apples']
It would be preferable if it was tokenized like this:
['I', '100%', 'like', 'apples', '.', 'I', 'like', '500g', 'of', 'apples']
The following code was used to generate this:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I 100% like apples. I like 500g of apples"
print([token.text for token in nlp(text)])
So normally you can modify the tokenizer by adding special rules or something, but in this particular case it's trickier than that. spaCy actually has a lot of code to make sure that suffixes like those in your example become separate tokens. So what you have to do is remove the relevant rules.
In this example code I just look for the set of rules that contain '%' and remove it; it just so happens that rule also contains unit suffixes like "g". So this does what you want:
import spacy
nlp = spacy.blank("en")
text = "I 100% like apples. I like 500g of apples"
# remove the entry with units and %
suffixes = [ss for ss in nlp.Defaults.suffixes if '%' not in ss]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
print(list(nlp(text)))
You can see the list of rule definitions here.
I understand you mean to give a simple example but there are a couple of things here that are of concern.
Typically, stopwords and punctuation are removed first as, particularly with topic modeling, they take up quite a bit of processing power but add very little.
If you read through the documentation, you'll see Parts of Speech analysis is a fairly central feature. While you may not be intending to use that, you should understand that you're going against the grain here in that you're looking to conjoin things (eg. a QUANTMOD or Quantifier phrase modifier with the NUM or number it modifies) rather than tease out concepts from term (SpaCy example is 'Gimme' --> 'gim' (or give) and 'me')
But if you're really bent on going down this path, SpaCy documentation will get you there.
Related
Say I have a corpus of annotated text where a sentence looks something like:
txt = 'red foxes <emotion>scare</emption> me.'
is it possible to tokenize this using word_tokenize in such as way that we get:
['red', 'foxes', '<emotion>scare<emotion>', 'me', '.']
We could use an alternative annotation scheme say:
txt = 'red foxes scare\_EMOTION me'
Is it possible to do this with NLTK -- currently I'm parsing out the annotations and then tracking them out of band and it is very cumbersome.
To achieve the desired result you don't need nltk.
Just run txt.split()
If you insist on using nltk, check out the different tokenizers.
PunktWordTokenizer and WhitespaceTokenizer fit.
I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible.
It depends greatly on what do you mean by valid English words. Is ECG, Thor or Loki a valid English word? If your definition of valid words is different, you might need to create your own language model.
Anyway besides obvious use of pyEnchant or nltk I would suggest fasttext library. It has multiple pre-built word vector models and you can check your paragraph for rare or out-of-vocabulary words. What you essentially want to check is that the word embedding representation for this specific "non valid" word corresponds to low number (or zero) other words.
You can use fasttext directly from python
pip install fasstext
or you can use gensim library (which will provide you some additional algorithms as well, such as Word2Vec which can be useful for your case as well)
pip install --upgrade gensim
Or for conda
conda install -c conda-forge gensim
Assuming you use gensim and you use pre-trained fasttext model:
from gensim.models import FastText
from gensim.test.utils import datapath
cap_path = datapath("fasttext-model.bin")
fb_model = load_facebook_model(cap_path)
Now you can perform several tasks to achieve your goal:
1. Check out-of-vocabulary
'mybizarreword' in fb_model.wv.vocab
Check similarity
fb_model.wv.most_similar("man")
For rare word you will get low scores and by setting the threshold you will decide which word is not 'valid'
Linux and Mac OS X have a list of words which you can use directly, otherwise you can download a list of English words.
You can use it as follows:
d = {}
fname = "/usr/share/dict/words"
with open(fname) as f:
content = f.readlines()
for w in content:
d[w.strip()] = True
p ="""I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible."""
lw = []
for w in p.split():
if len(w) < 4:
continue
if d.get(w, False):
lw.append(w)
print(len(lw))
print(lw)
#43
#['have', 'list', 'would', 'like', 'check', 'these', 'words', 'valid', 'English', 'words', 'some', 'external', 'might', 'valid', 'English', 'words', 'these', 'aware', 'libraries', 'like', 'which', 'have', 'dictionaries', 'provide', 'accuracy', 'some', 'level', 'both', 'these', 'have', 'wonder', 'there', 'exists', 'another', 'library', 'procedure', 'that', 'provide', 'with', 'what', 'looking', 'with', 'accuracy']
I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".
The problem is I can't find a way to do both. I only have these two partial options:
I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.
I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.
My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.
However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed. This answer seem to imply that lemmatization is performed independent of any pipeline component and possibly at different stages of it.
So my question is: how to choose when to perform the lemmatization and which input to give to it?
If you can, use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1.
If you have to use 2.0, consider using an alternate lemmatizer like this one: https://spacy.io/universe/project/spacy-lefff
I use the gensim library to create a word2vec model. It contains the function predict_output_words() which I understand as follows:
For example, I have a model that is trained with the sentence: "Anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy."
and then I use
model.predict_output_words(context_words_list=['Anarchism', 'does', 'not', 'offer', 'a', 'fixed', 'body', 'of', 'from', 'a', 'single', 'particular', 'world', 'view', 'instead', 'fluxing'], topn=10).
In this situation, could I get/predict the correct word or the omitted word 'doctrine'?
Is this the right way? Please explain this function in detail.
I am wondering if you have seen the documentation of predict_output_word?
Report the probability distribution of the center word given the
context words as input to the trained model.
To answer your specific question about the word 'doctrine' - it strongly depends if for the words you listed as your context one of the 10 most probable words is 'doctrine'. This means that it must occur relatively frequently in the corpus you use for training of the model. Also, since 'doctrine' does not seem to be one of the very often used words there is a high chance other words will have a higher probability of appearing in the context. Therefore, if you base only on the returned probability of the words given the context you may end up failing to predict 'doctrine' in this case.
"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)
I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.
One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]
in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages