Python: Tokenizing with phrases

Python: Tokenizing with phrases - python

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.
For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:
the west wing
is
an
american
...
What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.

You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())
You will get:
['Something', 'about', 'the_west_wing']

If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.
For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.

If you don't know the particular phrases in advance, you could possibly use scikit's CountVectorizer() class. It has the option to specify larger n-gram ranges (ngram_range) and then ignore any words that do not appear in enough documents (min_df). You might identfy a few phrases that you had not realized were common, but you might also find some that are meaningless. It also has the option to filter out english stopwords (meaningless words like 'is') using the stop_words parameter.

Related

How to exclude certain names and terms from stemming (Python NLTK SnowballStemmer (Porter2))

I am newly getting into NLP, Python, and posting on Stackoverflow at the same time, so please be patient with me if I might seem ignorant :).
I am using SnowballStemmer in Python's NLTK in order to stem words for textual analysis. While lemmatization seems to understem my tokens, the snowball porter2 stemmer, which I read is mostly preferred to the basic porter stemmer, overstems my tokens. I am analyzing tweets including many names and probably also places and other words which should not be stemmed, like: hillary, hannity, president, which are now reduced to hillari, hanniti, and presid (you probably guessed already whose tweets I am analyzing).
Is there an easy way to exclude certain terms from stemming? Conversely, I could also merely lemmatize tokens and include a rule for common suffixes like -ed, -s, …. Another idea might be to merely stem verbs and adjectives as well as nouns ending in s. That might also be close enough…
I am using below code as of now:
# LEMMATIZE AND STEM WORDS
from nltk.stem.snowball import EnglishStemmer
lemmatizer = nltk.stem.WordNetLemmatizer()
snowball = EnglishStemmer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in text]
def snowball_stemmer(text):
return [snowball.stem(w) for w in text]
# APPLY FUNCTIONS
tweets['text_snowball'] = tweets.text_processed.apply(snowball_stemmer)
tweets['text_lemma'] = tweets.text_processed.apply(lemmatize_text)
I hope someone can help… Contrary to my past experience with all kinds of issues, I have not been able to find adequate help for my issue online so far.
Thanks!

Do you know NER? It means named entity recognition. You can preprocess your text and locate all named entities, which you then exclude from stemming. After stemming, you can merge the data again.

How to perform NER on true case, then lemmatization on lower case, with spaCy

I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".
The problem is I can't find a way to do both. I only have these two partial options:
I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.
I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.
My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.
However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed. This answer seem to imply that lemmatization is performed independent of any pipeline component and possibly at different stages of it.
So my question is: how to choose when to perform the lemmatization and which input to give to it?

If you can, use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1.
If you have to use 2.0, consider using an alternate lemmatizer like this one: https://spacy.io/universe/project/spacy-lefff

How to get all noun phrases in Spacy(Python)

I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:
doc2 = nlp("what is the capital of Bangladesh?")
for chunk in doc2.noun_chunks:
print(chunk)
Output:
1. what
2. the capital
3. bangladesh
Expected:
the capital of Bangladesh
I have tried answers from spacy doc and StackOverflow. Nothing worked. It seems only cTakes and Stanford core NLP can give such complex NP.
Any help is appreciated.

Spacy clearly defines a noun chunk as:
A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses." (https://spacy.io/api/doc#noun_chunks)
If you process the dependency parse differently, allowing prepositional modifiers and nested phrases/chunks, then you can end up with what you're looking for.
I bet you could modify the existing spacy code fairly easily to do what you want:
https://github.com/explosion/spaCy/blob/06c6dc6fbcb8fbb78a61a2e42c1b782974bd43bd/spacy/lang/en/syntax_iterators.py

For those who are still looking for this answer
noun_pharses=set()
for nc in doc.noun_chunks:
for np in [nc, doc[nc.root.left_edge.i:nc.root.right_edge.i+1]]:
noun_pharses.add(np)
This is how I get all the complex noun phrase

NLTK words lemmatizing

I am trying to do lemmatization on words with NLTK.
What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement".
When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer(), it returns "acknowledg" rather than "acknowledge".
Can anyone tell me how to eliminate the affixes of words?
Say, when input is "acknowledgement", the output to be "acknowledge"

Lemmatization does not (and should not) return "acknowledge" for "acknowledgement". The former is a verb, while the latter is a noun. Porter's stemming algorithm, on the other hand, simply uses a fixed set of rules. So, your only way there is to change the rules at source. (NOT the right way to fix your problem).
What you are looking for is the derivationally related form of "acknowledgement", and for this, your best source is WordNet. You can check this online on WordNet.
There are quite a few WordNet-based libraries that you can use for this (e.g. in JWNL in Java). In Python, NLTK should be able to get the derivationally related form you saw online:
from nltk.corpus import wordnet as wn
acknowledgment_synset = wn.synset('acknowledgement.n.01')
acknowledgment_lemma = acknowledgment_synset.lemmas[1]
print(acknowledgment_lemma.derivationally_related_forms())
# [Lemma('admit.v.01.acknowledge'), Lemma('acknowledge.v.06.acknowledge')]

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner

There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.

Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.