using NLTK methods such as tokenize on annotated text - python

Say I have a corpus of annotated text where a sentence looks something like:
txt = 'red foxes <emotion>scare</emption> me.'
is it possible to tokenize this using word_tokenize in such as way that we get:
['red', 'foxes', '<emotion>scare<emotion>', 'me', '.']
We could use an alternative annotation scheme say:
txt = 'red foxes scare\_EMOTION me'
Is it possible to do this with NLTK -- currently I'm parsing out the annotations and then tracking them out of band and it is very cumbersome.

To achieve the desired result you don't need nltk.
Just run txt.split()
If you insist on using nltk, check out the different tokenizers.
PunktWordTokenizer and WhitespaceTokenizer fit.

Related

Custom tokenization rule spacy

How do I add a custom tokenization rule to spacy for the case of wanting a number and a symbol or word to be tokenized together. E.g. the following sentence:
"I 100% like apples. I like 500g of apples"
is tokenized as follows:
['I', '100', '%', 'like', 'apples', '.', 'I', 'like', '500', 'g', 'of', 'apples']
It would be preferable if it was tokenized like this:
['I', '100%', 'like', 'apples', '.', 'I', 'like', '500g', 'of', 'apples']
The following code was used to generate this:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I 100% like apples. I like 500g of apples"
print([token.text for token in nlp(text)])
So normally you can modify the tokenizer by adding special rules or something, but in this particular case it's trickier than that. spaCy actually has a lot of code to make sure that suffixes like those in your example become separate tokens. So what you have to do is remove the relevant rules.
In this example code I just look for the set of rules that contain '%' and remove it; it just so happens that rule also contains unit suffixes like "g". So this does what you want:
import spacy
nlp = spacy.blank("en")
text = "I 100% like apples. I like 500g of apples"
# remove the entry with units and %
suffixes = [ss for ss in nlp.Defaults.suffixes if '%' not in ss]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
print(list(nlp(text)))
You can see the list of rule definitions here.
I understand you mean to give a simple example but there are a couple of things here that are of concern.
Typically, stopwords and punctuation are removed first as, particularly with topic modeling, they take up quite a bit of processing power but add very little.
If you read through the documentation, you'll see Parts of Speech analysis is a fairly central feature. While you may not be intending to use that, you should understand that you're going against the grain here in that you're looking to conjoin things (eg. a QUANTMOD or Quantifier phrase modifier with the NUM or number it modifies) rather than tease out concepts from term (SpaCy example is 'Gimme' --> 'gim' (or give) and 'me')
But if you're really bent on going down this path, SpaCy documentation will get you there.

how can I count the specific bigram words?

I want to find and count the specific bigram words such as "red apple" in the text file.
I already made the text file to the word list, so I couldn't use regex to count the whole phrase. (i.e. bigram) ( or can I ? )
How can I count the specific bigram in the text file? not using nltk or other module... regex can be a solution?
Why you have made text file into list. Also it's not memory efficient.
Instead of text you can use file.read() method directly.
import re
text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']
for i in bigram:
print 'Found', i, len(re.findall(i, text))
out:
Found red apples 2
Found green apples 1
Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? In the latter case have a look at NLTK collocations module. You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such.
And think of this: why and how have you turned the lines to list of words? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. Which, again, is why NLTK is what you need.

Finding the common words between two text corpus in NLTK

I am very new to NLTK and am trying to do something.
What would be the best way to find the common words between two bodies of text? Basically, I have one long text file say text1, and another say text2. I want to find the common words that appear in both the files using NLTK.
Is there a direct way to do so? What would be the best approach?
Thanks!
It seems to me that unless you need to do something special with regards to language processing, you don't need NLTK:
words1 = "This is a simple test of set intersection".lower().split()
words2 = "Intersection of sets is easy using Python".lower().split()
intersection = set(words1) & set(words2)
>>> set(['of', 'is', 'intersection'])

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Python: Tokenizing with phrases

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization.
For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the west wing," the resulting tokens would be:
the west wing
is
an
american
...
What's the best way to accomplish this? I'd prefer to stay within the bounds of tools like NLTK.
You can use the Multi-Word Expression Tokenizer MWETokenizer of NLTK:
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer()
tokenizer.add_mwe(('the', 'west', 'wing'))
tokenizer.tokenize('Something about the west wing'.split())
You will get:
['Something', 'about', 'the_west_wing']
If you have a fixed set of phrases that you're looking for, then the simple solution is to tokenize your input and "reassemble" the multi-word tokens. Alternatively, do a regexp search & replace before tokenizing that turns The West Wing into The_West_Wing.
For more advanced options, use regexp_tokenize or see chapter 7 of the NLTK book.
If you don't know the particular phrases in advance, you could possibly use scikit's CountVectorizer() class. It has the option to specify larger n-gram ranges (ngram_range) and then ignore any words that do not appear in enough documents (min_df). You might identfy a few phrases that you had not realized were common, but you might also find some that are meaningless. It also has the option to filter out english stopwords (meaningless words like 'is') using the stop_words parameter.

Categories