Related
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. # sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
I just used the following code, which removed all the punctuation:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm with am.
I think you need some sort of regular expression matching (the following code is in Python 3):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
I use this code to remove punctuation:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
Tutorial:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
You can do it in one line without nltk (python 3.x).
import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
Just adding to the solution by #rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']
I am using the combined tagger described in the nltk book - chapter 5
Here is the code
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
Since the default tagger tags every token to NN every token that is goes to t0 will be tagged NN they say this can be resolved by following below method
Our approach to tagging unknown words still uses backoff to a regular-expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog or to blog. How can we do better with these unknown words, or out-of-vocabulary items?
A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK using the method shown in 3. During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb.
I have written the method shown in 3 that maps every word to UNK
>>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> vocab = nltk.FreqDist(alice)
>>> v1000 = [word for (word, _) in vocab.most_common(1000)]
>>> mapping = defaultdict(lambda: 'UNK')
>>> for v in v1000:
... mapping[v] = v
...
>>> alice2 = [mapping[v] for v in alice]
>>> alice2[:100]
['UNK', 'Alice', "'", 's', 'UNK', 'in', 'UNK', 'by', 'UNK', 'UNK', 'UNK',
'UNK', 'CHAPTER', 'I', '.', 'UNK', 'the', 'Rabbit', '-', 'UNK', 'Alice',
'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by',
'her', 'sister', 'on', 'the', 'UNK', ',', 'and', 'of', 'having', 'nothing',
'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'UNK', 'into', 'the',
'book', 'her', 'sister', 'was', 'UNK', ',', 'but', 'it', 'had', 'no',
'pictures', 'or', 'UNK', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the',
'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without',
'pictures', 'or', 'conversation', "?'" ...]
>>> len(set(alice2))
1001
My question is how do we implement this method in combined taggers? Where do I put the new mapped dictionary (in this example mapping) in combined taggers?
You should be replacing the tags, not the words themselves. Based on the code you shared, something like
mapped_unk = [(w[0], 'UNK') if i%2 else (w[0], w[1]) for i, w in enumerate(tagged)]
where I assume tagged is a tagged version of alice such that each input word has been mapped to a tuple (word, tag) like [('Alice', 'NN'), ('tagged', 'VT'), ('her', 'PRON'), ('corpus', 'N'), ('.', 'PUNC')] (with some guesswork around your tagging conventions).
nltk has routines for tagging a piece of text, but it's better if you can find a human-vetted tagged text to train on; the included corpora feature a number of such texts. The tagged training set should obviously use the tagging conventions (tag set, tokenization, etc) you desire for the trained tagger to learn; clearly the availability of tagged corpora to train on will constrain your choices if you can't manually produce a large enough tagged corpus for your preferred conventions.
I'm trying to use regular expressions to split text on punctuation, only when the punctuation follows a word and proceeds a space or the end of the string.
I've tried ([a-zA-Z])([,;.-])(\s|$)
But when I want to split in Python, it includes the last character of the word.
I want to split it like this:
text = 'Mr.Smith is a professor at Harvard, and is a great guy.'
splits = ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'a', 'great', 'guy', '.']
Any help would be greatly appreciated!
It seems you want to do tokenize. Try nltk
http://text-processing.com/demo/tokenize/
from nltk.tokenize import TreebankWordTokenizer
splits = TreebankWordTokenizer().tokenize(text)
You may use
re.findall(r'\w+(?:\.\w+)*|[^\w\s]', s)
See the regex demo.
Details
\w+(?:\.\w+)* - 1+ word chars followed with 0 or more occurrences of a dot followed with 1+ word chars
| - or
[^\w\s] - any char other than a word and whitespace char.
Python demo:
import re
rx = r"\w+(?:\.\w+)*|[^\w\s]"
s = "Mr.Smith is a professor at Harvard, and is a great guy."
print(re.findall(rx, s))
Output: ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.'].
This approach can be further precised. E.g. tokenizing only letter words, numbers and underscores as punctuation:
re.findall(r'[+-]?\d*\.?\d+|[^\W\d_]+(?:\.[^\W\d_]+)*|[^\w\s]|_', s)
See the regex demo
You can first split on ([.,](?=\s)|\s) and then filter out empty or blanks strings:
In [16]: filter(lambda s: not re.match(r'\s*$', s) , re.split(r'([.,](?=\s)|\s)', 'Mr.Smith is a professor at Har
...: vard, and is a great guy.'))
Out[16]:
['Mr.Smith',
'is',
'a',
'professor',
'at',
'Harvard',
',',
'and',
'is',
'a',
'great',
'guy.']
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. # sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
I just used the following code, which removed all the punctuation:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm with am.
I think you need some sort of regular expression matching (the following code is in Python 3):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
I use this code to remove punctuation:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
Tutorial:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
You can do it in one line without nltk (python 3.x).
import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
Just adding to the solution by #rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']
What I want to do is to split a text into his ultimate elements.
For example:
from nltk.tokenize import *
txt = "A sample sentences with digits like 2.119,99 or 2,99 are awesome."
regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+')
['A','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.']
You can see it works fine. My Problem is: What happens if the digit is at the end of a text?
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+')
['Today', 'it', "'s", '07.May', '2011.', 'Or', '2.999.']
The result should be:
['Today', 'it', "'s", '07.May', '2011','.', 'Or', '2.999','.']
What I have to do to get the result above?
I created a pattern to try to include periods and commas occurring inside words, numbers. Hope this helps:
txt = "Today it's 07.May 2011. Or 2.999."
regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']