Related
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. # sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
I just used the following code, which removed all the punctuation:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm with am.
I think you need some sort of regular expression matching (the following code is in Python 3):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
I use this code to remove punctuation:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
Tutorial:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
You can do it in one line without nltk (python 3.x).
import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
Just adding to the solution by #rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']
I am relatively new to NLP so please be gentle. I
have a complete list of the text from Trump's tweets since taking office and I am tokenizing the text to analyze the content.
I am using the TweetTokenizer from the nltk library in python and I'm trying to get everything tokenized except for numbers and punctuation. Problem is my code removes all the tokens except one.
I have tried using the .isalpha() method but this did not work, which I thought would as should only be True for strings composed from the alphabet.
#Create a content from the tweets
text= non_re['text']
#Make all text in lowercase
low_txt= [l.lower() for l in text]
#Iteratively tokenize the tweets
TokTweet= TweetTokenizer()
tokens= [TokTweet.tokenize(t) for t in low_txt
if t.isalpha()]
My output from this is just one token.
If I remove the if t.isalpha() statement then I get all of the tokens including numbers and punctuation, suggesting the isalpha() is to blame from the over-trimming.
What I would like, is a way to get the tokens from the tweet text without punctuation and numbers.
Thanks for your help!
Try something like below:
import string
import re
import nltk
from nltk.tokenize import TweetTokenizer
tweet = "first think another Disney movie, might good, it's kids movie. watch it, can't help enjoy it. ages love movie. first saw movie 10 8 years later still love it! Danny Glover superb could play"
def clean_text(text):
# remove numbers
text_nonum = re.sub(r'\d+', '', text)
# remove punctuations and convert characters to lower case
text_nopunct = "".join([char.lower() for char in text_nonum if char not in string.punctuation])
# substitute multiple whitespace with single whitespace
# Also, removes leading and trailing whitespaces
text_no_doublespace = re.sub('\s+', ' ', text_nopunct).strip()
return text_no_doublespace
cleaned_tweet = clean_text(tweet)
tt = TweetTokenizer()
print(tt.tokenize(cleaned_tweet))
output:
['first', 'think', 'another', 'disney', 'movie', 'might', 'good', 'its', 'kids', 'movie', 'watch', 'it', 'cant', 'help', 'enjoy', 'it', 'ages', 'love', 'movie', 'first', 'saw', 'movie', 'years', 'later', 'still', 'love', 'it', 'danny', 'glover', 'superb', 'could', 'play']
# Function for removing Punctuation from Text and It gives total no.of punctuation removed also
# Input: Function takes Existing fie name and New file name as string i.e 'existingFileName.txt' and 'newFileName.txt'
# Return: It returns two things Punctuation Free File opened in read mode and a punctuation count variable.
def removePunctuation(tokenizeSampleText, newFileName):
from nltk.tokenize import word_tokenize
existingFile = open(tokenizeSampleText, 'r')
read_existingFile = existingFile.read()
tokenize_existingFile = word_tokenize(read_existingFile)
puncRemovedFile = open(newFileName, 'w+')
import string
stringPun = list(string.punctuation)
count_pun = 0
for word in tokenize_existingFile:
if word in stringPun:
count_pun += 1
else:
word = word + ' '
puncRemovedFile.write(''.join(word))
existingFile.close()
puncRemovedFile.close()
return open(newFileName, 'r'), count_pun
punRemoved, punCount = removePunctuation('Macbeth.txt', 'Macbeth-punctuationRemoved.txt')
print(f'Total Punctuation : {punCount}')
punRemoved.read()
I would like to tokenize a list of sentence, but keep negated verbs as unique words.
t = """As aren't good. Bs are good"""
print(word_tokenize(t))
['As', 'are', "n't", 'good', '.', 'Bs', 'are', 'good']
I would like to have "aren't" and "are" separate. With word_tokenize I get "n't". Same for other negated forms like (Couldn't, didn't, et).
How can I do it?
Thanks in advance
If you want to extract individual words from a space-separated sentence, use Python's split() method.
t = "As aren't good. Bs are good"
print (t.split())
['As', "aren't", 'good.', 'Bs', 'are', 'good']
You can specify other delimiters in the split() method as well. For example, if you wanted to tokenize your string based on a full-stop, you could do something like this:
print (t.split("."))
["As aren't good", ' Bs are good']
Read the documentation here.
use split of re module.https://docs.python.org/2/library/re.html
import re
t = "As aren't good. Bs are good"
list(filter(None,re.split(r"[\s+.]",t)))
output:
['As', "aren't", 'good', 'Bs', 'are', 'good']
I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].
The nltk module does not seem to be up to the task however as:
"I wouldn't've done that."
tokenizes as:
['I', "wouldn't", "'ve", 'done', 'that', '.']
where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]
After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:
n't, 've, 'd, 'll, 's, 'm, 're
But the token "'ve" can also follow other contractions such as:
'd've, n't've, and (conceivably) 'll've
At the moment, I am trying to wrangle this regex:
\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b
However, this pattern also matches the badly formed:
"wouldn't've've"
It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.
I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.
Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.
EDIT:
Here's the output:
>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]
I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.
You can use the following complete regexes :
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."
print [i for i in pattern.split(s) if i]
result :
['I', 'would', "n't", "'ve", 'done', 'that.']
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.
You can use this regex to tokenize the text:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
Usage:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']
>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']
so:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']
Here a simple one
text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
.replace("'s ", ' is ').replace("'m ", ' am ') \
.replace("'ll ", ' will ').replace("'d ", ' would ') \
.replace("'re ", ' are ').replace("'ve ", ' have ')
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. # sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
I just used the following code, which removed all the punctuation:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm with am.
I think you need some sort of regular expression matching (the following code is in Python 3):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
I use this code to remove punctuation:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
Tutorial:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
You can do it in one line without nltk (python 3.x).
import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
Just adding to the solution by #rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']