tweaking and creating meaningful spans with spacy - python

I am trying to create a spacy doc in which the follwoing is teaked:
import spacy
nlp = spacy.load('en_core_web_sm')
text = """In the drawing is possible to see table 10. B including only 3 legs"""
doc = nlp(text)
print(f'There are {len(list(doc.sents))} sentences')
this gives two sentences.
But I dont want to split sentences in these cases:
10. B or 10-C or 10_C.
i.e. Any number followed by a period . and a single letter should never be the split of a sentence, nor of a token. i.e. 10. B or 10-C are two tokens.
How do I achieve that?
With regex infixes.
When that regex matches in the text I dont want any token boundary.
import re
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
text = 'In the drawing is possible to see table 10. B including only 3 legs.'
nlp1 = spacy.load('en_core_web_sm')
nlp2 = spacy.load('en_core_web_sm')
doc1 = nlp1(text)
print(f'There are {len(list(doc1.sents))} sentences')
print([token.text for token in doc1])
def custom_tokenizer(nlp):
# any number of digits follow by . and a single character
infix_re = re.compile(r'''(\d+)(\.\s|\s\.|\.)([A-Za-z]\b)''')
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer,
token_match=None)
nlp2.tokenizer = custom_tokenizer(nlp1)
doc2 = nlp2(text)
print([token.text for token in doc2])
This is the result:
['In', 'the', 'drawing', 'is', 'possible', 'to', 'see', 'table', '10', '.', 'B', 'including', 'only', '3', 'legs', '.']
['In', 'the', 'drawing', 'is', 'possible', 'to', 'see', 'table', '10.', 'B', 'including', 'only', '3', 'legs.']
it indeed does not join the letter B generating a single 10.B token.
Why?

Related

How to remove punctuations using NLTK? [duplicate]

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. # sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
I just used the following code, which removed all the punctuation:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm with am.
I think you need some sort of regular expression matching (the following code is in Python 3):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
I use this code to remove punctuation:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
Tutorial:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
You can do it in one line without nltk (python 3.x).
import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
Just adding to the solution by #rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Keeping punctuation as its own unit in Preprocessed Text

what is the code to split a sentence into a list of its constituent words AND punctuation? Most text preprocessing programs tend to remove punctuations.
For example, if I enter this:
"Punctuations to be included as its own unit."
The desired output would be:
result = ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own',
'unit', '.']
many thanks!
You might want to consider using a Natural Language Toolkit or nltk.
Try this:
import nltk
sentence = "Punctuations to be included as its own unit."
tokens = nltk.word_tokenize(sentence)
print(tokens)
Output: ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']
The following snippet can be used using regular expression to separate the words and punctuation in a list.
import string
import re
punctuations = string.punctuation
regularExpression="[\w]+|" + "[" + punctuations + "]"
content="Punctuations to be included as its own unit."
splittedWords_Puncs = re.findall(r""+regularExpression, content)
print(splittedWords_Puncs)
Output: ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

How to remove ' in strings with RegexpTokenizer

from nltk.tokenize import RegexpTokenizer
text="That's some text, you know!"
tokens=[]
tokenizer = RegexpTokenizer(r'\w+')
tokens+=tokenizer.tokenize(text.lower())
Currently returns: text = ['that', 's', 'some', 'text', 'you', 'know']
I need it to return: Currently returns: text = ['thats', 'some', 'text', 'you', 'know'] (The "thats" is one word)
There are 2 solutions. Either you want to preprocess your text variable with:
text = text.replace("'", "")
or you want to match "that's" as a single word with this modification:
tokenizer = RegexpTokenizer(r'[\w\']+')

Only words or numbers re pattern. Tokenize with CountVectorizer

I'm using python CountVectorizer to tokenize sentences and at the same time filter non-existant words like "1s2".
Which re pattern should I use to select only English words and numbers? The following regex pattern gets me pretty close:
pattern = '(?u)(?:\\b[a-zA-Z]+\\b)*(?:\\b[\d]+\\b)*'
vectorizer = CountVectorizer(ngram_range=(1, 1),
stop_words=None,
token_pattern=pattern)
tokenize = vectorizer.build_tokenizer()
tokenize('this is a test test1 and 12.')
['this', '', 'is', '', 'a', '', 'test', '', '', '', '',
'', '', '', '', 'and', '', '12', '', '']
but I can't understand why it gives me so many empty list items ('').
Also, how can I keep the punctuation? In end I would like to result like this:
tokenize('this is a test test1 and 12.')
['this','is','a','test','and','12','.']
I do not know whether sklearn's CountVectorizer can do it in one step (token_pattern is overwritten by tokenizer, I think), but you can do the following (based on this answer):
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
import re
vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=None,
tokenizer=TreebankWordTokenizer().tokenize)
tokenize = vectorizer.build_tokenizer()
tokenList = tokenize('this is a test test1 and 12.')
# ['this', 'is', 'a', 'test', 'test1', 'and', '12', '.']
# Remove any token that (i) does not consist of letters or (ii) is a punctuation mark
tokenList = [token for token in tokenList if re.match('^([a-zA-Z]+|\d+|\W)$', token)]
# ['this', 'is', 'a', 'test', 'and', '12', '.']
EDIT:
I forgot to tell you why your answer doesn't work.
"The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)." (How sklearn's token_pattern works). So punctuation mark is completely ignored.
Your pattern (?u)(?:\\b[a-zA-Z]+\\b)*(?:\\b[\d]+\\b)* is actually saying: 'Interpret as unicode, word boundaries with letters in between (or not (the *)) and word boundaries with digits in between (or not (again a *))'. Because of all the 'or not', a pattern like '' (nothing) is also what you're searching for!

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.
Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Output:
['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:
import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)
Or for unicode:
import string
translate_table = dict((ord(char), None) for char in string.punctuation)
s.translate(translate_table)
and then use this string in your tokenizer.
P.S. string module have some other sets of elements that can be removed (like digits).
Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.
http://www.nltk.org/book/ch01.html
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time. # sd 4 232"
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
print(words)
output
['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').
from nltk.tokenize import word_tokenize, sent_tokenize
text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
I just used the following code, which removed all the punctuation:
tokens = nltk.wordpunct_tokenize(raw)
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha()]
Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.
Hence the solution is to tokenise and then remove punctuation tokens.
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']
tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']
...and then if you wish, you can replace certain tokens such as 'm with am.
I think you need some sort of regular expression matching (the following code is in Python 3):
import string
import re
import nltk
s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
Output:
['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.
I use this code to remove punctuation:
import nltk
def getTerms(sentences):
tokens = nltk.word_tokenize(sentences)
words = [w.lower() for w in tokens if w.isalnum()]
print tokens
print words
getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")
And If you want to check whether a token is a valid English word or not, you may need PyEnchant
Tutorial:
import enchant
d = enchant.Dict("en_US")
d.check("Hello")
d.check("Helo")
d.suggest("Helo")
You can do it in one line without nltk (python 3.x).
import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
Just adding to the solution by #rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
Remove punctuaion(It will remove . as well as part of punctuation handling using below code)
tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
text_string = text_string.translate(tbl) #text_string don't have punctuation
w = word_tokenize(text_string) #now tokenize the string
Sample Input/Output:
direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Categories