Python regex: tokenizing English contractions - python

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].
The nltk module does not seem to be up to the task however as:
"I wouldn't've done that."
tokenizes as:
['I', "wouldn't", "'ve", 'done', 'that', '.']
where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]
After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:
n't, 've, 'd, 'll, 's, 'm, 're
But the token "'ve" can also follow other contractions such as:
'd've, n't've, and (conceivably) 'll've
At the moment, I am trying to wrangle this regex:
\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b
However, this pattern also matches the badly formed:
"wouldn't've've"
It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.
I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.
Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.
EDIT:
Here's the output:
>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]
I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.

You can use the following complete regexes :
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."
print [i for i in pattern.split(s) if i]
result :
['I', 'would', "n't", "'ve", 'done', 'that.']

(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.

You can use this regex to tokenize the text:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
Usage:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']
so:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

Here a simple one
text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
.replace("'s ", ' is ').replace("'m ", ' am ') \
.replace("'ll ", ' will ').replace("'d ", ' would ') \
.replace("'re ", ' are ').replace("'ve ", ' have ')

Related

How to check for words that are not immediately followed by a keyword, how about words not surrounded by the keyword?

I am trying to look for words that do not immediately come before the.
Performed a positive look-behind to get the words that come after the keyword 'the' (?<=the\W). However, I am unable to capture 'people' and 'that' as the above logic would not apply to these cases.
I am unable to take care of the words that do not have the keyword 'the' before and after (for example, 'that' and 'people' in the sentence).
p = re.compile(r'(?<=the\W)\w+')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
The current output am getting is
'part','fair','most','fireworks'.
Edit:
Thank you for all the help below. Using the below suggestions in the comments, managed to update my code.
p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
This brings me closer to the output I need to get.
Updated Output:
[('part', ' of the'), ('fair', ''),
('that', ' attracts the'), ('most', ''),
('people', ' is the'), ('fireworks', '')]
I just need the strings ('part','fair','that','most','people','fireworks').
Any advise?
I am trying to look for words that do not immediately come before 'the' .
Note that the code below does not use re.
words = 'the part of the fair that attracts the most people is the fireworks'
words_list = words.split()
words_not_before_the = []
for idx, w in enumerate(words_list):
if idx < len(words_list)-1 and words_list[idx + 1] != 'the':
words_not_before_the.append(w)
words_not_before_the.append(words_list[-1])
print(words_not_before_the)
output
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
using regex:
import re
m = re.sub(r'\b(\w+)\b the', 'the', 'the part of the fair that attracts the most people is the fireworks')
print([word for word in m.split(' ') if not word.isspace() and word])
output:
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
I am trying to look for words that do not immediately come before the.
Try this:
import re
# The capture group (\w+) matches a word, that is followed by a word, followed by the word: "the"
p = re.compile(r'(\w+)\W\w+\Wthe')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Output:
['part', 'that', 'people']
Try to spin it around, instead of finding the words that does not immediately follow the, eliminate all the occurrences that immediately follow the
import re
test = "the part of the fair that attracts the most people is the fireworks"
pattern = r"\s\w*\sthe|the\s"
print(re.sub(pattern, "", test))
output: part fair that most people fireworks
I have finally solved the question. Thank you all!
p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Added a non-capturing group '?:' inside the third group.
Output:
['part', 'fair', 'that', 'most', 'people', 'fireworks']

Find and split on certain characters that follow words

I'm trying to use regular expressions to split text on punctuation, only when the punctuation follows a word and proceeds a space or the end of the string.
I've tried ([a-zA-Z])([,;.-])(\s|$)
But when I want to split in Python, it includes the last character of the word.
I want to split it like this:
text = 'Mr.Smith is a professor at Harvard, and is a great guy.'
splits = ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'a', 'great', 'guy', '.']
Any help would be greatly appreciated!
It seems you want to do tokenize. Try nltk
http://text-processing.com/demo/tokenize/
from nltk.tokenize import TreebankWordTokenizer
splits = TreebankWordTokenizer().tokenize(text)
You may use
re.findall(r'\w+(?:\.\w+)*|[^\w\s]', s)
See the regex demo.
Details
\w+(?:\.\w+)* - 1+ word chars followed with 0 or more occurrences of a dot followed with 1+ word chars
| - or
[^\w\s] - any char other than a word and whitespace char.
Python demo:
import re
rx = r"\w+(?:\.\w+)*|[^\w\s]"
s = "Mr.Smith is a professor at Harvard, and is a great guy."
print(re.findall(rx, s))
Output: ['Mr.Smith', 'is', 'a', 'professor', 'at', 'Harvard', ',', 'and', 'is', 'a', 'great', 'guy', '.'].
This approach can be further precised. E.g. tokenizing only letter words, numbers and underscores as punctuation:
re.findall(r'[+-]?\d*\.?\d+|[^\W\d_]+(?:\.[^\W\d_]+)*|[^\w\s]|_', s)
See the regex demo
You can first split on ([.,](?=\s)|\s) and then filter out empty or blanks strings:
In [16]: filter(lambda s: not re.match(r'\s*$', s) , re.split(r'([.,](?=\s)|\s)', 'Mr.Smith is a professor at Har
...: vard, and is a great guy.'))
Out[16]:
['Mr.Smith',
'is',
'a',
'professor',
'at',
'Harvard',
',',
'and',
'is',
'a',
'great',
'guy.']

python re.split(): how to save some of the delimiters (instead of all the delimiter by using bracket)

For the sentences:
"I am very hungry, so mum brings me a cake!
I want it split by delimiters, and I want all the delimiters except space to be saved as well. So the expected output is :
"I" "am" "very" "hungry" "," "so", "mum" "brings" "me" "a" "cake" "!" "\n"
What I am currently doing is re.split(r'([!:''".,(\s+)\n])', text), which split the whole sentences but also saved a lot of space characters which I don't want. I've also tried the regular expression \s|([!:''".,(\s+)\n]) , which gives me a lot of None somehow.
search or findall might be more appropriate here than split:
import re
s = "I am very hungry, so mum brings me a !#$## cake!"
print(re.findall(r'[^\w\s]+|\w+', s))
# ['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', '!#$##', 'cake', '!']
The pattern [^\w\s]+|\w+ means: a sequence of symbols which are neither alphanumeric nor whitespace OR a sequence of alphanumerics (that is, a word)
That is because your regular expression contains a capture group. Because of that capture group, it will also include the matches in the result. But this is likely what you want.
The only challenge is to filter out the Nones (and other values with truthiness False) in case there is no match, we can do this with:
def tokenize(text):
return filter(None, re.split(r'[ ]+|([!:''".,\s\n])', text))
For your given sample text, this produces:
>>> list(tokenize("I am very hungry, so mum brings me a cake!\n"))
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']
One approach is to surround the special characters (,!.\n) with space and then split on space:
import re
def tokenize(t, pattern="([,!.\n])"):
return [e for e in re.sub(pattern, r" \1 ", t).split(' ') if e]
s = "I am very hungry, so mum brings me a cake!\n"
print(tokenize(s))
Output
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']

matching quoted strings and unquoted words

I try to write a regular expression to match eihter strings surrounded by double quotes (") or words separated by space () and have them in a list in python.
I don't really understand the output of my code, can anybody give me a hint or explain what my regular expression is doing exactly?
here is my code:
import re
regex = re.compile('(\"[^\"]*\")|( [^ ]* )')
test = '"hello world." here are some words. "and more"'
print(regex.split(test))
I expect an output like this:
['"hello world."', ' here ', ' are ', ' some ', ' words. ', '"and more"']
but I get the following:
['', '"hello world."', None, '', None, ' here ', 'are', None, ' some ', 'words.', None, ' "and ', 'more"']
where does the empty strings and the Nones come from.
and why does it match "hello world." but not "and more".
Thanks for your help, and a happy new year for those who celebrate it today!
EDIT:
to be precise: i dont need the surrounding spaces but i need the surrounding quotes. this output would be fine too:
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
EDIT2:
i ended up using shlex.split() like #PadraicCunningham suggested because it did exactly what i need and ihmo it is much more readable than regular expressions.
i still keep #TigerhawkT3's answer the accepted one because it solves the problem in the way i have asked it (with regular expressions).
Include the quoted match first so it prioritizes that, and then non-whitespace characters:
>>> s = '"hello world." here are some words. "and more"'
>>> re.findall(r'"[^"]*"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
You can get the same result with a non-greedy repeating pattern instead of the character set negation:
>>> re.findall(r'".*?"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
shlex.split with posix=False will do it for you:
import shlex
test = '"hello world." here are some words. "and more"'
print(shlex.split(test,posix=False))
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
if you did not want the quotes, you would leave posix as True:
print(shlex.split(test))
['hello world.', 'here', 'are', 'some', 'words.', 'and more']
Looks like CSV, so use the appropriate tools:
import csv
lines = ['"hello world." here are some words. "and more"']
list(csv.reader(lines, delimiter=' ', quotechar='"'))
returns
[['hello world.', 'here', 'are', 'some', 'words.', 'and more']]

Python title() with apostrophes

Is there a way to use .title() to get the correct output from a title with apostrophes? For example:
"john's school".title() --> "John'S School"
How would I get the correct title here, "John's School" ?
If your titles do not contain several whitespace characters in a row (which would be collapsed), you can use string.capwords() instead:
>>> import string
>>> string.capwords("john's school")
"John's School"
EDIT: As Chris Morgan rightfully says below, you can alleviate the whitespace collapsing issue by specifying " " in the sep argument:
>>> string.capwords("john's school", " ")
"John's School"
This is difficult in the general case, because some single apostrophes are legitimately followed by an uppercase character, such as Irish names starting with "O'". string.capwords() will work in many cases, but ignores anything in quotes. string.capwords("john's principal says,'no'") will not return the result you may be expecting.
>>> capwords("John's School")
"John's School"
>>> capwords("john's principal says,'no'")
"John's Principal Says,'no'"
>>> capwords("John O'brien's School")
"John O'brien's School"
A more annoying issue is that title itself does not produce the proper results. For example, in American usage English, articles and prepositions are generally not capitalized in titles or headlines. (Chicago Manual of Style).
>>> capwords("John clears school of spiders")
'John Clears School Of Spiders'
>>> "John clears school of spiders".title()
'John Clears School Of Spiders'
You can easy_install the titlecase module that will be much more useful to you, and does what you like, without capwords's issues. There are still many edge cases, of course, but you'll get much further without worrying too much about a personally-written version.
>>> titlecase("John clears school of spiders")
'John Clears School of Spiders'
I think that can be tricky with title()
Lets try out something different :
def titlize(s):
b = []
for temp in s.split(' '): b.append(temp.capitalize())
return ' '.join(b)
titlize("john's school")
// You get : John's School
Hope that helps.. !!
Although the other answers are helpful, and more concise, you may run into some problems with them. For example, if there are new lines or tabs in your string. Also, hyphenated words (whether with regular or non-breaking hyphens) may be a problem in some instances, as well as words that begin with apostrophes. However, using regular expressions (using a function for the regular expression replacement argument) you can solve these problems:
import re
def title_capitalize(match):
text=match.group()
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
return re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
s="here's an apostrophe es. this string has multiple spaces\nnew\n\nlines\nhyphenated words: and non-breaking   spaces, and a non‑breaking hyphen, as well as 'ords that begin with ’strophies; it\teven\thas\t\ttabs."
print(title(s))
Anyway, you can edit this to compensate for any further problems, such as backticks and what-have-you, if needed.
If you're of the opinion that title casing should keep such as prepositions, conjunctions and articles lowercase unless they're at the beginning or ending of the title, you can try such as this code (but there are a few ambiguous words that you'll have to figure out by context, such as when):
import re
lowers={'this', 'upon', 'altogether', 'whereunto', 'across', 'between', 'and', 'if', 'as', 'over', 'above', 'afore', 'inside', 'like', 'besides', 'on', 'atop', 'about', 'toward', 'by', 'these', 'for', 'into', 'beforehand', 'unlike', 'until', 'in', 'aft', 'onto', 'to', 'vs', 'amid', 'towards', 'afterwards', 'notwithstanding', 'unto', 'while', 'next', 'including', 'thru', 'a', 'down', 'after', 'with', 'afterward', 'or', 'those', 'but', 'whereas', 'versus', 'without', 'off', 'among', 'because', 'some', 'against', 'before', 'around', 'of', 'under', 'that', 'except', 'at', 'beneath', 'out', 'amongst', 'the', 'from', 'per', 'mid', 'behind', 'along', 'outside', 'beyond', 'up', 'past', 'through', 'beside', 'below', 'during'}
def title_capitalize(match, use_lowers=True):
text=match.group()
lower=text.lower()
if lower in lowers and use_lowers==True:
return lower
else:
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
first=re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
return re.sub(r"(^[\w'’‑-]+)|([\w'’‑-]+$)", lambda match : title_capitalize(match, use_lowers=False), first)
IMHO, best answer is #Frédéric's one. But if you already have your string separated to words, and you know how string.capwords is implemeted, then you can avoid unneeded joining step:
def capwords(s, sep=None):
return (sep or ' ').join(
x.capitalize() for x in s.split(sep)
)
As a result, you can just do this:
# here my_words == ['word1', 'word2', ...]
s = ' '.join(word.capitalize() for word in my_words)
If you have to cater for dashes then use:
import string
" ".join(
string.capwords(word, sep="-")
for word in string.capwords(
"john's school at bel-red"
).split()
)
# "John's School At Bel-Red"

Categories