Regular expression (Reg exp). Why this works? - python

So, I am trying to get my head around regexp. The first query doesn't give me the result but the second one does. I am not able to make sense, why that is.
I am trying to tokenize the sentence,
text = 'The interest does not exceed 8.25%.'
pattern = r'\w+|\d+\.\d+\%|[^\w+\s]+'
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(text)
This gives me
['The', 'interest', 'does', 'not', 'exceed', '8', '.', '25', '%']
And I want
['The', 'interest', 'does', 'not', 'exceed', '8.25%']
I get my result with,
pattern = r'\d+\.\d+\%|\w+|[^\w+\s]+'
Why does it work with the second pattern? Shouldn't both the queries work?

the issue is that \w matches letters, digits and underscores. Since the expression comes first in your ored expressions, it's prioritary.
['The', 'interest', 'does', 'not', 'exceed', '8', '.', '25', '%']
\w+ \w+ \w+ \w+ \w+ \w+ [^\w\s]+ \w+ [^\w\s]+
The second expression never has a chance to match because it's partly consumed by the first one.
Invert the ored expressions:
r'\d+\.\d+\%|\w+|[^\w\s]+'
just a test with the basic re module:
text = 'The interest does not exceed 8.25%.'
pattern = r'\d+\.\d+%|\w+|[^\w\s]+'
print(re.findall(pattern,text))
prints:
['The', 'interest', 'does', 'not', 'exceed', '8.25%', '.']
(note that you don't have to escape %)

Related

How do I tokenize to separate apostrophes?

I have a function made to tokenize the words given; however, the problem ensures in the fact that it does separate 'isn't' and 'o'brian' and similar words. This is the code:
from typing import List
import re
def tokenize(text: str) -> List[str]:
return re.sub(r'(\W+)', r' \1 ', text.lower()).split()
When I put in something like
"He said 'Isn't O'Brian the best?'"
it'll come as
['he', 'said', "'", "isn't", "o'brian", 'the', 'best', '?', "'"],
not
['he', 'said', "'", 'isn', "'", 't', 'o', "'", 'brian', 'the', 'best', "?'"]
I'm really lost because I've tried to do more than re.sub to separate the words and tried to split them, but it seemingly is not working.
You need a better definition of "word", for example:
word_re = r"\b[A-Za-z]+(?:'[A-Za-z]+)?\b"
that is, a word boundary, then some letters, then optionally an apostrophe, followed by letters. Once you've got it, use findall to extract words:
words = [w.lower() for w in re.findall(word_re, text)]
if the apostrophes arent important to the meaning of the word, ie "obrien" and "o'brien" are the same for your purposes you could do some basic text pre-processing to remove all the apostrophes using text.replace("'","") see here
if they do matter, you could replace them with Unicode (the unicode for ' is U+0027) and translate them out of Unicode after you've tokenized them, or just text.replace("'","U+0027") before you tokenize, and text.replace("U+0027","'") after
hope that helps, goodluck!

Error when creating a simple custom dynamic tokenizer in Python

I am trying to create a dynamic tokenizer, but it does not work as intended.
Below is my code:
import re
def tokenize(sent):
splitter = re.findall("\W",sent)
splitter = list(set(splitter))
for i in sent:
if i in splitter:
sent.replace(i, "<SPLIT>"+i+"<SPLIT>")
sent.split('<SPLIT>')
return sent
sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"
tokens = tokenize(sent)
print(tokens)
This does not work!
I expected it to return the below list:
["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]
This would be pretty trivial if it weren't for the special treatment of the '. I'm assuming you're doing NLP, so you want to take into account which "side" the ' belongs to. For instance, "tryin'" should not be split and neither should "'tis" (it is).
import re
def tokenize(sent):
split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
return [word for word in re.split(split_pattern, sent) if word]
sent = (
"Who's kid are you? my ph. is +1-6466461022.Bye!",
"Tryin' to show how the single quote can belong to either side",
"'tis but a regex thing + don't forget EOL testin'",
"You've got to love regex"
)
for item in sent:
print(tokenize(item))
The python re lib evaluates patterns containing | from left to right and it is non-greedy, meaning it stops as soon as a match is found, even though it's not the longest match.
Furthermore, a feature of the re.split() function is that you can use match groups to retain the patterns/matches you're splitting at (otherwise the string is split and the matches where the splits happen are dropped).
Pattern breakdown:
(\w+')(?:\W+|$) - words followed by a ' with no word characters immediately following it. E.g., "tryin'", "testin'". Don't capture the non-word characters.
('\w+) - ' followed by at least one word character. Will match "'t" and "'ve" in "don't" and "they've", respectively.
(?:\s+) - split on any whitespace, but discard the whitespace itself
(\W) - split on all non-word characters (no need to bother finding the subset that's present in the string itself)
You can use
[x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x]
See the regex demo. The pattern matches
( - Group 1 (as these texts are captured into a group these matches appear in the resulting list):
[^'\w\s] - any char other than ', word and whitespace char
| - or
'(?![^\W\d_]) - a ' not immediately followed with a letter ([^\W\d_] matches any Unicode letter)
| - or
(?<![^\W\d_])' - a ' not immediately preceded with a letter
) - end of the group
| - or
(?='(?<=[^\W\d_]')(?=[^\W\d_])) - a location right before a ' char that is enclosed with letters
| - or
\s+ - one or more whitespace chars.
See the Python demo:
import re
sents = ["Who's kid are you? my ph. is +1-6466461022.Bye!", "Who's kid are you? my ph. is +1-6466461022.'Bye!'"]
for sent in sents:
print( [x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x] )
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', 'Bye', '!']
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', "'", 'Bye', '!', "'"]

Regex to parse queries with quoted substrings and return nested lists of individual words

I'm trying to write a regex that takes in a string of words containing quoted substrings like "green lizards" like to sit "in the sun", tokenizes it into words and quoted substrings (using either single or double quotes) separated by spaces, and then returns a list [['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']] where the list items are either single words or nested lists of words where a quoted substrings was encountered.
I am new to regex, and was able to find a solution that captures the quoted parts: re.findall('"([^"]*)"', '"green lizards" like to sit "in the sun"') ... which returns: ['green lizards', 'in the sun']
But this doesn't capture the individual words, and also doesn't tokenize them (returning a single string instead of list of words, which requires me to split() them each separately.
How would I make a regex that correctly returns the type of list I'm wanting? Also, I'm open to better methods/tools than regex for parsing these sorts of strings if anyone has suggestions.
Thanks!
With re.findall() function and built-in str methods:
import re
s = '"green lizards" like to sit "in the sun"'
result = [i.replace('"', "").split() if i.startswith('"') else i
for i in re.findall(r'"[^"]+"|\S+', s)]
print(result)
The output:
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]
Another approach (supporting both single and double quotes):
import re
sentence = """"green lizards" like to sit "in the sun" and 'single quotes' remain alone"""
rx = re.compile(r"""(['"])(.*?)\1|\S+""")
tokens = [m.group(2).split()
if m.group(2) else m.group(0)
for m in rx.finditer(sentence)]
print(tokens)
Yielding
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun'], 'and', ['single', 'quotes'], 'remain', 'alone']
The idea here is:
(['"]) # capture a single or a double quote
(.*?) # 0+ characters lazily
\1 # up to the same type of quote previously captured
| # ...or...
\S+ # not a whitespace
In the list comprehension we check which condition was met.
You can use re.split and then a final str.split:
import re
s = '"green lizards" like to sit "in the sun"'
new_s = [[i[1:-1].split()] if i.startswith('"') else i.split() for i in re.split('(?<=")\s|\s(?=")', s)]
last_result = [i for b in new_s for i in b]
Output:
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]

Regex in Python to match words with special characters

I have this code
import re
str1 = "These should be counted as a single-word, b**m !?"
match_pattern = re.findall(r'\w{1,15}', str1)
print(match_pattern)
I want the output to be:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
The output should exclude non-words such as the "!?" what are the other validation should I use to match and achieve the desired output?
I would use word boundaries (\b) filled with 1 or more non-space:
match_pattern = re.findall(r'\b\S+\b', str1)
result:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
!? is skipped thanks to word boundary magic, which don't consider that as a word at all either.
Probably you want something like [^\s.!?] instead of \w but what exactly you want is not evident from a single example. [^...] matches a single character which is not one of those between the brackets and \s matches whitespace characters (space, tab, newline, etc).
You can also achieve a similar result not using RegEx:
string = "These should be counted as a single-word, b**m !?"
replacements = ['.',',','?','!']
for replacement in replacements:
if replacement in string:
string = string.replace(replacement, "");
print string.split()
>>> ['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']

python: splitting strings where numbers and letters meet (1234abcd-->1234, abcd)

I have a string composed of numbers and letters: string = 'this1234is5678it', and I would like the string.split() output to give me a list like ['this', '1234', 'is', '5678', 'it'], splitting at where numbers and letter meet. Is there an easy way to do this?
You can use Regex for this.
import re
s = 'this1234is5678it'
re.split('(\d+)',s)
Running example http://ideone.com/JsSScE
Outputs ['this', '1234', 'is', '5678', 'it']
Update
Steve Rumbalski mentioned in the comment the importance of the parenthesis in the regex. He quotes from the documentation:
If capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list." Without the parenthesis the result would be ['this', 'is',
'it'].

Categories