Why inconsistent regular expression "\bpattern\b" behavior in Python? - python

I am using Python 3 to demonstrate. There is an example string:
a = "learning is learn and elearn"
s = "#wen is # and wen#"
I want to do exact match of "learn" and "#", i.e., not extracting learning (or #wen) or elearn (or wen#). Therefore, I should get 'learn' and '#'.
re.findall(r'\blearn\b', a) # works
['learn']
or
re.sub(r'\blearn\b', 'z', a) # works
'learning is z and elearn'
re.findall(r'\b#\b', s) # not working
[]
or
re.sub(r'\b#\b', 'z', s) # not working
'#wen is # and wen#'

From the docs:
\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string
In your example, # is a nonalphanumeric (and non-underscore) character surrounded by other nonalphanumeric characters. Because there are no word characters, there is no word boundary, so \b will not match.

Related

Vowels not at the end or start of the words in string

I am trying to find the words in string not starting or ending with letters 'aıoueəiöü'. But regex fails to find words when I use this code:
txt = "Nasa has fixed a problem with malfunctioning equipment on a new rocket designed to take astronauts to the Moon."
re.findall(r"\b[^aıoueəiöü]\w+[^aıoueəiöü]\b",txt)
Instead, it works fine when whitespace character \s is added in negation part:
re.findall(r"\b[^aıoueəiöü\s]\w+[^aıoueəiöü\s]\b",txt)
I cannot understand the issue in first example of code, why should I specify whitespace characters too?
Note that [^aıoueəiöü] matches any char other than a, ı, o, u, e, ə, i, ö and ü. It can match a whitespace, a digit, punctuation, etc.
Also, you regex matches strings of at least three chars, you need to adjust it to match one and two char strings, too.
You do not have to rely on excluding whitespace from the pattern. Since you only want to match word chars other than vowels, add \W rather than \s:
\b[^\Waıoueəiöü](?:\w*[^\Waıoueəiöü])?\b
See the regex demo.
Details:
\b - a word boundary
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
(?:\w*[^\Waıoueəiöü])? - an optional occurrence of
\w* - any zero or more word chars
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
\b - a word boundary

Python regex: pattern with re.ASCII can still match unicode characters?

I am new to Python regex and am trying to match non-white space ASCII characters in Python.
The following is my code:
impore re
p = re.compile(r"[\S]{2,3}", re.ASCII)
p.search('1234') # have some result
p.search('你好吗') # also have result, but Why?
I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?
The re.A flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:
\d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
\w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
\W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
\s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
\S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
\b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A or re.ASCII:
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
That means that:
\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
\D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

regex match exact pattern within string

if I have the following string 'some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888' and I want to find 15 digit numbers (so only 151283917503423) how do I make it so that it doesn't match the bigger number and also deal with the possibility that the string can just be '151283917503423' therefore I cannot identify it by it possibly containing spaces on both sides?
serial = re.compile('[0-9]{15}')
serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
this returns both 66666666666666666667867866 and 151283917503423 but I only want the latter
Use word boundaries:
serial = re.compile(r'\b[0-9]{15}\b')
\b Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
You need to use word boundaries to ensure you don't match unwanted text on either side of your match:
>>> serial = re.compile(r'\b\d{15}\b')
>>> serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
['151283917503423']
Include word boundaries. Let s be your string. You can use
>>> re.findall(r'\b\d{15}\b' ,s)
['151283917503423']
where \b asserts a word boundary (^\w|\w$|\W\w|\w\W)
Since word boundaries \b contain 2 assertions each, I would use a single assertion
instead.
(?<![0-9])[0-9]{15}(?![0-9])
should be quicker?

Categories