Python regex: pattern with re.ASCII can still match unicode characters? - python

I am new to Python regex and am trying to match non-white space ASCII characters in Python.
The following is my code:
impore re
p = re.compile(r"[\S]{2,3}", re.ASCII)
p.search('1234') # have some result
p.search('你好吗') # also have result, but Why?
I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?

The re.A flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:
\d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
\w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
\W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
\s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
\S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
\b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A or re.ASCII:
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
That means that:
\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
\D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.

Related

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

how to remove trailing non-alpha characters

import re
s = 'Sarah Ruthers#6'
output = re.sub("[^\\w]", "", s)
print output
The above removes ALL alpha characters; I simply want to remove any characters after the last alpha (letter type character); or trailing last alpha character for instance.
i.e. Sarah Ruthers#6
to output simply:
Sarah Ruthers
My regex above; outputs SarahRuthers (removing the space)
Anchor your pattern at the end, and use a correct character class:
output = re.sub(r"[\W\d_]+$", "", s)
That'll remove a single run of all non-letter characters at the end of the string; the $ anchor limits the range, and [\W\d_] properly matches non-letters, not just non-word characters (word characters include digits and the underscore character).
I also made the regex a raw string (which you should always do anyway for regex patterns), removing the need to double the backslashes.
Note that while [^a-zA-Z] could replace [\W\d_] for your specific case, I strongly recommend [\W\d_] over [^a-zA-Z] because the former is Unicode friendly, while the latter is not; for example if your text is 'résumé', using [^a-zA-Z] will strip the trailing é, [\W\d_] won't.
output = re.sub("[^a-zA-Z]+$", "", s)
\w is "word character" which includes alphanumeric (letters, numbers) plus underscore (_).
Say that you only need to retain uppercase and lowercase letters towards the end:
output = re.sub("[^A-Za-z ]+$", "", s)

Why inconsistent regular expression "\bpattern\b" behavior in Python?

I am using Python 3 to demonstrate. There is an example string:
a = "learning is learn and elearn"
s = "#wen is # and wen#"
I want to do exact match of "learn" and "#", i.e., not extracting learning (or #wen) or elearn (or wen#). Therefore, I should get 'learn' and '#'.
re.findall(r'\blearn\b', a) # works
['learn']
or
re.sub(r'\blearn\b', 'z', a) # works
'learning is z and elearn'
re.findall(r'\b#\b', s) # not working
[]
or
re.sub(r'\b#\b', 'z', s) # not working
'#wen is # and wen#'
From the docs:
\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string
In your example, # is a nonalphanumeric (and non-underscore) character surrounded by other nonalphanumeric characters. Because there are no word characters, there is no word boundary, so \b will not match.

regex match exact pattern within string

if I have the following string 'some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888' and I want to find 15 digit numbers (so only 151283917503423) how do I make it so that it doesn't match the bigger number and also deal with the possibility that the string can just be '151283917503423' therefore I cannot identify it by it possibly containing spaces on both sides?
serial = re.compile('[0-9]{15}')
serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
this returns both 66666666666666666667867866 and 151283917503423 but I only want the latter
Use word boundaries:
serial = re.compile(r'\b[0-9]{15}\b')
\b Matches the empty string, but only at the beginning or end of a
word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
You need to use word boundaries to ensure you don't match unwanted text on either side of your match:
>>> serial = re.compile(r'\b\d{15}\b')
>>> serial.findall('some numbers 66666666666666666667867866 and serial 151283917503423 and 8888888')
['151283917503423']
Include word boundaries. Let s be your string. You can use
>>> re.findall(r'\b\d{15}\b' ,s)
['151283917503423']
where \b asserts a word boundary (^\w|\w$|\W\w|\w\W)
Since word boundaries \b contain 2 assertions each, I would use a single assertion
instead.
(?<![0-9])[0-9]{15}(?![0-9])
should be quicker?

Regex to include alphanumeric and _

I'm trying to create a regular expression to match alphanumeric characters and the underscore _. This is my regex: "\w_*[^-$\s\]" and my impression is that this regex means any alphanumeric character \w, an underscore _, and no -,$, or whitespace. Is this correct?
Regular expressions are read as patterns which actually match characters in a string, left to right, so your pattern actually matches an alphanumeric, THEN an underscore (0 or more), THEN at least one character that is not a hyphen, dollar, or whitespace.
Since you're trying to alternate on character types, just use a character class to show what characters you're allowing:
[\w_]
This checks that ANY part of the string matches it, so let's anchor it to the beginning and and of the string:
^[\w_]$
And now we see that the character class lacks a quantifier, so we are matching on exactly ONE character. We can fix that using + (if you want one or more characters, no empty strings) or * (if you want to allow empty strings). I'll use + here.
^[\w_]+$
As it turns out, the \w character class already includes the underscore, so we can remove the redundant underscore from the pattern:
^[\w]+$
And now we have only one character in the character class, so we no longer need the character class brackets at all:
^\w+$
And that's all you need, unless I'm missing something about your requirements.
Yes, you are semi-correct if the closing bracket was not escaped and you edited your regex a bit. Also the token \w matches underscore, so you do not need to repeat this character. Your regular expression says:
\w # word characters (a-z, A-Z, 0-9, _)
_* # '_' (0 or more times)
[^-$\s] # any character except: '-', '$', whitespace (\n, \r, \t, \f, and " ")
You could simply write your entire regex as follows to match word characters:
\w+ # word characters ( a-z, A-Z, 0-9, _ ) (1 or more times)
If you want to match an entire string, be sure to anchor your expression.
^\w+$
Explanation:
^ # the beginning of the string
\w+ # word characters ( a-z, A-Z, 0-9, _ ) (1 or more times)
$ # before an optional \n, and the end of the string

Categories