Regex to include alphanumeric and _ - python

I'm trying to create a regular expression to match alphanumeric characters and the underscore _. This is my regex: "\w_*[^-$\s\]" and my impression is that this regex means any alphanumeric character \w, an underscore _, and no -,$, or whitespace. Is this correct?

Regular expressions are read as patterns which actually match characters in a string, left to right, so your pattern actually matches an alphanumeric, THEN an underscore (0 or more), THEN at least one character that is not a hyphen, dollar, or whitespace.
Since you're trying to alternate on character types, just use a character class to show what characters you're allowing:
[\w_]
This checks that ANY part of the string matches it, so let's anchor it to the beginning and and of the string:
^[\w_]$
And now we see that the character class lacks a quantifier, so we are matching on exactly ONE character. We can fix that using + (if you want one or more characters, no empty strings) or * (if you want to allow empty strings). I'll use + here.
^[\w_]+$
As it turns out, the \w character class already includes the underscore, so we can remove the redundant underscore from the pattern:
^[\w]+$
And now we have only one character in the character class, so we no longer need the character class brackets at all:
^\w+$
And that's all you need, unless I'm missing something about your requirements.

Yes, you are semi-correct if the closing bracket was not escaped and you edited your regex a bit. Also the token \w matches underscore, so you do not need to repeat this character. Your regular expression says:
\w # word characters (a-z, A-Z, 0-9, _)
_* # '_' (0 or more times)
[^-$\s] # any character except: '-', '$', whitespace (\n, \r, \t, \f, and " ")
You could simply write your entire regex as follows to match word characters:
\w+ # word characters ( a-z, A-Z, 0-9, _ ) (1 or more times)
If you want to match an entire string, be sure to anchor your expression.
^\w+$
Explanation:
^ # the beginning of the string
\w+ # word characters ( a-z, A-Z, 0-9, _ ) (1 or more times)
$ # before an optional \n, and the end of the string

Related

Python regex: pattern with re.ASCII can still match unicode characters?

I am new to Python regex and am trying to match non-white space ASCII characters in Python.
The following is my code:
impore re
p = re.compile(r"[\S]{2,3}", re.ASCII)
p.search('1234') # have some result
p.search('你好吗') # also have result, but Why?
I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?
The re.A flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:
\d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
\w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
\W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
\s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
\S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
\b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A or re.ASCII:
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
That means that:
\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
\D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

how to remove trailing non-alpha characters

import re
s = 'Sarah Ruthers#6'
output = re.sub("[^\\w]", "", s)
print output
The above removes ALL alpha characters; I simply want to remove any characters after the last alpha (letter type character); or trailing last alpha character for instance.
i.e. Sarah Ruthers#6
to output simply:
Sarah Ruthers
My regex above; outputs SarahRuthers (removing the space)
Anchor your pattern at the end, and use a correct character class:
output = re.sub(r"[\W\d_]+$", "", s)
That'll remove a single run of all non-letter characters at the end of the string; the $ anchor limits the range, and [\W\d_] properly matches non-letters, not just non-word characters (word characters include digits and the underscore character).
I also made the regex a raw string (which you should always do anyway for regex patterns), removing the need to double the backslashes.
Note that while [^a-zA-Z] could replace [\W\d_] for your specific case, I strongly recommend [\W\d_] over [^a-zA-Z] because the former is Unicode friendly, while the latter is not; for example if your text is 'résumé', using [^a-zA-Z] will strip the trailing é, [\W\d_] won't.
output = re.sub("[^a-zA-Z]+$", "", s)
\w is "word character" which includes alphanumeric (letters, numbers) plus underscore (_).
Say that you only need to retain uppercase and lowercase letters towards the end:
output = re.sub("[^A-Za-z ]+$", "", s)

Cannot understand the code for removing words with numbers [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I want to remove words with numbers. After research I understood that
s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
This code works to solve my situation
However, I am not able to understand how this code works. I know about regex and I know individually \d recognizes all the numbers [0-9]. \S is for white spaces. and * is 0 or more occurrences of the pattern to its left
"\S*\d\S*"
This part I am not able to understand
But I am not sure I understand how this code identifies AB55.
Can anyone please explain to me? Thanks
this replaces a digit with any non-space symbols around with empty string ""
the AB55 is viewed like : AB are \S*, 5 is \d, 5 is \S*
55CD : empty string is \S*, 5 is \d, 5CD is \S*
A55D : A is \S*, 5 is \d, 5D is \S*
5555 : empty string is \S*, 5 is \d, 555 is \S*
The re.sub("\S*\d\S*", "", s) replaces all this substrings to empty string "" and .strip() is useless since it removes whitespace at the begin and end of the previous result
You misunderstand the code. \S is the opposite of \s: it matches with everything except whitespace.
Since the Kleene star (*) is greedy, it thus means that it aims to match as much non-space characters as possible, followed by a digit followed by as much non-space characters as possible. It will thus match a full word, where at least one character is a digit.
All these matches are then replaced by the empty string, and therefore removed from the original string.
Your code first matches 0+ times non whitespace chars \S* (where \s* matches whitespace chars) and will match all the way until the end of the "word". Then it backtracks to match a digit and and again match 0+ non whitespace chars.
The pattern will for example also match a single digit.
You could slightly optimize the pattern to first match not a whitespace char or a digit [^\s\d]* using a negated character class to prevent the first \S* match the whole word.
[^\s\d]*\d\S*
Regex demo
This is how your regex works, you mention about \S for white spaces. But it is not.
This is what python documentation mention about \s and \S
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
This is with \s which is for whitespace characters.
and you'll get an output like this,
>>> import re
>>>
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\s*\d\s*", "", s).strip()
'ABCD abcd ABCD AD'

Regex to get non-alphanumeric strings between alphanumeric strings

Let say I have this string:
Alpha+*&Numeric%$^String%%$
I want to get the non-alphanumeric characters that are between alphanumeric characters:
+*& %$^
I have this regex: [^0-9a-zA-Z]+ but it's giving me
+* %$^ %%$
which includes the tailing non-alphanumeric characters which I do not want. I have also tried [0-9a-zA-Z]([^0-9a-zA-Z])+[0-9a-zA-Z] but it's giving me
a+*&N c%$^S
which include the characters a, N, c and S
If you don't mind including the _ character as alpha-numeric data, you can extract all your non-alpha-numeric-data with this:
some_string = "A+*&N%$^S%%$"
import re
result = re.findall(r'\b\W+\b', some_string) # sets result to: ['+*&', '%$^']
Note my use of \b instead of something like \w or [^\W].
\w and [^\W] each match one character, so if your alpha-numeric string (between the text you want) is exactly one character, then what you think should be the next match won't match.
But since \b is a zero-width "word boundary," it doesn't care how many alpha-numeric characters there are, as long as there is at least one.
The only problem with your second attempt is the location of the + qualifier--it should be inside of the parentheses. You can also use the word character class \w and its inverse \W to pull out these items, which is the same as your second regex but includes underscores _ as parts of words:
import re
s = "Alpha+*&Numeric%$^String%%$"
print(re.findall(r"\w(\W+)\w", s)) # adds _ character
print(re.findall(r"[0-9a-zA-Z]([^0-9a-zA-Z]+)[0-9a-zA-Z]", s)) # your version fixed
print(re.findall(r"(?i)[0-9A-Z]([^0-9A-Z]+)[0-9A-Z]", s)) # same as above
Output:
['+*&', '%$^']
['+*&', '%$^']
['+*&', '%$^']

Categories