Python regular expressions: disallowing characters with exceptions - python

In my actual project I want to get some input from a user. It cannot contain whitespace at the beginning, at the end or twice or more in a row and it also cannot contain anything that is not a letter or a number.
However I have 2 problems:
\W disallows whitespaces. I want to allow one but not in the beginning, the end or more than once.
I want to allow dots but \W disallows it. I can't replace it with [^a-zA-Z0-9\s\s.] because it disallows characters like 'äöüß' which I want to allow. How can I change my code to disallow everything that is in the pattern below except for single whitespaces and dots?
input_text = input("input: ")
if re.search(r"(^\s|\s{2,}|\s$|\W)", input_text)
print("invalid input")

You may use either of the two:
^[^\W_]+(?:[\s.][^\W_]+)*\Z
^(?:[^\W_]|\.)+(?:\s(?:[^\W_]|\.)+)*\Z
The first one will match single dots and whitespace only in between letters. The second one allows dots to be anywhere in the string and in any consecutive quantities.
See regex #1 demo and regex #2 demo.
Details
^ - start of string
[^\W_]+ - one or more Unicode letters or digits
(?:[^\W_]|\.)+ - a non-capturing group matching one or more Unicode letters/digits or a dot
(?:[\s.][^\W_]+)* - zero or more repetitions of
[\s.] - a single whitespace or dot
[^\W_]+ - one or more Unicode letters or digits
\Z - the very end of string.
In Python, use something like
if re.search(r'^[^\W_]+(?:[\s.][^\W_]+)*\Z', input_text):
print('Valid')

Related

Combining positive and negative lookahead in python

I'm trying to extract tokens that satisfy many conditions out of which, I'm using lookahead to implement the following two conditions:
The tokens must be either numeric/alphanumeric (i.e, they must have at least one digit). They can contain few special characters like - '-','/','\','.','_' etc.,
I want to match strings like: 165271, agya678, yah#123, kj*12-
The tokens can't have consecutive special characters like: ajh12-&
I don't want to match strings like: ajh12-&, 671%&i^
I'm using a positive lookahead for the first condition: (?=\w*\d\w*) and a negative lookahead for the second condition: (?!=[\_\.\:\;\-\\\/\#\+]{2})
I'm not sure how to combine these two look-ahead conditions.
Any suggestions would be helpful. Thanks in advance.
Edit 1 :
I would like to extract complete tokens that are part of a larger string too (i.e., They may be present in middle of the string).
I would like to match all the tokens in the string:
165271 agya678 yah#123 kj*12-
and none of the tokens (not even a part of a token) in the string: ajh12-& 671%&i^
In order to force the regex to consider the whole string I've also used \b in the above regexs : (?=\b\w*\d\w*\b) and (?!=\b[\_\.\:\;\-\\\/\#\+]{2}\b)
You can use
^(?!=.*[_.:;\-\\\/#+*]{2})(?=[^\d\n]*\d)[\w.:;\-\\\/#+*]+$
Regex demo
The negative lookahead (?=[^\d\n]*\d) matches any char except a digit or a newline use a negated character class, and then match a digit.
Note that you also have to add * and that most characters don't have to be escaped in the character class.
Using contrast, you could also turn the first .* into a negated character class to prevent some backtracking
^(?!=[^_.:;\-\\\/#+*\n][_.:;\-\\\/#+*]{2})(?=[^\d\n]*\d)[\w.:;\-\\\/#+*]+$
Edit
Without the anchors, you can use whitespace boundaries to the left (?<!\S) and to the right (?!\S)
(?<!\S)(?!=\S*[_.:;\-\\\/#+*]{2})(?=[^\d\s]*\d)[\w.:;\-\\\/#+*]+(?!\S)
Regex demo
You can use multiple look ahead assertions to only capture strings that
(?!.*(?:\W|_){2,}.*) - doesn't have consecutive special characters and
(?=.*\d.*) - has at least 1 digit
^(?!.*(?:\W|_){2,}.*)(?=.*\d.*).*$

Regex to match (French) numbers

I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.

Regex that matches newlines literally and passively

I have to construct a regex that matches client codes that look like:
XXX/X{3,6}
XXX.X{3,6}
XXX.X{3,6}/XXX
With X a number between 0 and 9.
The regex needs to be strong enough so we don't extract codes that are within another string. The use of word boundaries was my first idea.
The regex looks like this: \b\d{3}[\.\/]\d{3,6}(?:\/\d{3})?\b
The problem with word boundaries is that it also matches dots. So a number like "123/456.12" would match "123/456" as the client number. So then I came up with the following regex: (?<!\S)\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?!\S). It uses lookbehind and lookahead and checks if that character is a white space. This matches most of the client codes correctly.
But there is still one last issue. We are using a Google OCR text to extract the codes from. This means that a valid code can be found in the text like 123/456\n, \n123/456, \n123/456\n, etc. Checking if the previous and or next characters are white space doesn't work because the literal "\n" is not included in this. If I do something like (?<!\S|\\n) as the word boundary it also includes a back and/or forward slash for some reason. Currently I came up with the following regex (?<![^\r\n\t\f\v n])\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?![^\r\n\t\f\v \\]), but that only checks if the previous character is a "n" or white space and the next a backslash or white space. So strings like "lorem\123/456" would still find a match. I need some way to include the "\n" in the white space characters without breaking the lookahead/lookbehind.
Do you guys have any idea how to solve this issue? All input is appreciated. Thx!
It seems you want to subtract \n from the whitespace boundaries. You can use
re.findall(r'(?<![^\s\n])\d{3}[./]\d{3,6}(?:/\d{3})?(?![^\s\n])', text)
See the Python demo and this regex demo.
If the \n are combinations of \ and n chars, you need to make sure the \S in the lookarounds does not match those:
import re
text = r'Codes like 123/456\n \n123/3456 \n123/23456\n etc are correct \n333.3333/333\n'
print( re.findall(r'(?<!\S(?<!\\n))\d{3}[./]\d{3,6}(?:/\d{3})?(?!(?!\\n)\S)', text) )
# => ['123/456', '123/3456', '123/23456', '333.3333/333']
See this Python demo.
Details:
(?<![^\s\n]) - a negative lookbehind that matches a location that is not immediately preceded with a char other than whitespace and an LF char
(?<!\S(?<!\\n)) - a left whitespace boundary that does not trigger if the non-whitespace is the n from the \n char combination
\d{3} - theree digits
[./] - a . or /
\d{3,6} - three to six digits
(?:/\d{3})? - an optional sequence of / and three digits
(?![^\s\n]) - a negative lookahead that requires no char other than whitespace and LF immediately to the right of the current location.
(?!(?!\\n)\S) - a right whitespace boundary that does not trigger if the non-whitespace is the \ char followed with n.

Regular Expression in Python strings

I want to validate a string that satisfies the below three conditions using regular expression
The special characters allowed are (. , _ , - ).
Should contain only lower-case characters.
Should not start or end with special character.
To satisfy the above conditions, I have created a format as below
^[^\W_][a-z\.,_-]+
This pattern works fine up to second character. However, this pattern is failing for the 3rd and subsequent characters if those contains any special character or upper cases characters.
Example:
Pattern Works for the string S#yanthan but not for Sa#yanthan. I am expecting that pattern to pass even if the third and subsequent characters contains any special characters or upper case characters. Can you suggest me where this pattern goes wrong please? Below is the snippet of the code.
import re
a = "Sayanthan"
exp = re.search("^[^\W_][a-z\.,_-]+",a)
if exp:
print(True)
else:
print(False)
Based on you initial rules I'd go with:
^[a-z](?:[.,_-]*[a-z])*$
See the online demo.
However, you mentioned in the comments:
"Also the third condition is "should not start with Special character" instead of "should not start or end with Special character""
In that case you could use:
^[a-z][-.,_a-z]*$
See the online demo
The pattern that you tried ^[^\W_][a-z.,_-]+ starts with [^\W_] which will match any word char except an underscore, so it could also be an uppercase char.
Then [a-z.,_-]+ will match 1+ times any of the listed, which means the string can also end with a comma for example.
Looking at the conditions listed, you could use:
^[a-z](?:[a-z.,_-]*[a-z])?\Z
^ Start of string
[a-z] Match a lower case char a-z
(?: Non capture group
[a-z.,_-]*[a-z] Match 0+ occurrences of the listed ending with a-z
)? Close group and make it optional
\Z End of string
Regex demo

Python regex: using or statement

I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!
The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)
Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')
re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex

Categories