Regex that matches newlines literally and passively - python

I have to construct a regex that matches client codes that look like:
XXX/X{3,6}
XXX.X{3,6}
XXX.X{3,6}/XXX
With X a number between 0 and 9.
The regex needs to be strong enough so we don't extract codes that are within another string. The use of word boundaries was my first idea.
The regex looks like this: \b\d{3}[\.\/]\d{3,6}(?:\/\d{3})?\b
The problem with word boundaries is that it also matches dots. So a number like "123/456.12" would match "123/456" as the client number. So then I came up with the following regex: (?<!\S)\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?!\S). It uses lookbehind and lookahead and checks if that character is a white space. This matches most of the client codes correctly.
But there is still one last issue. We are using a Google OCR text to extract the codes from. This means that a valid code can be found in the text like 123/456\n, \n123/456, \n123/456\n, etc. Checking if the previous and or next characters are white space doesn't work because the literal "\n" is not included in this. If I do something like (?<!\S|\\n) as the word boundary it also includes a back and/or forward slash for some reason. Currently I came up with the following regex (?<![^\r\n\t\f\v n])\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?![^\r\n\t\f\v \\]), but that only checks if the previous character is a "n" or white space and the next a backslash or white space. So strings like "lorem\123/456" would still find a match. I need some way to include the "\n" in the white space characters without breaking the lookahead/lookbehind.
Do you guys have any idea how to solve this issue? All input is appreciated. Thx!

It seems you want to subtract \n from the whitespace boundaries. You can use
re.findall(r'(?<![^\s\n])\d{3}[./]\d{3,6}(?:/\d{3})?(?![^\s\n])', text)
See the Python demo and this regex demo.
If the \n are combinations of \ and n chars, you need to make sure the \S in the lookarounds does not match those:
import re
text = r'Codes like 123/456\n \n123/3456 \n123/23456\n etc are correct \n333.3333/333\n'
print( re.findall(r'(?<!\S(?<!\\n))\d{3}[./]\d{3,6}(?:/\d{3})?(?!(?!\\n)\S)', text) )
# => ['123/456', '123/3456', '123/23456', '333.3333/333']
See this Python demo.
Details:
(?<![^\s\n]) - a negative lookbehind that matches a location that is not immediately preceded with a char other than whitespace and an LF char
(?<!\S(?<!\\n)) - a left whitespace boundary that does not trigger if the non-whitespace is the n from the \n char combination
\d{3} - theree digits
[./] - a . or /
\d{3,6} - three to six digits
(?:/\d{3})? - an optional sequence of / and three digits
(?![^\s\n]) - a negative lookahead that requires no char other than whitespace and LF immediately to the right of the current location.
(?!(?!\\n)\S) - a right whitespace boundary that does not trigger if the non-whitespace is the \ char followed with n.

Related

Regex to match (French) numbers

I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.

Regular Expression in Python strings

I want to validate a string that satisfies the below three conditions using regular expression
The special characters allowed are (. , _ , - ).
Should contain only lower-case characters.
Should not start or end with special character.
To satisfy the above conditions, I have created a format as below
^[^\W_][a-z\.,_-]+
This pattern works fine up to second character. However, this pattern is failing for the 3rd and subsequent characters if those contains any special character or upper cases characters.
Example:
Pattern Works for the string S#yanthan but not for Sa#yanthan. I am expecting that pattern to pass even if the third and subsequent characters contains any special characters or upper case characters. Can you suggest me where this pattern goes wrong please? Below is the snippet of the code.
import re
a = "Sayanthan"
exp = re.search("^[^\W_][a-z\.,_-]+",a)
if exp:
print(True)
else:
print(False)
Based on you initial rules I'd go with:
^[a-z](?:[.,_-]*[a-z])*$
See the online demo.
However, you mentioned in the comments:
"Also the third condition is "should not start with Special character" instead of "should not start or end with Special character""
In that case you could use:
^[a-z][-.,_a-z]*$
See the online demo
The pattern that you tried ^[^\W_][a-z.,_-]+ starts with [^\W_] which will match any word char except an underscore, so it could also be an uppercase char.
Then [a-z.,_-]+ will match 1+ times any of the listed, which means the string can also end with a comma for example.
Looking at the conditions listed, you could use:
^[a-z](?:[a-z.,_-]*[a-z])?\Z
^ Start of string
[a-z] Match a lower case char a-z
(?: Non capture group
[a-z.,_-]*[a-z] Match 0+ occurrences of the listed ending with a-z
)? Close group and make it optional
\Z End of string
Regex demo

Python regex specific word with singe quote at end

Searching a large syslog repo and need to get a specific word to match with a certain condition.
I'm using regex to compile a search for this word. I've read the python docs on regex characters and I understand how to specify each criteria separately but somehow missing how to concatenate all together for my specific search. This is what I have so far but not working...
p = re.compile("^'[A-Z]\w+'$")
match = re.search(p, syslogline, )
the word is a username that can be alphanum, always beginning with an uppercase character (preceded by blank space), can contain chars or nums, is 3-12 in length and ends with single quote.
an example would be: Epresley01' or J98473'
Brief
Based on your requirements (also stated below), your regex doesn't work because:
^' Asserts the position at the start of the line and ensures a ' is the first character of that line.
$ Asserts the position at the end of the line.
Having said that you specify that it's preceded by a space character (which isn't present in your pattern). You pattern also checks for ' which isn't the first character of the username. Given that you haven't actually given us a sample of your file I can't confirm nor deny that your string starts before the username and ends after it, but if that's not the case the anchors ^$ are also not helping you here.
Requirements
The requirements below are simply copied from the OP's question (rewritten) to outline the username format. The username:
Is preceded by a space character.
Starts with an uppercase letter.
Contains chars or nums. I'm assuming here that chars actually means letters and that all letters in the username (including the uppercase starting character) are ASCII.
Is 3-12 characters in length (excluding the preceding space and the end character stated below).
Ends with an apostrophe character '.
Code
See regex in use here
(?<= )[A-Z][^\W_]{2,11}'
Explanation
(?<= ) Positive lookbehind ensuring what precedes is a space character
[A-Z] Match any uppercase ASCII letter
[^\W_]{2,11} Match any word character except underscore _ (equivalent to a-zA-Z0-9)
This appears a little confusing because it's actually a double-negative. It's saying match anything that's not in the set. The \W matches any non-word character. Since it's a double-negative, it's like saying don't match non-word characters. Adding _ to the set negates it.
' Match the apostrophe character ' literally
I think you can do it like this:
(Updated after the comment from #ctwheels)
See regex in use here
[A-Z][a-zA-Z0-9]{1,10}'
Explanation
Match a whitespace
Match an uppercase character [A-Z]
Match [a-zA-Z0-9]+
Match an apostrophe '
Demo

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Python - How to remove spaces between Chinese characters while remaining the spaces in between a character and a number?

the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:
text = "今天特别 热,但是我买了 3 个西瓜。"
The output I want to get is
text = "今天特别热,但是我买了 3 个西瓜。"
I tried to use Python script and regular expression:
import re
text = re.sub(r'\s(?=[^A-z0-9])','')
However, the result is
text = '今天特别热,但是我买了 3个西瓜。'
So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".
I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!
I understand the spaces you need to remove reside in between letters.
Use
re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text)
Details:
(?<=[^\W\d_]) - a positive lookbehind requiring a Unicode letter immediately to the left of the current location
\s+ - 1+ whitespaces (remove + if only one is expected)
(?=[^\W\d_]) - a positive lookahead that requires a Unicode letter immediately to the right of the current location.
You do not need re.U flag since it is on by default in Python 3. You need it in Python 2 though.
You may also use capturing groups:
re.sub(r'([^\W\d_])\s+([^\W\d_])', r'\1\2', text)
where the non-consuming lookarounds are turned into consuming capturing groups ((...)). The \1 and \2 in the replacement pattern are backreferences to the capturing group values.
See a Python 3 online demo:
import re
text = "今天特别 热,但是我买了 3 个西瓜。"
print(re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text))
// => 今天特别热,但是我买了 3 个西瓜。

Categories