Regex for matching number ranges with specific units - python

I need help to complete a regex pattern. I need a pattern to match a range of numbers including unit.
Examples:
The car drives 50,5 - 80 km/10min on the road.
The car drives 50,5 - 80 km / 10min on the road.
The car drives 40,5-80 km/h on the road.
The car drives 30-50 km/h on the road.
The car drives 40 - 60.8 km/ h on the road.
The car drives 40.90-60,8 km/h on the road.
I need to match the entire ranges. Good would also be (?:km/10min|km / 10min|km/h|km/ h) to simplify this part so that this does not have to be listed multiple times. So also here the blanks taken into account.
([,.\d]+)\s*(?:km/10min|km / 10min|km/h|km/ h)
https://regex101.com/r/Ey792V/1
Currently, unfortunately, only the first number is matched. Thanks in advance for the help.

You could make the pattern a bit more specific and optionally match whitespace chars instead of hard coding all the possible spaces variations
\b\d+(?:[.,]\d+)?(?:\s*-\s*\d+(?:[.,]\d+)?)?\s*km\s*/\s*(?:h|10min)\b
Explanation
\b A word boundary
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
(?: Non capture group
\s*-\s* Match - between optional whitespace chars
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
)? Close the non capture group and make it optional
\s*km\s*/\s* Match km/ surrounded with optional whitespace chars to match different variations
(?:h|10min) Match either h or 10min (Or use \d+min to match 1+ digits)
\b A word boundary
See a regex demo.

Your question is not entirely clear as you framed it in terms of examples. To be precise you need to state the question in words, then use the examples for illustration. To take one example, the question does not make clear whether
"The car drives 40,5- 80 km /h on the road."
is to be matched.
Expressing a question in words is not always easy but it is a skill that you need to acquire in order write clear code specifications. A by-product is that it makes the code easier to write, as that amounts to merely translating the words into code.
Let's give it a try.
Match a string comprised by six successive substrings:
One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
A hyphen, optionally preceded and/or followed by a space.
One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
The literal " km".
A forward slash, optionally preceded and/or followed by a space.
The literal "h" or one or more digits followed by "min", followed by a word boundary.
I cannot be sure that this is what you want but you should be able to easily modify these requirements as necessary.
Now let's translate these requirements into a regular expression.
1. One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
(?<![,.])\d+(?:[,.-]\d+)?
(?<![,.]) is a negative lookbehind. It is needed to avoid matching, for example, the indicated part of the following string.
"The car drives 1,500.5 - 80 km/10min on the road."
^^^^^^^^^^^^^^^
2. A hyphen, optionally preceded and/or followed by a space.
?- ?
(The first question mark is preceded by a space.)
3. One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
\d+(?:[,.]\d+)?
4. The literal " km".
km
5. A forward slash, optionally preceded and/or followed by a space.
?\/ ?
(The first question mark is preceded by a space.)
6. The literal "h" or one or more digits followed by "min", followed by a word boundary.
(?:h|\d+min)\b
Now we can simply join these pieces to form the regular expression.
\d+(?:[,.-]\d+)? ?- ?\d+(?:[,.]\d+)?km ?\/ ?(?:h|\d+min)\b
Demo

\d.+(?:h\b|min\b|s\b)
Would also work. Demo

Related

Regex to match (French) numbers

I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.

Regex that matches newlines literally and passively

I have to construct a regex that matches client codes that look like:
XXX/X{3,6}
XXX.X{3,6}
XXX.X{3,6}/XXX
With X a number between 0 and 9.
The regex needs to be strong enough so we don't extract codes that are within another string. The use of word boundaries was my first idea.
The regex looks like this: \b\d{3}[\.\/]\d{3,6}(?:\/\d{3})?\b
The problem with word boundaries is that it also matches dots. So a number like "123/456.12" would match "123/456" as the client number. So then I came up with the following regex: (?<!\S)\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?!\S). It uses lookbehind and lookahead and checks if that character is a white space. This matches most of the client codes correctly.
But there is still one last issue. We are using a Google OCR text to extract the codes from. This means that a valid code can be found in the text like 123/456\n, \n123/456, \n123/456\n, etc. Checking if the previous and or next characters are white space doesn't work because the literal "\n" is not included in this. If I do something like (?<!\S|\\n) as the word boundary it also includes a back and/or forward slash for some reason. Currently I came up with the following regex (?<![^\r\n\t\f\v n])\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?![^\r\n\t\f\v \\]), but that only checks if the previous character is a "n" or white space and the next a backslash or white space. So strings like "lorem\123/456" would still find a match. I need some way to include the "\n" in the white space characters without breaking the lookahead/lookbehind.
Do you guys have any idea how to solve this issue? All input is appreciated. Thx!
It seems you want to subtract \n from the whitespace boundaries. You can use
re.findall(r'(?<![^\s\n])\d{3}[./]\d{3,6}(?:/\d{3})?(?![^\s\n])', text)
See the Python demo and this regex demo.
If the \n are combinations of \ and n chars, you need to make sure the \S in the lookarounds does not match those:
import re
text = r'Codes like 123/456\n \n123/3456 \n123/23456\n etc are correct \n333.3333/333\n'
print( re.findall(r'(?<!\S(?<!\\n))\d{3}[./]\d{3,6}(?:/\d{3})?(?!(?!\\n)\S)', text) )
# => ['123/456', '123/3456', '123/23456', '333.3333/333']
See this Python demo.
Details:
(?<![^\s\n]) - a negative lookbehind that matches a location that is not immediately preceded with a char other than whitespace and an LF char
(?<!\S(?<!\\n)) - a left whitespace boundary that does not trigger if the non-whitespace is the n from the \n char combination
\d{3} - theree digits
[./] - a . or /
\d{3,6} - three to six digits
(?:/\d{3})? - an optional sequence of / and three digits
(?![^\s\n]) - a negative lookahead that requires no char other than whitespace and LF immediately to the right of the current location.
(?!(?!\\n)\S) - a right whitespace boundary that does not trigger if the non-whitespace is the \ char followed with n.

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

python regex: don't allow a specific character to repeat

I have a regex
^[a-z][a-z0-9\-]{6,10}[a-z0-9]$
Which matches the following rules:
8-12 characters in length
first character is lowercase alpha
last characters lowercase alpha or digit
internal characters can contain a hyphen
it's re-used a lot in a module, always alongside some other rules and regexes
while writing out some unit tests, i noticed that it's always used in conjunction with another specific rule.
hyphens may not repeat
i can't wrap my head around integrating that rule into this one. i've tried a few dozen approaches with lookbehinds and lookaheads, but have had no luck on isolating to the specific character AND keeping the length requirement.
No repeating hyphen ^[a-z](?:[a-z0-9]|-(?!-)){6,10}[a-z0-9]$
Explained
^ [a-z]
(?:
[a-z0-9] # alnum
| # or
- (?! - ) # hyphen if not followed by hyphen
){6,10}
[a-z0-9] $

Python regex: using or statement

I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!
The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)
Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')
re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex

Categories