I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.
Related
I need help to complete a regex pattern. I need a pattern to match a range of numbers including unit.
Examples:
The car drives 50,5 - 80 km/10min on the road.
The car drives 50,5 - 80 km / 10min on the road.
The car drives 40,5-80 km/h on the road.
The car drives 30-50 km/h on the road.
The car drives 40 - 60.8 km/ h on the road.
The car drives 40.90-60,8 km/h on the road.
I need to match the entire ranges. Good would also be (?:km/10min|km / 10min|km/h|km/ h) to simplify this part so that this does not have to be listed multiple times. So also here the blanks taken into account.
([,.\d]+)\s*(?:km/10min|km / 10min|km/h|km/ h)
https://regex101.com/r/Ey792V/1
Currently, unfortunately, only the first number is matched. Thanks in advance for the help.
You could make the pattern a bit more specific and optionally match whitespace chars instead of hard coding all the possible spaces variations
\b\d+(?:[.,]\d+)?(?:\s*-\s*\d+(?:[.,]\d+)?)?\s*km\s*/\s*(?:h|10min)\b
Explanation
\b A word boundary
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
(?: Non capture group
\s*-\s* Match - between optional whitespace chars
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
)? Close the non capture group and make it optional
\s*km\s*/\s* Match km/ surrounded with optional whitespace chars to match different variations
(?:h|10min) Match either h or 10min (Or use \d+min to match 1+ digits)
\b A word boundary
See a regex demo.
Your question is not entirely clear as you framed it in terms of examples. To be precise you need to state the question in words, then use the examples for illustration. To take one example, the question does not make clear whether
"The car drives 40,5- 80 km /h on the road."
is to be matched.
Expressing a question in words is not always easy but it is a skill that you need to acquire in order write clear code specifications. A by-product is that it makes the code easier to write, as that amounts to merely translating the words into code.
Let's give it a try.
Match a string comprised by six successive substrings:
One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
A hyphen, optionally preceded and/or followed by a space.
One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
The literal " km".
A forward slash, optionally preceded and/or followed by a space.
The literal "h" or one or more digits followed by "min", followed by a word boundary.
I cannot be sure that this is what you want but you should be able to easily modify these requirements as necessary.
Now let's translate these requirements into a regular expression.
1. One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
(?<![,.])\d+(?:[,.-]\d+)?
(?<![,.]) is a negative lookbehind. It is needed to avoid matching, for example, the indicated part of the following string.
"The car drives 1,500.5 - 80 km/10min on the road."
^^^^^^^^^^^^^^^
2. A hyphen, optionally preceded and/or followed by a space.
?- ?
(The first question mark is preceded by a space.)
3. One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
\d+(?:[,.]\d+)?
4. The literal " km".
km
5. A forward slash, optionally preceded and/or followed by a space.
?\/ ?
(The first question mark is preceded by a space.)
6. The literal "h" or one or more digits followed by "min", followed by a word boundary.
(?:h|\d+min)\b
Now we can simply join these pieces to form the regular expression.
\d+(?:[,.-]\d+)? ?- ?\d+(?:[,.]\d+)?km ?\/ ?(?:h|\d+min)\b
Demo
\d.+(?:h\b|min\b|s\b)
Would also work. Demo
I'm writing a simple Sublime Text plugin to trim extra, unnecessary, spaces between words but without touching the leading spaces not to mess up Python formatting.
I have:
[spaces*******are********here]if****not***regions***and**default_to_all:
and want to get:
[spaces***are***still****here]if not regions and default_to_all:
Thinking about
regions = view.find_all('\w\s{2,}\w')
view.erase(edit, region)
but it cuts out the first and the last letter too.
For non-matching leading spaces implies you want to match multiple spaces following a non-space character (and replace it with single space), so you can replace (?<=\S) +(?=\S) with single space "".
Explanation:
(?<=\S) +(?=\S)
(?<= Positive look-behind, which means preceded by...
\S non-space character
) end of look-behind group
+ more than 1 space
(?=\S) Positive look-ahead, which means followed by...
non-space character
end of look-ahead group
That should be straight-forward to understand. You may need to tweak it a bit for trailing space handling though.
See "regular expressions 101" for more information.
However, just as a side note regarding your intention:
This is not going to be a reliable way to reformat code. Apart from leading spaces, there are still many cases of multiple-spaces that are significant. The most obvious one is spaces within string literal.
If I understand correctly, this should work:
>>> r = re.compile(r'( *[\S]*)(?: +)(\n)?')
>>> s = ' if not regions and default_to_all:\n foo'
>>> r.sub(' ', s)
if not regions and default_to_all:
foo
I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!
The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)
Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')
re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex
I'm just starting to figure out regex and would love some help trying to understand it. I've been using this to help me get started, but am still having some trouble figuring it out.
What I am trying to do is take this text:
<td>8.54/10 over 190 reviews</td>
And pull out the "8.54", so basically anything in between the first ">" and the "/"
Using my noob skills, I came up with this: [0-9].[0-9][0-9], which WILL match that 8.54, and will work for everything BUT 10.00, which I do need to account for.
Can anyone help me refine my expression to apply to that last case as well?
Use quantifiers.
You want one or more digits, followed by a dot, followed by one or more digits. A digit can also be written \d, and the "one or more" quantifier is +.
The dot needs to be escaped as it is a regex metacharacter which means "any character". Your regex therefore should be:
\d+\.\d+
Now, beware that a quantifier applies to atoms only. Character classes ([...]), complemented character classes ([^...]) and special character classes (\d, \w...) are atoms, however if you want to apply a quantifier to more than a simple atom, you'll need to group these atoms using the grouping operator, (). Ie, (ab)+ will look for one or more of ab.
Maybe answered my own question. Found this:
[0-9]+(?:.[0-9]*)
It seems to work, does anyone have any changes to this?
\d is often used instead of [0-9] (mnemonically, “digit”) and it's necessary to remember that sometimes fractional numbers are written without any digits before the decimal point. Thus:
(?<=>)(?:\d+(?:\.\d*)?|\.\d+)(?=/)
OK, that's a bit of a complex RE. Here's how it breaks down (in extended form).
(?<= > ) # With a “>” before (but not matched)…
(?: # … match either this
\d+ # at least one digit, followed by…
(?: # …match
\. \d* # a dot followed by any number of digits
) ? # optionally
| # … or this
\. \d+ # a dot followed by at least one digit
) #
(?= / ) # … and with a “/” afterwards (but not matched)
This might work:
\>(.*?)/
# (.*?) is a "non-greedy" group which maches as few characters as possible
Then access the actual value using
m.group(1)
where m is the match object returned by re.search or re.finditer
If you want to access the value directly (re.findall), use
(?>=\>)(.*?)(?=/)
consider this string
prison break: proof of innocence (2006) {abduction (#1.10)}
i just want to know whether there is (# floating point value )} in the string or not
i tried few regular expressions like
re.search('\(\#+\f+\)\}',xyz)
and
re.search('\(\#+(\d\.\d)+\)\}',xyz)
nothing worked though...can someone suggest me something here
Try r'\(#\d+\.\d+\)\}'
The (, ), ., and } are all special metacharacters, that's why they're preceded by \, so they're matched literally instead.
You also need to apply the + repetition at the right element. Here it's attached to the \d -- the shorthand for digit character class -- to mean that only the digits can appear one-or-more times.
The use of r'raw string literals' makes it easier to work with regex patterns because you don't have to escape backslashes excessively.
See also
What exactly do u and r string flags in Python do, and what are raw string literals?
Variations
For instructional purposes, let's consider a few variations. This will show a few basic features of regex. Let's first consider one of the attempted patterns:
\(\#+(\d\.\d)+\)\}
Let's space out the parts for readability:
\( \#+ ( \d \. \d )+ \) \}
\__________/
this is one group, repeated with +
So this pattern matches:
A literal (, followed by one-or-more #
Followed by one-or-more of:
A digit, a literal dot, and a digit
Followed by a literal )}
Thus, the pattern will match e.g. (###1.23.45.6)} (as seen on rubular.com). Obviously this is not the pattern we want.
Now let's try to modify the solution pattern and say that perhaps we also want to allow just a sequence of digits, without the subsequent period and following digits. We can do this by grouping that part (…), and making it optional with ?.
BEFORE
\(#\d+\.\d+\)\}
\___/
let's make this optional! (…)?
AFTER
\(#\d+(\.\d+)?\)\}
Now the pattern matches e.g. (#1.23)} as well as e.g. (#666)} (as seen on rubular.com).
References
regular-expressions.info - Optional, Brackets for Grouping
"Escape everything" and use raw-literal syntax for safety:
>>> s='prison break: proof of innocence (2006) {abduction (#1.10)}'
>>> re.search(r'\(\#\d+\.\d+\)\}', s)
<_sre.SRE_Match object at 0xec950>
>>> _.group()
'(#1.10)}'
>>>
This assumes that by "floating point value" you mean "one or more digits, a dot, one or more digits", and is not tolerant of other floating point syntax variations, multiple hashes (which you appear from your RE patterns to want to support but don't mention in your Q's text), arbitrary whitespace among the relevant parts (again, unclear from your Q whether you need it), ... -- some issues can be adjusted pretty easily, others "not so much" (it's particularly hard to guess what gamut of FP syntax variations you want to support, for example).