Some Regex stuff - python

I'm just starting to figure out regex and would love some help trying to understand it. I've been using this to help me get started, but am still having some trouble figuring it out.
What I am trying to do is take this text:
<td>8.54/10 over 190 reviews</td>
And pull out the "8.54", so basically anything in between the first ">" and the "/"
Using my noob skills, I came up with this: [0-9].[0-9][0-9], which WILL match that 8.54, and will work for everything BUT 10.00, which I do need to account for.
Can anyone help me refine my expression to apply to that last case as well?

Use quantifiers.
You want one or more digits, followed by a dot, followed by one or more digits. A digit can also be written \d, and the "one or more" quantifier is +.
The dot needs to be escaped as it is a regex metacharacter which means "any character". Your regex therefore should be:
\d+\.\d+
Now, beware that a quantifier applies to atoms only. Character classes ([...]), complemented character classes ([^...]) and special character classes (\d, \w...) are atoms, however if you want to apply a quantifier to more than a simple atom, you'll need to group these atoms using the grouping operator, (). Ie, (ab)+ will look for one or more of ab.

Maybe answered my own question. Found this:
[0-9]+(?:.[0-9]*)
It seems to work, does anyone have any changes to this?

\d is often used instead of [0-9] (mnemonically, “digit”) and it's necessary to remember that sometimes fractional numbers are written without any digits before the decimal point. Thus:
(?<=>)(?:\d+(?:\.\d*)?|\.\d+)(?=/)
OK, that's a bit of a complex RE. Here's how it breaks down (in extended form).
(?<= > ) # With a “>” before (but not matched)…
(?: # … match either this
\d+ # at least one digit, followed by…
(?: # …match
\. \d* # a dot followed by any number of digits
) ? # optionally
| # … or this
\. \d+ # a dot followed by at least one digit
) #
(?= / ) # … and with a “/” afterwards (but not matched)

This might work:
\>(.*?)/
# (.*?) is a "non-greedy" group which maches as few characters as possible
Then access the actual value using
m.group(1)
where m is the match object returned by re.search or re.finditer
If you want to access the value directly (re.findall), use
(?>=\>)(.*?)(?=/)

Related

Catch multiple pattern with regex [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

Regex to match (French) numbers

I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.

replacement pattern in python(re.sub())

The question
Can someone please explain the process of the following re.sub() to me.
I am thinking the process is as following:
look for a "." then look for a digit then look for another digit that is between 1 and 9. Now I am lost. What is the question mark for? What does the \d* do? Why do we need to use raw string regex in this case?
If you want to understand the process, I can simply explain it to you. I don't know if this regular expression is doing what you want or not..
At first, the . is a special character in regex which means any character. But, we here want to use the dot character. In regex, this can be done by using escaping character \ like so \.. So, using . means any character and using \. means a dot.
The \d represents any digit and acts exactly like [0-9]
When you used [1-9], by then you specified to get the numbers from 1 till 9 which means that zero is excluded.
We can use the asterisk * to choose zero or more characters. Unlike + which is used to choose one or more characters. So, using \d* means any consecutive digits from [0-9] or none.
The ? is used to indicate using just one character or none. So, using [1-9]? means try to find just one digit between 1 and 9 IF FOUND.
The Parenthesis () is used for grouping the whole regular expression in one output.
If you want to know more about regular expression, here is an awesome cheat sheet.
NOTE:
I think the regex you have written in the question is not correct. I think it should be as follows (\d*\.\d\d[1-9]?) to obtain the same result. I will try to explain this regular expression using this number 3.141500012. \d*\. means find any number of digits that could be found before the dot which would match the 3.. then after that \d\d matches two digits after the dot which are 14. Finally, the [1-9]? matches any digit between 1 and 9 if found which matches 1 in our example.

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?
Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.
#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

python regular expresssion for a string

consider this string
prison break: proof of innocence (2006) {abduction (#1.10)}
i just want to know whether there is (# floating point value )} in the string or not
i tried few regular expressions like
re.search('\(\#+\f+\)\}',xyz)
and
re.search('\(\#+(\d\.\d)+\)\}',xyz)
nothing worked though...can someone suggest me something here
Try r'\(#\d+\.\d+\)\}'
The (, ), ., and } are all special metacharacters, that's why they're preceded by \, so they're matched literally instead.
You also need to apply the + repetition at the right element. Here it's attached to the \d -- the shorthand for digit character class -- to mean that only the digits can appear one-or-more times.
The use of r'raw string literals' makes it easier to work with regex patterns because you don't have to escape backslashes excessively.
See also
What exactly do u and r string flags in Python do, and what are raw string literals?
Variations
For instructional purposes, let's consider a few variations. This will show a few basic features of regex. Let's first consider one of the attempted patterns:
\(\#+(\d\.\d)+\)\}
Let's space out the parts for readability:
\( \#+ ( \d \. \d )+ \) \}
\__________/
this is one group, repeated with +
So this pattern matches:
A literal (, followed by one-or-more #
Followed by one-or-more of:
A digit, a literal dot, and a digit
Followed by a literal )}
Thus, the pattern will match e.g. (###1.23.45.6)} (as seen on rubular.com). Obviously this is not the pattern we want.
Now let's try to modify the solution pattern and say that perhaps we also want to allow just a sequence of digits, without the subsequent period and following digits. We can do this by grouping that part (…), and making it optional with ?.
BEFORE
\(#\d+\.\d+\)\}
\___/
let's make this optional! (…)?
AFTER
\(#\d+(\.\d+)?\)\}
Now the pattern matches e.g. (#1.23)} as well as e.g. (#666)} (as seen on rubular.com).
References
regular-expressions.info - Optional, Brackets for Grouping
"Escape everything" and use raw-literal syntax for safety:
>>> s='prison break: proof of innocence (2006) {abduction (#1.10)}'
>>> re.search(r'\(\#\d+\.\d+\)\}', s)
<_sre.SRE_Match object at 0xec950>
>>> _.group()
'(#1.10)}'
>>>
This assumes that by "floating point value" you mean "one or more digits, a dot, one or more digits", and is not tolerant of other floating point syntax variations, multiple hashes (which you appear from your RE patterns to want to support but don't mention in your Q's text), arbitrary whitespace among the relevant parts (again, unclear from your Q whether you need it), ... -- some issues can be adjusted pretty easily, others "not so much" (it's particularly hard to guess what gamut of FP syntax variations you want to support, for example).

Categories