optional groups in regex to match different lines - python

I have two files:
/c/desktop/test.txt#edit
/c/desktop/test.txt
I am using regex: (.*desktop.*)(?:#.*)?
it should match everything before and after desktop but leave anything which is from #, which may or may not exists in that line.
But it's either matching everything or nothing.

One way of achieving what you want is by using the non-greedy operator *? in conjunction with the end of line operator: (.*desktop.*?)(?:$|#.*)
.*? says match as few characters as possible
$|#.* says match either the end of line or a # followed by characters. This way, the .* from the first group does not match past the # because it is possible to match the pattern with fewer characters if the second group takes it.
Tested here: https://regex101.com/r/7l1CQi/1

Related

Improving the efficiency of a regex

Given a string such as this:
upstream-status=502; upstream-scheme=http; upstream-host=dfsdf-dsfsd88.dsfsdf99.sdfsdf.dfdf.in.sdfsf; upstream-url=%2FWebObjects%2Fdsdf.woa;
The regex that I wrote for matching and extracting the upstream-host is:
upstream-host=(?P<hostname>\S+(?=;))*
The ?P<hostname> allows me to create a named group.
The \S+ matches the actual hostname.
The ?=; says don't include the ; in the named group.
The last * says I don't care what comes after.
I have a nagging feeling that there is a better way to write this regex.
You can omit the lookahead and match the ; outside of the group, as the \S+ first captures all non whitespace chars and then you also match the last ; instead of asserting it.
Also, you can omit the quantifier * from the group, as repeating it zero or more times it can also match an empty string.
upstream-host=(?P<hostname>\S+);
Regex demo

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?
You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

Python regex: Line can't start with certain words, can only contain certain characters

I am reading in lines from a file, and I want to remove lines that only contain letters, colon, parentheses, underscores, spaces and backslashes. This regex was working fine to find those lines...
[^A-Za-z0-9:()_\s\\]
...as passed to re.search() as a raw string.
Now, I need to add to it that the lines cannot start with THEN or ELSE; otherwise they should not match and thus be exempted from being removed.
I tried just taking the ^ out of the brackets and adding a negative lookbehind before the bracketed expression, like so...
r'^(?!(ELSE|THEN))[A-Za-z0-9:()_\s\\]'
...but now it just matches every line. What am I missing?
^(?:(?:.*[^A-Za-z0-9:()_\s\\])|(?:THEN|ELSE)).*$
Broken down
^(?: ).*$ # Starts with
(?: )|(?: ) # Either
.*[^A-Za-z0-9:()_\s\\] # Anything that contains a non-alphanumeric character
THEN|ELSE # THEN/ELSE
See the example on regex101.com
Just use an alternation:
^(?:THEN|ELSE|[A-Za-z0-9:()_\s\\]*$)
and remove the lines that don't match the pattern.

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?
Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.
#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

Python regular expression question mark operator not working?

import re
str='abc defg'
m1 = re.match(".*(def)?",str)
m2 = re.match(".*(def)",str)
print (m1.group(1),m2.group(1))
The output of the above is:
(None, 'def')
What is going on? Even with a non-greedy repetition operator, the optional capture group (def)? is not matched.
Here's what happens when the regex engine tries to match .*(def) against abc defg:
First, the engine starts trying to match the regex at the beginning of the string.
The greedy subpattern .* initially tries to match as many times as it can, matching the entire string.
Since this causes the rest of the match to fail, the regex engine backtracks until it finds a way to match the (def), which happens when the .* matches only abc .
However, if we change the regex to .*(def)?, the following happens instead:
First, the regex engine again starts at the beginning of the string.
Next, it again tries to match .* as many times as possible, matching the entire string.
But at that point, since all the rest of the regex is optional, it has found a match for the entire regex! Since (def)? is greedy, the engine would prefer to match it if it could, but it's not going to backtrack earlier subpatterns just to see if it can. Instead, it just lets the .* gobble up the entire string, leaving nothing for (def)?.
Something similar happens with .*?(def) and .*?(def)?:
Again, the engine starts at the beginning of the string.
The ungreedy subpattern .*? tries to match as few times as it can, i.e. not at all.
At that point, (def) cannot match, but (def)? can. Thus, for (def) the regex engine has to go back and consider longer matches for .*? until it finds one that lets the full pattern match, whereas for (def)? it doesn't have to do that, and so it doesn't.
For more information, see the "Combining RE Pieces" section of the Perl regular expressions manual (which matches the behavior of Python's "Perl-compatible" regular expressions).

Categories