Improving the efficiency of a regex

Improving the efficiency of a regex - python

Given a string such as this:
upstream-status=502; upstream-scheme=http; upstream-host=dfsdf-dsfsd88.dsfsdf99.sdfsdf.dfdf.in.sdfsf; upstream-url=%2FWebObjects%2Fdsdf.woa;
The regex that I wrote for matching and extracting the upstream-host is:
upstream-host=(?P<hostname>\S+(?=;))*
The ?P<hostname> allows me to create a named group.
The \S+ matches the actual hostname.
The ?=; says don't include the ; in the named group.
The last * says I don't care what comes after.
I have a nagging feeling that there is a better way to write this regex.

You can omit the lookahead and match the ; outside of the group, as the \S+ first captures all non whitespace chars and then you also match the last ; instead of asserting it.
Also, you can omit the quantifier * from the group, as repeating it zero or more times it can also match an empty string.
upstream-host=(?P<hostname>\S+);
Regex demo

Related

Catch multiple pattern with regex [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?

You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?

Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?

You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

Update regex to extract company register number [duplicate]

Suppose I have the following regex that matches a string with a semicolon at the end:
\".+\";
It will match any string except an empty one, like the one below:
"";
I tried using this:
\".+?\";
But that didn't work.
My question is, how can I make the .+ part of the, optional, so the user doesn't have to put any characters in the string?

To make the .+ optional, you could do:
\"(?:.+)?\";
(?:..) is called a non-capturing group. It only does the matching operation and it won't capture anything. Adding ? after the non-capturing group makes the whole non-capturing group optional.
Alternatively, you could do:
\".*?\";
.* would match any character zero or more times greedily. Adding ? after the * forces the regex engine to do a shortest possible match.

As an alternative:
\".*\";
Try it here: https://regex101.com/r/hbA01X/1

optional groups in regex to match different lines

I have two files:
/c/desktop/test.txt#edit
/c/desktop/test.txt
I am using regex: (.*desktop.*)(?:#.*)?
it should match everything before and after desktop but leave anything which is from #, which may or may not exists in that line.
But it's either matching everything or nothing.

One way of achieving what you want is by using the non-greedy operator *? in conjunction with the end of line operator: (.*desktop.*?)(?:$|#.*)
.*? says match as few characters as possible
$|#.* says match either the end of line or a # followed by characters. This way, the .* from the first group does not match past the # because it is possible to match the pattern with fewer characters if the second group takes it.
Tested here: https://regex101.com/r/7l1CQi/1

How to say "match anything until a specific character, then work your way backwards"?

I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)

Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?

If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character

In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo

By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'

Try this:
.*(\b\d{1,3}(?=\%)).*
demo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improving the efficiency of a regex - python

Related

Catch multiple pattern with regex [duplicate]

Python path regex optional match

Update regex to extract company register number [duplicate]

optional groups in regex to match different lines

How to say "match anything until a specific character, then work your way backwards"?

Categories

Resources