Regular Expression for matching a string with different combinations - python

I'n trying to match a string with the following different combinations using python
(here x's are digits of lenght 4)
W|MON-FRI|xxxx-xxxx
W|mon-fri|xxxx-xxxx
W|MON-THU,SAT|xxxx-xxxx
W|mon-thu,sat|xxxx-xxxx
W|MON|xxxx-xxxx
Here the first part and the last is static, second part is can have any of the combinations as shown above, like sometime the days were separated by ',' or '-'.
I'm a newbie to Regular Expressions, I was googled on how regular expressions works, I can able to do the RE for bits & pieces of above expressions like matching the last part with re.compile('(\d{4})-(\d{4})$') and the first part with re.compile('[w|W]').
I tried to match the 2nd part but couldn't succeeded with
new_patt = re.compile('(([a-zA-Z]{3}))([,-]?)(([a-zA-Z]{3})?))
How can I achieve this?

Here is a regular expression that should work:
pat = re.compile('^W\|(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?(,(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?)?\⎪\d{4}-\d{4}$', re.IGNORECASE)
Note first how you can ignore the case to take care of lower and upper cases. In addition to the static text at the beginning and the numbers at the end, this regex matches a day of the week, followed by an optional dash+day of the week, followed by an optional sequence that contains a ,and the previous sequence.
"^W\|(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?(,(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?)?\|\d{4}-\d{4}$"i
^ assert position at start of the string
W matches the character W literally (case insensitive)
\| matches the character | literally
1st Capturing group (mon|tue|wed|thu|fri|sat|sun)
2nd Capturing group (-(mon|tue|wed|thu|fri|sat|sun))?
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
- matches the character - literally
3rd Capturing group (mon|tue|wed|thu|fri|sat|sun)
4th Capturing group (,(mon|tue|wed|thu|fri|sat|sun)(-(mon|tue|wed|thu|fri|sat|sun))?)?
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
, matches the character , literally
5th Capturing group (mon|tue|wed|thu|fri|sat|sun)
6th Capturing group (-(mon|tue|wed|thu|fri|sat|sun))?
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
- matches the character - literally
7th Capturing group (mon|tue|wed|thu|fri|sat|sun)
\| matches the character | literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
- matches the character - literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
$ assert position at end of the string
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
https://regex101.com/r/dW4dQ7/1

You can get everything in one go:
^W\|(?:\w{3}[-,]){0,2}\w{3}\|(?:\d{4}[-]?){2}$
With Live Demo

Thanks for your posts and comments,
At last I am able to satisfy my requirement with regular expressions
here it is
"^[w|W]\|(mon|sun|fri|thu|sat|wed|tue|[0-6])(-(mon|fri|sat|sun|wed|thu|tue|[0-6]))?(,(mon|fri|sat|sun|wed|thu|tue|[0-6]))*?\|(\d{4}-\d{4})$"img
I just tweaked the answer posted by Julien Spronck
Once again thanks all

Related

Improving the efficiency of a regex

Given a string such as this:
upstream-status=502; upstream-scheme=http; upstream-host=dfsdf-dsfsd88.dsfsdf99.sdfsdf.dfdf.in.sdfsf; upstream-url=%2FWebObjects%2Fdsdf.woa;
The regex that I wrote for matching and extracting the upstream-host is:
upstream-host=(?P<hostname>\S+(?=;))*
The ?P<hostname> allows me to create a named group.
The \S+ matches the actual hostname.
The ?=; says don't include the ; in the named group.
The last * says I don't care what comes after.
I have a nagging feeling that there is a better way to write this regex.
You can omit the lookahead and match the ; outside of the group, as the \S+ first captures all non whitespace chars and then you also match the last ; instead of asserting it.
Also, you can omit the quantifier * from the group, as repeating it zero or more times it can also match an empty string.
upstream-host=(?P<hostname>\S+);
Regex demo

How to extract substring with regex

I have SKUs of like the following:
SBC225SLB32
SBA2161BRB30
PBA632AS32
Where the first 3-4 characters are A-Z, which must be extracted, and the following 3-4 numbers are [0-9], and also have to be extracted.
For the first, I tried \D{3,4} and for the second, I tried \d{3,4}.
But when using pandas' .str.extract('\D{3,4}'), I got a pattern contains no capture groups error.
Is there a better way to do this?
The regex pattern you pass to Series.str.extract contains no capturing groups, while the method expects at least one.
In your case, it is more convenient to grab both values at once with the help of two capturing groups. You can use
df[['Code1', 'Code2']] = df['SKU'].str.extract(r'^([A-Z]{3,4})([0-9]{3,4})', expand=False)
See the regex demo. Pattern details:
^ - start of string
([A-Z]{3,4}) - Capturing group 1: three to four uppercase ASCII letters
([0-9]{3,4}) - Capturing group 2: three to four uppercase ASCII digits.

How to say "match anything until a specific character, then work your way backwards"?

I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)
Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?
If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character
In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo
By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'
Try this:
.*(\b\d{1,3}(?=\%)).*
demo

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Python regex: using or statement

I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!
The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)
Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')
re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex

Categories