Python Regex no groups after alternation operator - python

I wrote a regex match pattern in python, but re.match() do not capture groups after | alternation operator.
Here is the pattern:
pattern = r"00([1-9]\d) ([1-9]\d) ([1-9]\d{5})|\+([1-9]\d) ([1-9]\d) ([1-9]\d{5})"
I feed the pattern with a qualified string: "+12 34 567890":
strng = "+12 34 567890"
pattern = r"00([1-9]\d) ([1-9]\d) ([1-9]\d{5})|\+([1-9]\d) ([1-9]\d) ([1-9]\d{5})"
m = re.match(pattern, strng)
print(m.group(1))
None is printed.
Buf if I delete the part before | alternation operator
strng = "+12 34 567890"
pattern = r"\+([1-9]\d) ([1-9]\d) ([1-9]\d{5})"
m = re.match(pattern, strng)
print(m.group(1))
It can capture all 3 groups:
12
34
567890
Thanks so much for your thoughts!

'|' has nothing to do with the index of group, index is always counted from left to right in the regex itself.
In your original regex, their are 6 groups:
In [270]: m.groups()
Out[270]: (None, None, None, '12', '34', '567890')
The matching part is the second part, thus you need:
In [271]: m.group(4)
Out[271]: '12'

You want to support two different patterns, one with 00 and the other with + at the start. You may merge the alternatives using a non-capturing group:
import re
strng = "+12 34 567890"
pattern = r"(?:00|\+)([1-9]\d) ([1-9]\d) ([1-9]\d{5})$"
m = re.match(pattern, strng)
if m:
print(m.group(1))
print(m.group(2))
print(m.group(3))
See the regex demo and the Python demo yielding
12
34
567890
The regex at the regex testing site is prepended with ^ (start of string) because re.match only matches at the start of the string. The whole pattern now matches:
^ - start of string (implicit in re.match)
(?:00|\+) - a 00 or + substrings
([1-9]\d) - Capturing group 1: a digit from 1 to 9 and then any digit
- a space (replace with \s to match any 1 whitespace chars)
([1-9]\d) - Capturing group 2: a digit from 1 to 9 and then any digit
- a space (replace with \s to match any 1 whitespace chars)
([1-9]\d{5}) - Capturing group 3: a digit from 1 to 9 and then any 5 digits
$ - end of string.
Remove $ if you do not need to match the end of the string right after the number.

Related

Python regex match a pattern for multiple times

I've got a list of strings.
input=['XX=BB|3|3|1|1|PLP|KLWE|9999|9999', 'XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999', '999|999|999|9999|999', ....]
This type '999|999|999|9999|999' remains unchanged.
I need to replace 9999|9999 with 12|21
I write this (?<=BB\|\d\|\d\|\d\|\d\|\S{3}\|\S{4}\|)9{2,9}\|9{2,9} to match 999|999. However, there are 4 to 6 \|\d in the middle. So how to match |d this pattern for multiple times.
Desired result:
['XX=BB|3|3|1|1|PLP|KLWE|12|21', 'XX=BB|3|3|1|1|2|PLP|KPOK|12|21', '999|999|999|9999|999'...]
thanks
You can use
re.sub(r'(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|)9{2,9}\|9{2,9}(?!\d)', r'\g<1>12|21', text)
See the regex demo.
Details:
(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|) - Capturing group 1:
BB - a BB string
(?:\|\d){4,6} - four, five or six repetitions of | and any digit sequence
\| - a | char
[^\s|]{3} - three chars other than whitespace and a pipe
\|[^\s|]{4}\| - a |, four chars other than whitespace and a pipe, and then a pipe char
9{2,9}\|9{2,9} - two to nine 9 chars, | and again two to nine 9 chars...
(?!\d) - not followed with another digit (note you may remove this if you do not need to check for the digit boundary here. You may also use (?![^|]) instead if you need to check if there is a | char or end of string immediately on the right).
The \g<1>12|21 replacement includes an unambiguous backreference to Group 1 (\g<1>) and a 12|21 substring appended to it.
See the Python demo:
import re
texts=['XX=BB|3|3|1|1|PLP|KLWE|9999|9999', 'XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999', '999|999|999|9999|999']
pattern = r'(BB(?:\|\d){4,6}\|[^\s|]{3}\|[^\s|]{4}\|)9{2,9}\|9{2,9}(?!\d)'
repl = r'\g<1>12|21'
for text in texts:
print( re.sub(pattern, repl, text) )
Output:
XX=BB|3|3|1|1|PLP|KLWE|12|21
XX=BB|3|3|1|1|2|PLP|KPOK|12|21
999|999|999|9999|999
I would just use re.sub here and search for the pattern \b9{2,9}\|9{2,9}\b:
inp = ["XX=BB|3|3|1|1|PLP|KLWE|9999|9999" "XX=BB|3|3|1|1|2|PLP|KPOK|99999|99999"]
output = [re.sub(r'\b9{2,9}\|9{2,9}\b', '12|21', i) for i in inp]
print(output)
# ['XX=BB|3|3|1|1|PLP|KLWE|12|21', 'XX=BB|3|3|1|1|2|PLP|KPOK|12|21']

Pattern to extract, expand and form a sentence based on a certain delimiter

I was trying out to solve a problem on regex:
There is an input sentence which is of one of these forms: Number1,2,3 or Number1/2/3 or Number1-2-3 these are the 3 delimiters: , / -
The expected output is: Number1,Number2,Number3
Pattern I've tried so far:
(?\<=,)\[^,\]+(?=,)
but this misses out on the edge cases i.e. 1st element and last element. I am also not able to generate for '/'.
You could separate out the key from values, then use a list comprehension to build the output you want.
inp = "Number1,2,3"
matches = re.search(r'(\D+)(.*)', inp)
output = [matches[1] + x for x in re.split(r'[,/]', matches[2])]
print(output) # ['Number1', 'Number2', 'Number3']
You can do it in several steps: 1) validate the string to match your pattern, and once validated 2) add the first non-digit chunk to the numbers while replacing - and / separator chars with commas:
import re
texts = ['Number1,2,3', 'Number1/2/3', 'Number1-2-3']
for text in texts:
m = re.search(r'^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$', text)
if m:
print( re.sub(r'(?<=,)(?=\d)', m.group(1).replace('\\', '\\\\'), text.replace('/',',').replace('-',',')) )
else:
print(f"NO MATCH in '{text}'")
See this Python demo.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
The ^(\D+)(\d+(?=([,/-]))(?:\3\d+)*)$ regex validates your three types of input:
^ - start of string
(\D+) - Group 1: one or more non-digits
(\d+(?=([,/-]))(?:\3\d+)*) - Group 2: one or more digits, and then zero or more repetitions of ,, / or - and one or more digits (and the separator chars should be consistent due to the capture used in the positive lookahead and the \3 backreference to that value used in the non-capturing group)
$ - end of string.
The re.sub pattern, (?<=,)(?=\d), matches a location between a comma and a digit, the Group 1 value is placed there (note the .replace('\\', '\\\\') is necessary since the replacement is dynamic).
import re
for text in ("Number1,2,3", "Number1-2-3", "Number1/2/3"):
print(re.sub(r"(\D+)(\d+)[/,-](\d+)[/,-](\d+)", r"\1\2,\1\3,\1\4", text))
\D+ matches "Number" or any other non-number text
\d+ matches a number (or more than one)
[/,-] matches any of /, ,, -
The rest is copy paste 3 times.
The substitution consists of backreferences to the matched "Number" string (\1) and then each group of the (\d+)s.
This works if you're sure that it's always three numbers divided by that separator. This does not ensure that it's the same separator between each number. But it's short.
Output:
Number1,Number2,Number3
Number1,Number2,Number3
Number1,Number2,Number3
If you can make use of the pypi regex module you can use the captures collection with a named capture group.
([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)
([^\d\s,/]+) Capture group 1, match 1+ chars other than the listed
(?<num>\d+) Named capture group num matching 1+ digits
([,/-]) Capture either , / - in group 3
(?<num>\d+) Named capture group num matching 1+ digits
(?:\3(?<num>\d+))* Optionally repeat a backreference to group 3 to keep the separators the same and match 1+ digits in group num
(?!\S) Assert a whitspace boundary to the right to prevent a partial match
Regex demo | Python demo
import regex as re
pattern = r"([^\d\s,/]+)(?<num>\d+)([,/-])(?<num>\d+)(?:\3(?<num>\d+))*(?!\S)"
s = "Number1,2,3 or Number4/5/6 but not Number7/8,9"
matches = re.finditer(pattern, s)
for _, m in enumerate(matches, start=1):
print(','.join([m.group(1) + c for c in m.captures("num")]))
Output
Number1,Number2,Number3
Number4,Number5,Number6

How to extract filename from path using regex

I would like to extract a filename from a path using regular expression:
mysting = '/content/drive/My Drive/data/happy (463).jpg'
How do I extract 'happy.jpg'?
I have tried this: '[^/]*$' but the result still includes the number in parenthesis which I do not want: 'happy (463).jpg'
How could I improve it?
You could use 2 capturing groups. In the first group match / and capture 1+ word chars in group 1.
Then match 1+ digits between parenthesis and capture .jpg asserting the end of the string in group 2.
^.*/(\w+)\s*\(\d+\)(\.jpg)$
In parts that will match
^.*/ Match until last /
(\w+) Catpure group 1, match 1+ word chars
\s* Match 1+ whitespace chars
\(\d+\) Match 1+ digits between parenthesis
(\.jpg) Capture group 2, match .jpg
$ End of string
Regex demo | Python demo
Then use group 1 and group 2 in the replacement to get happy.jpg
import re
regex = r"^.*/(\w+)\s*\(\d+\)(\.jpg)$"
test_str = "/content/drive/My Drive/data/happy (463).jpg"
result = re.sub(regex, r"\1\2", test_str, 1)
if result:
print (result)
Output
happy.jpg
Without Regex; str methods (str.partition and str.rpartition):
In [185]: filename = mysting.rpartition('/')[-1]
In [186]: filename
Out[186]: 'happy (463).jpg'
In [187]: f"{filename.partition(' ')[0]}.{filename.rpartition('.')[-1]}"
Out[187]: 'happy.jpg'
With Regex; re.sub:
re.sub(r'.*/(?!.*/)([^\s]+)[^.]+(\..*)', r'\1\2', mysting)
.*/ greedily matches upto last /
The zero-width negative lookahead (?!.*/) ensures there is no / in anyplace forward
([^\s]+) matches upto the next whitespace and put as the first captured group
[^.]+ matches upto next .
(\..*) matches a literal . followed by any number of characters and put as the second captured group; if you want to match more conservatively like 3 characters or even literal .jpg you can do that also
in the replacement, only the captured groups are used
Example:
In [183]: mysting = '/content/drive/My Drive/data/happy (463).jpg'
In [184]: re.sub(r'.*/(?!.*/)([^\s]+)[^.]+(\..*)', r'\1\2', mysting)
Out[184]: 'happy.jpg'
I use javascript.
In javascript case,
const myString="happy (463).jpg";
const result=myString.replace(/\s\(\d*\)/,'');
After you split path in slash separator,
you can apply this code.

python regexp 3 the same pairs of [0-9A-Fa-f]

I need a regex for some color which can be described like this:
starts with #
then 3 the same pairs of hex characters (0-9, a-f, A-F). aA and Aa are also the same pairs
Now i have #(([0-9A-Fa-f]){2}){3}
How can I make regexp for the SAME pairs of hex characters?
Some examples of the matching strings:
"#FFFFFF",
"#000000",
"#aAAaaA",
"#050505",
"###93#0b0B0b1B34"
Strings like "#000100" shouldn't match
With re.search() function:
import re
s = '#aAAaaA'
match = re.search(r'#([0-9a-z]{2})\1\1', s, re.I)
result = match if not match else match.group()
print(result)
\1 - points to the 1st parenthesized group (...)
re.I - IGNORECASE regex flag
You may use the following regex with a capturing group and a backreference:
#([0-9A-Fa-f]{2})\1{2}
See the regex demo
Details
# - a #
([0-9A-Fa-f]{2}) - Group 1: 2 hex chars
\1{2} - 2 consecutive occurrences of the same value as captured in Group 1.
NOTE: the case insensitive flag is required to make the \1 backreference match Group 1 contents in a case insensitive way. Bear in mind we need to use a raw string literal to define the regex to avoid overescaping the backreferences.
See the Python demo:
import re
strs = ["#FFFFFF","#000000","#aAAaaA","#050505","###93#0b0B0b1B34", "#000100"]
for s in strs:
m = re.search(r'#([0-9A-Fa-f]{2})\1{2}', s, flags=re.I)
if m:
print("{} MATCHED".format(s))
else:
print("{} DID NOT MATCH".format(s))
Results:
#FFFFFF MATCHED
#000000 MATCHED
#aAAaaA MATCHED
#050505 MATCHED
###93#0b0B0b1B34 MATCHED
#000100 DID NOT MATCH

Add optional part in python regular expression

I want to add an optional part to my python expression:
myExp = re.compile("(.*)_(\d+)\.(\w+)")
so that
if my string is abc_34.txt, result.group(2) is 34
if my string is abc_2034.txt, results.group(2) is still 34
I tried myExp = re.compile("(.*)_[20](\d+)\.(\w+)")
but my results.groups(2) is 034 for the case of abc_2034.txt
Thanks F.J.
But I want to expand your solution and add a suffix.
so that if I put abc_203422.txt, results.group(2) is still 34
I tried "(.*)_(?:20)?(\d+)(?:22)?.(\w+)")
but I get 3422 instead of 34
strings = [
"abc_34.txt",
"abc_2034.txt",
]
for string in strings:
first_part, ext = string.split(".")
prefix, number = first_part.split("_")
print prefix, number[-2:], ext
--output:--
abc 34 txt
abc 34 txt
import re
strings = [
"abc_34.txt",
"abc_2034.txt",
]
pattern = r"""
([^_]*) #Match not an underscore, 0 or more times, captured in group 1
_ #followed by an underscore
\d* #followed by a digit, 0 or more times, greedy
(\d{2}) #followed by a digit, twice, captured in group 2
[.] #followed by a period
(.*) #followed by any character, 0 or more times, captured in group 3
"""
regex = re.compile(pattern, flags=re.X) #ignore whitespace and comments in regex
for string in strings:
md = re.match(regex, string)
if md:
print md.group(1), md.group(2), md.group(3)
--output:--
abc 34 txt
abc 34 txt
myExp = re.compile("(.*)_(?:20)?(\d+)\.(\w+)")
The ?: at the beginning of the group containing 20 makes this a non-capturing group, the ? after that group makes it optional. So (?:20)? means "optionally match 20".
Not sure if you're looking for this, but ? is the re symbol for 0 or 1 times. or {0,2} which is a bit hacky for up to two optional [0-9]. I will think more on it.

Categories