Python Regex: OR statement does not work in regex module - python

Hi I want apply the following expression to check substitutions, insertions, deletion counts. However the OR statement seems like it does not work. Regex check only the first statement in the paranthesis.
For example:
correct_string = "20181201"
regex_pattern = r"((20[0-9]{2})(0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])){e}"
regex.fullmatch(regex_pattern, correct_string)
Output:
<regex.Match object; span=(0, 8), match='20181201', fuzzy_counts=(1, 0, 0)>
It says there is one substitution because of the 5th digit however if in the OR statement it exist
Another example:
correct_string = "20180201"
regex_pattern = r"((20[0-9]{2})(0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])){e}"
regex.fullmatch(regex_pattern, correct_string)
Output:
<regex.Match object; span=(0, 8), match='20180201'>
In this case it says there are no substitutions which is correct according to first statement in the OR.
How can I solve this. Thank you.

You need to use regex.ENHANCEMATCH:
By default, fuzzy matching searches for the first match that meets the given constraints. The ENHANCEMATCH flag will cause it to attempt to improve the fit (i.e. reduce the number of errors) of the match that it has found.
Python demo:
import regex
correct_string = "20181201"
regex_pattern = r"((20[0-9]{2})(0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])){e}"
print(regex.fullmatch(regex_pattern, correct_string, regex.ENHANCEMATCH))
// => <regex.Match object; span=(0, 8), match='20181201'>
See the online Python demo.

Related

Match integer values from end of string until second dot

I have following string GA1.2.4451363243.9414195136 and I want to match 4451363243.9414195136 using regular expression for python.
I have tried the following which is not working ([\d].[\d])$
Where am I going wrong here?
A few ideas (string operations or regex):
s = 'GA1.2.4451363243.9414195136'
out = '.'.join(s.rsplit('.', 2)[-2:])
# '4451363243.9414195136'
import re
out = re.search(r'[^.]*\.[^.]*$', s)
# <re.Match object; span=(6, 27), match='4451363243.9414195136'>
NB. to ensure matching digits, you can replace [^.] (any character but .) with \d.
For an arbitrary N:
N = 3
out = '.'.join(s.rsplit('.', N)[-N:])
# '2.4451363243.9414195136'
out = re.search(fr'[^.]*(?:\.[^.]*){{{N-1}}}$', s)
# <re.Match object; span=(4, 27), match='2.4451363243.9414195136'>
It could be done using pure python! but if you want to use regex here is the code:
regex:
(?:[\w\d]*.){2}(.*)
python:
import re
s = 'GA1.2.4451363243.9414195136'
re.match(r'(?:[\w\d]*.){2}(.*)',s).groups()[0] # output: '4451363243.9414195136'
OR
Just use python:
s.split('.',2)[-1] # output: '4451363243.9414195136'
The following regex ([0-9]+.[0-9]+)$ matches the expected part of the example. Note that more specific solutions may arise as you provide more details, restrictions, etc. regarding the part to be matched:
>>> import re
>>> data = "GA1.2.4451363243.941419513"
>>> re.findall(r"([0-9]+.[0-9]+)$", data)
['4451363243.941419513']
It requests the matched part to be made of:
digit(s)
dot
digit(s)
end of line.

Find string between two patterns with an AND condition in Python

I would like identify the string of characters that is between two patterns (lettre/ and " in example). In addition, the identified string should not correspond to a third pattern (somth?other in example).
Python 3.7 running on MAC OSX 10.13
import re
strings = ['lettre/abc"','lettre/somth?other"','lettre/acc"','lettre/edf"de','lettre/nhy"','lettre/somth?other"']
res0_1 = re.search('lettre/.*?\"', strings[0])
res1_1 = re.search('lettre/.*?\"', strings[1])
print(res0_1)
<re.Match object; span=(0, 11), match='lettre/abc"'>
print(res1_1)
<re.Match object; span=(0, 19), match='lettre/somth?other"'>
res0_2 = re.search('lettre/(.*?\"&^[somth\?other])', strings[0])
res1_2 = re.search('lettre/(.*?\"&^[somth\?other])', strings[1])
print(res0_2)
None
print(res1_2)
None
I would like to get res0_1 for strings[0] and res1_2 for strings[1].
As I understand it
Try this:
import re
strings = ['lettre/abc"','lettre/somth?other"','lettre/acc"','lettre/edf"de','lettre/nhy"','lettre/somth?other"']
res0_1 = re.findall('lettre/(.*)\"', strings[0])
res1_2 = re.findall('lettre/(.*)\"', strings[1])
print(res0_1)
print(res1_2)
Hope it helps
I think below code can give you what you asked in the question.
import re
strings = ['lettre/abc"','lettre/somth?other"','lettre/acc"','lettre/edf"de','lettre/nhy"','lettre/somth?other"']
for i in strings:
if 'somth?other' not in i.split('/')[1]:
print(i.split('/')[1].split('"')[0])
Since you do not want to get a match if there is somth?other to the right of / you may use
r'lettre/(?!somth\?other)[^"]*"'
See the regex demo and the regex graph:
Details
lettre/ - a literal substring
(?!somth\?other) - no somth?other substring allowed immediately to the right of the current location
[^"]* - 0+ chars other than "
" - a double quotation mark.
Try to use this site instead of try and error.
https://regex101.com/
In [7]: import re
...: strings = ['lettre/abc"','lettre/somth?other"','lett
...: re/acc"','lettre/edf"de','lettre/nhy"','lettre/somth
...: ?other"']
...:
In [8]: c = re.compile('(?=lettre/.*?\")(^((?!.*somth\?other
...: .*).)*$)')
In [9]: for string in strings:
...: print(c.match(string))
...:
<re.Match object; span=(0, 11), match='lettre/abc"'>
None
<re.Match object; span=(0, 11), match='lettre/acc"'>
<re.Match object; span=(0, 13), match='lettre/edf"de'>
<re.Match object; span=(0, 11), match='lettre/nhy"'>
None

Regex python expression

I'm trying to read a binary file.
My objective is to find all the matches of "10, 10, [any hex value exactly one time], either EE or DD]"
Thought I could do it like this:
pattern = (b"\x10\x10\[0-9a-fA-F]?\[xDD|xEE]")
Clearly not working. It seems that it becomes an error at the third part. I tried dissecting the statement and x10 and x11 works, but the rest just won't.
My understanding of "[0-9a-fA-F]?" is that it matches the range in the brackets 0 or 1 times. and the third part "xDD or xEE" am I wrong?
Any ideas?
Use the regex
b'\x10\x10.[\xdd\xee]'
A single . matches any character (any one-byte) single time, and a single [ab] matches a or b a single time.
>>> re.match(b'\x10\x10.[\xdd\xee]', b'\x10\x10\x00\xee')
<_sre.SRE_Match object; span=(0, 4), match=b'\x10\x10\x00\xee'>

Python regular expression with or and re.search

Say I have two types of strings:
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
For both of these, I want to match 'Foobar' (which could be anything). I have tried the following:
m = re.compile('((?<=Thing: ).+(?= Analysis))|((?<=\d ).+(?= Analysis))')
ind1 = m.search(str1).span()
match1 = str1[ind1[0]:ind1[1]]
ind2 = m.search(str2).span()
match2 = str2[ind2[0]:ind2[1]]
However, match1 comes out to 'A Thing: Foobar', which seems to be the match for the second pattern, not the first. Applied individually, (pattern 1 to str1 and pattern 2 to str2, without the |), both patterns match 'Foobar'. I expected this, then, to stop when matched by the first pattern. This doesn't seem to be the case. What am I missing?
According to the documentation,
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
But the behavior seems to be different:
import re
THING = r'(?<=Thing: )(?P<THING>.+)(?= Analysis)'
NUM = r'(?<=\d )(?P<NUM>.+)(?= Analysis)'
MIXED = THING + '|' + NUM
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
print(re.match(THING, str1))
# <... match='Foobar'>
print(re.match(NUM, str1))
# <... match='A Thing: Foobar'>
print(re.match(MIXED, str1))
# <... match='A Thing: Foobar'>
We would expect that because THING matches 'Foobar', the MIXED pattern would get that 'Foobar' and quit searching. (as per the documentation)
Because it is not working as documented, the solution has to rely on Python's or short-circuiting:
print(re.search(THING, str1) or re.search(NUM, str1))
# <_sre.SRE_Match object; span=(17, 23), match='Foobar'>
print(re.search(THING, str2) or re.search(NUM, str2))
# <_sre.SRE_Match object; span=(8, 14), match='Foobar'>
If you use named groups, eg (?P<name>...) you'll be able to debug easier. But note the docs for span.
https://docs.python.org/2/library/re.html#re.MatchObject.span
span([group]) For MatchObject m, return the 2-tuple (m.start(group),
m.end(group)). Note that if group did not contribute to the match,
this is (-1, -1). group defaults to zero, the entire match.
You're not passing in the group number.
Why are you using span anyway? Just use m.search(str1).groups() or similar

Python Regex expression

Trying to write a Regex expression in Python to match strings.
I want to match input that starts as first, first?21313 but not first.
So basically, I don't want to match to anything that has . the period character.
I've tried word.startswith(('first[^.]?+')) but that doesn't work. I've also tried word.startswith(('first.?+')) but that hasn't worked either. Pretty stumped here
import re
def check(word):
regexp = re.compile('^first([^\..])+$')
return regexp.match(word)
And if you dont want the dot:
^first([^..])+$
(first + allcharacter except dot and first cant be alone).
You really don't need regex for this at all.
word.startswith('first') and word.find('.') == -1
But if you really want to take the regex route:
>>> import re
>>> re.match(r'first[^.]*$', 'first')
<_sre.SRE_Match object; span=(0, 5), match='first'>
>>> re.match(r'first[^.]*$', 'first.') is None
True

Categories