I have a string,
s = '`re.``search`(*pattern*, *string*, *flags=0*)',
Easily it produces such a result using sub
In [100]: re.sub(r'[`*]','',s)
Out[100]: 're.search(pattern, string, flags=0)'
I'd like to refactor it by writing a whole regex pattern instead of substituting.
In [101]: re.search(r'[^`*]+',s)
Out[101]: <_sre.SRE_Match object; span=(1, 4), match='re.'>
It stops at first match 're., while I intend to retrieve the completed.
How to accomplish such a task?
Related
I have following string GA1.2.4451363243.9414195136 and I want to match 4451363243.9414195136 using regular expression for python.
I have tried the following which is not working ([\d].[\d])$
Where am I going wrong here?
A few ideas (string operations or regex):
s = 'GA1.2.4451363243.9414195136'
out = '.'.join(s.rsplit('.', 2)[-2:])
# '4451363243.9414195136'
import re
out = re.search(r'[^.]*\.[^.]*$', s)
# <re.Match object; span=(6, 27), match='4451363243.9414195136'>
NB. to ensure matching digits, you can replace [^.] (any character but .) with \d.
For an arbitrary N:
N = 3
out = '.'.join(s.rsplit('.', N)[-N:])
# '2.4451363243.9414195136'
out = re.search(fr'[^.]*(?:\.[^.]*){{{N-1}}}$', s)
# <re.Match object; span=(4, 27), match='2.4451363243.9414195136'>
It could be done using pure python! but if you want to use regex here is the code:
regex:
(?:[\w\d]*.){2}(.*)
python:
import re
s = 'GA1.2.4451363243.9414195136'
re.match(r'(?:[\w\d]*.){2}(.*)',s).groups()[0] # output: '4451363243.9414195136'
OR
Just use python:
s.split('.',2)[-1] # output: '4451363243.9414195136'
The following regex ([0-9]+.[0-9]+)$ matches the expected part of the example. Note that more specific solutions may arise as you provide more details, restrictions, etc. regarding the part to be matched:
>>> import re
>>> data = "GA1.2.4451363243.941419513"
>>> re.findall(r"([0-9]+.[0-9]+)$", data)
['4451363243.941419513']
It requests the matched part to be made of:
digit(s)
dot
digit(s)
end of line.
Hi I want apply the following expression to check substitutions, insertions, deletion counts. However the OR statement seems like it does not work. Regex check only the first statement in the paranthesis.
For example:
correct_string = "20181201"
regex_pattern = r"((20[0-9]{2})(0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])){e}"
regex.fullmatch(regex_pattern, correct_string)
Output:
<regex.Match object; span=(0, 8), match='20181201', fuzzy_counts=(1, 0, 0)>
It says there is one substitution because of the 5th digit however if in the OR statement it exist
Another example:
correct_string = "20180201"
regex_pattern = r"((20[0-9]{2})(0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])){e}"
regex.fullmatch(regex_pattern, correct_string)
Output:
<regex.Match object; span=(0, 8), match='20180201'>
In this case it says there are no substitutions which is correct according to first statement in the OR.
How can I solve this. Thank you.
You need to use regex.ENHANCEMATCH:
By default, fuzzy matching searches for the first match that meets the given constraints. The ENHANCEMATCH flag will cause it to attempt to improve the fit (i.e. reduce the number of errors) of the match that it has found.
Python demo:
import regex
correct_string = "20181201"
regex_pattern = r"((20[0-9]{2})(0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])){e}"
print(regex.fullmatch(regex_pattern, correct_string, regex.ENHANCEMATCH))
// => <regex.Match object; span=(0, 8), match='20181201'>
See the online Python demo.
I'm trying to read a binary file.
My objective is to find all the matches of "10, 10, [any hex value exactly one time], either EE or DD]"
Thought I could do it like this:
pattern = (b"\x10\x10\[0-9a-fA-F]?\[xDD|xEE]")
Clearly not working. It seems that it becomes an error at the third part. I tried dissecting the statement and x10 and x11 works, but the rest just won't.
My understanding of "[0-9a-fA-F]?" is that it matches the range in the brackets 0 or 1 times. and the third part "xDD or xEE" am I wrong?
Any ideas?
Use the regex
b'\x10\x10.[\xdd\xee]'
A single . matches any character (any one-byte) single time, and a single [ab] matches a or b a single time.
>>> re.match(b'\x10\x10.[\xdd\xee]', b'\x10\x10\x00\xee')
<_sre.SRE_Match object; span=(0, 4), match=b'\x10\x10\x00\xee'>
Say I have two types of strings:
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
For both of these, I want to match 'Foobar' (which could be anything). I have tried the following:
m = re.compile('((?<=Thing: ).+(?= Analysis))|((?<=\d ).+(?= Analysis))')
ind1 = m.search(str1).span()
match1 = str1[ind1[0]:ind1[1]]
ind2 = m.search(str2).span()
match2 = str2[ind2[0]:ind2[1]]
However, match1 comes out to 'A Thing: Foobar', which seems to be the match for the second pattern, not the first. Applied individually, (pattern 1 to str1 and pattern 2 to str2, without the |), both patterns match 'Foobar'. I expected this, then, to stop when matched by the first pattern. This doesn't seem to be the case. What am I missing?
According to the documentation,
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
But the behavior seems to be different:
import re
THING = r'(?<=Thing: )(?P<THING>.+)(?= Analysis)'
NUM = r'(?<=\d )(?P<NUM>.+)(?= Analysis)'
MIXED = THING + '|' + NUM
str1 = 'NUM-140 A Thing: Foobar Analysis NUM-140'
str2 = 'NUM-140 Foobar Analysis NUM-140'
print(re.match(THING, str1))
# <... match='Foobar'>
print(re.match(NUM, str1))
# <... match='A Thing: Foobar'>
print(re.match(MIXED, str1))
# <... match='A Thing: Foobar'>
We would expect that because THING matches 'Foobar', the MIXED pattern would get that 'Foobar' and quit searching. (as per the documentation)
Because it is not working as documented, the solution has to rely on Python's or short-circuiting:
print(re.search(THING, str1) or re.search(NUM, str1))
# <_sre.SRE_Match object; span=(17, 23), match='Foobar'>
print(re.search(THING, str2) or re.search(NUM, str2))
# <_sre.SRE_Match object; span=(8, 14), match='Foobar'>
If you use named groups, eg (?P<name>...) you'll be able to debug easier. But note the docs for span.
https://docs.python.org/2/library/re.html#re.MatchObject.span
span([group]) For MatchObject m, return the 2-tuple (m.start(group),
m.end(group)). Note that if group did not contribute to the match,
this is (-1, -1). group defaults to zero, the entire match.
You're not passing in the group number.
Why are you using span anyway? Just use m.search(str1).groups() or similar
There is a cool native .split() in python, returning the list of substrings. Is there a native or, at least, not very nasty way to split by multiple delimiters and automatically get substring coordinates? Something like this:
"abc? !cde".some_smart_split("!?") -> [("abc", 0, 2), (" ", 4, 4), ("cde", 6, 8)]
Of course, I can write some naive code myself. But my use case is much more complicated, it would be great to find something concise.
Using re.finditer:
>>> import re
>>> [(match.group(0), match.start(), match.end())
for match in re.finditer(r'[^!?]+', 'abc? !cde')]
[('abc', 0, 3), (' ', 4, 5), ('cde', 6, 9)]
The pattern [^!?]+ to match characters that is not !, ?.
used re.MatchedObject.group(0) to get the matched string.
re.MatchedObject.start and re.MatchedObject.end to get the indices of matched parts.