Python regex pattern building

Python regex pattern building - python

I'm trying to incrementally build the following regex pattern in python using reusable pattern components. I'd expect the pattern p to match the text in lines completely but it ends up matching only the first line..
import re
nbr = re.compile(r'\d+')
string = re.compile(r'(\w+[ \t]+)*(\w+)')
p1 = re.compile(rf"{string.pattern}\s+{nbr.pattern}\s+{string.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{string.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (f"aaaa 100284 aaaa\n"
f"aaaa 365870 bbbb\n"
f"757166 cccc\n"
f"111054 cccc\n"
f"999657 dddd\n"
f"999 eeee\n"
f"2955 ffff\n")
match = p.search(lines)
print(match)
print(match.group(0))
here's what gets printed:
<re.Match object; span=(0, 14), match='aaaa 1284 aaaa'>
aaaa 1284 aaaa

The problem is here:
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
In p the \n is appended to p1orp2, but this influences the scope of the | in p1orp2: the added \n belongs to the second option, not to the first option. It is the same if you would have attached that \n already in the definition of p1orp2:
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")
...while you really want to allow the p1 pattern to be followed by \n as well:
p1orp2 = re.compile(rf"{p1.pattern}\n|{p2.pattern}\n")
p = re.compile(rf"({p1orp2.pattern})+")
To achieve that with the \n where it was, you could use parentheses in the definition of p1orp2 so it limits the scope of the | operator:
p1orp2 = re.compile(rf"({p1.pattern}|{p2.pattern})")
p = re.compile(rf"({p1orp2.pattern}\n)+")
With this change it will work as you intended.

The issue with the regex pattern is that the capturing group in p1 only captures the last word in the sequence of words separated by whitespace or tabs. Therefore, the second part of p1 matches only the last word in the second line, and the first part of p1 and p2 don't match the lines that don't start with a word. As a result, p1orp2 doesn't match the entire input.
To fix this, you need to modify string to capture all the words in the sequence, not just the last one. Here's an updated version of your code:
word_sequence = re.compile(r"\w+(?:[ \t]+\w+)*")
nbr = re.compile(r"\d+")
p1 = re.compile(rf"{word_sequence.pattern}\s+{nbr.pattern}\s+
{word_sequence.pattern}")
p2 = re.compile(rf"{nbr.pattern}\s+{word_sequence.pattern}")
p1orp2 = re.compile(rf"{p1.pattern}|{p2.pattern}")
p = re.compile(rf"({p1orp2.pattern}\n)+")
lines = (
f"aaaa 1284 aaaa\n"
f"aaaa 3650 bbbb\n"
f"75071 cccc\n"
f"111872214054 cccc\n"
f"999 dddd\n"
f"999 eeee\n"
f"295255 ffff\n"
)
match = p.search(lines)
print(match)
print(match.group(0))

Related

Python Regex, how to substitute multiple occurrences with a single pattern?

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags <b></b>
For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is "Leonardo DiCaprio". The first occurrence of each character is highlighted in order of appearance.
What I'm doing right now is:
def prototype_finding_chars_in_string():
test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
comp_string = "ldi" #chars to highlight
regex = ".*?" + ".*?".join([f"({x})" for x in comp_string]) + ".*?" #results in .*?(l).*?(d).*?(i).*
regex_compiled = re.compile(regex, re.IGNORECASE)
for x in test_string_list:
re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
if re_search_result:
print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")
results in
char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')
Now I want to replace each occurrence in the result groups with <b>[whatever in the result]</b> and I'm not sure how to do it.
What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:
def replace_with_bold(result_groups, original_string):
output_string: str = original_string
for result in result_groups:
output_string = output_string.replace(result,f"<b>{result}</b>",1)
return output_string
This results in:
Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio
But I think looping like this over the results when I already have the match groups is wasteful. Furthermore, it's not even correct because it checked the string from the beginning each loop. So for the input 'ooo' this is the result:
char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio
When it should be Le<b>o</b>nard<b>o</b> DiCapri<b>o</b>
Is there a way to simplify this? Maybe regex here is overkill?

A way using re.split:
test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]
def filter_and_highlight(strings, letters):
pat = re.compile( '(' + (')(.*?)('.join(letters)) + ')', re.I)
results = []
for s in strings:
parts = pat.split(s, 1)
if len(parts) == 1: continue
res = ''
for i, p in enumerate(parts):
if i & 1:
p = '<b>' + p + '</b>'
res += p
results.append(res)
return results
filter_and_highlight(test_string_list, 'lir')
A particularity of re.split is that captures are included by default as parts in the result. Also, even if the first capture matches at the start of the string, an empty part is returned before it, that means that searched letters are always at odd indexes in the list of substrings.

This should work:
for result in result_groups:
output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
r'\1<b>\2</b>\3',
output_string,
flags=re.IGNORECASE)
on each iteration first occurrence of result (? makes .* lazy this together does the magic of first occurrence) will be replaced by <b>result</b> if it is not enclosed by tag before ((?!<b>) and (?!</b>) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.

When using "re.search", how do I search for the second instance rather than the first?

re.search looks for the first instance of something. In the following code, "\t" appears twice. Is there a way to make it skip forward to the second instance?
code = ['69.22\t82.62\t134.549\n']
list = []
text = code
m = re.search('\t(.+?)\n', text)
if m:
found = m.group(1)
list.append(found)
result:
list = ['82.62\t134.549']
expected:
list = ['134.549']

This modified version of your expression does return the desired output:
import re
code = '69.22\t82.62\t134.549\n'
print(re.findall(r'.*\t(.+?)\n', code))
Output
['134.549']
I'm though guessing that maybe you'd like to design an expression, somewhat similar to:
(?<=[\t])(.+?)(?=[\n])
DEMO

There is only one solution for greater than the "second" tab.
You can do it like this :
^(?:[^\t]*\t){2}(.*?)\n
Explained
^ # BOS
(?: # Cluster
[^\t]* # Many not tab characters
\t # A tab
){2} # End cluster, do 2 times
( .*? ) # (1), anything up to
\n # first newline
Python code
>>> import re
>>> text = '69.22\t82.62\t134.549\n'
>>> m = re.search('^(?:[^\t]*\t){2}(.*?)\n', text)
>>> if m:
>>> print( m.group(1) )
...
134.549
>>>

How to fix 'replace' keyword when it is not working in python

I am writing a code that needs to get four individual values, and one of the values has the newline character in addition to an extra apostrophe and bracket like so: 11\n']. I only need the 11 and have been able to strip the '], but I am unable to remove the newline character.
I have tried various different set ups of strip and replace, and both strip and replace are not removing the part.
with open('gil200110raw.txt', 'r') as qcfile:
txt = qcfile.readlines()
line1 = txt[1:2]
line2 = txt[2:3]
line1 = str(line1)
line2 = str(line2)
sptline1 = line1.split(' ')
sptline2 = line2.split(' ')
totalobs = sptline1[39]
qccalc1 = sptline2[2]
qccalc2 = sptline2[9]
qccalc3 = sptline2[16]
qccalc4 = sptline2[22]
qccalc4 = qccalc4.strip("\n']")
qccalc4 = qccalc4.replace("\n", "")
I did not get an error, but the output of print(qccalc4) is 11\n. I expect the output to be 11.

Use rstrip instead!
>>> 'test string\n'.rstrip()
'test string'

You can use regex to match the outputs you're looking for.
From your description, I assume it is all integers, consider the following snippet
import re
p = re.compile('[0-9]+')
sample = '11\n\'] dwqed 12 444'
results = p.findall(sample)
results would now contain the array ['11', '12', '444'].
re is the regex package for python and p is the pattern we would like to find in our text, this pattern [0-9]+ simply means match one or more characters 0 to 9
you can find the documentation here

How to get the first number from span=(2494, 2516) here?

I want to cut a text from the point where my regex expression is found to the end of the text. The position may vary, so I need that number as a variable.
The position can already be seen in the result of studentnrRegex.search(text):
>>> studentnrRegex = re.compile(r'(Studentnr = 18\d\d\d\d\d\d\d\d)')
>>> start = studentnrRegex.search(text)
>>> start
<_sre.SRE_Match object; span=(2494, 2516), match='Studentnr = 1825010243'>
>>> myText = text[2494:]
>>> myText
'Studentnr = 1825010243\nTEXT = blablabla
Can I get the start position as a variable directly from my variable start, in this case 2494?

The match object returned by calling .search() has .start() and .end() methods that return the starting and ending positions of the match.
studentnrRegex = re.compile(r'(Studentnr = 18\d\d\d\d\d\d\d\d)')
m = studentnrRegex.search(text)
start = m.start()
print(mytext[start:])
You can accomplish the same thing with a different regex that matches the student number and everything after it. This will save you the trouble of doing the slice:
studentnrRegex = re.compile(r'(Studentnr = 18\d{8}).*', re.DOTALL)
m = studentnrRegex.search(text)
print(m.group())
The {8} matches 8 repeats of the \d and the .* matches all remaining characters until the end of the string (including newlines) as long as the re.DOTALL flag is specified. The full match is group 0, which is the default value for the .group() method of the match object. You can access the student number as m.group(1).

Finding the index of the second match of a regular expression in python

So I am trying to rename files to match the naming convention for plex mediaserver. ( SxxEyy )
Now I have a ton of files that use eg. 411 for S04E11. I have written a little function that will search for an occurrence of this pattern and replace it with the correct convention. Like this :
pattern1 = re.compile('[Ss]\\d+[Ee]\\d+')
pattern2 = re.compile('[\.\-]\d{3,4}')
def plexify_name(string):
#If the file matches the pattern we want, don't change it
if pattern1.search(string):
return string
elif pattern2.search(string):
piece_to_change = pattern2.search(string)
endpos = piece_to_change.end()
startpos = piece_to_change.start()
#Cut out the piece to change
cut = string[startpos+1:endpos-1]
if len(cut) == 4:
cut = 'S'+cut[0:2] + 'E' + cut[2:4]
if len(cut) == 3:
cut = 'S0'+cut[0:1] + 'E' + cut[1:3]
return string[0:startpos+1] + cut + string[endpos-1:]
And this works very well. But it turns out that some of the filenames will have a year in them eg. the.flash.2014.118.mp4 In which case it will change the 2014.
I tried using
pattern2.findall(string)
Which does return a list of strings like this --> ['.2014', '.118'] but what I want is a list of matchobjects so I can check if there is 2 and in that case use the start/end of the second. I can't seem to find something to do this in the re documentation. I am missing something or do I need to take a totally different approach?

You could try anchoring the match to the file extension:
pattern2 = re.compile(r'[.-]\d{3,4}(?=[.]mp4$)')
Here, (?= ... ) is a look-ahead assertion, meaning that the thing has to be there for the regex to match, but it's not part of the match:
>>> pattern2.findall('test.118.mp4')
['.118']
>>> pattern2.findall('test.2014.118.mp4')
['.118']
>>> pattern2.findall('test.123.mp4.118.mp4')
['.118']
Of course, you want it to work with all possible extensions:
>>> p2 = re.compile(r'[.-]\d{3,4}(?=[.][^.]+$)')
>>> p2.findall('test.2014.118.avi')
['.118']
>>> p2.findall('test.2014.118.mov')
['.118']
If there is more stuff between the episode number and the extension, regexes for matching that start to get tricky, so I would suggest a non-regex approach for dealing with that:
>>> f = 'test.123.castle.2014.118.x264.mp4'
>>> [p for p in f.split('.') if p.isdigit()][-1]
'118'
Or, alternatively, you can get match objects for all matches by using finditer and expanding the iterator by converting it to a list:
>>> p2 = re.compile(r'[.-]\d{3,4}')
>>> f = 'test.2014.712.x264.mp4'
>>> matches = list(p2.finditer(f))
>>> matches[-1].group(0)
'.712'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex pattern building - python

Related

Python Regex, how to substitute multiple occurrences with a single pattern?

When using "re.search", how do I search for the second instance rather than the first?

How to fix 'replace' keyword when it is not working in python

How to get the first number from span=(2494, 2516) here?

Finding the index of the second match of a regular expression in python

Categories

Resources