Python regex re.finditer weird behavior with match.end() - python

I am trying to find the positions of patterns in a file, and I am using Python regex. When I run the below code, several start positions are printed but only one end position, the one corresponding to the latest start position, is printed. The bottom print statement is also only printed once. Why is there not a match.end() value for every match.start() value? File is a .obj file.
import re
import binascii
def findByte(b,file):
f = open(file, "rb").read()
f = binascii.hexlify(f)
regex = b + "(?=(?:[\\da-fA-F]{2})*$)"
for match in (re.finditer(regex, f)):
s = match.start()
print("S" + str(s))
e = match.end()
print("E" + str(e))
g = match.group()
print(g)
print ('String match "%s" at %d:%d' % (g, s, e))
findByte("ca","demo.obj")
When I run it, the below answers are printed.
S0
S64
S184
S252
E254
ca
String match "ca" at 252:254
If I instead write
def findByte(b,file):
f = open(file, "rb").read()
f = binascii.hexlify(f)
regex = b + "(?=(?:[\\da-fA-F]{2})*$)"
m = re.findall(regex,f)
print(m)
findByte("ca","demo.obj")
The printed value is
['ca', 'ca', 'ca', 'ca']

Checking the actual code you pasted, it's clear you've committed the cardinal Python sin of mixing tabs and spaces, and it's biting you (try selecting the leading whitespace in your own code on this page; you'll see some of it is selected as blocks of four spaces at a time, while other parts have single space granularity). Your editor is showing you tabs as being equivalent to four spaces, but in the code you pasted, you have purely tab based indentation up through print("S" + str(s)), then four spaces followed by a tab as the indentation for subsequent lines.
Most such mixed indentation stuff is rejected by Python 3, but Python 2 is more flexible (it gives you the rope to hang yourself), which may be what is happening here (Python 2 is end of life as of the beginning of this year, so if you're writing new code, I'd suggest switching for this and many other reasons). Your code looks like it's all in the for loop, but most of it isn't.
Replace all your tabs with four space indents, and reconfigure your editor to always expand tabs to spaces in the future, so you aren't bit by this in the future; Python style is consistent four space indents with no tabs for a reason.

Related

Python 2.7: Matching a subtitle events in VTT subtitles using a regular expression

I'm writing a python script to parse VTT subtitle files.
I am using a regular expression to match and extract specific elements:
'in timecode'
'out timecode'
'other info' (mostly alignment information, like align:middle or line:-1)
subtitle content (the actual text)
I am using Python's 're' module from the standard library, and I am looking for a regular expression that will match all (5) of the below 'subtitle events':
WEBVTT
00:00:00.440 --> 00:00:02.320 align:middle line:-1
Hi.
00:00:03.440 --> 00:00:07.520 align:middle line:-1
This subtitle has one line.
00:00:09.240 --> 00:00:11.080 align:middle line:-2
This subtitle has
two lines.
00:00:15.240 --> 00:00:23.960 align:middle line:-4
Now...
Let's try
four...
lines...
00:00:24.080 --> 00:00:27.080 align:middle
PS: Note that stackoverflow doesn't allow me to add an empty line at the end of the code block. Normally the last 'empty' line will exist because a line break (\r\n or \n). After: 00:00:24.080 --> 00:00:27.080 align:middle
Below is my code. My problem is that I can't figure out a regular expression that will match all of the 'subtitle events' (including the one with an empty line as 'subtitle content').
import re
import io
webvttFileObject = io.open("C:\Users\john.doe\Documents\subtitle_sample.vtt", 'r', encoding = 'utf-8') # opens WebVTT file forcing UTF-8 encoding
textBuffer = webvttFileObject.read()
regex = re.compile(r"""(^[0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3}) # match TC-IN in group1
[ ]-->[ ] # VTT/SRT style TC-IN--TC-OUT separator
([0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3}) # match TC-OUT n group2
(.*)?\n # additional VTT info (like) alignment
(^.+\n)+\n? # subtitle_content """, re.MULTILINE|re.VERBOSE)
subtitle_match_count = 0
for match in regex.finditer(textBuffer):
subtitle_match_count += 1
group1, group2, group3, group4 = match.groups()
tc_in = group1.strip()
tc_out = group2.strip()
vtt_extra_info = group3
subtitle_content = group4
print "*** subtitle match count: %d ***" % subtitle_match_count
print "TIMECODE IN".ljust(20), tc_in
print "TIMECODE OUT".ljust(20), tc_out
print "ALIGN".ljust(20), vtt_extra_info.strip()
print "SUBTITLE CONTENT".ljust(20), subtitle_content
print
I've tried several variations of the regex in the code. All without success. What is also very strange to me is that if I put regex groups in a variable and print them, like I'm doing with this code, I only get the last line as SUBTITLE CONTENT. But I must be doing something wrong (right?). Any help is greatly appreciated.
Thanks in advance.
The reason why your regex doesn't match the last subtitle is here:
(^.+\n)+\n?
The ^.+\n is looking for a line with 1 or more characters. But the last line in the file is empty, so it doesn't match.
The reason why subtitle_content only contains the last line is also there. You're matching each line one by one with (^.+\n)+, i.e. the capture group always captures only a single line. With each matched line, the capture group's previous value is discarded, so in the end all you're left with is the last line. If you want to capture all lines, you have match them all in one go inside of the capture group, for example like this:
((?:^.+\n)+)
In order to make the regex work correctly, I've slightly changed the last two lines:
(^[0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3})
[ ]-->[ ]
([0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3})
([^\n]*)?\n # replaced `.*` with `[^\n]*` here because of the S-modifier
(.*?)(?:\n\n|\Z) # this now captures everything up to 2 consecutive
# newlines or the end of the string
This regex requires the modifiers m (multiline), s (single-line) and of course x (verbose).
See it in action here.

Regex search up to first instance Python

I know that there are a bunch of other similar questions to this, but I have built off other answers with no success.
I've dug here, here, here, here, and here
but this question is closest to what I'm trying to do, however it's in php and I'm using python3
My goal is to extract a substring from a body text.
The body is formatted:
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
etc.
Formatting on SO is tough. But in the actual text, there's no spaces, just newlines for each line.
I want what's under Header2, so currently I have:
found = re.search("\*\*Header2\*\*\n[^*]+",body)
if found:
list = found.group(0)
list = list[11:]
list = list.split('\n')
print(list)
But that's returning "None". Various other regex I've tried also haven't worked, or grabbed too much (all of the remaining headers).
For what it's worth I've also tried:
\*\*Header2\*\*.+?^\**$
\*\*Header2\*\*[^*\s\S]+\*\* and about 10 other permutations of those.
Brief
Your pattern \*\*Header2\*\*\n[^*]+ isn't matching because your line **Header2** includes trailing spaces before the newline character. Adding * should suffice, but I've added other options below as well.
Code
See regex in use here
\*{2}Header2\*{2} *\n([^*]+)
Alternatively, you can also use the following regex (which also allows you to capture lines with * in them so long as they don't match the format of your header ^\*{2}[^*]*\*{2} - it also beautifully removes whitespace from the last element under the header - uses the im flags):
See regex in use here
^\*{2}Header2\*{2} *\n((?:(?!^\*{2}[^*]*\*{2}).)*?)(?=\s*^\*{2}[^*]*\*{2}|\s*\Z)
Usage
See code in use here
import re
regex = r"\*{2}Header2\*{2}\s*([^*]+)\s*"
test_str = ("**Header1** \n"
"thing1 \n"
"thing2 \n"
"thing3 \n"
"thing4 \n\n"
"**Header2** \n"
"dsfgs \n"
"sdgsg \n"
"rrrrrr \n\n"
"**Hello Dolly** \n"
"abider \n"
"abcder \n"
"ffffff")
print(re.search(regex, test_str).group(1))
Explanation
The pattern is practically identical to the OP's original pattern. I made minor changes to allow it to better perform and also get the result the OP is expecting.
\*\* changed to \*{2}: Very minor adjustment for performance
\n changed to *\n: Takes additional spaces at the end of a line into account before the newline character
([^*]+): Captures the contents the OP is expecting into capture group 1
You could use
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
with the multiline and verbose modifier, see a demo on regex101.com.
Afterwards, just grab what is inside content (i.e. using re.finditer()).
Broken down this says:
^\*\*Header2\*\*.*[\n\r] # match **Header2** at the start of the line
# and newline characters
(?P<content>(?:.+[\n\r])+) # afterwards match as many non-null lines as possible
In Python:
import re
rx = re.compile(r'''
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
for match in rx.finditer(your_string_here):
print(match.group('content'))
I have the feeling that you even want to allow empty lines between paragraphs. If so, change the expression to
^\*\*Header2\*\*.*[\n\r]
(?P<content>[\s\S]+?)
(?=^\*\*)
See a demo for the latter on regex101.com as well.
You can try this:
import re
s = """
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
"""
new_contents = re.findall('(?<=\*\*Header2\*\*)[\n\sa-zA-Z0-9]+', s)
Output:
[' \ndsfgs \nsdgsg \nrrrrrr \n\n']
If you want to remove special characters from the output, you can try this:
final_data = filter(None, re.split('\s+', re.sub('\n+', '', new_contents[0])))
Output:
['dsfgs', 'sdgsg', 'rrrrrr']

Python regex: re.search() is extremely slow on large text files

My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex
You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group
I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.

Editing lines and removing lines from file

I have a file of accession numbers and 16S rrna sequences, and what I'm trying to do is remove all lines of RNA, and only keep the lines with the accession numbers and the species name (and remove all the junk in between). So my input file looks like this (there are > in front of the accession numbers):
> D50541 1 1409 1409bp rna Abiotrophia defectiva Aerococcaceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA
UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAGGAUAU AGAGAUAAUU UCUUUAUAUU
(... and many more lines)
> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU
AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGACAUA UUGAGGCAUC UUAAUAUGUU
...
I want my output to look like this:
>D50541 Abiotrophia defectiva Aerococcaceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
The code I wrote does what I want... for most of the lines. It looks like this:
#!/usr/bin/env python
# take LTPs111.compressed fasta and reduce to accession numbers with names.
import re
infilename = 'LTPs111.compressed.fasta'
outfilename = 'acs.fasta'
regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
x = regex.sub(r'\1\2 \3', line)
#remove rna sequences
for line in x:
if '>' in line:
outfile.write(x)
Sometimes, the code seems to skip over some of the names. for example, for the first accession number above, I only got back:
>D50541 Aerococcaceae
Why might my code be doing this? The input for each accession number looks identical, and the spacing between 'rna' and the first name is the same for each line (5 spaces).
Thank you to anyone who might have some ideas!
I still haven't been able to run your code to get the claimed results, but I think I know what the problem is:
>>> line = '> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae'
>>> regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
>>> regex.findall(line)
[('>', 'AY538167', 'Acholeplasmataceae')]
The problem is that [rna]\s+ matches any one of the characters r, n, or a at the end of a word. And, because all of the matches are greedy, with no lookahead or anything else to prevent it, this means that it matches the n at the end of hippikon.
The simple solution is to remove the brackets, so it matches the string rna:
>>> regex = re.compile(r'(>)\s(\w+).+rna\s+([A-Z].+)')
That won't work if any of your species or genera can end with that string. Are there any such names? If so, you need to come up with a better way to describe the cutoff between the 1409bp part and the rna part. The simplest may be to just look for rna surrounded by spaces:
>>> regex = re.compile(r'(>)\s(\w+).+\s+rna\s+([A-Z].+)')
Whether this is actually correct or not, I can't say without knowing more about the format, but hopefully you understand what I'm doing well enough to verify that it's correct (or at least to ask smarter questions than I can ask).
It may help debug things to add capture groups. For example, instead of this:
(>)\s(\w+).+[rna]\s+([A-Z].+)
… search for this:
(>)(\s)(\w+)(.+[rna]\s+)([A-Z].+)
Obviously your desired capture groups are now \1\3 \5 instead of \1\2 \3… but the big thing is that you can see what got matched in \4:
[('>', ' ', 'AY538167', ' 1 1411 1411bp Acholeplasma hippikon ', 'Acholeplasmataceae')]
So, now the question is "Why did .+[rna]\s+ match '1 1411 1411bp Acholeplasma hippikon '? Sometimes the context matters, but in this case, it doesn't. You don't want that group to match that string in any context, and yet it will always match it, so that's the part you have to debug.
Also, a visual regexp explorer often helps a lot. The best ones can color parts of the expression and the matched text, etc., to show you how and why the regexp is doing what it does.
Of course you're limited by those that run on your platform or online, and work with Python syntax. If you're careful and/or only use simple features (as in your example), perl/PCRE syntax is very close to Python, and JavaScript/ActionScript is also pretty close (the one big difference to keep in mind is that replace/sub uses $ instead of \1).
I don't have a good online one to strongly recommend, but from a quick glance Debuggex looks pretty cool.
Items between brackets are character classes, so by setting your regex to look for "[rna]" you are requesting lines with either r, n, or a, but not all three.
Further, if the lines you want all have the pattern "bp rna", I'd use that to yank those lines. By reading the file in line by line, the following worked for me for a quick and dirty line-yanker, for instance:
regex = re.compile(r'^[\w\s]+bp rna .*$')
But, again, if it's as simple as finding lines with "bp rna" in them, you could read the file line by line and forego regex entirely:
for line in file:
if "bp rna" in line:
print(line)
EDIT: I blew it by not reading the request carefully enough. Maybe a capture-and-replace regex would help?
for line in file:
if "bp rna" in line:
subreg = re.sub(r'^(>[\w]+)\s[\d\s]+bp\srna\s([\w\s]+$)', r"\1 \2", line)
print(subreg)
OUTPUT:
>AY538166 Acholeplasma granularum Acholeplasmataceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
This should match any whitespace (tabs or spaces) between the things you want.

python substitute a substring with one character less

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs
You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =
No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.
I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?
how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.
either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," ")
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module,but it's not necessory

Categories