Python regex capturing extra newline - python

So, I've been cooking some regex, and it seems the regex library is capturing an extra new line when I use ((.|\s)*) to capture multi-line text.. [\S\s]* works for some reason:
If you see below, the first regex produces an additional \n group, why??:
>>> s = """
... #pragma whatever
... #pr
... asdfsadf
... #pragma START-SomeThing-USERCODE
... this is the code
... this is more
... #pragma END-SomeThing-USERCODE
... asd
... asdf
... sadf
... sdaf
... """
>>> r = r"(#pragma START-(.*)-USERCODE\s*\n)((.|\s)*)(#pragma END-(.*)-USERCODE)"
>>> re.findall(r, s) [('#pragma START-SomeThing-USERCODE\n', 'SomeThing', 'this is the code\nthis is more\n', '\n', '#pragma END-SomeThing-USERCODE', 'SomeThing')]
>>> r = r"(#pragma START-(.*)-USERCODE\s*\n)([\S\s]*)(#pragma END-(.*)-USERCODE)"
>>> re.findall(r, s) [('#pragma START-SomeThing-USERCODE\n', 'SomeThing', 'this is the code\nthis is more\n', '#pragma END-SomeThing-USERCODE', 'SomeThing')]

The subregex
((.|\s)*)
matches "this is the code\nthis is more\n". The outer parentheses capture this entire string.
The inner parentheses capture one character at a time (either any character besides newlines, or a space (including newline)). Since that group is repeated, the contents of the group are overwritten with each repetition. At the end of the match, the last character that was matched (\n) is kept in that group.
So, if you want to avoid that, either make the inner group non-capturing:
((?:.|\s)*)
or use the ([\s\S]*) idiom for matching truly any character. It might be a good idea to use ([\s\S]*?), though, to make sure that the smallest possible number of characters are matched.

This expression produces nested group
((.|\s)*)
Because you use nested braces. For single-character OR square braces is a proper choice; this syntax is suitable when you want to chose between 2 words
(treat|trick)

Related

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

Replace pairs of characters at start of string with a single character

I only want this done at the start of the sting. Some examples (I want to replace "--" with "-"):
"--foo" -> "-foo"
"-----foo" -> "---foo"
"foo--bar" -> "foo--bar"
I can't simply use s.replace("--", "-") because of the third case. I also tried a regex, but I can't get it to work specifically with replacing pairs. I get as far as trying to replace r"^(?:(-){2})+" with r"\1", but that tries to replace the full block of dashes at the start, and I can't figure how to get it to replace only pairs within that block.
Final regex was:
re.sub(r'^(-+)\1', r'\1', "------foo--bar")
^ - match start
(-+) - match at least one -, but...
\1 - an equal number must exist outside the capture group.
and finally, replace with that number of hyphens, effectively cutting the number of hyphens in half.
import re
print re.sub(r'\--', '',"--foo")
print re.sub(r'\--', '',"-----foo")
Output:
foo
-foo
EDIT this answer is for the OP before it was completely edited and changed.
Here's it all written out for anyone else who comes this way.
>>> foo = '---foo'
>>> bar = '-----foo'
>>> foobar = 'foo--bar'
>>> foobaz = '-----foo--bar'
>>> re.sub('^(-+)\\1', '-', foo)
'-foo'
>>> re.sub('^(-+)\\1', '-', bar)
'---foo'
>>> re.sub('^(-+)\\1', '-', foobar)
'foo--bar'
>>> re.sub('^(-+)\\1', '-', foobaz)
'--foo--bar'
The pattern for re.sub() is:
re.sub(pattern, replacement, string)
therefore in this case we want to replace -- with -. HOWEVER, the issue comes when we have -- that we don't want to replace, given by some circumstances.
In this case we only want to match -- at the beginning of a string. In regular expressions for python, the ^ character, when used in the pattern string, will only match the given pattern at the beginning of the string - just what we were looking for!
Note that the ^ character behaves differently when used within square brackets.
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'... An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
Getting back to what we were talking about. The parenthesis in the pattern represent a "group," this group can then be referenced with the \\1, meaning the first group. If there was a second set of parenthesis, we could then reference that sub-pattern with \\2. The extra \ is to escape the next slash. This pattern can also be written with re.sub(r'^(-+)\1', '-', foo) forcing python to interpret the string as a raw string, as denoted with the r preceding the pattern, thereby eliminating the need to escape special characters.
Now that the pattern is all set up, you just make the replacement whatever you want to replace the pattern with, and put in the string that you are searching through.
A link that I keep handy when dealing with regular expressions, is Google's developer's notes on them.

Match regex with \\n in it

I have the following string:
>>> repr(s)
" NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp
I want to match the string before the \\n -- everything before a whitespace character. The output should be:
['NBCUniversal', 'VOLGAFILMINC']
Here is what I have so far:
re.findall(r'[^s].+\\n\d{1,2}', s)
What would be the correct regex for this?
EDIT: sorry I haven't read carefully your question
If you want to find all groups of letters immediatly before a literal \n, re.findall is appropriate. You can obtain the result you want with:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> re.findall(r'(?i)[a-z]+(?=\\n)', s)
['NBCUniversal', 'VOLGAFILMINC']
OLD ANSWER:
re.findall is not the appropriate method since you only need one result (that is a pair of strings). Here the re.search method is more appropriate:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> res = re.search(r'^(?i)[^a-z\\]*([a-z]+)\\n[^a-z\\]*([a-z]+)', s)
>>> res.groups()
('NBCUniversal', 'VOLGAFILM')
Note: I have assumed that there are no other characters between the first word and the literal \n, but if it isn't the case, you can add [^a-z\\]* before the \\n in the pattern.
If you want to fix your existing code instead of replace it, you're on the right track, you've just got a few minor problems.
Let's start with your pattern:
>>> re.findall(r'[^s].+\\n\d{1,2}', s)
[' NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64']
The first problem is that .+ will match everything that it can, all the way up to the very last \\n\d{1,2}, rather than just to the next \\n\d{1,2}. To fix that, add a ? to make it non-greedy:
>>> re.findall(r'[^s].+?\\n\d{1,2}', s)
[' NBCUniversal\\n63', ' VOLGAFILM, INC VOLGAFILMINC\\n64']
Notice that we now have two strings, as we should. The problem is, those strings don't just have whatever matched the .+?, they have whatever matched the entire pattern. To fix that, wrap the part you want to capture in () to make it a capturing group:
>>> re.findall(r'[^s](.+?)\\n\d{1,2}', s)
[' NBCUniversal', ' VOLGAFILM, INC VOLGAFILMINC']
That's nicer, but it still has a bunch of extra stuff on the left end. Why? Well, you're capturing everything after [^s]. That means any character except the letter s. You almost certainly meant [\s], meaning any character in the whitespace class. (Note that \s is already the whitespace class, so [\s], meaning the class consisting of the whitespace class, is unnecessary.) That's better, but that's still only going to match one space, not all the spaces. And it will match the earliest space it can that still leaves .+? something to match, not the latest. So if you want to suck all all the excess spaces, you need to repeat it:
re.findall(r'\s+(.+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILM, INC VOLGAFILMINC']
Getting closer… but the .+? matches anything, including the space between VOLGAFILM and VOLGAFILMINC, and again, the \s+ is going to match the first run of spaces it can, leaving the .+? to match everything after that.
You could fiddle with the prefix , but there's an easier solution. If you don't want spaces in your capture group, just capture a run of nonspaces instead of a run of anything, using \S:
re.findall(r'\s+(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
And notice that once you've done that, the \s+ isn't really doing anything anymore, so let's just drop it:
re.findall(r'(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
I've obviously made some assumptions above that are correct for your sample input, but may not be correct for real data. For example, if you had a string like Weyland-Yutani\\n…, I'm assuming you want Weyland-Yutani, not just Yutani. If you have a different rule, like only letters, just change the part in parentheses to whatever fits that rule, like (\w+?) or ([A-Za-z]+?).
Assuming that the input actually has the sequence \n (backslash followed by letter 'n') and not a newline, this will work:
>>> re.findall(r'(\S+)\\n', s)
['NBCUniversal', 'VOLGAFILMINC']
If the string actually contains newlines then replace \\n with \n in the regular expression.

REGEX (python) match or return a string after '?', but in a new line, til the end of that line

Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc
I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases
This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence
This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'

Using regex to get passage between two strings in Python

I want to parse all of the functions inside of a .txt file. It looks like this:
def
test
end
def
hello
end
def
world
end
So, I would get the following returned: [test, hello, world]
Here is what I have tried, but I do not get anything back:
r = re.findall('def(.*?)end', doc)
print r
You have to use the re.DOTALL flag which will allow . to match newlines too (since your doc is multi-line).
You could additionally use '^def' and '^end' in the regex if you only wanted the outer def/end blocks (ie ignore indented ones), in which case you would also need to use the re.MULTILINE flag, which allows '^' and '$' to match start/end of line (as opposed to start/end of string).
re.findall('^def(.*?)^end',doc,re.DOTALL|re.MULTILINE)
r = re.findall('def(.*?)end', doc, re.S)
You need to enable re.MULTILINE flag to match multiple lines in a single regular expression.
Also, ^ and $ do NOT match linefeeds (\n)
>>> re.findall(r"^def$\n(.*)\n^end$", doc, re.MULTILINE)
[' test', ' hello', ' world']
If you don't want to match the whitespace in the beginning of the blocks, add \W+:
>>> re.findall(r"^def$\n\W*(.*)\n^end$", text, re.MULTILINE)
['test', 'hello', 'world']

Categories