Python regex: parsing newick format - python

I have a string like:
(A\2009_2009-01-04:0.2,(A\name2\human\2007_2007:0.3,A\chicken\ird16\2016_20016:0.4)A\name3\epi66321\2001_2001-04-04:0.5)A\name_with_space\2014_2014:0.1)A\name4\66036-8a\2004_2004-12-05;
In this tree, names are enclosed on the left by either an open bracket "(", a closing bracket ")", or a comma, and enclosed on the right with a colon ':'. That is, the substrings "A\2009_2009-01-04", "A\name2\human\2007_2007", "A\name3\epi66321\2001_2001-04-04", are names. (this is actually a tree in newick format).
I'd like to find a regex pattern which finds all names, with as little restriction on namespace as possible. Think of names as variables, like this example from Wikipedia:
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;
Where A, B, C etc. can be any string. The only restriction on namespace is that names cannot contain rounded or square brackets, '&', ',' or ':', because these are special characters that define the tree format, the same way that the comma defines a csv format.
Bonus: sometimes, internal nodes within the tree aren't labelled:
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);
In which case, a regex that correctly returns a string of length zero would be great.

It seems you want to extract substrings that start with 1+ (, ) or , and then contain 1+ non-whitespace characters other than : and ;, as many as possible, but stop at the word boundary.
Use
r'[(),]+([^;:]+)\b'
See the regex demo.
Pattern details
[(),]+ - one or more characters in the character class: (, ) or ,
([^;:]+) - Group 1: one or more chars other than ; and :, as many as possible
\b - a word boundary
Python demo:
import re
rx = r'[(),]+([^;:]+)\b'
s = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;((A\\2009_2009-01-04:0.2,(A\\name2\\human\\2007_2007:0.3,A\\chicken\\ird16\\2016_20016:0.4)A\\name3\\epi66321\\2001_2001-04-04:0.5)A\\name_with_space\\2014_2014:0.1)A\\name4\\66036-8a\\2004_2004-12-05;"
res = re.findall(rx, s)
for val in res:
print(val)
Output:
A
B
C
D
E
F
A\2009_2009-01-04
A\name2\human\2007_2007
A\chicken\ird16\2016_20016
A\name3\epi66321\2001_2001-04-04
A\name_with_space\2014_2014
A\name4\66036-8a\2004_2004-12-05

you can use the regex
(\w+)(?=:|;)
see the sample code
import re
regex = r"(\w+)(?=:|;)"
test_str = "((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
The output is
Match 1 was found at 2-3: B
Match 2 was found at 9-10: C
Match 3 was found at 15-16: D
Match 4 was found at 21-22: E
Match 5 was found at 27-28: F
Match 6 was found at 33-34: A

A working solution:
[(),]([A-E])(?!;)
See live demo. One mistake you made was escaping characters inside the character class; but inside it they don't have special meaning.
I also took care of selecting against a trailing semicolon.

pattern = re.compile(r'[(),]A/[\S]*?:')
Not the most elegant, because I made use of the fact that all my names start with "A/". This will not be true for future use cases, just this current one. Will leave this question open if someone can find a more generalizable solution.

Related

Python Regex: Capture overlapping parts

Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text. Like this:
<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c
So expected output should look like this:
["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]
I found this question How to find overlapping matches with a regexp? which brought me little bit closer to the desired result but still my regex doesn't work.
regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"
Any ideas how to solve this problem?
You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches. In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:
(?:\A|>)(?=([a-c]*<\w+>[a-c]*))
See regex demo.
In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:
text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)
Output:
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
You need to set the left- and right-hand boundaries to < or > chars or start/end of string.
Use
import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
See the Python demo online and the regex demo.
Pattern details
(?= - start of a positive lookahead to enable overlapping matches
(?<![^<>]) - start of string, < or >
([a-c]*<\w+>[a-c]*) - Group 1 (the value extracted): 0+ a, b or c chars, then <, 1+ word chars, > and again 0+ a, b or c chars
(?![^<>]) - end of string, < or > must follow immediately
) - end of the lookahead.
You may use this regex code in python:
>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
RegEx Demo
RegEx Details:
([^<>]*<[^>]*>): Capture group #1 to match 0 or more characters that are not < and > followed by <...> string.
(?=([^<>]*)): Lookahead to assert that we have 0 or more non-<> characters ahead of current position. We have capture group #2 inside this lookahead.

A way to match a SSHA hash using a regular expression

I'm trying to match four hashes that look like this:
{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c=
{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c=
I've successfully matched the first two with this regular expression: \D{5,}[a-zA-Z0-9]\w+\(?= however I am unable to get a full match on the third or the fourth one. What is a better regular expression to match the given hashes?
Note that \D{5,} matches 5 or more non-digit chars, and then [a-zA-Z0-9] matches an ASCII letter or digit and \w+ matches 1+ letters/digits/_. So, if you have - or / in the string, it won't get matches. Or if the first 5 chars contain a digit.
I suggest the following pattern:
\{[^{}]*}[a-zA-Z0-9][\w/-]+=?
See the regex demo.
It matches:
\{[^{}]*} - a {, then 0+ chars other than { and } and then } (note you may further precise it: \{\w+} to match {, 1 or more letters/digits/_, and then }, or even \{(?:SS?HA|MD5)} to match SHA, SSHA or MD5 enclosed with {...})
[a-zA-Z0-9] - an ASCII letter or digit
[\w/-]+ - 1 or more word chars (letters, digits or _)
=? - an optional, 1 or 0 occurrences (due to the ? quantifier) = symbols (greedy ? makes it match a = if it is found).
Python demo:
import re
s = """
TEXT {SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4= and some more
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c text here
{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c= maybe."""
rx = r"\{[^{}]*}[a-zA-Z0-9][\w/-]+=?"
print(re.findall(rx, s))
# => ['{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=', '{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=', '{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c', '{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c=']
I would suggest something along these lines:
\{[SHAMD5]{3,4}\}[^=]+=?
It will match a { then 3 or 4 characters that are the combinations you have listed of characters. You can change that to [A-Z0-9] to broaden it, but I like to keep it tighter to start. Then a }. Then all (at least 1) non = characters. Ending with an optional = character. Here is my python demo:
import re
textlist = [
"{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M="
,"{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4="
,"{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c"
,"test for break below"
,"{WORD}stuff="
,"{MD55/DNVWwyafo-pIEaHNhv39sSN7c="
,"MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
]
for text in textlist:
if re.search("\{[SHAMD5]{3,4}\}[^=]+=?", text):
print ("match")
else:
print ("no soup for you")
Note the end of the list has a few tests to make sure the regex doesn't just succeed on anything random.

find all substring wrapped in double quotes satisfying serveral constraints in python regular expression

I want to find all the substrings wrapped in the double quotes satisfying the following two constraints:
The shortest substring starting with "http"
End with ".bmp" or ".jpg"
My codes are as below:
import re
pat = '"(http.+?\.(jpg|bmp))"' # I don't how to modify this pattern
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk" ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
My expected outputs are
['http:afd/aa.bmp', 'http:kk/bb.jpg']
But the execution results are
[('http:afd/aa.bmp', 'bmp'), ('http--test--http:kk/bb.jpg', 'jpg')]
I have already tried several kinds of patterns but I still can't get what I want.
How should I modify my codes to get the results I expect? Thanks!
Use a [^"]* negated character class after the first " to stay within double quoted substring (note - this will only work if there are no escape sequences in the string and get to the last http, then add it at the end, too, to get to the trailing ".
import re
pat = r'"[^"]*(http.*?\.(?:jpg|bmp))[^"]*"'
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk" ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
# => ['http:afd/aa.bmp', 'http:kk/bb.jpg']
See the Python demo online.
Pattern details:
" - a literal double quote
[^"]* - 0+ chars other than a double quote, as many as possible, since * is a greedy quantifier
(http.*?\.(?:jpg|bmp)) - Group 1 (extracted with re.findall) that matches:
http - a literal substring http
.*? - any 0+ chars, as few as possible (as *? is a lazy quantifier)
\. - a literal dot
(?:jpg|bmp) - a non-capturing group (so that the text it matches could not be output with re.findall) matching either jpg or bmp substring
[^"]* - 0+ chars other than a double quote, as many as possible
" - a literal double quote

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'
You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

Categories