Use Python to extract Branch Lengths from Newick Format

Use Python to extract Branch Lengths from Newick Format - python

I have a list in python consisting of one item which is a tree written in Newick Format, as below:
['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']
In tree format this appears as below:
I am trying to write some code that will look through the list item and return the IDs (BMNHxxxxxx) which are joined by branch length of 0 (or <0.001 for example) (highlighted in red). I thought about using regex such as:
JustTree = []
with JustTree as f:
for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}", subject, re.I):
f.extend(match.group()+"\n")
As taken from another StackOverflow answer where item A would be a ':' as the branch lengths always appear after a : and item B would be either a ',' or ')'or a ';' as these a there three characters that delimit it, but Im not experienced enough in regex to do this.
By using a branch length of 0 in this case I want the code to output ['BMNH703458a', 'BMNH703458b']. If I could alter this to also include ID's joined by a branch length of user defined value of say 0.01 this would be highly useful.
If anyone has any input, or can point me to a useful answer I would highly appreciate it.

Okay, here's a regex to extract only numbers (with potential decimals):
\b[0-9]+(?:\.[0-9]+)?\b
The \bs make sure that there is no other number, letter or underscore around the number right next to it. It's called a word boundary.
[0-9]+ matches multiple digits.
(?:\.[0-9]+)? is an optional group, meaning that it may or may not match. If there is a dot and digits after the first [0-9]+, then it will match those. Otherwise, it won't. The group itself matches a dot, and at least 1 digit.
You can use it with re.findall to put all the matches in a list:
import re
NewickTree = ['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']
pattern = re.compile(r"\b[0-9]+(?:\.[0-9]+)?\b")
for tree in NewickTree:
branch_lengths = pattern.findall(tree)
# Do stuff to the list branch_lengths
print(branch_lengths)
For this list, you get this printed:
['0.16529463651919140688', '0.22945757727367316336', '0.18028180766761139897',
'0.21469677818346077913', '0.54350916483644962085', '0.00654573856803835914',
'0.04530853441176059537', '0.02416511342888815264', '0.21236619242575086042',
'0.13421900772403019819', '0.14957653992840658219', '0.02592135486124686958',
'0.02477670174791116522', '0.22983459269245612444', '0.00000328449424529074',
'0.29776257618061197086', '0.09881729077887969892', '0.02257522897558370684',
'0.21599133163597591945', '0.02365043128986757739', '0.16069861523756587274',
'0.0']

I know your question has been answered, but if you ever want your data as a nested list instead of a flat string:
import re
import pprint
a="(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;"
def tokenize(str):
for m in re.finditer(r"\(|\)|[\w.:]+", str):
yield m.group()
def make_nested_list(tok, L=None):
if L is None: L = []
while True:
try: t = tok.next()
except StopIteration: break
if t == "(": L.append(make_nested_list(tok))
elif t == ")": break
else:
i = t.find(":"); assert i != -1
if i == 0: L.append(float(t[1:]))
else: L.append([t[:i], float(t[i+1:])])
return L
L = make_nested_list(tokenize(a))
pprint.pprint(L)

There are several Python libraries that support the newick format. The ETE toolkit allows to read newick strings and operate with trees as Python objects:
from ete2 import Tree
tree = Tree(newickFile)
print tree
Several newick subformats can be choosen and branch distances are parsed even if they are expressed in scientific notation.
from ete2 import Tree
tree = Tree("(A:3.4, (B:0.15E-10,C:0.0001):1.5E-234);")

Related

How to search for patterns in a list of strings without 100% similarity?

So I have a list of strings:
input_list=["ACTGATCTTATCGAGTCAGCTAGTCGATCGATCGACGCGCGATCGTGATG","TGCATCGATCGATGCTAGTCGATATACGCGATATGTACG","CATCGGATCGATCGATCAGCTCATAGTCAGTC","CATCGATCATATATCGAGCGACAGTCAGTCGATCAGTCATCAGGTAGC","CATCATATCGAGCAGTCATCGTAGTCATGATCGATCGAT","ACATGAATCGATCGATAATCGATCGCGATTCGATG"]
And a list of tuples containing patterns to be looked for in the strings.
list_patterns=[("AGTG","TCGC"),("TATC","ATGT"),("GCAT","TCAG")]
I have this function that, for each string, looks for (any) one of the pairs of patterns from "list_patterns". The first element of each tuple in "list_patterns" is searched for from the beggining of the strings, the second element of each tuple is searched for from the end of each string.
Subsequently, the function trims the string, appending the trimmed string to an empty list (if none of the pairs of patterns is found, it just appends the original untrimmed string).
trimmed_list = []
for el in input_list:
for pat in list_patterns:
beg = el.find(pat[0])
end = el.rfind(pat[1])
if(beg != -1 and end != -1):
output_list.append(el[beg+len(pat[0]):end])
break
else:
output_list.append(el)
The thing is, I want to trim and find the patterns, but not necessarily match only the ones that have 100% similarity. I want to also find the patterns that are somewhat similar (by a user-defined threshold) and trim the strings accordingly.
I found this function that retrieves the ratio of similarity between two strings, but I'm not able to implement it into my original function:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
As an example, let's say I had a string:
"ATGCATCGTACGTACGTACG"
And a tuple of patterns which may be slightly different from the original ones:
("AYTG","TARYCG")
Even thought the string does not contain exactly those patterns (but contains similar ones), I would still like to trim it (ATG | CATCGTACGTACG | TACG), and have an output:
"CATCGTACGTACG"
Is there an easy way to add the "SequenceMatcher" function to my user-defined function?
Thank you so much in advance for any answers.

You can create a custom find function and call it instead of the string.find or string.rfind methods. This function just looks at string parts that has the same length as the pattern. If different lengths can match it has to be extended but it's hard to predict how much.
def find_similar(needle, haystack, backwards = False, min_diff_ratio = 0.75):
n_len = len(needle)
# create a range depending on direction
if backwards:
r = range(len(haystack)-n_len, -1, -1)
else:
r = range(len(haystack) - n_len)
for i in r:
# create a substring with same length as search string
# at index
substr = haystack[i:i + n_len]
# Here we check for similarity using your function
if similar(needle, substr) >= min_diff_ratio:
return i
return -1
Update your loop to this
trimmed_list = []
for el in input_list:
for pat in list_patterns:
beg = find_similar(pat[0], el)
end = find_similar(pat[1], el, True)
if(beg != -1 and end != -1):
output_list.append(el[beg+len(pat[0]):end])
break
else:
output_list.append(el)

Change string for defiened pattern (Python)

Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?

To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.

Finding patterns in HEX data with regex but getting duplicates

I have a regex python script to go over Hex data and find patterns which looks like this
r"(.{6,}?)\1{2,}"
all it does is look for at least 6 character long hex strings that repeat and at least have two instances of it repeating. My issue is it is also finding substrings inside larger strings it has already found for example:
if it was "a00b00a00b00a00b00a00b00a00b00a00b00" it would find 2 instances of "a00b00a00b00a00b00" and 6 instances of "a00b00" How could I go about keeping only the longest patterns found and ignoring even looking for shorter patterns without more hardcoded parameters?

#!/usr/bin/python
import fnmatch
pattern_string = "abcdefabcdef"
def print_pattern(pattern, num):
n = num
# takes n and splits it by that value in this case 6
new_pat = [pattern[i:i+n] for i in range(0, len(pattern), n)]
# this is the hit counter for matches
match = 0
# stores the new value of the match
new_match = ""
#loops through the list to see if it matches more than once
for new in new_pat:
new_match = new
print new
#if matches previous keep adding to match
if fnmatch.fnmatch(new, new_pat[0]):
match += 1
if match:
print "Count: %d\nPattern:%s" %(match, new_match)
#returns the match
return new_match
print_pattern(pattern_string, 6)
regex is better but this was funner to write

Replacing all numeric value to formatted string

What I am trying to do is:
Find out all the numeric values in a string.
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
for number in numbers:
print ("{} start > {}, end > {}".format(number.group(), number.start(0), number.end(0)))
'''Output'''
>>100 start > 12, end > 15
>>79.80 start > 18, end > 23
And then I want to replace all the integer and float value to a certain format:
INT_(number of digit) and FLT(number of decimal places)
eg. 100 -> INT_3 // 79.80 -> FLT_2
Thus, the expect output string is like this:
"高露潔光感白輕悅薄荷牙膏INT_3 FLT2"
But the string replace substring method in Python is kind of weird, which can't archive what I want to do.
So I am trying to use the substring append substring methods
string[:number.start(0)] + "INT_%s"%len(number.group()) +.....
which looks stupid and most importantly I still can't make it work.
Can anyone give me some advice on this problem?

Use re.sub and a callback method inside where you can perform various manipulations on the match:
import re
def repl(match):
chunks = match.group(1).split(".")
if len(chunks) == 2:
return "FLT_{}".format(len(chunks[1]))
else:
return "INT_{}".format(len(chunks[0]))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
result = re.sub(r'[-+]?([0-9]*\.?[0-9]+)(?:[eE][-+]?[0-9]+)?',repl,input_string)
print(result)
See the Python demo
Details:
The regex now has a capturing group over the number part (([0-9]*\.?[0-9]+)), this will be analyzed inside the repl method
Inside the repl method, Group 1 contents is split with . to see if we have a float/double, and if yes, we return the length of the fractional part, else, the length of the integer number.

You need to group the parts of your regex possibly like this
import re
def repl(m):
if m.group(1) is None: #int
return ("INT_%i"%len(m.group(2)))
else: #float
return ("FLT_%i"%(len(m.group(2))))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.sub(r'[-+]?([0-9]*\.)?([0-9]+)([eE][-+]?[0-9]+)?',repl,input_string)
print(numbers)
group 0 is the whole string that was matched (can be used for putting into float or int)
group 1 is any digits before the . and the . itself if exists else it is None
group 2 is all digits after the . if it exists else it it is just all digits
group 3 is the exponential part if existing else None
You can get a python-number from it with
def parse(m):
s=m.group(0)
if m.group(1) is not None or m.group(3) is not None: # if there is a dot or an exponential part it must be a float
return float(s)
else:
return int(s)

You probably are looking for something like the code below (of course there are other ways to do it). This one just starts with what you were doing and show how it can be done.
import re
input_string = u"高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
s = input_string
for m in list(numbers)[::-1]:
num = m.group(0)
if '.' in num:
s = "%sFLT_%s%s" % (s[:m.start(0)],str(len(num)-num.index('.')-1),s[m.end(0):])
else:
s = "%sINT_%s%s" % (s[:m.start(0)],str(len(num)), s[m.end(0):])
print(s)
This may look a bit complicated because there are really several simple problems to solve.
For instance your initial regex find both ints and floats, but you with to apply totally different replacements afterward. This would be much more straightforward if you were doing only one thing at a time. But as parts of floats may look like an int, doing everything at once may not be such a bad idea, you just have to understand that this will lead to a secondary check to discriminate both cases.
Another more fundamental issue is that really you can't replace anything in a python string. Python strings are non modifiable objects, henceforth you have to make a copy. This is fine anyway because the format change may need insertion or removal of characters and an inplace replacement wouldn't be efficient.
The last trouble to take into account is that replacement must be made backward, because if you change the beginning of the string the match position would also change and the next replacement wouldn't be at the right place. If we do it backward, all is fine.
Of course I agree that using re.sub() is much simpler.

Python: How to produce only in sequence combinations from a list of string parts, with use being optional

I would like to know how I can produce only in sequence combinations from a list of string parts, with use being optional. I need to do this in Python.
For example:
Charol(l)ais (cattle) is my complete string, with the parts in brackets being optional.
From this I would like to produce the following output as an iterable:
Charolais
Charollais
Charolais cattle
Charollais cattle
Was looking at Python's itertools module, since it has combinations; but couldn't figure out how to use this for my scenario.

You will need to convert the string into a more sensible format. For example, a tuple of all of the options for each part:
words = [("Charol",), ("l", ""), ("ais ",), ("cattle", "")]
And you can easily put them back together:
for p in itertools.product(*words):
print("".join(p))
To create the list, parse the string, e.g.:
base = "Charol(l)ais (cattle)"
words = []
start = 0
for i, c in enumerate(base):
if c == "(":
words.append((base[start:i],))
start = i + 1
elif c == ")":
words.append((base[start:i], ""))
start = i + 1
if start < len(base):
words.append((base[start:],))

You could use the permutations from itertools and denote your optional strings with a special character. Then, you can replace those either with the correct character or an empty string. Or carry on from this idea depending on the exact semantics of your task at hand.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use Python to extract Branch Lengths from Newick Format - python

Related

How to search for patterns in a list of strings without 100% similarity?

Change string for defiened pattern (Python)

Finding patterns in HEX data with regex but getting duplicates

Replacing all numeric value to formatted string

Python: How to produce only in sequence combinations from a list of string parts, with use being optional

Categories

Resources