Constructing a phylogentic tree

Constructing a phylogentic tree - python

I have a list of list of lists like this
matches = [[['rootrank', 'Root'], ['domain', 'Bacteria'], ['phylum', 'Firmicutes'], ['class', 'Clostridia'], ['order', 'Clostridiales'], ['family', 'Lachnospiraceae'], ['genus', 'Lachnospira']],
[['rootrank', 'Root'], ['domain', 'Bacteria'], ['phylum', '"Proteobacteria"'], ['class', 'Gammaproteobacteria'], ['order', '"Vibrionales"'], ['family', 'Vibrionaceae'], ['genus', 'Catenococcus']],
[['rootrank', 'Root'], ['domain', 'Archaea'], ['phylum', '"Euryarchaeota"'], ['class', '"Methanomicrobia"'], ['order', 'Methanomicrobiales'], ['family', 'Methanomicrobiaceae'], ['genus', 'Methanoplanus']]]
And I want to construct a phylogenetic tree from them. I wrote a node class like so (based partially on this code):
class Node(object):
"""Generic n-ary tree node object
Children are additive; no provision for deleting them."""
def __init__(self, parent, category=None, name=None):
self.parent = parent
self.category = category
self.name = name
self.childList = []
if parent is None:
self.birthOrder = 0
else:
self.birthOrder = len(parent.childList)
parent.childList.append(self)
def fullPath(self):
"""Returns a list of children from root to self"""
result = []
parent = self.parent
kid = self
while parent:
result.insert(0, kid)
parent, kid = parent.parent, parent
return result
def ID(self):
return '{0}|{1}'.format(self.category, self.name)
And then I try to construct my tree like this:
node = None
for match in matches:
for branch in match:
category, name = branch
node = Node(node, category, name)
print [n.ID() for n in node.fullPath()]
This works for the first match, but when I start with the second match it is appended at the end of the tree instead of starting again at the top. How would I do that? I tried some variations on searching for the ID, but I can't get it to work.

I would highly recommend using a phylogenetics library like Dendropy.
The 'standard way of writing phylogenetic trees is with the Newick format (parenthetical statements like ((A,B),C)). If you use Dendropy, reading that tree would be as simple as
>>> import dendropy
>>> tree1 = dendropy.Tree.get_from_string("((A,B),(C,D))", schema="newick")
or to read from a stream
>>> tree1 = dendropy.Tree(stream=open("mle.tre"), schema="newick")
The creator of the library maintains a nice tutorial too.

The issue is that node is always the bottommost node in the tree, and you are always appending to that node. You need to store the root node. Since ['rootrank', 'Root'] appears at the beginning of each of the lists, I'd recommend pulling that out and using it as the root. So you can do something like:
rootnode = Node(None, 'rootrank', 'Root')
for match in matches:
node = rootnode
for branch in match:
category, name = branch
node = Node(node, category, name)
print [n.ID() for n in node.fullPath()]
This will make the matches list more readable, and gives the expected output.

Do yourself a favor and don't reinvent the wheel. Python-graph (a.k.a. pygraph) does all that you ask here and most of the things that you'll ask next.

Related

Modifying Instance Attribute Unexpected Behavior

I'm trying to implement a Graph/Node data structure to store some connections between synsets in WordNet.
import nltk
from nltk.corpus import wordnet as wn
# a dict of [wn.synset, Node]
syns_node_dict = []
# gets a node from a wn.synset
def nodeFromSyn(syn):
for row in syns_node_dict:
if row[0] == syn:
return row[1]
return False
def display_dict():
for row in syns_node_dict:
curr_nde = row[1]
display([curr_nde._value, curr_nde._parents, curr_nde._children])
# Graph Node data struct
class Node:
# value is a wn.synset
# parents, children are both lists of wn.synset
def __init__(self, value, parents=[], children=[]):
curr = nodeFromSyn(value)
# if a Node for value already exists, merge attributes
if curr:
for p in parents:
if p not in curr._parents : curr._parents.append(p)
for c in children:
if c not in curr._children : curr._children.append(c)
self = curr
# if a Node for value does not exist, create new Node for value
else:
syns_node_dict.append([value, self])
self._value = value
self._parents = parents
self._children = children
# Create a Node for each of self's parents if it does not already exist
# and add self as a child
for parent in self._parents:
parent_node = nodeFromSyn(parent)
if parent_node:
if value not in parent_node._children:
parent_node._children.append(value)
else:
parent_node = Node(parent, children=[value])
# Create a Node for each of self's children if it does not already exist
# and add self as a parent
for child in self._children:
child_node = nodeFromSyn(child)
if child_node:
if value not in child_node._parents:
child_node._parents.append(value)
else:
child_node = Node(child, parents=[value])
However, I'm seeing some strange behavior when I attempt to run my code:
Node(wn.synset('condition.n.01'), [], children=[wn.synset('difficulty.n.03')])
display_dict()
>[Synset('condition.n.01'), [], [Synset('difficulty.n.03')]]
>[Synset('difficulty.n.03'), [Synset('condition.n.01')], []]
Node(wn.synset('state.n.02'), children=[wn.synset('condition.n.01')])
display_dict()
>[Synset('condition.n.01'), [Synset('state.n.02')], [Synset('difficulty.n.03')]]
>[Synset('difficulty.n.03'), [Synset('condition.n.01')], []]
>[Synset('state.n.02'), [], [Synset('condition.n.01')]]
When I run the following lines of code I expect the last line to be
[Synset('attribute.n.02'), [], [Synset('state.n.02')]].
Why does the Node for wn.synset('attribute.n.02') contain itself among its parents?
Node(wn.synset('attribute.n.02'), children=[wn.synset('state.n.02')])
display_dict()
>[Synset('condition.n.01'), [Synset('state.n.02')], [Synset('difficulty.n.03')]]
>[Synset('difficulty.n.03'), [Synset('condition.n.01')], []]
>[Synset('state.n.02'), [Synset('attribute.n.02')], [Synset('condition.n.01')]]
>[Synset('attribute.n.02'), [Synset('attribute.n.02')], [Synset('state.n.02')]]
I isolated the issue to be within the block below but I can't figure out what's causing this behavior.
# Create a Node for each of self's children if it does not already exist
# and add self as a parent
for child in self._children:
child_node = nodeFromSyn(child)
if child_node:
if value not in child_node._parents:
child_node._parents.append(value)
else:
child_node = Node(child, parents=[value])

The reason this is happening is because default parameters are evaluated once when the function is defined, not each time it is called. Because of this, all the nodes you create with the parents (or children) parameter blank are actually sharing the same list for their parent list. For the first two nodes this isn't a problem because you specify an empty list for the first node, and the first node creates a list for the second, but all the nodes after that are sharing the same list (which becomes more obvious if you keep creating more nodes).
A way to solve this problem outlined here is to make the default parameters None, then check if the parameter is none and create the empty list if so:
# modify the init function so that the default parameters are None
def __init__(self, value, parents=None, children=None):
# if they are left as None, set them to an empty list
if parents is None:
parents = []
if children is None:
children = []
# The rest of your code here...

defining multi branching nested tree in python recursively

I am trying to implement some basic recursive structure in Python, but without great success. I have a tree represented in form of nested lists like this:
ex = ['A',
['A1',
['A11', 'tag'],
['A12', 'tag'],
['A13',
['A131', 'tag'],
['A132',
['A1321', 'tag'],
['A1322', 'tag']]]],
['A2', 'tag'],
['A3',
['A31',
['A311', 'tag'],
['A312', 'tag']],
['A32', 'tag'],
['A33',
['A331',
['A3311', 'tag'],
['A3312', 'tag']]],
['A34', 'tag'],
['A35',
['A351', 'tag'],
['A352',
['A3521', 'tag'],
['A3522', 'tag']]]],
['A4', 'tag']]
and I have defined a Node class that allows specifying a tag 'A', 'A1', ... and adding children. Terminal nodes can be retrieved by noticing that children is not a list.
class Node(object):
def __init__(self, tag, parent=None, children=[]):
self.tag = tag
self.parent = parent
self.children = children
self.is_terminal = False if isinstance(children, list) else True
def add_child(self, node):
if not isinstance(node, Node):
raise ValueError("Cannot append node of type: [%s]" % type(node))
if self.is_terminal:
raise ValueError("Cannot append node to terminal")
else:
self.children.append(node)
Now I am having trouble implementing a function that would recursively transform the list-based tree into a Node-based one:
tree = Node(tag='A',
children=[Node(tag='A1',
children=[Node(tag='A11',
children='tag'),
Node(tag='A12',
children='tag'),
...]),
...])
This is my attempt so far based on the idea that at each position in the nested list, we might have either a terminal node, in which case we just add it to the root, or a non-terminal, in which case we extract the respective root tag and iterate over children recursively. When list is empty we return control to caller.
My feeling is that the coding style is perhaps not the best suitable for Python, but I would like to know what I am missing more concretely.
def is_terminal(e):
return len(e) == 2 and type(e[0]) == str and type(e[1]) == str
def from_list(lst, root):
lst = list(lst) # avoid mutating input list
if not lst:
return
for e in lst:
if is_terminal(e):
tag, children = e
print "terminal", tag, "with root", root.tag
root.add_child(Node(tag=tag, children=children, parent=root))
else:
e = list(e)
tag, children = e.pop(0), e
print "non terminal", tag, "with root", root.tag
root = Node(tag=tag, parent=root)
from_list(children, root=root)
It has a number of problems. For instance, it looses track of the highest root 'A' -i.e. A2 gets A1 as root. It also flattens out the tree to a Node with 16 children, one per terminal node, and goes into infinite recursion.
I'd appreciate any type of hints.

I finally found out the problem, which turned out to be partially a missing point in the algorithm and partially a misunderstanding in the way Python lists work.
The algorithm was loosing track of the highest root, because I wasn't adding the sub root in the else statement. This new version solves the issue.
else:
e = list(e)
tag, children = e.pop(0), e
print "non terminal", tag, "with root", root.tag
subroot = Node(tag=tag, parent=root)
root.add_child(subroot) #
from_list(children, root=subroot)
The problem with the flattening was actually the fact that I was using [] as default argument in the Node class definition. As explained here, the default empty list is being created only the first function call (or class instantiation in this case) and not every time the function is called.
Therefore, the children list of the root was getting added all sub children (and hence the flattening effect), and the children list of all sub children was getting modified every time the root children list was modified - hence the infinite recursion.
It turns out to be more of a Python-gotcha, rather than an algorithm definition problem.
For the record, the complete corrected version of the code:
class Node(object):
def __init__(self, tag, parent=None, children=None):
self.tag = tag
self.parent = parent
self.children = [] if not children else children
self.is_terminal = False if isinstance(self.children, list) else True
def add_child(self, node):
if not isinstance(node, Node):
raise ValueError("Cannot append node of type: [%s]" % type(node))
if self.is_terminal:
raise ValueError("Cannot append node to terminal")
else:
self.children.append(node)
def is_terminal(e):
return len(e) == 2 and type(e[0]) == str and type(e[1]) == str
def from_list(lst, root):
lst = list(lst)
if not lst:
return
for e in lst:
if is_terminal(e):
tag, children = e
print "terminal", tag, "with root", root.tag
root.add_child(Node(tag=tag, children=children, parent=root))
else:
e = list(e)
tag, children = e.pop(0), e
print "non terminal", tag, "with root", root.tag
newroot = Node(tag=tag, parent=root)
root.add_child(newroot)
from_list(children, root=newroot)
And here is how you call it:
root = Node(tag=ex[0])
from_list(ex[1:], root=root)

Implementing a depth-first tree iterator in Python

I'm trying to implement an iterator class for not-necessarily-binary trees in Python. After the iterator is constructed with a tree's root node, its next() function can be called repeatedly to traverse the tree in depth-first order (e.g., this order), finally returning None when there are no nodes left.
Here is the basic Node class for a tree:
class Node(object):
def __init__(self, title, children=None):
self.title = title
self.children = children or []
self.visited = False
def __str__(self):
return self.title
As you can see above, I introduced a visited property to the nodes for my first approach, since I didn't see a way around it. With that extra measure of state, the Iterator class looks like this:
class Iterator(object):
def __init__(self, root):
self.stack = []
self.current = root
def next(self):
if self.current is None:
return None
self.stack.append(self.current)
self.current.visited = True
# Root case
if len(self.stack) == 1:
return self.current
while self.stack:
self.current = self.stack[-1]
for child in self.current.children:
if not child.visited:
self.current = child
return child
self.stack.pop()
This is all well and good, but I want to get rid of the need for the visited property, without resorting to recursion or any other alterations to the Node class.
All the state I need should be taken care of in the iterator, but I'm at a loss about how that can be done. Keeping a visited list for the whole tree is non-scalable and out of the question, so there must be a clever way to use the stack.
What especially confuses me is this--since the next() function, of course, returns, how can I remember where I've been without marking anything or using excess storage? Intuitively, I think of looping over children, but that logic is broken/forgotten when the next() function returns!
UPDATE - Here is a small test:
tree = Node(
'A', [
Node('B', [
Node('C', [
Node('D')
]),
Node('E'),
]),
Node('F'),
Node('G'),
])
iter = Iterator(tree)
out = object()
while out:
out = iter.next()
print out

If you really must avoid recursion, this iterator works:
from collections import deque
def node_depth_first_iter(node):
stack = deque([node])
while stack:
# Pop out the first element in the stack
node = stack.popleft()
yield node
# push children onto the front of the stack.
# Note that with a deque.extendleft, the first on in is the last
# one out, so we need to push them in reverse order.
stack.extendleft(reversed(node.children))
With that said, I think that you're thinking about this too hard. A good-ole' (recursive) generator also does the trick:
class Node(object):
def __init__(self, title, children=None):
self.title = title
self.children = children or []
def __str__(self):
return self.title
def __iter__(self):
yield self
for child in self.children:
for node in child:
yield node
both of these pass your tests:
expected = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
# Test recursive generator using Node.__iter__
assert [str(n) for n in tree] == expected
# test non-recursive Iterator
assert [str(n) for n in node_depth_first_iter(tree)] == expected
and you can easily make Node.__iter__ use the non-recursive form if you prefer:
def __iter__(self):
return node_depth_first_iter(self)

That could still potentially hold every label, though. I want the
iterator to keep only a subset of the tree at a time.
But you already are holding everything. Remember that an object is essentially a dictionary with an entry for each attribute. Having self.visited = False in the __init__ of Node means you are storing a redundant "visited" key and False value for every single Node object no matter what. A set, at least, also has the potential of not holding every single node ID. Try this:
class Iterator(object):
def __init__(self, root):
self.visited_ids = set()
...
def next(self):
...
#self.current.visited = True
self.visited_ids.add(id(self.current))
...
#if not child.visited:
if id(child) not in self.visited_ids:
Looking up the ID in the set should be just as fast as accessing a node's attribute. The only way this can be more wasteful than your solution is the overhead of the set object itself (not its elements), which is only a concern if you have multiple concurrent iterators (which you obviously don't, otherwise the node visited attribute couldn't be useful to you).

My recursive function (populates a tree structure) is adding to the root node during every loop/call

I have an algorithm to populate a tree like structure (class: Scan_instance_tree), but unfortunately, during each call, it is incorrectly adding to the root node's children, as well as to the new child nodes created further down in the tree.
As a clue, I saw another thread...
Persistent objects in recursive python functions
...where this problem was mentioned briefly, and it was suggested that the parameters passed had to be mutable. Is that the answer, and how would I do this, in this example???
Here is my current code:
class Field_node(object):
field_phenotype_id = -1
field_name = ''
field_parent_id = -1
child_nodes = []
class Scan_instance_tree(object):
root_node = None
def __init__(self, a_db):
self.root_node = Field_node()
scan_field_values = self.create_scan_field_values(a_db) # This just creates a temporary user-friendly version of a database table
self.build_tree(scan_field_values)
def build_tree(self, a_scan_field_values):
self.root_node.field_name = 'ROOT'
self.add_child_nodes(a_scan_field_values, self.root_node)
def add_child_nodes(self, a_scan_field_values, a_parent_node):
i = 0
while i < len(a_scan_field_values):
if a_scan_field_values[i]['field_parent_dependancy'] == a_parent_node.field_phenotype_id:
#highest_level_children.append(a_scan_field_values.pop(a_scan_field_values.index(scan_field)))
child_node = Field_node()
child_node.field_phenotype_id = a_scan_field_values[i]['field_phenotype_id']
child_node.field_name = a_scan_field_values[i]['field_name']
child_node.field_parent_dependancy = a_scan_field_values[i]['field_parent_dependancy']
a_parent_node.child_nodes.append(child_node)
a_scan_field_values.remove(a_scan_field_values[i])
# RECURSION: get the child nodes
self.add_child_nodes(a_scan_field_values, child_node)
else:
i = i+1
If I remove the recursive call to self.add_child_nodes(...), the root's children are added correctly, ie they only consist of those nodes where the field_parent_dependancy = -1
If I allow the recursive call, the root's children contain all the nodes, regardless of the field_parent_dependancy value.
Best regards
Ann

When you define your Field_node class, the line
child_nodes = []
is actually instantiating a single list as a class attribute, rather than an instance attribute, that will be shared by all instances of the class.
What you should do instead is create instance attributes in __init__, e.g.:
class Field_node(object):
def __init__(self):
self.field_phenotype_id = -1
self.field_name = ''
self.field_parent_id = -1
self.child_nodes = []

Python For...loop iteration

Alright,
I have this program to sparse code in Newick Format, which extracts both a name, and a distance for use in a phylogenetic tree diagram.
What my problem is, in this branch of code, as the program reads through the newickNode function, it assigns the name and distance to the 'node' variable, then returns it back into the 'Node' class to be printed, but it seems to only print the first node 'A', and skips the other 3.
Is there anyway to finish the for loop in newickNode to read the other 3 nodes and print them accordingly with the first?
class Node:
def __init__(self, name, distance, parent=None):
self.name = name
self.distance = distance
self.children = []
self.parent = parent
def displayNode(self):
print "Name:",self.name,",Distance:",self.distance,",Children:",self.children,",Parent:",self.parent
def newickNode(newickString, parent=None):
String = newickString[1:-1].split(',')
for x in String:
splitString = x.split(':')
nodeName = splitString[0]
nodeDistance = float(splitString[1])
node = Node(nodeName, nodeDistance, parent)
return node
Node1 = newickNode('(A:0.1,B:0.2,C:0.3,D:0.4)')
Node1.displayNode()
Thanks!

You could make it a generator:
def newickNode(newickString, parent=None):
String = newickString[1:-1].split(',')
for x in String:
splitString = x.split(':')
nodeName = splitString[0]
nodeDistance = float(splitString[1])
node = Node(nodeName, nodeDistance, parent)
yield node
for node in newickNode('(A:0.1,B:0.2,C:0.3,D:0.4)'):
node.displayNode()
The generator will return one node at a time and pause within the function, and then resume when you want the next one.
Or just save them up and return them
def newickNode(newickString, parent=None):
String = newickString[1:-1].split(',')
nodes = []
for x in String:
splitString = x.split(':')
nodeName = splitString[0]
nodeDistance = float(splitString[1])
node = Node(nodeName, nodeDistance, parent)
nodes.append(node)
return nodes

Your newickNode() function should accumulate a list of nodes and return that, rather than returning the first node created. If you're going to do that, why have a loop to begin with?
def newickNodes(newickString, parent=None):
nodes = []
for node in newickString[1:-1].split(','):
nodeName, nodeDistance = node.split(':')
nodes.append(Node(nodeName, nodeDistance, parent))
return nodes
Alternatively, you could write it as a generator that yields the nodes one at a time. This would allow you to easily iterate over them or convert them to a list depending on your needs.
def newickNodes(newickString, parent=None):
for node in newickString[1:-1].split(','):
nodeName, nodeDistance = node.split(':')
yield Node(nodeName, nodeDistance, parent)
Also, from a object-oriented design POV, this should probably be a class method on your Node class named parseNewickString() or similar.

Alternatively, your newickNode() function could immediately call node.displayNode() on the new node each time through the loop.

To keep this more flexible - I would use pyparsing to process the Newick text and networkx so I had all the graph functionality I could desire - recommend to easy_install/pip those modules. It's also nice that someone has written a parser with node and tree creation already (although it looks like it lacks some features, it'll work for your case):
http://code.google.com/p/phylopy/source/browse/trunk/src/phylopy/newick.py?r=66

The first time through your for: loop, you return a node, which stops the function executing.
If you want to return a list of nodes, create the list at the top of the function, append to it each time through the loop, and return the list when you're done.
It may make more sense to move the loop outside of the newickNode function, and have that function only return a single node as its name suggests.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Constructing a phylogentic tree - python

Do yourself a favor and don't reinvent the wheel. Python-graph (a.k.a. pygraph) does all that you ask here and most of the things that you'll ask next.

Related

Modifying Instance Attribute Unexpected Behavior

defining multi branching nested tree in python recursively

Implementing a depth-first tree iterator in Python

My recursive function (populates a tree structure) is adding to the root node during every loop/call

Python For...loop iteration

Categories

Resources