Follow-up on iterating over a graph using XML minidom

Follow-up on iterating over a graph using XML minidom - python

This is a follow-up to the question (Link)
What I intend on doing is using the XML to create a graph using NetworkX. Looking at the DOM structure below, all nodes within the same node should have an edge between them, and all nodes that have attended the same conference should have a node to that conference. To summarize, all authors that worked together on a paper should be connected to each other, and all authors who have attended a particular conference should be connected to that conference.
<conference name="CONF 2009">
<paper>
<author>Yih-Chun Hu(UIUC)</author>
<author>David McGrew(Cisco Systems)</author>
<author>Adrian Perrig(CMU)</author>
<author>Brian Weis(Cisco Systems)</author>
<author>Dan Wendlandt(CMU)</author>
</paper>
<paper>
<author>Dan Wendlandt(CMU)</author>
<author>Ioannis Avramopoulos(Princeton)</author>
<author>David G. Andersen(CMU)</author>
<author>Jennifer Rexford(Princeton)</author>
</paper>
</conference>
I've figured out how to connect authors to conferences, but I'm unsure about how to connect authors to each other. The thing that I'm having difficulty with is how to iterate over the authors that have worked on the same paper and connect them together.
dom = parse(filepath)
conference=dom.getElementsByTagName('conference')
for node in conference:
conf_name=node.getAttribute('name')
print conf_name
G.add_node(conf_name)
#The nodeValue is split in order to get the name of the author
#and to exclude the university they are part of
plist=node.getElementsByTagName('paper')
for p in plist:
author=str(p.childNodes[0].nodeValue)
author= author.split("(")
#Figure out a way to create edges between authors in the same <paper> </paper>
alist=node.getElementsByTagName('author')
for a in alist:
authortext= str(a.childNodes[0].nodeValue).split("(")
if authortext[0] in dict:
edgeQuantity=dict[authortext[0]]
edgeQuantity+=1
dict[authortext[0]]=edgeQuantity
G.add_edge(authortext[0],conf_name)
#Otherwise, add it to the dictionary and create an edge to the conference.
else:
dict[authortext[0]]= 1
G.add_node(authortext[0])
G.add_edge(authortext[0],conf_name)
i+=1

I'm unsure about how to connect authors to each other.
You need to generate (author, otherauthor) pairs so you can add them as edges. The typical way to do that would be a nested iteration:
for thing in things:
for otherthing in things:
add_edge(thing, otherthing)
This is a naïve implementation that includes self-loops (giving an author an edge connecting himself to himself), which you may or may not want; it also includes both (1,2) and (2,1), which if you're doing an undirected graph is redundant. (In Python 2.6, the built-in permutations generator also does this.) Here's a generator that fixes these things:
def pairs(l):
for i in range(len(l)-1):
for j in range(i+1, len(l)):
yield l[i], l[j]
I've not used NetworkX, but looking at the doc it seems to say you can call add_node on the same node twice (with nothing happening the second time). If so, you can discard the dict you were using to try to keep track of what nodes you'd inserted. Also, it seems to say that if you add an edge to an unknown node, it'll add that node for you automatically. So it should be possible to make the code much shorter:
for conference in dom.getElementsByTagName('conference'):
var conf_name= node.getAttribute('name')
for paper in conference.getElementsByTagName('paper'):
authors= paper.getElementsByTagName('author')
auth_names= [author.firstChild.data.split('(')[0] for author in authors]
# Note author's conference attendance
#
for auth_name in auth_names:
G.add_edge(auth_name, conf_name)
# Note combinations of authors working on same paper
#
for auth_name, other_name in pairs(auth_names):
G.add_edge(auth_name, otherauth_name)

im not entirely sure what you're looking for, but based on your description i threw together a graph which I think encapsulates the relationships you describe.
http://imgur.com/o2HvT.png
i used openfst to do this. i find it much easier to clearly layout the graphical relationships before plunging into the code for something like this.
also, do you actually need to generate an explicit edge between authors? this seems like a traversal issue.

Related

pyosmium - Build a Geojson linestring based on OSM Relation

I have a python script to analyse OSM data, and the objective is to build a GeoJson with specific data issued from OSM relation.
I'm currently focusing on OSM relation that represents 'hiking' trail like this one.
According to the document:
members
(read-only) Ordered list of relation members. See osmium.osm.RelationMemberList.
the relation object has an attribute members which collects all members of the relation.
Hence The first part of the script manages to extract all relation that have a tag sac_scale=hiking and collects all its ways.
The following script is on purpose focusing only on 1 specific relation : r104369
class HikingWaysfromRelations(osmium.SimpleHandler):
def __init__(self):
super(HikingWaysfromRelations, self).__init__()
self.dict = {}
def _getWays(self, elem, elem_type):
# tag
if 'sac_scale' in elem.tags and elem.tags['sac_scale']=='hiking' and elem.id==104369:
list=[]
for mem in elem.members:
if mem.type=="w":
list.append(str(mem.ref))
self.dict["r"+str(elem.id)]=list
else:
pass
def relation(self,r):
self._getWays(r, "relation")
ml = HikingWaysfromRelations()
ml.apply_file('../pbf/new-zealand.osm.pbf')
The result is a dictionary containing the expected relation as the only key, and its ways:
{"r104369": ["191668175", "765285136", "765285135", "765285138", "765285139", "191668225", "765542429", "765542430", "765542432", "765542431", "765542435", "765542436", "765542434", "765542433", "765542437", "765542438", "765542439", "765542440", "765542441", "765542442", "765548983", "271277386", "765548985", "765548984", "684295241", "684295239", "464603363", "464603364", "464607430", "299788481", "178920047", "155711655", "155711646", "684294192", "259362037", "684294189", "259362038", "259362041", "259362036", "259362043", "259362039", "259362040"]}
Now the question is: How to build a GeoJson containing a single Feature MultiLineString that connects all those ways and rebuild the expected hiking trail?
Based on what I've found on the net, I should re-run a simpleHander on the full .pbf file, and each time I encounter a way I'm looking for - based on the values of the above dictionary - I could reconstruct a LineString with:
import shapely.wkb as wkblib
wkbfab = osmium.geom.WKBFactory()
def getRelationGeometry(elem):
wkb=wkbfab.create_linestring(elem)
return wkb
The issue is that it looks like some ways have only 1 node, hence triggering following error:
RuntimeError: need at least two points for linestring (way_id=155711655)
So what would be the solution to re-build a GeoJson feature - multiLineString - of multiple ways, to be able to plot the result on https://geojson.io/#map=2/20.0/0.0 ?
How for instance openstreetmap manages to re-build the track of a relation when I hit link if not by connecting all nodes (from all ways) issued from the relation ?
Thanks a lot for your help
I know there is way with bash, where you first filter the initial pbf by keeping only relation with the tag sac_scale=hiking, and then transforming this filtered result to GeoJson - but I really want to be able to generate the same with python to understand how OSM data are stored. I just can't figure out an easy way to do so, knowing pyosmium is equivalent (supposedly) to osmium, I believe there should be an easy way there too
osmium export output/output_food-drinks.pbf -f geojson

Looking at the way with the id shown in the error in your post (155711655), it has two nodes, not one. Visible here as of the time of this answer.
Knowing that, I can think of two reasons why you would get that error:
You're not passing in the argument location=True to the apply_file method as suggested by the documentation:
Because of the way that OSM data is structured, osmium needs to internally cache node geometries, when the handler wants to process the geometries of ways and areas. The apply_file() method cannot deduce by itself if this cache is needed. Therefore locations need to be explicitly enabled by setting the locations parameter to True:
h.apply_file("test.osm.pbf", locations=True, idx='flex_mem')
Looking at your code above, the apply_file method only has the input file as an argument, so I think this is likely your problem.
The way may be referencing a node that is missing in your pbf extract. This is simple to verify with the osmium cli tool:
osmium check-refs <your pbf file>
This is the result I get from running that on a valid pbf of my own
There are 6702885 nodes, 431747 ways, and 2474 relations in this file.
Nodes in ways missing: 0
Note the Nodes in ways missing: 0.

Python friends network visualization

I have hundreds of lists (each list corresponds to 1 person). Each list contains 100 strings, which are the 100 friends of that person.
I want to 3D visualize this people network based on the number of common friends they have. Considering any 2 lists, the more same strings they have, the closer they should appear together in this 3D graph. I wanted to show each list as a dot on the 3D graph without nodes/connections between the dots.
For brevity, I have included only 3 people here.
person1 = ['mike', 'alex', 'arker','locke','dave','david','ross','rachel','anna','ann','darl','carl','karle']
person2 = ['mika', 'adlex', 'parker','ocke','ave','david','rosse','rachel','anna','ann','darla','carla','karle']
person3 = ['mika', 'alex', 'parker','ocke','ave','david','rosse','ross','anna','ann','darla','carla','karle', 'sasha', 'daria']

Gephi Setup steps:
Install Gephi and then start it
You probably want to upgrade all the plugins now, see the button in the lower right corner.
Now create a new project.
Make sure the current workspace is Workspace1
Enable Graph Streaming plugin
In Streaming tab that then appears configure server to use http and port 8080
start the server (it will then have a green dot underneath it instead of a red dot).
Python steps:
install gephistreamer package (pip install gephistreamer)
Copy the following python cod to something like friends.py:
from gephistreamer import graph
from gephistreamer import streamer
import random as rn
stream = streamer.Streamer(streamer.GephiWS(hostname="localhost",port=8080,workspace="workspace1"))
szfak = 100 # this scales up everything - somehow it is needed
cdfak = 3000
nodedict = {}
def addfnode(fname):
# grab the node out of the dictionary if it is there, otherwise make a newone
if (fname in nodedict):
nnode = nodedict[fname]
else:
nnode = graph.Node(fname,size=szfak,x=cdfak*rn.random(),y=cdfak*rn.random(),color="#8080ff",type="f")
nodedict[fname] = nnode # new node into the dictionary
return nnode
def addnodes(pname,fnodenamelist):
pnode = graph.Node(pname,size=szfak,x=cdfak*rn.random(),y=cdfak*rn.random(),color="#ff8080",type="p")
stream.add_node(pnode)
for fname in fnodenamelist:
print(pname+"-"+fname)
fnode = addfnode(fname)
stream.add_node(fnode)
pfedge = graph.Edge(pnode,fnode,weight=rn.random())
stream.add_edge(pfedge)
person1friends = ['mike','alex','arker','locke','dave','david','ross','rachel','anna','ann','darl','carl','karle']
person2friends = ['mika','adlex','parker','ocke','ave','david','rosse','rachel','anna','ann','darla','carla','karle']
person3friends = ['mika','alex','parker','ocke','ave','david','rosse','ross','anna','ann','darla','carla','karle','sasha','daria']
addnodes("p1",person1friends)
addnodes("p2",person2friends)
addnodes("p3",person3friends)
Run it with the command python friends.py
You should see all the nodes appear. There are then lots of ways you can lay it out to make it look better, I am using the Force Atlas layouter here and you can see the parameters I am using on the left.
Some notes:
you can get the labels to show or disappear by clicking on the T on the bottom status/control bar.
View the data in the nodes and edges by opening Window/Data Table.
It is a very rich program, there are more options than you can shake a stick at.
You can set more properties on your nodes and edges in the python code and then they will show up in the data table view and can be used to filter, etc.
You want to pay attention to that update button in the bottom right corner of Gephi, there are a lot of bugs to fix.
This will get you started (as you asked), but for your particular problem:
you will also need to calculate weights for your persons (the "p" nodes), and link them to each other with those weights
Then you need to find a layouter and paramters that positions those nodes the way you want them based on the new weights.
So you don't really need to show the type="f" nodes, you need just the "p" nodes.
The weight between to "p" nodes should be based on the intersection of the sets of the friend names.
There are also Gephi plugins that can then display this in 3D, but that is actually a completely separate issue, you probably want to get it working in 2D first.
This is running on Windows 10 using Anaconda 4.4.1 and Python 3.5.2 and Gephi 0.9.1.

Django finding paths between two vertexes in a graph

This is mostly a logical question, but the context is done in Django.
In our Database we have Vertex and Line Classes, these form a (neural)network, but it is unordered and I can't change it, it's a Legacy Database
class Vertex(models.Model)
code = models.AutoField(primary_key=True)
lines = models.ManyToManyField('Line', through='Vertex_Line')
class Line(models.Model)
code = models.AutoField(primary_key=True)
class Vertex_Line(models.Model)
line = models.ForeignKey(Line, on_delete=models.CASCADE)
vertex = models.ForeignKey(Vertex, on_delete=models.CASCADE)
Now, in the application, the user will be able to visually select TWO vertexes (the green circles below)
The javascript will then send the pk of these two Vertexes to Django, and it has to find the Line classes that satisfy a route between them, in this case, the following 4 red Lines :
Business Logic:
A Vertex can have 1-4 Lines related to it
A Line can have 1-2 Vertexes related to it
There will only be one possible route between two Vertexes
What I have so far:
I understand that the answer probably includes recursion
The path must be found by trying every path from one Vertex untill the other is find, it can't be directly found
Since there are four and three-way junctions, all the routes being tried must be saved throughout the recursion(unsure of this one)
I know the basic logic is looping through all the lines of each Vertex, and then get the other Vertex of these lines, and keep walking recursively, but I really don't know where to start on this one.
This is as far as I could get, but it probably does not help (views.py) :
def findRoute(request):
data = json.loads(request.body.decode("utf-8"))
v1 = Vertex.objects.get(pk=data.get('v1_pk'))
v2 = Vertex.objects.get(pk=data.get('v2_pk'))
lines = v1.lines.all()
routes = []
for line in lines:
starting_line = line
#Trying a new route
this_route_index = len(routes)
routes[this_route_index] = [starting_line.pk]
other_vertex = line.vertex__set.all().exclude(pk=v1.pk)
#There are cases with dead-ends
if other_vertex.length > 0:
#Mind block...

As you has pointed, this is not a Django/Python related question, but a logical/algorithmic matter.
To find paths between two vertexes in a graph you can use lot of algorithms: Dijkstra, A*, DFS, BFS, Floyd–Warshall etc.. You can choose depending on what you need: shortest/minimum path, all paths...
How to implement this in Django? I suggest to don't apply the algorithm over the models itself, since this could be expensive (in term of time, db queries, etc...) specially for large graphs; instead, I'd rather to map the graph in an in-memory data structure and execute the algorithm over it.
You can take a look to this Networkx, which is a very complete (data structure + algorithms) and well documented library; python-graph, which provides a suitable data structure and a whole set of important algorithms (including some of the mentioned above). More options at Python Graph Library

maya component id data

Does anyone know how to obtain the data in maya called the component ID of the vertices.
I know how to get the vert number but the component ID on the vert is something that changes as the model has been changed.
It seems that there is data in the vertex but just can't find any command to extract it. Any help would be helpful.
I even tried using the maya api but this also just seem to give me the vertices index number and not the actual ID (which is not a sequence as the vertices indexes)
Thanks

try that,
import maya.OpenMaya as om
sel = om.MSelectionList()
om.MGlobal.getActiveSelectionList(sel)
dag = om.MDagPath()
comp = om.MObject()
sel.getDagPath(0, dag, comp)
itr = om.MItMeshFaceVertex(dag, comp)
print '| %-15s| %-15s| %-15s' % ('Face ID', 'Object VertID', 'Face-relative VertId')
while not itr.isDone():
print '| %-15s| %-15s| %-15s' % (itr.faceId(), itr.vertId(), itr.faceVertId())
itr.next()
there are many solutions, i find this one.... src: Link

Maya components do not have persistent identities; the 'vertex id' is just the index of one entry in the vertex table (or the tables for faces, normals, etc). That's why its so easy to mess up a model with construction history if you go 'upstream' and change things that affect the model's component count or topology.
You can attach persistent data to a vertex using the PolyBlindData system, which attaches arbitrary info to faces, vertices or edges. You could attach data to a particular vertex and the data would probably survive, though the same considerations which can mess up things like vert colors or UVs when construction history changes upstream will also mess with blind data.

What is the correct graph data structure to differentiate between nodes with the same name?

I'm learning about graphs(they seem super useful) and was wondering if I could get some advice on a possible way to structure my graphs.
Simply, Lets say I get purchase order data everyday and some days its the same as the day before and on others its different. For example, yesterday I had an order of pencils and erasers, I create the two nodes to represent them and then today I get an order for an eraser and a marker, and so on. After each day, my program also looks to see who ordered what, and if Bob ordered a pencil yesterday and then an eraser today, it creates a directed edge. My logic for this is I can see who bought what on each day and I can track the purchase behaviour of Bob(and maybe use it to infer patterns with himself or other users).
My problem is, I'm using networkx(python) and creating a node 'pencil' for yesterday and then another node 'pencil' for day2 and I can't differentiate them.
I thought(and have been) naming it day2-pencil and then scanning the entire graph and stripping out the 'day2-' to track pencil orders. This seems wrong to me(not to mention expensive on the processor). I think the key would be if I can somehow mark each day as its own subgraph so when I want to study a specific day or a few days, I don't have to scan the entire graph.
As my test data gets larger, its getting more and more confusing so I am wondering what the best practice is? Any generate suggestions would be great(as networkx seems pretty full featured so they probably have a way of doing it).
Thanks in advance!
Update: Still no luck, but this maybe helpful:
import networkx as nx
G=nx.Graph()
G.add_node('pencil', day='1/1/12', colour='blue')
G.add_node('eraser', day='1/1/12', colour='rubberish colour. I know thats not a real colour')
G.add_node('pencil', day='1/2/12', colour='blue')
The result I get typing the following command G.node is:
{'pencil': {'colour': 'blue', 'day': '1/2/12'}, 'eraser': {'colour': 'rubberish colour. I know thats not a real colour', 'day': '1/1/12'}}
Its obviously overwriting the pencil from 1/1/12 with 1/2/12 one, not sure if I can make a distint one.

This is mostly depending on your goal actually. What you want to analyze is the definitive factor in your graph design. But, looking at your structure, a general structure would be nodes for Customers and Products, that are connected by Days (I don't know if this would help you any better but this is in fact a bipartite graph).
So your structure would be something like this:
node(Person) --- edge(Day) ---> node(Product)
Let's say, Bob buys a pencil on 1/1/12:
node(Bob) --- 1/1/12 ---> node(Pencil)
Ok, now Bob goes and buys another pencil on 1/2/12:
-- 1/1/12 --
/ \
node(Bob) > node(Pencil)
\ /
-- 1/2/12 --
so on...
This is actually possible with networkx. Since you have multiple edges between nodes, you have to choose between MultiGraphMor MultiDiGraph depending on the directed-ness of your edges.
In : g = networkx.MultiDiGraph()
In : g.add_node("Bob")
In : g.add_node("Alice")
In : g.add_node("Pencil")
In : g.add_edge("Bob","Pencil",key="1/1/12")
In : g.add_edge("Bob","Pencil",key="1/2/12")
In : g.add_edge("Alice","Pencil",key="1/3/12")
In : g.add_edge("Alice","Pencil",key="1/2/12")
In : g.edges(keys=True)
Out:
[('Bob', 'Pencil', '1/2/12'),
('Bob', 'Pencil', '1/1/12'),
('Alice', 'Pencil', '1/3/12'),
('Alice', 'Pencil', '1/2/12')]
so far, not bad. You can actually query things like "Did Alice buy a Pencil on 1/1/12?".
In : g.has_edge("Alice","Pencil","1/1/12")
Out: False
In : g.has_edge("Alice","Pencil","1/2/12")
Out: True
Things might get bad if you want all orders on specific days. By bad, I don't mean code-wise, but computation-wise. Code-wise it is rather simple:
In : [(from_node, to_node) for from_node, to_node, key in g.edges(keys=True) if key=="1/2/12"]
Out: [('Bob', 'Pencil'), ('Alice', 'Pencil')]
But this scans all the edges in the network and filters the ones you want. I don't think networkx has any better way.

Graphs are not the best approach for this. A relational database such as MySQL is the right tool for storing this data and performing such queries as who bought what when.

Try this:
Give each node a unique integer ID. Then, create a dictionary, nodes, such that:
nodes['pencil'] = [1,4,...] <- where all of these correspond to a node with the pencil attribute.
Replace 'pencil' with whatever other attributes you're interested in.
Just make sure that when you add a node with 'pencil', you update the dictionary:
node['pencil'].append(new_node_id). Likewise with node deletion.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Follow-up on iterating over a graph using XML minidom - python

Related

pyosmium - Build a Geojson linestring based on OSM Relation

Python friends network visualization

Django finding paths between two vertexes in a graph

maya component id data

What is the correct graph data structure to differentiate between nodes with the same name?

Categories

Resources