Finding and fixing label islands in networkx - python

I have a graph where each node has an integer label. If the graph is well-behaved, labeled regions will be continuous. I'd like to write something in python networkx to "fix" bad graphs. For example, in
I'd like to:
1) identify bad nodes (the ones in dotted blue lines); then
2) remove their label and "fill" with the correct value
My graph vocabulary is weak; are there networkx functions that can do this?
Note: not sure if it makes a difference, but and all nodes have a degree of 3, and the graph is always a topological sphere.

1) For each label, make a subgraph of your original graph containing all nodes with that label (networkx.subgraph).
2) For each subgraph, find the connected components with networkx.connected_components, which returns a generator of node sets, one for each component.
3) For each component that is not the largest component of its class, find the neighbours of each node with networkx.neighbors; determine their labels and assign the most common label (that is not also the label of the component) to the nodes in that component.
This procedure may fail if large mislabeled "islands" are adjacent to each other but should work for the example shown.

Related

Divide a region into parts efficiently Python

I have a square grid with some points marked off as being the centers of the subparts of the grid. I'd like to be able to assign each location within the grid to the correct subpart. For example, if the subparts of the region were centered on the black dots, I'd like to be able to assign the red dot to the region in the lower right, as it is the closest black dot.
Currently, I do this by iterating over each possible red dot, and comparing its distance to each of the black dots. However, the width, length, and number of black dots in the grid is very high, so I'd like to know if there's a more efficient algorithm.
My particular data is formatted as such, where the numbers are just placeholders to correspond with the given example:
black_dots = [(38, 8), (42, 39), (5, 14), (6, 49)]
grid = [[0 for i in range(0, 50)] for j in range(0, 50)]
For reference, in the sample case, I hope to be able to fill grid up with integers 1, 2, 3, 4, depending on whether they are closest to the 1st, 2nd, 3rd, or 4th entry in black_dots to end up with something that would allow me to create something similar to the following picture where each integer correspond to a color (dots are left on for show).
To summarize, is there / what is the more efficient way to do this?
You can use a breadth-first traversal to solve this problem.
Create a first-in, first-out queue. (A queue makes a traversal breadth-first.)
Create a Visited mask indicating whether a cell in your grid has been added to the queue or not. Set the mask to false.
Create a Parent mask indicating what black dot the cell ultimately belongs to.
Place all the black dots into the queue, flag them in the Visited mask, and assign them unique ids in the Parent mask.
Begin popping cells from the queue one by one. For each cell, iterate of the cell's neighbours. Place each neighbour into the Queue, flag it in Visited, and set its value in Parent to be equal to that of the cell you just popped.
Continue until the queue is empty.
The breadth-first traversal makes a wave which expands outward from each source cell (black dot). Since the waves all travel at the same speed across your grid, each wave gobbles up those cells closest to its source.
This solves the problem in O(N) time.
If I understand correctly what you really need is to construct a Voronoi diagram of your centers:
https://en.m.wikipedia.org/wiki/Voronoi_diagram
Which can be constructed very efficiently with similar computational complexity as calculating its convex hull.
The Voronoi diagram allows you to construct the optimal polygons sorrounding your centers which delimit the regions closest to the centers.
Having the Voronoi diagram the task is reduced to detect in which polygon the red dots lies. Since the Voronoi cells are convex you need an algorithm to decide wether a point is inside a convex polygon. However traversing all polygons has complexity O(n).
There are several algorithms to accelerate the point location so it can be done in O(log n):
https://en.m.wikipedia.org/wiki/Point_location
See also
Nearest Neighbor Searching using Voronoi Diagrams
The "8-way" Voronoi diagram can be constructed efficiently (in linear time wrt the number of pixels) by a two-passes scanline process. (8-way means that distances are evaluated as the length of the shortest 8-connected path between two pixels.)
Assign every center a distinct color and create an array of distances of the same size as the image, initialized with 0 at the centers and "infinity" elsewhere.
In a top-down/left-right pass, update the distances of all pixels as being the minimum of the distances of the four neighbors W, NW, N and NE plus one, and assign the current pixel the color of the neighbor that achieves the minimum.
In a bottom-up/right-left pass, update the distances of all pixels as being the minimum of the current distance and the distances of the four neighbors E, SE, S, SW plus one, and assign the current pixel the color of the neighbor that achieves the minimum (or keep the current color).
It is also possible to compute the Euclidean Voronoi diagram efficiently (in linear time), but this requires a more sophisticated algorithm. It can be based on the wonderful paper "A GENERAL ALGORITHM FOR COMPUTING DISTANCE
TRANSFORMS IN LINEAR TIME" by A. MEIJSTERā€š J.B.T.M. ROERDINK and W.H. HESSELINK, which must be enhanced with some accounting of the neighbor that causes the smallest distance.

Keeping scale free graph's degree distribution after perturbation - python

Given a scale free graph G ( a graph whose degree distribution is a power law), and the following procedure:
for i in range(C):
coint = randint(0,1)
if (coint == 0):
delete_random_edges(G)
else:
add_random_edge(G)
(C is a constant)
So, when C is large, the degree distribution after the procedure would be more like G(n,p). I am interested in preserving the power law distribution, i.e. - I want the graph to be scale free after this procedure, even for large C.
My idea is writing the procedures "delete_random_edges" and "add_random_edge" in a way that will give edges that connected to node with big degree small probability to be deleted (when adding new edge, it would be more likely to add it to node with large degree).
I use Networkx to represent the graph, and all I found is procedures that delete or add a specific edge. Any idea how can I implement the above?
Here's 2 algorithms:
Algorithm 1
This algorithm does not preserve the degree exactly, rather it preserves the expected degree.
Save each node's initial degree. Then delete edges at random. Whenever you create an edge, do so by randomly choosing two nodes, each with probability proportional to the initial degree of those nodes.
After a long period of time, the expected degree of each node 'u' is its initial degree (but it might be a bit higher or lower).
Basically, this will create what is called a Chung-Lu random graph. Networkx has a built in algorithm for creating them.
Note - this will allow the degree distribution to vary.
algorithm 1a
Here is the efficient networkx implementation skipping over the degree deleting and adding and going straight to the final result (assuming a networkx graph G):
degree_list = G.degree().values()
H = nx.expected_degree_graph(degree_list)
Here's the documentation
Algorithm 2
This algorithm preserves the degrees exactly.
Choose a set of edges and break them. Create a list, with each node appearing equal to the number of broken edges it was in. Shuffle this list. Create new edges between nodes that appear next to each other in this list.
Check to make sure you never join a node to itself or to a node which is already a neighbor. If this would occur you'll want to think of a custom way to avoid it. One option is to simply reshuffle the list. Another is to set those nodes aside and include them in the list you create next time you do this.
edit
There is a built in networkx command double_edge_swapto swap two edges at a time. documentation
Although you have already accepted the answer from #abdallah-sobehy, meaning that it works, I would suggest a more simple approach, in case it helps you or anybody around.
What you are trying to do is sometimes called preferential attachment (well, at least when you add nodes) and for that there is a random model developed quite some time ago, see Barabasi-Albert model, which leads to a power law distribution of gamma equals -3.
Basically you have to add edges with probability equal to the degree of the node divided by the sum of the degrees of all the nodes. You can scipy.stats for defining the probability distribution with a code like this,
import scipy.stats as stats
x = Gx.nodes()
sum_degrees = sum(list(Gx.degree(Gx).values()))
p = [Gx.degree(x)/sum_degrees for x in Gx]
custm = stats.rv_discrete(name='custm', values=(x, p))
Then you just pick 2 nodes following that distribution, and that's the 2 nodes you add an edge to,
custm.rvs(size=2)
As for deleting the nodes, I haven't tried that myself. But I guess you could use something like this,
sum_inv_degrees = sum([1/ x for x in list(Gx.degree(Gx).values())])
p = [1 / (Gx.degree(x) * sum_inv_degrees) for x in Gx]
although honestly I am not completely sure; it is not anymore the random model that I link to above...
Hope it helps anyway.
UPDATE after comments
Indeed by using this method for adding nodes to an existing graph, you could get 2 undesired outcomes:
duplicated links
self links
You could remove those, although it will make the results deviate from the expected distribution.
Anyhow, you should take into account that you are deviating already from the preferential attachment model, since the algorithm studied by Barabasi-Albert works adding new nodes and links to the existing graph,
The network begins with an initial connected network of m_0 nodes.
New nodes are added to the network one at a time. Each new node is connected to m > m_0 existing nodes with a probability that is proportional to the number
...
(see here)
If you want to get an exact distribution (instead of growing an existing network and keeping its properties), you're probably better off with the answer from #joel
Hope it helps.
I am not sure to what extent this will preserve the scale free property but this can be a way to implement your idea:
In order to add an edge you need to specify 2 nodes in networkx. so, you can choose one node with a probability that is proportional to (degree) and the other node to be uniformly chosen (without any preferences). Choosing a highly connected node can be achieved as follows:
For a graph G where nodes are [0,1,2,...,n]
1) create a list of floats (limits) between 0 and 1 to specify for each node a probability to be chosen according to its degree^2. For example: limits[1] - limits[0] is the probability to choose node 0, limits[2] - limits[1] is probability to choose node 2 etc.
# limits is a list that stores floats between 0 and 1 which defines
# the probabaility of choosing a certain node depending on its degree
limits = [0.0]
# store total number of edges of the graph; the summation of all degrees is 2*num_edges
num_edges = G.number_of_edges()
# store the degree of all nodes in a list
degrees = G.degree()
# iterate nodes to calculate limits depending on degree
for i in G:
limits.append(G.degree(i)/(2*num_edges) + limits[i])
2) Randomly generate a number between 0 and 1 then compare it to the limits, and choose the node to add an edge to accordingly:
rnd = np.random.random()
# compare the random number to the limits and choose node accordingly
for j in range(len(limits) - 1):
if rnd >= limits[j] and rnd < limits[j+1]:
chosen_node = G.node[j]
3) Choose another node uniformly, by generating a random integer between [0,n]
4) Add an edge between both of the chosen nodes.
5) Similarly for deleting edge, you can choose a node according to (1/degree) instead of degree then uniformly delete any of its edges.
It is interesting to know if using this approach would reserve the scale free property and at which 'C' the property is lost , so let us know if it worked or not.
EDIT: As suggested by #joel the selection of the node to add an edge to should be proportional to degree rather than degree^2. I have edited step 1 accordingly.
EDIT2: This might help you to be able to judge if the scale free graph lost its property after edges addition and removal. Simply comute the preferential attachmet score before and after changes. You can find the doumentation here.

How to read in weighted edgelist with igraph in Python (not in R)?

What I aim to do is create a graph of the nodes in the first 2 columns, that have edge lengths that are proportional to the values in the 3rd column. My input data looks like:
E06.1644.1 A01.908.1 0.5
E06.1643.1 A01.908.1 0.02
E06.1644.1 A01.2060.1 0.7
I am currently importing it like this:
g=Graph.Read_Ncol("igraph.test.txt",names=True,directed=False,weights=True)
igraph.plot(g, "igraph.pdf", layout="kamada_kawai")
When I print the names or the weights (which I intend them to be the edge lengths), they print out fine with:
print(g.vs["name"])
print(g.es["weight"])
However, the vertices are blank, and the lengths do not seem to be proportional to their values. Also, there are too many nodes (A01.908.1 is duplicated).
What am I doing wrong?
Thanks in advance....
The vertices are blank because igraph does not use the name attribute as vertex labels automatically. If you want to use the names as labels, you have two options:
Copy the name vertex attribute to the label attribute: g.vs["label"] = g.vs["name"]
Tell plot explicitly that you want it to use the names as labels: plot(g, "igraph.pdf", layout="kamada_kawai", vertex_label=g.vs["name"])
I guess the same applies to the weights; igraph does not use the weights automatically to determine the thickness of each edge. If you want to do this, rescale the weight vector to a meaningful thickless range (say, from 0.5 to 3) and then set the rescaled vector as the width edge attribute:
>>> g.es["width"] = rescale(g.es["weight"], out_range=(0.5, 3))
Alternatively, you can also use the edge_width keyword argument in the plot() call:
plot(g, ..., edge_width=rescale(g.es["weight"], out_range=(0.5, 3)))
See help(Graph.__plot__) for more details about the keyword arguments that you can pass to plot().
As for the duplicated node, I strongly suspect that there is a typo in your input file and the two names are not equivalent; one could have a space at the end for instance. Inspect g.vs["name"] carefully to see if this is the case.
Update: if you want the lengths of the edges to be proportional to the prescribed weights, I'm afraid that this cannot be done exactly in the general case - it is easy to come up with a graph where the prescribed lengths cannot be achieved in 2D space. There is a technique called multidimensional scaling (MDS) which could reconstruct the positions of the nodes from a distance matrix - but this requires that a distance is specified for each pair of nodes (i.e. also for disconnected pairs).
The Kamada-Kawai layout algorithm that you have used is able to take edge weights into account to some extent (it is likely to get stuck in local minima so you probably won't get an exact result), but it interprets the weights as similarities, not distances, therefore the larger the weight is, the closer the endpoints will be. However, you still have to tell igraph to use the weights when calculating the layout, like this:
>>> similarities = [some_transformation(weight) for weight in g.es["weight"]]
>>> layout = g.layout_kamada_kawai(weights=similarities)
>>> plot(g, layout=layout, ...)
where some_transformation() is a "reasonable" transformation from distance to similarity. This requires some trial-and-error; I usually use a transformation based on a sigmoid function that transforms the median distance to a similarity of 0.5, the (median + 2 sd) distance to 0.1 and the (median - 2 sd) distance to 0.9 (where sd is the standard deviation of the distance distribution) - but this is not guaranteed to work in all cases.

NetworkX graph: creating nodes with ordered list

I am completely new to graphs. I have a 213 X 213 distance matrix. I have been trying to visualize the distance matrix using network and my idea is that far apart nodes will appear as separate clusters when the graph will be plotted. So I am creating a graph with nodes representing column index. I need to keep track of nodes in order to label it afterwards. I need to add edges in certain order so I need to keep track of nodes and their labels.
Here is the code:
import networkx as nx
G = nx.Graph()
G.add_nodes_from(time_pres) ##time_pres is the list of labels that I want specific node to have
for i in range(212):
for j in range(i+1, 212):
color = ['green' if j == i+1 else 'red'][0]
edges.append((i,j, dist[i,j], 'green')) ##This thing requires allocation of distance as per the order in dist matrirx
G.add_edge(i,j, dist = dist[i,j], color = 'green')
The way I am doing right now, it is allocating nodes with id as a number which is not as per the index of labels in time_pres.
I can answer the question you seem to be asking, but this won't be the end of your troubles. Specifically, I'll show you where you go wrong.
So, we assume that the variable time_pres is defined as follows
time_pres = [('person1', '1878'), ('person2', '1879'), etc)]
Then,
G.add_nodes_from(time_pres)
Creates the nodes with labels ('person1', '1878'), ('person2', '1879'), etc. These nodes are held in a dictionary, with keys the label of the nodes and values any additional attributes related to each node. In your case, you have no attributes. You can also see this from the manual online, or if you type help(G.add_nodes_from).
You can even see the label of the nodes by typing either of the following lines.
G.nodes() # either this
G.node.keys() # or this
This will print a list of the labels, but since they come from a dictionary, they may not be in the same order as time_pres. You can refer to the nodes by their labels. They don't have any additional id numbers, or anything else.
Now, for adding an edge. The manual says that any of the two nodes will be added if they are not already in the graph. So, when you do
G.add_edge(i, j, dist = dist[i,j], color = 'green')
where, i and j are numbers, they are added in the graph since they don't already exist in the graph labels. So, you end up adding the nodes i and j and the edge between them. Instead, you want to do
G.add_edge(time_pres[i], time_pres[j], dist = dist[i,j], color = 'green')
This will add an edge between the nodes time_pres[i] and time_pres[j]. As far as I understand, this is your aim.
However, you seem to expect that when you draw the graph, the distance between nodes time_pres[i] and time_pres[j] will be decided by the attribute dist=dist[i,j] in G.add_edge(). In fact, the position of a node is decided by tuple holding the x and y positions of the node. From the manual for nx.draw().
pos : dictionary, optional
A dictionary with nodes as keys and positions as values. If not specified a spring layout positioning will be computed. See networkx.layout for functions that compute node positions.
If you don't define the node positions, they will be generated randomly. In your case, you would need a dictionary like
pos = {('person1', '1878'): (23, 10),
('person2', '1879'): (18, 11),
etc}
Then, the coordinates between the nodes i and j would result to a distance equal to dist[i,j]. You would have to figure out these coordinates, but since you haven't made it clear exactly how you derived the matrix dist, I can't say anything about it.

Networkx graph clustering

in Networkx, how can I cluster nodes based on nodes color? E.g., I have 100 nodes, some of them are close to black, while others are close to white. In the graph layout, I want nodes with similar color stay close to each other, and nodes with very different color stay away from each other. How can I do that? Basically, how does the edge weight influence the layout of spring_layout? If NetworkX cannot do that, is there any other tools can help to calculate the layout?
Thanks
Ok, lets build us adjacency matrix W for that graph following the simple procedure:
if both of adjacent vertexes i-th and j-th are of the same color then weight of the edge between them W_{i,j} is big number (which you will tune in your experiments later) and else it is some small number which you will figure out analogously.
Now, lets write Laplacian of the matrix as
L = D - W, where D is a diagonal matrix with elements d_{i,i} equal to the sum of W i-th row.
Now, one can easily show that the value of
fLf^T, where f is some arbitrary vector, is small if vertexes with huge adjustments weights are having close f values. You may think about it as of the way to set a coordinate system for graph with i-the vertex has f_i coordinate in 1D space.
Now, let's choose some number of such vectors f^k which give us representation of the graph as a set of points in some euclidean space in which, for example, k-means works: now you have i-th vertex of the initial graph having coordinates f^1_i, f^2_i, ... and also adjacent vectors of the same color on the initial graph will be close in this new coordinate space.
The question about how to choose vectors f is a simple one: just take couple of eigenvectors of matrix L as f which correspond to small but nonzero eigenvalues.
This is a well known method called spectral clustering.
Further reading:
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. by Trevor Hastie, Robert Tibshirani and Jerome Friedman
which is available for free from the authors page http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Categories