Find cliques of length k in a graph - python

I'm working with graphs of ~200 nodes and ~3500 edges. I need to find all cliques of this graph. Using networkx's enumerate_all_cliques() works fine with smaller graphs of up to 100 nodes, but runs out of memory for bigger ones.
"This algorithm however, hopefully, does not run out of memory
since it only keeps candidate sublists in memory and
continuously removes exhausted sublists."source code for enumerate_all_cliques()
Is there maybe a way to return a generator of all cliques of length k, instead of all cliques, in order to save memory?

It seems that your priority is to save memory rather than getting all cliques. In that case the use of networkx.find_cliques(G) is a satisfactory solution as you will get all maximal cliques (largest complete subgraph containing a given node) instead of all cliques.
I compared the number of lists (subgraphs) of both functions:
G = nx.erdos_renyi_graph(300,0.08)
print 'All:',len(list(nx.enumerate_all_cliques(G)))
print 'Maximal',len(list(nx.find_cliques(G)))
All: 6087
Maximal 2522
And when the number of edges increases in the graph the difference in results gets wider.

Related

Find intersection of 2 layers containing multiple lines (in python)

I have 2 layers with links and nodes: layer A (yellow) and layer B (blue).
I would like to get the places where the lines of layer A intersect with the lines of layer B (red nodes), directly in python.
I have the coordinates for all nodes in both layers (the nodes of layers A and B are hidden in the image below).
I saw this option to find line intersection in python, but since layer A has approx. 23,000 lines and layer B 50,000, it would be too computational intensive to use it:
from shapely.geometry import LineString
line1 = LineString([(x1,y1), (x2,y2), (x3,y3)])
line2 = LineString([(x4,y4), (x5,y5)])
output = line1.intersection(line2)
Does anyone know a better (faster) way to get these intersection nodes?
Thanks a lot!
Okay, I can see what do you mean. I will try to explain a method to do this. You can easily do this by using brute force. But it's time consuming as you mentioned. Specially when there are thousands of nodes and edges. I can suggest a less time consuming method.
Let there are N nodes in layer 1 and M nodes in layer 2. Then for your method the time complexity is O(N*M)
My method. A moderately complex method. I can't implement code here. I will describe steps at the best level I can. You have to figure out how to implement in code. The Cons: May miss intersections
We use Localization in the graph. Otherwise we select (relatively) small windows from layers and perform the same thing you have done. Using shapely.
Ok, First we have to determine the window size we going to use.
determine the size of the rectangle which contains the all nodes in both layers
Divide it into small squares of same size.( like a square grid)
Build zero matrix equivalent to squares.
get count of nodes in each square and assign it to the related element in matrix. Which will have time complexity of O(N+M)
Now you have the density of nodes.
If density of a tile is high, Window size is small. (3x3 would be enough). The selected tile is at the middle. Since nodes are closer less chance to miss intersections. So perform your method on nodes inside selected tiles. Will have time complexity of O(n*m) where n,m are nodes inside the selected window.
If density of a tile is low, Window size is large. (5x5,7x7,9x9.... you can determine). The selected tile is at the middle.Since nodes are far away less chance to miss intersections. So perform your method on nodes inside selected tiles. Will have time complexity of O(n*m) where n,m are nodes inside the selected window.
How does this reduce the time. It will be prevented from comparing the nodes far away from the selected lines. If you have learned about time complexity, This is very efficient than your previous brute force approach. This is a little bit less accurate than your previous approach But faster.
ATTENTION : This is my method for select less number of nodes for comparison. There may be methods much more faster and accurate than my solution. But most of them will be very complex. If you are worried about the accuracy use your method or look for much faster and accurate method. Else you can use mine.
IN ADDITION : You can determine window size using the line length instead of using node density. No need to draw a grid. Select large window around a line if the line is long. If short , Use small window. This is also faster. I think This will be much more accurate than my previous method.

find k-partition of a graph that satisfies the condition

Given the adjacency matrix of a graph, I want to divide the nodes into k partitions where [the sum of edges inside each part] minus [the sum of edges between these k parts] is the maximum.
Here is my attempt:
I made all partitions of the nodes and looked up for partitions including k subsets. Then calculated all of the edges inside each part considering all pairs of nodes and I did this for edges between all pairs of parts too and looked for the maximum.
The problem is that my approach is not efficient in terms of time and memory. So I am looking for a faster method.
I know that I shouldn’t store all partitions in memory and have to do the calculations while generating each one. But I don’t know how to implement this.
Any help is appreciated.

Add extra cost depending on length of path

I have a graph/network that obviously consists of some nodes and some edges. Each edge has a weight attached to it, or in this case a cost. Each edge also have a distance attached to it AND a type. So basically the weight/cost is pre-calculated from the distance of the edge along with some other metrics for both type of edges.
However, in my case I would like there to be added some additional cost for let's say every 100 distance or so, but only for one type of edge.But I'm not even certain if it is possible to add additional cost/distance depending on the sum of the previous steps in the path in algorithms such as Dijkstra's ?
I know I could just divide the cost into each distance unit, and thus getting a somewhat estimate. The problem there would be the edge cases, where the cost would be almost double at distance 199 compared to adding the cost at exactly each 100 distance, i.e. adding cost at 100 and 200.
But maybe there are other ways to get around this ?
I think you cannot implement this using Dijkstra, because you would validate the invariant, which is needed for correctness (see e.g. wikipedia). In each step, Dijkstra builds on this invariant, which more or less states, that all "already found paths" are optimal, i.e. shortest. But to show that it does not hold in your case of "additional cost by edge type and covered distance", let's have a look at a counterexample:
Counterexample against Usage of Dijkstra
Assume we have two types of edges, first type (->) and second type (=>). The second type has an additional cost of 10 after a total distance of 10. Now, we take the following graph, with the following edges
start -1-> u_1
start -1-> u_2
start -1-> u_3
...
start -1-> u_7
u_7 -1-> v
start =7=> v
v =4=> end
When, we play that through with Dijkstra (I skip all intermediate steps) with start as start node and end as target, we will first retrieve the path start=7=>v. This path has a length of 7 and that is shorter than the "detour" start-1->u_1-1->... -1->u_7->v, which has a length of 8. However, in the next step, we have to choose the edge v=4=>end, which makes the first path to a total of 21 (11 original + 10 penalty). But the detour path becomes now shorter with a length of 12=8+4 (no penalty).
In short, Dijkstra is not applicable - even if you modify the algorithm to take the "already found path" into account for retrieving the cost of next edges.
Alternative?
Maybe you can build your algorithm around a variant of Dijkstra, which usually retrieves multiple (suboptimal) solutions. First, you would need to extend Dijkstra, so that it takes the already found path into account. (In this function replace cost = weight(v, u, e) with cost = weight(v, u, e, paths[v]) and write a suitable function to calculate the penalty based on the previous path and the considered edge). Afterwards, remove edges from your original optimal solution and iterate the procedure to find a new alternative shortest path. However, I see no easy way of selecting which edge to remove from the graph-beside those from your penalty type-and the runtime complexity is probably awful.

What's the fastest way of finding the longest chain of strings with common first/last characters?

This one seems pretty straightforward but I'm having trouble implementing it in Python.
Say you have an array of strings, all of length 2: ["ab","bc","cd","za","ef","fg"]
There are two chains here: "zabcd" and "efg"
What's the most efficient way of finding the longest one? (in this case it would be "zabcd").
Chains can be cyclical too... for instance ["ab","bc","ca"] (in this particular case the length of the chain would be 3).
This is clearly a graph problem, with the characters being the vertices and the pairs being unweighted directed edges.
Without allowing cycles in the solution, this is the longest path problem, and it is NP-hard, so "efficient" is probably out of the window, even if cycles are allowed (to get rid of the solution cycles, split the vertices in two, one for incoming edges and one for out-going edges, with an edge in-between). According to Wikipedia, no good approximation scheme is known.
If the graph is acyclic, then you can do it in linear time, as the wikipedia article mentions:
A longest path between two given vertices s and t in a weighted graph
G is the same thing as a shortest path in a graph −G derived from G by
changing every weight to its negation. Therefore, if shortest paths
can be found in −G, then longest paths can also be found in G.[4]
For most graphs, this transformation is not useful because it creates
cycles of negative length in −G. But if G is a directed acyclic graph,
then no negative cycles can be created, and a longest path in G can be
found in linear time by applying a linear time algorithm for shortest
paths in −G, which is also a directed acyclic graph.[4] For instance,
for each vertex v in a given DAG, the length of the longest path
ending at v may be obtained by the following steps:
Find a topological ordering of the given DAG. For each vertex v of the
DAG, in the topological ordering, compute the length of the longest
path ending at v by looking at its incoming neighbors and adding one
to the maximum length recorded for those neighbors. If v has no
incoming neighbors, set the length of the longest path ending at v to
zero. In either case, record this number so that later steps of the
algorithm can access it. Once this has been done, the longest path in
the whole DAG may be obtained by starting at the vertex v with the
largest recorded value, then repeatedly stepping backwards to its
incoming neighbor with the largest recorded value, and reversing the
sequence of vertices found in this way.
There are other special cases where there are efficient algorithms available, notably trees, but since you allow cycles that probably doesn't apply to you.
I didn't give an algorithm here for you, but that should give you the right direction for your research. The problem itself may be straightforward, but an efficient solution is NOT.

Computing many shortest paths in graph

I have a large (weighted, directed) graph (>100,000 nodes) and I want to compute a large number of random shortest paths in that graph. So I want to randomly select two nodes (let's say k times) and compute the shortest path. One way to do this is using either the networkx or the igraph module and doing a for loop as in
pairs=np.random.choice(np.arange(0,len(graph.nodes)), [k,2])
for pair in pairs:
graph.get_shortest_paths(pair[0],pair[1], weights='weight')
This works, but it takes a long time. Especially, compared to computing all paths for a particular source node. Essentially, in every iteration the process loads the graph again and starts the process from scratch. So is there a way to benefit from loading the graph structure in to memory and not redoing this in each iteration without computing all shortest paths (which would take too long given that those would be n*(n-1) paths).
Phrased differently, can I compute a random subset of all shortest paths in an efficient way?
AFAIK, the operations are independent of each other, so running them in parallel could work (pseudocode):
import dask
#dask.delayed
def short_path(graph, pair):
return graph.get_shortest_paths(pair[0],pair[1], weights='weight')
pairs=np.random.choice(np.arange(0,len(graph.nodes)), [k,2])
results = dask.compute(*[short_path(pair) for pair in pairs])

Categories