Add extra cost depending on length of path - python

I have a graph/network that obviously consists of some nodes and some edges. Each edge has a weight attached to it, or in this case a cost. Each edge also have a distance attached to it AND a type. So basically the weight/cost is pre-calculated from the distance of the edge along with some other metrics for both type of edges.
However, in my case I would like there to be added some additional cost for let's say every 100 distance or so, but only for one type of edge.But I'm not even certain if it is possible to add additional cost/distance depending on the sum of the previous steps in the path in algorithms such as Dijkstra's ?
I know I could just divide the cost into each distance unit, and thus getting a somewhat estimate. The problem there would be the edge cases, where the cost would be almost double at distance 199 compared to adding the cost at exactly each 100 distance, i.e. adding cost at 100 and 200.
But maybe there are other ways to get around this ?

I think you cannot implement this using Dijkstra, because you would validate the invariant, which is needed for correctness (see e.g. wikipedia). In each step, Dijkstra builds on this invariant, which more or less states, that all "already found paths" are optimal, i.e. shortest. But to show that it does not hold in your case of "additional cost by edge type and covered distance", let's have a look at a counterexample:
Counterexample against Usage of Dijkstra
Assume we have two types of edges, first type (->) and second type (=>). The second type has an additional cost of 10 after a total distance of 10. Now, we take the following graph, with the following edges
start -1-> u_1
start -1-> u_2
start -1-> u_3
...
start -1-> u_7
u_7 -1-> v
start =7=> v
v =4=> end
When, we play that through with Dijkstra (I skip all intermediate steps) with start as start node and end as target, we will first retrieve the path start=7=>v. This path has a length of 7 and that is shorter than the "detour" start-1->u_1-1->... -1->u_7->v, which has a length of 8. However, in the next step, we have to choose the edge v=4=>end, which makes the first path to a total of 21 (11 original + 10 penalty). But the detour path becomes now shorter with a length of 12=8+4 (no penalty).
In short, Dijkstra is not applicable - even if you modify the algorithm to take the "already found path" into account for retrieving the cost of next edges.
Alternative?
Maybe you can build your algorithm around a variant of Dijkstra, which usually retrieves multiple (suboptimal) solutions. First, you would need to extend Dijkstra, so that it takes the already found path into account. (In this function replace cost = weight(v, u, e) with cost = weight(v, u, e, paths[v]) and write a suitable function to calculate the penalty based on the previous path and the considered edge). Afterwards, remove edges from your original optimal solution and iterate the procedure to find a new alternative shortest path. However, I see no easy way of selecting which edge to remove from the graph-beside those from your penalty type-and the runtime complexity is probably awful.

Related

What's the fastest way of finding the longest chain of strings with common first/last characters?

This one seems pretty straightforward but I'm having trouble implementing it in Python.
Say you have an array of strings, all of length 2: ["ab","bc","cd","za","ef","fg"]
There are two chains here: "zabcd" and "efg"
What's the most efficient way of finding the longest one? (in this case it would be "zabcd").
Chains can be cyclical too... for instance ["ab","bc","ca"] (in this particular case the length of the chain would be 3).
This is clearly a graph problem, with the characters being the vertices and the pairs being unweighted directed edges.
Without allowing cycles in the solution, this is the longest path problem, and it is NP-hard, so "efficient" is probably out of the window, even if cycles are allowed (to get rid of the solution cycles, split the vertices in two, one for incoming edges and one for out-going edges, with an edge in-between). According to Wikipedia, no good approximation scheme is known.
If the graph is acyclic, then you can do it in linear time, as the wikipedia article mentions:
A longest path between two given vertices s and t in a weighted graph
G is the same thing as a shortest path in a graph −G derived from G by
changing every weight to its negation. Therefore, if shortest paths
can be found in −G, then longest paths can also be found in G.[4]
For most graphs, this transformation is not useful because it creates
cycles of negative length in −G. But if G is a directed acyclic graph,
then no negative cycles can be created, and a longest path in G can be
found in linear time by applying a linear time algorithm for shortest
paths in −G, which is also a directed acyclic graph.[4] For instance,
for each vertex v in a given DAG, the length of the longest path
ending at v may be obtained by the following steps:
Find a topological ordering of the given DAG. For each vertex v of the
DAG, in the topological ordering, compute the length of the longest
path ending at v by looking at its incoming neighbors and adding one
to the maximum length recorded for those neighbors. If v has no
incoming neighbors, set the length of the longest path ending at v to
zero. In either case, record this number so that later steps of the
algorithm can access it. Once this has been done, the longest path in
the whole DAG may be obtained by starting at the vertex v with the
largest recorded value, then repeatedly stepping backwards to its
incoming neighbor with the largest recorded value, and reversing the
sequence of vertices found in this way.
There are other special cases where there are efficient algorithms available, notably trees, but since you allow cycles that probably doesn't apply to you.
I didn't give an algorithm here for you, but that should give you the right direction for your research. The problem itself may be straightforward, but an efficient solution is NOT.

Check if path exists that contains a set of N nodes

Given a graph g and a set of N nodes my_nodes = [n1, n2, n3, ...], how can I check if there's a path that contains all N nodes?
Checking among all_simple_paths for paths that contain all nodes in my_nodes becomes computationally cumbersome as the graph grows
The search above can be limited to paths between my_nodes pairwise couples. This reduces complexity only to a small degree. Plus it requires a lot of python looping, which is quite slow
Is there a faster solution to the problem?
You may try out some greedy algorithm here, starting the path find check from all the nodes to find, and step by step explore your graph. Can't provide some real sample, but pseudo-code should be something like this:
Start n path stubs from all your n nodes to find
For all these path stubs adjust them by all the neighbors which weren't checked before
If you have some intersection between path stubs, then you got a new one, which does contain more of your needed nodes than before
If after merging the stub paths you have the one which covers all needed nodes, you're done
If there are still some additional nodes to add to the path, you continue with second step again
If there are no nodes left in graph, the path doesn't exists
This algorithm has complexity O(E + N), because you're visiting the edges and nodes in non-recursive fashion.
However, in case of directed graph the "merge" will be a bit more complicated, yet still be done, but in this case the worst scenario may take a lot of time.
Update:
As you say that the graph is directed, the above approach wouldn't work well. In this case you may simplify your task like this:
Find the strongly connected components in graph (I suggest you to implement it by yourself, e.g., Kosaraju's algorithm). The complexity is O(E + N). You can use a NetworkX method for this, if you want some out-ofbox solution.
Create the condensation of graph, based on step 1 information, with saving the information about which component can be visited from other. Again, there is a NetworkX method for this.
Now you can easily say, which nodes from your set are in the same component, so a path containing all of them definitely exists.
After that all you need to check is a connectivity between different components for your nodes. For example, you can get the topological sort of condensation and do check in linear time again.

Any good algorithm to find a "best" path as defined below in a weighted graph?

Currently I'm learning graph theory, during which I was stuck in the following graph-related problem. Let's start with an example. This may sound like a pure math problem.
Eg: A weighted graph to be implemented:
A simplified graph is shown above (it was extracted from a more complex network).
There are 5 simple paths, 14 edges with their respective weights, and 11 nodes (the blue ones in different paths can be the same, but in a single path the nodes are totally different). I'm trying to find the best path between node s and node t (both colored as black) which satisfies the following conditions:
the path should be as short as possible;
the weights of each edge along the path should be as large as possible.
Or I can even rank all the simple paths based on some methods?
More specifically, let's consider the nodes to be the users in a social network, and the edges the associations between each pair of users. The weights are in proportion to the reliability of association (10 is the most reliable one while 1 is the least).
So, is there a good way to define and calculate the weights of indirect associations (the paths between node s and node t in the example)? As we know the more connections (edges) a single path has between s and t, the reliability of it tends to decline; moreover, the decrease in the reliability of each connection of this single path will also lead to the decline in its reliability. That's why the aforementioned conditions desire a shorter path, and larger weights in each edge of the path.
Thank you for your time, guys!
Due to the problem defines only vaguely how favorable each possible path is, an answer to this question should first define such an ordering.
Based on your problem definition, more specifically from your example about the social network relations, I think we can derive the factors in favor and against how favorable a path is.
We know that each edge is in favor of the reliability of the path in an amount that is directly proportional to its cost, or value, in this case. Intuitively, there seems to be a factor in favor of the reliability of the path which is directly proportional to the average cost of the edges on it. You also mentioned the length to be a second factor that affects things, but this time in the other direction. (i.e. against the reliability of a path)
Considering those two factors, a formula such as the following may be derived, and used to rank the reliability of each path.
As you can observe, there is a summation expression where cei represents the cost of each edge ei on the path. n indicates the number of edges on the path. The entire summation divided by n is essentially the first factor I mentioned above. (i.e. the average cost of edges on the path) while the second n in the expression n2 in the denominator is the second factor, the length of the path, which is against the reliability of an edge.
I also introduced 3 constants so that you can update this formula based on how you plan to make use of it. C2 indicates an extra factor in favor how how effective the length of the path is in decreasing the reliability of the entire path. Similarly, C1 is a factor indicating how effective an increased average of edge costs is in making that path more reliable. And finally, C3 can be an optional factor which can be equal to either the minimum or the maximum edge cost on the path.
While C1 and C2 are relatively more intuitive to understand, here's an example case where C3 may come in handy. Suppose you have paths A and B with edge costs [3, 7, 8] and [5, 6, 7], respectively. As their path length and sum of the edge costs are the same, it is not possible to identify which is the more favorable path here. This is why we need a factor such as C3 in this case, and based on your need you can consider it to be equal to the minimum edge or the maximum edge for each path. If your problem definition chooses the former and assigns C3 the minimum edge cost of each path, then path B is considered the better as its minimum edge cost is higher. If the latter is chosen, however, path A is more favorable.
I am aware of the fact that not defining the constants in my answer may, in a way, make one feel that the answer is incomplete. I believe an assignment as given below should work for the time being.
C1 = 1
C2 = 1
C3 = min(cei)
Still, I believe different variations of this problem may require different values for these constants, which is why I refrain from stating that these values would hold for all variations of the problem.
The graph is essentially a reliability block diagram see wikipedia.
A reliability block diagram (RBD) is a diagrammatic method for showing
how component reliability contributes to the success or failure of a
complex system. RBD is also known as a dependence diagram (DD).
According to reliawiki.org, a wiki specialized in reliability theory:
... for a pure series system, the system reliability is
equal to the product of the reliabilities of its constituent
components.reliawiki.org
The OP gave examples in the comments of two paths with equal number of edges and equal sum of the weights/reliability
Path [8, 7] is more reliable than Path [6, 9]
The path in a social network is a chain of dependency, each edge represents an association between two people - the weight indicates how reliable that association is. A chain is only as reliable as the weakest link, that's why in the example Path [6, 9] is less reliable than Path [8, 7], it has a weaker link - an edge with reliability 6, which puts an upper bound on the reliability of the path/chain. That's why the formula indicated by reliawiki.org for a chain (series system) is the product of the individual reliabilities, each factor is a reliability estimate, R, such that 0 <= R < 1, adding another segment can only decrease the final product and the lower the reliability R is, the lower the upper bound on the final product.
Effect of Component Reliability in a Series System--
In a series configuration, the component with the least reliability
has the biggest effect on the system's reliability. There is a saying
that a chain is only as strong as its weakest link. This is a good
example of the effect of a component in a series system. In a chain,
all the rings are in series and if any of the rings break, the system
fails. In addition, the weakest link in the chain is the one that will
break first. The weakest link dictates the strength of the chain in
the same way that the weakest component/subsystem dictates the
reliability of a series system. As a result, the reliability of a
series system is always less than the reliability of the least
reliable component.
reliawiki.org
Reliability is expressed as a number between 0 (completely unreliable) to 1 (completely reliable).
A first approximation is to say the reliability of a path is the product of the weights divided by the maximum weight (10). This allows R = 1.0, let's try it. Consider the example
Path A with Weights[10,10,10] and Path B with Weights[1,1],
Path A is apparently more desirable.
For that example
RA = 10/10 * 10/10 * 10/10
RA = 100%
RB = 1/10 * 1/10
RB = 1%
RA > RB therefore Path A is more reliable than Path B.
I said that's a 1st approximation, because if you add a third path
Path C with Weights[10,10]
RC = 10/10 * 10/10
RC = 100%
RA = RC yet we know Path C is more reliable (fewer edges). Conclusion: R < 1.0 is a requirement. Let's add a fudge factor to the denominator to make it greater than the maximum weight, 10+1 = 11, this ensures R < 1.0. Now you have
RA = 10/11 * 10/11 * 10/11
RA = 75%
RB = 1/11 * 1/11
RB = 0.83%
RC = 10/11 * 10/11
RC = 83%
The most reliable is Path C, at 83%.
Side Note
That Wikipedia article points out
In order to evaluate RBD, closed form solution are available in the
case of statistical independence among blocks or components. In the
case the statistical independence assumption is not satisfied,
specific formalisms and solution tools, such as dynamic RBD, have
to be considered.
In other words, that simple formula for series reliability where the total reliability is the product of the individual reliabilities is valid only if there is no correlation between them.
The financial crisis in 2008 was due in part to an incorrect assumption that the risk of default of individual mortgages was not highly correlated with other mortgages in other parts of the country... that assumption was wrong, and as they say the rest is history.

Networkx spring layout edge weights

I was wondering how spring_layout takes edge weight into account. From wikipedia,
'An alternative model considers a spring-like force for every pair of nodes (i,j) where the ideal length \delta_{ij} of each spring is proportional to the graph-theoretic distance between nodes i and j, without using a separate repulsive force. Minimizing the difference (usually the squared difference) between Euclidean and ideal distances between nodes is then equivalent to a metric multidimensional scaling problem.'
How is edge weight factored in, specifically?
This isn't a great answer, but it gives the basics. Someone else may come by who actually knows the Fruchterman-Reingold algorithm and can describe it. I'm giving an explanation based on what I can find in the code.
From the documentation,
weight : string or None optional (default=’weight’)
The edge attribute that holds the numerical value used for the edge weight. If None, then all edge weights are 1.
But that doesn't tell you what it does with the weight, which is your question.
You can find the source code. If you send in weighted edges, it will create an adjacency matrix A with those weights and pass A to _fruchterman_reingold.
Looking at the code there, the meat of it is in this line
displacement=np.transpose(np.transpose(delta)*\
(k*k/distance**2-A*distance/k)).sum(axis=1)
The A*distance is calculating how strong of a spring force is acting on the node. A larger value in the corresponding A entry means that there is a relatively stronger attractive force between those two nodes (or if they are very close together, a weaker repulsive force). Then the algorithm moves the nodes according to the direction and strength of the forces. It then repeats (50 times by default). Interestingly, if you look at the source code you'll notice a t and dt. It appears that at each iteration, the force is multiplied by a smaller and smaller factor, so the steps get smaller.
Here is a link to the paper describing the algorithm, which unfortunately is behind a paywall. Here is a link to the paper on the author's webpage

How do I find the path with the biggest sum of weights in a weighted graph?

I have a bunch of objects with level, weight and 0 or more connections to objects of the next levels. I want to know how do I get the "heaviest" path (with the biggest sum of weights).
I'd also love to know of course, what books teach me how to deal with graphs in a practical way.
Your graph is acyclic right? (I presume so, since a node always points to a node on the next level). If your graph can have arbritrary cycles, the problem of finding the largest path becomes NP-complete and brute force search becomes the only solution.
Back to the problem - you can solve this by finding, for each node, the heaviest path that leads up to it. Since you already have a topological sort of your DAG (the levels themselves) it is straighfoward to find the paths:
For each node, store the cost of the heaviest path that leads to it and the last node before that on the said path. Initialy, this is always empty (but a sentinel value, like a negative number for the cost, might simplify code later)
For nodes in the first level, you already know the cost of the heaviest path that ends in them - it is zero (and the parent node is None)
For each level, propagate the path info to the next level - this is similar to a normal algo for shortest distance:
for level in range(nlevels):
for node in nodes[level]:
cost = the cost to this node
for (neighbour_vertex, edge_cost) in (the nodes edges):
alt_cost = cost + edge_cost
if alt_cost < cost_to_that_vertex:
cost_to_that_vertex = alt_cost
My book recommendation is Steve Skiena's "Algorithm Design Manual". There's a nice chapter on graphs.
I assume that you can only go down to a lower level in the graph.
Notice how the graph forms a tree. Then you can solve this using recursion:
heaviest_path(node n) = value[n] + max(heaviest_path(children[n][0]), heaviest_path(children[n][1]), etc)
This can easily be optimized by using dynamic programming instead.
Start with the children with the lowest level. Their heaviest_path is just their own value. Keep track of this in an array. Then calculate the heaviest_path for then next level up. Then the next level up. etc.
The method which i generally use to find the 'heaviest' path is to negate the weights and then find the shortest path. there are good algorithms( http://en.wikipedia.org/wiki/Shortest_path_problem) to find the shortest path. But this method holds good as long as you do not have a positive-weight cycle in your original graph.
For graphs having positive-weight cycles the problem of finding the 'heaviest' path is NP-complete and your algorithm to find the heaviest path will have non-polynomial time complexity.

Categories