I have 2 layers with links and nodes: layer A (yellow) and layer B (blue).
I would like to get the places where the lines of layer A intersect with the lines of layer B (red nodes), directly in python.
I have the coordinates for all nodes in both layers (the nodes of layers A and B are hidden in the image below).
I saw this option to find line intersection in python, but since layer A has approx. 23,000 lines and layer B 50,000, it would be too computational intensive to use it:
from shapely.geometry import LineString
line1 = LineString([(x1,y1), (x2,y2), (x3,y3)])
line2 = LineString([(x4,y4), (x5,y5)])
output = line1.intersection(line2)
Does anyone know a better (faster) way to get these intersection nodes?
Thanks a lot!
Okay, I can see what do you mean. I will try to explain a method to do this. You can easily do this by using brute force. But it's time consuming as you mentioned. Specially when there are thousands of nodes and edges. I can suggest a less time consuming method.
Let there are N nodes in layer 1 and M nodes in layer 2. Then for your method the time complexity is O(N*M)
My method. A moderately complex method. I can't implement code here. I will describe steps at the best level I can. You have to figure out how to implement in code. The Cons: May miss intersections
We use Localization in the graph. Otherwise we select (relatively) small windows from layers and perform the same thing you have done. Using shapely.
Ok, First we have to determine the window size we going to use.
determine the size of the rectangle which contains the all nodes in both layers
Divide it into small squares of same size.( like a square grid)
Build zero matrix equivalent to squares.
get count of nodes in each square and assign it to the related element in matrix. Which will have time complexity of O(N+M)
Now you have the density of nodes.
If density of a tile is high, Window size is small. (3x3 would be enough). The selected tile is at the middle. Since nodes are closer less chance to miss intersections. So perform your method on nodes inside selected tiles. Will have time complexity of O(n*m) where n,m are nodes inside the selected window.
If density of a tile is low, Window size is large. (5x5,7x7,9x9.... you can determine). The selected tile is at the middle.Since nodes are far away less chance to miss intersections. So perform your method on nodes inside selected tiles. Will have time complexity of O(n*m) where n,m are nodes inside the selected window.
How does this reduce the time. It will be prevented from comparing the nodes far away from the selected lines. If you have learned about time complexity, This is very efficient than your previous brute force approach. This is a little bit less accurate than your previous approach But faster.
ATTENTION : This is my method for select less number of nodes for comparison. There may be methods much more faster and accurate than my solution. But most of them will be very complex. If you are worried about the accuracy use your method or look for much faster and accurate method. Else you can use mine.
IN ADDITION : You can determine window size using the line length instead of using node density. No need to draw a grid. Select large window around a line if the line is long. If short , Use small window. This is also faster. I think This will be much more accurate than my previous method.
Related
I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.
Recently, I've been interested in Data analysis.
So I researched about how to do machine-learning project and do it by myself.
I learned that scaling is important in handling features.
So I scaled every features while using Tree model like Decision Tree or LightGBM.
Then, the result when I scaled had worse result.
I searched on the Internet, but all I earned is that Tree and Ensemble algorithm are not sensitive to variance of the data.
I also bought a book "Hands-on Machine-learning" by O'Relly But I couldn't get enough explanation.
Can I get more detailed explanation for this?
Though I don't know the exact notations and equations, the answer has to do with the Big O Notation for the algorithms.
Big O notation is a way of expressing the theoretical worse time for an algorithm to complete over extremely large data sets. For example, a simple loop that goes over every item in a one dimensional array of size n has a O(n) run time - which is to say that it will always run at the proportional time per size of the array no matter what.
Say you have a 2 dimensional array of X,Y coords and you are going to loop across every potential combination of x/y locations, where x is size n and y is size m, your Big O would be O(mn)
and so on. Big O is used to compare the relative speed of different algorithms in abstraction, so that you can try to determine which one is better to use.
If you grab O(n) over the different potential sizes of n, you end up with a straight 45 degree line on your graph.
As you get into more complex algorithms you can end up with O(n^2) or O(log n) or even more complex. -- generally though most algorithms fall into either O(n), O(n^(some exponent)), O(log n) or O(sqrt(n)) - there are obviously others but generally most fall into this with some form of co-efficient in front or after that modifies where they are on the graph. If you graph each one of those curves you'll see which ones are better for extremely large data sets very quickly
It would entirely depend on how well your algorithm is coded, but it might look something like this: (don't trust me on this math, i tried to start doing it and then just googled it.)
Fitting a decision tree of depth ‘m’:
Naïve analysis: 2m-1 trees -> O(2m-1 n d log(n)).
each object appearing only once at a given depth: O(m n d log n)
and a Log n graph ... well pretty much doesn't change at all even with sufficiently large numbers of n, does it?
so it doesn't matter how big your data set is, these algorithms are very efficient in what they do, but also do not scale because of the nature of a log curve on a graph (the worst increase in performance for +1 n is at the very beginning, then it levels off with only extremely minor increases to time with more and more n)
Do not confuse trees and ensembles (which may be consist from models, that need to be scaled).
Trees do not need to scale features, because at each node, the entire set of observations is divided by the value of one of the features: relatively speaking, to the left everything is less than a certain value, and to the right - more. What difference then, what scale is chosen?
The Problem
I've been doing a bit of research on Particle Swarm Optimization, so I said I'd put it to the test.
The problem I'm trying to solve is the Balanced Partition Problem - or reduced simply to the Subset Sum Problem (where the sum is half of all the numbers).
It seems the generic formula for updating velocities for particles is
but I won't go into too much detail for this question.
Since there's no PSO attempt online for the Subset Sum Problem, I looked at the Travelling Salesman Problem instead.
They're approach for updating velocities involved taking sets of visited towns, subtracting one from another and doing some manipulation on that.
I saw no relation between that and the formula above.
My Approach
So I scrapped the formula and tried my own approach to the Subset Sum Problem.
I basically used gbest and pbest to determine the probability of removing or adding a particular element to the subset.
i.e - if my problem space is [1,2,3,4,5] (target is 7 or 8), and my current particle (subset) has [1,None,3,None,None], and the gbest is [None,2,3,None,None] then there is a higher probability of keeping 3, adding 2 and removing 1, based on gbest
I can post code but don't think it's necessary, you get the idea (I'm using python btw - hence None).
So basically, this worked to an extent, I got decent solutions out but it was very slow on larger data sets and values.
My Question
Am I encoding the problem and updating the particle "velocities" in a smart way?
Is there a way to determine if this will converge correctly?
Is there a resource I can use to learn how to create convergent "update" formulas for specific problem spaces?
Thanks a lot in advance!
Encoding
Yes, you're encoding this correctly: each of your bit-maps (that's effectively what your 5-element lists are) is a particle.
Concept
Your conceptual problem with the equation is because your problem space is a discrete lattice graph, which doesn't lend itself immediately to the update step. For instance, if you want to get a finer granularity by adjusting your learning rate, you'd generally reduce it by some small factor (say, 3). In this space, what does it mean to take steps only 1/3 as large? That's why you have problems.
The main possibility I see is to create 3x as many particles, but then have the transition probabilities all divided by 3. This still doesn't satisfy very well, but it does simulate the process somewhat decently.
Discrete Steps
If you have a very large graph, where a high velocity could give you dozens of transitions in one step, you can utilize a smoother distance (loss or error) function to guide your model. With something this small, where you have no more than 5 steps between any two positions, it's hard to work with such a concept.
Instead, you utilize an error function based on the estimated distance to the solution. The easy one is to subtract the particle's total from the nearer of 7 or 8. A harder one is to estimate distance based on that difference and the particle elements "in play".
Proof of Convergence
Yes, there is a way to do it, but it requires some functional analysis. In general, you want to demonstrate that the error function is convex over the particle space. In other words, you'd have to prove that your error function is a reliable distance metric, at least as far as relative placement goes (i.e. prove that a lower error does imply you're closer to a solution).
Creating update formulae
No, this is a heuristic field, based on shape of the problem space as defined by the particle coordinates, the error function, and the movement characteristics.
Extra recommendation
Your current allowable transitions are "add" and "delete" element.
Include "swap elements" to this: trade one present member for an absent one. This will allow the trivial error function to define a convex space for you, and you'll converge in very little time.
Picture is worth a thousand words, so:
My input is the matrix on the left, and what I need to find is the sets of nodes that are maximum one step away from each other (not diagonally). Node that is more than one up/down/left/right step away would be in a separate set.
So, my plan was running a BFS from every node I find, then returning the set it traversed through, and removing it from the original set. Iterate this process until I'm done. But then I've had the wild idea of looking for a graph analysis tools - and I've found NetworkX. Is there an easy way (algorithm?) to achieve this without manually writing BFS, and traverse the whole matrix?
Thanks
What you are trying to do is searching for "connected components" and
NetworX has itself a method for doing exactly that as can be seen in the first example on this documentation page as others has already pointed out on the comments.
Reading your question it seems that your nodes are on a discrete grid and the concept of connected that you describe is the same used on the pixel of an image.
Connected components algorithms are available for graphs and for images also.
If performances are important in your case I would suggest you to go for the image version of connected components.
This comes by the fact that images (grids of pixels) are a specific class of graphs so the connected components algorithms dealing with grids of nodes
are built knowing the topology of the graph itself (i.e. graph is planar, the max vertex degree is four). A general algorithm for graphs has o be able to work on general graphs
(i.e they may be not planar, with multiple edges between some nodes) so it has to spend more work because it can't assume much about the properties of the input graph.
Since connected components can be found on graphs in linear time I am not telling the image version would be orders of magnitude faster. There will only be a constant factor between the two.
For this reason you should also take into account which is the data structure that holds your input data and how much time will be spent in creating the input structures which are required by each version of the algorithm.
I was wondering how spring_layout takes edge weight into account. From wikipedia,
'An alternative model considers a spring-like force for every pair of nodes (i,j) where the ideal length \delta_{ij} of each spring is proportional to the graph-theoretic distance between nodes i and j, without using a separate repulsive force. Minimizing the difference (usually the squared difference) between Euclidean and ideal distances between nodes is then equivalent to a metric multidimensional scaling problem.'
How is edge weight factored in, specifically?
This isn't a great answer, but it gives the basics. Someone else may come by who actually knows the Fruchterman-Reingold algorithm and can describe it. I'm giving an explanation based on what I can find in the code.
From the documentation,
weight : string or None optional (default=’weight’)
The edge attribute that holds the numerical value used for the edge weight. If None, then all edge weights are 1.
But that doesn't tell you what it does with the weight, which is your question.
You can find the source code. If you send in weighted edges, it will create an adjacency matrix A with those weights and pass A to _fruchterman_reingold.
Looking at the code there, the meat of it is in this line
displacement=np.transpose(np.transpose(delta)*\
(k*k/distance**2-A*distance/k)).sum(axis=1)
The A*distance is calculating how strong of a spring force is acting on the node. A larger value in the corresponding A entry means that there is a relatively stronger attractive force between those two nodes (or if they are very close together, a weaker repulsive force). Then the algorithm moves the nodes according to the direction and strength of the forces. It then repeats (50 times by default). Interestingly, if you look at the source code you'll notice a t and dt. It appears that at each iteration, the force is multiplied by a smaller and smaller factor, so the steps get smaller.
Here is a link to the paper describing the algorithm, which unfortunately is behind a paywall. Here is a link to the paper on the author's webpage