Python NumPy vectorization

Python NumPy vectorization - python

I'm trying to code what is known as the List Right Heuristic for the unweighted vertex cover problem. The background is as follows:
Vertex Cover Problem: In the vertex cover problem, we are given an undirected graph G = (V, E) where V is the set of vertices and E is the set of Edges. We need to find the smallest set V' which is a subset of V such that V' covers G. A set V' is said to cover a graph G if all the edges in the graph have at least one vertex in V'.
List Right Heuristic: The algorithm is very simple. Given a list of vertices V = [v1, v2, ... vn] where n is the number of vertices in G, vi is said to be a right neighbor of vj if i > j and vi and vj are connected by an edge in the graph G. We initiate a cover C = {} (empty set) and scan V from right to left. At any point, say the current vertex being scanned is u. If u has at least one right neighbor not in C then u is added to c. The entire V is just scanned once.
I'm solving this for multiple graphs (with same vertices but different edges) at once.
I coded the List Right Heuristic in python. I was able to vectorize it to solve multiple graphs at once, but I was unable to vectorize the original for loop. I'm representing the graph using an Adjacency matrix. I was wondering if it can be further vectorized. Here's my code:
def list_right_heuristic(population: np.ndarray, adj_matrix: np.ndarray):
adj_matrices = np.matlib.repmat(adj_matrix,population.shape[0], 1).reshape((population.shape[0], *adj_matrix.shape))
for i in range(population.shape[0]):
# Remove covered vertices from the graph. Delete corresponding edges
adj_matrices[i, np.outer(population[i], population[i]).astype(bool)] = 0
vertex_covers = np.zeros(shape=population.shape, dtype=population.dtype)
for index in range(population.shape[-1] - 1, -1, -1):
# Get num of intersecting elements (for each row) in right neighbors and vertex_covers
inclusion_rows = np.sum(((1 - vertex_covers) * adj_matrices[..., index])[..., index + 1:], axis=-1).astype(bool)
# Only add vertices to cover for rows which have at least one right neighbor not in vertex cover
vertex_covers[inclusion_rows, index] = 1
return vertex_covers
I have p graphs that I'm trying to solve simultaneously, where p=population.shape[0]. Each graph has the same vertices but different edges. The population array is a 2D array where each row indicates vertices of the graph G that are already in the cover. I'm only trying to find the vertices which are not in the cover. So for this reason, setting all rows and columns of vertices in cover to 0, i.e., I'm deleting the corresponding edges. The heuristic should theoretically only return vertices not in the cover now.
So in the first for loop, I just set the corresponding rows and columns in the adjacency matrix to 0 ( all elements in the rows and columns will be zero). Next I'm going through the 2D array of vertices from right to left and finding number of right neighbors in each row not in vertex_covers. For this I'm first finding the vertices not in cover (1 - vertex_covers) and then multiplying that with corresponding columns in adj_matrices (or rows since adj matrix is symmetric) to get neighbors of that that vertex we're scanning. Then I'm summing all elements to the right of this. If this value is greater than 0 then there's at least one right neighbor not in vertex_covers.
Am I doing this correctly for one?
And is there any way to vectorize the second for loop ( or the first for that matter) or speed up the code in general? calling this function thousands of times in some other code for large graphs (with 1000+ vertices). Any help would be appreciated.

You can use np.einsum to perform many complex operations between indices. In your case, the first loop can be performed this way:
adj_matrices[np.einsum('ij, ik->ijk', population, population).astype(bool)] = 0
It took me some time to understand how einsum works. I found this SO question very helpful.
BTW, Your code gave me the following syntax error:
SyntaxError: can use starred expression only as assignment target
and I had to re-write the first line of the function as:
adj_matrices = np.matlib.repmat(adj_matrix,population.shape[0],
1).reshape((population.shape[0],) + adj_matrix.shape)

Related

How to get the K most distant points, given their coordinates?

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?

From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question

I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:

Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)

Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.

If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.

Divide a region into parts efficiently Python

I have a square grid with some points marked off as being the centers of the subparts of the grid. I'd like to be able to assign each location within the grid to the correct subpart. For example, if the subparts of the region were centered on the black dots, I'd like to be able to assign the red dot to the region in the lower right, as it is the closest black dot.
Currently, I do this by iterating over each possible red dot, and comparing its distance to each of the black dots. However, the width, length, and number of black dots in the grid is very high, so I'd like to know if there's a more efficient algorithm.
My particular data is formatted as such, where the numbers are just placeholders to correspond with the given example:
black_dots = [(38, 8), (42, 39), (5, 14), (6, 49)]
grid = [[0 for i in range(0, 50)] for j in range(0, 50)]
For reference, in the sample case, I hope to be able to fill grid up with integers 1, 2, 3, 4, depending on whether they are closest to the 1st, 2nd, 3rd, or 4th entry in black_dots to end up with something that would allow me to create something similar to the following picture where each integer correspond to a color (dots are left on for show).
To summarize, is there / what is the more efficient way to do this?

You can use a breadth-first traversal to solve this problem.
Create a first-in, first-out queue. (A queue makes a traversal breadth-first.)
Create a Visited mask indicating whether a cell in your grid has been added to the queue or not. Set the mask to false.
Create a Parent mask indicating what black dot the cell ultimately belongs to.
Place all the black dots into the queue, flag them in the Visited mask, and assign them unique ids in the Parent mask.
Begin popping cells from the queue one by one. For each cell, iterate of the cell's neighbours. Place each neighbour into the Queue, flag it in Visited, and set its value in Parent to be equal to that of the cell you just popped.
Continue until the queue is empty.
The breadth-first traversal makes a wave which expands outward from each source cell (black dot). Since the waves all travel at the same speed across your grid, each wave gobbles up those cells closest to its source.
This solves the problem in O(N) time.

If I understand correctly what you really need is to construct a Voronoi diagram of your centers:
https://en.m.wikipedia.org/wiki/Voronoi_diagram
Which can be constructed very efficiently with similar computational complexity as calculating its convex hull.
The Voronoi diagram allows you to construct the optimal polygons sorrounding your centers which delimit the regions closest to the centers.
Having the Voronoi diagram the task is reduced to detect in which polygon the red dots lies. Since the Voronoi cells are convex you need an algorithm to decide wether a point is inside a convex polygon. However traversing all polygons has complexity O(n).
There are several algorithms to accelerate the point location so it can be done in O(log n):
https://en.m.wikipedia.org/wiki/Point_location
See also
Nearest Neighbor Searching using Voronoi Diagrams

The "8-way" Voronoi diagram can be constructed efficiently (in linear time wrt the number of pixels) by a two-passes scanline process. (8-way means that distances are evaluated as the length of the shortest 8-connected path between two pixels.)
Assign every center a distinct color and create an array of distances of the same size as the image, initialized with 0 at the centers and "infinity" elsewhere.
In a top-down/left-right pass, update the distances of all pixels as being the minimum of the distances of the four neighbors W, NW, N and NE plus one, and assign the current pixel the color of the neighbor that achieves the minimum.
In a bottom-up/right-left pass, update the distances of all pixels as being the minimum of the current distance and the distances of the four neighbors E, SE, S, SW plus one, and assign the current pixel the color of the neighbor that achieves the minimum (or keep the current color).
It is also possible to compute the Euclidean Voronoi diagram efficiently (in linear time), but this requires a more sophisticated algorithm. It can be based on the wonderful paper "A GENERAL ALGORITHM FOR COMPUTING DISTANCE
TRANSFORMS IN LINEAR TIME" by A. MEIJSTER‚ J.B.T.M. ROERDINK and W.H. HESSELINK, which must be enhanced with some accounting of the neighbor that causes the smallest distance.

Re-arrange the array contain endpoints to creat an closed polygon in Python

For some purpose, I want to plot an polygon based on several latitude and longitude as endpoints which combined together.
The example data shows like this:
fig=plt.figure()
ax = plt.gca()
x_map1, x_map2 = 114.166,114.996
y_map1, y_map2 = 37.798,38.378
map = Basemap(llcrnrlon=x_map1,llcrnrlat=y_map1,urcrnrlon=x_map2,urcrnrlat=y_map2)
map.drawparallels(np.arange(y_map1+0.102,y_map2,0.2),labels=[1,0,0,1],size=14,linewidth=0,color= '#FFFFFF')
map.drawmeridians(np.arange(x_map1+0.134,x_map2,0.2),labels=[1,0,0,1],size=14,linewidth=0)
bo_x = [114.4390022, 114.3754847, 114.3054522, 114.3038236, 114.2802081, 114.2867228, 114.3378847, 114.3888619, \
114.6288783, 114.6848733, 114.7206292, 114.7341219]
bo_y = [38.16671389, 38.14472722, 38.14309861, 38.10156778, 38.08853833, 38.06980889, 38.03587472, 37.96409056, \
37.84975278, 37.84840333, 37.9017, 38.16683306]
x, y = map( bo_x, bo_y )
xy = zip(x,y)
poly = Polygon( xy, facecolor='red', alpha=0.4 )
plt.gca().add_patch(poly)
The figure shows like this:
But when the Lons array and Lats array are not in the anticlockwise order, and the arrays contain many items that hard to adjust manually. The polygon output may show non-conformity.
Here, I disorganize the bo_x and bo_y as an suppositional situation.
bo_x_adjust = [114.4390022, 114.3754847, 114.3054522, 114.3038236, 114.6288783, 114.6848733, 114.7206292, 114.7341219,
114.2802081, 114.2867228, 114.3378847, 114.3888619, ]
bo_y_adjust = [38.16671389, 38.14472722, 38.14309861, 38.10156778, 37.84975278, 37.84840333, 37.9017, 38.16683306,
38.08853833, 38.06980889, 38.03587472, 37.96409056, ]
Figure shows like:
So, here is my question. Sometimes, the original endpoints are not in order which can output a closed polygon. Pre-organize the arrays is the way to go.
I think to adjust the order of arrays like bo_x and bo_y must follow two principles:
Elements in these two array should be adjust synchronously for the purpose to not break the endpoint pairs(X~Y)
The new arrays should be outlined in clockwise or anticlockwise order on 2-D space.
Any advice or guidelines would be appreciate.

Not an answer yet, but I needed the ability to attach images.
The problem may be ill defined. For example, these two legitimate polygons have the same vertices.
Do you want to get either one?

Here is a way to solve what you want by linear algebra. Sorry but I am writing just the general guidelines. Nonetheless it should work.
Write a function that accept two edges numbers j and k and check if there is an intersection. Note that you need to handle correctly the last to first vertices edge. You also need to make sure you give 'False' when adjacent edges are called since these always intersect by definition.
Now the way to know if two edges intersect is to follow a little algebra. Extract from each edge its straight line parameters a and b by y = a*x + b. Then solve for the two edges to find the intersection x by equating a1*x+b1==a2*x+b2. If the intersection x for both edges is between the x's of the edge's vertices, then the two edges indeed intersect.
Write a function that goes over all edges pairs and test for intersection. Only when no intersection exist the polygon is legitimate.
Next you can go in two approaches:
Comprehensive approach - Go over all possible permutations of the vertices. Test each permutation polygon for intersections. Note that when permutating you need to permutate x and y together. Note that there are a lot of permutations so this could be very time consuming.
Greedy approach - As long as there are still intersections, go over the edges pairs combinations and whenever there is an intersection simply switch the two last edge coordinates (unwind the intersection). Then restart going over all the edges pairs again . Repeat this until there are no more intersections. This should work pretty fast but will not give the best polygon (e.g. will not optimize the largest polygon area)
Hope this helps...

Creating fixed set of nodes using networkx in python

I have a problem concerning graph diagrams. I have 30 nodes(points). I want to construct an adjacency matrix in such a way that each ten set of nodes are like at a vertices of a triangle. So lets say a group of 10 nodes is at the vertex A, B and C of a triangle ABC.
Two of the vertex sets should have only 10 edges(basically each node within a cluster is connected to other one). Lets say groups at A and B have 10 edges within the group. While the third vertex set should have 11 edges(10 for each nodes and one node connecting with two nodes, so 11 edges in that group). Lets say the one at C has 11 edges in it.
All these three clusters would be having one edge between them to form a triangle.That is connect group at A with group at B with one edge and B with C with one edge and C with A with one edge.
Later on I would add one more edge between B and C. Represented as dotted line in the attached figure. The point at a vertex can be in a circle or any other formation as long as they represent a group.
How do I create an adjacency matrix for such a thing. I actually know how to create the adjacency matrix for such a matrix as it is just binary symmetric matrix(undirected graph) but the problem is when I try to plot that adjacency matrix it would bring the one node from other group closer to the group to which that node is connected. So lets say I connected one node at Vertex A with one node at Vertex B by connecting an edge between the two. This edge would depict the side AB of the triangle. But when I depict it using networkx then those two nodes which are connected from these two different groups would eventually come closer and look like part of one group. How do I keep it as separate group. ?
Pls note I am making use of networkx lib of python which helps plot the adjacency matrix.
EDIT:
A code I am trying to use after below inspiration:
G=nx.Graph()
# Creating three separate groups of nodes (10 nodes each)
node_clusters = [range(1,11), range(11,21) , range(21,31)]
# Adding edges between each set of nodes in each group.
for x in node_clusters:
for y in x:
if(y!=x[-1]):
G.add_edge(y,y+1,len=2)
else:
G.add_edge(y,x[0],len=2)
# Adding three inter group edges separately:
for x in range(len(node_clusters)):
if(x<2):
G.add_edge(node_clusters[x][-1],node_clusters[x+1][0],len=8)
else:
G.add_edge(node_clusters[x][-1],node_clusters[0][0],len=8)
nx.draw_graphviz(G, prog='neato')
Gives the following error:
--> 260 '(not available for Python3)')
261 if root is not None:
262 args+="-Groot=%s"%root
ImportError: ('requires pygraphviz ', 'http://networkx.lanl.gov/pygraphviz ', '(not available for Python3)')
My python version is not 3, its 2. And am using anaconda distribution
EDIT2:
I used Marius's code but instead used the following to plot:
graph_pos=nx.spring_layout(G,k=0.20,iterations=50)
nx.draw_networkx(G,graph_pos)
It has destroyed completely the whole graph. and shows this:

I was able to get something going fairly quickly just by hacking away at this, all you need to do is put together tuples representing each edge, you can also set some arbitrary lengths on the edges to get a decent approximation of your desired layout:
import networkx
import string
all_nodes = string.ascii_letters[:30]
a_nodes = all_nodes[:10]
b_nodes = all_nodes[10:20]
c_nodes = all_nodes[20:]
all_edges = []
for node_set in [a_nodes, b_nodes, c_nodes]:
# Link each node to the next
for i, node in enumerate(node_set[:-1]):
all_edges.append((node, node_set[i + 1], 2))
# Finish off the circle
all_edges.append((node_set[0], node_set[-1], 2))
joins = [(a_nodes[0], b_nodes[0], 8), (b_nodes[-1], c_nodes[0], 8), (c_nodes[-1], a_nodes[-1], 8)]
all_edges += joins
# One extra edge for C:
all_edges.append((c_nodes[0], c_nodes[5], 5))
G = networkx.Graph()
for edge in all_edges:
G.add_edge(edge[0], edge[1], len=edge[2])
networkx.draw_graphviz(G, prog='neato')
Try something like networkx.to_numpy_matrix(G) if you then want to export as an adjacency matrix.

find the min distances from one polygon to other polygons in a layer?

I tried to figure out how to find the min distances from one polygon to other polygons in a layer (a layer consists of many polygons) of ArcGIS. More specific, I was wondering if it is possible to run a loop with python, which will find the min distances from each polygon to others?
Thanks,
Rajib

If you've got the center coordinates of your polygons, it's illustratably easy to do this on your own. First you need a function to find the distance between two points of the same dimensions:
def euclid(pt1, pt2):
pairs = zip(pt1, pt2) # Form pairs in corresponding dimensions
sum_sq_diffs = sum((a - b)**2 for a, b in pairs) # Find sum of squared diff
return (sum_sq_diffs)**(float(1)/2) # Take sqrt to get euclidean distance
Then you can make a function to find the closest point among a vector (list or whatever) of points. I would simply apply the min() function with a quick custom key-function:
# Returns the point in vec with minimum euclidean distance to pt
def closest_pt(pt, vec):
return min(vec, key=lambda x: euclid(pt, x))
If you have the vertices of the polygon this is a couple steps more complicated, but easy to figure out if you take it step-by-step. Your outer-most loop should iterate through the points in your "base" polygon (the one you are trying to find the minimum distance to). The loop nested within this should take you to each of the other polygons in your comparison vector. From here you can just call the closest_pt() function to compare your basis point to all the points in this other polygon, finding the closest one:
def closest_poly(basis, vec):
closest = []
for (i, pt) in basis:
closer = []
for poly in vec:
closer.append(closest_pt(pt, poly))
closest.append(closest_pt(pt, closer))
best = min(enumerate(closest), key=lambda x: euclid(basis[x[0]], x[1]))
return (best[0], best[1], [best[1] in poly for poly in vec])
It may be slightly redundant structurally, but I think it will work and it provides pretty transparent logic. The function returns a pair of (vertex, close_pt, polys), where: vertex is the index of the vertex in your basis which was found to be closest to another polygon; close_pt is the point in the other polygon which was found to contain the closest point; and polys is a list of Boolean values corresponding with the polygons in your vec, such that each polys[i] == True if and only if close_pt is a vertex of vec[i].
Hope this is helpful.

There is a tool in arcgis toolbox called:
http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//00080000001q000000.htm

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.