Indexing in python quadtree - python

I have a set of points with their coordinates (latitudes, longigutes), and I also have a region (a box). All the points are inside the box.
Now I want to distribute the points into small cells (rectangles) considering the density of points. Basically, a cell containing many points should be subdivided into smaller ones, while keeping larger cells with small number of points.
I have check this question, which has almost the same problem with me but I couldn't find a good answer. I think I should use a quad tree, but all implementations I found doesn't provide that.
For example, this library allows making a tree associated with a box, and then we can insert boxes as follows:
spindex = Index(bbox=(0, 0, 100, 100))
for item in items:
spindex.insert(item, item.bbox)
But it doens't allow inserting points. Moreover, I'll need to get the IDs of cells (or names), and check if a point belongs to a given cell.
Here is another lib I found. It does allow inserting points, but it doens't give me the cell's ID (or name), so I cannot check if a point belongs to a certain cell. Notice that the tree should do the decomposition automatically.
Could you please suggest me a solution?
Thanks.

Finally, I ended up using Google's s2-geometry-library with Python wrapper. In fact, each cell created by this lib is not a rectangle (it's a projection), but it satisfies my need. The lib already divided the earth surface into cells at different levels (quad tree). Given an point (lat,lng), I can easily get a corresponding cell at leaf level. From these leaf nodes, I go up and merge cells based on what I need (number of points in a cell).
This tutorial explains everything in details.
Here is my result:

Related

Get n largest regions from binary image

I have given a large binary image (every pixel is either 1 or 0).
I know that in that image there are multiple regions (a region is defined as a set of neighboring 1s which are enclosed by 0s).
The goal is to find the largest (in terms of pixel-count or enclosed area, both would work out for me for now)
My current planned approach is to:
start an array of array of coordinates of the 1s (or 0s, whatever represents a 'hit')
until no more steps can be made:
for the current region (which is a set of coordinates) do:
see if any region interfaces with the current region, if yes add them togther, if no continue with the next iteration
My question is: is there a more efficient way of doing this, and are there already tested (bonus points for parallel or GPU-accelerated) implementations out there (in any of the big libraries) ?
You could Flood Fill every region with an unique ID, mapping the ID to the size of the region.
You want to use connected component analysis (a.k.a. labeling). It is more or less what you suggest to do, but there are extremely efficient algorithms out there. Answers to this question explain some of the algorithms. See also connected-components.
This library collects different efficient algorithms and compares them.
From within Python, you probably want to use OpenCV. cv.connectedComponentsWithStats does connected component analysis and outputs statistics, among other things the area for each connected component.
With regards to your suggestion: using coordinates of pixels rather than the original image matrix directly is highly inefficient: looking for neighbor pixels in an image is trivial, looking for the same in a list of coordinates requires expensive searchers.

obtaining boundary of polygon embedded in a matrix and other post-processing techniques

I have a 100x100 Boolean matrix called mat. All cells have false value except a continuous patch of polygonal area. I can read the cells belonging to this polygon by running through each cell of the matrix and finding true cells.
region_of_interest=false(size(mat));
for i=1:size(mat,1)
for j=1:size(mat,2)
if mat(i,j)
region_of_interest(i,j)=true;
end
end
end
Now I want to do further processing of this polygon like store only the boundary cells. How to do this? I tried visiting each polygonal cell and seeing if all its four neighbors are in the polygon or not. But this did not seem very efficient. Are there better algorithms out there?
If there are other post-processing methods that can be run in this scenario please suggest. Suggestions outside of Matlab are also welcome.
This is neither Python nor a convex-hull problem.
However, I would point out that boundary cells inside the polygon are true and have at least one neighbor which is false, and boundary cells outside the polygon are false and have at least one neighbor which is true.
It's up to you to decide whether you want inner or outer boundary cells, and what a "neighbor" is. For example, are the neighbors of a cell the four neighbors in cardinal directions, or the eight neighbors in diagonal directions too? If it's the former, then you'll still have to search through the eight neighbors to use the algorithm I describe below to get from cell 1 to cell 2 in cases like the following:
...XX
...2X
XX1XX
XXXXX
That being said, it would depend on what your data is like as to how you'd want to write the algorithm. If you know that there is a single contiguous block with no holes in it, which seems to be what your question implies, then once you find a boundary cell, you simply need to walk along the boundary until you find the first cell again. So search neighbors until you find the next boundary cell, and then do it again.
The problem is finding that first boundary cell.
One method would be just searching randomly until you find a cell within the patch, and, then walking in any direction until you find a boundary.
Another would be to do some sort of uniform search, perhaps in a grid, until you find that same first cell as described above. If you know that your patch size is always at least x-cells large, you could adjust your grid size based on that.
Without more specific information, these ideas should at least help you reach a solution. Good luck!

Linear shift between 2 sets of coordinates

My Problem is the following:
For my work I need to compare images of scanned photographic plates with a catalogue of a sample of known stars within the general area of the sky the plates cover (I call it the master catalogue). To that end I extract information, like the brightness on the image and the position in the sky, of the objects in the images and save it in tables. I then use python to create a polynomial fit for the calibration of the magnitude of the stars in the image.
That works up to a certain accuracy pretty well, but unfortunately not well enough, since there is a small shift between the coordinates the object has in the photographic plates and in the master catalogue.
Here the green circles indicate the positions (center of the circle) of objects in the master catalogue. As you can see the actual stars are always situated to the upper left of the objects in the master catalogue.
I have looked a little bit in the comparison of images (i.e. How to detect a shift between images) but I'm a little at a loss now, because I'm not actually comparing images but arrays with the coordinates of the objects. An additional problem here is that (as you can see in the image) there are objects in the master catalogue that are not visible on the plates and not all plates have the same depth (meaning some show more stars than others do).
What I would like to know is a way to find and correct the linear shift between the 2 arrays of different size of coordinates in python. There shouldn't be any rotations, so it is just a shift in x and y directions. The arrays are normal numpy recarrays.
I would change #OphirYoktan's suggestion slightly. You have these circles. I assume you know the radius, and you have that radius value for a reason.
Instead of randomly choosing points, filter the master catalog for x,y within radius of your sample. Then compute however many vectors you need to compute for all possible master catalog entries within range of your sample. Do the same thing repeatedly, then collect a histogram of the vectors. Presumably a small number will occur repeatedly, those are the likely true translations. (Ideally, "small number" == 1.)
There are several possible solutions
Note - these are high level pointers, you'll need some work to convert it to working code
The original solution (cross correlation) can be adapted to the current data structure, and should work
A believe that RANSAC will be better in your case
basically it means:
create a model based on a small number of data points (the minimal number that are required to define a relevant model), and verify it's correctness using the full data set.
specifically, if you have only translation to consider (and not scale):
select one of your points
match it to a random point in the catalog [you may do "educated guesses", if you have some prior about what translation is more likely]
this matching gives you the translation
verify this translation matches the rest of your points
repeat until you find a good match
I'm assuming here the objects aren't necessarily in the same order in both the photo plate and master catalogue.
Consider the set of position vectors, A, of the objects in the photo plate, and the set of position vectors, B, of the objects in the master catalogue. You're looking for a vector, v, such that for each a in A, a + v is approximately some element in b.
The most obvious algorithm to me would be to say for each a, for each b, let v = b - a. Now, for each element in A, check that there is a corresponding element in B that is sufficiently close (within some distance e that you choose) to that element + v. Once you find the v that meets this condition, v is your shift.

How export a network and the attributes in graphml with networkx?

Hi I'm quite new with networks and I've been trying to programme a code that gets all the .edges and .nodes files of a folder and generate a graphml file so I can visualize it in another software. But I also need to add some colours in my nodes but when I tried it I got: KeyError 29
I was running a loop through my array of nodes to add the color of each node.
Here's the part of the code where I try to add the color attribute. So the nodes will be coloured with 4 different colors: the best fitness, the worse, top 10% best fitness and 10% worse.
for i in range(len(nodes)):
if nodes[i]==top:
NetGraph.node[i]['color']='r'
Hope you can help me! Cheers
If you are trying to "merge" relationship data that are stored in a number of different .nodes' and.edges' files into one graph, then it is possible that as the files are read from the disk you come across a node that has not yet been added to the graph.
In general, I feel that more information is required to provide a more meaningful answer to this question. For example: what is the format of the .node and .edge files? What is in the top variable? (Is that a list or a single number variable representing a threshold?).
However, based on what is mentioned so far in this question, here are a few pointers:
Try to build the graph first and colour it later. This might appear inneficient if you already have the fitness data but it will be the easiest way to get you to a working piece of code.
Make sure that your node IDs are indeed integer numbers. That is, each node, is known in the graph, by its index value. For example 2,3,5, etc instead of "Paris", "London", "Berlin", etc (i.e. string node IDs). If it is the latter, then the for would be better formed as: for aNode in G.nodes(data = True):. This will return an iterator with each node's ID and a dictionary with all of the existing node's data.
If top is a single variable, then it doesn't make sense to compare the node's ID with the top threshold. It would be like saying if 22 (which is a node ID) is equal to 89 (which is some expression of efficiency) then apply the red colour to the node. If top is a list that contains all the nodes that are considered top nodes, then the condition expression should have been: if nodes[i] in top:.
You seem to have skipped an indentation below the if (?). For the statement that assigns the colour to the node provided that the condition is True, to work, it needs to be indented one more set of 4 spaces to the right.
The expression to assign the colour is correct.
Please note that Networkx will make an attempt to write every node and edge attribute it comes across in a Graph to the appropriate format. For more information on this, please see the response to this question. Therefore, once you are satisfied with the structure of a given graph (G), then you can simply call networkx.write_graphml(G, 'mygraph.gml') to save it to the disk (where networkx is the name of the module). The networkx.write_* functions will export a complete version of your graph (G) to a number of different formats or raise an exception if a data type cannot be serialised properly.
I hope this helps. Happy to ammend the response if more details are provided later on.

Finding intersections

Given a scenario where there are millions of potentially overlapping bounding boxes of variable sizes less the 5km in width.
Create a fast function with the arguments findIntersections(Longitude,Latitude,Radius) and the output is a list of those bounding boxes ids where each bounding box origin is inside the perimeter of the function argument dimensions.
How do I solve this problem elegantly?
This is normally done using an R-tree data structure
dbs like mysql or postgresql have GIS modules that use an r-tree under the hood to quickly retrieve locations within a certain proximity to a point on a map.
From http://en.wikipedia.org/wiki/R-tree:
R-trees are tree data structures that
are similar to B-trees, but are used
for spatial access methods, i.e., for
indexing multi-dimensional
information; for example, the (X, Y)
coordinates of geographical data. A
common real-world usage for an R-tree
might be: "Find all museums within 2
kilometres (1.2 mi) of my current
location".
The data structure splits space with
hierarchically nested, and possibly
overlapping, minimum bounding
rectangles (MBRs, otherwise known as
bounding boxes, i.e. "rectangle", what
the "R" in R-tree stands for).
The Priority R-Tree (PR-tree) is a variant that has a maximum running time of:
"O((N/B)^(1-1/d)+T/B) I/Os, where N is the number of d-dimensional (hyper-)
rectangles stored in the R-tree, B is the disk block size, and T is the output
size."
In practice most real-world queries will have a much quicker average case run time.
fyi, in addition to the other great code posted, there's some cool stuff like SpatiaLite and SQLite R-tree module
PostGIS is an open-source GIS extention for postgresql.
They have ST_Intersects and ST_Intersection functions available.
If your interested you can dig around and see how it's implemented there:
http://svn.osgeo.org/postgis/trunk/postgis/
This seems like a better more general approach GiST
http://en.wikipedia.org/wiki/GiST

Categories