Thiessen-like polygons out of pre-labeled points - python

I have a list of coordinate points that are already clustered. Each point is available to me as a row in a csv file, with one of the fields being the "zone id": the ID of the cluster to which a point belongs. I was wondering if there is a way, given the latitude, longitude and zone ID of each point, to draw polygons similar to Voronoi cells, such that:
each cluster is entirely contained within a polygon
each polygon contains points belonging to only one cluster
the union of the polygons is contiguous polygon that contains all the points. No holes: the polygons must border each other except at the edges. A fun extension would be to supply the "holes" (water bodies, for example) as part of the input.
I realise the problem is very abstract and could be very resource intensive, but I am curious to hear of any approaches. I am open to solutions using a variety or combination of tools, such as GIS software, Python, R, etc. I am also open to implementations that would be integrated into the clustering process.

Related

Points in Polygons. How can I match them spatially with given coordinates?

I have a dataset of georeferenced flickr posts (ca. 35k, picture below) and I have an unrelated dataset of georeferenced polygons (ca. 40k, picture below), both are currently panda dataframes. The polygons do not cover the entire area where flickr posts are possible. I am having trouble understanding how to sort many different points in many different polygons (or check if they are close). In the end I want a map with the points from the flickerdata in polygons colord to an attribute (Tag). I am trying to do this in Python. Do you have any ideas or recommendations?
Point dataframe Polygon dataframe
Since, you don't have any sample data to load and play with, my answer will be descriptive in nature, trying to explain some possible strategies to approach the problem you are trying to solve.
I assume that:
these polygons are probably some addresses and you essentially want to place the geolocated flickr posts to the nearest best-match among the polygons.
First of all, you need to identify or acquire information on the precision of those flickr geolocations. How off could they possibly be because of numerous sources of errors (the reason behind those errors is not your concern, but the amount of error is). This will give you an idea of a circle of confusion (2D) or more likely a sphere of confusion (3D). Why 3D? Well, you might have flickr post from a certain elevation on a high-rise apartment, and so, (x: latitude,y: longitude, z: altitude) all may be necessary to consider. But, you have to study the data and any other information available to you to determine the best option here (2D/3D space-of-confusion).
Once you have figured out the type of ND-space-of-confusion, you will need a distance metric (typically just a distance between two points) -- call this sigma. Just to be on the safe side, find all the addresses (geopolygons) within a radius of 1 sigma and additionally within 2 sigma -- these are your possible set of target addresses. For each of these addresses have a variable that calculates its distances of its centroid, and the four corners of its rectangular outer bounding box from the flickr geolocations.
You will then want to rank these addresses for each flickr geolocation, based on their distances for all the five points. You will need a way of identifying a flickr point that is far from a big building's center (distance from centroid could be way more than distance from the corners) but closer to it's edges vs. a different property with smaller area-footprint.
For each flickr point, thus you would have multiple predictions with different probabilities (convert the distance metric based scores into probabilities) using the distances, on which polygon they belong to.
Thus, if you choose any flickr location, you should be able to show top-k geopolygons that flickr location could belong to (with probabilities).
For visualizations, I would suggest you to use holoviews with datashader as that should be able to take care of curse of dimension in your data. Also, please take a look at leafmap (or, geemap).
References
holoviews: https://holoviews.org/
datshader: https://datashader.org/
leafmap: https://leafmap.org/
geemap: https://geemap.org/

Select a smaller sample of "uniformly" distributed co-ordinates, out of a larger population of co-ordinates

I have a set of co-ordinates(latitudes and longitudes) of different buildings of a city. The sample size is around 16,000. I plan to use these co-ordinates as the central point of their locality/neighbourhood, and do some analysis on the different neighbourhoods of the city. The "radius/size" for each neighbourhood is still undecided as of now.
However, a lot of these co-ordinates are too close to each other. So, many of them actually represent the same locality/neighbourhood.
As a result, I want to select a smaller sample(say, 3-6k) of co-ordinates that will be more evenly spread out.
Example:- If two of the co-ordinates are representing two neighbouring buildings, I don't want to include both as they pretty much represent the same area. So we must select only one of them.
This way, I was hoping to reduce the population to a smaller size, while at the same time being able to cover most of the city through the remaining co-ordinates.
One way I was imagining the solution is to plot these co-ordinates on a 2D graph(for visualisation). Then, we can select different values of "radius" to see how many co-ordinates would remain. But I do not know how to implement such a "graph".
I am doing this analysis in Python. Is there a way I can obtain such a sample of these co-ordinates that are evenly distributed with minimal overlap?
Thanks for your help,
It seems like for your use case, you might need clustering instead of sampling to reduce your analysis set.
Given that you'd want to reduce your "houses" data to "neighborhoods" data, I'd suggest exploring geospatial clustering to cluster houses that are closer together and then take your ~3-4K clusters as your data set to begin with.
That being said, if your objective still is to remove houses that are closer together, you can obviously create an N*N matrix of the geospatial distance between each house vs. others and remove pairs that are within (0, X] where X is your threshold.

Create Geographical Grid based on data using K-D Trees (python)

For my research, I need to divide the geographical area of a city (i.e.Chicago or New York) using a grid. Later, I have data points consisting of GPS longitude and latitude location that I want to associate to its corresponding cell in the grid.
The simplest way to do this is dividing the space into squared cells of same size. However, this will lead to cells with very few points in non-populated (rural areas) areas and cells with a high number of points (city centre). In order to have a more fair representation and the relation between the number of points and cell size, an adaptative grid that create cells of size based on data density would be a better option.
I came across this paper that utilise a K-D tree to do the space partition and retrieve the cells from the nodes. However, I cannot find any implementation (in python) that does that. Many of the implementations out there only index data points in the tree to perform Nearest Neighbour search, but they not provide code to extract the polygon-rectangles that k-d tree generates.
For example, given the following image:
My resulting grid will contain 5 cells (node1 to node5) where each cell contains the associated data points.
Any idea on how to do that?
Anyone knows any implementation?
Many thanks,
David

Segmentation of 2D data with additionally categorical variable

there's a segmentation problem that I would like some help with.
I am trying to segment a large set of points in 2-Dimensional space, that also has one categorical variable. The primary segmentation should be done by clustering of the spatial data and, if necessary, the clusters should then be further divided based on the categorical variable.
Here's an example:
Let's say we have a dataset of geographical coordinates of houses in a city. In addition to the location of each house we also know which colour it has. If we would plot the location of the houses and the colour they have we would get the image below. You can see there are three neighbourhoods in this town, two of which are geographically difficult to separate but are clearly distinct based on their looks.
The above example would be difficult to segment with a clustering algorithm like DBScan or K-means, which would not take into account the categorical variable. In addition we cannot separate the purple and orange houses, because both can be found in the same neighbourhood. It would also be difficult to cluster in multidimensional space with gower's distance, because that might lead to houses with odd colours being assigned to clusters outside of their geographical confines.
What would be a good approach to this problem? Are there any python (or R) implementations of clustering algorithms that could deal with this sort of problem? Or would a computer vision approach be more fitting?
Any input would be welcome.
Another approach would be to run, say, DBSCAN separately for all subsets of colours, obtaining a large set of putative neighbourhoods, and then to take the minimal elements of this set, with respect to set inclusion.
In the above example:
green gives you the south-west,
blue, red, and blue+red give you the
north and the south-east,
green+red/blue/both give you the north and
the south (a merge of two correct neighbourhoods) and then you filter
out the south by taking the minimal elements.

Mapping points to polygons in ESRI shape file in Python

I have an ESRI shape file with many geographic areas, represented as non-overlapping polygons. I'm trying to map arbitrary points to the polygons which they belong to in Python.
I've looked into storing the polygons in a SQLite database as R-trees, but I think this works if the shapes are rectangles (or if I approximate the polygons by using minimum bounding rectangles).
Is there any way to do this exact calculation with R-trees (or a similar module provided by SQLite)? This way I could store this information as a SQLite database, making it very easy to perform this calculation cross platform.

Categories