Segmentation of 2D data with additionally categorical variable - python

there's a segmentation problem that I would like some help with.
I am trying to segment a large set of points in 2-Dimensional space, that also has one categorical variable. The primary segmentation should be done by clustering of the spatial data and, if necessary, the clusters should then be further divided based on the categorical variable.
Here's an example:
Let's say we have a dataset of geographical coordinates of houses in a city. In addition to the location of each house we also know which colour it has. If we would plot the location of the houses and the colour they have we would get the image below. You can see there are three neighbourhoods in this town, two of which are geographically difficult to separate but are clearly distinct based on their looks.
The above example would be difficult to segment with a clustering algorithm like DBScan or K-means, which would not take into account the categorical variable. In addition we cannot separate the purple and orange houses, because both can be found in the same neighbourhood. It would also be difficult to cluster in multidimensional space with gower's distance, because that might lead to houses with odd colours being assigned to clusters outside of their geographical confines.
What would be a good approach to this problem? Are there any python (or R) implementations of clustering algorithms that could deal with this sort of problem? Or would a computer vision approach be more fitting?
Any input would be welcome.

Another approach would be to run, say, DBSCAN separately for all subsets of colours, obtaining a large set of putative neighbourhoods, and then to take the minimal elements of this set, with respect to set inclusion.
In the above example:
green gives you the south-west,
blue, red, and blue+red give you the
north and the south-east,
green+red/blue/both give you the north and
the south (a merge of two correct neighbourhoods) and then you filter
out the south by taking the minimal elements.

Related

On Creating Random Walkers in 3D elements of 3 different 4D arrays

I am currently working with 2 different 4D arrays (x,y,z,l) containing voxel data of a brain scan (x,y,z) and a label (l). Every voxel holds the probability of that voxel to contain that label.
The first array, represents the brain tissue in 3D space, has 3 labels, white matter, grey matter, and the CSF. My only area of interest is grey matter.
The second aray, represents a pre defined probability distribution of 360 labels, all of which correspond to a portion of the grey matter.
The "shape" of the probability distributions are in 3D is similar, but not exactly the same. But they are already aligned in 3D space with a combination of rigid and affine transformations, and it is "as similar as possible".
What I want to do, is to map this 360 3D elements, onto the second 3D element of first array and get a best fit.
I am not an expert in machine learning, so I tried to come up with my own idea;
Pick every voxel from grey matter probability map
Generate 1000 random walks from each point.
Which territory is walked most? - Label the point
What I explained above, I havent been able to code yet, therefore I am unable to offer any code.
I am still trying to create "the random walking sampler" but I quickly realized I need a lot of nested for loops work on a lot of different 3D arrays (1363 to be exact).
Is there a better way to do it?
Is there a better way to do this?

Points in Polygons. How can I match them spatially with given coordinates?

I have a dataset of georeferenced flickr posts (ca. 35k, picture below) and I have an unrelated dataset of georeferenced polygons (ca. 40k, picture below), both are currently panda dataframes. The polygons do not cover the entire area where flickr posts are possible. I am having trouble understanding how to sort many different points in many different polygons (or check if they are close). In the end I want a map with the points from the flickerdata in polygons colord to an attribute (Tag). I am trying to do this in Python. Do you have any ideas or recommendations?
Point dataframe Polygon dataframe
Since, you don't have any sample data to load and play with, my answer will be descriptive in nature, trying to explain some possible strategies to approach the problem you are trying to solve.
I assume that:
these polygons are probably some addresses and you essentially want to place the geolocated flickr posts to the nearest best-match among the polygons.
First of all, you need to identify or acquire information on the precision of those flickr geolocations. How off could they possibly be because of numerous sources of errors (the reason behind those errors is not your concern, but the amount of error is). This will give you an idea of a circle of confusion (2D) or more likely a sphere of confusion (3D). Why 3D? Well, you might have flickr post from a certain elevation on a high-rise apartment, and so, (x: latitude,y: longitude, z: altitude) all may be necessary to consider. But, you have to study the data and any other information available to you to determine the best option here (2D/3D space-of-confusion).
Once you have figured out the type of ND-space-of-confusion, you will need a distance metric (typically just a distance between two points) -- call this sigma. Just to be on the safe side, find all the addresses (geopolygons) within a radius of 1 sigma and additionally within 2 sigma -- these are your possible set of target addresses. For each of these addresses have a variable that calculates its distances of its centroid, and the four corners of its rectangular outer bounding box from the flickr geolocations.
You will then want to rank these addresses for each flickr geolocation, based on their distances for all the five points. You will need a way of identifying a flickr point that is far from a big building's center (distance from centroid could be way more than distance from the corners) but closer to it's edges vs. a different property with smaller area-footprint.
For each flickr point, thus you would have multiple predictions with different probabilities (convert the distance metric based scores into probabilities) using the distances, on which polygon they belong to.
Thus, if you choose any flickr location, you should be able to show top-k geopolygons that flickr location could belong to (with probabilities).
For visualizations, I would suggest you to use holoviews with datashader as that should be able to take care of curse of dimension in your data. Also, please take a look at leafmap (or, geemap).
References
holoviews: https://holoviews.org/
datshader: https://datashader.org/
leafmap: https://leafmap.org/
geemap: https://geemap.org/

Select a smaller sample of "uniformly" distributed co-ordinates, out of a larger population of co-ordinates

I have a set of co-ordinates(latitudes and longitudes) of different buildings of a city. The sample size is around 16,000. I plan to use these co-ordinates as the central point of their locality/neighbourhood, and do some analysis on the different neighbourhoods of the city. The "radius/size" for each neighbourhood is still undecided as of now.
However, a lot of these co-ordinates are too close to each other. So, many of them actually represent the same locality/neighbourhood.
As a result, I want to select a smaller sample(say, 3-6k) of co-ordinates that will be more evenly spread out.
Example:- If two of the co-ordinates are representing two neighbouring buildings, I don't want to include both as they pretty much represent the same area. So we must select only one of them.
This way, I was hoping to reduce the population to a smaller size, while at the same time being able to cover most of the city through the remaining co-ordinates.
One way I was imagining the solution is to plot these co-ordinates on a 2D graph(for visualisation). Then, we can select different values of "radius" to see how many co-ordinates would remain. But I do not know how to implement such a "graph".
I am doing this analysis in Python. Is there a way I can obtain such a sample of these co-ordinates that are evenly distributed with minimal overlap?
Thanks for your help,
It seems like for your use case, you might need clustering instead of sampling to reduce your analysis set.
Given that you'd want to reduce your "houses" data to "neighborhoods" data, I'd suggest exploring geospatial clustering to cluster houses that are closer together and then take your ~3-4K clusters as your data set to begin with.
That being said, if your objective still is to remove houses that are closer together, you can obviously create an N*N matrix of the geospatial distance between each house vs. others and remove pairs that are within (0, X] where X is your threshold.

Calculating object labelling consensus area

Scenario: four users are annotating images with one of four labels each. These are stored in a fairly complex format - either as polygons or as centre-radius circles. I'm interested in quantifying, for each class, the area of agreement between individual raters – in other words, I'm looking to get an m x n matrix, where M_i,j will be some metric, such as the IoU (intersection over union), between i's and j's ratings (with a 1 diagonal, obviously). There are two problems I'm facing.
One, I don't know what works best in Python for this. Shapely doesn't implement circles too well, for instance.
Two, is there a more efficient way for this than comparing it annotator-by-annotator?
IMO the simplest is to fill the shapes using polygon filling / circle filling (this is simple, you can roll your own) / path filling (from a seed). Then finding the area of overlap is an easy matter.

Create Geographical Grid based on data using K-D Trees (python)

For my research, I need to divide the geographical area of a city (i.e.Chicago or New York) using a grid. Later, I have data points consisting of GPS longitude and latitude location that I want to associate to its corresponding cell in the grid.
The simplest way to do this is dividing the space into squared cells of same size. However, this will lead to cells with very few points in non-populated (rural areas) areas and cells with a high number of points (city centre). In order to have a more fair representation and the relation between the number of points and cell size, an adaptative grid that create cells of size based on data density would be a better option.
I came across this paper that utilise a K-D tree to do the space partition and retrieve the cells from the nodes. However, I cannot find any implementation (in python) that does that. Many of the implementations out there only index data points in the tree to perform Nearest Neighbour search, but they not provide code to extract the polygon-rectangles that k-d tree generates.
For example, given the following image:
My resulting grid will contain 5 cells (node1 to node5) where each cell contains the associated data points.
Any idea on how to do that?
Anyone knows any implementation?
Many thanks,
David

Categories