Get attributes from non-overlapping points and polygons

Get attributes from non-overlapping points and polygons - python

I have two geodatasets -- one is in points (centroids from a different polygon, let's name it point_data) and the other one is a polygon of a whole country (let's name it polygon_data). What I'm trying to do now is to get attributes from polygon_data and put them in point_data. But the problem is that they are not overlapping with each other.
To better understand the context, the country is archipelagic by nature, and the points are outside the country (that's why they're not overlapping).
Some solutions that I've tried are:
1.) Buffer up polygon_data so that it would touch point_data. Unfortunately this caused problems because the shapes that are not in the shoreline also buffered up.
2.) Used the original polygon of point_data and did a spatial join (intersects), but the problem is that there are some points that still returned with null values and duplicate rows also occured.
I want to make the process as seamless and easy as possible. Any ideas?
I'm both proficient with geopandas and qgis, but I would prefer it in geopandas as much as possible.
Thank you to whoever will be able to help. :)

I guess you can try to join your data depending on the distance between the points and the polygon(s). By doing so, you can fetch the index of the nearest polygon feature for each of your points, then use this index to do the jointure.
To replicate your problem, I generated a layer of points and a layer of polygons (they have an attribute name that I want to put on the point layer).
One (naive) way to do so could be the following:
# read the polygon layer and the point layer
poly_data = gpd.read_file('poly_data.geojson')
pt_data = gpd.read_file('pt_data.geojson')
# Create the field to store the index
# of the nearest polygon feature
pt_data['join_field'] = 0
for idx, geom in pt_data['geometry'].iteritems():
# Compute the distance between this point and each polygon
distances = [
(idx_to_join, geom.distance(geom_poly))
for idx_to_join, geom_poly in poly_data['geometry'].iteritems()]
# Sort the distances...
distances.sort(key=lambda d: d[1])
# ... and store the index of the nearest polygon feature
pt_data.loc[(idx, 'join_field')] = distances[0][0]
# make the join between pt_data and poly_data (except its geometry column)
# based on the value of 'join_field'
result = pt_data.join(
poly_data[poly_data.columns.difference(['geometry'])],
on='join_field')
# remove the `join_field` if needed
result.drop('join_field', axis=1, inplace=True)
Result: (the value in the name column is coming from the polygons)
id geometry name
0 1 POINT (-0.07109 0.40284) A
1 2 POINT (0.04739 0.49763) A
2 3 POINT (0.05450 0.29858) A
3 4 POINT (0.06635 0.11848) A
4 5 POINT (0.63744 0.73934) B
5 6 POINT (0.61611 0.53555) B
6 7 POINT (0.76540 0.44787) B
7 8 POINT (0.84597 0.36256) B
8 9 POINT (0.67062 -0.36493) C
9 10 POINT (0.54028 -0.37204) C
10 11 POINT (0.69194 -0.60900) C
11 12 POINT (0.62085 -0.65166) C
12 13 POINT (0.31043 -0.48578) C
13 14 POINT (0.36967 -0.81280) C
Depending on the size of your dataset you may want to consider more efficient methods (e.g. defining a maximum search radius around each point to avoid having to iterate across all polygons).

Related

Efficiently merge GeoDataFrames if Polygon from one contains Point from second

i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.

To merge two dataframes on their geometries (not on column or index values), use one of geopandas's spatial joins. They have a whole section of the docs about it - it's great - give it a read!
There are two workhorse spatial join functions in geopandas:
GeoDataFrame.sjoin joins two dataframes based on a binary predicate performed on all combinations of geometries, one of intersects, contains, within, touches, crosses, or overlaps. You can specify whether you want a left, right, or inner join based on the how keyword argument
GeoDataFrame.sjoin_nearest joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, the how argument gives left, right, and inner options. Additionally, there are two arguments to sjoin_nearest not available on sjoin:
max_distance: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.
distance_col: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
You can optionally use these global geopandas.sjoin and geopandas.sjoin_nearest functions, or use the methods geopandas.GeoDataFrame.sjoin and geopandas.GeoDataFrame.sjoin_nearest. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects".

Compare every cell in a dataframe with its surrounding cells

I have a dataframe similar to this yet, but a lot bigger (3000x3000):
A
B
C
D
E
W
3
1
8
3
4
X
2
2
9
1
1
Y
5
7
1
3
7
Z
6
8
5
8
9
where the [A,B,C,D,E] are the column names and [W,X,Y,Z] are the rows indexs.
I want to compare every cell with its surrounding cells. If the cell has a greater value than its neighbor cell value, create a directed edge (using networkX package) from that cell to its smaller value neighbor cell. For example:
examining cell (X,B), we should add the following:
G.add_edge((X,B), (W,B)) and G.add_edge((X,B), (Y,C)) and so on for every cell in the dataframe.
Currently I am doing it using two nested loops. However this takes hours to finish and a lot of resources (RAM).
Is there any more efficient way to do it?

If you want to have edges in a networkx graph, then you will not be able to avoid the nested for loop.
The comparison is actually easy to optimize. You could make four copies of your matrix and shift each one step into each direction. You are then able to vectorize the comparison by a simple df > df_copy for every direction.
Nevertheless, when it comes to creating the edges in your graph, it is necessary for you to iterate over both axes.
My recommendation is to write the data preparation part in Cython. Also have a look at graph-tools which at its core is written in C++. With that much edges you will probably also get performance issues in networkx itself.

Grouping datapoint based on defined thresold values

I have a set of 100 points column-wise. I would like to pair these data points which are within the distance threshold of -5 and 5. And, then label them accordingly by the same letters, say, A, B, C, or so on.
For example, let us assume x is a variable which records 100 data points and samples of them are as follows the data points
x=[2,5,10,13,20,25]
If I want to group and label them based on a defined threshold distance (here is either +5 or -5). Then I should get output something like this
[(2,5),(10,13),(5,10),(20,25)] and so on
A=(2,5); B=(10,13), and so on
The data point may include decimal points and I would like to do this in an efficient way. Any I idea how to this? Thanks in advance

Excluding data points based on proximity in a scatterplot

I am trying to create a representation of Amsterdam's channels based on a very large data set of coordinates send through AIS. As the AIS is sometimes calibrated wrong, some coordinates are not on the actual channel, but rather on urban structures. Luckily, this happens relatively few times. As a result these datapoints are not in close proximity of other data points / data point clusters. As such, I want to exlude these data points which are do not have a 'neighbour' with a margin (say 5 meters in real life) in the most pythonic way. Would anyone know how to approach this problem? My data is a simple pandas dataframe:
lng lat
0 4.962218 52.362260
1 4.882198 52.406013
2 4.918583 52.335535
3 4.908185 52.381353
4 5.020983 52.277188
... ... ...
2249835 4.979960 52.352660
2249836 4.914533 52.334980
2249837 4.856630 52.401977
2249838 4.971418 52.357525
2249839 5.042353 52.402142
[2211095 rows x 2 columns]
and the map currently looks as follows, I have marked examples of coordinates I want filter out / exclude:

How to cluster data based on a subset of attributes (4 attributes)?

I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?

Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.

depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.