Efficiently merge GeoDataFrames if Polygon from one contains Point from second - python

i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.

To merge two dataframes on their geometries (not on column or index values), use one of geopandas's spatial joins. They have a whole section of the docs about it - it's great - give it a read!
There are two workhorse spatial join functions in geopandas:
GeoDataFrame.sjoin joins two dataframes based on a binary predicate performed on all combinations of geometries, one of intersects, contains, within, touches, crosses, or overlaps. You can specify whether you want a left, right, or inner join based on the how keyword argument
GeoDataFrame.sjoin_nearest joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, the how argument gives left, right, and inner options. Additionally, there are two arguments to sjoin_nearest not available on sjoin:
max_distance: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.
distance_col: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
You can optionally use these global geopandas.sjoin and geopandas.sjoin_nearest functions, or use the methods geopandas.GeoDataFrame.sjoin and geopandas.GeoDataFrame.sjoin_nearest. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects".

Related

Get attributes from non-overlapping points and polygons

I have two geodatasets -- one is in points (centroids from a different polygon, let's name it point_data) and the other one is a polygon of a whole country (let's name it polygon_data). What I'm trying to do now is to get attributes from polygon_data and put them in point_data. But the problem is that they are not overlapping with each other.
To better understand the context, the country is archipelagic by nature, and the points are outside the country (that's why they're not overlapping).
Some solutions that I've tried are:
1.) Buffer up polygon_data so that it would touch point_data. Unfortunately this caused problems because the shapes that are not in the shoreline also buffered up.
2.) Used the original polygon of point_data and did a spatial join (intersects), but the problem is that there are some points that still returned with null values and duplicate rows also occured.
I want to make the process as seamless and easy as possible. Any ideas?
I'm both proficient with geopandas and qgis, but I would prefer it in geopandas as much as possible.
Thank you to whoever will be able to help. :)
I guess you can try to join your data depending on the distance between the points and the polygon(s). By doing so, you can fetch the index of the nearest polygon feature for each of your points, then use this index to do the jointure.
To replicate your problem, I generated a layer of points and a layer of polygons (they have an attribute name that I want to put on the point layer).
One (naive) way to do so could be the following:
# read the polygon layer and the point layer
poly_data = gpd.read_file('poly_data.geojson')
pt_data = gpd.read_file('pt_data.geojson')
# Create the field to store the index
# of the nearest polygon feature
pt_data['join_field'] = 0
for idx, geom in pt_data['geometry'].iteritems():
# Compute the distance between this point and each polygon
distances = [
(idx_to_join, geom.distance(geom_poly))
for idx_to_join, geom_poly in poly_data['geometry'].iteritems()]
# Sort the distances...
distances.sort(key=lambda d: d[1])
# ... and store the index of the nearest polygon feature
pt_data.loc[(idx, 'join_field')] = distances[0][0]
# make the join between pt_data and poly_data (except its geometry column)
# based on the value of 'join_field'
result = pt_data.join(
poly_data[poly_data.columns.difference(['geometry'])],
on='join_field')
# remove the `join_field` if needed
result.drop('join_field', axis=1, inplace=True)
Result: (the value in the name column is coming from the polygons)
id geometry name
0 1 POINT (-0.07109 0.40284) A
1 2 POINT (0.04739 0.49763) A
2 3 POINT (0.05450 0.29858) A
3 4 POINT (0.06635 0.11848) A
4 5 POINT (0.63744 0.73934) B
5 6 POINT (0.61611 0.53555) B
6 7 POINT (0.76540 0.44787) B
7 8 POINT (0.84597 0.36256) B
8 9 POINT (0.67062 -0.36493) C
9 10 POINT (0.54028 -0.37204) C
10 11 POINT (0.69194 -0.60900) C
11 12 POINT (0.62085 -0.65166) C
12 13 POINT (0.31043 -0.48578) C
13 14 POINT (0.36967 -0.81280) C
Depending on the size of your dataset you may want to consider more efficient methods (e.g. defining a maximum search radius around each point to avoid having to iterate across all polygons).

Returning a unique row if condition returns multiple rows

I am writing an algorithm for a Vehicle Routing Problem. When the driving time from some seed city to some other city is minimal, it should be added to my route. To check this, I use
new_point_row_df = emte_df[emte_df["seed_time"] == emte_df.seed_time.min()]
Which gives me the entire row of the new city. Then I use
lat = new_point_row_df["Lat"].squeeze()
long = new_point_row_df["Long"].squeeze()
Which gives me the coordinates of that city. Then I pass these coordinates alongside my "home" coordinates (The coordinates of where the distribution centre lies), to a function, "time_to_home"
which calculates, no surprises here, the time_to_home.
time_to_home((lat,long), veghel_coordinates)
In most cases this works perfectly fine. However, in the rare case that emte_df.seed_time.min()returns multiple rows (So 2 cities lie at exactly the same driving time from my seed point), lat & long are pandas Series instead of numpy floats. When I pass series to my time_to_home function, it breaks down and throws
TypeError: cannot convert the series to <class 'float'>which makes perfect sense.
My question is: If emte_df.seed_time.min()returns multiple rows, how can I make it so that I select only one to pass into my lat/long coordinates and subsequently time_to_home function? It doesn't matter which row is selected, as long it is only one so my time_to_home function doesn't break down.
Simply do:
ew_point_row_df = emte_df[emte_df["seed_time"] == emte_df.seed_time.min()].iloc[0]
This will just select the first row every time.

How to cluster data based on a subset of attributes (4 attributes)?

I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?
Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.
depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Getting counts of intersections between geometries in GeoPandas

Is it possible to get counts of intersections between two geometries using GeoPandas objects? That is, I want to count up the number of polygons or line strings in one GeoDataFrame that intersect with each polygon in another GeoDataFrame. I did not see an easy way of doing this while browsing the GeoPandas docs, but wanted to check before moving on to lower-level tools.
You want a spatial join: geopandas.tools.sjoin().
There's an example in this Jupyter Notebook — look at the section called Spatial join. This is counting a set of points (midpoints) into a set of polygons (bins). Both geometries define a GeoDataFrame.
At the time of writing, tools.sjoin() is not in the current release of geopandas. I couldn't get geopandas.tools to build in any of their branches, but I fixed it — for me anyway — in my fork. My fix is an open PR.
I don't know about a built-in tool to do this but I'm not an expert. At the same time it's easily done with some pandas magic:
import geopandas as gpd
from shapely.geometry import *
p1 = Point(.5,.5)
p2 = Point(.5,1)
p3 = Point(1,1)
poly = Polygon([(0,0), (0,2), (2,2), (2,0)])
df1 = gpd.GeoSeries([p1,p2,p3])
df2 = gpd.GeoDataFrame([poly,p3], columns=['geometries'])
f = lambda x:np.sum(df1.intersects(x))
df2['geometries'].apply(f)
Should return:
0 3
1 1
Name: geometries, dtype: int64
Lets consider 02 geometries (Points and Polygons) which intersect at least once.
Spatial join of your layers
You should write something like this :
pointsInPolygon = gpd.sjoin(points, polygons, how="inner", op='intersects')
Add a field with 1 as a constant value
You should write something like this : pointsInPolygon['const']=1
Group by the field according to the column by which you want to aggregate data
You should write something like this : pointsInPolygon.groupby(['field']).sum()
The column [const] will give you the count of intersections between your two geometries.
If you want to see others columns as well, just type something like this : df = pointsInPolygon.groupby('field').agg({'columnA':'first', 'columnB':'first', 'const':'sum'}).reset_index()

Categories