I want to get the geometries of the districts bordering a given district.
districts
d0 = districts[0]
gpd.sjoin(d0, districts, op='intersects')
This gives the geometry of d0 in each row. But I want the geometry of the right table in each row. Is it possible to get both left and right table geometries?
You could use join to get the geometry from the right table after your sjoin
gdf = gpd.sjoin(d0, districts, op='intersects')
gdf will have a column/series called index_rightwhich we can leverage
gdf.join(districts['geometry'], on='index_right', lsuffix='', rsuffix='_districts')
Not sure how geopandas will handle two geometries. I'm guessing all operations will leverage the original one from d0
Related
i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.
To merge two dataframes on their geometries (not on column or index values), use one of geopandas's spatial joins. They have a whole section of the docs about it - it's great - give it a read!
There are two workhorse spatial join functions in geopandas:
GeoDataFrame.sjoin joins two dataframes based on a binary predicate performed on all combinations of geometries, one of intersects, contains, within, touches, crosses, or overlaps. You can specify whether you want a left, right, or inner join based on the how keyword argument
GeoDataFrame.sjoin_nearest joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, the how argument gives left, right, and inner options. Additionally, there are two arguments to sjoin_nearest not available on sjoin:
max_distance: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.
distance_col: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
You can optionally use these global geopandas.sjoin and geopandas.sjoin_nearest functions, or use the methods geopandas.GeoDataFrame.sjoin and geopandas.GeoDataFrame.sjoin_nearest. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects".
I have a dataframe(df3)
df3 = pd.DataFrame({
'Origin':['DEL','BOM','AMD'],
'Destination':['BOM','AMD','DEL']})
comprising of Travel Data which contains Origin/Destination and I'm trying to map Latitude and Longitude for Origin & Destination airports using 3 letter city codes (df_s3).
df_s3 = pd.DataFrame({
'iata_code':['AMD','BOM','DEL'],
'Lat':['72.6346969603999','72.8678970337','77.103104'],
'Lon':['23.0771999359','19.0886993408','28.5665']})
I've tried mapping them one at a time, i.e.
df4=pd.merge(left=df3,right=df_s3,how='left',left_on=['Origin'],right_on=['iata_code'],suffixes=['_origin','_origin'])
df5=pd.merge(left=df4,right=df_s3,how='left',left_on=['Destination'],right_on=['iata_code'],suffixes=['_destination','_destination'])
This updates the values in the dataframe but the columns corresponding to origin lat/long have '_destination' as the suffix
I've even taken an aspirational long shot by combining the two, i.e.
df4=pd.merge(left=df3,right=df_s3,how='left',left_on=['Origin','Destination'],right_on=['iata_code','iata_code'],suffixes=['_origin','_destination'])
Both of these dont seem to be working. Any suggestions on how to make it work in a larger dataset while keeping the processing time low.
Your solution was almost correct. But you need to specify the origin suffix in the second merge:
df4=pd.merge(left=df3,
right=df_s3,how='left',
left_on=['Origin'],
right_on=['iata_code'])
df5=pd.merge(left=df4,
right=df_s3,how='left',
left_on=['Destination'],
right_on=['iata_code'],
suffixes=['_origin', '_destination'])
In the first merge you don't need to specify any suffix as there is no overlap. In the second merge you need to specify the suffix for the right side and the left side. The right side is the longitude and latitude from the origin and the left side are from the destination.
You can try to apply to each column a function like this one:
def from_place_to_coord(place: str):
if place in df_s3['iata_code'].to_list():
Lat = df_s3[df_s3['iata_code'] == place]['Lat'].values[0]
Lon = df_s3[df_s3['iata_code'] == place]['Lon'].values[0]
return Lat, Lon
else:
print('Not found')
and then:
df3['origin_loc'] = df3['Origin'].apply(from_place_to_coord)
df3['destination_loc'] = df3['Destination'].apply(from_place_to_coord)
It will return you 2 more columns with a tuple of Lat,Lon according to the location
I have over 1 million rows of Latitude Longitude positions. My goal is to check each of these rows against a data set of about 43000 ZipCodes that have a central Latitude Longitude.
I want to calculate the haversine distance between each row with the large ZipCodes list. I then want to take the closest lat/long and return that or the corresponding zip code to the left most frame (in essence, giving the closest ZipCode to the latitude/longitudes in the large frame.
I have tried several things including vectorized haversine functions and looping through each row, calculating and moving to next but I can't quite get them to work. Given the large size of my data I know that simply looping through each row and calculating won't work. I need a new solution. I think it might involve vectorization.
Here are some sample frames of my data. df is the large frame I am trying to calculate the smallest distance from the zip_list and return the corresponding zip code to the large frame.
df = pd.DataFrame(np.array([[42.801104,-76.827879],[38.187102,-83.433917],
[35.973115,-83.955932]]), columns = ['Lat', 'Long'])
zip_list = pd.DataFrame(np.array([[49544, 42.999561,-85.75371],[49648,
45.000254,-85.3651],[49654, 45.023384,-85.75697],[50265,
41.570916,-93.73568]]), columns = ['ZipCode', 'Latitude', 'Longitude'])
I would like to return the minimum distance zip code to the corresponding row in the df frame.
Any ideas would be great. I am a beginner with vectorization and numpy/pandas.
I'm learning python and am currently trying to parse out the longitude and latitude from a "Location" column and assign them to the 'lat' and 'lon' columns. I currently have the following code:
def getlatlong(cell):
dd['lat'] = cell.split('\n')[2].split(',')[0][1:]
dd['lon'] = cell.split('\n')[2].split(',')[1][1:-1]
dd['Location'] = dd['Location'].apply(getlatlong)
dd.head()
The splitting portion of the code works. The problem is that this code copies the lat and lon from the last cell in the dataframe to all of the 'lat' and 'lon' rows. I want it to split the current row it is iterating through, assign the 'lat' and 'lon' values for that row, and then do the same on every subsequent row.
I get that assigning dd['lat'] to the split value assigns it to the whole column, but I don't know how to assign to just the row currently being iterated over.
Data sample upon request:
Index,Location
0,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"
1,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67931141, -121.7765988)"
2,"138 14TH ST\nOAKLAND, CA 94612\n(37.80140803, -122.26369831)"
3,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968061, -122.19690846)"
4,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968557, -122.19692165)"
Please see my approach below. It is based on creating a DataFrame with lat and lon columns and then adding it to the existing dataframe.
def getlatlong(x):
return pd.Series([x.split('\n')[2].split(',')[0][1:],
x.split('\n')[2].split(',')[1][1:-1]],
index = ["lat", "lon"])
df = pd.concat((df, df.Location.apply(getlatlong)), axis=1)
This addresses another technique you can use to get the answer, but isn't exact code you need. If you add sample data i can tailor it.
Using Pandas's build in str methods you can save yourself some headache as follows:
temp_df = df['Location'].str.split('\n').str.split().apply(pd.Series)
The above splits the Location column on spaces, and then turns the split values into columns. You can then assign just the Latitude and Longitude columns to the original df.
df[['Latitude', 'Longitude']] = temp_df[[<selection1>, <selection2>]]
str.split() also has an expand parameter so that you can write .str.split("char", expand=True) to spread out the columns without the apply.
Update
Given your example, this works for your specific case:
df = pd.DataFrame({"Location": ["1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"]})
df[["Latitude", "Longitude"]] = (df['Location']
.str.split('\n')
.apply(pd.Series)[2] # Column 2 has the str (lat, long)
.str[1:-1] # Strip the ()
.str.split(",", expand=True) # Expand latitude and longitude into two columns
.astype(float)) # Make sure latitude and longitude are floats
Out:
Location Latitude Longitude
0 1554 FIRST ST\nLIVERMORE, CA 94550\n(37.679306... 37.679306 -121.776586
Update #2
#Abhishek Mishra's answer is faster (takes only 55% of the time, since it goes through the data fewer times). Worth noting that the output from that example has strings in each column, so you might want to modify to get values back to floats.
for ind, row in dd.iterrows():
dd['lat'].loc[ind] = dd['Location'].loc[ind].split(',')[0][1:]
dd['lon'].loc[ind] = dd['Location'].loc[ind].split(',')[1][1:-1]
PS: iterrows() is slow.
Is it possible to get counts of intersections between two geometries using GeoPandas objects? That is, I want to count up the number of polygons or line strings in one GeoDataFrame that intersect with each polygon in another GeoDataFrame. I did not see an easy way of doing this while browsing the GeoPandas docs, but wanted to check before moving on to lower-level tools.
You want a spatial join: geopandas.tools.sjoin().
There's an example in this Jupyter Notebook — look at the section called Spatial join. This is counting a set of points (midpoints) into a set of polygons (bins). Both geometries define a GeoDataFrame.
At the time of writing, tools.sjoin() is not in the current release of geopandas. I couldn't get geopandas.tools to build in any of their branches, but I fixed it — for me anyway — in my fork. My fix is an open PR.
I don't know about a built-in tool to do this but I'm not an expert. At the same time it's easily done with some pandas magic:
import geopandas as gpd
from shapely.geometry import *
p1 = Point(.5,.5)
p2 = Point(.5,1)
p3 = Point(1,1)
poly = Polygon([(0,0), (0,2), (2,2), (2,0)])
df1 = gpd.GeoSeries([p1,p2,p3])
df2 = gpd.GeoDataFrame([poly,p3], columns=['geometries'])
f = lambda x:np.sum(df1.intersects(x))
df2['geometries'].apply(f)
Should return:
0 3
1 1
Name: geometries, dtype: int64
Lets consider 02 geometries (Points and Polygons) which intersect at least once.
Spatial join of your layers
You should write something like this :
pointsInPolygon = gpd.sjoin(points, polygons, how="inner", op='intersects')
Add a field with 1 as a constant value
You should write something like this : pointsInPolygon['const']=1
Group by the field according to the column by which you want to aggregate data
You should write something like this : pointsInPolygon.groupby(['field']).sum()
The column [const] will give you the count of intersections between your two geometries.
If you want to see others columns as well, just type something like this : df = pointsInPolygon.groupby('field').agg({'columnA':'first', 'columnB':'first', 'const':'sum'}).reset_index()