How to optimize Shapely and Sklearn code?

How to optimize Shapely and Sklearn code? - python

I am working with a dataset of 4.2 millions points and my codes is already taking a while to process, however below code is taking several hours to process (the code was provided in other public question and basically it takes the nearest linestring to a point, finds the nearest point from that line string and calculus the distance)
The codes actually does an awesome job, but takes too long for its purposes, How I can optimize or do the same thing in a shortest time?
import geopandas as gpd
import numpy as np
from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
from sklearn.neighbors import DistanceMetric
EARTH_RADIUS_IN_MILES = 3440.1 #NAUTICAL MILES
panama = gpd.read_file("/Users/Danilo/Documents/Python/panama_coastline/panama_coastline.shp")
for c in range(b):
#p = Point(-77.65325423107359,9.222038196656131)
p=Point(data['longitude'][c],data['latitude'][c])
def closest_line(point, linestrings):
return np.argmin( [p.distance(linestring) for linestring in panama.geometry] )
closest_linestring = panama.geometry[ closest_line(p, panama.geometry) ]
closest_linestring
closest_point = nearest_points(p, closest_linestring)
dist = DistanceMetric.get_metric('haversine')
points_as_floats = [ np.array([p.x, p.y]) for p in closest_point ]
haversine_distances = dist.pairwise(np.radians(points_as_floats), np.radians(points_as_floats) )
haversine_distances *= EARTH_RADIUS_IN_MILES
dtc1=haversine_distances[0][1]
dtc.append(dtc1)

Edit: Simplify to single calculation with BallTree
Imports
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely.geometry import Point, LineString
from shapely.ops import nearest_points
Read Panama
panama = gpd.read_file("panama_coastline/panama_coastline.shp")
Get all points, long,lat format:
def get_points_as_numpy(geom):
work_list = []
for g in geom:
work_list.append( np.array(g.coords) )
return np.concatenate(work_list)
all_coastline_points = get_points_as_numpy(panama.geometry)
Create Balltree
from sklearn.neighbors import BallTree
import numpy as np
panama_radians = np.radians(np.flip(all_coastline_points,axis=1))
tree = BallTree(panama_radians, leaf_size=12, metric='haversine')
Create 1M random points:
mean = [8.5,-80]
cov = [[1,0],[0,5]] # diagonal covariance, points lie on x or y-axis
random_gps = np.random.multivariate_normal(mean,cov,(10**6))
random_points = pd.DataFrame( {'lat' : random_gps[:,0], 'long' : random_gps[:,1]})
random_points.head()
Calculate closest coast point (<30 Seconds on my machine)
distances, index = tree.query( np.radians(random_gps), k=1)
Put results in DataFrame
EARTH_RADIUS_IN_MILES = 3440.1
random_points['distance_to_coast'] = distances * EARTH_RADIUS_IN_MILES
random_points['closest_lat'] = all_coastline_points[index][:,0,1]
random_points['closest_long'] = all_coastline_points[index][:,0,0]

Related

Is there a better way than Pandas apply?

I'm trying to find the distance from each point to the nearest shoreline.
I have two data.
Latitude, longitude information for each point
About the shoreline
ex) sample_Data (Point Data) =
위도 경도
0 36.648365 127.486831
1 36.648365 127.486831
2 37.569615 126.819528
3 37.569615 126.819528
....
gdf =
0 LINESTRING (127.45000 34.45696, 127.44999 34.4...
1 LINESTRING (127.49172 34.87526, 127.49173 34.8...
2 LINESTRING (129.06340 37.61434, 129.06326 37.6...
...
def min_distance(x,y):
sreach_point = Point(x,y)
a = gdf.swifter.progress_bar(enable=True).apply(lambda x : geod.geometry_length(LineString(nearest_points(x['geometry'], sreach_point))),axis = 1)
return a.min()
sample_Data['거리']= sample_Data.apply(lambda x : min_distance(x['경도'],x['위도']),axis =1 ,result_type='expand')
This code takes longer than I thought, so I'm looking for a better way.
If I cross join both data frames, will the speed increase?
It takes about 6 hours to proceed with the above code.

you can use shapely for distance like :
import numpy as np
import pandas as pd
from shapely.geometry import Point, LineString
def min_distance(row, gdf):
sreach_point = Point(row['경도'], row['위도'])
a = gdf.swifter.progress_bar(enable=True).apply(
lambda x: geod.geometry_length(LineString(nearest_points(x['geometry'], sreach_point))), axis=1
)
return a.min()
sample_Data['거리'] = gdf.apply(min_distance, axis=1, args=(sample_Data,))

How to generate a 3D contour plot using data for torsion angles as numpy arrays

I am trying to generate a 3D contour plot using data stored as lists for two angles phi2 and theta in degrees. I have in total 88 datapoints. I am trying to generate the joint multivariate normal DPF using the scipy stats multivariate_normal and then plot the graph. But the attached code does not work it gives me errors refered to that z is 1D and has to be 2D.
Could anybody be so kind of direct me on how to get a decent density surface and/or contour with the data I have and fix this code? Thank you in advance
This is my code:
phi2 = [68.74428813, 73.81435267, 66.13791645, 178.54309657, 179.52273055, 161.05564169,
157.29079181, 191.92405566, 91.49774385, 96.19566795, 70.59561195, 119.9603657,
120.22305924, 98.52577754, 102.37894512, 100.12088791, 150.21004667, 139.18249739,
139.09246089, 89.51031839, 88.39689092, 136.47397506, 286.26056406, 283.74464006,
290.17913953, 286.74459786, 284.86706369, 328.13937238, 275.44219073, 303.47499211,
260.52134486, 259.35788745, 306.90146741, 11.20622691, 10.78220574, 19.15446087,
12.15462016, 13.58160662, 3.83673279, 0.12494051, 17.73139875, 8.53784067, 16.50118845,
2.53838974, 233.88019465, 234.93195189, 229.57996459, 233.07447083, 233.59862002,
231.18392245, 207.88397566, 237.31741345, 183.95293031, 179.42872881, 213.32271268,
140.7533708, 150.16895446, 130.61256041, 130.89734197, 128.63260154, 12.06830893,
200.28087782, 189.90378416, 62.39275508, 58.30936802, 205.64840358, 277.30394295,
287.76441089, 284.93518941, 265.89041707, 265.04884345, 343.86712163, 9.14315768,
341.43239609, 259.68283323, 260.00152679, 319.65245694, 341.08153124, 328.45596486,
336.02665804, 334.51276135, 334.8480636, 14.23480894, 12.53212715, 326.89899848,
42.62591188, 45.9396189, 335.39967741]
theta = [162.30883091334002, 162.38681345640427, 159.9615814653753, 174.16782637794842,
174.2151437560705, 176.40150466571572, 172.99139904772483, 175.92043493594562,
170.54952038009057, 172.72436078059172, 157.8929621077973, 168.98842698365024,
171.98480108104968, 157.1025039563731, 158.00939405227624, 157.85195861050553,
171.7970456599733, 173.88542939027778, 174.13060483554227, 157.06302225640127,
156.68490146086768, 174.10583004433656, 12.057892850177469, 22.707446760473047,
10.351988334104147, 10.029845365897357, 9.685493520484972, 7.903767103756965,
2.4881977395826027, 5.95349444674959, 30.507921155263, 30.63344201861564,
12.408566633469452, 3.9720259901877712, 4.65662142520097, 4.638183341072918,
4.106777084823232, 4.080743212101051, 4.747614837690929, 5.50356343278645,
3.5832926179292923, 3.495358074328152, 2.980060059242138, 5.785575733164003,
172.46901133841854, 172.2062576963548, 173.0410300278859, 174.06303865166896,
174.21162725364357, 170.0470319897294, 174.10752252682713, 171.23903792872886,
172.86412623832285, 174.4850965754363, 172.82274147050111, 176.9008741019669,
177.0080169547876, 171.90883294152246, 173.22247813491, 173.4304905772758,
89.63634206258786, 175.70086864635368, 175.71009499829492, 162.5980851129683,
162.16583875715634, 175.35616287818408, 4.416907543506939, 4.249480386717373,
5.265265803392446, 21.091392446454336, 21.573883985068303, 7.135649687137961,
5.332884425609576, 1.4184699545284118, 24.487533963462965, 25.63021267148377,
5.005913657707176, 7.562769691801299, 7.52664594699765, 7.898159135060811,
7.167861631741688, 7.018092266267269, 5.939275995893341, 5.975608665369072,
7.138904478798905, 9.93153808410636, 9.415946863231648, 7.154298332687937]
import sys, os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy import loadtxt
import matplotlib
from matplotlib.mlab import bivariate_normal
import math
from scipy.stats import multivariate_normal
from astropy.stats import circcorrcoef
from astropy import units as u
from scipy.stats import circvar
from scipy.stats import circmean
phi2_vacuum = np.array(phi2_vacuum)
theta_vacuum = np.array(theta_vacuum)
angle1 = np.radians(phi2_vacuum)
angle2 = np.radians(theta_vacuum)
# Obtain the circular variance
var_angle1 = circvar(angle1)
var_angle2 = circvar(angle2)
# Obtain circular mean from scipy
mean_angle1 = circmean(angle1)
mean_angle2 = circmean(angle2)
# Obtain circular covar between both angles in degrees
corr = circcorrcoef(angle1, angle2)
covar = corr * np.sqrt(var_angle1*var_angle2)
# Create the covar matrix
covar_matrix = np.array([[var_angle1, covar], [covar, var_angle2]])
# Obtain circular prob
delta = covar / (var_angle1 * var_angle2)
S = ((angle1-mean_angle1)/var_angle1) + ((angle2-mean_angle2)/var_angle2) - ((2*delta*
(angle1-mean_angle1)*(angle2-mean_angle2))/(var_angle1*var_angle2))
# Obtain exponential of PDF
exp = -1 * S / (2 * (1 - delta**2))
# Calculate the PDF
#prob = (1/(2*np.pi*var_angle1*var_angle2*np.sqrt(1-(delta**2)))) * np.e**exp
prob = multivariate_normal([mean_angle1, mean_angle2], covar_matrix)
# Create the stacking
pos = np.dstack((angle1, angle2))
fig2 = plt.figure()
ax2 = fig2.add_subplot(111)
ax2.contourf(angle1, angle2, pdf.pdf(pos))

Convert DataFrame to a multi polygon DataFrame, multiple data point - python

I have a DataFrame as below, I want to convert data to a multi polygon DataFrame, because I want to plot each multi polygon on a map.
I know how to convert if I have two data point, but with 6 data point, I don't know how to convert it. can anyone help me please.
geometry = [Point(xy) for xy in zip(neightrip_counts_.lan0, neightrip_counts_.long0)]
geometry
#neightrip_counts_.lan1, neightrip_counts_.long1,neightrip_counts_.lan2, neightrip_counts_.long2
lan0 long0 lan1 long1 lan2 long2
0 59.915667 10.777567 59.916738 10.779916 59.914943 10.773977
1 59.929853 10.711515 59.929435 10.713682 59.927596 10.710033
2 59.939230 10.759170 59.937205 10.760581 59.943750 10.760306
3 59.912520 10.762240 59.911594 10.761774 59.912347 10.763815
4 59.929634 10.732839 59.927140 10.730981 59.931081 10.736003

Let me rename the dataframe neightrip_counts_ as df for brevity. Here is the relevant code that will create a polygon for each row of dataframe.
df['geometry'] = [Polygon([(z[0],z[1]), (z[2],z[3]), (z[4],z[5])]) for z in zip(df.long0, df.lan0, df.long1, df.lan1, df.long2, df.lan2)]
gpdf = df.set_geometry("geometry", drop=True)
gpdf.plot()
By the way, you must be careful about the sequence of (long, lat).
start_coords = [ gdf.centroid[0].x, gdf.centroid[0].y] # is wrong
Use this in stead.
start_coords = [ gdf.centroid[0].y, gdf.centroid[0].x]
Edit
For the benefits of the readers, here is the complete runnable code:
import pandas as pd
import geopandas as gpd
from io import StringIO
from shapely.geometry import Polygon, Point, LineString
import numpy as np
import folium
data1 = """index lan0 long0 lan1 long1 lan2 long2
0 59.915667 10.777567 59.916738 10.779916 59.914943 10.773977
1 59.929853 10.711515 59.929435 10.713682 59.927596 10.710033
2 59.939230 10.759170 59.937205 10.760581 59.943750 10.760306
3 59.912520 10.762240 59.911594 10.761774 59.912347 10.763815
4 59.929634 10.732839 59.927140 10.730981 59.931081 10.736003"""
# read/parse data into dataframe
df0 = pd.read_csv(StringIO(data1), sep='\s+', index_col='index')
# create `geometry` column
df0['geometry'] = [Polygon([(xy[0],xy[1]), (xy[2],xy[3]), (xy[4],xy[5])]) \
for xy in zip(df0.long0, df0.lan0, df0.long1, df0.lan1, df0.long2, df0.lan2)]
# set geometry
gpdf = df0.set_geometry("geometry", drop=True)
# do check plot. (uncomment next line)
#gpdf.plot()
# make geojson
center_pt = gpdf.centroid[0].y, gpdf.centroid[0].x
gdf_json = gpdf.to_json()
# plot the geojson on the folium webmap
webmap = folium.Map(location = center_pt, zoom_start = 13, min_zoom = 3)
folium.GeoJson(gdf_json, name='data_layer_1').add_to(webmap)
# this opens the webmap
webmap
Output screen capture (of interactive webmap):

Try this, assuming the 'lan' is latitude.
import geopandas as gpd
from shapely.geometry import Polygon
import numpy as np
import pandas as pd
import folium
# ....
def addpolygeom(row):
row_array = np.array(row)
# split dataframe row to a list of tuples (lat, lon)
coords = [tuple(i)[::-1] for i in np.split(row_array, range(2, row_array.shape[0], 2))]
polygon = Polygon(coords)
return polygon
# Convert points to shapely geometry
neightrip_counts_['geometry'] = neightrip_counts_.apply(lambda x: addpolygeom(x), axis=1)
# Create a GeoDataFrame
gdf = gpd.GeoDataFrame(neightrip_counts_, geometry='geometry')
start_coords = [ gdf.centroid[0].y, gdf.centroid[0].x]
gdf_json = gdf.to_json()
map = folium.Map(start_coords, zoom_start=4)
folium.GeoJson(gdf_json, name='mypolygons').add_to(map)

How to find an "x" amount of closest elements to a centroid

I am working on a dataset that is very high dimensional and have performed k-means clustering on it. I am trying to find the 20 closest points to each centroid. The dimensions of the dataset (X_emb) is 10 x 2816. Provided is code that I used to find the single-most closest point to each centroid. The commented out code is a potential solution that I found, but I was not able to make it accurately work.
import numpy as np
import pickle as pkl
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.neighbors import NearestNeighbors
from visualization.make_video_v2 import make_video_from_numpy
from scipy.spatial import cKDTree
n_s_train = 10000
df = pkl.load(open('cluster_data/mixed_finetuning_data.pkl', 'rb'))
N = len(df)
X = []
X_emb = []
for i in range(N):
play = df.iloc[i]
if df.iloc[i].label == 1:
X_emb.append(play['embedding'])
X.append(play['input'])
X_emb = np.array(X_emb)
kmeans = KMeans(n_clusters=10)
kmeans.fit(X_emb)
results = kmeans.cluster_centers_
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
# def find_k_closest(centroids, data, k=1, distance_norm=2):
# kdtree = cKDTree(data, leafsize=30)
# distances, indices = kdtree.query(centroids, k, p=distance_norm)
# if k > 1:
# indices = indices[:,-1]
# values = data[indices]
# return indices, values
# indices, values = find_k_closest(results, X_emb)

You can use the pairwise distances to calculate the distances for every point with the centroids with every point in X_emb, then using numpy finding the index of the min 20 elements and finally geting them from X_emb
from sklearn.metrics import pairwise_distances
distances = pairwise_distances(centroids, X_emb, metric='euclidean')
ind = [np.argpartition(i, 20)[:20] for i in distances]
closest = [X_emb[indexes] for indexes in ind]
The shape of closest will be (num of centroids x 20)

You can the NearestNeighbors class from sklearn this way:
from sklearn.neighbors import NearestNeighbors
def find_k_closest(centroids, data):
nns = {}
neighbors = NearesNieghbors(n_neighbors=20).fit(data)
for center in centroids:
nns[center] = neighbors.kneighbors(center, return_distance=false)
return nns
the nns dictionary should contain the centers as key and the list of neighbors as value

Igraph shortest path gives an infinite value

I am trying to calculate the distance between a node and two targets, afterwards I compare the lengths of the routes calculated and save the smallest in a list. I know I can use networkx.shortest_path() however this solution takes a long time and the code takes too long to run. For this reason I opted to use Igraph. Here is the code:
import networkx as nx
import matplotlib.pyplot as plt
import osmnx as ox
import pandas as pd
from shapely.wkt import loads as load_wkt
import numpy as np
import matplotlib.cm as cm
import igraph as ig
import matplotlib as mpl
import random as rd
ox.config(log_console=True, use_cache=True)
city = 'Portugal, Lisbon'
G = ox.graph_from_place(city, network_type='drive')
G_nx = nx.relabel.convert_node_labels_to_integers(G)
ox.speed.add_edge_speeds(G_nx, hwy_speeds=20, fallback=20)
ox.speed.add_edge_travel_times(G_nx)
weight = 'travel_time'
coord_1 = (38.74817825481225, -9.160815118526642) # Coordenada Hospital Santa Maria
coord_2 = (38.74110711410615, -9.152159572392323) # Coordenada Hopstial Curry Cabral
coord_3 = (38.7287248180068, -9.139114834357233) # Hospital Dona Estefania
coord_4 = (38.71814053423293, -9.137885476529883) # Hospital Sao Jose
target_1 = ox.get_nearest_node(G_nx, coord_1)
target_2 = ox.get_nearest_node(G_nx, coord_2)
target_3 = ox.get_nearest_node(G_nx, coord_3)
target_4 = ox.get_nearest_node(G_nx, coord_4)
G_ig = ig.Graph(directed=True)
G_ig.add_vertices(list(G_nx.nodes()))
G_ig.add_edges(list(G_nx.edges()))
G_ig.vs['osmid'] = list(nx.get_node_attributes(G_nx, 'osmid').values())
G_ig.es[weight] = list(nx.get_edge_attributes(G_nx, weight).values())
assert len(G_nx.nodes()) == G_ig.vcount()
assert len(G_nx.edges()) == G_ig.ecount()
route_length=[]
list_nodes=[]
for node in G_nx.nodes:
length_1 = G_ig.shortest_paths(source=node, target=target_1, weights=weight)[0][0]
length_2 = G_ig.shortest_paths(source=node, target=target_2, weights=weight)[0][0]
if length_1<length_2:
route_length.append(length_1)
else:
route_length.append(length_2)
list_nodes.append(node)
If you print the list with the lengths of the routes some values will be 'inf' which obviously doesn't make sense. Can anyone help me understand why the length would be inf?

As Vincent Traag said the distance between two disconnected nodes in inf. So it means that for such results, the node and the source are not connected

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize Shapely and Sklearn code? - python

Related

Is there a better way than Pandas apply?

How to generate a 3D contour plot using data for torsion angles as numpy arrays

Convert DataFrame to a multi polygon DataFrame, multiple data point - python

How to find an "x" amount of closest elements to a centroid

Igraph shortest path gives an infinite value

Categories

Resources