I have a data frame for individuals, each individual has X and y coordinates, and i have a .shp file
that have number of polygons.
individuals data frame look like:
ind_ID
x_coordinates
y_coordinates
1
2.333
6.572711
2
3.4444
6.57273
the .shp file looks like:
Code
shape length
shape area
222
.22
.5432
2322
.54322
.4342
122
.65656
.43
2122
.5445
.5678
what I want to do is to add a new column to the data frame, in order to label each coordinate with the linked Code of .shp file that this coordinate fall inside it.
to do so, I build this code :
from shapely.geometry import Point
import csv
from shapely.geometry.polygon import Polygon
import shapefile
from shapely.geometry import shape # shape() is a function to convert geo objects through the interface
import numpy as np
import pandas as pd
import shapefile as shp
Individual = pd.read_csv("dataframe.csv")
sf = shapefile.Reader('path to the shape file.shp')
sf.shapes()
len(sf.shapes())
# function to read the shapefile
def read_shapefile(sf):
"""
Read a shapefile into a Pandas dataframe with a 'coords'
column holding the geometry information. This uses the pyshp
package
"""
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps)
return df
df = read_shapefile(sf)
df.shape
I used the read_shapefile function to find all x,y points inside each feature, output DF
Code
shape length
shape area
cooded
222
.22
.5432
3.23232,2.72323,3.931226,2.543,3.435534 ....
2322
.54322
.4342
3.23232,2.72322,3.111226,2.343,3.12312 ...
122
.65656
.43
3.2323,2.23325,3.1212,2.1221,3.12321 ...
2122
.5445
.5678
3.9232,2.23232,2.931226,1.2123,3.213 ...
the next step is to check each induvial , wheather is it fall inside any cooded points,
if yes add a new column to the Individual df contain the corresponding Code of the shape file.
I need the help of this part ^^",
I started as checking x,y in the sh data coords
Individual.["X","Y"].isin(sf .["coords"]).astype(int)
I couldn't check as there is an error.
the output need is : individuals data frame look like:
ind_ID
x_coordinates
y_coordinates
Code
1
2.333
6.572711
222
2
3.4444
6.57273
122
I have a list of points (longitude and latitude), as well as their associated point geometries in a geodataframe. All of the points should be able to be subdivided into individual polygons, as the points are generally clustered in several areas. What I would like to do is have some sort of algorithm that loops over the points and checks the the distance between the previous and current point. If the distance is sufficiently small, it would group those points together. This process would continue until the current point is too far away. It would make a polygon out of those close points, and then continue the process with the next group of points.
gdf
longitude latitude geometry
0 -76.575249 21.157229 POINT (-76.57525 21.15723)
1 -76.575035 21.157453 POINT (-76.57503 21.15745)
2 -76.575255 21.157678 POINT (-76.57526 21.15768)
3 -76.575470 21.157454 POINT (-76.57547 21.15745)
5 -112.973177 31.317333 POINT (-112.97318 31.31733)
... ... ... ...
2222 -113.492501 47.645914 POINT (-113.49250 47.64591)
2223 -113.492996 47.643609 POINT (-113.49300 47.64361)
2225 -113.492379 47.643557 POINT (-113.49238 47.64356)
2227 -113.487443 47.643142 POINT (-113.48744 47.64314)
2230 -105.022627 48.585669 POINT (-105.02263 48.58567)
So in the data above, the first 4 points would be grouped together and turned into a polygon. Then, it would move onto the next group, and so forth. Each group of points is not evenly spaced, i.e., the next group might be 7 pairs of points, and the following could be 3. Ideally, the final output would be another geodataframe that is just a bunch of polygons.
You can try DBSCAN clustering as it will automatically find the best number of clusters and you can specify a maximum distance between points ( ε ).
Using your example, the algorithm identifies two clusters.
import pandas as pd
from sklearn.cluster import DBSCAN
df = pd.DataFrame(
[
[-76.575249, 21.157229, (-76., 21.15723)],
[-76.575035, 21.157453, (-76.57503, 21.15745)],
[-76.575255, 21.157678, (-76.57526, 21.15768)],
[-76.575470, 21.157454, (-76.57547, 21.15745)],
[-112.973177, 31.317333, (-112.97318, 31.31733)],
[-113.492501, 47.645914, (-113.49250, 47.64591)],
[-113.492996, 47.643609, (-113.49300, 47.64361)],
[-113.492379, 47.643557, (-113.49238, 47.64356)],
[-113.487443, 47.643142, (-113.48744, 47.64314)],
[-105.022627, 48.585669, (-105.02263, 48.58567)]
], columns=["longitude", "latitude", "geometry"])
clustering = DBSCAN(eps=0.3, min_samples=4).fit(df[['longitude','latitude']].values)
gdf = pd.concat([df, pd.Series(clustering.labels_, name='label')], axis=1)
print(gdf)
gdf.plot.scatter(x='longitude', y='latitude', c='label')
longitude latitude geometry label
0 -76.575249 21.157229 (-76.0, 21.15723) 0
1 -76.575035 21.157453 (-76.57503, 21.15745) 0
2 -76.575255 21.157678 (-76.57526, 21.15768) 0
3 -76.575470 21.157454 (-76.57547, 21.15745) 0
4 -112.973177 31.317333 (-112.97318, 31.31733) -1 # not in cluster
5 -113.492501 47.645914 (-113.4925, 47.64591) 1
6 -113.492996 47.643609 (-113.493, 47.64361) 1
7 -113.492379 47.643557 (-113.49238, 47.64356) 1
8 -113.487443 47.643142 (-113.48744, 47.64314) 1
9 -105.022627 48.585669 (-105.02263, 48.58567) -1 # not in cluster
If we add random data to your data set, run the clustering algorithm, and filter out those data points not in clusters, you get a clearer idea of how it's working.
import numpy as np
rng = np.random.default_rng(seed=42)
arr2 = pd.DataFrame(rng.random((3000, 2)) * 100, columns=['latitude', 'longitude'])
randdf = pd.concat([df[['latitude', 'longitude']], arr2]).reset_index()
clustering = DBSCAN(eps=1, min_samples=4).fit(randdf[['longitude','latitude']].values)
labels = pd.Series(clustering.labels_, name='label')
gdf = pd.concat([randdf[['latitude', 'longitude']], labels], axis=1)
subgdf = gdf[gdf['label']> -1]
subgdf.plot.scatter(x='longitude', y='latitude', c='label', colormap='viridis', figsize=(20,10))
print(gdf['label'].value_counts())
-1 2527
16 10
3 8
10 8
50 8
...
57 4
64 4
61 4
17 4
0 4
Name: label, Length: 99, dtype: int64
Getting the clustered points from this dataframe would be relatively simple. Something like this:
subgdf['point'] = subgdf.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
subgdf.groupby(['label'])['point'].apply(list)
label
0 [(21.157229, -76.575249), (21.157453, -76.5750...
1 [(47.645914, -113.492501), (47.643609, -113.49...
2 [(46.67210037270342, 4.380376578722878), (46.5...
3 [(85.34030732681661, 23.393948586534073), (86....
4 [(81.40203846660347, 16.697291990770392), (82....
...
93 [(61.419880354359925, 23.25522624430636), (61....
94 [(50.893415175135424, 90.70863269095085), (52....
95 [(88.80586950148697, 81.17523712192651), (88.6...
96 [(34.23624333000541, 40.8156668231013), (35.86...
97 [(16.10456828199399, 67.41443008931344), (15.9...
Name: point, Length: 98, dtype: object
Although you'd probably need to do some kind of sorting to make sure you were connecting the closest points when drawing the polygons.
Similar SO question
DBSCAN from sklearn
Haversine Formula in Python (Bearing and Distance between two GPS points)
https://gis.stackexchange.com/questions/121256/creating-a-circle-with-radius-in-metres
You may be able to use the haversine formula to group points within a distance. Create polygons for each point (function below) with the formula then filter points inside from the master list of points and repeat until there are no more points.
#import modules
import numpy as np
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, GeoSeries
from shapely import geometry
from shapely.geometry import Polygon, Point
from functools import partial
import pyproj
from shapely.ops import transform
#function to create polygons on radius
def polycir(lat, lon, radius):
local_azimuthal_projection = """+proj=aeqd +R=6371000 +units=m +lat_0={} +lon_0=
{}""".format(lat, lon)
wgs84_to_aeqd = partial(
pyproj.transform,
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
pyproj.Proj(local_azimuthal_projection),
)
aeqd_to_wgs84 = partial(
pyproj.transform,
pyproj.Proj(local_azimuthal_projection),
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
)
center = Point(float(lon), float(lat))
point_transformed = transform(wgs84_to_aeqd, center)
buffer = point_transformed.buffer(radius)
# Get the polygon with lat lon coordinates
circle_poly = transform(aeqd_to_wgs84, buffer)
return circle_poly
#Convert df to gdf
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
#Create circle polygons col
gdf['polycir'] = [polycir(x, y, <'Radius in Meters'>) for x, y in zip(gdf.latitude,
gdf.longitude)]
gdf.set_geometry('polycir', inplace=True)
#You should be able to loop through the polygons and find the geometries that overlap with
# gdf_filtered = gdf[gdf.polycir.within(gdf.iloc[0,4])]
Looks like a job for k-means clustering.
You may need to be careful to how you define your distance (actual disctance "through" earth, or shortest path around?)
Turning each cluster into a polygon depends on what you want to do... just chain them or look for their convex enveloppe...
I need to start this by saying that my code runs without any error messages, but I don't understand some of the results.
I create a graph in networkx from a pandas data frame, that has 398595 integer IDs.
# Create Graph
G = nx.Graph()
G.name = "Graph from Pandas"
# Add Nodes to Graph
G.add_nodes_from(test_df['ID'].tolist())
print(nx.info(G))
The output from nx.info(G) is as follows, which is also correct this is what I expected:
Type: Graph
Number of nodes: 398595
Number of edges: 0
Average degree: 0.0000
Then I load a second pandas data frame and it contains 5556353 entries and has three columns:
ID1 ID2 weight
3 198 0.601002
3 183 0.618057
Each ID in ID1 or ID2 exists also into the first pandas dataframe, so I load the edges as follows:
# Add data to Graph
G = nx.from_pandas_edgelist(df,source='ID1',target='ID2', edge_attr='weight')
print(nx.info(G))
However here is what I don't understand, the output from nx.info(G) now returns:
Type: Graph
Number of nodes: 29348
Number of edges: 4371353
Average degree: 297.8978
Now my questions are (1) why are there fewer nodes in this graph than before and (2) why are there considerably fewer edges in this Graph than available from the data frame?
There are probably less unique IDs between ID1 and ID2 of df than there are in the ID column of test_df. The first thing I would check is if the unique IDs across ID1 and ID2 in df equals the number of nodes you display len(pd.unique(df[['ID1','ID2']].values.ravel())) (should equal 29348).
One reason there are fewer edges is if there are directed edges in the dataframe. The default value for the create_using parameter of nx.from_pandas_edgelist is nx.Graph() so edges will be treated as undirected and multiple edges are removed. If you want directed edges, multiple edges, or both, try passing nx.DiGraph,nx.MultiGraph, or nx.MultiDiGraph respectively to the create_using parameter.
For context:
I am making a visual graph for a protein-protein interaction network. A node here corresponds to a protein and an edge would indicate interaction between two nodes.
Here is my code:
First I import all the modules and files that I need:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
interactome_edges = pd.read_csv("*a_directory*", delimiter = "\t", header = None)
interactome_nodes = pd.read_csv("*a_directory*", delimiter = "\t", header = None)
# A few adjustments for the dataframes
interactome_nodes = interactome_nodes.drop(columns = [0])
interactome_edges.columns = ["node1","node2"]
Dataframe for nodes looks like this:
1
0 MET3
1 IMD3
2 OLE1
3 MUP1
4 PIS1
...
Dataframe for edges looks like this:
node1 node2
0 MET3 MET3
1 IMD3 IMD4
2 OLE1 OLE1
3 MUP1 MUP1
4 PIS1 PIS1
...
Basically the edge goes from node1 to node2
Now I iterate through each row from the node dataframe and edge dataframe and use it as networkx nodes and edges.
interactome = nx.Graph()
# Adding Nodes to Graph
for index, row in interactome_nodes.iterrows():
interactome.add_nodes_from(row)
# Adding Edges to Graph
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row["node1", "node2"]) #### Here is the problem
My problem is at the adding Edges part.
I am currently getting the following error:
KeyError: ('node1', 'node2')
I have also tried :
for index, row in interactome_edges.iterrows():
interactome.add_edges_from((row["node1"],row["node2"]))
and:
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row["node1"],row["node2"])
and also simply:
for index, row in interactome_edges.iterrows():
interactome.add_edges_from(row)
All of which give me some form of error.
How can I use my node to node dataframe as edges for a networkx graph?
In [9]: import networkx as nx
In [10]: import pandas as pd
In [11]: df = pd.read_csv("a.csv")
In [12]: df
Out[12]:
node1 node2
0 MET3 MET3
1 IMD3 IMD4
2 OLE1 OLE1
3 MUP1 MUP1
4 PIS1 PIS1
In [13]: G=nx.from_pandas_edgelist(df, "node1", "node2")
In [14]: [e for e in G.edges]
Out[14]:
[('MET3', 'MET3'),
('IMD3', 'IMD4'),
('OLE1', 'OLE1'),
('MUP1', 'MUP1'),
('PIS1', 'PIS1')]
Networkx has methods to read from pandas dataframe. I have use the edge dataframe provided. Here, I'm using from_pandas_edgelist method to read from the dataframe of edges.
After plotting the graph,
nx.draw_planar(G, with_labels = True)
plt.savefig("filename2.png")
I have a dataset as the following where the first and second columns indicate nodes connection from to:
fromNode toNode
0 1
0 2
0 31
0 73
1 3
1 56
2 10
...
I want to generate laplacian matrix from this dataset. I use the following code to do so but it complains as the dataset itself is not square matrix. Is there a function that accept this type of dataset and generates the matrix?
from numpy import genfromtxt
from scipy.sparse import csgraph
import csv
G = genfromtxt('./data.csv', delimiter='\t').astype(int)
dataset = csgraph.laplacian(G, normed=False)
Rather than find a function that will except your data, process your data into the correct format.
Fake data f simulates a file object. Use io.StringIO for Python 3.6.
data = '''0 1
0 2
0 31
0 73
1 3
1 56
2 10'''
f = io.BytesIO(data)
Read each line of the data and process it into a list of edges with the form (node1, node1).
edges = []
for line in f:
line = line.strip()
(node1, node2) = map(int, line.split())
edges.append((node1,node2))
Find the highest node number, create a square numpy ndarray based on the highest node number. You need to be aware of your node numbering - is it zero based?
N = max(x for edge in edges for x in edge)
G = np.zeros((N+1,N+1), dtype = np.int64)
Iterate over the edges and assign the edge weight to the Graph
for row, column in edges:
G[row,column] = 1
Here is a solution making use of numpy integer array indexing.
z = np.genfromtxt(f, dtype = np.int64)
n = z.max() + 1
g = np.zeros((n,n), dtype = np.int64)
rows, columns = z.T
g[rows, columns] = 1
Of course both of those assume all edge weights are equal.
See Graph Representations in the scipy docs. I couldn't try this graph to see if it is valid, I'm getting an import error for csgraph - probably need to update.