I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?
Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.
depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.
Related
i have two GeoDataFrames
gdf_point:
Unnamed: 0 latitude longitude geometry
0 0 50.410203 7.236583 POINT (7.23658 50.41020)
1 1 51.303545 7.263082 POINT (7.26308 51.30354)
2 2 50.114965 8.672785 POINT (8.67278 50.11496)
and gdf_poly:
Unnamed: 0 Id geometry
0 0 301286 POLYGON ((9.67079 49.86762, 9.67079 49.86987, ...
1 1 302258 POLYGON ((9.67137 54.75650, 9.67137 54.75874, ...
2 2 302548 POLYGON ((9.66808 48.21535, 9.66808 48.21760, ...
I want to match if a point from gdf_point is contained by any of the polygons of gdf_poly, if yes i want the Id of that polygon to be added to the corresponding row of gdf_point.
Here is my current code:
COUNTER = 0
def f(x, gdf_poly, df_new_point):
global COUNTER
for row in gdf_poly.itertuples():
geom = getattr(row, 'geometry')
id = getattr(row, 'Id')
if geom.contains(x):
print('True')
df_new_point.loc[COUNTER, 'Id'] = id
COUNTER = COUNTER + 1
df_new_point = gdf_point
gdf_point['geometry'].apply(lambda x: f(x, gdf_poly, df_new_point))
This works and does what i want it to do. But the Problem is its way to slow, it takes about 50min to do 10k rows (multithreading is a future option), and i want it to be able to handle multiple million rows. There must be a better and faster way to do this. Thanks for your help.
To merge two dataframes on their geometries (not on column or index values), use one of geopandas's spatial joins. They have a whole section of the docs about it - it's great - give it a read!
There are two workhorse spatial join functions in geopandas:
GeoDataFrame.sjoin joins two dataframes based on a binary predicate performed on all combinations of geometries, one of intersects, contains, within, touches, crosses, or overlaps. You can specify whether you want a left, right, or inner join based on the how keyword argument
GeoDataFrame.sjoin_nearest joins two dataframes based on which geometry in one dataframe is closest to each element in the other. Similarly, the how argument gives left, right, and inner options. Additionally, there are two arguments to sjoin_nearest not available on sjoin:
max_distance: The max_distance argument specifies a maximum search radius for matching geometries. This can have a considerable performance impact in some cases. If you can, it is highly recommended that you use this parameter.
distance_col: If set, the resultant GeoDataFrame will include a column with this name containing the computed distances between an input geometry and the nearest geometry.
You can optionally use these global geopandas.sjoin and geopandas.sjoin_nearest functions, or use the methods geopandas.GeoDataFrame.sjoin and geopandas.GeoDataFrame.sjoin_nearest. Note, however, that the docs include a warning that the root-level functions may be deprecated at some point in the future, and recommend the use of the GeoDataFrame methods.
So in your case:
merged = gdf_poly.sjoin(gdf_point, predicate="contains")
will do the trick, though if you want to match polygons where the point falls exactly on the boundary, you may want to consider predicate="intersects".
I have two geodatasets -- one is in points (centroids from a different polygon, let's name it point_data) and the other one is a polygon of a whole country (let's name it polygon_data). What I'm trying to do now is to get attributes from polygon_data and put them in point_data. But the problem is that they are not overlapping with each other.
To better understand the context, the country is archipelagic by nature, and the points are outside the country (that's why they're not overlapping).
Some solutions that I've tried are:
1.) Buffer up polygon_data so that it would touch point_data. Unfortunately this caused problems because the shapes that are not in the shoreline also buffered up.
2.) Used the original polygon of point_data and did a spatial join (intersects), but the problem is that there are some points that still returned with null values and duplicate rows also occured.
I want to make the process as seamless and easy as possible. Any ideas?
I'm both proficient with geopandas and qgis, but I would prefer it in geopandas as much as possible.
Thank you to whoever will be able to help. :)
I guess you can try to join your data depending on the distance between the points and the polygon(s). By doing so, you can fetch the index of the nearest polygon feature for each of your points, then use this index to do the jointure.
To replicate your problem, I generated a layer of points and a layer of polygons (they have an attribute name that I want to put on the point layer).
One (naive) way to do so could be the following:
# read the polygon layer and the point layer
poly_data = gpd.read_file('poly_data.geojson')
pt_data = gpd.read_file('pt_data.geojson')
# Create the field to store the index
# of the nearest polygon feature
pt_data['join_field'] = 0
for idx, geom in pt_data['geometry'].iteritems():
# Compute the distance between this point and each polygon
distances = [
(idx_to_join, geom.distance(geom_poly))
for idx_to_join, geom_poly in poly_data['geometry'].iteritems()]
# Sort the distances...
distances.sort(key=lambda d: d[1])
# ... and store the index of the nearest polygon feature
pt_data.loc[(idx, 'join_field')] = distances[0][0]
# make the join between pt_data and poly_data (except its geometry column)
# based on the value of 'join_field'
result = pt_data.join(
poly_data[poly_data.columns.difference(['geometry'])],
on='join_field')
# remove the `join_field` if needed
result.drop('join_field', axis=1, inplace=True)
Result: (the value in the name column is coming from the polygons)
id geometry name
0 1 POINT (-0.07109 0.40284) A
1 2 POINT (0.04739 0.49763) A
2 3 POINT (0.05450 0.29858) A
3 4 POINT (0.06635 0.11848) A
4 5 POINT (0.63744 0.73934) B
5 6 POINT (0.61611 0.53555) B
6 7 POINT (0.76540 0.44787) B
7 8 POINT (0.84597 0.36256) B
8 9 POINT (0.67062 -0.36493) C
9 10 POINT (0.54028 -0.37204) C
10 11 POINT (0.69194 -0.60900) C
11 12 POINT (0.62085 -0.65166) C
12 13 POINT (0.31043 -0.48578) C
13 14 POINT (0.36967 -0.81280) C
Depending on the size of your dataset you may want to consider more efficient methods (e.g. defining a maximum search radius around each point to avoid having to iterate across all polygons).
I want to represent relationships between nodes in python using pandas.DataFrame
And each relationship has weight so I used dataframe like this.
nodeA nodeB nodeC
nodeA 0 5 1
nodeB 5 0 4
nodeC 1 4 0
But I think this is improper way to express relationships because the dataframe
is symmetric, has duplicated datas.
Is there more proper way than using dataframe to represent graph in python?
(Sorry for my bad English)
This seems like an acceptable way to represent a graph, and is in fact compatible with, say, nextworkx. For example, you can recover a nextworkx graph object as follows:
import networkx as nx
g = nx.from_pandas_adjacency(df)
print(g.edges)
# [('nodeA', 'nodeB'), ('nodeA', 'nodeC'), ('nodeB', 'nodeC')]
print(g.get_edge_data('nodeA', 'nodeB'))
# {'weight': 5}
If your graph is sparse, you may want to store it as an edge list instead, e.g. as discussed here.
I am trying to create a representation of Amsterdam's channels based on a very large data set of coordinates send through AIS. As the AIS is sometimes calibrated wrong, some coordinates are not on the actual channel, but rather on urban structures. Luckily, this happens relatively few times. As a result these datapoints are not in close proximity of other data points / data point clusters. As such, I want to exlude these data points which are do not have a 'neighbour' with a margin (say 5 meters in real life) in the most pythonic way. Would anyone know how to approach this problem? My data is a simple pandas dataframe:
lng lat
0 4.962218 52.362260
1 4.882198 52.406013
2 4.918583 52.335535
3 4.908185 52.381353
4 5.020983 52.277188
... ... ...
2249835 4.979960 52.352660
2249836 4.914533 52.334980
2249837 4.856630 52.401977
2249838 4.971418 52.357525
2249839 5.042353 52.402142
[2211095 rows x 2 columns]
and the map currently looks as follows, I have marked examples of coordinates I want filter out / exclude:
I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.