Excluding data points based on proximity in a scatterplot

Excluding data points based on proximity in a scatterplot - python

I am trying to create a representation of Amsterdam's channels based on a very large data set of coordinates send through AIS. As the AIS is sometimes calibrated wrong, some coordinates are not on the actual channel, but rather on urban structures. Luckily, this happens relatively few times. As a result these datapoints are not in close proximity of other data points / data point clusters. As such, I want to exlude these data points which are do not have a 'neighbour' with a margin (say 5 meters in real life) in the most pythonic way. Would anyone know how to approach this problem? My data is a simple pandas dataframe:
lng lat
0 4.962218 52.362260
1 4.882198 52.406013
2 4.918583 52.335535
3 4.908185 52.381353
4 5.020983 52.277188
... ... ...
2249835 4.979960 52.352660
2249836 4.914533 52.334980
2249837 4.856630 52.401977
2249838 4.971418 52.357525
2249839 5.042353 52.402142
[2211095 rows x 2 columns]
and the map currently looks as follows, I have marked examples of coordinates I want filter out / exclude:

Related

Grouping datapoint based on defined thresold values

I have a set of 100 points column-wise. I would like to pair these data points which are within the distance threshold of -5 and 5. And, then label them accordingly by the same letters, say, A, B, C, or so on.
For example, let us assume x is a variable which records 100 data points and samples of them are as follows the data points
x=[2,5,10,13,20,25]
If I want to group and label them based on a defined threshold distance (here is either +5 or -5). Then I should get output something like this
[(2,5),(10,13),(5,10),(20,25)] and so on
A=(2,5); B=(10,13), and so on
The data point may include decimal points and I would like to do this in an efficient way. Any I idea how to this? Thanks in advance

Get attributes from non-overlapping points and polygons

I have two geodatasets -- one is in points (centroids from a different polygon, let's name it point_data) and the other one is a polygon of a whole country (let's name it polygon_data). What I'm trying to do now is to get attributes from polygon_data and put them in point_data. But the problem is that they are not overlapping with each other.
To better understand the context, the country is archipelagic by nature, and the points are outside the country (that's why they're not overlapping).
Some solutions that I've tried are:
1.) Buffer up polygon_data so that it would touch point_data. Unfortunately this caused problems because the shapes that are not in the shoreline also buffered up.
2.) Used the original polygon of point_data and did a spatial join (intersects), but the problem is that there are some points that still returned with null values and duplicate rows also occured.
I want to make the process as seamless and easy as possible. Any ideas?
I'm both proficient with geopandas and qgis, but I would prefer it in geopandas as much as possible.
Thank you to whoever will be able to help. :)

I guess you can try to join your data depending on the distance between the points and the polygon(s). By doing so, you can fetch the index of the nearest polygon feature for each of your points, then use this index to do the jointure.
To replicate your problem, I generated a layer of points and a layer of polygons (they have an attribute name that I want to put on the point layer).
One (naive) way to do so could be the following:
# read the polygon layer and the point layer
poly_data = gpd.read_file('poly_data.geojson')
pt_data = gpd.read_file('pt_data.geojson')
# Create the field to store the index
# of the nearest polygon feature
pt_data['join_field'] = 0
for idx, geom in pt_data['geometry'].iteritems():
# Compute the distance between this point and each polygon
distances = [
(idx_to_join, geom.distance(geom_poly))
for idx_to_join, geom_poly in poly_data['geometry'].iteritems()]
# Sort the distances...
distances.sort(key=lambda d: d[1])
# ... and store the index of the nearest polygon feature
pt_data.loc[(idx, 'join_field')] = distances[0][0]
# make the join between pt_data and poly_data (except its geometry column)
# based on the value of 'join_field'
result = pt_data.join(
poly_data[poly_data.columns.difference(['geometry'])],
on='join_field')
# remove the `join_field` if needed
result.drop('join_field', axis=1, inplace=True)
Result: (the value in the name column is coming from the polygons)
id geometry name
0 1 POINT (-0.07109 0.40284) A
1 2 POINT (0.04739 0.49763) A
2 3 POINT (0.05450 0.29858) A
3 4 POINT (0.06635 0.11848) A
4 5 POINT (0.63744 0.73934) B
5 6 POINT (0.61611 0.53555) B
6 7 POINT (0.76540 0.44787) B
7 8 POINT (0.84597 0.36256) B
8 9 POINT (0.67062 -0.36493) C
9 10 POINT (0.54028 -0.37204) C
10 11 POINT (0.69194 -0.60900) C
11 12 POINT (0.62085 -0.65166) C
12 13 POINT (0.31043 -0.48578) C
13 14 POINT (0.36967 -0.81280) C
Depending on the size of your dataset you may want to consider more efficient methods (e.g. defining a maximum search radius around each point to avoid having to iterate across all polygons).

How to cluster data based on a subset of attributes (4 attributes)?

I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?

Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.

depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median

You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Calculate (road travel) distance between postcodes/zipcodes python

I have a csv file with start and end postcodes (UK equivalent of US zipcodes) and would like to compute simple distance, road travel distance and travel time between the two. I guess the way to go would be to use Google maps in one way or another. I first tried using some spreadhsheet and the following url http://maps.google.com/maps?saddr="&B2&"&daddr="&A2&" but
I do not know how to retrieve the resulting distance from google maps
I would like to know some more pythonic way to work this out

The distance between postal codes can be obtained with the pgeocode library. Unlike the above response, it does not query a web API, and is therefore more suitable for processing large amounts of data,
>>> import pgeocode
>>> dist = pgeocode.GeoDistance('GB')
>>> dist.query_postal_code('WC2N', 'EH53')
536.5 # retured distance in km
More information about these postal codes, including latitude and longitude, can be queried with,
>>> nomi = pgeocode.Nominatim('GB')
>>> nomi.query_postal_code(['WC2N', 'EH53'])
postal_code country code place_name \
0 WC2N GB London
1 EH53 GB Pumpherston, Mid Calder, East Calder, Oakbank
state_name state_code county_name county_code community_name \
0 England ENG Greater London 11609024 NaN
1 Scotland SCT West Lothian WLN NaN
community_code latitude longitude accuracy
0 NaN 51.5085 -0.125700 4.0
1 NaN 55.9082 -3.479025 4.0
This uses the GeoNames postal code dataset to get the GPS coordinates, then computes the Haversine (great circle) distance on those. Most countries are supported.
In the particular case of Great Britain, only the outward codes are included in the GB dataset, the full dataset is also available as GB_full but it is currently not supported in pgeocode.

The main issue with finding a distance between 2 postcodes is that they aren't designed for it.
For the purposes of directing mail, the United Kingdom is divided by
Royal Mail into postcode areas. -Wikipedia
A postcode by itself provides no useful information, so you are correct you need help from an external source. The Google maps service at http://maps.google.com is of no use, as it's not designed for you to retrieve information like this.
Option 1 - Google Maps API
The Google Maps API is feature packed and will provide you with a lot of options. The link above is to the Distance Matrix API, which will help with working out distances between 2 points. The results from this will be based on travel (so driving distance), this may or may not be what you want.
Example
Python 3
import urllib.request
import json
res = urllib.request.urlopen("https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=SE1%208XX&destinations=B2%205NY").read()
data = json.loads(res.decode())
print(data["rows"][0]["elements"][0]["distance"])
# {'text': '127 mi', 'value': 204914}
Note: Google Maps API is subject to usage limits.
Option 2 - Do it yourself with postcodes.io
postcodes.io has a nice API backed by a public data set. Example lookup. Results are in JSON which can be mapped to a Python dictionary using the json module. The downside here is it provides no way to check distance, so you will have to do it yourself using the Longitude and Latitude returned.
Example
Python 3
import urllib.request
import json
res = urllib.request.urlopen("http://api.postcodes.io/postcodes/SE18XX").read()
data = json.loads(res)
print(data["result"]["longitude"], data["result"]["latitude"])
# -0.116825494204512 51.5057668390097
Calculating distance
I don't want to get too much into this because it's a big topic and varies greatly depending on what you're trying to achieve, but a good starting point would be the Haversine Formula, which takes into account the curvature of the Earth. However, it assumes the Earth is a perfect sphere (which it's not).
The haversine formula determines the great-circle distance between two
points on a sphere given their longitudes and latitudes. Important in
navigation, it is a special case of a more general formula in
spherical trigonometry, the law of haversines, that relates the sides
and angles of spherical triangles.
Here is an example of it implemented in Python: https://stackoverflow.com/a/4913653/7220776

This looks like the perfect resource for you (they provide lat and long values for each postcode in the UK, in various formats): https://www.freemaptools.com/download-uk-postcode-lat-lng.htm
and in particular this CSV file (linked in the same page):
https://www.freemaptools.com/download/full-postcodes/ukpostcodes.zip
Once you match geographical coordinates to each postcode you have (out of the scope of this question), say you'll have a table with 4 columns (i.e. 2 (lat, long) values per postcode).
You can compute the distances using numpy. Here's an example:
import numpy as np
latlong = np.random.random((3,4))
# Dummy table containing 3 records, will look like this:
# array([[ 0.258906 , 0.66073909, 0.25845113, 0.87433443],
# [ 0.7657047 , 0.48898144, 0.39812762, 0.66054291],
# [ 0.2839561 , 0.04679014, 0.40685189, 0.09550362]])
# The following will produce a numpy array with as many elements as your records
# (It's the Euclidean distance between the points)
distances = np.sqrt((latlong[:, 3] - latlong[:, 1])**2 + (latlong[:, 2] - latlong[:, 0])**2)
# and it look like this:
# array([ 0.21359582, 0.405643 , 0.13219825])

The simplest way to calculate the distance between two UK postcodes is not to use latitude and longitude but to use easting and northing instead.
Once you have easting and northing you can just use Pythagoras's theorem to calculate the distance, making the maths much simpler.
Get the easting and northing for the postcodes. You can use Open Postcode Geo for this.
Use the below formula to find the distance:
sqrt(pow(abs(easting1 - easting2),2) + pow(abs(northing1 - northing1),2))
This example is from MySQL but you should be able to find similar functions in both Excel and Python:
sqrt(): Find the square root.
pow(): Raise to the power of.
abs(): Absolute
value (ignore sign).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Excluding data points based on proximity in a scatterplot - python

Related

Grouping datapoint based on defined thresold values

Get attributes from non-overlapping points and polygons

How to cluster data based on a subset of attributes (4 attributes)?

Removing points which deviate too much from adjacent point in Pandas

Calculate (road travel) distance between postcodes/zipcodes python

Categories

Resources