How to add random points in between the given points? - python

I have data points as dataframe just like represented in figure1
sample data
df=
74 34
74.5 34.5
75 34.5
75 34
74.5 34
74 34.5
76 34
76 34.5
74.5 34
74 34.5
75.5 34.5
75.5 34
75 34
75 34.5
I want to add random points in between those points but keep the shape of the initial points.
Desired output will be somehow like in figure 2 (black dots represent the random points.And the red line represent the boundary)
~Any suggestions? I am looking for a general solution since the geometry of the outer boundary will change in problem

interpolation might be worth looking into:
import numpy as np
# lets suppose y = 2x and x[i], y[i] is a data point
x = [1, 5, 16, 20, 60]
y = [2, 10, 32, 40, 120]
interp_here = [7, 8, 9, 21] # the points where we are going to interpolate values.
print(np.interp(interp_here, x, y)) ## [14. 16. 18. 42.]
If you want random points, then you could use the above as a guide line and then for each point add/subtract some delta.

If the shape is convex it is pretty simple:
def get_random_point(points):
point_selectors = np.random.randint(0, len(points), 2)
scale = np.random.rand()#value between 0 and 1
start_point = points[point_selectors[0]]
end_point = points[point_selectors[1]]
return start_point + (end_point - start_point) * scale
The shape you have specified is not convex. But without you additionally specifying which points make up the exterior of your shape or additional constraints like e.g. you only want to allow for lines to go parallel to either x or y axis the shape you see is mathematically not sufficiently specified.
Final remark: There are algorithms which can check whether a point is within a polygon (Point in polygon).
You can then 1) specify bounding polygon 2) generate a point within the bounding rectangle of your shape and 3) test whether the point lies within the shape of your polygon.

Related

Scatter Plot from pandas table in Python

New student to python and struggling with a task at the moment. I'm trying to publish a scatter plot from some data in a pandas table and can't seem to work it out.
Here is a sample of my data set:
import pandas as pd
data = {'housing_age': [14, 11, 3, 4],
'total_rooms': [25135, 32627, 39320, 37937],
'total_bedrooms': [4819, 6445, 6210, 5471],
'population': [35682, 28566, 16305, 16122]}
df = pd.DataFrame(data)
I'm trying to draw a scatter plot on the data in housing_age, but having some difficult figuring it out.
Initially tried for x axis to be 'housing_data' and the y axis to be a count of housing data, but couldn't get the code to work. Then read somewhere that x-axis should be variable, and y-axis should be constant, so tried this code:
x='housing_data'
y=[0,5,10,15,20,25,30,35,40,45,50,55]
plt.scatter(x,y)
ax.set_xlabel("Number of buildings")
ax.set_ylabel("Age of buildings")
but get this error:
ValueError: x and y must be the same size
Note - the data in 'housing_data' ranges from 1-53 years.
I imagine this should be a pretty easy thing, but for some reason I can't figure it out.
Does anyone have any tips?
I understand you are starting so confusion is common. Please bear with me.
From your description, it looks like you swapped x and y:
# x is the categories: 0-5 yrs, 5-10 yrs, ...
x = [0,5,10,15,20,25,30,35,40,45,50,55]
# y is the number of observations in each category
# I just assigned it some random numbers
y = [78, 31, 7, 58, 88, 43, 47, 87, 91, 87, 36, 78]
plt.scatter(x,y)
plt.set_title('Housing Data')
Generally, if you have a list of observations and you want to count them across a number of categories, it's called a histogram. pandas has many convenient functions to give you a quick look at the data. The one of interest for this question is hist - create a histogram:
# A series of 100 random buildings whose age is between 1 and 55 (inclusive)
buildings = pd.Series(np.random.randint(1, 55, 100))
# Make a histogram with 10 bins
buildings.hist(bins=10)
# The edges of those bins were determined automatically so they appear a bit weird:
pd.cut(buildings, bins=10)
0 (22.8, 28.0]
1 (7.2, 12.4]
2 (33.2, 38.4]
3 (38.4, 43.6]
4 (48.8, 54.0]
...
95 (48.8, 54.0]
96 (22.8, 28.0]
97 (12.4, 17.6]
98 (43.6, 48.8]
99 (1.948, 7.2]
You can also set the bins explicitly: 0-5, 5-10, ..., 50-55
buildings.hist(bins=range(0,60,5))
# Then the edges are no longer random
pd.cut(buildings, bins=range(0,60,5))
0 (25, 30]
1 (5, 10]
2 (30, 35]
3 (40, 45]
4 (45, 50]
...
95 (45, 50]
96 (25, 30]
97 (15, 20]
98 (40, 45]
99 (0, 5]

How to find clusters in a matrix of values

I have a 640x480 matrix that contains temperature values (coming from thermal images);
each element of the matrix represents the temperature of a single pixel like so:
[[31.2 30.4 32.5 ... 31.3 31.6 31.7]
[30.0 37.4 40.5 ... 51.5 52.6 52.7]
...
[28.9 28.8 28.1 ... 31.2 32.4 32.3]]
I want to find clusters in this matrix taking into consideration:
temperature difference between two elements;
positional distance between two elements;
I tried to do this by using DBSCAN clustering algorithm on an array containing coordinates and values of the elements like so:
coord = [[0 0 31.2]
[1 0 30.4]
[2 0 32.5]
...
[638 479 32.4]
[639 479 32.3]]
This is the code:
X=np.array(coord)
db = DBSCAN(eps=2, min_samples=2, metric='manhattan').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
cluster_dict = {i: X[labels==i] for i in range(n_clusters_)}
The problem is that I get a large number of clusters and the clustering is not precise (also with different values ​​of eps and min_samples).
I would like to know if there is a more efficient method of doing this.
Thank you all

numpy: take multiple range subsets of the same of size

What I'm looking for
# I have an array
x = np.arange(0, 100)
# I have a size n
n = 10
# I have a random set of numbers
indexes = np.random.randint(n, 100, 10)
# What I want is a matrix where every row i is the i-th element of indexes plus the previous n elements
res = np.empty((len(indexes), n), int)
for (i, v) in np.ndenumerate(indexes):
res[i] = x[v-n:v]
To reformulate, as I wrote in the title what am looking for is a way to take multiple subsets (of the same size) of an initial array.
Just to add a detail this loopy version works, I want just to know if there is a numpyish way to achieve this in a more elegant way.
The following does what you are asking for. It uses numpy.lib.stride_tricks.as_strided to create a special view on the data which can be indexed in the desired way.
import numpy as np
from numpy.lib import stride_tricks
x = np.arange(100)
k = 10
i = np.random.randint(k, len(x)+1, size=(5,))
xx = stride_tricks.as_strided(x, strides=np.repeat(x.strides, 2), shape=(len(x)-k+1, k))
print(i)
print(xx[i-k])
Sample output:
[ 69 85 100 37 54]
[[59 60 61 62 63 64 65 66 67 68]
[75 76 77 78 79 80 81 82 83 84]
[90 91 92 93 94 95 96 97 98 99]
[27 28 29 30 31 32 33 34 35 36]
[44 45 46 47 48 49 50 51 52 53]]
A bit of explanation. Arrays store not only data but also a small "header" with layout information. Amongst this are the strides which tell how to translate linear memory to nd. There is a stride for each dimension which is just the offset at which the next element along that dimension can be found. So the strides for a 2d array are (row offset, element offset). as_strided permits to directly manipulate an array's strides; by setting row offsets to the same as element offsets we create a view that looks like
0 1 2 ...
1 2 3 ...
2 3 4
. .
. .
. .
Note that no data are copied at this stage; for exasmple, all the 2s refer to the same memory location in the original array. Which is why this solution should be quite efficient.

DBSCAN for clustering of geographic location data

I have a dataframe with latitude and longitude pairs.
Here is my dataframe look like.
order_lat order_long
0 19.111841 72.910729
1 19.111342 72.908387
2 19.111342 72.908387
3 19.137815 72.914085
4 19.119677 72.905081
5 19.119677 72.905081
6 19.119677 72.905081
7 19.120217 72.907121
8 19.120217 72.907121
9 19.119677 72.905081
10 19.119677 72.905081
11 19.119677 72.905081
12 19.111860 72.911346
13 19.111860 72.911346
14 19.119677 72.905081
15 19.119677 72.905081
16 19.119677 72.905081
17 19.137815 72.914085
18 19.115380 72.909144
19 19.115380 72.909144
20 19.116168 72.909573
21 19.119677 72.905081
22 19.137815 72.914085
23 19.137815 72.914085
24 19.112955 72.910102
25 19.112955 72.910102
26 19.112955 72.910102
27 19.119677 72.905081
28 19.119677 72.905081
29 19.115380 72.909144
30 19.119677 72.905081
31 19.119677 72.905081
32 19.119677 72.905081
33 19.119677 72.905081
34 19.119677 72.905081
35 19.111860 72.911346
36 19.111841 72.910729
37 19.131674 72.918510
38 19.119677 72.905081
39 19.111860 72.911346
40 19.111860 72.911346
41 19.111841 72.910729
42 19.111841 72.910729
43 19.111841 72.910729
44 19.115380 72.909144
45 19.116625 72.909185
46 19.115671 72.908985
47 19.119677 72.905081
48 19.119677 72.905081
49 19.119677 72.905081
50 19.116183 72.909646
51 19.113827 72.893833
52 19.119677 72.905081
53 19.114100 72.894985
54 19.107491 72.901760
55 19.119677 72.905081
I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071,
1.05925366, 1.05420922],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
...,
[ 1.67313071, 1.44111548, 1.44111548, ..., 0. ,
1.02310118, 1.22871515],
[ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118,
0. , 1.39923529],
[ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515,
1.39923529, 0. ]])
Then I am applying DBSCAN clustering algorithm on distance matrix.
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
y_db = db.fit_predict(distance_matrix)
I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.
You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit() takes the coordinates in radian units for the haversine metric.
DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.
However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.
See the documentation of DBSCAN (emphasis added):
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
similar for fit_predict:
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.
In other words, you need to do
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
I don't know what implementation of haversine you're using but it looks like it returns results in km so eps should be 0.2, not 2 for 200 m.
For the min_samples parameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversine based on this answer which gives a distance matrix similar, but not identical to yours.
This is with db = DBSCAN(eps=0.2, min_samples=5)
[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
This creates three clusters, 0, 1 and 2, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1).
Trying again with a smaller min_samples value:
db = DBSCAN(eps=0.2, min_samples=2)
[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8
8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3
7 -1 3 -1 -1 3]
Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0 to 7.
Edited to add
It looks like #Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:
from math import radians, cos, sin, asin, sqrt
from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd
def haversine(lonlat1, lonlat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lat1, lon1 = lonlat1
lat2, lon2 = lonlat2
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed') # using "precomputed" as recommended by #Anony-Mousse
y_db = db.fit_predict(distance_matrix)
X['cluster'] = y_db
plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()
#eos Gives the best answer I think - as well as making use of Haversine distance (the most relevant distance measure in this case), it avoids the need to generate a precomputed distance matrix. If you create a distance matrix then you need to calculate the pairwise distances for every combination of points (although you can obviously save a bit of time by taking advantage of the fact that your distance metric is symmetric).
If you just supply DBSCAN with a distance measure and use the ball_tree algorithm though, it can avoid the need to calculate every possible distance. This is because the ball tree algorithm can use the triangular inequality theorem to reduce the number of candidates that need to be checked to find the nearest neighbours of a data point (this is the biggest job in DBSCAN).
The triangular inequality theorem states:
|x+y| <= |x| + |y|
...so if a point p is distance x from its neighbour n, and another point q is a distance y from p, if x+y is greater than our nearest neighbour radius, we know that q must be too far away from n to be considered a neighbour, so we don't need to calculate its distance.
Read more about how ball trees work in the scikit-learn documentation
There are three different things you can do to use DBSCAN with GPS data. The first is that you can use the eps parameter to specify the maximum distance between data points that you will consider to create a cluster, as specified in other answers you need to take into account the scale of the distance metric you are using a pick a value that makes sense. Then you can use the min_samples this can be used as a way to filtering out data points while moving. Last the metric will allow you to use whatever distance you want.
As an example, in a particular research project I'm working on I want to extract significant locations from a subject's GPS data locations collected from their smartphone. I'm not interested on how the subject navigates through the city and also I'm more comfortable dealing with distances in meters then I can do the next:
from geopy import distance
def mydist(p1, p2):
return distance.great_circle((p1[0],p1[1],100),(p2[0],p2[1],100)).meters
DBSCAN(eps=50,min_samples=50,n_jobs=-1,metric=mydist)
Here eps as per the DBSCAN documentation "The maximum distance between two samples for one to be considered as in the neighborhood of the other."
While min samples is "The number of samples (or total weight) in a neighborhood for a point to be considered as a core point." Basically with eps you control how close data points in a cluster should be, in the example above I selected 100 meters. Min samples is just a way to control for density, in the example above the data was captured at about one sample per second, because I'm not interested in when people are moving around but instead stationary locations I want to make sure I get at least the equivalent of 60 seconds of GPS data from the same location.
If this still does not make sense take a look at this DBSCAN animation.

Find multiple maximum values in a 2d array fast

The situation is as follows:
I have a 2D numpy array. Its shape is (1002, 1004). Each element contains a value between 0 and Inf. What I now want to do is determine the first 1000 maximum values and store the corresponding indices in to a list named x and a list named y. This is because I want to plot the maximum values and the indices actually correspond to real time x and y position of the value.
What I have so far is:
x = numpy.zeros(500)
y = numpy.zeros(500)
for idx in range(500):
x[idx] = numpy.unravel_index(full.argmax(), full.shape)[0]
y[idx] = numpy.unravel_index(full.argmax(), full.shape)[1]
full[full == full.max()] = 0.
print os.times()
Here full is my 2D numpy array. As can be seen from the for loop, I only determine the first 500 maximum values at the moment. This however already takes about 5 s. For the first 1000 maximum values, the user time should actually be around 0.5 s. I've noticed that a very time consuming part is setting the previous maximum value to 0 each time. How can I speed things up?
Thank you so much!
If you have numpy 1.8, you can use the argpartition function or method.
Here's a script that calculates x and y:
import numpy as np
# Create an array to work with.
np.random.seed(123)
full = np.random.randint(1, 99, size=(8, 8))
# Get the indices for the largest `num_largest` values.
num_largest = 8
indices = (-full).argpartition(num_largest, axis=None)[:num_largest]
# OR, if you want to avoid the temporary array created by `-full`:
# indices = full.argpartition(full.size - num_largest, axis=None)[-num_largest:]
x, y = np.unravel_index(indices, full.shape)
print("full:")
print(full)
print("x =", x)
print("y =", y)
print("Largest values:", full[x, y])
print("Compare to: ", np.sort(full, axis=None)[-num_largest:])
Output:
full:
[[67 93 18 84 58 87 98 97]
[48 74 33 47 97 26 84 79]
[37 97 81 69 50 56 68 3]
[85 40 67 85 48 62 49 8]
[93 53 98 86 95 28 35 98]
[77 41 4 70 65 76 35 59]
[11 23 78 19 16 28 31 53]
[71 27 81 7 15 76 55 72]]
x = [0 2 4 4 0 1 4 0]
y = [6 1 7 2 7 4 4 1]
Largest values: [98 97 98 98 97 97 95 93]
Compare to: [93 95 97 97 97 98 98 98]
You could loop through the array as #Inspired suggests, but looping through NumPy arrays item-by-item tends to lead to slower-performing code than code which uses NumPy functions since the NumPy functions are written in C/Fortran, while the item-by-item loop tends to use Python functions.
So, although sorting is O(n log n), it may be quicker than a Python-based one-pass O(n) solution. Below np.unique performs the sort:
import numpy as np
def nlargest_indices(arr, n):
uniques = np.unique(arr)
threshold = uniques[-n]
return np.where(arr >= threshold)
full = np.random.random((1002,1004))
x, y = nlargest_indices(full, 10)
print(full[x, y])
print(x)
# [ 2 7 217 267 299 683 775 825 853]
print(y)
# [645 621 132 242 556 439 621 884 367]
Here is a timeit benchmark comparing nlargest_indices (above) to
def nlargest_indices_orig(full, n):
full = full.copy()
x = np.zeros(n)
y = np.zeros(n)
for idx in range(n):
x[idx] = np.unravel_index(full.argmax(), full.shape)[0]
y[idx] = np.unravel_index(full.argmax(), full.shape)[1]
full[full == full.max()] = 0.
return x, y
In [97]: %timeit nlargest_indices_orig(full, 500)
1 loops, best of 3: 5 s per loop
In [98]: %timeit nlargest_indices(full, 500)
10 loops, best of 3: 133 ms per loop
For timeit purposes I needed to copy the array inside nlargest_indices_orig, lest full get mutated by the timing loop.
Benchmarking the copying operation:
def base(full, n):
full = full.copy()
In [102]: %timeit base(full, 500)
100 loops, best of 3: 4.11 ms per loop
shows this added about 4ms to the 5s benchmark for nlargest_indices_orig.
Warning: nlargest_indices and nlargest_indices_orig may return different results if arr contains repeated values.
nlargest_indices finds the n largest values in arr and then returns the x and y indices corresponding to the locations of those values.
nlargest_indices_orig finds the n largest values in arr and then returns one x and y index for each large value. If there is more than one x and y corresponding to the same large value, then some locations where large values occur may be missed.
They also return indices in a different order, but I suppose that does not matter for your purpose of plotting.
If you want to know the indices of the n max/min values in the 2d array, my solution (for largest is)
indx = divmod((-full).argpartition(num_largest,axis=None)[:3],full.shape[0])
This finds the indices of the largest values from the flattened array and then determines the index in the 2d array based on the remainder and mod.
Nevermind. Benchmarking shows the unravel method is twice as fast at least for num_largest = 3.
I'm afraid that the most time-consuming part is recalculating maximum. In fact, you have to calculate maximum of 1002*1004 numbers 500 times which gives you 500 million comparisons.
Probably you should write your own algorithm to find the solution in one pass: keep only 1000 greatest numbers (or their indices) somewhere while scanning your 2D array (without modifying the source array). I think that some sort of a binary heap (have a look at heapq) would suit for the storage.

Categories