Extracting clusters in the form of lines from a dataset - python

I have a dataset similar to the one shown below, that clearly forms lines from my point of view. Instead of drawing markers, I want to connect the markers within each curve by a line. I am curious, in this case, what type of clustering algorithms would be a good one?
import numpy as np
import matplotlib.pyplot as plt
np.random.seed = 42
#Generate (x,y) data
x = np.linspace(0.1,0.9,50)
y = x%1
x += np.sin(2*x%1)
y = y%0.2
#Shuffle (x,y) data
ns = list(range(len(x)))
x = x[ns]
y = y[ns]
fig, axs = plt.subplots(1,2)
plt.savefig("markers vs lines.pdf")
Figure - Left: Markers, Right: Data points connected by lines.

Since you asked for Clustering Algorithm, you may want to look at DBSCAN.
There is two parameters, epsilon and the number of point to make a cluster.
Here is a code to get you started:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
np.random.seed = 42
%matplotlib inline
#Generate (x,y) data
x = np.linspace(0.1,0.9,50)
y = x%1
x += np.sin(2*x%1)
y = y%0.2
#Shuffle (x,y) data
ns = list(range(len(x)))
x = x[ns]
y = y[ns]
Fit the Data
X = [i for i in zip(x,y)]
X = StandardScaler().fit_transform(X)
Compute the DBSCAN
db = DBSCAN(eps=0.5, min_samples=1).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
Plot the clusters
d= dict(zip(set(labels),['red','green','blue','yellow','purple','grey']))
d[-1] = "black"
plt.scatter(x,y,color=[ d[i] for i in labels])
The result :
Inpired by : http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
More about the parameters of the DBSCAN here : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
Hope this help.

Such data is common in image analysis, due to architecture.
In order to infer perspective, people have used the Hough transform to identify lines of points'.
That is probably the best method to use here.


Finding k nearest neighbors in 3d numpy array

So I'm trying to find the k nearest neighbors in a pyvista numpy array from an example mesh. With the neighbors received, I want to implement some region growing in my 3d model.
But unfortunaley I receive some weird output, which you can see in the following picture.
It seems like I'm missing something on the KDTree implementation. I was following the answer on a similar question: https://stackoverflow.com/a/2486341/9812286
import numpy as np
from sklearn.neighbors import KDTree
import pyvista as pv
from pyvista import examples
# Example dataset with normals
mesh = examples.load_random_hills()
smooth = mesh
NDIM = 3
X = smooth.points
point = X[5000]
tree = KDTree(X, leaf_size=X.shape[0]+1)
# ind = tree.query_radius([point], r=10) # indices of neighbors within distance 0.3
distances, ind = tree.query([point], k=1000)
p = pv.Plotter()
ids = np.arange(smooth.n_points)[ind[0]]
top = smooth.extract_cells(ids)
random_color = np.random.random(3)
p.add_mesh(top, color=random_color)
You're almost there :) The problem is that you are using the points in the mesh to build the tree, but then extracting cells. Of course these are unrelated in the sense that indices for points will give you nonsense when applied as indices of cells.
Either you have to extract_points:
import numpy as np
from sklearn.neighbors import KDTree
import pyvista as pv
from pyvista import examples
# Example dataset with normals
mesh = examples.load_random_hills()
smooth = mesh
NDIM = 3
X = smooth.points
point = X[5000]
tree = KDTree(X, leaf_size=X.shape[0]+1)
# ind = tree.query_radius([point], r=10) # indices of neighbors within distance 0.3
distances, ind = tree.query([point], k=1000)
p = pv.Plotter()
ids = np.arange(smooth.n_points)[ind[0]]
top = smooth.extract_points(ids) # changed here!
random_color = np.random.random(3)
p.add_mesh(top, color=random_color)
Or you have to work with cell centers to begin with:
import numpy as np
from sklearn.neighbors import KDTree
import pyvista as pv
from pyvista import examples
# Example dataset with normals
mesh = examples.load_random_hills()
smooth = mesh
NDIM = 3
X = smooth.cell_centers().points # changed here!
point = X[5000]
tree = KDTree(X, leaf_size=X.shape[0]+1)
# ind = tree.query_radius([point], r=10) # indices of neighbors within distance 0.3
distances, ind = tree.query([point], k=1000)
p = pv.Plotter()
ids = np.arange(smooth.n_points)[ind[0]]
top = smooth.extract_cells(ids)
random_color = np.random.random(3)
p.add_mesh(top, color=random_color)
As you can see, the two results differ, since index 5000 (which we used for the reference point) means something else when indexing points or when indexing cells.

How could I use a dynamic espilon in a DBSCAN?

Today I'm working on a dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. I would like to segment my dataset by beds, baths, neighborhood and use a DBSCAN to get a clustering by price in each segment. The problem is because each segment is different, I don't want to use the same epsilon for all my dataset but for each segment the best epsilon, do you know an efficient way to do it ?
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
Clus_dataSet = pdf[['beds','baths','neighborhood','price']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=6).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
Thank you.
A heuristic for the setting of Epsilon and MinPts parameters has been proposed in the original DBSCAN paper
Once the MinPts value is set (e.g. 2 ∗ Number of features) the partitioning result strongly depends on Epsilon. The heuristic suggests to infer epsilon through a visual analysis of the k-dist plot.
A toy example of the procedure with two gaussian distributions is reported in the following.
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
from sklearn.datasets import make_biclusters
data,lab,_ = make_biclusters((200,2), 2, noise=0.1, minval=0, maxval=1)
minpts = 4
nbrs = NearestNeighbors(n_neighbors=minpts, algorithm='ball_tree').fit(data)
distances, indices = nbrs.kneighbors(data)
k_dist = [x[-1] for x in distances]
f,ax = plt.subplots(1,2,figsize = (10,5))
ax[0].set_title('k-dist plot for k = minpts = 4')
ax[0].set_xlabel('object index after sorting by k-distance')
ax[1].set_title('original data')
ax[1].scatter(data[:,0],data[:,1],c = lab[0])
In the resulting k-dist plot, the "elbow" theoretically divides noise objects from cluster objects and indeed gives an indication on a plausible range of values for Epsilon (tailored on the dataset in combination with the selected value of MinPts). In this toy example, I would say between 0.05 and 0.075.

Fourier Transformation of 2D Matrix in Python

So, I have a matrix with 72x72 values, each corresponding to some energy on a triangular lattice with 72x72 sites. I'm trying to Fourier transform the values, but I'm not understanding how to do that with np.fft.fftn .
To illustrate my problem I have written the following basic code with some random values. The triangular gives the lattice x,y coordinates.
import numpy as np
import matplotlib.pyplot as plt
def triangular(nsize):
for i in range(nsize):
for j in range(nsize):
xx = triangular(72)[0]
yy = triangular(72)[1]
plt.pcolormesh(xx, yy, np.reshape(np.random.rand(72**2),(72,72)))
I'm not using random data, but I wanted not to make the example that complicated. In fact I see everytime the same plot, when I now use the following FFT:
matrix = []
spectrum_3d = np.fft.fftn(matrix) # Fourrier transform along x, y, energy
kx = np.linspace(-4*np.pi/3,4*np.pi/3,72) #this is the range I want to plot
ky = np.linspace(-2*np.pi/np.sqrt(3),2*np.pi/np.sqrt(3),72)
Ky, Kx = np.meshgrid(ky, kx, indexing='ij') #making a grid
psd = plt.pcolormesh(Kx, Ky, abs(spectrum_3d[2])**2)
cbar = plt.colorbar(psd)
My result looks always the same and I don't know what went wrong. Also for my correlated values, which have a large symmetry the plot looks the same.
You can't 'see' the spectrum because of the DC dominance.
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
X,Y= np.meshgrid(x,y)
data_wo_DC= data- np.mean(data)
spectrum = np.fft.fftshift(np.fft.fft2(data))
spectrum_wo_DC = np.fft.fftshift(np.fft.fft2(data_wo_DC))
freqx=np.fft.fftshift(np.fft.fftfreq(72,1)) #q(n, d=1.0)
fX,fY= np.meshgrid(freqx,freqy)
p.pcolormesh(X,Y, data)
p.title('most data is in the DC')
p.title('wo DC we can see the structure');

Cutting Dendrogram/Clustering Tree from SciPy at distance height

I'm trying to learn how to use dendrograms in Python using SciPy . I want to get clusters and be able to visualize them; I heard hierarchical clustering and dendrograms are the best way.
How can I "cut" the tree at a specific distance?
In this example, I just want to cut it at distance 1.6
I looked up a tutorial on https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Inconsistency-Method but the guy did some really confusing wrapper function using **kwargs (he calls his threshold max_d)
Here is my code and plot below; I tried annotating it as best as I could for reproducibility:
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram,linkage,fcluster
from scipy.spatial import distance
np.random.seed(424173239) #43984
n,m = 20,7
#DataFrame: rows = Samples, cols = Attributes
attributes = ["a" + str(j) for j in range(m)]
DF_data = pd.DataFrame(np.random.random((n, m)), columns = attributes)
A_dist = distance.cdist(DF_data.as_matrix().T, DF_data.as_matrix().T)
#(i) . Do the labels stay in place from DF_data for me to do this?
DF_dist = pd.DataFrame(A_dist, index = attributes, columns = attributes)
#Create dendrogram
fig, ax = plt.subplots()
Z = linkage(distance.squareform(DF_dist.as_matrix()), method="average")
D_dendro = dendrogram(Z, labels = attributes, ax=ax) #create dendrogram dictionary
threshold = 1.6 #for hline
ax.axhline(y=threshold, c='k')
#(ii) How can I "cut" the tree by giving it a distance threshold?
#i.e. If I cut at 1.6 it would make (a5 : cluster_1 or not in a cluster), (a2,a3 : cluster_2), (a0,a1 : cluster_3), and (a4,a6 : cluster_4)
#link_1 says use fcluster
#This -> fcluster(Z, t=1.5, criterion='inconsistent', depth=2, R=None, monocrit=None)
#gives me -> array([1, 1, 1, 1, 1, 1, 1], dtype=int32)
len(set(D_dendro["color_list"])), "^ # of colors from dendrogram",
len(D_dendro["ivl"]), "^ # of labels",sep="\n")
#^ # of colors from dendrogram it should be 4 since clearly (a6, a4) and a5 are in different clusers
#^ # of labels
link_1 : How to compute cluster assignments from linkage/distance matrices in scipy in Python?
color_threshold is the method I was looking for. It doesn't really help when the color_palette is too small for the amount of clusters being generated. Migrated the next step to Bigger color-palette in matplotlib for SciPy's dendrogram (Python) if anyone can help.
For a bigger color palette this should work:
from scipy.cluster import hierarchy as hc
import matplotlib.cm as cm
import matplotlib.colors as col
#get a color spectrum "gist_ncar" from matplotlib cm.
#When you have a spectrum it begins with 0 and ends with 1.
#make tinier steps if you need more than 10 colors
colors = cm.gist_ncar(np.arange(0, 1, 0.1))
colorlst=[]# empty list where you will put your colors
for i in range(len(colors)): #get for your color hex instead of rgb
hc.set_link_color_palette(colorlst) #sets the color to use.
Put all of that infront of your code and it should work

Interpolation of curve

I have a code where a curve is generated using random values. and a Horizontal line which runs through it. The code is as follows.
import numpy as np
import matplotlib.pylab as pl
data = np.random.uniform(low=-1600, high=-550, size=(288,))
line = [-1290] * 288
pl.figure(figsize = (10,5))
Now I need to find the the coordinates for the all the points of intersections of the curve (data) and the line. The curve is made of linear segments that join neighboring points . And there are a lot of intersection points where the curve meets the line. any help would be appreciated. thank you!
I like the Shapely answer because Shapely is awesome, but you might not want that dependency. Here's a version of some code I use in signal processing adapted from this Gist by #endolith. It basically implements kazemakase's suggestion.
from matplotlib import mlab
def find_crossings(a, value):
# Normalize the 'signal' to zero.
sig = a - value
# Find all indices right before any crossing.
indices = mlab.find((sig[1:] >= 0) & (sig[:-1] < 0) | (sig[1:] < 0) & (sig[:-1] >= 0))
# Use linear interpolation to find intersample crossings.
return [i - sig[i] / (sig[i+1] - sig[i]) for i in indices]
This returns the indices (your x values) of where the curve crosses the value (-1290 in your case). You would call it like this:
find_crossings(data, -1290)
Here's what I get for 100 points:
x = find_crossings(data, -1290)
plt.scatter(x, [-1290 for p in x], color='red')
I think the curve, as you interpret it, does in fact follow an equation. In particular, it is made of linear segments that join neighboring points.
Here is what you can do:
find all pairs of neighbors where one lies above and the other below the line
for each pair find the intersection of the horizontal line with the line joining the points
Here is a solution that use shapely:
import numpy as np
import matplotlib.pylab as pl
data = np.random.uniform(low=-1600, high=-550, size=(50,))
line = [-1290] * len(data)
pl.figure(figsize = (10,5))
from shapely import geometry
line = geometry.LineString(np.c_[np.arange(len(data)), data])
hline = geometry.LineString([[-100, -1290], [1000, -1290]])
points = line.intersection(hline)
x = [p.x for p in points]
y = [p.y for p in points]
pl.plot(x, y, "o")
the output:
