How to calculate difference between sparse histogram in Python - python

I am using 3D histogram to compare a superpixel region in two images taken from a video sequence. Based on histogram threshold, I want to classify regions as similar or dissimilar.
I was using chi square distance to compare the two histograms. But I saw that chi square distance should be used only for dense histograms.
My histogram is sparse with a lot of bins containing zero entries.
Can you suggest me the best way to compare these histograms in Python?

Related

Clustering on evenly spaced grid points

I have a 50 by 50 grid of evenly spaced (x,y) points. Each of these points has a third scalar value. This can be visualized using a contourplot which I have added. I am interested in the regions indicated in by the red circles. These regions of low "Z-values" are what I want to extract from this data.
2D contour plot of 50 x 50 evenly spaced grid points:
I want to do this by using clustering (machine learning), which can be lightning quick when applied correctly. The problem is, however, that the points are evenly spaced together and therefore the density of the entire dataset is equal everywhere.
I have tried using a DBSCAN algorithm with a custom distance metric which takes into account the Z values of each point. I have defined the distance between two points as follows:\
def custom_distance(point1,point2):
average_Z = (point1[2]+point2[2])/2
distance = np.sqrt(np.square((point1[0]-point2[0])) + np.square((point1[1]-point2[1])))
distance = distance * average_Z
return distance
This essentially determines the Euclidean distance between two points and adds to it the average of the two Z values of both points. In the picture below I have tested this distance determination function applied in a DBSCAN algorithm. Each point in this 50 by 50 grid each has a Z value of 1, except for four clusters that I have randomly placed. These points each have a z value of 10. The algorithm is able to find the clusters in the data based on their z value as can be seen below.
DBSCAN clustering result using scalar value distance determination:
Positive about the results I tried to apply it to my actual data, only to be disappointed by the results. Since the x and y values of my data are very large, I have simply scaled them to be 0 to 49. The z values I have left untouched. The results of the clustering can be seen in the image below:
Clustering result on original data:
This does not come close to what I want and what I was expecting. For some reason the clusters that are found are of rectangular shape and the light regions of low Z values that I am interested in are not extracted with this approach.
Is there any way I can make the DBSCAN algorithm work in this way? I suspect the reason that it is currently not working has something to do with the differences in scale of the x,y and z values. I am also open for tips or recommendations on other approaches on how to define and find the lighter regions in the data.

How to check if a vector hirogramm correlates with uniform distribution?

I have a vector of floats V with values from 0 to 1. I want to create a histogram with some window say A==0.01. And check how close is the resulting histogram to uniform distribution getting one value from zero to one where 0 is correlating perfectly and 1 meaning not correlating at all. For me correlation here first of all means histogram shape.
How one would do such a thing in python with numpy?
You can create the histogram with np.histogram. Then, you can generate the uniform histogram from the average of the previously retrieved histogram with np.mean. Then you can use a statistical test like the Pearson coefficient to do that with scipy.stats.pearsonr.

Visualize documents embeddings and clustering

I have the following dataframe:
print(df)
document embeddings
1 [-1.1132643 , 0.793635 , 0.8664889]
2 [-1.1132643 , 0.793635 , 0.8664889]
3 [-0.19276126, -0.48233205, 0.17549737]
4 [0.2080252 , 0.01567003, 0.0717131]
I want to cluster and visualize them to see the similarities between the documents. What is the best method/steps to do this?
This is just a small dataframe, the original dataframe has more than 20k documents.
Document vectors in your case reside in a 768-dimensional euclidean space. Meaning in a 768-dimensional coordinate space, each point represents a document. Assuming these have been trained correctly, it's safe to imagine that contextually similar documents should be closer to each other in this space as compared to different ones. This may allow you to apply a clustering method to club similar documents together.
For clustering, you can use multiple clustering techniques such as -
Kmeans (clusters based on euclidean distances)
Dbscan (clusters with the notion of density)
Gaussian mixtures (clusters based on a mixture of k gaussians)
You can use Silhouette score to find the optimal number of clusters for the clustering algorithm to best create separations in clusters.
For visualization, you can ONLY visualize in 3D or 2D space. This means you will have to use some dimensionality reduction methods to reduce the 768 dimensions to 3 dimensions or 2 dimensions.
This can be achieved with the following algorithms set to 2 or 3 components -
PCA
T-SNE
LDA (requires labels)
Once you have clustered the data AND reduced the dimensionality of the data separately, you can use matplotlib to plot each of the points in a 2D/3D space and color each point based on its cluster (0-7) to visualize documents and clusters.
#process flow
(20k,768) -> K-clusters (20k,1) ------|
|--- Visualize (3 axis, k colors)
(20k,768) -> Dim reduction (20k,3)----|
Here is an example of the goal you are trying to achieve -
Here, you see the first 2 components of data from T-SNE, and each color represents the clusters you have created from your clustering method of choice (deciding the number of clusters using silhouette score)
EDIT: You can apply dimensionality reduction to project your 768-dimensional data into a 3D or 2D space and THEN cluster using a clustering method. This would reduce the amount of computation you have to handle since now you are clustering only on 3 dimensions instead of 768, but at cost of information that might help you discriminate clusters better.
#process flow
|------------------------|
(20k,768) -> Dim reduction (20k,3)--| |-- Visualize
|--- K-Clusters (20k,1)--|

Interpolation of a huge 2D array in python

I've just plotted the following colormap from a 35x800 numpy array:
As you can see, the map appears crenelated: this is because cells contain probability = 0 (artefacts produced by model simulation method). I need to interpolate the data for (i) make a smooth and elegant colormap and (ii) obtain the full matrix for follow-up computations. However, I don't know how to proceed and the interpolation method I should use. Any idea?

Two dimensional least squares fitting

I have a two dimensional data set, of some fixed dimensions (xLen and yLen), which contains a sine curve.
I've already determined the frequency of the sine curve, and I've generated my own sine data with the frequency using
SineData = math.sin((2*math.pi*freqX)/xLen + (2*math.pi*freqY)/yLen)
where freqX and freqY and the oscillation frequencies in the X and Y directions for the curve.
But now I'd like to do a linear least squares fit (or something similar), so that I can fit the right amplitude. As far as I know, a linear least squares is the right way to go, but if there's another way that's fine as well.
The leastsq function is SciPy doesn't do a multidimensional fit. Is there a python implementation for a 2/multidimensional least square fitting algorithm
Edit: I found the 2 dimensional frequency of the sine wave from a 2D FFT. The data contains a 2D sine + noise, so I only picked the largest peak of the 2D FFT and took an inverse of that. Now I have a sine curve, but with an amplitude that's off. Is there a way to do a 2 dimensional least squares (or similar), and fit the amplitude?
You might also consider a 2D Finite/Discrete Fourier Transform (FFT/DFT) if your data is well served by using trig functions.
NumPy has an DFT solution built in.
There are lots of places to help you get started; Google found this one.
Start with your original data. The transform will tell you if your frequency solution is correct and if there are other frequencies that are also significant.
In least squares fitting , one minimizes a residual function, perhaps chisquare. Since this involves summing estimates corresponding to the difference squared at each of the points of model minus data, the number of dimensions is "forgotten" in making the residual. Thus all the values in the 2D difference function array can be copied to a 1D array as the result of the residual function supplied to, for example, leastsq. An example for complex to real rather than 2D to 1D is given in my answer to this question: Least Squares Minimization Complex Numbers

Categories