Scale and computer cosine similarity of co-occurrence matrix - python

I have a co-occurrence symmetric matrix (1877 x 1877).
I treat columns as features and compute the cosine distance between them. Before that, I scale the matrix (center to the mean and component wise scale to unit variance).
from sklearn import preprocessing
from sklearn.metrics import pairwise_distances
X_scaled = preprocessing.scale(mymatrix)
dist = pairwise_distances(X_scaled,metric="cosine")
My questions:
Should I scale the co-occurrence data before computing the cosine
distance/sim? The figure below shows the histograms of the actual matrix. The x-axis represents co-occurrence values in the matrix, and y-axis indicates the number of times they appear in the matrix.
The code above returns distance > 1 and distance < 0. How can I ensure that the cosine distance values between 0 and 1? Should I apply min max scaler over the dist matrix?

Related

Chi-square value from xarray polyfit

How can we derive the chi square values of the polynomial fit using xarray.Dataset.polyfit.
From what I can see in he documentation, it returns residuals and covariance matrix.
for example, I can calculate the 2nd order polynomial fit across dim1 of my dataarray
fit_param = ds.polyfit(dim='dim1', deg=2, cov=True)

Discretize normal distribution to get prob of a random variable

Suppose I draw randomly from a normal distribution with mean zero and standard deviation represented by a vector of, say, dimension 3 with
scale_rng=np.array([1,2,3])
eps=np.random.normal(0,scale_rng)
I need to compute a weighted average based on some simulations for which I draw the above mentioned eps. The weights of this average are "the probability of eps" (hence I will have a vector with 3 weights). For weighted average I simply mean an arithmetic sum wehere each component is multiplied by a weight, i.e. a number between 0 and 1 and where all the weights should sum up to one.
Such weighted average shall be calculated as follows: I have a time series of observations for one variable, x. I calculate an expanding rolling standard deviation of x (say this is the values in scale). Then, I extract a random variable eps from a normal distribution as explained above for each time-observation in x and I add it to it, say obtaining y=x+eps. Finally, I need to compute the weighted average of y where each value of y is weighted by the "probability of drawing each value of eps from a normal distribution with mean zero and standard deviation equal to scale.
Now, I know that I cannot think of this being the points on the pdf corresponding to the values randomly drawn because a normal random variable is continuous and as such the pdf at a certain point is zero. Hence, the only solution I Found out is to discretize a normal distribution with a certain number of bins and then find the probability that a value extracted with the code of above is actually drawn. How could I do this in Python?
EDIT: the solution I found is to use
norm.cdf(eps_it+0.5, loc=0, scale=scale_rng)-norm.cdf(eps_it-0.5, loc=0, scale=scale_rng)
which is not really based on the discretization but at least it seems feasible to me "probability-wise".
here's an example leaving everything continuous.
import numpy as np
from scipy import stats
# some function we want a monte carlo estimate of
def fn(eps):
return np.sum(np.abs(eps), axis=1)
# define distribution of eps
sd = np.array([1,2,3])
d_eps = stats.norm(0, sd)
# draw uniform samples so we don't double apply the normal density
eps = np.random.uniform(-6*sd, 6*sd, size=(10000, 3))
# calculate weights (working with log-likelihood is better for numerical stability)
w = np.prod(d_eps.pdf(eps), axis=1)
# normalise so weights sum to 1
w /= np.sum(w)
# get estimate
np.sum(fn(eps) * w)
which gives me 4.71, 4.74, 4.70 4.78 if I run it a few times. we can verify this is correct by just using a mean when eps is drawn from a normal directly:
np.mean(fn(d_eps.rvs(size=(10000, 3))))
which gives me essentially the same values, but with expected lower variance. e.g. 4.79, 4.76, 4.77, 4.82, 4.80.

Covariance matrix for circular variables?

In my current project, I have a collection of three-dimensional samples such as [-0.5,-0.1,0.2]*pi, [0.8,-0.1,-0.4]*pi. These variables are circular/periodic, with their values ranging from -pi to +pi. It is my goal to calculate a 3-by-3 covariance matrix for these circular variables.
Python has an in-built function to calculate circular standard deviations, which I can use to calculate the standard deviations along each dimension, then use them to create a diagonal covariance matrix (i.e., without any correlation). Ideally, however, I would like to consider correlations between the parameters as well. Is there a way to calculate correlations between circular variables, or to directly compute the covariance matrix between them?
import numpy as np
import scipy.stats
# A collection of N circular samples
samples = np.asarray(
[[0.384917, 1.28862, -2.034],
[0.384917, 1.28862, -2.034],
[0.759245, 1.16033, -2.57942],
[0.45797, 1.31103, 2.9846],
[0.898047, 1.20955, -3.02987],
[1.25694, 1.74957, 2.46946],
[1.02173, 1.26477, 1.83757],
[1.22435, 1.62939, 1.99264]])
# Calculate the circular standard deviations
stds = scipy.stats.circstd(samples, high = np.pi, low = -np.pi, axis = 0)
# Create a diagonal covariance matrix
cov = np.identity(3)
np.fill_diagonal(cov,stds**2)

Does sklearn DBSCAN suppose distances are normalized

I'm learning about DBSCAN and apparently the most important hyperparameter is eps, from sklearn documentation:
eps float, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
This is not a maximum bound on the distances of points within a cluster.
This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
I notice that the number 0.5 doesn't take in fact the range of the distances of our data, in other words, if I use distances from 1 to 100 will it still work the same way if I scale up those distances by a factor of x100? Or scale down by x10? Or this parameter is supposed to be used in normalized distances (max_distance = 1)?

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.

Categories