I want to apply chi-square distance on a categorical dataset (219 x 55).
As I understand, categorical data must be encoded first before applying the chi-square formula (reference, P.10).
The formula for chi-square distance is as follow:
Where the row totals is denoted
and the column totals are
.
I am struggling to understand what sort of output I will be getting from applying this formula to my dataset. Is it a matrix of distances between rows that symmetrical across the diagonal (similar to the one found in the reference)?
Or is it a matrix with the same proportion of my dataset but each value is substituted with the distance?
Finally, is there a method for chi-square distance in python?
I couldn't find a Python package implementing the $\chi^2$ distance, but the TraMineR package in R implements it (the biofam.chi functions). That function takes in an m x n matrix and returns an m x m matrix symmetrical across the diagonal:
library(TraMineR)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
dim(biofam.seq)
[1] 100 16
biofam.chi.full <- seqdist(biofam.seq, method = "CHI2",
step = max(seqlength(biofam.seq)))
dim(biofam.chi.full)
[1] 100 100
isSymmetric(biofam.chi.full)
[1] TRUE
Related
I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.
This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.
This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.
Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.
I was doing clustering with categorical data. I came across Kmodes algo and found it to be perfect for my requirements. Now, I want to measure dissimilarity within a cluster for all the clusters. I am thinking to measure the dissimilarity with a cluster and reduce it as much as possible. Is there any way to do that?
Alternatively, is there any way to check how efficiently my data has been clustered?
Since my data is categorical, ways which consider distance as a metric might not be helpful.
To measure the dissimilarity within a cluster you need to come up with some kind of a metric. For categorical data, one of the possible ways of calculating dissimilarity could be the following:
d(i, j) = (p - m) / p
where:
p is the number of classes/categories in your data
m is the number of matches you have between samples i and j
For example, if your data has 3 categorical features and the samples, i and j are as follows:
Feature1 Feature2 Feature3
i x y z
j x w z
So here, we have 3 categorical features, so p=3 and out of these three, two features have same values for the samples i and j, so m=2. Therefore
d(i,j) = (3 - 2) / 3
d(i,j) = 0.33
Another alternative is to convert your categorical variables to one-hot-encoded features and then compute jaccard simmilarity.
So, in order to measure the dissimilarity within a cluster you could calculate pairwise dissimilarity between each object in your cluster and then take the average of that.
Based on these measures you may also use the silhoutte score for evaluating the quality of your clustering (but you need to take it with a grain of salt, sometimes the score can be good while the clustering might not be what you expected).
By consulting the scikit manual, this was found:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html
From googling I found that getting the first canonical correlation was to do this: How to get the first canonical correlation from sklearn's CCA module?
Does anybody have any idea how to calculate the canonical correlation coefficient with scikit? What about the first order canonical correlation, second order canonical correlation, etc.?
PS: Apparently, CCA hasn't been updated for a while (https://www.mail-archive.com/scikit-learn-general#lists.sourceforge.net/msg06029.html). Does anybody know its status?
Let X,Y be n x d1 and n x d2 matrices, where n is the number of observations. In order to get the different orders of canonical correlation you need to initializing the CCA object using the following:
n_components = 3
cca = CCA(n_components)
cca.fit(X, Y)
U, V = cca.transform(X, Y)
U and V are n x n_components (3 in this example) matrices. Each column of U and V is a different order of correlation. In order to find the canonical correlation you need to do:
for i in range(n_components):
corr = np.corrcoef(U[:,i], V[:,i])[0,1]
print np.round( corr, 4)
I tried this method and it produced the same results as the Canonical Correlation Analysis package in R.
In short:
cca = CCA(n_components=3)
cca.fit(X, Y)
cc_corr = np.corrcoef(cca.x_scores, rowvar=False).diagonal(offset=cca.n_components)
Details:
This is now answered for correlations of any CC pairs in How to get the first canonical correlation from sklearn's CCA module?.
Vartholomeos Argiris's answer is correct but the loop is unneeded, although might be faster in some case, not so sure as my answer here do not use the .transform computation.
But I would like to put a clear answer explaining how to get the CC correlations directly from the fitted CCA instance, and to give some explanations about what is happening.
If you use the same matrices X and Y to fit and to get the correlations of CC, you do not have to transform them! Indeed (following OP's notations) U and V are simply stored in cca.x_scores_ and cca.y_scores_ respectively.
Then we want to get the correlation coefficient between each pair of columns in U and V. The output of a np.corrcoeff(U, V, rowvar=False) is (with slight abuse of notation, where I denote by UtU the correlation matrix of U, not a covariance matrix, and not the product of those two matrices either):
| UtU UtV |
| |
| VtU VtV |
The above will be of size 2*n_components x 2*n_compoenents, where:
UtU (and VtV) is correlation matrix of U (so of each column of U) of size n_components x n_components
UtV is the correlation matrix between columns of U and columns of V, on the diagonal of that block you will find correlations between matched column pairs (1st column of U with 1st column of V, 2nd column with 2nd column, and so on...) also of size n_components x n_components
All in all, you just need to take the diagonal of the correlation matrix between CCA's scores (U and V) with an offset of n_components to pick up the diagonal of the UtV block:
cc_corr = np.corrcoef(cca.x_scores, rowvar=False).diagonal(offset=cca.n_components)
I'm trying to calculate the Pearson correlation coefficient of two variables. These variables are to determine if there is a relationship between number of postal codes to a range of distances. So I want to see if the number of postal codes increases/decreases as the distance ranges changes.
I'll have one list which will count the number of postal codes within a distance range and the other list will have the actual ranges.
Is it ok to have a list that contain a range of distances? Or would it be better to have a list like this [50, 100, 500, 1000] where each element would then contain ranges up that amount. So for example the list represents up to 50km, then from 50km to 100km and so on.
Use scipy :
scipy.stats.pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Parameters :
x : 1D array
y : 1D array the same length as x
Returns :
(Pearson’s correlation coefficient, :
2-tailed p-value)
You can also use numpy:
numpy.corrcoef(x, y)
which would give you a correlation matrix that looks like:
[[1 correlation(x, y)]
[correlation(y, x) 1]]
try this:
val=Top15[['Energy Supply per Capita','Citable docs per Capita']].rank().corr(method='pearson')
In Python 3.10 correlation() function was added to the statistics module of the Python standard library, it can be directly used by importing the statistics module:
import statistics
statistics.correlation(words, views)