Variance of each component after FastICA - python

After performing independent component analysis through FastICA, how can I calculate the variance captured by individual components (or all components)?
For PCA it is very straight forward, the variance explained by the components equals to the eigen values of the covariance matrix of X. But for ICA, how should I proceed?

Related

How to find outliers in multivariate data with weak covariance

I have used PCA and Mahalanobis distance to find outliers. But in both cases, only the highest or lowest values are detected as outliers. I am looking for a way that any data point that does not follow a certain correlation between output and 3 inputs can be identified as an outlier.

How to compute variance decomposition proportions of correlation matrix in python?

I'm looking to identify collinear variables in my input matrix X. I'm able to get some metrics like VIF scores, condition number, condition indices, but unable to get variance decomposition proportions. Can someone please help me on how to compute variance decomposition proportions of correlation matrix in python?

Does sklearn DBSCAN suppose distances are normalized

I'm learning about DBSCAN and apparently the most important hyperparameter is eps, from sklearn documentation:
eps float, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
This is not a maximum bound on the distances of points within a cluster.
This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
I notice that the number 0.5 doesn't take in fact the range of the distances of our data, in other words, if I use distances from 1 to 100 will it still work the same way if I scale up those distances by a factor of x100? Or scale down by x10? Or this parameter is supposed to be used in normalized distances (max_distance = 1)?

covariance matrix using np.matmul(data.T, data)

This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.
This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.
Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.

Generating random value for given cdf

Depending on sample of values of random variable I create cumulative density function using kernel density estimation.
cdf = gaussian_kde(sample)
What I need is to generate sample values of random variable whose density function is equal to constructed cdf. I know about the way of inversing the probability distribution function, but since I can not do it analitically it requires pretty complicated preparations. Is there integrated solution or maybe another way to accomplish the task?
If you're using a kernel density estimator (KDE) with Gaussian kernels, your density estimate is a Gaussian mixture model. This means that the density function is a weighted sum of 'mixture components', where each mixture component is a Gaussian distribution. In a typical KDE, there's a mixture component centered over each data point, and each component is a copy of the kernel. This distribution is easy to sample from without using the inverse CDF method. The procedure looks like this:
Setup
Let mu be a vector where mu[i] is the mean of mixture component i. In a KDE, this will just be the locations of the original data points
Let sigma be a vector where sigma[i] is the standard deviation of mixture component i. In typical KDEs, this will be the kernel bandwidth, which is shared for all points (but variable-bandwidth variants do exist).
Let w be a vector where w[i] contains the weight of mixture component i. The weights must be positive and sum to 1. In a typical, unweighted KDE, all weights will be 1/(number of data points) (but weighted variants do exist).
Choose the number of random points to sample, n_total
Determine how many points will be drawn from each mixture component.
Let n be a vector where n[i] contains the number of points to sample from mixture component i.
Draw n from a multinomial distribution with "number of trials" equal to n_total and "success probabilities" equal to w. This means the number of points to draw from each mixture component will be randomly chosen, proportional to the component weights.
Draw random values
For each mixture component i:
Draw n[i] values from a normal distribution with mean mu[i] and standard deviation sigma[i]
Shuffle the list of random values, so they have random order.
This procedure is relatively straightforward because random number generators (RNGs) for multinomial and normal distributions are widely available. If your kernels aren't Gaussian but some other probability distribution, you can replicate this strategy, replacing the normal RNG in step 4 with a RNG for that distribution (if it's available). You can also use this procedure to sample from mixture models in general, not just KDEs.

Categories