I have used PCA and Mahalanobis distance to find outliers. But in both cases, only the highest or lowest values are detected as outliers. I am looking for a way that any data point that does not follow a certain correlation between output and 3 inputs can be identified as an outlier.
Related
I am working with time-series in which peaks can be observed that are (often) approximately Gaussian shaped. To each of these peaks, I fit a Gaussian curve and want to assess how well this Gaussian curve fits the actual data. For this assessment, I want to determine the coefficient of determination.
On Stackoverflow I've came across many topics on how to implement the coefficient of determination. However, to me it seems all of the answers on Stackoverflow describe how to implement the square of the Pearson correlation coefficient r^2, which to my understanding is notably different from the coefficient of determination R^2. As explained in this article per definition the Pearson correlation coefficient r depends on the distance between points and the best-fit line whereby the coefficient of determination R^2 depends on the distance between points and the 1:1 line. Therefore r^2 is unequal to R^2.
My question is: is the definition in above article correct that R^2 depends on the distance between points and the 1:1 line? If so, are there implementations of R^2 (note: not the square of the Pearson correlation coefficient) in Python?
Right now I've implemented the R^2 using the definition shown below, but I've noticed that this is the square of the Pearson correlation coefficient and not the coefficient of determination.
def coeffofdetermination(x,y):#coefficient of determination
Rsqrd = np.corrcoef(x, y)[0, 1]**2
return Rsqrd
Thanks a lot in advance for replying.
Rik
After performing independent component analysis through FastICA, how can I calculate the variance captured by individual components (or all components)?
For PCA it is very straight forward, the variance explained by the components equals to the eigen values of the covariance matrix of X. But for ICA, how should I proceed?
I'm looking to identify collinear variables in my input matrix X. I'm able to get some metrics like VIF scores, condition number, condition indices, but unable to get variance decomposition proportions. Can someone please help me on how to compute variance decomposition proportions of correlation matrix in python?
I'm learning about DBSCAN and apparently the most important hyperparameter is eps, from sklearn documentation:
eps float, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
This is not a maximum bound on the distances of points within a cluster.
This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
I notice that the number 0.5 doesn't take in fact the range of the distances of our data, in other words, if I use distances from 1 to 100 will it still work the same way if I scale up those distances by a factor of x100? Or scale down by x10? Or this parameter is supposed to be used in normalized distances (max_distance = 1)?
I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.