Setting precision while scaling vectors using preprocessing in scikit learn - python

I have to calculate Euclidean distance between two vectors , and have to do scaling before i calculate the distances.
sample_A= np.array([1,1,1,0,0,1,0,0,1,1,0,0,0,0,0,0.008624,-0.002894,0.006471,0.000961,0.007407,-0.004442,-0.00966,-0.003026,0.010202,0.008907,-0.003031,-0.002724,0.002302,0.002171,-0.011219,0.006802,0.004588,0.030068,0.016608,0.021235,0.015706,0.102711,0.053489,0.006902,-0.010042,0.002647,0.036403,-0.010567,0.040207,0.065626,-0.010786,-0.010131,0.080007,-0.046524,-0.08577,0.120587,0.159285,0.058588,0.112184,0.011561])
sample_B = np.array([18,1,1,0,0,1,0,0,1,0,1,0,0,0,0,1.921413,-1.350259,-0.549294,-0.829648,-0.271365,-2.267258,-0.043207,-0.127863,0.46472,0.106202,-0.363018,-0.863932,-1.041068,0.944935,-0.269358,-0.705195,-0.505604,-0.721329,0.603105,-0.619679,-0.461518,0.595048,-0.097054,-1.602379,-0.373747,-0.253988,-0.476779,1.108103,1.428308,1.12896,1.296803,-0.086155,-0.555077,0.347556,0.202161,0.289031,0.676664,-0.318146,0.193779,0.841483])
The expected distance between these two points as per requirement is 7.296226771
from sklearn import preprocessing
A_scaled = preprocessing.scale(sample_A)
B_scaled = preprocessing.scale(sample_B)
distance.euclidean(A_scaled,B_scaled)
The value i got was 7.713635264892224
My understanding is this is because of the higher precision that is present while calculating the standard deviation and mean. Is there any way to provide precision while scaling as input to the function or do i have to write a custom scale function.
If so how can i write a custom scale function that applies to the entire numpy array.

Related

Does sklearn DBSCAN suppose distances are normalized

I'm learning about DBSCAN and apparently the most important hyperparameter is eps, from sklearn documentation:
eps float, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
This is not a maximum bound on the distances of points within a cluster.
This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
I notice that the number 0.5 doesn't take in fact the range of the distances of our data, in other words, if I use distances from 1 to 100 will it still work the same way if I scale up those distances by a factor of x100? Or scale down by x10? Or this parameter is supposed to be used in normalized distances (max_distance = 1)?

How to normalize a non-normal distribution?

I have the above distribution with a mean of -0.02, standard deviation of 0.09 and with a sample size of 13905.
I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.
I am just wondering is it possible to normalize to make it more smooth and 'normal' distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.
You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you'll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats
You can avoid those steps and simply use Yeo-Johnson transform. sklearn provides an API for that
from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer
data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])
pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)
We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.
k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)
The test returns two values k2 and p. The value of p is of our interest here.
if p is greater than some threshold (ex 0.001 or so), we can say reject the hypothesis that data comes from a normal distribution.
In the example above, you'll see that p is greater than 0.001 while transformed_p is less than this threshold indicating that we are moving in the right direction.
I agree with the top answer, except the last 2 paragraphs, because the interpretation of normaltest's output is flipped. These paragraphs should instead read:
"The test returns two values k2 and p. The value of p is of our interest here.
if p is greater less than some threshold (ex 0.001 or so), we can say reject the null hypothesis that data comes from a normal distribution.
In the example above, you'll see that p is greater less than 0.001 while transformed_p is less greater than this threshold indicating that we are moving in the right direction."
Source: normaltest documentation.

Using cosine distance with scikit learn KNeighborsClassifier

Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier?
This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?
The cosine similarity is generally defined as xT y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)
Notice though, that xT y / (||x|| * ||y||) = (x/||x||)T (y/||y||). The euclidean distance can be equivalently written as sqrt(xTx + yTy − 2 xTy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-2*1) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.
KNN family class constructors have a parameter called metric, you can switch between different distance metrics you want to use in nearest neighbour model.
A list of available distance metrics can be found here
If you want to use cosine metric for ranking and classification problem, you can use norm 2 Euclidean distance on normalized feature vector, that gives you same ranking/classification (predictions that made by argmax or argmin operations) results.

Finding the sigma of a Gaussian array without using a fit

I have an array, called gaussian_array, which is made of a series of numbers that, once plotted, form a Gaussian, to a good approximation.
I need to understand the \sigma of this Gaussian, but I am not allowed to use a fit of any kind. What I have tried so far is to calculate the peak of the Gaussian, which is given by the first element of the array (the Gaussian is centred around the origin), gaussian_array[0], and then somehow I thought it could be useful to use the FWHM and the well known relation between \sigma and the FWHM.
However, I do not know exactly how to implement this in python. I thought it could have been useful to write something like
for i in range(len(gaussian_array)):
if gaussian[i] = FWHM:
sigma = gaussian[i]/(2.*np.sqrt(2.np.log(2)))
but I don't think that's a reliable procedure, because it will not always be true that a certain element of the gaussian_array will EXACTLY coincide to the calculated FWHM. I cannot even calculate the standard deviation by the sum of the squares of the differences between the values and the origin.
So, how could I estimate the sigma of this gaussian_array?
I am confused why you would go to such great lengths to calculate a standard deviation. In you post it seems you are trying to get the \sigma by this relation
If you are trying to obtain the standard deviation, just use numpy
import numpy as np
# method 1 - use np.std() on a python data structure
sigma = np.std(gaussian_array)
# method 2 - convert to numpy array and use .std() method
gaussian_array = np.asarray(gaussian_array)
sigma = gaussian_array.std()

Is there a way to automatically estimate the best degree of freedoms for a t-distribution using Scipy?

A lot of the functions are simply asking for you to input the degree of freedoms to fit a distribution and return other items. However, I would like a function like R's fitdistr where it estimates the mean, scale parameter, and df. My ultimate goal is to obtain the t-score using that best estimated df.
Using Scipy you can use the fit function associated with the t-distribution class to estimate the degrees of freedom, location and scale (see here and here for more details). This estimates the parameters using maximum likelihood like the fitdistr function from MASS in R. E.G.
from scipy import stats
import numpy as np
np.random.seed(2015)
x = [ stats.t.rvs(9) for i in range(250)]
stats.t.fit(x)
This gives estimates of df = 5.63, location = 0.00 and scale = 0.85
A point of caution is that the estimated fit for the degrees of freedom may not be great if you are also estimating scale and location. You may only hit a local optimum when you are maximising the likelihood function, so maybe standardise your data?

Categories