norm parameters in sklearn.preprocessing.normalize - python

In sklearn documentation says "norm" can be either
norm : ‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
The documentation about normalization isn't clearly stating how ‘l1’, ‘l2’, or ‘max’ are calculated.
Can anyone clear these?

Informally speaking, the norm is a generalization of the concept of (vector) length; from the Wikipedia entry:
In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space.
The L2-norm is the usual Euclidean length, i.e. the square root of the sum of the squared vector elements.
The L1-norm is the sum of the absolute values of the vector elements.
The max-norm (sometimes also called infinity norm) is simply the maximum absolute vector element.
As the docs say, normalization here means making our vectors (i.e. data samples) having unit length, so specifying which length (i.e. which norm) is also required.
You can easily verify the above adapting the examples from the docs:
from sklearn import preprocessing
import numpy as np
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_l1 = preprocessing.normalize(X, norm='l1')
X_l1
# array([[ 0.25, -0.25, 0.5 ],
# [ 1. , 0. , 0. ],
# [ 0. , 0.5 , -0.5 ]])
You can verify by simple visual inspection that the absolute values of the elements of X_l1 sum up to 1.
X_l2 = preprocessing.normalize(X, norm='l2')
X_l2
# array([[ 0.40824829, -0.40824829, 0.81649658],
# [ 1. , 0. , 0. ],
# [ 0. , 0.70710678, -0.70710678]])
np.sqrt(np.sum(X_l2**2, axis=1)) # verify that L2-norm is indeed 1
# array([ 1., 1., 1.])

Related

Efficient Implementation of Gaussian Elimination in Python [duplicate]

Is there somewhere in the cosmos of scipy/numpy/... a standard method for Gauss-elimination of a matrix?
One finds many snippets via google, but I would prefer to use "trusted" modules if possible.
I finally found, that it can be done using LU decomposition. Here the U matrix represents the reduced form of the linear system.
from numpy import array
from scipy.linalg import lu
a = array([[2.,4.,4.,4.],[1.,2.,3.,3.],[1.,2.,2.,2.],[1.,4.,3.,4.]])
pl, u = lu(a, permute_l=True)
Then u reads
array([[ 2., 4., 4., 4.],
[ 0., 2., 1., 2.],
[ 0., 0., 1., 1.],
[ 0., 0., 0., 0.]])
Depending on the solvability of the system this matrix has an upper triangular or trapezoidal structure. In the above case a line of zeros arises, as the matrix has only rank 3.
One function that can be worth checking is _remove_redundancy, if you wish to remove repeated or redundant equations:
import numpy as np
import scipy.optimize
a = np.array([[1.,1.,1.,1.],
[0.,0.,0.,1.],
[0.,0.,0.,2.],
[0.,0.,0.,3.]])
print(scipy.optimize._remove_redundancy._remove_redundancy(a, np.zeros_like(a[:, 0]))[0])
which gives:
[[1. 1. 1. 1.]
[0. 0. 0. 3.]]
As a note to #flonk answer, using a LU decomposition might not always give the desired reduced row matrix. Example:
import numpy as np
import scipy.linalg
a = np.array([[1.,1.,1.,1.],
[0.,0.,0.,1.],
[0.,0.,0.,2.],
[0.,0.,0.,3.]])
_,_, u = scipy.linalg.lu(a)
print(u)
gives the same matrix:
[[1. 1. 1. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 2.]
[0. 0. 0. 3.]]
even though the last 3 rows are linearly dependent.
You can use the symbolic mathematics python library sympy
import sympy as sp
m = sp.Matrix([[1,2,1],
[-2,-3,1],
[3,5,0]])
m_rref, pivots = m.rref() # Compute reduced row echelon form (rref).
print(m_rref, pivots)
This will output the matrix in reduced echelon form, as well as a list of the pivot columns
Matrix([[1, 0, -5],
[0, 1, 3],
[0, 0, 0]])
(0, 1)

I'd like to know how to calculate the similarity(numerical accuracy) of the two numpy array types in Python

I'm a student who just started deep learning with Python.
First of all, my native language is not English, so I can be poor at using a translator.
I used time series data in deep learning to create a model that predicts the likelihood of certain situations in the future. We've even completed visualizations using graphs.
But rather than visualizing it through graphs, I wanted to understand the similarity between train data and test data, the accuracy of the numbers.
The two data are in the following format:
In [51] : train_r
Out[51] : array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
Note: This data is composed of 0 and 1.
In [52] : test_r
Out[52] : array([[0. , 0. , 0. , ..., 0.03657577, 0.06709877,
0.0569071 ],
[0. , 0. , 0. , ..., 0.04707848, 0.07826 ,
0.0819832 ],
[0. , 0. , 0. , ..., 0.04467918, 0.07355513,
0.08117414],
I used the Cosine Similarity method to determine the accuracy of these two types of data, but an error has occurred.
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(train_r, test_r)/(norm(train_r)*norm(test_r))
ValueError: shapes (100,24) and (100,24) not aligned: 24 (dim 1) != 100 (dim 0)
So I searched the Internet to find a different way, but it didn't help because most of them were string-analysis.
Can I figure out how to calculate the similarity between the two lists, and describe it in numbers?
Found the cause.
The reason for the error is that a total of 24 lists were stored in train_r and test_r.
I tried to calculate the list of 24 at once, and there was an error.
It's a simple solution. You can specify a list in train_r and test_r to calculate by cosine similarity method.
train_c = train_r[:,12]
test_c = test_r[:,12]
from numpy import dot
from numpy.linalg import norm
a = train_c
b = test_c
cos_sim = (dot(a, b)/(norm(a)*norm(b))) * 100
print(cos_sim)
95.18094658851624

Different results from scipy.stats.spearmanr depending on how data is produced

I'm having some weird problem using spearmanr from scipy.stats. I'm using the values of a polynomial to get some correlations that are a bit more interesting to work with, but if I manually enter the values (as a list, converted to a numpy array) I get a different correlation to what I get if I calculate the values using a function. The code below should demonstrate what I mean:
import numpy as np
from scipy.stats import spearmanr
data = np.array([ 0.4, 1.2, 1. , 0.4, 0. , 0.4, 2.2, 6. , 12.4, 22. ])
axis = np.arange(0, 10, dtype=np.float64)
print(spearmanr(axis, data))# gives a correlation of 0.693...
# Use this polynomial
poly = lambda x: 0.1*(x - 3.0)**3 + 0.1*(x - 1.0)**2 - x + 3.0
data2 = poly(axis)
print(data2) # It is the same as data
print(spearmanr(axis, data2))# gives a correlation of 0.729...
I did notice that the arrays are subtly different (i.e. data - data2 is not exactly zero for all elements), but the difference is tiny - order of 1e-16.
Is such a tiny difference enough to throw off spearmanr by this much?
Is such a tiny difference enough to throw off spearmanr by this much?
Yes, because Spearman's r is based on the sample rank. Such tiny differences can change the rank of values that would otherwise be equal:
sp.stats.rankdata(data)
# array([ 3., 6., 5., 3., 1., 3., 7., 8., 9., 10.])
# Note that all three values of 0.4 get the same rank 3.
sp.stats.rankdata(data2)
# array([ 2.5, 6. , 5. , 2.5, 1. , 4. , 7. , 8. , 9. , 10. ])
# Note that two values 0.4 get the rank 2.5 and one gets 4.
If you add a small gradient (larger than the numerical difference you observe) to break such ties, you will get the same result:
print(spearmanr(axis, data + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
print(spearmanr(axis, data2 + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
This, however, will break any ties that may be intentional and can lead to over- or underestimating the correlation. numpy.round may be the preferable solution if the data is expected to have discrete values.

implementing euclidean distance based formula using numpy

I am trying to implement this formula in python using numpy
As you can see in picture above X is numpy matrix and each xi is a vector with n dimensions and C is also a numpy matrix and each Ci is vector with n dimensions too, dist(Ci,xi) is euclidean distance between these two vectors.
I implement a code in python:
value = 0
for i in range(X.shape[0]):
min_value = math.inf
#this for loop iterate k times
for j in range(C.shape[0]):
distance = (np.dot(X[i] - C[j],
X[i] - C[j])) ** .5
min_value = min(min_value, distance)
value += min_value
fitnessValue = value
But my code performance is not good enough I'am looking for faster,is there any faster way to calculate that formula in python any idea would be thankful.
Generally, loops running an important number of times should be avoided when possible in python.
Here, there exists a scipy function, scipy.spatial.distance.cdist(C, X), which computes the pairwise distance matrix between C and X. That is to say, if you call distance_matrix = scipy.spatial.distance.cdist(C, X), you have distance_matrix[i, j] = dist(C_i, X_j).
Then, for each j, you want to compute the minimum of the dist(C_i, X_j) over all i. You do not either need a loop to compute this! The function numpy.minimum does it for you, if you pass an axis argument.
And finally, the summation of all these minimum is done by calling the numpy.sum function.
This gives code much more readable and faster:
import scipy.spatial.distance
import numpy as np
def your_function(C, X):
distance_matrix = scipy.spatial.distance.cdist(C, X)
minimum = np.min(distance_matrix, axis=0)
return np.sum(minimum)
Which returns the same results as your function :)
Hope this helps!
einsum can also be called into play. Here is a simple small example of a pairwise distance calculation for a small set. Useful if you don't have scipy installed and/or wish to use numpy solely.
>>> a
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
>>> b = a.reshape(np.prod(a.shape[:-1]),1,a.shape[-1])
>>> b
array([[[ 0., 0.]],
[[ 1., 1.]],
[[ 2., 2.]],
[[ 3., 3.]],
[[ 4., 4.]]])
>>> diff = a - b; dist_arr = np.sqrt(np.einsum('ijk,ijk->ij', diff, diff)).squeeze()
>>> dist_arr
array([[ 0. , 1.41421, 2.82843, 4.24264, 5.65685],
[ 1.41421, 0. , 1.41421, 2.82843, 4.24264],
[ 2.82843, 1.41421, 0. , 1.41421, 2.82843],
[ 4.24264, 2.82843, 1.41421, 0. , 1.41421],
[ 5.65685, 4.24264, 2.82843, 1.41421, 0. ]])
Array 'a' is a simple 2d (shape=(5,2), 'b' is just 'a' reshaped to facilitate (5, 1, 2) the difference calculations for the cdist style array. The terms are written verbosely since they are extracted from other code. the 'diff' variable is the difference array and the dist_arr shown is for the 'euclidean' distance. Should you need euclideansq (square distance) for 'closest' determinations, simply remove the np.sqrt term and finally squeeze, just removes and 1 terms in the shape.
cdist is faster for much larger arrays (in the order of 1000s of origins and destinations) but einsum is a nice alternative and well documented by others on this site.

DBSCAN in Python: Unexpected result

I'm trying to understand the DBSCAN implementation by scikit-learn, but I'm having trouble. Here is my data sample:
X = [[0,0],[0,1],[1,1],[1,2],[2,2],[5,0],[5,1],[5,2],[8,0],[10,0]]
Then I calculate D as in the example provided
D = distance.squareform(distance.pdist(X))
D returns a matrix with the distance between each point and all others. The diagonal is thus always 0.
Then I run DBSCAN as:
db = DBSCAN(eps=1.1, min_samples=2).fit(D)
eps = 1.1 means, if I understood the documentation well, that points with a distance of smaller or equal 1.1 will be considered in a cluster (core).
D[1] returns the following:
>>> D[1]
array([ 1. , 0. , 1. , 1.41421356,
2.23606798, 5.09901951, 5. , 5.09901951,
8.06225775, 10.04987562])
which means the second point has a distance of 1 to the first and the third. So I expect them to build a cluster, but ...
>>> db.core_sample_indices_
[]
which means no cores found, right? Here are the other 2 outputs.
>>> db.components_
array([], shape=(0, 10), dtype=float64)
>>> db.labels_
array([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.])
Why is there any cluster?
I figure the implementation might just assume your distance matrix is the data itself.
See: usually you wouldn't compute the full distance matrix for DBSCAN, but use a data index for faster neighbor search.
Judging from a 1 minute Google, consider adding metric="precomputed", since:
fit(X)
X: Array of distances between samples, or a feature array. The array is treated as a feature array unless the metric is given as ‘precomputed’.

Categories