I am trying to implement this formula in python using numpy
As you can see in picture above X is numpy matrix and each xi is a vector with n dimensions and C is also a numpy matrix and each Ci is vector with n dimensions too, dist(Ci,xi) is euclidean distance between these two vectors.
I implement a code in python:
value = 0
for i in range(X.shape[0]):
min_value = math.inf
#this for loop iterate k times
for j in range(C.shape[0]):
distance = (np.dot(X[i] - C[j],
X[i] - C[j])) ** .5
min_value = min(min_value, distance)
value += min_value
fitnessValue = value
But my code performance is not good enough I'am looking for faster,is there any faster way to calculate that formula in python any idea would be thankful.
Generally, loops running an important number of times should be avoided when possible in python.
Here, there exists a scipy function, scipy.spatial.distance.cdist(C, X), which computes the pairwise distance matrix between C and X. That is to say, if you call distance_matrix = scipy.spatial.distance.cdist(C, X), you have distance_matrix[i, j] = dist(C_i, X_j).
Then, for each j, you want to compute the minimum of the dist(C_i, X_j) over all i. You do not either need a loop to compute this! The function numpy.minimum does it for you, if you pass an axis argument.
And finally, the summation of all these minimum is done by calling the numpy.sum function.
This gives code much more readable and faster:
import scipy.spatial.distance
import numpy as np
def your_function(C, X):
distance_matrix = scipy.spatial.distance.cdist(C, X)
minimum = np.min(distance_matrix, axis=0)
return np.sum(minimum)
Which returns the same results as your function :)
Hope this helps!
einsum can also be called into play. Here is a simple small example of a pairwise distance calculation for a small set. Useful if you don't have scipy installed and/or wish to use numpy solely.
>>> a
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
>>> b = a.reshape(np.prod(a.shape[:-1]),1,a.shape[-1])
>>> b
array([[[ 0., 0.]],
[[ 1., 1.]],
[[ 2., 2.]],
[[ 3., 3.]],
[[ 4., 4.]]])
>>> diff = a - b; dist_arr = np.sqrt(np.einsum('ijk,ijk->ij', diff, diff)).squeeze()
>>> dist_arr
array([[ 0. , 1.41421, 2.82843, 4.24264, 5.65685],
[ 1.41421, 0. , 1.41421, 2.82843, 4.24264],
[ 2.82843, 1.41421, 0. , 1.41421, 2.82843],
[ 4.24264, 2.82843, 1.41421, 0. , 1.41421],
[ 5.65685, 4.24264, 2.82843, 1.41421, 0. ]])
Array 'a' is a simple 2d (shape=(5,2), 'b' is just 'a' reshaped to facilitate (5, 1, 2) the difference calculations for the cdist style array. The terms are written verbosely since they are extracted from other code. the 'diff' variable is the difference array and the dist_arr shown is for the 'euclidean' distance. Should you need euclideansq (square distance) for 'closest' determinations, simply remove the np.sqrt term and finally squeeze, just removes and 1 terms in the shape.
cdist is faster for much larger arrays (in the order of 1000s of origins and destinations) but einsum is a nice alternative and well documented by others on this site.
Related
I have a 150x4 matrix X which I created from a pandas dataframe using the following code:
X = df_new.as_matrix()
I have to normalize it using this function:
I know that Uj is the mean val of j, and that σ j is the standard deviation of j, but I don't understand what j is. I'm having a little trouble understanding what the bar on X is, and I'm confused by the commas in the equation (I don't know if they have any significance or not).
Can anyone help me understand what this equation means so I can then write the normalization using sklearn?
You don't actually need to write code for the normalization yourself - it comes ready with sklearn.preprocessing.scale.
Here is an example from the docs:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
When used with the default setting axis=0, the mormalization happens column-wise (i.e. for each column j, as in your equestion). As a result, it is easy to confirm that scaled data has zero mean and unit variance:
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The indexes for matrix X are row (i) and column (j). Hence, X,j means column j of matrix X. I.e. normalize each column of matrix X to z-scores.
You can do that using pandas:
df_new_zscores = (df_new - df_new.mean()) / df_new.std()
I do not know pandas but I think that the equation means that the normalized matrix is given by
You subtract the empirical mean and devide by the empirical standard deviation per column.
You sometimes use this for Principal Component Analysis.
In sklearn documentation says "norm" can be either
norm : ‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
The documentation about normalization isn't clearly stating how ‘l1’, ‘l2’, or ‘max’ are calculated.
Can anyone clear these?
Informally speaking, the norm is a generalization of the concept of (vector) length; from the Wikipedia entry:
In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space.
The L2-norm is the usual Euclidean length, i.e. the square root of the sum of the squared vector elements.
The L1-norm is the sum of the absolute values of the vector elements.
The max-norm (sometimes also called infinity norm) is simply the maximum absolute vector element.
As the docs say, normalization here means making our vectors (i.e. data samples) having unit length, so specifying which length (i.e. which norm) is also required.
You can easily verify the above adapting the examples from the docs:
from sklearn import preprocessing
import numpy as np
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_l1 = preprocessing.normalize(X, norm='l1')
X_l1
# array([[ 0.25, -0.25, 0.5 ],
# [ 1. , 0. , 0. ],
# [ 0. , 0.5 , -0.5 ]])
You can verify by simple visual inspection that the absolute values of the elements of X_l1 sum up to 1.
X_l2 = preprocessing.normalize(X, norm='l2')
X_l2
# array([[ 0.40824829, -0.40824829, 0.81649658],
# [ 1. , 0. , 0. ],
# [ 0. , 0.70710678, -0.70710678]])
np.sqrt(np.sum(X_l2**2, axis=1)) # verify that L2-norm is indeed 1
# array([ 1., 1., 1.])
Most probably somebody else already asked this but I couldn't find it. The question is how can I assign values to a 2D array from two 1D arrays. For example:
import numpy as np
#a is the 2D array. b is the 1D array and should be assigned
#to second coordinate. In this exaple the first coordinate is 1.
a=np.zeros((3,2))
b=np.asarray([1,2,3])
c=np.ones(3)
a=np.vstack((c,b)).T
output:
[[ 1. 1.]
[ 1. 2.]
[ 1. 3.]]
I know the way I am doing it so naive, but I am sure there should be a one line way of doing this.
P.S. In real case that I am dealing with, this is a subarray of an array, and therefore I cannot set the first coordinate from the beginning to one. The whole array's first coordinate are different, but after applying np.where they become constant.
How about 2 lines?
>>> c = np.ones((3, 2))
>>> c[:, 1] = [1, 2, 3]
And the proof it works:
>>> c
array([[ 1., 1.],
[ 1., 2.],
[ 1., 3.]])
Or, perhaps you want np.column_stack:
>>> np.column_stack(([1.,1,1],[1,2,3]))
array([[ 1., 1.],
[ 1., 2.],
[ 1., 3.]])
First, there's absolutely no reason to create the original zeros array that you stick in a, never reference, and replace with a completely different array with the same name.
Second, if you want to create an array the same shape and dtype as b but with all ones, use ones_like.
So:
b = np.array([1,2,3])
c = np.ones_like(b)
d = np.vstack((c, b).T
You could of course expand b to a 3x1-array instead of a 3-array, in which case you can use hstack instead of needing to vstack then transpose… but I don't think that's any simpler:
b = np.array([1,2,3])
b = np.expand_dims(b, 1)
c = np.ones_like(b)
d = np.hstack((c, b))
If you insist on 1 line, use fancy indexing:
>>> a[:,0],a[:,1]=[1,1,1],[1,2,3]
I'm trying to understand the DBSCAN implementation by scikit-learn, but I'm having trouble. Here is my data sample:
X = [[0,0],[0,1],[1,1],[1,2],[2,2],[5,0],[5,1],[5,2],[8,0],[10,0]]
Then I calculate D as in the example provided
D = distance.squareform(distance.pdist(X))
D returns a matrix with the distance between each point and all others. The diagonal is thus always 0.
Then I run DBSCAN as:
db = DBSCAN(eps=1.1, min_samples=2).fit(D)
eps = 1.1 means, if I understood the documentation well, that points with a distance of smaller or equal 1.1 will be considered in a cluster (core).
D[1] returns the following:
>>> D[1]
array([ 1. , 0. , 1. , 1.41421356,
2.23606798, 5.09901951, 5. , 5.09901951,
8.06225775, 10.04987562])
which means the second point has a distance of 1 to the first and the third. So I expect them to build a cluster, but ...
>>> db.core_sample_indices_
[]
which means no cores found, right? Here are the other 2 outputs.
>>> db.components_
array([], shape=(0, 10), dtype=float64)
>>> db.labels_
array([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.])
Why is there any cluster?
I figure the implementation might just assume your distance matrix is the data itself.
See: usually you wouldn't compute the full distance matrix for DBSCAN, but use a data index for faster neighbor search.
Judging from a 1 minute Google, consider adding metric="precomputed", since:
fit(X)
X: Array of distances between samples, or a feature array. The array is treated as a feature array unless the metric is given as ‘precomputed’.
In numpy if you want to calculate the sinus of each entry of a matrix (elementise) then
a = numpy.arange(0,27,3).reshape(3,3)
numpy.sin(a)
will get the job done! If you want the power let's say to 2 of each entry
a**2
will do it.
But if you have a sparse matrix things seem more difficult. At least I haven't figured a way to do that besides iterating over each entry of a lil_matrix format and operate on it.
I've found this question on SO and tried to adapt this answer but I was not succesful.
The Goal is to calculate elementwise the squareroot (or the power to 1/2) of a scipy.sparse matrix of CSR format.
What would you suggest?
The following trick works for any operation which maps zero to zero, and only for those operations, because it only touches the non-zero elements. I.e., it will work for sin and sqrt but not for cos.
Let X be some CSR matrix...
>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix(np.arange(10).reshape(2, 5), dtype=np.float)
>>> X.A
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.]])
The non-zero elements' values are X.data:
>>> X.data
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9.])
which you can update in-place:
>>> X.data[:] = np.sqrt(X.data)
>>> X.A
array([[ 0. , 1. , 1.41421356, 1.73205081, 2. ],
[ 2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ]])
Update In recent versions of SciPy, you can do things like X.sqrt() where X is a sparse matrix to get a new copy with the square roots of elements in X.