I've been working with dendrograms to determine the optimal number of clusters for hierarchical.
If having this:
Dendrogram
Having a linkage array like so, that defines this dendrogram:
[[ 1. 2. 5.83095189 2. ]
[ 3. 10. 9.21954446 3. ]
[ 6. 7. 11.18033989 2. ]
[ 0. 11. 13. 4. ]
[ 9. 12. 14.2126704 3. ]
[ 5. 14. 17.20465053 4. ]
[ 4. 13. 20.88061302 5. ]
[ 8. 15. 21.21320344 5. ]
[16. 17. 47.16990566 10. ]]
Comparing those values to the graph, index 2 defines the y value and index 3 defines the number of expansions. Now to determine the optimal value I need to calculate the max Distance.
How could I do this knowing that I would need to subtract 41.1699 to 21.21 and 41.1699 to 20.88 (subtracting in descending order of ramifications)
Related
I try to use recursive function on python, i have this matrix W:
[[ 13. 14. ]
[ 12. 15. ]
[ 0. 4. ]
[ 3. 6. ]
[ 7. 8. ]
[ 11. 18. ]
[ 10. 17. ]
[ 2. 23. ]
[ 5. 22. ]
[ 16. 19. ]
[ 1. 27. ]
[ 9. 21. ]
[ 25. 29. ]
[ 24. 28. ]
[ 20. 26. ]
[ 31. 32. ]
[ 30. 33. ]
[ 34. 35. ]
[ 36. 37. ]]
the principle that for each line, I get the values of the two columns, if they are <20 I return them, otherwise I do the modulo until I reach a value lower than 20. for example I have a value 35 which is> 20, so 35% 20 = 15, I go to line 15 I get the value, if I find it for example 11, I return 11, if I find it 23 for example, I redo the modulo 23% 20 = 3, I go to line 3 and so on .. this is my code
def modulo(entier):
if entier < 20:
return(entier)
else:
c = (entier % 20)
if int(W[c,0]) < 20:
return(int(W[c,0]))
else:
a = modulo(int(W[c,0]))
return(a)
if int(W[c,1]) < 20:
return(int(W[c,1]))
else:
e = modulo(int(W[c,1]))
return(e)
i = 12
print(modulo(int(W[i,0])), modulo(int(W[i,1])))
here I tried with line 12 of the matrix, which has both values 25 and 29, following the principle the function must return 11 and 18 for the value 25 and 16 and 19 for the value 29. but in the running, the program only displays two values 11 and 16. so I have the impression that it just points to the first column of the matrix, it does not read the second if condition. I hope I explained the problem well and I find a solution. Thank you
I have the following arrays:
from mxnet import nd
A=nd.array([[1,1,1,1],[2,2,2,2]])
B=nd.array([[11,11,11,11],[22,22,22,22]])
Y=nd.array([[91,91,91,91],[92,92,92,92]])
Imagine that each list whithin each array corresponds to a client.
So [1,1,1,1] is the result of operation A to client 1 and [2,2,2,2] is the result of operation A to client 2.
Then I have another array with a diferent operation that is applied to all the clients. [11,11,11,11] is the result of operation B to client 1 and so on.
And I need to get the following result:
D=nd.array( [ [[1,1,1,1],[11,11,11,11]],[[2,2,2,2],[22,22,22,22]] ])
list([D,Y])
This returns:
[
[[[ 1. 1. 1. 1.]
[11. 11. 11. 11.]]
[[ 2. 2. 2. 2.]
[22. 22. 22. 22.]]]
<NDArray 2x2x4 #cpu(0)>,
[[91. 91. 91. 91.]
[92. 92. 92. 92.]]
<NDArray 2x4 #cpu(0)>]
As you can see, the operations (A and B) are grouped for each client.
I tried:
list([list(zip(A,B)),Y])
And I get:
[[(
[1. 1. 1. 1.]
<NDArray 4 #cpu(0)>,
[11. 11. 11. 11.]
<NDArray 4 #cpu(0)>), (
[2. 2. 2. 2.]
<NDArray 4 #cpu(0)>,
[22. 22. 22. 22.]
<NDArray 4 #cpu(0)>)],
[[91. 91. 91. 91.]
[92. 92. 92. 92.]]
<NDArray 2x4 #cpu(0)>]
Which is not what I need. Plus, arrays A and B are really big, so I don't want to use a loop or something that will take too long.
Thanks.
this is typically an operation you can do with an mxnet.ndarray.concat, yet you need to expand the dimension of the concatenated items before the concat so that they stay in separate arrays.
This command will get exactly the output you ask for:
C = nd.concat(A.expand_dims(axis=1), B.expand_dims(axis=1), dim=1)
print(C)
which returns:
[[[ 1. 1. 1. 1.]
[11. 11. 11. 11.]]
[[ 2. 2. 2. 2.]
[22. 22. 22. 22.]]]
<NDArray 2x2x4 #cpu(0)>
I want to sort the rows of a 2D array based on the elements of the first column, in Python 3. For example, if
x = array([[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ],
[ 2. , 6. , 7. , 9. ]])
then I need the sorted array to be
x = array([[ 2. , 6. , 7. , 9. ],
[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ]])
How can I do that? A similar question was asked and answered here, but it does not work for me.
The following should work:
import numpy as np
x = np.array([[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ],
[ 2. , 6. , 7. , 9. ]])
x[x[:, 0].argsort()]
Out[2]:
array([[ 2. , 6. , 7. , 9. ],
[ 5. , 9. , 2. , 6. ],
[ 7. , 12. , 3.5, 8. ]])
Documentation : numpy.argsort
#using sorted
x = ([[5.,9.,2.,6. ], [7.,12.,3.5,8.], [2.,6.,7.,9.]])
x = sorted(x, key=lambda i: i[0]) #1st col
print(x)
[[ 208.47 26. ]
[ 202.84 17. ]
[ 143.37 10. ]
...,
[ 45.99 3. ]
[ 159.31 10. ]
[ 34.12 4. ]]
[[ 58.64 1. ]
[ 44.31 19. ]
[ 37.89 14. ]
...,
[ 46.86 4. ]
[ 60.73 5. ]
[ 41.91 6. ]]
[[ 36.6 4. ]
[ 219.29 17. ]
[ 64.77 5. ]
...,
[ 51.85 37. ]
[ 161.26 10. ]
[ 53.63 20. ]]
[[ 52.97 32. ]
[ 51.32 3. ]
[ 196.23 4. ]
...,
[ 41.39 8. ]
[ 47.49 5. ]
[ 34.34 3. ]]
I have this numpy array entering my function:
def initialize_centroids(points, k):
"""returns k centroids from the initial points"""
centroids = points.copy()
np.random.shuffle(centroids)
print centroids
return centroids[:k]
Now what the function is currently doing is, shuffling the values and sending the first k of them. I want to basically randomize the values of the first column between 0 and 300 and the second between 0 and 100. How would I do this?
This is part of my work on building a K-Means algorithm using Python.
As #kazemakase has commented, the answer is simply using:
np.random.rand(k, 2) * [300, 100]
I'm trying to learn how to use the scipy.cluster.hierarchy.inconsistent. I know from the document and this one that the first column and the second column represent mean and standard deviation respectively; the third means the number of links, and the fourth is the inconsistency coefficient.
However, what I don't understand is that:
what does the all the links included in the calculation really mean?
what does the d parameter of scipy.hierarchy.cluster.inconsistent(Z, d=2) really do?
For example, let's assume that we have X matrix as follow:
[[2], [8], [0], [4], [1], [9], [9], [0]]
Then, get the Z value by
Z = linkage(X, 'single')
And, we get
[[ 2. 7. 0. 2.]
[ 5. 6. 0. 2.]
[ 0. 4. 1. 2.]
[ 8. 10. 1. 4.]
[ 1. 9. 1. 3.]
[ 3. 11. 2. 5.]
[ 12. 13. 4. 8.]]
Finally, get the inconsistency
inconsistent(Z)
The output is
[[ 0. 0. 1. 0. ]
[ 0. 0. 1. 0. ]
[ 1. 0. 1. 0. ]
[ 0.66667 0.57735 3. 0.57735]
[ 0.5 0.70711 2. 0.70711]
[ 1.5 0.70711 2. 0.70711]
[ 2.33333 1.52753 3. 1.09109]]
For the fourth row, which three links are used to calculate the mean and standard deviation to get the value of 0.66667 and 0.57735 exactly?
[ 0.66667 0.57735 3. 0.57735]
First you have to understand the Z matrix:
[[ 2. 7. 0. 2.] <== x[2] is linked with x[7], forming cluster x[8] = {x[2], x[7]}
[ 5. 6. 0. 2.]
[ 0. 4. 1. 2.] <== x[10] = {x[0], x[4]}
[ 8. 10. 1. 4.] <== x[11] = {x[8], x[10]} = {x[2], x[7], x[0], x[4]}
[ 1. 9. 1. 3.]
[ 3. 11. 2. 5.]
[ 12. 13. 4. 8.]]
There are 3 links included in the calculation of the fourth row:
| Link | Height/Distance |
| x[2] - x[7] | Z[0,2] = 0 |
| x[0] - x[4] | Z[2,2] = 1 |
| x[8] - x[10] | Z[3,2] = 1 |
The mean value of (1,1,0) is R[3,0] = 0.66667, and the standard deviation is R[3,1] = 0.57735 (normalized by N-1, not N). The inconsistent value is computed as:
Z[i,2] - R[i,0] 1 - 0.6667
R[i,3] = --------------- = ---------- = 0.57735
R[i,1] 0.57735
what does the all the links included in the calculation really mean?
what does the d parameter of scipy.hierarchy.cluster.inconsistent(Z,
d=2) really do?
For a cluster C, all the links below the cluster C, up to depth d, are considered to compute statistics (mean and std). In the above example, d=2 means we look at the link that created x[11] (depth 1), and the links below x[8] and x[10] (depth 2).