I was following a Machine Learning course, having basic knowledge of Python, following an example in Towards Data Science about K-means Clustering and there is a way of indexing that I didn't ask the professor during the lecture.
Source
It's the part where the graph is plotted, with the centroids, the author uses indexing like:
plt.scatter(
X[y_km == 2, 0], X[y_km == 2, 1],
s=50, c='lightblue',
marker='v', edgecolor='black',
label='cluster 3'
)
Does anybody know how this works?
I've tried doing it outside of the plt.scatter, and it isn't helping further than what I already know.
Here is an article that can help you understand ndarray indexing better: Indexing on ndarrays
So in your example X is 2dim ndarray with n rows and 2 columns - feature1 and feature2.
Simple example:
x = np.arange(20).reshape(10, 2)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]])
and simple example of y - list of classes:
y = np.array([1, 2] * 5)
array([1, 2, 1, 2, 1, 2, 1, 2, 1, 2])
Let's consider you want to get all rows from X array which correspond to class 1.
You can simply do this using boolean array indexing like this:
x[y == 1]
array([[ 0, 1],
[ 4, 5],
[ 8, 9],
[12, 13],
[16, 17]])
But if you want to get all rows of one certain column you have to use dimensional indexing:
x[y == 1, 0] # all rows of feature1 (0 index) corresponding to class 1
array([ 0, 4, 8, 12, 16])
So here y == 1 is all rows and 0 is index of column you are interested in.
X is an array of 2 columns. You can think of them as x and y coordinates.
By printing the first 10 rows, you see:
print(X[0:10])
[[ 2.60509732 1.22529553]
[ 0.5323772 3.31338909]
[ 0.802314 4.38196181]
[ 0.5285368 4.49723858]
[ 2.61858548 0.35769791]
[ 1.59141542 4.90497725]
[ 1.74265969 5.03846671]
[ 2.37533328 0.08918564]
[-2.12133364 2.66447408]
[ 1.72039618 5.25173192]]
y_km is the classification of these coordinates.
In the example, they are either classified as 0, 1, or 2
print(y_km[0:10])
[1 0 0 0 1 0 0 1 2 0]
But when you have y_km == 1, these are converted to a list of Booleans
print((y_km==1)[0:10])
[ True False False False True False False True False False]
So when you call
X[y_km == 1 , 1]
Essentially, you are asking to select the values of y_km that are equal to 1, and map them to column 1 of the X array. It will only grab the rows for which y_km is equal to True, and only grab the value from the column specified (i.e. 1)
And
X[y_km == 2, 0]
The values of y_km that are equal to 2, mapped to column 0 of the X array.
So the first number relates to the classification group that you want to gather, and the second number relates to the column of the X array that you want to retrieve from.
Related
I'm working in python using numpy (could be a pandas series too) and am trying to make the following calculation:
Lets say I have an array corresponding to points on the x axis:
2, 9, 5, 6, 55, 8
For each element in this array I would like to get the distance to the closest element so the output would look like the following:
3, 1, 1, 1, 46, 1
I am trying to find a solution that can scale to 2D (distance to nearest XY point) and ideally would avoid a for loop. Is that possible?
There seems to be a theme with O(N^2) solutions here. For 1D, it's quite simple to get O(N log N):
x = np.array([2, 9, 5, 6, 55, 8])
i = np.argsort(x)
dist = np.diff(x[i])
min_dist = np.r_[dist[0], np.minimum(dist[1:], dist[:-1]), dist[-1]])
min_dist = min_dist[np.argsort(i)]
This clearly won't scale well to multiple dimensions, so use scipy.special.KDTree instead. Assuming your data is N-dimensional and has shape (M, N), you can do
k = KDTree(data)
dist = k.query(data, k=2)[0][:, -1]
Scipy has a Cython implementation of KDTree, cKDTree. Sklearn has a sklearn.neighbors.KDTree with a similar interface as well.
Approach 1
You can use broadcasting in order to get matrix of distances:
>>> data = np.array([2,9,5,6,55,8])
>>> dst_matrix = data - data[:, None]
>>> dst_matrix
array([[ 0, 7, 3, 4, 53, 6],
[ -7, 0, -4, -3, 46, -1],
[ -3, 4, 0, 1, 50, 3],
[ -4, 3, -1, 0, 49, 2],
[-53, -46, -50, -49, 0, -47],
[ -6, 1, -3, -2, 47, 0]])
Then we can eliminate diagonal as proposed in this post:
dst_matrix = dst_matrix[~np.eye(dst_matrix.shape[0],dtype=bool)].reshape(dst_matrix.shape[0],-1)
>>> dst_matrix
array([[ 7, 3, 4, 53, 6],
[ -7, -4, -3, 46, -1],
[ -3, 4, 1, 50, 3],
[ -4, 3, -1, 49, 2],
[-53, -46, -50, -49, -47],
[ -6, 1, -3, -2, 47]])
Finally, mininum items can be found:
>>> np.min(np.abs(dst_matrix), axis=1)
array([ 3, 1, 1, 1, 46, 1])
Approach 2
If you're looking for time and memory efficient solution, the best option is scipy.spatial.cKDTrees which packs points (of any dimension) into specific data structure that is optimized for querying closest points. It can also be extended to 2D or 3D.
import scipy.spatial
data = np.array([2,9,5,6,55,8])
ckdtree = scipy.spatial.cKDTree(data[:,None])
distances, idx = ckdtree.query(data[:,None], k=2)
output = distances[:,1] #distances to not coincident points
For each point querying first two closest points is required here because first of them is expected to be coincident. This is the only solution I found between all the proposed answers that doesn't take ages (the average performance is 4secs for 1M points). Warning: you need to filter duplicated points before applying this method.
There are many ways of achieving it. Some readable and generalizable ways are:
Approach 1:
dist = np.abs(a[:,None]-a)
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
#[ 3 1 1 1 46 1]
Approach 2:
dist = np.abs(np.subtract.outer(a,a))
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
For a 2-D case approach 1 (assumes Euclidean distance. Any other is also possible):
from scipy.spatial.distance import cdist
dist = cdist(a,a)
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
For a 2-D case approach 2 with numpy only:
dist=np.sqrt(((a[:,None]-a)**2).sum(-1))
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
You can achieve a faster distance calculation by using np.dot.
You can do some list comprehension on a pandas series:
s = pd.Series([2,9,5,6,55,8])
s.apply(lambda x: min([abs(x - s[y]) for y in s.index if s[y] != x]))
Out[1]:
0 3
1 1
2 1
3 1
4 46
5 1
Then you can just add .to_list() or .to_numpy() to the end to get rid of the series index:
s.apply(lambda x: min([abs(x - s[y]) for y in s.index if s[y] != x])).to_numpy()
array([ 3, 1, 1, 1, 46, 1], dtype=int64)
I have created a vector of zeros called Qc_vector (18 rows x 1 column).
I have created another vector called s_vector (6 rows x 1 column) that is generated each time by a for loop within the range ingreso_datos, that is, for this example it is generated 5 times.
I have also created a list called indices that is generated for each iteration of the loop, these indices tell me the row number to which I should index the values from s_vector to Qc_vector
PROBLEM
When trying to do this I get the following error: ValueError: shape mismatch: value array of shape (6,) could not be broadcast to indexing result of shape (6,1)
For element 6 of the matrix ingreso_datos, the indices are: [1,2,3,4,5,6]
For the end of the loop, that is, for element number 5 s_vector it looks like this:
s_vector for element 5
Qc_vector indexed, how it should look
import numpy as np
# Element 1(i) 2(i) 3(i) 1(j) 2(j) 3(j) x(i) y(i) x(j) y(j) | W(kg/m) Axis(kg/m)
# [Col0] [Col1] [Col2] [Col3] [Col4] [Col5] [Col6] [Col7] [Col8] [Col9] [Col10] | [Col11] [Col12]
ingreso_datos = [[ 1, 13, 14, 15, 7, 8, 9, 0, 0, 0, 2.5, 0, 0],
[ 2, 16, 17, 18, 10, 11, 12, 4.5, 0, 4.5, 2.5, 0, 0],
[ 3, 7, 8, 9, 1, 2, 3, 4.5, 0, 4.5, 2.5, 0, 0],
[ 4, 10, 11, 12, 4, 5, 6, 4.5, 0, 4.5, 2.5, 0, 0],
[ 5, 7, 8, 9, 10, 11, 12, 4.5, 0, 4.5, 2.5, -2200, 0]]
Qc_vector = np.zeros((12,1)) # Vector de zeros
for i in range(len(ingreso_datos)):
indices = []
indices.append([ingreso_datos[i][0], ingreso_datos[i][1], ingreso_datos[i][2], ingreso_datos[i][3],
ingreso_datos[i][4], ingreso_datos[i][5], ingreso_datos[i][6]])
for row in indices:
indices = np.array(row[1:])
L = np.sqrt((ingreso_datos[i][9]-ingreso_datos[i][7])**2+(ingreso_datos[i][10]-ingreso_datos[i][8])**2)
lx = (ingreso_datos[i][9]-ingreso_datos[i][7])/L
ly = (ingreso_datos[i][10]-ingreso_datos[i][8])/L
w = ingreso_datos[i][11]
ad = ingreso_datos[i][12]
s_vector = np.array([ad*L/2, w*L/2, (w*L**2)/12, ad*L/2, w*L/2, (-w*L**2)/12]) # s_vector
Qc_vector[np.ix_(indices)] = s_vector # Indexing
Qc_vector is (18,1).
indices = [ingreso_datos[i][0], ingreso_datos[i][1], ingreso_datos[i][2], ingreso_datos[i][3], ingreso_datos[i][4], ingreso_datos[i][5], ingreso_datos[i][6]])
or simply:
indices = [ingreso_datos[i,[0,1,2,3,4,5,6]]]
followed by:
for row in indices:
indices = np.array(row[1:])
which is just
ingreso_datos[i,[1,2,3,4,5,6]]
s_vector is a 6 element array, shape (6,)
In:
Qc_vector[np.ix_(indices)] = s_vector
you don't need ix_. In my previous answer I suggested:
master_matrix[np.ix_(indices,indices)] ==little_matrix
as a way of doing the indexing for all rows, not just one at a time.
I think your assignment can be simplified to
Qc_vector[indices, 0] = s_vector
That way there's a shape (6,) array on both sides.
I have a feeling you are still trying to write this code by copying other people's code, without understanding what is happening, or why they suggest things.
or define Qc_vector with shape (18,) rather than (18,1).
A quick fix if you don't want to bother too much would be to use numpy.reshape().
This way you can manage the shape mismatch.
I can't wrap my head around csr_matrix examples in scipy documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
Can someone explain how this example work?
>>> row = np.array([0, 0, 1, 2, 2, 2])
>>> col = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csr_matrix((data, (row, col)), shape=(3, 3)).toarray()
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I believe this is following this format.
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].
What is a here?
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
from the above arrays;
for k in 0~5
a[row_ind[k], col_ind[k]] = data[k]
a
row[0],col[0] = [0,0] = 1 (from data[0])
row[1],col[1] = [0,2] = 2 (from data[1])
row[2],col[2] = [1,2] = 3 (from data[2])
row[3],col[3] = [2,0] = 4 (from data[3])
row[4],col[4] = [2,1] = 5 (from data[4])
row[5],col[5] = [2,2] = 6 (from data[5])
so let's arrange matrix 'a' in shape(3X3)
a
0 1 2
0 [1, 0, 2]
1 [0, 0, 3]
2 [4, 5, 6]
This is a sparse matrix. So, it stores the explicit indices and values at those indices. So for example, since row=0 and col=0 corresponds to 1 (the first entries of all three arrays in your example). Hence, the [0,0] entry of the matrix is 1. And so on.
Represent the "data" in a 4 X 4 Matrix:
data = np.array([10,0,5,99,25,9,3,90,12,87,20,38,1,8])
indices = np.array([0,1,2,3,0,2,3,0,1,2,3,1,2,3])
indptr = np.array([0,4,7,11,14])
'indptr'- Index pointers is linked list of pointers to 'indices' (Column
index Pointers)...
indptr[i:i+1] represents i to i+1 index of pointer
14 reprents len of Data len(data)...
indptr = np.array([0,4,7,11,len(data)]) other way of represenint 'indptr'
0,4 --> 0:4 represents pointers to indices 0,1,2,3
4,7 --> 4:7 represents the pointers of indices 0,2,3
7,11 --> 7:11 represents the pointers of 0,1,2,3
11,14 --> 11:14 represents pointers 1,2,3
# Representing the data in a 4,4 matrix
a = csr_matrix((data,indices,indptr),shape=(4,4),dtype=np.int)
a.todense()
matrix([[10, 0, 5, 99],
[25, 0, 9, 3],
[90, 12, 87, 20],
[ 0, 38, 1, 8]])
Another Stackoverflow explanation
As far as I understand, in row and col arrays we have indices which corrensponds to non-zero values in matrix. a[0, 0] = 1, a[0, 2] = 2, a[1, 2] = 3 and so on. As we have no indices for a[0, 1], a[1, 0], a[1, 1] so appropriate values in matrix are equal to 0.
Also, maybe this little intro will be helpful for you:
https://www.youtube.com/watch?v=Lhef_jxzqCg
#Rohit Pandey stated correctly, I just want to add an example on that.
When most of the elements of a matrix have 0 values, then we call this a sparse matrix. The process includes removing zero elements from the matrix and thus saving memory space and computing time. We only store non-zero items with their respected row and column index. i.e.
0 3 0 4
0 5 7 0
0 0 0 0
0 2 6 0
We calculate the sparse matrix by putting non-zero items row index first, then column index, and finally non-zero values like the following:
Row
0
0
1
1
3
3
Column
1
3
1
2
1
2
Value
3
4
5
7
2
6
By reversing the process we get the simple matrix form from the sparse form.
import numpy
square = numpy.reshape(range(0,16),(4,4))
square
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In the above array, how do I access the primary diagonal and secondary diagonal of any given element? For example 9.
by primary diagonal, I mean - [4,9,14],
by secondary diagonal, I mean - [3,6,9,12]
I can't use numpy.diag() cause it takes the entire array to get the diagonal.
Base on your description, with np.where, np.diagonal and np.fliplr
import numpy as np
x,y=np.where(square==9)
np.diagonal(square, offset=-(x-y))
Out[382]: array([ 4, 9, 14])
x,y=np.where(np.fliplr(square)==9)
np.diagonal(np.fliplr(square), offset=-(x-y))
# base on the op's comment it should be np.diagonal(np.fliplr(square), offset=-(x-y))
Out[396]: array([ 3, 6, 9, 12])
For the first diagonal, use the fact that both x_coordiante and y_coordinate increase with 1 each step:
def first_diagonal(x, y, length_array):
if x < y:
return zip(range(x, length_array), range(length_array - x))
else:
return zip(range(length_array - y), range(y, length_array))
For the secondary diagonal, use the fact that the x_coordinate + y_coordinate = constant.
def second_diagonal(x, y, length_array):
tot = x + y
return zip(range(tot+1), range(tot, -1, -1))
This gives you two lists you can use to access your matrix.
Of course, if you have a non square matrix these functions will have to be reshaped a bit.
To illustrate how to get the desired output:
a = np.reshape(range(0,16),(4,4))
first = first_diagonal(1, 2, len(a))
second = second_diagonal(1,2, len(a))
primary_diagonal = [a[i[0]][i[1]] for i in first]
secondary_diagonal = [a[i[0]][i[1]] for i in second]
print(primary_diagonal)
print(secondary_diagonal)
this outputs:
[4, 9, 14]
[3, 6, 9, 12]
What is the best way to create a 2D list (or numpy array) in python, in which the diagonal is set to -1 and the remaining values are increasing from 0 by 1, for different values of n. For example, if n = 3 the array would look like:
[[-1,0,1]
[2,-1,3]
[4,5,-1]]
or for n = 4:
[[-1,0,1,2]
[3,-1,4,5]
[6,7,-1,8]
[9,10,11,-1]]
etc.
I know I can create an array with zeros and with the diagonal set to -1 with:
a = numpy.zeros((n,n))
numpy.fill_diagonal(a,-1)
And so if n = 3 this would give:
[[-1,0,0]
[0,-1,0]
[0,0,-1]]
But how would I then set the 0's to be increasing numbers, as shown in the example above? Would I need to iterate through and set the values through a loop? Or is there a better way to approach this?
Thanks in advance.
One approach -
def set_matrix(n):
out = np.full((n,n),-1)
off_diag_mask = ~np.eye(n,dtype=bool)
out[off_diag_mask] = np.arange(n*n-n)
return out
Sample runs -
In [23]: set_matrix(3)
Out[23]:
array([[-1, 0, 1],
[ 2, -1, 3],
[ 4, 5, -1]])
In [24]: set_matrix(4)
Out[24]:
array([[-1, 0, 1, 2],
[ 3, -1, 4, 5],
[ 6, 7, -1, 8],
[ 9, 10, 11, -1]])
Here is an arithmetic way:
m=np.arange(n*n).reshape(n,n)*n//(n+1)
m.flat[::n+1]=-1
for n=5 :
[[-1 0 1 2 3]
[ 4 -1 5 6 7]
[ 8 9 -1 10 11]
[12 13 14 -1 15]
[16 17 18 19 -1]]