Related
I'm currenlty teaching myself python for data science and stumbled upon a chapter that I have been looking at for hours but I don't understand. I hope you can help me understand it. In the example they want to code the k-nearest neighbors. The code looks like this:
X = np.random.rand(10,2)
dist_sq = np.sum((X[:, np.newaxis,:] - X[np.newaxis,:,:])** 2, axis = -1)
nearest = np.argsort(dist_sq, axis=1)
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K=2
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
plt.plot(*zip(X[j], X[i]), color='black')
I do understand the premise of calculating the eucledian distance, but it is very abstract for me to understand the 3D array etc. Thank you guys for dumbing it down for me. I appreciate every answert! Thanks!
fyi: book Im learning from is python data science handbook page 89.
Array broadcasting in 3 dimensions is pretty tricky to wrap your head around.
lets start with 2 dimensions:
X = np.arange(5)
X[np.newaxis,:] + 10*X[:,np.newaxis]
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[20, 21, 22, 23, 24],
[30, 31, 32, 33, 34],
[40, 41, 42, 43, 44]])
As you can see, when we add a (1 x N) row vector to a (N x 1) column vector, we get a (N x N) matrix. Right before adding, the row vector becomes an (N x N) matrix where every row is the same. Similarly, the column vector becomes an (N x N) matrix where every column is the same. In some sense, this is a shorthand way of doing the following operation.
X1 = np.array([[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.]])
X2 = np.array([[ 0., 0., 0., 0., 0.],
[10., 10., 10., 10., 10.],
[20., 20., 20., 20., 20.],
[30., 30., 30., 30., 30.],
[40., 40., 40., 40., 40.]])
Clearly, X1 + X2 will get us the same answer we got before.
So how does this work in 3 dimensions? Very much the same. Before we repeated the 1st dimesion across the 2nd dimension (and vice versa).
X1 = X[:, np.newaxis,:]
X2 = X[np.newaxis,:,:]
difference = X1 - X2
Right before subtracting, X1's 1st and 3rd dimensions are repeated for every slice in the 2nd dimension. X2's 2nd and 3rd dimensions are repeated for every 1st dimension slice. Lets observe with some easier-to-read matricies.
X1 = np.array([[1.,2.],
[3.,4.],
[5.,6.]])
X2 = np.array([[10.,20.],
[30.,40.],
[50.,60.]])
X1[:,np.newaxis,:] + X2[np.newaxis,:,:]
array([[[11., 22.],
[31., 42.],
[51., 62.]],
[[13., 24.],
[33., 44.],
[53., 64.]],
[[15., 26.],
[35., 46.],
[55., 66.]]])
Its easiest visually to see X2 repeated across the 1st Dimension (the blocks). In each block, we see the 10s digit is the same. Perhaps its easier to read this as a for loop of 2D broadcasting
first_dimension = []
for i_row in X1.shape[0]:
first_dimension.append(X2 + X1[i_row,:])
Hopefully its clear now that
X1 = X[:, np.newaxis,:]
X2 = X[np.newaxis,:,:]
difference = X1 - X2
sq_diff = difference ** 2
sq_diff is a 3D tensor, where each slice of the 3rd dimension is the squared difference between one column of X2 and one column of X1.
ssq_diff = np.sum(sq_diff, axis = -1)
then just sums across the 3rd dimension (axis = -1 just means the last dimension in the array). Now ssq_diff is a 2D matrix, where each element is the Euclidean distance between two of the data points. For row i and column j, ssq_diff[i,j] is the euclidean distance between the ith and jth row in X.
I have an array n×m, where n = 217000 and m = 3 (some data from telescope).
I need to calculate the distances between 2 points in 3D (according to my x, y, z coordinates in columns).
When I try to use sklearn tools the result is:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
What tool can I use in this situation and what max possible size for this tools?
What tool can I use in this situation...?
You could implement the euclidean distance function on your own using the approach suggested by #Saksow. Assuming that a and b are one-dimensional NumPy arrays, you could also use any of the methods proposed in this thread:
import numpy as np
np.linalg.norm(a-b)
np.sqrt(np.sum((a-b)**2))
np.sqrt(np.dot(a-b, a-b))
If you wish to compute in one go the pairwise distance (not necessarily the euclidean distance) between all the points in your array, the module scipy.spatial.distance is your friend.
Demo:
In [79]: from scipy.spatial.distance import squareform, pdist
In [80]: arr = np.asarray([[0, 0, 0],
...: [1, 0, 0],
...: [0, 2, 0],
...: [0, 0, 3]], dtype='float')
...:
In [81]: squareform(pdist(arr, 'euclidean'))
Out[81]:
array([[ 0. , 1. , 2. , 3. ],
[ 1. , 0. , 2.23606798, 3.16227766],
[ 2. , 2.23606798, 0. , 3.60555128],
[ 3. , 3.16227766, 3.60555128, 0. ]])
In [82]: squareform(pdist(arr, 'cityblock'))
Out[82]:
array([[ 0., 1., 2., 3.],
[ 1., 0., 3., 4.],
[ 2., 3., 0., 5.],
[ 3., 4., 5., 0.]])
Notice that the number of points in the mock data array used in this toy example is and the resulting pairwise distance array has elements.
...and what max possible size for this tools?
If you try to apply the approach above using your data () you get an error:
In [105]: data = np.random.random(size=(217000, 3))
In [106]: squareform(pdist(data, 'euclidean'))
Traceback (most recent call last):
File "<ipython-input-106-fd273331a6fe>", line 1, in <module>
squareform(pdist(data, 'euclidean'))
File "C:\Users\CPU 2353\Anaconda2\lib\site-packages\scipy\spatial\distance.py", line 1220, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
The issue is you are running out of RAM. To perform such computation you would need more than 350TB! The required amount of memory result from multiplying the number of elements of the distance matrix (2170002) by the number of bytes of each element of that matrix (8), and dividing this product by the apropriate factor (10243) to express the result in gigabytes:
In [107]: round(data.shape[0]**2 * data.dtype.itemsize / 1024.**3)
Out[107]: 350.8
So the maximum allowed size for your data is determined by the amount of available RAM (take a look at this thread for further details).
Using only Python and Euclidean distance formula for 3 dimensions:
import math
distance = math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2 + (z1 - z2) ** 2)
I have a three-dimensional array of the following structure:
x = np.array([[[1,2],
[3,4]],
[[5,6],
[7,8]]], dtype=np.double)
Additionally, I have an index array
idx = np.array([[0,1],[1,3]], dtype=np.int)
Each row of idx defines the row/column indices for the placement of each sub-array along the 0 axis in x into a two-dimensional array K that is initialized as
K = np.zeros((4,4), dtype=np.double)
I would like to use fancy indexing/broadcasting to performing the indexing without a for loop. I currently do it this way:
for i, id in enumerate(idx):
idx_grid = np.ix_(id,id)
K[idx_grid] += x[i]
Such that the result is:
>>> K = array([[ 1., 2., 0., 0.],
[ 3., 9., 0., 6.],
[ 0., 0., 0., 0.],
[ 0., 7., 0., 8.]])
Is this possible to do with fancy indexing?
Here's one alternative way. With x, idx and K defined as in your question:
indices = (idx[:,None] + K.shape[1]*idx).ravel('f')
np.add.at(K.ravel(), indices, x.ravel())
Then we have:
>>> K
array([[ 1., 2., 0., 0.],
[ 3., 9., 0., 6.],
[ 0., 0., 0., 0.],
[ 0., 7., 0., 8.]])
To perform unbuffered inplace addition on NumPy arrays you need to use np.add.at (to avoid using += in a for loop).
However, it's slightly probelmatic to pass a list of 2D index arrays, and corresponding arrays to add at these indices, to np.add.at. This is because the function interprets these lists of arrays as higher-dimensional arrays and IndexErrors are raised.
It's much simpler to pass in 1D arrays. You can temporarily ravel K and x to give you a 1D array of zeros and a 1D array of values to add to those zeros. The only fiddly part is constructing a corresponding 1D array of indices from idx at which to add the values. This can be done via broadcasting with arithmetical operators and then ravelling, as shown above.
The intended operation is one of an accumulation of values from x into places indexed by idx. You could think of those idx places as bins of a histogram data and the x values as the weights that you need to accumulate for those bins. Now, to perform such a binning operation, np.bincount could be used. Here's one such implementation with it -
# Get size info of expected output
N = idx.max()+1
# Extend idx to cover two axes, equivalent to `np.ix_`
idx1 = idx[:,None,:] + N*idx[:,:,None]
# "Accumulate" values from x into places indexed by idx1
K = np.bincount(idx1.ravel(),x.ravel()).reshape(N,N)
Runtime tests -
1) Create inputs:
In [361]: # Create x and idx, with idx having unique elements in each row of idx,
...: # as otherwise the intended operation is not clear
...:
...: nrows = 100
...: max_idx = 100
...: ncols_idx = 2
...:
...: x = np.random.rand(nrows,ncols_idx,ncols_idx)
...: idx = np.random.randint(0,max_idx,(nrows,ncols_idx))
...:
...: valid_mask = ~np.any(np.diff(np.sort(idx,axis=1),axis=1)==0,axis=1)
...:
...: x = x[valid_mask]
...: idx = idx[valid_mask]
...:
2) Define functions:
In [362]: # Define the original and proposed (bincount based) approaches
...:
...: def org_approach(x,idx):
...: N = idx.max()+1
...: K = np.zeros((N,N), dtype=np.double)
...: for i, id in enumerate(idx):
...: idx_grid = np.ix_(id,id)
...: K[idx_grid] += x[i]
...: return K
...:
...:
...: def bincount_approach(x,idx):
...: N = idx.max()+1
...: idx1 = idx[:,None,:] + N*idx[:,:,None]
...: return np.bincount(idx1.ravel(),x.ravel()).reshape(N,N)
...:
3) Finally time them:
In [363]: %timeit org_approach(x,idx)
100 loops, best of 3: 2.13 ms per loop
In [364]: %timeit bincount_approach(x,idx)
10000 loops, best of 3: 32 µs per loop
I do not think it is efficiently possible, since you have += in the loop. This means, you would have to "blow up" your array idx by one dimension and reduce it again by utilizing np.sum(x[...], axis=...).
A minor optimization would be:
import numpy as np
xx = np.array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]], dtype=np.double)
idx = np.array([[0, 1], [1, 3]], dtype=np.int)
K0, K1 = np.zeros((4, 4), dtype=np.double), np.zeros((4, 4), dtype=np.double)
for k, i in enumerate(idx):
idx_grid = np.ix_(i, i)
K0[idx_grid] += xx[k]
for x, i in zip(xx, idx):
K1[np.ix_(i, i)] += x
print("K1 == K0:", np.allclose(K1, K0)) # prints: K1 == K0: True
PS: Do not use id as a variable name, since it is a Python keyword.
I have a numpy array with three columns of the form:
x1 y1 f1
x2 y2 f2
...
xn yn fn
The (x,y) pairs may repeat. I would need another array such that each (x,y) pair appears once and the corresponding third column is the sum of all the f values that appeared next to (x,y).
For example, the array
1 2 4.0
1 1 5.0
1 2 3.0
0 1 9.0
would give
0 1 9.0
1 1 5.0
1 2 7.0
The order of rows is not relevant. What is the fastest way to do this in Python?
Thank you!
This would be one approach to solve it -
import numpy as np
# Input array
A = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
# Extract xy columns
xy = A[:,0:2]
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(xy.T)
sorted_xy = xy[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_xy,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# OR df2 = np.append([True],np.logical_or(df1[:,0]!=0,df1[:,1]!=0),0)
# OR df2 = np.append([True],np.dot(df1!=0,[True,True]),0)
# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1
# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels
# Get unique indices
unq_idx = sorted_idx[df2]
# Get counts and unique rows and setup output array
counts = np.bincount(labels, weights=A[:,2])
unq_rows = xy[unq_idx,:]
out = np.append(unq_rows,counts.ravel()[:,None],1)
Input & Output -
In [169]: A
Out[169]:
array([[ 1., 2., 4.],
[ 1., 1., 5.],
[ 1., 2., 3.],
[ 0., 1., 9.]])
In [170]: out
Out[170]:
array([[ 0., 1., 9.],
[ 1., 1., 5.],
[ 1., 2., 7.]])
Thanks to #hpaulj, finally found the simplest solution. If d contains the 3-column data:
ind =d[0:2].astype(int)
x = zeros(shape=(N,N))
add.at(x,list(ind),d[2])
This solution assumes that the (x,y) indices in the first two columns are integer and smaller than N. This is what I need and should have mentioned in the post.
Edit: Note that the above solution produces a sparse matrix with the sum values at position (x,y) within the matrix.
Certainly easily done in Python:
arr = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
d={}
for x, y, z in arr:
d.setdefault((x,y), 0)
d[x,y]+=z
>>> d
{(1.0, 2.0): 7.0, (0.0, 1.0): 9.0, (1.0, 1.0): 5.0}
Then translate back to numpy:
>>> np.array([[x,y,d[(x,y)]] for x,y in d.keys()])
array([[ 1., 2., 7.],
[ 0., 1., 9.],
[ 1., 1., 5.]])
If you have scipy, the sparse module does this kind of addition - again for an array where the 1st 2 columns are integers - ie. indexes.
from scipy import sparse
M = sparse.csr_matrix((d[:,0], (d[:,1],d[:,2])))
M = M.tocoo() # there may be a short cut to this csr coo round trip
x = np.column_stack([M.row, M.col, M.data]) # needs testing
For convenience in constructing certain kinds of linear algebra matrices, the csr sparse array format sums values with duplicate indices. It's implemented in compiled code so should be fairly fast. But putting the data into M and taking it back out might slow it down.
(ps. I haven't tested this script since I'm writing this on a machine without scipy).
I have a large scipy sparse symmetric matrix which I need to condense by taking the sum of blocks to make a new smaller matrix.
For example, for a 4x4 sparse matrix A I will like to make a 2x2 matrix B in which B[i,j] = sum(A[i:i+2,j:j+2]).
Currently, I just go block by block to recreate the condensed matrix but this is slow. Any ideas on how to optimize this?
Update: Here is an example code that works fine, but is slow for a sparse matrix of 50.000x50.000 that I want to condense in a 10.000x10.000:
>>> A = (rand(4,4)<0.3)*rand(4,4)
>>> A = scipy.sparse.lil_matrix(A + A.T) # make the matrix symmetric
>>> B = scipy.sparse.lil_matrix((2,2))
>>> for i in range(B.shape[0]):
... for j in range(B.shape[0]):
... B[i,j] = A[i:i+2,j:j+2].sum()
First of all, lil matrix for the one your are summing up is probably really bad, I would try COO or maybe CSR/CSS (I don't know which will be better, but lil is probably inherently slower for many of these operations, even the slicing might be much slower, though I did not test). (Unless you know that for example dia fits perfectly)
Based on COO I could imagine doing some tricking around. Since COO has row and col arrays to give the exact positions:
matrix = A.tocoo()
new_row = matrix.row // 5
new_col = matrix.col // 5
bin = (matrix.shape[0] // 5) * new_col + new_row
# Now do a little dance because this is sparse,
# and most of the possible bin should not be in new_row/new_col
# also need to group the bins:
unique, bin = np.unique(bin, return_inverse=True)
sum = np.bincount(bin, weights=matrix.data)
new_col = unique // (matrix.shape[0] // 5)
new_row = unique - new_col * (matrix.shape[0] // 5)
result = scipy.sparse.coo_matrix((sum, (new_row, new_col)))
(I won't guarantee that I didn't confuse row and column somewhere and this only works for square matrices...)
Given a square matrix of size N and a split size of d (so matrix will be partitioned into N/d * N/d sub-matrices of size d), could you use numpy.split a couple times to build a collection of those sub-matrices, sum each of them, and put them back together?
This should be treated more as pseudocode than an efficient implementation, but it expresses my idea:
def chunk(matrix, size):
row_wise = []
for hchunk in np.split(matrix, size):
row_wise.append(np.split(hchunk, size, 1))
return row_wise
def sum_chunks(chunks):
sum_rows = []
for row in chunks:
sum_rows.append([np.sum(col) for col in row])
return np.array(sum_rows)
Or more compactly as
def sum_in_place(matrix, size):
return np.array([[np.sum(vchunk) for vchunk in np.split(hchunk, size, 1)]
for hchunk in np.split(matrix, size)])
This gives you something like the following:
In [16]: a
Out[16]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [17]: chunk.sum_in_place(a, 2)
Out[17]:
array([[10, 18],
[42, 50]])
For a 4x4 example you can do the following:
In [43]: a = np.arange(16.).reshape((4, 4))
In [44]: a
Out[44]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 12., 13., 14., 15.]])
In [45]: u = np.array([a[:2, :2], a[:2, 2:], a[2:,:2], a[2:, 2:]])
In [46]: u
Out[46]:
array([[[ 0., 1.],
[ 4., 5.]],
[[ 2., 3.],
[ 6., 7.]],
[[ 8., 9.],
[ 12., 13.]],
[[ 10., 11.],
[ 14., 15.]]])
In [47]: u.sum(1).sum(1).reshape(2, 2)
Out[47]:
array([[ 10., 18.],
[ 42., 50.]])
Using something like itertools it should be possible to automate and generalise an expression for u.