finding the k nearest neighbours - python

I'm currenlty teaching myself python for data science and stumbled upon a chapter that I have been looking at for hours but I don't understand. I hope you can help me understand it. In the example they want to code the k-nearest neighbors. The code looks like this:
X = np.random.rand(10,2)
dist_sq = np.sum((X[:, np.newaxis,:] - X[np.newaxis,:,:])** 2, axis = -1)
nearest = np.argsort(dist_sq, axis=1)
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K=2
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
plt.plot(*zip(X[j], X[i]), color='black')
I do understand the premise of calculating the eucledian distance, but it is very abstract for me to understand the 3D array etc. Thank you guys for dumbing it down for me. I appreciate every answert! Thanks!
fyi: book Im learning from is python data science handbook page 89.

Array broadcasting in 3 dimensions is pretty tricky to wrap your head around.
lets start with 2 dimensions:
X = np.arange(5)
X[np.newaxis,:] + 10*X[:,np.newaxis]
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[20, 21, 22, 23, 24],
[30, 31, 32, 33, 34],
[40, 41, 42, 43, 44]])
As you can see, when we add a (1 x N) row vector to a (N x 1) column vector, we get a (N x N) matrix. Right before adding, the row vector becomes an (N x N) matrix where every row is the same. Similarly, the column vector becomes an (N x N) matrix where every column is the same. In some sense, this is a shorthand way of doing the following operation.
X1 = np.array([[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.],
[0., 1., 2., 3., 4.]])
X2 = np.array([[ 0., 0., 0., 0., 0.],
[10., 10., 10., 10., 10.],
[20., 20., 20., 20., 20.],
[30., 30., 30., 30., 30.],
[40., 40., 40., 40., 40.]])
Clearly, X1 + X2 will get us the same answer we got before.
So how does this work in 3 dimensions? Very much the same. Before we repeated the 1st dimesion across the 2nd dimension (and vice versa).
X1 = X[:, np.newaxis,:]
X2 = X[np.newaxis,:,:]
difference = X1 - X2
Right before subtracting, X1's 1st and 3rd dimensions are repeated for every slice in the 2nd dimension. X2's 2nd and 3rd dimensions are repeated for every 1st dimension slice. Lets observe with some easier-to-read matricies.
X1 = np.array([[1.,2.],
[3.,4.],
[5.,6.]])
X2 = np.array([[10.,20.],
[30.,40.],
[50.,60.]])
X1[:,np.newaxis,:] + X2[np.newaxis,:,:]
array([[[11., 22.],
[31., 42.],
[51., 62.]],
[[13., 24.],
[33., 44.],
[53., 64.]],
[[15., 26.],
[35., 46.],
[55., 66.]]])
Its easiest visually to see X2 repeated across the 1st Dimension (the blocks). In each block, we see the 10s digit is the same. Perhaps its easier to read this as a for loop of 2D broadcasting
first_dimension = []
for i_row in X1.shape[0]:
first_dimension.append(X2 + X1[i_row,:])
Hopefully its clear now that
X1 = X[:, np.newaxis,:]
X2 = X[np.newaxis,:,:]
difference = X1 - X2
sq_diff = difference ** 2
sq_diff is a 3D tensor, where each slice of the 3rd dimension is the squared difference between one column of X2 and one column of X1.
ssq_diff = np.sum(sq_diff, axis = -1)
then just sums across the 3rd dimension (axis = -1 just means the last dimension in the array). Now ssq_diff is a 2D matrix, where each element is the Euclidean distance between two of the data points. For row i and column j, ssq_diff[i,j] is the euclidean distance between the ith and jth row in X.

Related

Numpy add smaller matrix to a bigger one

I have big 3D matrices indicating the position of agents in a 3D space. The values of the matrix are 0 if there is not agent on it and 1 if there is an agent on it.
Then, my problem is that I want the agents to 'grow' in the sense that I want them to be determined by lets say a cube (3x3x3) of ones. If already gotten a way to do it but I'm having trouble when the agent is close to the borders.
For example, I have a matrix of positions 100x100x100, if I know my agent is at position (x, y, z) I will do:
positions_matrix = numpy.zeros((100, 100, 100))
positions_matrix[x - 1: x + 2, y - 1: y + 2, z - 1: z + 2] += numpy.ones((3, 3, 3))
Of course in my real code I'm looping over more positions but this is basically it. This works but the problem comes when the agent is to close to the border in which the sum can't be made because the resultant matrix from slicing would be smaller than the ones matrix.
Any idea how to solve it or if numpy or any other package have an implementation for this? I couldn't manage to find it although I'm pretty sure I'm not the first one to face against this.
A slightly more programmatic way of solving the problem:
import numpy as np
m = np.zeros((100, 100, 100))
slicing = tuple(
slice(max(0, x_i - 1), min(x_i + 2, d - 1))
for x_i, d in zip((x, y, z), m.shape))
ones_shape = tuple(s.stop - s.start for s in slicing)
m[slicing] += np.ones(ones_shape)
But it is otherwise the same as the accepted answer.
You should cut at the lower and upper bounds, using something like:
import numpy as np
m = np.zeros((100, 100, 100))
x_min, x_max = np.max([0, x-1]), np.min([x+2, m.shape[0]-1])
y_min, y_max = np.max([0, y-1]), np.min([y+2, m.shape[1]-1])
z_min, z_max = np.max([0, z-1]), np.min([z+2, m.shape[2]-1])
m[x_min:x_max, y_min:y_max, z_min:z_max] += np.ones((x_max-x_min, y_max-y_min, z_max-z_min))
There is a solution using np.put, and its 'clip' option.
It just requires a little gymnastics because the function requires indices in the flattened matrix; fortunately, the function np.ravel_multi_index does the job:
import itertools
import numpy as np
x, y, z = 2, 0, 4
positions_matrix = np.zeros((100,100,100))
indices = np.array( list( itertools.product( (x-1, x, x+1), (y-1, y, y+1), (z-1, z, z+1)) ))
flat_indices = np.ravel_multi_index(indices.T, positions_matrix.shape, mode='clip')
positions_matrix.put(flat_indices, 1+positions_matrix.take(flat_indices))
# positions_matrix[2,1,4] is now 1.0
The nice thing about this solution is that you can play with other modes, for instance 'wrap' (if your agents live on a donut ;-) or in a periodic space).
I'll explain how it works on a smaller 2D matrix:
import itertools
import numpy as np
positions_matrix = np.zeros((8,8))
ones = np.ones((3,3))
x, y = 0, 4
indices = np.array( list( itertools.product( (x-1, x, x+1), (y-1, y, y+1) )))
# array([[-1, 3],
# [-1, 4],
# [-1, 5],
# [ 0, 3],
# [ 0, 4],
# [ 0, 5],
# [ 1, 3],
# [ 1, 4],
# [ 1, 5]])
flat_indices = np.ravel_multi_index(indices.T, positions_matrix.shape, mode='clip')
# array([ 3, 4, 5, 3, 4, 5, 11, 12, 13])
positions_matrix.put(flat_indices, ones, mode='clip')
# positions_matrix is now:
# array([[0., 0., 0., 1., 1., 1., 0., 0.],
# [0., 0., 0., 1., 1., 1., 0., 0.],
# [0., 0., 0., 0., 0., 0., 0., 0.],
# [ ...
By the way, in this case mode='clip' was redundant for put.
Well, I just cheated put does an assignment. The +=1 requires both take and put:
positions_matrix.put(flat_indices, ones.flat + positions_matrix.take(flat_indices))
# notice that ones has to be flattened, or alternatively the result of take could be reshaped (3,3)
# positions_matrix is now:
# array([[0., 0., 0., 2., 2., 2., 0., 0.],
# [0., 0., 0., 2., 2., 2., 0., 0.],
# [0., 0., 0., 0., 0., 0., 0., 0.],
# [ ...
There is one important difference in this solution compared to the others: the ones matrix is always (3,3),
which may or may not be an advantage.
The trick is in this flat_indices list, that has repeating entries (result of clip).
It may thus require some precautions, if you add a non constant sub-matrix at max indices:
x, y = 1, 7
values = 1 + np.arange(9)
indices = np.array( list( itertools.product( (x-1, x, x+1), (y-1, y, y+1) )))
flat_indices = np.ravel_multi_index(indices.T, positions_matrix.shape, mode='clip')
positions_matrix.put(flat_indices, values, mode='clip')
# positions_matrix is now:
# array([[0., 0., 0., 2., 2., 2., 1., 3.],
# [0., 0., 0., 2., 2., 2., 4., 6.],
# [0., 0., 0., 0., 0., 0., 7., 9.],
... you were probably expecting the last column to be 2 5 8.
Currently, you could work on flat_indices, for example by putting -1 in the out-of-bounds locations.
But it'd all be easier if np.put accepted non-flat indices, or if there was a clip mode='ignore'.

Adding Numpy arrays like Counters

Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.
Basically, I need to be able to take arrays like this:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
And then "add" them so they look like this:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.
Thanks!
Edit: Ready for some speedtests?
Intel win 64bit machine. All of the following values are in seconds; 20000 loops.
collections.Counter results:
2.131000, 2.125000, 2.125000
Divakar's union1d + masking results:
1.641000, 1.633000, 1.625000
Divakar's union1d + indexing results:
0.625000, 0.625000, 0.641000
Histogram results:
1.844000, 1.938000, 1.858000
Pandas results:
16.659000, 16.686000, 16.885000
Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!
Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).
Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.
Here's one approach with np.union1d and masking -
def app1(a,b):
c0 = np.union1d(a[:,0],b[:,0])
out = np.zeros((len(c0),2))
out[:,0] = c0
mask1 = np.in1d(c0,a[:,0])
out[mask1,1] = a[:,1]
mask2 = np.in1d(c0,b[:,0])
out[mask2,1] += b[:,1]
return out
Sample run -
In [174]: a
Out[174]:
array([[ 0., 2.],
[ 12., 2.],
[ 23., 1.]])
In [175]: b
Out[175]:
array([[ 0., 3.],
[ 1., 1.],
[ 12., 5.]])
In [176]: app1(a,b)
Out[176]:
array([[ 0., 5.],
[ 1., 1.],
[ 12., 7.],
[ 23., 1.]])
Here's another with np.union1d and indexing -
def app2(a,b):
n = np.maximum(a[:,0].max(), b[:,0].max())+1
c0 = np.union1d(a[:,0],b[:,0])
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out = out0[c0.astype(int)]
out[:,0] = c0
return out
For the case where all indices are covered by the first column values in a and b -
def app2_specific(a,b):
c0 = np.union1d(a[:,0],b[:,0])
n = c0[-1]+1
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out0[:,0] = c0
return out0
Sample run -
In [234]: a
Out[234]:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
In [235]: b
Out[235]:
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
In [236]: app2_specific(a,b)
Out[236]:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If you know the number of fields, use np.bincount.
c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T
This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.
out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])
You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.
import numpy as np
x = np.array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
y = np.array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.],
[ 5., 3.]])
c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w,
bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0. 5.]
# [ 1. 1.]
# [ 2. 7.]
# [ 3. 1.]
# [ 4. 0.]
# [ 5. 3.]]
r_nonzero = r[r[:,1]!=0]
Pandas have some functions doing exactly what you intend
import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)
Edit: If you actually need the data back in numpy format, just do
array_res = result.reset_index(name=1).values
This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:
import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])
Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.

Joining Array In Python

Hi I want to join multiple arrays in python, using numpy to form multidimensional arrays, it's inside of a for loop, this is a pseudocode
import numpy as np
h = np.zeros(4)
for x in range(3):
x1 = some array of length of 4 returned from a previous function (3,5,6,7)
h = np.concatenate((h,x1), axis =0)
The first iteration goes fine, but during the second iteration on the for loop I get the following error,
ValueError: all the input arrays must have same number of dimensions
The output array should look something like this
[[0,0,0,0],[3,5,6,7],[6,3,6,7]]
etc
So how can I join the arrays?
Thanks
You need to use vstack. It allows you to stack arrays. You take a sequence of arrays and stack them vertically to make a single array
import numpy as np
h = np.zeros(4)
for x in range(3):
x1 = [3,5,6,7]
h = np.vstack((h,x1))
# not h = np.concatenate((h,x1), axis =0)
print h
Output:
[[ 0. 0. 0. 0.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]]
more edits later.
If you do want to use cocatenate only, you can do the following way as well:
import numpy as np
h1 = np.zeros(4)
for x in range(3):
x1 = np.array([3,5,6,7])
h1= np.concatenate([h1,x1.T], axis =0)
print h1.shape
print h1.reshape(4,4)
Output:
(16,)
[[ 0. 0. 0. 0.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]]
Both have different applications. You can choose according to your need.
There are multiple ways of doing this. I'll list a few examples:
First, we import numpy and define a function that generates those arrays of length 4.
import numpy as np
def previous_function_returning_array_of_length_4(x):
return np.array(range(4)) + x
The first way involves creating a list of arrays, then calling numpy.array() to convert the list to a 2D array.
h0 = np.zeros(4)
arrays = [h0]
for x in range(3):
x1 = previous_function_returning_array_of_length_4(x)
arrays.append(x1)
h = np.array(arrays)
You can do the same with np.vstack():
h0 = np.zeros(4)
arrays = [h0]
for x in range(3):
x1 = previous_function_returning_array_of_length_4(x)
arrays.append(x1)
h = np.vstack(arrays)
Alternatively, if you know how many arrays you are going to create, you can create the 2D array first and fill in the values:
h = np.zeros((4, 4))
for ii in range(3):
x1 = previous_function_returning_array_of_length_4(ii)
h[ii + 1, ...] = x1
There are more ways, but hopefully, this will give you an idea of what to do.
It is best to collect values in a list, and perform the concatenate or array creation once, at the end.
h = [np.zeros(4)]
for x in range(3):
x1 = some array of length of 4 returned from a previous function (3,5,6,7)
h = h.append(x1)
h = np.array(h)
# or h = np.vstack(h)
All the concatenate/stack/array functions takes a list of multiple items. It is faster to append to a list than to do a concatenate of 2 items.
======================
Let's try your approach step by step:
In [189]: h=np.zeros(4)
In [190]: h
Out[190]: array([ 0., 0., 0., 0.]) # 1d array (4,) shape
In [191]: x1=np.array([3,5,6,7]) # another 1d
In [192]: h1=np.concatenate((h,x1),axis=0)
In [193]: h1
Out[193]: array([ 0., 0., 0., 0., 3., 5., 6., 7.])
In [194]: h1.shape
Out[194]: (8,) # also a 1d array, but with 8 items
In [195]: x1=np.array([6,3,6,7])
In [196]: h1=np.concatenate((h1,x1),axis=0)
In [197]: h1
Out[197]: array([ 0., 0., 0., 0., 3., 5., 6., 7., 6., 3., 6., 7.])
In this case I'm adding (4,) arrays one after the other, still getting a 1d array.
If I go back an create x1 as 2d (1,4):
In [198]: h=np.zeros(4)
In [199]: x1=np.array([[6,3,6,7]])
In [200]: h1=np.concatenate((h,x1),axis=0)
...
ValueError: all the input arrays must have same number of dimensions
I get this dimension error right away.
The fact that you get the error on the 2nd iteration suggests that the 1st x1 is (4,), but the 2nd is 2d.
When you have dimensions errors like this, check the shapes.
vstack adds dimensions to the inputs, as needed, so you can build 2d arrays:
In [207]: h=np.zeros(4)
In [208]: x1=np.array([3,5,6,7])
In [209]: h=np.vstack((h,x1))
In [210]: h
Out[210]:
array([[ 0., 0., 0., 0.],
[ 3., 5., 6., 7.]])
In [211]: x1=np.array([6,3,6,7])
In [212]: h=np.vstack((h,x1))
In [213]: h
Out[213]:
array([[ 0., 0., 0., 0.],
[ 3., 5., 6., 7.],
[ 6., 3., 6., 7.]])

How to efficiently make new matrix from sum of blocks from bigger sparse matrix

I have a large scipy sparse symmetric matrix which I need to condense by taking the sum of blocks to make a new smaller matrix.
For example, for a 4x4 sparse matrix A I will like to make a 2x2 matrix B in which B[i,j] = sum(A[i:i+2,j:j+2]).
Currently, I just go block by block to recreate the condensed matrix but this is slow. Any ideas on how to optimize this?
Update: Here is an example code that works fine, but is slow for a sparse matrix of 50.000x50.000 that I want to condense in a 10.000x10.000:
>>> A = (rand(4,4)<0.3)*rand(4,4)
>>> A = scipy.sparse.lil_matrix(A + A.T) # make the matrix symmetric
>>> B = scipy.sparse.lil_matrix((2,2))
>>> for i in range(B.shape[0]):
... for j in range(B.shape[0]):
... B[i,j] = A[i:i+2,j:j+2].sum()
First of all, lil matrix for the one your are summing up is probably really bad, I would try COO or maybe CSR/CSS (I don't know which will be better, but lil is probably inherently slower for many of these operations, even the slicing might be much slower, though I did not test). (Unless you know that for example dia fits perfectly)
Based on COO I could imagine doing some tricking around. Since COO has row and col arrays to give the exact positions:
matrix = A.tocoo()
new_row = matrix.row // 5
new_col = matrix.col // 5
bin = (matrix.shape[0] // 5) * new_col + new_row
# Now do a little dance because this is sparse,
# and most of the possible bin should not be in new_row/new_col
# also need to group the bins:
unique, bin = np.unique(bin, return_inverse=True)
sum = np.bincount(bin, weights=matrix.data)
new_col = unique // (matrix.shape[0] // 5)
new_row = unique - new_col * (matrix.shape[0] // 5)
result = scipy.sparse.coo_matrix((sum, (new_row, new_col)))
(I won't guarantee that I didn't confuse row and column somewhere and this only works for square matrices...)
Given a square matrix of size N and a split size of d (so matrix will be partitioned into N/d * N/d sub-matrices of size d), could you use numpy.split a couple times to build a collection of those sub-matrices, sum each of them, and put them back together?
This should be treated more as pseudocode than an efficient implementation, but it expresses my idea:
def chunk(matrix, size):
row_wise = []
for hchunk in np.split(matrix, size):
row_wise.append(np.split(hchunk, size, 1))
return row_wise
def sum_chunks(chunks):
sum_rows = []
for row in chunks:
sum_rows.append([np.sum(col) for col in row])
return np.array(sum_rows)
Or more compactly as
def sum_in_place(matrix, size):
return np.array([[np.sum(vchunk) for vchunk in np.split(hchunk, size, 1)]
for hchunk in np.split(matrix, size)])
This gives you something like the following:
In [16]: a
Out[16]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [17]: chunk.sum_in_place(a, 2)
Out[17]:
array([[10, 18],
[42, 50]])
For a 4x4 example you can do the following:
In [43]: a = np.arange(16.).reshape((4, 4))
In [44]: a
Out[44]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 12., 13., 14., 15.]])
In [45]: u = np.array([a[:2, :2], a[:2, 2:], a[2:,:2], a[2:, 2:]])
In [46]: u
Out[46]:
array([[[ 0., 1.],
[ 4., 5.]],
[[ 2., 3.],
[ 6., 7.]],
[[ 8., 9.],
[ 12., 13.]],
[[ 10., 11.],
[ 14., 15.]]])
In [47]: u.sum(1).sum(1).reshape(2, 2)
Out[47]:
array([[ 10., 18.],
[ 42., 50.]])
Using something like itertools it should be possible to automate and generalise an expression for u.

assign index dependant value to each index in numpy array

I want to center multi-dimensional data in a n x m matrix (<class 'numpy.matrixlib.defmatrix.matrix'>), let's say X . I defined a new array ones(645), lets say centVector to produce the mean for every row in matrix X. And now I want to iterate every row in X, compute the mean and assign this value to the corresponding index in centVector. Isn't this possible in a single row in scipy/numpy? I am not used to this language and think about something like:
centVector = ones(645)
for key, val in X:
centVector[key] = centVector[key] * (val.sum/val.size)
Afterwards I just need to subtract the mean in every Row:
X = X - centVector
How can I simplify this?
EDIT: And besides, the above code is not actually working - for a key-value loop I need something like enumerate(X). And I am not sure if X - centVector is returning the proper solution.
First, some example data:
>>> import numpy as np
>>> X = np.matrix(np.arange(25).reshape((5,5)))
>>> print X
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
numpy conveniently has a mean function. By default however, it'll give you the mean over all the values in the array. Since you want the mean of each row, you need to specify the axis of the operation:
>>> np.mean(X, axis=1)
matrix([[ 2.],
[ 7.],
[ 12.],
[ 17.],
[ 22.]])
Note that axis=1 says: find the mean along the columns (for each row), where 0 = rows and 1 = columns (and so on). Now, you can subtract this mean from your X, as you did originally.
Unsolicited advice
Usually, it's best to avoid the matrix class (see docs). If you remove the np.matrix call from the example data, then you get a normal numpy array.
Unfortunately, in this particular case, using an array slightly complicates things because np.mean will return a 1D array:
>>> X = np.arange(25).reshape((5,5))
>>> r_means = np.mean(X, axis=1)
>>> print r_means
[ 2. 7. 12. 17. 22.]
If you try to subtract this from X, r_means gets broadcast to a row vector, instead of a column vector:
>>> X - r_means
array([[ -2., -6., -10., -14., -18.],
[ 3., -1., -5., -9., -13.],
[ 8., 4., 0., -4., -8.],
[ 13., 9., 5., 1., -3.],
[ 18., 14., 10., 6., 2.]])
So, you'll have to reshape the 1D array into an N x 1 column vector:
>>> X - r_means.reshape((-1, 1))
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])
The -1 passed to reshape tells numpy to figure out this dimension based on the original array shape and the rest of the dimensions of the new array. Alternatively, you could have reshaped the array using r_means[:, np.newaxis].

Categories