First n elements of row in numpy array - python

I'm trying to implement a k-nearest neighbour classifier in Python, and so I want to calculate the Euclidean distance. I have a dataset that I have converted into a big numpy array
[[ 0. 0. 4. ..., 1. 0. 1.]
[ 0. 0. 5. ..., 0. 0. 1.]
[ 0. 0. 14. ..., 16. 9. 1.]
...,
[ 0. 0. 3. ..., 2. 0. 3.]
[ 0. 1. 7. ..., 0. 0. 3.]
[ 0. 2. 10. ..., 0. 0. 3.]]
where the last element of each row indicates the class. So when calculating the Euclidean distance, I obviously don't want to include the last element. I thought I could do the following
for row in dataset:
distance = euclidean_distance(vector, row[:dataset.shape[1] - 1])
but that still includes the last element
print row
>>> [[ 0. 0. 4. ..., 1. 0. 1.]]
print row[:dataset.shape[1] - 1]
>>> [[ 0. 0. 4. ..., 1. 0. 1.]]
as you can see both are the same.

You can subset the data using numpy slicing. If you find yourself iterating over a numpy array, stop and try to find a method that takes advantage of the vectorized nature of numpy operations.
Assuming your array is called arr:
data_points = arr[:,:-1]
classes = arr[:,-1]
For distance to vector calculations:
To find the distance between a 1d array and all of the rows of a 2d array, you can use to following. It assumes the 1d array is v and the 2d array is arr.
dist = np.power(arr - v, 2).sum(axis=1)
dist will be a 1d array of distances.
For pairwise calculations:
The following function takes a 2d array of numbers and returns the upper-diagonal matrix of pair-wise distances using the given L-x distance measurement (the Euclidean distance measure is the L=2 metric).
def pairwise_distance(arr, L=2):
d = arr.shape[0]
out = np.zeros(d)
for f in range(1, d):
out[:-f].ravel()[f::d+1] = np.power(arr[:-f]-arr[f:], L).sum(axis=1)
return np.power(out, 1.0/L)

Related

Adjacency Matrix from Numpy array using Euclidean Distance

Can someone help me please on how to generate a weighted adjacency matrix from a numpy array based on euclidean distance between all rows, i.e 0 and 1, 0 and 2,.. 1 and 2,...?
Given the following example with an input matrix(5, 4):
matrix = [[2,10,9,6],
[5,1,4,7],
[3,2,1,0],
[10, 20, 1, 4],
[17, 3, 5, 18]]
I would like to obtain a weighted adjacency matrix (5,5) containing the most minimal distance between nodes, i.e,
if dist(row0, row1)= 10,77 and dist(row0, row2)= 12,84,
--> the output matrix will take the first distance as a column value.
I have already solved the first part for the generation of the adjacency matrix with the following code :
from scipy.spatial.distance import cdist
dist = cdist( matrix, matrix, metric='euclidean')
and I get the following result :
array([[ 0. , 10.77032961, 12.84523258, 15.23154621, 20.83266666],
[10.77032961, 0. , 7.93725393, 20.09975124, 16.43167673],
[12.84523258, 7.93725393, 0. , 19.72308292, 23.17326045],
[15.23154621, 20.09975124, 19.72308292, 0. , 23.4520788 ],
[20.83266666, 16.43167673, 23.17326045, 23.4520788 , 0. ]])
But I don't know yet how to specify the number of neighbors for which we select for example 2 neighbors for each node. For example, we define the number of neighbors N = 2, then for each row, we choose only two neighbors with the two minimum distances and we get as a result :
[[ 0. , 10.77032961, 12.84523258, 0, 0],
[10.77032961, 0. , 7.93725393, 0, 0],
[12.84523258, 7.93725393, 0. , 0, 0],
[15.23154621, 0, 19.72308292, 0. , 0 ],
[20.83266666, 16.43167673, 0, 0 , 0. ]]
You can use this cleaner solution to get the smallest n from a matrix. Try the following -
The dist.argsort(1).argsort(1) creates a rank order (smallest is 0 and largest is 4) over axis=1 and the <= 2 decided the number of nsmallest values you need from the rank order. np.where filters it or replaces it with 0.
np.where(dist.argsort(1).argsort(1) <= 2, dist, 0)
array([[ 0. , 10.77032961, 12.84523258, 0. , 0. ],
[10.77032961, 0. , 7.93725393, 0. , 0. ],
[12.84523258, 7.93725393, 0. , 0. , 0. ],
[15.23154621, 0. , 19.72308292, 0. , 0. ],
[20.83266666, 16.43167673, 0. , 0. , 0. ]])
This works for any axis or if you want nlargest or nsmallest from a matrix as well.
Assuming a is your Euclidean distance matrix, you can use np.argpartition to choose n min/max values per row. Keep in mind the diagonal is always 0 and euclidean distances are non-negative, so to keep two closest point in each row, you need to keep three min per row (including 0s on diagonal). This does not hold if you want to do max however.
a[np.arange(a.shape[0])[:,None],np.argpartition(a, 3, axis=1)[:,3:]] = 0
output:
array([[ 0. , 10.77032961, 12.84523258, 0. , 0. ],
[10.77032961, 0. , 7.93725393, 0. , 0. ],
[12.84523258, 7.93725393, 0. , 0. , 0. ],
[15.23154621, 0. , 19.72308292, 0. , 0. ],
[20.83266666, 16.43167673, 0. , 0. , 0. ]])

How can iterate through arrays of two different sizes and perform different operations on each one in python

I'm trying to implement the Gillespie algorithm in python 3.7.
I have two arrays:
popul_num = np.array([100, 200, 0, 0])
and
LHS = np.array([[1,1,0,0], [0,0,1,0], [0,0,1,0]])
In the first array each element represents the discrete number of molecules for each reactant in the system and in the second array each row represents a reaction, where each element of the row represents the stochiometric coefficient of the reactant.
I need to loop through both arrays and calculate the binomial coefficient using the number of discrete molecules from the first array and the corresponding stochiometries of the reactants in each reaction of the second array. Each row of the LHS array should be positionally dependent on the popul_num array.
I havent got very far
for i in LHS[0:3]:
for j in popul_num:
aj = binom(j, i)
print(aj)
but the output for this is:
[100. 100. 1. 1.]
[200. 200. 1. 1.]
[0. 0. 1. 1.]
[0. 0. 1. 1.]
[ 1. 1. 100. 1.]
[ 1. 1. 200. 1.]
[1. 1. 0. 1.]
[1. 1. 0. 1.]
[ 1. 1. 100. 1.]
[ 1. 1. 200. 1.]
[1. 1. 0. 1.]
[1. 1. 0. 1.]
which uses every element of each row in LHS with all the elements of popul_num. I want to just use each element of each row in LHS on the possitionally corresponding element in popul_num once.
any ideas
sorry for the long post
cheers

Create 2 matrices based on list of index value

I have a scipy sparse matrix - title - and a python list index. The list contains integers which correspond to rows in the title matrix. From this I wish to create 2 new scipy sparse matrices:
One should contain all of the rows in title except if the index number is in index
The other matrix should contain all of the rows in title which index numbers are in index
Eg.
import numpy as np
from scipy import sparse
titles = sparse.csr_matrix(np.ones((5,5)))
index = [3,2]
Where the desired output for print(matrix1.todense()) is:
[[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]]
and the desired output for print(matrix2.todense()) is:
[[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]]
You can use np.setdiff1d to find exclusive indices and just index titles appropriately.
idx1 = [3, 2]
idx2 = np.setdiff1d(np.arange(titles.shape[0]), idx1)
matrix1 = titles[idx2].todense()
matrix2 = titles[idx1].todense()
print(matrix1)
[[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]]
print(matrix2)
[[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]]

Update 3 and 4 dimension elements of numpy array

I have a numpy array of shape [12, 8, 5, 5]. I want to modify the values of 3rd and 4th dimension for each element.
For e.g.
import numpy as np
x = np.zeros((12, 80, 5, 5))
print(x[0,0,:,:])
Output:
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
Modify values:
y = np.ones((5,5))
x[0,0,:,:] = y
print(x[0,0,:,:])
Output:
[[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]]
I can modify for all x[i,j,:,:] using two for loops. But, I was wondering if there is any pythonic way to do it without running two loops. Just curious to know :)
UPDATE
Actual use case:
dict_weights = copy.deepcopy(combined_weights)
for i in range(0, len(combined_weights[each_layer][:, 0, 0, 0])):
for j in range(0, len(combined_weights[each_layer][0, :, 0, 0])):
# Extract 5x5
trans_weight = combined_weights[each_layer][i,j]
trans_weight = np.fliplr(np.flipud(trans_weight ))
# Update
dict_weights[each_layer][i, j] = trans_weight
NOTE: The dimensions i, j of combined_weights can vary. There are around 200 elements in this list with varied i and j dimensions, but 3rd and 4th dimensions are always same (i.e. 5x5).
I just want to know if I can updated the elements combined_weights[:,:,5, 5] with transposed values without running 2 for loops.
Thanks.
Simply do -
dict_weights[each_layer] = combined_weights[each_layer][...,::-1,::-1]

using indices with multiple values, how to get the smallest one

I have an index to choose elements from one array. But sometimes the index might have repeated entries... in that case I would like to choose the corresponding smaller value. Is it possible?
index = [0,3,5,5]
dist = [1,1,1,3]
arr = np.zeros(6)
arr[index] = dist
print arr
what I get:
[ 1. 0. 0. 1. 0. 3.]
what I would like to get:
[ 1. 0. 0. 1. 0. 1.]
addendum
Actually I have a third array with the (vector) values to be inserted. So the problem is to insert values from values into arr at positions index as in the following. However I want to choose the values corresponding to minimum dist when multiple values have the same index.
index = [0,3,5,5]
dist = [1,1,1,3]
values = np.arange(8).reshape(4,2)
arr = np.zeros((6,2))
arr[index] = values
print arr
I get:
[[ 0. 1.]
[ 0. 0.]
[ 0. 0.]
[ 2. 3.]
[ 0. 0.]
[ 6. 7.]]
I would like to get:
[[ 0. 1.]
[ 0. 0.]
[ 0. 0.]
[ 2. 3.]
[ 0. 0.]
[ 4. 5.]]
Use groupby in pandas:
import pandas as pd
index = [0,3,5,5]
dist = [1,1,1,3]
s = pd.Series(dist).groupby(index).min()
arr = np.zeros(6)
arr[s.index] = s.values
print arr
If index is sorted, then itertools.groupby could be used to group that list.
np.array([(g[0],min([x[1] for x in g[1]])) for g in
itertools.groupby(zip(index,dist),lambda x:x[0])])
produces
array([[0, 1],
[3, 1],
[5, 1]])
This is about 8x slower than the version using np.unique. So for N=1000 is similar to the Pandas version (I'm guessing since something is screwy with my Pandas import). For larger N the Pandas version is better. Looks like the Pandas approach has a substantial startup cost, which limits its speed for small N.

Categories