Here's the question and the example given:
You are given a 2-d array A of size NxN containing floating-point
numbers. The array represents pairwise correlation between N elemenets
with A[i,j] = A[j,i] = corr(i,j) and A[i,i] = 1.
Write a Python program using NumPy to find the index of the highest
correlated element for each element and finally print the sum of all
these indexes.
Example: The array A = [[1, 0.3, 0.4], [0.4,1,0.5],[0.1,0.6,1]]. Then, the indexes of the highest correlated elements for each element
are [3, 3, 2]. the sum of these indexes is 8.
I'm having trouble understanding the question, but the example makes my confusion worse. With each array inside A having only 3 values, and A itself having only three arrays inside how can any "index of the highest correlated elements" being greater than 2 if numpy is zero indexed?
Does anyone understand the question?
To reiterate, the example is wrong in multiple ways.
Correlation matrices are by definition symmetric, yet the example is not:
array([[1. , 0.3, 0.4],
[0.4, 1. , 0.5],
[0.1, 0.6, 1. ]])
Also you are right, numpy arrays (like everything else I know in Python that supports indexing) are zero-indexed. So the solution is off by one.
The exercise wants you to find the index j of the random variable with the greatest correlation for each random variable with index i. Obviously excluding itself (the correlation coefficient of 1 on the diagonal).
Here is one way to do that given your numpy array a:
np.where(a != 1, a, 0).argmax(axis=1)
Here np.where produces an array identical to a except we replace the ones with zeroes. This is based on the assumption that if i != j, the correlation is always < 1. If that does not hold, the solution will obviously be wrong.
Then argmax gives the indices of the greatest values in each row. Although, in an actual correlation matrix, axis=0 would work just as well, since it would be... you know... symmetrical.
The result is array([2, 2, 1]). To get the sum, you just add a .sum() at the end.
EDIT:
Now that I think about it, the assumption is too strong. Here is a better way:
b = a.copy()
np.fill_diagonal(b, -1)
b.argmax(axis=1)
Now we only assume that actual correlations can never be < 0, which I think is reasonable. If you don't care about mutating the original array, you could obviously omit the copy and fill the diagonal of a with -1. instead.
Related
I know how to access elements in a vector by indices doing:
test = numpy.array([1,2,3,4,5,6])
indices = list([1,3,5])
print(test[indices])
which gives the correct answer : [2 4 6]
But I am trying to do the same thing using a 2D matrix, something like:
currentGrid = numpy.array( [[0, 0.1],
[0.9, 0.9],
[0.1, 0.1]])
indices = list([(0,0),(1,1)])
print(currentGrid[indices])
this should display me "[0.0 0.9]" for the value at (0,0) and the one at (1,1) in the matrix. But instead it displays "[ 0.1 0.1]". Also if I try to use 3 indices with :
indices = list([(0,0),(1,1),(0,2)])
I now get the following error:
Traceback (most recent call last):
File "main.py", line 43, in <module>
print(currentGrid[indices])
IndexError: too many indices for array
I ultimately need to apply a simple max() operation on all the elements at these indices and need the fastest way to do that for optimization purposes.
What am I doing wrong ? How can I access specific elements in a matrix to do some operation on them in a very efficient way (not using list comprehension nor a loop).
The problem is the arrangement of the indices you're passing to the array. If your array is two-dimensional, your indices must be two lists, one containing the vertical indices and the other one the horizontal ones. For instance:
idx_i, idx_j = zip(*[(0, 0), (1, 1), (0, 2)])
print currentGrid[idx_j, idx_i]
# [0.0, 0.9, 0.1]
Note that the first element when indexing arrays is the last dimension, e.g.: (y, x). I assume you defined yours as (x, y) otherwise you'll get an IndexError
There are already some great answers to your problem. Here just a quick and dirty solution for your particular code:
for i in indices:
print(currentGrid[i[0],i[1]])
Edit:
If you do not want to use a for loop you need to do the following:
Assume you have 3 values of your 2D-matrix (with the dimensions x1 and x2 that you want to access. The values have the "coordinates"(indices) V1(x11|x21), V2(x12|x22), V3(x13|x23). Then, for each dimension of your matrix (2 in your case) you need to create a list with the indices for this dimension of your points. In this example, you would create one list with the x1 indices: [x11,x12,x13] and one list with the x2 indices of your points: [x21,x22,x23]. Then you combine these lists and use them as index for the matrix:
indices = [[x11,x12,x13],[x21,x22,x23]]
or how you write it:
indices = list([(x11,x12,x13),(x21,x22,x23)])
Now with the points that you used ((0,0),(1,1),(2,0)) - please note you need to use (2,0) instead of (0,2), because it would be out of range otherwise:
indices = list([(0,1,2),(0,1,0)])
print(currentGrid[indices])
This will give you 0, 0.9, 0.1. And on this list you can then apply the max() command if you like (just to consider your whole question):
maxValue = max(currentGrid[indices])
Edit2:
Here an example how you can transform your original index list to get it into the correct shape:
originalIndices = [(0,0),(1,1),(2,0)]
x1 = []
x2 = []
for i in originalIndices:
x1.append(i[0])
x2.append(i[1])
newIndices = [x1,x2]
print(currentGrid[newIndices])
Edit3:
I don't know if you can apply max(x,0.5) to a numpy array with using a loop. But you could use Pandas instead. You can cast your list into a pandas Series and then apply a lambda function:
import pandas as pd
maxValues = pd.Series(currentGrid[newIndices]).apply(lambda x: max(x,0.5))
This will give you a pandas array containing 0.5,0.9,0.5, which you can simply cast back to a list maxValues = list(maxValues).
Just one note: In the background you will always have some kind of loop running, also with this command. I doubt, that you will get much better performance by this. If you really want to boost performance, then use a for loop, together with numba (you simply need to add a decorator to your function) and execute it in parallel. Or you can use the multiprocessing library and the Pool function, see here. Just to give you some inspiration.
Edit4:
Accidentally I saw this page today, which allows to do exactly what you want with Numpy. The solution (considerin the newIndices vector from my Edit2) to your problem is:
maxfunction = numpy.vectorize(lambda i: max(i,0.5))
print(maxfunction(currentGrid[newIndices]))
2D indices have to be accessed like this:
print(currentGrid[indices[:,0], indices[:,1]])
The row indices and the column indices are to be passed separately as lists.
I'm dealing with correlation matrices and I want to rearrange the rows and columns so that the column with the highest average correlation is in the middle, the second best is one index above that, the third is one index below the middle, and so on and so forth.
In an example, this is the original matrix
[[ 1. , -0.85240671, 0.93335528, 0.75431679, 0.81586527],
[-0.85240671, 1. , -0.874545 , -0.68551567, -0.8594703 ],
[ 0.93335528, -0.874545 , 1. , 0.7103762 , 0.86104527],
[ 0.75431679, -0.68551567, 0.7103762 , 1. , 0.73345121],
[ 0.81586527, -0.8594703 , 0.86104527, 0.73345121, 1. ]]
Ideally the new column/row order (using python indexing) is 3, 1, 2, 0, 4. So it would look like
[[1,-.686,.710,.754,.733],
[-.686,1,-.875,-.852,-.859],
[.710,-.875,1,.933,.861],
[.754,-.852,.754,1,.816],
[.733,-.859,.861,.816,1]]
None of the sorting algorithms I know seem to be able to deal with my goal of "symmetry". I'm using numpy for my matrices.
Some of the matrices will not have odd dimensions so I also want a way to deal with matrices with even numbers for their dimensions if possible. Any help would be awesome.
I'm not sure about the "determine the order of largest correlation" part of your question, but that's not really the core of the question.
I thought that, assuming your array is called arr, determining the order of descending correlation can be done by
corrs = arr.sum(axis=0)
corr_order = corrs.argsort()[::-1]
But the main part of your issue is filling up your matrix in this specific "largest in the middle" order. There has to be a more elegant way, but this is what I did to obtain the column order once you have your columns sorted decreasing:
ndim = arr.shape[0]
inds_orig = list(range(ndim))
inds = []
for _ in range(ndim):
inds.append(inds_orig[(len(inds_orig)-1)//2])
del inds_orig[(len(inds_orig)-1)//2]
inds = np.array(inds)
Now, the above for ndim=5 will give us
array([2, 1, 3, 0, 4])
which seems to be exactly what you want: the first (largest) column in the middle, then each subsequent item on alternating sides.
Now we need to combine these two arrays to get a sorted+rearranged version of your original array. There's a slight inconvenience that using arrays to index your 2d array will trigger fancy indexing, when we really want to get basic indexing. So we need np.ix_ to convert our fancy indices into the equivalent effectively-slicing ones:
res = np.empty_like(arr)
res[np.ix_(inds,inds)] = arr[np.ix_(corr_order,corr_order)]
the result of which is
array([[ 1. , 0.7103762 , 0.75431679, 0.73345121, -0.68551567],
[ 0.7103762 , 1. , 0.93335528, 0.86104527, -0.874545 ],
[ 0.75431679, 0.93335528, 1. , 0.81586527, -0.85240671],
[ 0.73345121, 0.86104527, 0.81586527, 1. , -0.8594703 ],
[-0.68551567, -0.874545 , -0.85240671, -0.8594703 , 1. ]])
To check that this matrix is correct within my definition of "largest correlation":
>>> print(res.sum(axis=0))
[ 2.51262853 2.63023175 2.65113063 2.55089145 -2.27193768]
As you can see: largest in the middle, then one to the left, then one to the right, then the first, then the last.
Unless I'm mistaken, the other option would've been to invert the sorting permutation on the left-hand-side, and only index on the right-hand-side by indexing with one index array into the other. I'm not sure that would've been any clearer than this approach, so I stuck with this one.
I wanna print the index of the row containing the minimum element of the matrix
my matrix is matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
and the code
matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
a = np.array(matrix)
buff_min = matrix.argmin(axis = 0)
print(buff_min) #index of the row containing the minimum element
min = np.array(matrix[buff_min])
print(str(min.min(axis=0))) #print the minium of that row
print(min.argmin(axis = 0)) #index of the minimum
print(matrix[buff_min]) # print all row containing the minimum
after running, my result is
1
3
1
[22, 3, 4, 12]
the first number should be 2, because the minimum is 2 in the third list ([34,6,4,5,8,2]), but it returns 1. It returns 3 as minimum of the matrix.
What's the error?
I am not sure which version of Python you are using, i tested it for Python 2.7 and 3.2 as mentioned your syntax for argmin is not correct, its should be in the format
import numpy as np
np.argmin(array_name,axis)
Next, Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.
If you really want flexible Numpy arrays, use something like this:
np.array([[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]], dtype=object)
However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).
Also, to mention if you can resize your numpy array thing might work, i haven't tested it, but by the concept that should be an easy solution. But i will prefer use a nested list in this case of input matrix
Does this work?
np.where(a == a.min())[0][0]
Note that all rows of the matrix need to contain the same number of elements.
I'm fairly new to Python/Numpy. What I have here is a standard array and I have a function which I have vectorized appropriately.
def f(i):
return np.random.choice(2,1,p=[0.7,0.3])*9
f = np.vectorize(f)
Defining an example array:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
With the vectorized function, f, I would like to evaluate f on each cell on the array with a value of 0.
I am trying to leave for loops as a last resort. My arrays will eventually be larger than 100 by 100, so running each cell individually to look and evaluate f might take too long.
I have tried:
print f(array[array==0])
Unfortunately, this gives me a row array consisting of 5 elements (the zeroes in my original array).
Alternatively I have tried,
array[array==0] = f(1)
But as expected, this just turns every single zero element of array into 0's or 9's.
What I'm looking for is somehow to give me my original array with the zero elements replaced individually. Ideally, 30% of my original zero elements will become 9 and the array structure is conserved.
Thanks
The reason your first try doesn't work is because the vectorized function handle, let's call it f_v to distinguish it from the original f, is performing the operation for exactly 5 elements: the 5 elements that are returned by the boolean indexing operation array[array==0]. That returns 5 values, it doesn't set those 5 items to the returned values. Your analysis of why the 2nd form fails is spot-on.
If you wanted to solve it you could combine your second approach with adding the size option to np.random.choice:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
mask = array==0
array[mask] = np.random.choice([18,9], size=mask.sum(), p=[0.7, 0.3])
# example output:
# array([[ 1, 1, 9],
# [18, 1, 9],
# [ 9, 18, 1]])
There was no need for np.vectorize: the size option takes care of that already.
how can I change the values of the diagonal of a matrix in numpy?
I checked Numpy modify ndarray diagonal, but the function there is not implemented in numpy v 1.3.0.
lets say we have a np.array X and I want to set all values of the diagonal to 0.
Did you try numpy.fill_diagonal? See the following answer and this discussion. Or the following from the documentation (although currently broken):
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fill_diagonal.html
If you're using a version of numpy that doesn't have fill_diagonal (the right way to set the diagonal to a constant) or diag_indices_from, you can do this pretty easily with array slicing:
# assuming a 2d square array
n = mat.shape[0]
mat[range(n), range(n)] = 0
This is much faster than an explicit loop in Python, because the looping happens in C and is potentially vectorized.
One nice thing about this is that you can also fill a diagonal with a list of elements, rather than a constant value (like diagflat, but for modifying an existing matrix rather than making a new one). For example, this will set the diagonal of your matrix to 0, 1, 2, ...:
# again assuming 2d square array
n = mat.shape[0]
mat[range(n), range(n)] = range(n)
If you need to support more array shapes, this is more complicated (which is why fill_diagonal is nice...):
m[list(zip(*map(range, m.shape)))] = 0
(The list call is only necessary in Python 3, where zip returns an iterator.)
You can use numpy.diag_indices_from() to get the indices of the diagonal elements of your array. Then set the value of those indices.
X[np.diag_indices_from(X)] = 0.
Example:
>>> import numpy as np
>>> X = np.random.rand(5, 5)
>>> print(X)
[[0.59480384 0.20133725 0.59147423 0.22640441 0.40898203]
[0.65230581 0.57055258 0.97009881 0.58535275 0.32036626]
[0.71524332 0.73424734 0.92461381 0.38704119 0.08147428]
[0.18931865 0.97366736 0.11482649 0.82793141 0.13347333]
[0.47402986 0.73329347 0.18892479 0.11883424 0.78718883]]
>>> X[np.diag_indices_from(X)] = 0
>>> print(X)
[[0. 0.20133725 0.59147423 0.22640441 0.40898203]
[0.65230581 0. 0.97009881 0.58535275 0.32036626]
[0.71524332 0.73424734 0. 0.38704119 0.08147428]
[0.18931865 0.97366736 0.11482649 0. 0.13347333]
[0.47402986 0.73329347 0.18892479 0.11883424 0. ]]
Here's another good way to do this. If you want a one-dimensional view of the array's main diagonal use:
A.ravel()[:A.shape[1]**2:A.shape[1]+1]
For the i'th superdiagonal use:
A.ravel()[i:max(0,A.shape[1]-i)*A.shape[1]:A.shape[1]+1]
For the i'th subdiagonal use:
A.ravel()[A.shape[1]*i:A.shape[1]*(i+A.shape[1]):A.shape[1]+1]
Or in general, for the i'th diagonal where the main diagonal is 0, the subdiagonals are negative and the superdiagonals are positive, use:
A.ravel()[max(i,-A.shape[1]*i):max(0,(A.shape[1]-i))*A.shape[1]:A.shape[1]+1]
These are views and not copies, so they will run faster for extracting a diagonal, but any changes made to the new array object will apply to the original array.
On my machine these run faster than the fill_diagonal function when setting the main diagonal to a constant, but that may not always be the case. They can also be used to assign an array of values to a diagonal instead of just a constant.
Notes: for small arrays it may be faster to use the flat attribute of the NumPy array.
If speed is a major issue it could be worth it to make A.shape[1] a local variable.
Also, if the array is not contiguous, ravel() will return a copy, so, in order to assign values to a strided slice, it will be necessary to creatively slice the original array used to generate the strided slice (if it is contiguous) or to use the flat attribute.
Also, it was originally planned that in NumPy 1.10 and later the 'diagonal' method of arrays will return a view instead of a copy.
That change hasn't yet been made though, but hopefully at some point this trick to get a view will no longer be necessary.
See http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.diagonal.html
def replaceDiagonal(matrix, replacementList):
for i in range(len(replacementList)):
matrix[i][i] = replacementList[i]
Where size is n in an n x n matrix.
>>> a = numpy.random.rand(2,2)
>>> a
array([[ 0.41668355, 0.07982691],
[ 0.60790982, 0.0314224 ]])
>>> a - numpy.diag(numpy.diag(a))
array([[ 0. , 0.07982691],
[ 0.60790982, 0. ]])
You can do the following.
Assuming your matrix is 4 * 4 matrix.
indices_diagonal = np.diag_indices(4)
yourarray[indices_diagonal] = Val