element-wise operations on a sparse matrix - python

If you have a sparse matrix X:
>> print type(X)
<class 'scipy.sparse.csr.csr_matrix'>
...How can you sum the squares of each element in each row, and save them into a list? For example:
>>print X.todense()
[[0 2 0 2]
[0 2 0 1]]
How can you turn that into a list of sum of squares of each row:
[[0²+2²+0²+2²]
[0²+2²+0²+1²]]
or:
[8, 5]

First of all, the csr matrix has a .sum method (relying on the dot product) which works well, so what you need is the squaring. The simplest solution is to create a copy of the sparse matrix, square its data and then sum it:
squared_X = X.copy()
# now square the data in squared_X
squared_X.data **= 2
# and sum each row:
squared_sum = squared_X.sum(1)
# and delete the squared_X:
del squared_X
If you really must save the space, I guess you could just replace .data and then replace it back, something along:
X.sum_duplicate() # make sure, not sure if this happens with normal usage.
old_data = X.data.copy()
X.data **= 2
squared_sum = X.sum(1)
X.data = old_data
EDIT: There is actually another nice way, as the csr matrix has a .multiply method for elementwise multiplication:
squared_sum = X.multiply(X).sum(1)
Addition:
Elementwise operations are thus easily done by accessing csr.data which stores the values for all nonzero elements. NOTE: I guess .sum_duplicates() may be necessary, I am not sure what kind of operations would make it necessary.

Related

Python: general sum over numpy rows

I want to sum all the lines of one matrix hence, if I have a n x 2 matrix, the result should be a 1 x 2 vector with all rows summed. I can do something like that with np.sum( arg, axis=1 ) but I get an error if I supply a vector as argument. Is there any more general sum function which doesn't throw an error when a vector is supplied? Note: This was never a problem in MATLAB.
Background: I wrote a function which calculates some stuff and sums over all rows of the matrix. Depending on the number of inputs, the matrix has a different number of rows and the number of rows is >= 1
According to numpy.sum documentation, you cannot specify axis=1 for vectors as you would get a numpy AxisError saying axis 1 is out of bounds for array of dimension 1.
A possible workaround could be, for example, writing a dedicated function that checks the size before performing the sum. Please find below a possible implementation:
import numpy as np
M = np.array([[1, 4],
[2, 3]])
v = np.array([1, 4])
def sum_over_columns(input_arr):
if len(input_arr.shape) > 1:
return input_arr.sum(axis=1)
return input_arr.sum()
print(sum_over_columns(M))
print(sum_over_columns(v))
In a more pythonic way (not necessarily more readable):
def oneliner_sum(input_arr):
return input_arr.sum(axis=(1 if len(input_arr.shape) > 1 else None))
You can do
np.sum(np.atleast_2d(x), axis=1)
This will first convert vectors to singleton-dimensional 2D matrices if necessary.

finding the occurrence of vector v (1,k) inside a matrix M (m,k)

I want to find the number of occurrences of vector v in matrix M.
What I have is a matrix the size (60K, 10)
and I initialised a test vector v (1,10):
tester = np.zeros((1, 10))
Now I want to check how much time that vector entirely repeats itself in the matrix rows.
I did it iterative and it works, but the fact that the matrix is very large, it affects the performance and im trying to find some more elegant and faster way.
would appreciate some help
Thanks.
you can do the following:
temp = np.where((prediction == tester)).all(axis=1))
len(temp[0])
what np.where() returns in the case it has no values [x,y] accept for the condition is the indices, in your case it will return the True and False option, starting from the True.
so using this will sure to lower your running time, and for me its much more elegant then looping through the matrix.
you can check np.where api:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
Just compare and use all, so each row will result in a True value only if all its elements compare equal to the reference array. Then, you can simply sum the result, since int(True) == 1.
Example:
np.random.seed(0)
data = np.random.randint(0, 2, size=(50, 3))
to_match = np.random.randint(0, 2, size=(1, 3))
print(to_match)
print((data == to_match).all(axis=1).sum())
Output:
[[0 0 0]]
4
...which means that there are 4 instances of [0, 0, 0] in data.

Numpy: function that creates block matrices

Say I have a dimension k. What I'm looking for is a function that takes k as an input and returns the following block matrix.
Let I be a k-dimensional identity matrix and 0 be k-dimensional square matrix of zeros
That is:
def function(k):
...
return matrix
function(2) -> np.array([I, 0])
function(3) -> np.array([[I,0,0]
[0,I,0]])
function(4) -> np.array([[I,0,0,0]
[0,I,0,0],
[0,0,I,0]])
function(5) -> np.array([[I,0,0,0,0]
[0,I,0,0,0],
[0,0,I,0,0],
[0,0,0,I,0]])
That is, the output is a (k-1,k) matrix where identity matrices are on the diagonal elements and zero matrices elsewhere.
What I've tried:
I know how to create any individual row, I just can't think of a way to put it into a function so that it takes a dimension, k, and spits out the matrix I need.
e.g.
np.block([[np.eye(3),np.zeros((3, 3)),np.zeros((3, 3))],
[np.zeros((3, 3)),np.eye(3),np.zeros((3, 3))]])
Would be the desired output for k=3
scipy.linalg.block_diag seems like it might be on the right track...
IMO, np.eye already has everything you need, as you can define number of rows and columns separately.
So your function should simply look like
def fct(k):
return np.eye(k**2-k, k**2)
If I understand you correctly, this should work:
a = np.concatenate((np.eye((k-1)*k),np.zeros([(k-1)*k,k])), axis=1)
(at least, when I set k=3 and compare with the np.block(...) expression you gave, both results are identical)
IIUC, you can also try np.fill_diagonal such that you create the right shape of matrices and then fill in the diagonal parts.
def make_block(k):
arr = np.zeros(((k-1)*k, k*k))
np.fill_diagonal(arr, 1)
return arr
There are two interpretations to your question. One is where you are basically creating a matrix of the form [[1, 0, 0], [0, 1, 0]], which can be mathematically represented as [I 0], and another where each element contains its own numpy array entirely (which does reduce computational ability but might be what you want).
The former:
np.append(np.eye(k-1), np,zeros((k-1, 1)), axis=1)
The latter (a bit more complicated):
I = np.eye(m) #Whatever dimensions you want, although probably m==n
Z = np.eye(n)
arr = np.zeros((k-1, k)
for i in range(k-1):
for j in range(k):
if i == j:
arr[i,j] = np.array(I)
else:
arr[i,j] = np.array(Z)
I really have no idea how the second one would be useful, so I think you might be a bit confused on the fundamental structure of a block matrix if that's what you think you want. Generally [A b], for example, with A being a matrix and b being a vector, is generally thought of as now representing a single matrix, with block notation just existing for simplicity's sake. Hope this helps!

Walk through each column in a numpy matrix efficiently in Python

I have a very big two-dimensions array in Python, using numpy library. I want to walk through each column efficiently and check each time if elements are different from 0 to count their number in every column.
Suppose I have the following matrix.
M = array([[1,2], [3,4]])
The following code enables us to walk through each row efficiently, for example (it is not what I intend to do of course!):
for row_idx, row in enumerate(M):
print "row_idx", row_idx, "row", row
for col_idx, element in enumerate(row):
print "col_idx", col_idx, "element", element
# update the matrix M: square each element
M[row_idx, col_idx] = element ** 2
However, in my case I want to walk through each column efficiently, since I have a very big matrix.
I've heard that there is a very efficient way to achieve this using numpy, instead of my current code:
curr_col, curr_row = 0, 0
while (curr_col < numb_colonnes):
result = 0
while (curr_row < numb_rows):
# If different from 0
if (M[curr_row][curr_col] != 0):
result += 1
curr_row += 1
.... using result value ...
curr_col += 1
curr_row = 0
Thanks in advance!
In the code you showed us, you treat numpy's arrays as lists and for what you can see, it works! But arrays are not lists, and while you can treat them as such it wouldn't make sense to use arrays, or even numpy.
To really exploit the usefulness of numpy you have to operate directly on arrays, writing, e.g.,
M = M*M
when you want to square the elements of an array and using the rich set of numpy functions to operate directly on arrays.
That said, I'll try to get a bit closer to your problem...
If your intent is to count the elements of an array that are different from zero, you can use the numpy function sum.
Using sum, you can obtain the sum of all the elements in an array, or you can sum across a particular axis.
import numpy as np
a = np.array(((3,4),(5,6)))
print np.sum(a) # 18
print np.sum(a, axis=0) # [8, 10]
print np.sum(a, axis=1) # [7, 11]
Now you are protesting: I don't want to sum the elements, I want to count the non-zero elements... but
if you write a logical test on an array, you obtain an array of booleans, e.g, we want to test which elements of a are even
print a%2==0
# [[False True]
# [False True]]
False is zero and True is one, at least when we sum it...
print np.sum(a%2==0) # 2
or, if you want to sum over a column, i.e., the index that changes is the 0-th
print np.sum(a%2==0, axis=0) # [0 2]
or sum across a row
print np.sum(a%2==0, axis=1) # [1 1]
To summarize, for your particular use case
by_col = np.sum(M!=0, axis=0)
# use the counts of non-zero terms in each column, stored in an array
...
# if you need the grand total, use sum again
total = np.sum(by_col)

function to get number of columns in a NumPy array that returns 1 if it is a 1D array

I have defined operations on 3xN NumPy arrays, and I want to loop over each column of the array.
I tried:
for i in range(nparray.shape[1]):
However, if nparray.ndim == 1, this fails.
Is there a clean way to ascertain the number of columns of a NumPy array, for example, to get 1 if it is a 1D array (like MATLAB's size operation does)?
Otherwise, I have implemented:
if nparray.ndim == 1:
num_points = 1
else:
num_points = nparray.shape[1]
for i in range(num_points):
If you're just looking for something less verbose, you could do this:
num_points = np.atleast_2d(nparray).shape[1]
That will, of course, make a new temporary array just to take its shape, which is a little silly… but it'll be pretty cheap, because it's just a view of the same memory.
However, I think your explicit code is more readable, except that I might do it with a try:
try:
num_points = nparray.shape[1]
except IndexError:
num_points = 1
If you're doing this repeatedly, whatever you do, you should wrap it in a function. For example:
def num_points(arr, axis):
try:
return arr.shape[axis]
except IndexError:
return 1
Then all you have to write is:
for i in range(num_points(nparray, 1)):
And of course it means you can change things everywhere by just editing one place, e.g.,:
def num_points(arr, axis):
return nparray[:,...,np.newaxis].shape[1]
If you want to keep the one-liner, how about using conditional expressions:
for i in range(nparray.shape[1] if nparray.ndim > 1 else 1):
pass
By default, to iterate a np.array means to iterate over the rows. If you have to iterate over columns, just iterate through the transposed array:
>>> a2=array(range(12)).reshape((3,4))
>>> for col in a2.T:
print col
[0 4 8]
[1 5 9]
[ 2 6 10]
[ 3 7 11]
What's the intended behavior of an array array([1,2,3]), it is treated as having one column or having 3 cols? It is confusing that you mentioned that the arrays are all 3XN arrays, which means this should be the intended behavior, as it should be treated as having just 1 column:
>>> a1=array(range(3))
>>> for col in a1.reshape((3,-1)).T:
print col
[0 1 2]
So, a general solution: for col in your_array.reshape((3,-1)).T: #do something
I think the easiest way is to use the len function:
for i in range(len(nparray)):
...
Why? Because if the number of nparray is a like a one dimensional vector, len will return the number of elements. In your case, the number of columns.
nparray = numpy.ones(10)
print(len(nparray))
Out: 10
If nparray is like a matrix, the number of columns will be returned.
nparray = numpy.ones((10, 5))
print(len(nparray))
Out: 10
If you have a list of numpy arrays with different sizes, just use len inside a loop based on your list.

Categories