Efficiently Subtract Vector from Matrix (Scipy) - python

I've got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common task when you're doing things like normalization/standardization, but I can't seem to find the proper way to do this efficiently.
Here's an example to demonstrate:
# mat is a 3x3 matrix
mat = scipy.sparse.csc_matrix([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T
"""
I want to subtract `vec` from each of the columns in `mat` yielding...
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]
"""
One way to accomplish what I want is to hstack vec to itself 3 times, yielding a 3x3 matrix where each column is vec and then subtract that from mat. But again, I'm looking for a way to do this efficiently, and the hstacked matrix takes a long time to create. I'm sure there's some magical way to do this with slicing and broadcasting, but it eludes me.
Thanks!
EDIT: Removed the 'in-place' constraint, because sparsity structure would be constantly changing in an in-place assignment scenario.

For a start what would we do with dense arrays?
mat-vec.A # taking advantage of broadcasting
mat-vec.A[:,[0]*3] # explicit broadcasting
mat-vec[:,[0,0,0]] # that also works with csr matrix
In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566
we found that using as_strided on the mat.indptr vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows, x.cols of an lil_matrix are nearly as good. getrow is slow). This function implements such as iteration.
def sum(X,v):
rows, cols = X.shape
row_start_stop = as_strided(X.indptr, shape=(rows, 2),
strides=2*X.indptr.strides)
for row, (start, stop) in enumerate(row_start_stop):
data = X.data[start:stop]
data -= v[row]
sum(mat, vec.A)
print mat.A
I'm using vec.A for simplicity. If we keep vec sparse we'd have to add a test for nonzero value at row. Also this type of iteration only modifies the nonzero elements of mat. 0's are unchanged.
I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec has lots of zeros, then it makes sense to iterate, modifying only those rows of mat where vec is nonzero. But vec is nearly dense like this example, it may be hard to beat mat-vec.A.

Summary
So in short, if you use CSR instead of CSC, it's a one-liner:
mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))
Explanation
If you realized it, this is better done in row-wise fashion, since we will deduct the same number from each row. In your example then: deduct 1 from the first row, 2 from the second row, 3 from the third row.
I actually encountered this in a real life application where I want to classify documents, each represented as a row in the matrix, while the columns represent words. Each document has a score which should be multiplied to the score of each word in that document. Using row representation of the sparse matrix, I did something similar to this (I modified my code to answer your question):
mat = scipy.sparse.csc_matrix([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T
# Use the row version
mat_row = mat.tocsr()
vec_row = vec.T
# mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
# mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
# By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
print mat_row.todense()
Which results in:
[[0 1 2]
[0 1 2]
[0 1 2]]
The visualization is something like this:
>>> mat_row.data
[1 2 3 2 3 4 3 4 5]
>>> mat_row.indptr
[0 3 6 9]
>>> numpy.diff(mat_row.indptr)
[3 3 3]
>>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[1 1 1 2 2 2 3 3 3]
>>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[0 1 2 0 1 2 0 1 2]
>>> mat_row.todense()
[[0 1 2]
[0 1 2]
[0 1 2]]

You can introduce fake dimensions by altering the strides of your vector. You can, with out additional allocation, "convert" your vector to a 3 x 3 matrix using np.lib.stride_tricks.as_strided. This page has an example and a bit of a discussion about it along with some discussion of related topics (like views). Search the page for "Example: fake dimensions with strides."
There are also quite a few example on SO about this... but my searching skills are failing me now.

Related

Difficulties to understand np.nditer

I am very new to python. I want to clearly understand the below code, if there's anyone who can help me.
Code:
import numpy as np
arr = np.array([[1, 2, 3, 4,99,11,22], [5, 6, 7, 8,43,54,22]])
for x in np.nditer(arr[0:,::4]):
print(x)
My understanding:
This 2D array has two 1D arrays.
np.nditer(arr[0:,::4]) will give all value from 0 indexed array to upto last array, ::4 means the gap between printed arrays will be 4.
Question:
Is my understanding for no 2 above correct?
How can I get the index for the print(x)? Because of the step difference of 4 e.g [0:,::4] or any gap [0:,::x] I want to find out the exact index that it is printing. But how?
Addressing your questions below
Yes, I think your understanding is correct. It might help to first print what arr[0:,::4] returns though:
iter_array = arr[0:,::4]
print(iter_array)
>>> [[ 1 99]
>>> [ 5 43]]
The slicing takes out each 4th index of the original array. All nditer does is iterate through these values in order. (Quick FYI: arr[0:] and arr[:] are equivalent, since the starting point is 0 by default).
As you pointed out, to get the index for these you need to keep track of the slicing that you did, i.e. arr[0:, ::x]. Remember, nditer has nothing to do with how you sliced your array. I'm not sure how to best get the indices of your slicing, but this is what I came up with:
import numpy as np
ls = [
[1, 2, 3, 4,99,11,22],
[5, 6, 7, 8,43,54,22]
]
arr = np.array(ls)
inds = np.array([
[(ctr1, ctr2) for ctr2, _ in enumerate(l)] for ctr1, l in enumerate(ls)
]) # create duplicate of arr filled with zeros
step = 4
iter_array = arr[0:,::step]
iter_inds = inds[0:,::step]
print(iter_array)
>>> [[ 1 99]
>>> [ 5 43]]
print(iter_inds)
>>> [[[0 0]
>>> [0 4]]
>>>
>>> [[1 0]
>>> [1 4]]]
All that I added here was an inds array. This array has elements equal to their own index. Then, when you slice both arrays in the same way, you get your indices. Hopefully this helps!

array[row][col] vs array[row,col] in Python

What is the difference between indexing a 2D array row/col with [row][col] vs [row, col] in numpy/pandas? Is there any implications of using either of these two?
For example:
import numpy as np
arr = np.array([[1, 2], [3, 4]])
print(arr[1][0])
print(arr[1, 0])
Both give 3.
Single-element indexing
For single elements indexing as in your example, the result is indeed the same. Although as stated in the docs:
So note that x[0,2] = x[0][2] though the second case is more
inefficient as a new temporary array is created after the first index
that is subsequently indexed by 2.
emphasis mine
Array indexing
In this case, not only that double-indexing is less efficient - it simply gives different results. Let's look at an example:
>>> arr = np.array([[1, 2], [3, 4], [5, 6]])
>>> arr[1:][0]
[3 4]
>>> arr[1:, 0]
[3 5]
In the first case, we create a new array after the first index which is all rows from index 1 onwards:
>>> arr[1:]
[[3 4]
[5 6]]
Then we simply take the first element of that new array which is [3 4].
In the second case, we use numpy indexing which doesn't index the elements but indexes the dimensions. So instead of taking the first row, it is actually taking the first column - [3 5].
Using [row][col] is one more function call than using [row, col]. When you are indexing an array (in fact, any object, for that matter), you are calling obj.__getitem__ under the hook. Since Python wraps the comma in a tuple, doing obj[row][col] is the equivalent of calling obj.__getitem__(row).__getitem__(col), whereas obj[row, col] is simply obj.__getitem__((row,col)). Therefore, indexing with [row, col] is more efficient because it has one fewer function call (plus some namespace lookups but they can normally be ignored).

How to calculate two different numpy array's values then put the result in a third array

I have two numpy arrays that I need to calculate to get the needed behaviour for the third array.
To start, here is the first two arrays:
[[2 0 1 3 0 1]
[1 2 1 2 1 2] # ARRAY 1
[2 1 2 1 0 1]
[0 2 0 2 2 3]
[0 3 3 3 1 4]
[2 3 2 3 1 3]]
[[0.60961197 0.29067687 0.20701799 0.79897639 0.74822711 0.21928105]
[0.67683562 0.14261662 0.74655501 0.21529103 0.14347939 0.42190162]
[0.21116134 0.98618323 0.93882545 0.51422862 0.12715579 0.18808092] # ARRAY 2
[0.48570863 0.32068082 0.32335023 0.62634641 0.37418013 0.44860968]
[0.12498966 0.56458377 0.24902924 0.12992352 0.76903935 0.68230202]
[0.90349626 0.75727838 0.14188677 0.63082553 0.96360265 0.28694261]]
Where array1[0][0] will be used to subtract the the input value from array3[0][0], and then array2[0][0] will be used multiply the now subtracted value from array3[0][0] to give the new output of array3[1][0] (In other words, these calculations WILL get array3).
So for example, lets say the starting values of array3[0] are:
[[20,22,24,40,42,10],
....
For array3[0][0] (20), it needs to subtract 2 (coming from array1[0][0]), leaving the value with 18. The value 18 is then NOW multiplied by 0.60961197 (array2[0][0]) leaving a NEW VALUE of 10.97. 10.97 is now the NEW value of array3[1][0].
If you were to move onto the next column, the process would be the same. You would take 22-0 = 22, then take 22 * 0.29067687 to create the new value for array3[1][1].
To provide a visual example, the completed process of this array for the first two lines would look something like this:
[[20 22 24 40 42 10],
[10.97 19.65 7.44 10.58 7.03],
....
I am trying to get this process continuing for the entire length of the first array ( and I guess second because they are the same). So for the next set, you would take 10.97-1 * 0.6768... = 6.74.. and so on for each index until it reaches the end.
I'm quite stuck on what to do for this, I had tried a for loop but I feel like there may be a lot more a an efficient way of doing this in numpy.
I sincerely appreciate the help, I know this isn't easy (or maybe it will be!). This will kick start what will be a fairly lengthy project for me.
Thank you very much!
Note: If numpy arrays are not a good way to solve this problem and lets say lists are better, I am more than willing to go that route. I'm just assuming with most of numpy's functions this will be easier.
If I understood correctly you could do something like this:
import numpy as np
np.random.seed(42)
arr1 = np.array([[2, 0, 1, 3, 0, 1],
[1, 2, 1, 2, 1, 2],
[2, 1, 2, 1, 0, 1],
[0, 2, 0, 2, 2, 3],
[0, 3, 3, 3, 1, 4],
[2, 3, 2, 3, 1, 3]])
arr2 = np.array([[0.60961197, 0.29067687, 0.20701799, 0.79897639, 0.74822711, 0.21928105],
[0.67683562, 0.14261662, 0.74655501, 0.21529103, 0.14347939, 0.42190162],
[0.21116134, 0.98618323, 0.93882545, 0.51422862, 0.12715579, 0.18808092],
[0.48570863, 0.32068082, 0.32335023, 0.62634641, 0.37418013, 0.44860968],
[0.12498966, 0.56458377, 0.24902924, 0.12992352, 0.76903935, 0.68230202],
[0.90349626, 0.75727838, 0.14188677, 0.63082553, 0.96360265, 0.28694261]])
arr3 = np.random.randint(5, 30, size=(6, 6))
result = (arr3 - arr1) * arr2
print(result)
Output
[[ 5.48650773 6.97624488 3.72632382 9.58771668 8.97872532 5.2627452 ]
[ 6.7683562 2.99494902 19.41043026 2.79878339 2.00871146 10.96944212]
[ 4.85671082 6.90328261 9.3882545 13.88417274 0.89009053 4.702023 ]
[12.14271575 1.28272328 9.05380644 8.76884974 2.99344104 1.34582904]
[ 3.1247415 1.12916754 3.23738012 2.98824096 11.53559025 17.0575505 ]
[17.16642894 8.33006218 2.55396186 10.09320848 17.3448477 5.7388522 ]]
If applied to the data from your example, you get:
arr3 = np.array([20, 22, 24, 40, 42, 10])
result = (arr3 - arr1[0]) * arr2[0]
print(result)
Output
[10.97301546 6.39489114 4.76141377 29.56212643 31.42553862 1.97352945]
Note that in the second example I just use the first row from arr2 and arr3.
Just expanding my comment to a full answer. The question is talking about two kinds of "repeat":
Doable-in-parallel ones (column-wise, broadcast-able)
Not-doable-in-parallel ones (row-wise, iterative)
numpy handles broadcast (i.e. column direction) nicely, so just use a for loop in row direction:
for i in range(len(array1)):
array3[i+1] = (array3[i] - array1[i]) * array2[i]
Do notice array3 should be longer than array1 or array2 otherwise that don't make sense.
EDIT
Oops I didn't see you want to avoid for loop. Technically you can do this problem without a for loop, but you need to mess up with linear algebra yourself:
If we name array1 as a, array2 as b, and the first row of array3 as c for convenience. The rows of array3 would be:
c
(c-a0)*b0 = c*b0-a0*b0
((c-a0)*b0-a1)*b1 = c*b0*b1-a0*b0*b1-a1*b1
...
The final line of array3 can then be computed as
B = b[::-1].cumprod(0)[::-1]
final_c = c * B[0] - (B * a).sum(0)
If you want the whole array3, that's not really trivial to do without for loop. You might be able to write it but that's both painful to read and painful to write. The performance is also questionable

Python: Hierarchical Slicing

Is there a more pythonic/numpythonic way to do some sort of nested/hierarchical slicing, i.e. a prettier version of this:
_sum = 0
for i in np.arange(n):
_sum += someFunc(A[i,:])
Basically I would like to map someFunc (which takes arrays of any shape and returns a number) over the rows and then sum the results.
I have been thinking about np.sum(someFunc(A[:,:])), but according to my understanding this will just map someFuncover the whole array.
If I understood correctly, you could use a list comprehension like this:
sum([someFunc(A[i:]) for i in np.arange(n)])
Define a function to count 1's in an array:
def foo(x):
return (x==1).sum()
and a 2d array:
In [431]: X=np.array([[1,0,2],[3,1,1],[0,2,3]])
I can apply it iteratively to rows
In [432]: [foo(i) for i in X] # iterate on 1st dimension
Out[432]: [1, 2, 0]
In [433]: [foo(X[i,:]) for i in range(3)]
Out[433]: [1, 2, 0]
and get the total count with sum (here the Python sum)
In [434]: sum([foo(X[i,:]) for i in range(3)])
Out[434]: 3
As written foo gets the same thing with applied to the whole array
In [435]: foo(X)
Out[435]: 3
and for row counts, use the np.sum axis control:
In [440]: np.sum(X==1, axis=1)
Out[440]: array([1, 2, 0])
apply_along_axis can to the same sort of row iteration:
In [438]: np.apply_along_axis(foo,1,X)
Out[438]: array([1, 2, 0])
but for this it is overkill. It's more useful with 3d or larger arrays where it is awkward to iterate over all dimensions except the nth one. It's never faster than doing your own iteration.
It's clearly best if you can write the function to work on the whole array. But if you must iterate on rows, there aren't any magical solutions. vectorize and frompyfunc wrap functions that work with scalar values, not 1d arrays. Some row problems are solved by casting the rows as larger dtype objects (e.g. unique rows).

What is the meaning of single quote(') in matlab, and how to change it to python

grad = (1/m * (h-y)' * X) + lambda * [0;theta(2:end)]'/m;
cost = 1/(m) * sum(-y .* log(h) - (1-y) .* log(1-h)) + lambda/m/2*sum(theta(2:end).^2);
How to change this two lines to python? I tried to use the zip to do the same job as '. But it shows the error.
Short answer:
The ' operator in MATLAB is the matrix (conjugate) transpose operator. It flips the matrix around dimensions and takes the complex conjugate of the matrix (the second part being what trips people up) The short answer is that the equivalent of a' in Python is np.atleast_2d(a).T.conj().
Slightly longer answer:
Don't use ' in MATLAB unless you really know what you are doing. Use .', which is the ordinary transpose. It is the equivalent of np.atleast_2d(a).T in Python (no conjugate). If you are sure that the a.ndim >= 2 in python, then you can just use a.T. If you are sure that a.ndim == 1 in Python, you can use a[None].T. If you are sure that a.ndim == 0 in Python then transposing is pointless so just do whatever you want.
Very Long Answer:
The basic idea about a transpose is that it flips an array or matrix around one dimension So consider this:
>> a=[1,2,3,4,5,6]
a =
1 2 3 4 5 6
>> a'
ans =
1
2
3
4
5
6
>> b=[1,2,3;4,5,6]
b =
1 2 3
4 5 6
>> b'
ans =
1 4
2 5
3 6
So it seems pretty clear, ' does a transpose. But that is deceiving:
c=[1j,2j,3j,4j,5j,6j]
c =
Columns 1 through 3
0.000000000000000 + 1.000000000000000i 0.000000000000000 + 2.000000000000000i 0.000000000000000 + 3.000000000000000i
Columns 4 through 6
0.000000000000000 + 4.000000000000000i 0.000000000000000 + 5.000000000000000i 0.000000000000000 + 6.000000000000000i
>> c'
ans =
0.000000000000000 - 1.000000000000000i
0.000000000000000 - 2.000000000000000i
0.000000000000000 - 3.000000000000000i
0.000000000000000 - 4.000000000000000i
0.000000000000000 - 5.000000000000000i
0.000000000000000 - 6.000000000000000i
Where did all those negatives come from? They weren't in the original array. The reason for this is described in the documentation. The ' operator in MATLAB isn't a normal transpose operator, the normal transpose operator is .'. The ' operator does a complex conjugate transpose. It does the transpose of the matrix and does the complex conjugate of that matrix.
The problem is that this is almost never what you actually want. It will result in code that seems to work as expected, but silently changes your FFT data, for example. So unless you are absolutely, positively sure your algorithm requires a complex conjugate transpose, use .'.
As for Python, the Python transpose operator is .T. So you consider this:
>>> a = np.array([[1, 2, 3, 4, 5, 6]])
>>> print(a)
[[1 2 3 4 5 6]]
>>> print(a.T)
[[1]
[2]
[3]
[4]
[5]
[6]]
>>> b = np.array([[1j, 2j, 3j, 4j, 5j, 6j]])
[[ 0.+1.j 0.+2.j 0.+3.j 0.+4.j 0.+5.j 0.+6.j]]
>>> (1j*np.ones((1,10))).T
[[ 0.+1.j]
[ 0.+2.j]
[ 0.+3.j]
[ 0.+4.j]
[ 0.+5.j]
[ 0.+6.j]]
Notice the lack of any negatives for the imaginary part. If you want to get the complex conjugate transpose, you need to use np.conj(a) or a.conj() to get the complex conjugate (either before or after doing the transpose). However, numpy has its own transpose pitfall:
>>> c = np.array([1, 2, 3, 4, 5, 6])
>>> print(c)
[1 2 3 4 5 6]
>>> print(c.T)
[1 2 3 4 5 6]
Huh? It didn't do anything. The reason is that np.array([1, 2, 3, 4, 5, 6]) creates a 1D array. A transpose is flipping the array along a particular dimension. That is meaningless when there is only one dimension, so the transpose doesn't do anything.
"But," you might object, "didn't a transpose of the 1D MATLAB matrix work?" The reason is more fundamental to how MATLAB and numpy store data. Consider Python:
>>> np.array([[1, 2, 3], [4, 5, 6]]).ndim
2
>>> np.array([1, 2, 3, 4, 5, 6]).ndim
1
>>> np.array(1).ndim
0
That seems reasonable. A 2D array has two dimensions, a 1D array has one dimension, and a scalar has zero dimensions. But try the same thing in MATLAB:
>> ndims([1,2,3;4,5,6])
ans =
2
>> ndims([1,2,3,4,5,6])
ans =
2
>> ndims(1)
ans =
2
Everything has 2 dimensions! MATLAB has no 1D or 0D data structures, everything in MATLAB must have at least 2 dimensions (although it may be possible to create your own effectively 1D or 0D class in MATLAB). So taking the transpose of your "1D" data structure in MATLAB worked becaused it wasn't actually 1D.
Both the conjugate transpose and the 1D transpose issues come down to the basic data type MATLB and numpy use. MATLAB uses matrices, which inherently are at least 2D. nump, on the other hand, uses arrays, which can have any number of dimensions. MATLAB matrices use matrix mathematics as their normal operations (so a * b in MATLAB is a matrix product) while Python arrays use element-by-element mathematics as their normal operators (so a * b is an element-by-element product, equivalent of a .* b in MATLAB). MATLAB has element-by-element operators, and numpy arrays have matrix operators (although no matrix transpose yet, though adding one is being considered), so this mostly applies to the default operations.
To avoid this issue in Python, there are several ways to get around it. Indexing with None in Python inserts additional dimensions. So for a 1D array a, a[None] will be a 2D array where the first dimension has a length of 1. If you don't know ahead of time what the dimensionality of your array is, you can use np.atleast_2d(a), which will make sure a has at least two dimensions. So 0D becomes 2D, 1D becomes 2D, 2D stays 2D, 3D stays 3D, 4D stays 4D, etc.
That being said, numpy has a matrix class that works the same as MATLAB's in all these regards (it even has a conjugate transpose operator, .H). Don't use it. The python community has standardized around arrays, since in practice that is almost always what you want. That means that most Python tools expect arrays, and many will either malfunction if given matrices or will convert them to arrays. So just use arrays.
The " ' " in Matlab is 'transpose' of a matrix. The numpy package is the fundamental package for scientific computing in python. numpy.transpose could be used to carry the same task out.
import numpy as np
matrix = np.arange(6).reshape((2,3))
This going to create a matrix with two rows and three columns as follows :
>>> array([[0, 1, 2],[3, 4, 5]])
Then the transpose is given as:
np.transpose (matrix)
>>> array([[0, 3],[1, 4],[2, 5]])
I hope it helps

Categories