Copy keyword breaks numpy's copy/view philosophy - python

I have noticed that none of the methods that are used to convert between types of sparse matrices are using copy kwarg, supplied in the the method. Even though, copying in most cases actually happens, the data array (where it is valid) always has a base set, which means that it shows up as a view in the code. However, de facto the copy has been made.
Is this an intentional behavior?
For instance, here are examples of with csr and csc arrays. As you can see, all of them have bases, no matter what.
In [1]: import numpy as np
...: from scipy import sparse
...:
...: a = np.arange(20).reshape(4, 5)
...: csr = sparse.csr_array(a, copy=True)
...: print('csr.data.base', id(csr.data.base) if csr.data.base is not None else None)
...:
...: csr_copy = csr.copy()
...: print('csr_copy.data.base', id(csr_copy.data.base) if csr_copy.data.base is not None else None)
...:
...: csc_copy = csr.tocsc(copy=True)
...: print('csc_copy.data.base', id(csc_copy.data.base) if csc_copy.data.base is not None else None)
...:
...: csc_copy_2 = csr.tocsc()
...: print('csc_copy_2.data.base', id(csc_copy_2.data.base) if csc_copy_2.data.base is not None else None)
csr.data.base 4392865488
csr_copy.data.base 4392866448
csc_copy.data.base 4392866640
csc_copy_2.data.base 4392867120
While it makes sense for csr_copy to have the same base as csr.data, I don't understand why any other objects have base attribute set for data to begin with.
In particular, this behavior prevents user from direct manipulation with data and indices parameters of the array. For instance, it becomes, impossible extend csr matrix, by adding rows to it using inplace resize method:
In [2]: old_nnz = csr.nnz
...: row = [1, 2, 3, 4, 5] # Lets append row of 5 elements to csr
...:
...: csr.resize(5, 5)
...:
...: print(id(csr.data))
...: print(csr.data)
...:
...: print(id(csr.data.base))
...: print(csr.data.base)
...:
...: csr.data.resize((old_nnz + len(row),), refcheck=True)
4757413808
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
4757413520
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniforge/base/envs/dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-34-c52e3457494e>", line 12, in <module>
csr.data.resize((old_nnz + len(row),), refcheck=True)
ValueError: cannot resize this array: it does not own its data
While using np.resize might work, I am not sure how inplace it is:
In [3]: old_nnz = csr.nnz
...: row = [1, 2, 3, 4, 5] # Let's append row of 5 elements to csr
...:
...: csr.resize(5, 5)
...:
...: print('Data')
...: print(id(csr.data))
...: print(csr.data)
...:
...: print("Data's Base")
...: print(id(csr.data.base))
...: print(csr.data.base)
...:
...: print('New Data')
...: new_data = np.resize(csr.data, (old_nnz + len(row),))
...: print(id(new_data))
...: print(new_data)
...:
...: print("New Data's Base")
...: print(id(new_data.base))
...: print(new_data.base)
...:
...: new_indices = np.resize(csr.indices, (old_nnz + len(row),))
Data
5256251600
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
Data's Base
5256250736
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
New Data
5256250928
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5]
New Data's Base
5256253040
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19]
I have been reading the source code for these functions and I don't see copy even used in some of those. For instance,
in _csr.py:
def tocsc(self, copy=False):
idx_dtype = get_index_dtype((self.indptr, self.indices),
maxval=max(self.nnz, self.shape[0]))
indptr = np.empty(self.shape[1] + 1, dtype=idx_dtype)
indices = np.empty(self.nnz, dtype=idx_dtype)
data = np.empty(self.nnz, dtype=upcast(self.dtype))
csr_tocsc(self.shape[0], self.shape[1],
self.indptr.astype(idx_dtype),
self.indices.astype(idx_dtype),
self.data,
indptr,
indices,
data)
A = self._csc_container((data, indices, indptr), shape=self.shape)
A.has_sorted_indices = True
return A
Even though I see that new array (data) is created, somewhere down the line, maybe somewhere between C/Python interface it is put into base.

I only have scipy v 1.7.3, so don't have access to the major rewrite of the sparse module in 1.8 (e.g. not csr_array or _data.py file).
Whether something has a base or not is not a reliable measure of whether a copy was made. Take your first example:
In [74]: a = np.arange(20).reshape(4, 5)
...: ...: csr = sparse.csr_matrix(a, copy=True)
In [75]: a
Out[75]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
In [76]: a.base
Out[76]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
a is a view of the 1d array produced by the arange. That array is not accessible - except as the base.
In [77]: csr
Out[77]:
<4x5 sparse matrix of type '<class 'numpy.intc'>'
with 19 stored elements in Compressed Sparse Row format>
The data attribute has a base - that looks the same as itself. id is different. But we'd have to study the code to see how data was derived from its base.
In [78]: csr.data
Out[78]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19], dtype=int32)
In [79]: csr.data.base
Out[79]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19], dtype=int32)
It isn't a view of a or a.base, as we can prove by modifying an element.
In [82]: csr.data[0] = 100
In [83]: csr.A
Out[83]:
array([[ 0, 100, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19]], dtype=int32)
In [84]: a
Out[84]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
The copy parameter makes most sense when keeping the same format. Changing format can involve reordering the data (csr to csc), or summing duplicates (coo to csr), etc.
Let's try making a new csr:
In [87]: csr1 = sparse.csr_matrix(csr, copy=False)
In [88]: csr2 = sparse.csr_matrix(csr, copy=True)
As with csr, both of these have data.base and different ids. But if I modify an element of csr, that change only appears in csr1. csr2 is indeed a copy.
In [93]: csr.data[1] = 200
In [97]: csr1.data[1]
Out[97]: 200
In [98]: csr2.data[1]
Out[98]: 2
resize
I haven't used resize before for sparse, and rarely use it for numpy. But playing with the csr, it's evident it's quite a different operation.
csr.resize(5,5) appears to just change the indptr (and shape), without change to data or indices.
csr.resize(5,6) just seems to change the shape. I don't see a change in main attributes. Neither adds nonzero values, so "padding" with 0s doesn't change much.
You don't want to do csr.data.resize(...). Such a change would also require changing indices and indptr (to maintain a consistent csr format). data can have 0s, but it should be cleaned up with a call to eliminate_zeros.
ravel
The sparse code could be doing something as harmless as ravel.
In [129]: x = np.array([1,2,3]).ravel()
In [130]: x.base
Out[130]: array([1, 2, 3])
In [131]: x.resize(4)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [131], in <cell line: 1>()
----> 1 x.resize(4)
ValueError: cannot resize this array: it does not own its data
Another example where base is not a reliable indicator of copy/view, is when indexing columns (of a 2d array):
Make a 2d array, which has its own data:
In [144]: arr = np.array([[1,2],[3,4]])
In [145]: arr.base # None
A row selection also has a None base:
In [146]: arr[[1]].base
But a column selection does not - even though it is a copy:
In [147]: arr[:,[1]].base
Out[147]: array([[2, 4]])
In [148]: arr[:,[1]]
Out[148]:
array([[2],
[4]])
Evidently the indexing operation selects a (1,2) array, which is then reshaped to (2,1). Actually, looking a strides, I think it's doing a transpose. arr[:,[1]].base.T.

Related

Is there a method of vectorizing the printing of elements in a Numpy array?

I have a numpy array named "a":
a = numpy.array([
[[1, 2, 3], [11, 22, 33]],
[[4, 5, 6], [44, 55, 66]],
])
I want to print the following (in this exact format):
1 2 3
11 22 33
4 5 6
44 55 66
To accomplish this, I wrote the following:
for i in range(len(A)):
a = A[i]
for j in range(len(a)):
a1 = a[j][0]
a2 = a[j][1]
a3 = a[j][2]
print(a1, a2, a3)
The output is:
1 2 3
11 22 33
4 5 6
44 55 66
I would like to vectorize my solution (if possible) and discard the for loop. I understand that this problem might not benefit from vectorization. In reality (for work-related purposes), the array "a" has 52 elements and each element contains hundreds of arrays stored inside. I'd like to solve a basic/trivial case and move onto a more advanced, realistic case.
Also, I know that Numpy arrays were not meant to be iterated through.
I could have used Python lists to accomplish the following, but I really want to vectorize this (if possible, of course).
You could use np.apply_along_axis which maps the array with a function on an arbitrary axis. Applying it on axis=2 to get the desired result.
Using print directly as the callback:
>>> np.apply_along_axis(print, 2, a)
[1 2 3]
[11 22 33]
[4 5 6]
[44 55 66]
Or with a lambda wrapper:
>>> np.apply_along_axis(lambda r: print(' '.join([str(x) for x in r])), 2, a)
1 2 3
11 22 33
4 5 6
44 55 66
In [146]: a = numpy.array([
...: [[1, 2, 3], [11, 22, 33]],
...: [[4, 5, 6], [44, 55, 66]],
...: ])
...:
In [147]: a
Out[147]:
array([[[ 1, 2, 3],
[11, 22, 33]],
[[ 4, 5, 6],
[44, 55, 66]]])
A proper "vectorized" numpy output is:
In [148]: a.reshape(-1,3)
Out[148]:
array([[ 1, 2, 3],
[11, 22, 33],
[ 4, 5, 6],
[44, 55, 66]])
You could also convert that to a list of lists:
In [149]: a.reshape(-1,3).tolist()
Out[149]: [[1, 2, 3], [11, 22, 33], [4, 5, 6], [44, 55, 66]]
But you want a print without the standard numpy formatting (nor list formatting)
But this iteration is easy:
In [150]: for row in a.reshape(-1,3):
...: print(*row)
...:
1 2 3
11 22 33
4 5 6
44 55 66
Since your desired output is a print, or at least "unformatted" strings, there's no "vectorized", i.e. whole-array, option. You have to iterate on each line!
np.savetxt creates a csv output by iterating on rows and writing a format tuple, e.g. f.write(fmt%tuple(row)).
In [155]: np.savetxt('test', a.reshape(-1,3), fmt='%d')
In [156]: cat test
1 2 3
11 22 33
4 5 6
44 55 66
To get that exact output without iterating, try this:
print(str(a.tolist()).replace('], [', '\n').replace('[', '').replace(']', '').replace(',', ''))

Numpy argsort - what is happening?

I have a numpy array called arr1 defined like following.
arr1 = np.array([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9])
print(arr1.argsort())
array([ 0, 1, 2, 3, 4, 5, 6, 7, 9, 8, 10, 11, 12, 13, 14, 15, 16,
17], dtype=int64)
I expected all the indices of the array to be in numeric order but indices 8 and 9 seems to have flipped.
Can someone help on why this is happening?
np.argsort by default uses the quicksort algorithm which is not stable. You can specify kind = "stable" to perform a stable sort, which will preserve the order of equal elements:
import numpy as np
arr1 = np.array([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9])
print(arr1.argsort(kind="stable"))
It gives:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17]
Because it will sort according to the quick sort algorithm if you follow the steps you will see that is why they are flipped. https://numpy.org/doc/stable/reference/generated/numpy.argsort.html

How can I iterate through numpy 3d array

So I have an array:
array([[[27, 27, 28],
[27, 14, 28]],
[[14, 5, 4],
[ 5, 6, 14]]])
How can I iterate through it and on each iteration get the [a, b, c] values, I try like that:
for v in np.nditer(a):
print(v)
but it just prints
27
27
28
27
14
28
14
5
4
5
6
I need:
[27 27 28]
[27 14 28]...
b = a.reshape(-1, 3)
for triplet in b:
...
Apparently you want to iterate on the first 2 dimensions of the array, returning the 3rd (as 1d array).
In [242]: y = np.array([[[27, 27, 28],
...: [27, 14, 28]],
...:
...: [[14, 5, 4],
...: [ 5, 6, 14]]])
Double loops are fine, as is reshaping to a (4,2) and iterating.
nditer isn't usually needed, or encouraged as an iteration mechanism (its documentation needs a stronger disclaimer). It's really meant for C level code. It isn't used much in Python level code. One exception is the np.ndindex function, which can be useful in this case:
In [244]: for ij in np.ndindex(y.shape[:2]):
...: print(ij, y[ij])
...:
(0, 0) [27 27 28]
(0, 1) [27 14 28]
(1, 0) [14 5 4]
(1, 1) [ 5 6 14]
ndindex uses nditer in multi_index mode on a temp array of the specified shape.
Where possible try to work without iteration. Iteration, with any of these tricks, is relatively slow.
You could do something ugly as
for i in range(len(your_array)):
for j in range(len(your_array[i])):
print(your_array[i][j])
Think of it as having arrays within an array. So within array v you have array a which in turn contains the triplets b
import numpy as np
na = np.array
v=na([[[27, 27, 28], [27, 14, 28]], [[14, 5, 4],[ 5, 6, 14]]])
for a in v:
for b in a:
print b
Output:
[27 27 28]
[27 14 28]
[14 5 4]
[ 5 6 14]
Alternatively you could do the following,
v2 = [b for a in v for b in a]
Now all your triplets are stored in v2
[array([27, 27, 28]),
array([27, 14, 28]),
array([14, 5, 4]),
array([ 5, 6, 14])]
..and you can access them like a 1D array eg
print v2[0]
gives..
array([27, 27, 28])
Another alternative (useful for arbitrary dimensionality of the array containing the n-tuples):
a_flat = a.ravel()
n = 3
m = len(a_flat)/n
[a_flat[i:i+n] for i in range(m)]
or in one line (slower):
[a.ravel()[i:i+n] for i in range(len(a.ravel())/n)]
or for further usage within a loop:
for i in range(len(a.ravel())/n):
print a.ravel()[i:i+n]
Reshape the array A (whose shape is n1, n2, 3) to array B (whose shape is n1 * n2, 3), and iterate through B. Note that B is just A's view. A and B share the same data block in the memory, but they have different array headers information where records their shapes, and changing values in B will also change A's value. The code below:
a = np.array([[[27, 27, 28],[27, 14, 28]],
[[14, 5, 4],[ 5, 6, 14]]])
b = a.reshape((-1, 3))
for last_d in b:
a, b, c = last_d
# do something with last_d or a, b, c

extract all vertical slices from numpy array

I want to extract a complete slice from a 3D numpy array using ndeumerate or something similar.
arr = np.random.rand(4, 3, 3)
I want to extract all possible arr[:, x, y] where x, y range from 0 to 2
ndindex is a convenient way of generating the indices corresponding to a shape:
In [33]: arr = np.arange(36).reshape(4,3,3)
In [34]: for xy in np.ndindex((3,3)):
...: print(xy, arr[:,xy[0],xy[1]])
...:
(0, 0) [ 0 9 18 27]
(0, 1) [ 1 10 19 28]
(0, 2) [ 2 11 20 29]
(1, 0) [ 3 12 21 30]
(1, 1) [ 4 13 22 31]
(1, 2) [ 5 14 23 32]
(2, 0) [ 6 15 24 33]
(2, 1) [ 7 16 25 34]
(2, 2) [ 8 17 26 35]
It uses nditer, but doesn't have any speed advantages over a nested pair of for loops.
In [35]: for x in range(3):
...: for y in range(3):
...: print((x,y), arr[:,x,y])
ndenumerate uses arr.flat as the iterator, but using it to
In [38]: for xy, _ in np.ndenumerate(arr[0,:,:]):
...: print(xy, arr[:,xy[0],xy[1]])
does the same thing, iterating on the elements of a 3x3 subarray. As with ndindex it generates the indices. The element won't be the size 4 array that you want, so I ignored that.
A different approach is to flatten the later axes, transpose, and then just iterate on the (new) first axis:
In [43]: list(arr.reshape(4,-1).T)
Out[43]:
[array([ 0, 9, 18, 27]),
array([ 1, 10, 19, 28]),
array([ 2, 11, 20, 29]),
array([ 3, 12, 21, 30]),
array([ 4, 13, 22, 31]),
array([ 5, 14, 23, 32]),
array([ 6, 15, 24, 33]),
array([ 7, 16, 25, 34]),
array([ 8, 17, 26, 35])]
or with the print as before:
In [45]: for a in arr.reshape(4,-1).T:print(a)
Why not just
[arr[:, x, y] for x in range(3) for y in range(3)]

Pythonic way to get both diagonals passing through a matrix entry (i,j)

What is the Pythonic way to get a list of diagonal elements in a matrix passing through entry (i,j)?
For e.g., given a matrix like:
[1 2 3 4 5]
[6 7 8 9 10]
[11 12 13 14 15]
[16 17 18 19 20]
[21 22 23 24 25]
and an entry, say, (1,3) (representing element 9) how can I get the elements in the diagonals passing through 9 in a Pythonic way? Basically, [3,9,15] and [5,9,13,17,21] both.
Using np.diagonal with a little offset logic.
import numpy as np
lst = np.array([[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
i, j = 1, 3
major = np.diagonal(lst, offset=(j - i))
print(major)
array([ 3, 9, 15])
minor = np.diagonal(np.rot90(lst), offset=-lst.shape[1] + (j + i) + 1)
print(minor)
array([ 5, 9, 13, 17, 21])
The indices i and j are the row and column. By specifying the offset, numpy knows from where to begin selecting elements for the diagonal.
For the major diagonal, You want to start collecting from 3 in the first row. So you need to take the current column index and subtract it by the current row index, to figure out the correct column index at the 0th row. Similarly for the minor diagonal, where the array is flipped (rotated by 90˚) and the process repeats.
As another alternative method, with raveling the array and for matrix with shape (n*n):
array = np.array([[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
x, y = 1, 3
a_mod = array.ravel()
size = array.shape[0]
if y >= x:
diag = a_mod[y-x:(x+size-y)*size:size+1]
else:
diag = a_mod[(x-y)*size::size+1]
if x-(size-1-y) >= 0:
reverse_diag = array[:, ::-1].ravel()[(x-(size-1-y))*size::size+1]
else:
reverse_diag = a_mod[x:x*size+1:size-1]
# diag --> [ 3 9 15]
# reverse_diag --> [ 5 9 13 17 21]
The correctness of the resulted arrays must be checked further. This can be developed to handle matrices with other shapes e.g. (n*m).

Categories