I am constructing a sparse vector using a scipy.sparse.csr_matrix like so:
csr_matrix((values, (np.zeros(len(indices)), indices)), shape = (1, max_index))
This works fine for most of my data, but occasionally I get a ValueError: could not convert integer scalar.
This reproduces the problem:
In [145]: inds
Out[145]:
array([ 827969148, 996833913, 1968345558, 898183169, 1811744124,
2101454109, 133039182, 898183170, 919293479, 133039089])
In [146]: vals
Out[146]:
array([ 1., 1., 1., 1., 1., 2., 1., 1., 1., 1.])
In [147]: max_index
Out[147]:
2337713000
In [143]: csr_matrix((vals, (np.zeros(10), inds)), shape = (1, max_index+1))
...
996 fn = _sparsetools.csr_sum_duplicates
997 M,N = self._swap(self.shape)
--> 998 fn(M, N, self.indptr, self.indices, self.data)
999
1000 self.prune() # nnz may have changed
ValueError: could not convert integer scalar
inds is a np.int64 array and vals is a np.float64 array.
The relevant part of the scipy sum_duplicates code is here.
Note that this works:
In [235]: csr_matrix(([1,1], ([0,0], [1,2])), shape = (1, 2**34))
Out[235]:
<1x17179869184 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
So the problem is not that one of the dimensions is > 2^31
Any thoughts why these values should be causing a problem?
Might it be that max_index > 2**31 ?
Try this, just to make sure:
csr_matrix((vals, (np.zeros(10), inds/2)), shape = (1, max_index/2))
The max index you are giving is less than the maximum index of the rows you are supplying.
This
sparse.csr_matrix((vals, (np.zeros(10), inds)), shape = (1, np.max(inds)+1))
works fine with me.
Although making a .todense() results in memory error for the large size of the matrix
Uncommenting the sum_duplicates - function will lead to other errors. But this fix: strange error when creating csr_matrix also solves your problem. You can extend the version_check to newer versions of scipy.
import scipy
import scipy.sparse
if scipy.__version__ in ("0.14.0", "0.14.1", "0.15.1"):
_get_index_dtype = scipy.sparse.sputils.get_index_dtype
def _my_get_index_dtype(*a, **kw):
kw.pop('check_contents', None)
return _get_index_dtype(*a, **kw)
scipy.sparse.compressed.get_index_dtype = _my_get_index_dtype
scipy.sparse.csr.get_index_dtype = _my_get_index_dtype
scipy.sparse.bsr.get_index_dtype = _my_get_index_dtype
Related
I want to create a NumPy array of np.ndarray from an iterable. This is because I have a function that will return np.ndarray of some constant shape, and I need to create an array of results from this function, something like this:
OUTPUT_SHAPE = some_constant
def foo(input) -> np.ndarray:
# processing
# generated np.ndarray of shape OUTPUT_SHAPE
return output
inputs = [i for i in range(100000)]
iterable = (foo(input) for input in inputs)
arr = np.fromiter(iterable, np.ndarray)
This obviously gives an error:-
cannot create object arrays from iterator
I cannot first create a list then convert it to an array, because it will first create a copy of every output array, so for a time, there will be almost double memory occupied, and I have very limited memory.
Can anyone help me?
You probably shouldn't make an object array. You should probably make an ordinary 2D array of non-object dtype. As long as you know the number of results the iterator will give in advance, you can avoid most of the copying you're worried about by doing it like this:
arr = numpy.empty((num_iterator_outputs, OUTPUT_SHAPE), dtype=whatever_appropriate_dtype)
for i, output in enumerate(iterable):
arr[i] = output
This only needs to hold arr and a single output in memory at once, instead of arr and every output.
If you really want an object array, you can get one. The simplest way would be to go through a list, which will not perform the copying you're worried about as long as you do it right:
outputs = list(iterable)
arr = numpy.empty(len(outputs), dtype=object)
arr[:] = outputs
Note that if you just try to call numpy.array on outputs, it will try to build a 2D array, which will cause the copying you're worried about. This is true even if you specify dtype=object - it'll try to build a 2D array of object dtype, and that'll be even worse, for both usability and memory.
An object dtype array contains references, just like a list.
Define 3 arrays:
In [589]: a,b,c = np.arange(3), np.ones(3), np.zeros(3)
put them in a list:
In [590]: alist = [a,b,c]
and in an object dtype array:
In [591]: arr = np.empty(3,object)
In [592]: arr[:] = alist
In [593]: arr
Out[593]:
array([array([0, 1, 2]), array([1., 1., 1.]), array([0., 0., 0.])],
dtype=object)
In [594]: alist
Out[594]: [array([0, 1, 2]), array([1., 1., 1.]), array([0., 0., 0.])]
Modify one, and see the change in the list and array:
In [595]: b[:] = [1,2,3]
In [596]: b
Out[596]: array([1., 2., 3.])
In [597]: alist
Out[597]: [array([0, 1, 2]), array([1., 2., 3.]), array([0., 0., 0.])]
In [598]: arr
Out[598]:
array([array([0, 1, 2]), array([1., 2., 3.]), array([0., 0., 0.])],
dtype=object)
A numeric dtype array created from these copies all values:
In [599]: arr1 = np.stack(arr)
In [600]: arr1
Out[600]:
array([[0., 1., 2.],
[1., 2., 3.],
[0., 0., 0.]])
So even if your use of fromiter worked, it wouldn't be any different, memory wise from a list accumulation:
alist = []
for i in range(n):
alist.append(constant_array)
I have array of floats, and I want to floor them to nearest integer, so I can use them as indices.
For example:
In [2]: import numpy as np
In [3]: arr = np.random.rand(1, 10) * 10
In [4]: arr
Out[4]:
array([[4.97896461, 0.21473121, 0.13323678, 3.40534157, 5.08995577,
6.7924586 , 1.82584208, 6.73890807, 2.45590354, 9.85600841]])
In [5]: arr = np.floor(arr)
In [6]: arr
Out[6]: array([[4., 0., 0., 3., 5., 6., 1., 6., 2., 9.]])
In [7]: arr.dtype
Out[7]: dtype('float64')
They are still floats after flooring, is there a way to automatically cast them to integers?
I am edit answer with #DanielF explanation:
"floor doesn't convert to integer, it just gives integer-valued floats, so you still need an astype to change to int"
Check this code to understand the solution:
import numpy as np
arr = np.random.rand(1, 10) * 10
print(arr)
arr = np.floor(arr).astype(int)
print(arr)
OUTPUT:
[[2.76753828 8.84095843 2.5537759 5.65017407 7.77493733 6.47403036
7.72582766 5.03525625 9.75819442 9.10578944]]
[[2 8 2 5 7 6 7 5 9 9]]
Why not just use:
np.random.randint(1,10)
As alternative to changing type after floor division, you can provide an output array of the desired data type to np.floor (and to any other numpy ufunc). For example, imagine you want to convert the output to np.int32, then do the following:
import numpy as np
arr = np.random.rand(1, 10) * 10
out = np.empty_like(arr, dtype=np.int32)
np.floor(arr, out=out, casting='unsafe')
As the casting argument already indicates, you should know what you are doing when casting outputs into different types. However, in your case it is not really unsafe.
Although, I would not call np.floor in your case, because all values are greater than zero. Therefore, the simplest and probably fastest solution to your problem would be a direct casting to integer.
import numpy as np
arr = (np.random.rand(1, 10) * 10).astype(int)
I have a question which I think may have an easy answer. I have a numpy array with three dimensions - (num_users, num_dates, num_holdings). I'd like to initialize it to some random test values. random.rand works perfectly fine for this, but for each user and each date, the third dimension has to sum to 1 (ie, for any user and any date, their holdings have to sum to 1). I can do this by iterating, as in:
num_users = 2
num_dates = 2
num_holdings = 5
test_arr = np.random.rand(num_users, num_dates, num_holdings)
for user in range(num_users):
for date in range(num_dates):
starting_total = np.sum(test_arr[user, date, :])
test_arr[user, date, :] = np.divide(test_arr[user, date, :], starting_total)
# Check it works
print(np.all(np.sum(test_arr, axis=2).reshape(-1)==1))
But if I'm creating multiple arrays it starts to get a bit slow. Plus it feels a little unsatisfactory. I was wondering if anyone knew of a better way to do it using vector math?
Thanks!
You could do
test_arr /= test_arr.sum(axis=2, keepdims=True)
For example:
In [95]: test_arr = np.random.rand(2, 2, 5)
In [96]: test_arr
Out[96]:
array([[[0.44621493, 0.04093414, 0.30051671, 0.40939041, 0.37251939],
[0.33997017, 0.81257008, 0.52820553, 0.55382711, 0.11720684]],
[[0.78460482, 0.43458619, 0.07722273, 0.18181153, 0.52101088],
[0.47933417, 0.31354249, 0.09966921, 0.59655266, 0.24816989]]])
In [97]: test_arr.sum(axis=2, keepdims=True)
Out[97]:
array([[[1.56957558],
[2.35177973]],
[[1.99923614],
[1.73726842]]])
The use of keepdims=True means that we get a resulting shape (2,2,1) which will correctly broadcast when we divide by it.
In [98]: test_arr /= test_arr.sum(axis=2, keepdims=True)
In [99]: test_arr.sum(axis=2)
Out[99]:
array([[1., 1.],
[1., 1.]])
Note that because of limited precision you won't get exactly 1.0 as the sum, but the difference is negligible:
In [100]: test_arr.sum(axis=2) - 1.0
Out[100]:
array([[ 0.00000000e+00, 0.00000000e+00],
[-1.11022302e-16, -1.11022302e-16]])
Is there a way to globally avoid the matrix from appearing in any of the results of the numpy computations? For example currently if you have x as a numpy.ndarray and y as a scipy.sparse.csc_matrix, and you say x += y, x will become a matrix afterwards. Is there a way to prevent that from happening, i.e., keep x an ndarray, and more generally, keep using ndarray in all places where a matrix is produced?
I added the scipy tag, This is a scipy.sparse problem, not a np.matrix one.
In [250]: y=sparse.csr_matrix([[0,1],[1,0]])
In [251]: x=np.arange(2)
In [252]: y+x
Out[252]:
matrix([[0, 2],
[1, 1]])
the sparse + array => matrix
(as a side note, np.matrix is a subclass of np.ndarray. sparse.csr_matrix is not a subclass. It has many numpy like operations, but it implements them in its own code).
In [255]: x += y
In [256]: x
Out[256]:
matrix([[0, 2],
[1, 1]])
technically this shouldn't happen; in effect it is doing x = x+y assigning a new value to x, not just modifying x.
If I first turn y into a regular dense matrix, I get an error. Allowing the action would change a 1d array into a 2d one.
In [258]: x += y.todense()
...
ValueError: non-broadcastable output operand with shape (2,) doesn't match the broadcast shape (2,2)
Changing x to 2d allows the addition to proceed - without changing array to matrix:
In [259]: x=np.eye(2)
In [260]: x
Out[260]:
array([[ 1., 0.],
[ 0., 1.]])
In [261]: x += y.todense()
In [262]: x
Out[262]:
array([[ 1., 1.],
[ 1., 1.]])
In general, performing addition/subtraction with sparse matrices is tricky. They were designed for matrix multiplication. Multiplication doesn't change sparsity as much as addition. y+1 for example makes it dense.
Without digging into the details of how sparse addition is coded, I'd say - don't try this x+=... operation without first turning y into a dense version.
In [265]: x += y.A
In [266]: x
Out[266]:
array([[ 1., 2.],
[ 2., 1.]])
I can't think of a good reason not to do this.
(I should check the scipy github for a bug issue on this).
scipy/sparse/compressed.py has the csr addition code. x+y uses x.__add__(y) but sometimes that is flipped to y.__add__(x). x+=y uses x.__iadd__(y). So I may need to examine __iadd__ for ndarray as well.
But the basic addition for a sparse matrix is:
def __add__(self,other):
# First check if argument is a scalar
if isscalarlike(other):
if other == 0:
return self.copy()
else: # Now we would add this scalar to every element.
raise NotImplementedError('adding a nonzero scalar to a '
'sparse matrix is not supported')
elif isspmatrix(other):
if (other.shape != self.shape):
raise ValueError("inconsistent shapes")
return self._binopt(other,'_plus_')
elif isdense(other):
# Convert this matrix to a dense matrix and add them
return self.todense() + other
else:
return NotImplemented
So the y+x becomes y.todense() + x. And x+y uses the same thing.
Regardless of the += details, it is clear that adding a sparse to a dense (array or np.matrix) involves converting the sparse to dense. There's no code that iterates through the sparse values and adds those selectively to the dense array.
It's only if the arrays are both sparse that it performs a special sparse addition. y+y works, returning a sparse. y+=y fails with a NotImplmenentedError from sparse.base.__iadd__.
This is the best diagnostic sequence that I've come up, trying various ways of adding y to a (2,2) array.
In [348]: x=np.eye(2)
In [349]: x+y
Out[349]:
matrix([[ 1., 1.],
[ 1., 1.]])
In [350]: x+y.todense()
Out[350]:
matrix([[ 1., 1.],
[ 1., 1.]])
Addition produces a matrix, but values can be written to x without changing x class (or shape)
In [351]: x[:] = x+y
In [352]: x
Out[352]:
array([[ 1., 1.],
[ 1., 1.]])
+= with a dense matrix does the same:
In [353]: x += y.todense()
In [354]: x
Out[354]:
array([[ 1., 2.],
[ 2., 1.]])
but something in the +=sparse changes the class of x
In [355]: x += y
In [356]: x
Out[356]:
matrix([[ 1., 3.],
[ 3., 1.]])
Further testing and looking at id(x) and x.__array_interface__ it is clear that x += y replaces x. This is true even if x starts as np.matrix. So the sparse += is not an inplace operation. x += y.todense() is an inplace operation.
Yes, it's a bug; but https://github.com/scipy/scipy/issues/7826 says
I do not really see a way to change this.
An X += c * Y without todense follows.
Some inc( various array / matrix, various sparse )
have been tested, but for sure not all.
def inc( X, Y, c=1. ):
""" X += c * Y, X Y sparse or dense """
if (not hasattr( X, "indices" ) # dense += sparse
and hasattr( Y, "indices" )):
# inc an ndarray view, because ndarry += sparse -> matrix --
X = getattr( X, "A", X ).squeeze()
X[Y.indices] += c * Y.data
else:
X += c * Y # sparse + different sparse: SparseEfficiencyWarning
return X
I'm relatively new to python but I'm trying to understand something which seems basic.
Create a vector:
x = np.linspace(0,2,3)
Out[38]: array([ 0., 1., 2.])
now why isn't x[:,0] a value argument?
IndexError: invalid index
It must be x[0]. I have a function I am calling which calculates:
np.sqrt(x[:,0]**2 + x[:,1]**2 + x[:,2]**2)
Why can't what I have just be true regardless of the input? It many other languages, it is independent of there being other rows in the array. Perhaps I misunderstand something fundamental - sorry if so. I'd like to avoid putting:
if len(x) == 1:
norm = np.sqrt(x[0]**2 + x[1]**2 + x[2]**2)
else:
norm = np.sqrt(x[:,0]**2 + x[:,1]**2 + x[:,2]**2)
everywhere. Surely there is a way around this... thanks.
Edit: An example of it working in another language is Matlab:
>> b = [1,2,3]
b =
1 2 3
>> b(:,1)
ans =
1
>> b(1)
ans =
1
Perhaps you are looking for this:
np.sqrt(x[...,0]**2 + x[...,1]**2 + x[...,2]**2)
There can be any number of dimensions in place of the ellipsis ...
See also What does the Python Ellipsis object do?, and the docs of NumPy basic slicing
It looks like the ellipsis as described by #JanneKarila has answered your question, but I'd like to point out how you might make your code a bit more "numpythonic". It appears you want to handle an n-dimensional array with the shape (d_1, d_2, ..., d_{n-1}, 3), and compute the magnitudes of this collection of three-dimensional vectors, resulting in an (n-1)-dimensional array with shape (d_1, d_2, ..., d_{n-1}). One simple way to do that is to square all the elements, then sum along the last axis, and then take the square root. If x is the array, that calculation can be written np.sqrt(np.sum(x**2, axis=-1)). The following shows a few examples.
x is 1-D, with shape (3,):
In [31]: x = np.array([1.0, 2.0, 3.0])
In [32]: np.sqrt(np.sum(x**2, axis=-1))
Out[32]: 3.7416573867739413
x is 2-D, with shape (2, 3):
In [33]: x = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
In [34]: x
Out[34]:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In [35]: np.sqrt(np.sum(x**2, axis=-1))
Out[35]: array([ 3.74165739, 8.77496439])
x is 3-D, with shape (2, 2, 3):
In [36]: x = np.arange(1.0, 13.0).reshape(2,2,3)
In [37]: x
Out[37]:
array([[[ 1., 2., 3.],
[ 4., 5., 6.]],
[[ 7., 8., 9.],
[ 10., 11., 12.]]])
In [38]: np.sqrt(np.sum(x**2, axis=-1))
Out[38]:
array([[ 3.74165739, 8.77496439],
[ 13.92838828, 19.10497317]])
I tend to solve this is by writing
x = np.atleast_2d(x)
norm = np.sqrt(x[:,0]**2 + x[:,1]**2 + x[:,2]**2)
Matlab doesn't have 1D arrays, so b=[1 2 3] is still a 2D array and indexing with two dimensions makes sense. It can be a novel concept for you, but they're quite useful in fact (you can stop worrying whether you need to multiply by the transpose, insert a row or a column in another array...)
By the way, you could write a fancier, more general norm like this:
x = np.atleast_2d(x)
norm = np.sqrt((x**2).sum(axis=1))
The problem is that x[:,0] in Python isn't the same as in Matlab.
If you want to extract the first element in the single row vector you should go with
x[:1]
This is called a "slice". In this example it means that you take everything in the array from the first element to the element with index 1 (not included).
Remember that Python has zero-based numbering.
Another example may be:
x[0:2]
which would return the first and the second element of the array.