I have a generator that generates single dimension numpy.arrays of the same length. I would like to have a sparse matrix containing that data. Rows are generated in the same order I'd like to have them in the final matrix. csr matrix is preferable over lil matrix, but I assume the latter will be easier to build in the scenario I'm describing.
Assuming row_gen is a generator yielding numpy.array rows, the following code works as expected.
def row_gen():
yield numpy.array([1, 2, 3])
yield numpy.array([1, 0, 1])
yield numpy.array([1, 0, 0])
matrix = scipy.sparse.lil_matrix(list(row_gen()))
Because the list will essentially ruin any advantages of the generator, I'd like the following to have the same end result. More specifically, I cannot hold the entire dense matrix (or a list of all matrix rows) in memory:
def row_gen():
yield numpy.array([1, 2, 3])
yield numpy.array([1, 0, 1])
yield numpy.array([1, 0, 0])
matrix = scipy.sparse.lil_matrix(row_gen())
However it raises the following exception when run:
TypeError: no supported conversion for types: (dtype('O'),)
I also noticed the trace includes the following:
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/lil.py", line 122, in __init__
A = csr_matrix(A, dtype=dtype).tolil()
Which makes me think using scipy.sparse.lil_matrix will end up creating a csr matrix and only then convert that to a lil matrix. In that case I would rather just create csr matrix to begin with.
To recap, my question is: What is the most efficient way to create a scipy.sparse matrix from a python generator or numpy single dimensional arrays?
Let's look at the code for sparse.lil_matrix. It checks the first argument:
if isspmatrix(arg1): # is is already a sparse matrix
...
elif isinstance(arg1,tuple): # is it the shape tuple
if isshape(arg1):
if shape is not None:
raise ValueError('invalid use of shape parameter')
M, N = arg1
self.shape = (M,N)
self.rows = np.empty((M,), dtype=object)
self.data = np.empty((M,), dtype=object)
for i in range(M):
self.rows[i] = []
self.data[i] = []
else:
raise TypeError('unrecognized lil_matrix constructor usage')
else:
# assume A is dense
try:
A = np.asmatrix(arg1)
except TypeError:
raise TypeError('unsupported matrix type')
else:
from .csr import csr_matrix
A = csr_matrix(A, dtype=dtype).tolil()
self.shape = A.shape
self.dtype = A.dtype
self.rows = A.rows
self.data = A.data
As per the documentation - you can construct it from another sparse matrix, from a shape, and from a dense array. The dense array constructor first makes a csr matrix, and then converts it to lil.
The shape version constructs an empty lil with data like:
In [161]: M=sparse.lil_matrix((3,5),dtype=int)
In [163]: M.data
Out[163]: array([[], [], []], dtype=object)
In [164]: M.rows
Out[164]: array([[], [], []], dtype=object)
It should be obvious that passing a generator isn't going work - it isn't a dense array.
But having created a lil matrix, you can fill in elements with a regular array assignment:
In [167]: M[0,:]=[1,0,2,0,0]
In [168]: M[1,:]=[0,0,2,0,0]
In [169]: M[2,3:]=[1,1]
In [170]: M.data
Out[170]: array([[1, 2], [2], [1, 1]], dtype=object)
In [171]: M.rows
Out[171]: array([[0, 2], [2], [3, 4]], dtype=object)
In [172]: M.A
Out[172]:
array([[1, 0, 2, 0, 0],
[0, 0, 2, 0, 0],
[0, 0, 0, 1, 1]])
and you can assign values to the sublists directly (I think this is faster, but a little more dangerous):
In [173]: M.data[1]=[1,2,3]
In [174]: M.rows[1]=[0,2,4]
In [176]: M.A
Out[176]:
array([[1, 0, 2, 0, 0],
[1, 0, 2, 0, 3],
[0, 0, 0, 1, 1]])
Another incremental approach is to construct the 3 arrays or lists of coo format, and then make a coo or csr from those.
sparse.bmat is another option, and its code is a good example of building the coo inputs. I'll let you look at that yourself.
Related
I'm trying to calculate this type of calculation:
arr = np.arange(4)
# array([0, 1, 2, 3])
arr_t =arr.reshape((-1,1))
# array([[0],
# [1],
# [2],
# [3]])
mult_arr = np.multiply(arr,arr_t) # <<< the multiplication
# array([[0, 0, 0, 0],
# [0, 1, 2, 3],
# [0, 2, 4, 6],
# [0, 3, 6, 9]])
to eventually perform it in a bigger matrix index of single row, and to sum all the matrices that are reproduced by the calculation:
arr = np.random.random((600,150))
arr_t =arr.reshape((-1,arr.shape[1],1))
mult = np.multiply(arr[:,None],arr_t)
summed = np.sum(mult,axis=0)
summed
Till now its all pure awesomeness, the problem starts when I try to covert on a bigger dataset, for example this array instead :
arr = np.random.random((6000,1500))
I get the following error - MemoryError: Unable to allocate 101. GiB for an array with shape (6000, 1500, 1500) and data type float64
which make sense, but my question is:
can I get around this anyhow without being forced to use loops that slow down the process entirely ??
my question is mainly about performance and solution that require long running tasks more then 30 secs is not an option.
Looks like you are simply trying to perform a dot product:
arr.T#arr
or
arr.T.dot(arr)
checking this is what you want
arr = np.random.random((600,150))
arr_t =arr.reshape((-1,arr.shape[1],1))
mult = np.multiply(arr[:,None],arr_t)
summed = np.sum(mult,axis=0)
np.allclose((arr.T#arr), summed)
# True
I'm fairly new to NumPy, so it's quite possible that I'm missing something fundamental. Don't hesitate to ask "stupid" questions about "basic" things!
I'm trying to write some functions that manipulate vectors. I'd like them to work on single vectors, as well as on arrays of vectors, like most of NumPy's ufuncs:
import math
import numpy
def func(scalar, x, vector):
# arbitrary function
# I'm NOT looking to replace this with numpy.magic_sum_multiply()
# I'm trying to understand broadcasting & dtypes
return scalar * x + vector
print(func(
scalar=numpy.array(2),
x=numpy.array([1, 0, 0]),
vector=numpy.array([1, 0, 0]),
))
# => [3 0 0], as expected
print(func(
scalar=numpy.array(2),
x=numpy.array([1, 0, 0]),
vector=numpy.array([[1, 0, 0], [0, 1, 0]]),
))
# => [[3 0 0], [2 1 0]], as expected. x & scalar are broadcasted out to match the multiple vectors
However, when trying to use multiple scalars, things go wrong:
print(func(
scalar=numpy.array([1, 2]),
x=numpy.array([1, 0, 0]),
vector=numpy.array([1, 0, 0]),
))
# => ValueError: operands could not be broadcast together with shapes (2,) (3,)
# expected: [[2 0 0], [3 0 0]]
I'm not entirely surprised be this. After all, NumPy has no idea that I'm working with vectors that are an single entity, and not an arbitrary dimension.
I can solve this ad-hoc with some expand_dims() and/or squeeze() to add/remove axes, but that feels hacky...
So I figured that, since I'm working with vectors that are a single "entity", dtypes may be what I'm looking for:
vector_dtype = numpy.dtype([
('x', numpy.float64),
('y', numpy.float64),
('z', numpy.float64),
])
_ = numpy.array([(1, 0, 0), (0, 1, 0)], dtype=vector_dtype)
print(_.shape) # => (2,), good, we indeed have 2 vectors!
_ = numpy.array((1, 0, 0, 7), dtype=vector_dtype)
# Good, basic checking that I'm staying in 3D
# => ValueError: could not assign tuple of length 4 to structure with 3 fields.
However, I seem to loose basic math capabilities:
print(2 * _)
# => TypeError: The DTypes <class 'numpy.dtype[void]'> and <class 'numpy.dtype[uint8]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is `object`.
So my main question is: How do I solve this?
Is there some numpy.magic_broadcast_that_understands_what_I_mean() function?
Can I define math-operators (such as addition, ...) on the vector-dtype?
How do I solve this?
You are after the first-argument vectorized version of func, let's call it vfunc(vfunc is not "vectorization" stricto sensu, since the vectorization job in done internally.)
# v
def vfunc(scalars, x, vector):
# ^
return numpy.vstack([ # Assuming that's the shape you want.
scalar * x + vector for scalar in scalars
])
print(vfunc(
scalars = [2], # No need for array instance actually
x = numpy.array([1, 0, 0]),
vector = numpy.array([1, 0, 0]),
))
# => [3 0 0], as expected
print(vfunc(
scalars = [2],
x = numpy.array([1, 0, 0]),
vector = numpy.array([[1, 0, 0], [0, 1, 0]]),
))
# => [[3 0 0], [2 1 0]], as expected
print(vfunc(
scalars = [1, 2],
x = numpy.array([1, 0, 0]),
vector = numpy.array([1, 0, 0]),
))
# => # expected: [[2 0 0], [3 0 0]]
[...] dtypes may be what I'm looking for
No it is not.
Is there some numpy.magic_broadcast_that_understands_what_I_mean()
Yes. It is called numpy.vectorize but it is not worth it.
As it reads in the documentation:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
ufuncs obey the same broadcasting rules as the operators. And your own function, written with numpy operators and ufuncs have to work with those as well. Your function could tweak the dimensions to translate inputs to something works with the rest of numpy. (Writing your own ufuncs is an advanced topic.)
In [64]: scalar=numpy.array([1, 2])
...: x=numpy.array([1, 0, 0])
...: vector=numpy.array([1, 0, 0])
In [65]: scalar * x + vector
Traceback (most recent call last):
File "<ipython-input-65-ad4a73833616>", line 1, in <module>
scalar * x + vector
ValueError: operands could not be broadcast together with shapes (2,) (3,)
The problem is the multiplication; regardless of what you call it, scalar is a (2,) shape array, which does not work with a (3,) array.
In [68]: scalar*x
Traceback (most recent call last):
File "<ipython-input-68-0d21729ffa15>", line 1, in <module>
scalar*x
ValueError: operands could not be broadcast together with shapes (2,) (3,)
But what do you expect to happen? What shape should the result have?
If scalar is a (2,1) shaped array, then by broadcasting this result is (2,3) - taking the 2 from scalar and 3 from the other arrays:
In [76]: scalar[:,None] * x + vector
Out[76]:
array([[2, 0, 0],
[3, 0, 0]])
This is standard numpy broadcasting, and there's nothing "hacky" about it.
I don't know what you mean by calling scalar a 'single entity'.
Structured array is a convenient way of putting arrays with diverse dtypes into one structure. Or to access "columns" of convenient 'names'.
But you can't perform math across the fields of such an array.
In [70]: z=np.array([(1, 0, 0), (0, 1, 0)], dtype=vector_dtype)
In [71]: z
Out[71]:
array([(1., 0., 0.), (0., 1., 0.)],
dtype=[('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
In [72]: z.shape
Out[72]: (2,)
In [73]: z.dtype
Out[73]: dtype([('x', '<f8'), ('y', '<f8'), ('z', '<f8')])
In [74]: z['x']
Out[74]: array([1., 0.])
In [75]: 2*z['x'] # math on a single field
Out[75]: array([2., 0.])
note
There is a np.vectorize function. It takes a function that accepts only (true) scalar arguments, and applies array arguments according to the standard broadcasting rules. So even if your func was implemented with it, you'd still have to use arguments as I did in [70]. Sometimes it's convenient, but it's better to use standard numpy functions and operators where possible - better and much faster.
Let's say I have a two-dimensional array
import numpy as np
a = np.array([[1, 1, 1], [2,2,2], [3,3,3]])
and I would like to replace the third vector (in the second dimension) with zeros. I would do
a[:, 2] = np.array([0, 0, 0])
But what if I would like to be able to do that programmatically? I mean, let's say that variable x = 1 contained the dimension on which I wanted to do the replacing. How would the function replace(arr, dimension, value, arr_to_be_replaced) have to look if I wanted to call it as replace(a, x, 2, np.array([0, 0, 0])?
numpy has a similar function, insert. However, it doesn't replace at dimension i, it returns a copy with an additional vector.
All solutions are welcome, but I do prefer a solution that doesn't recreate the array as to save memory.
arr[:, 1]
is basically shorthand for
arr[(slice(None), 1)]
that is, a tuple with slice elements and integers.
Knowing that, you can construct a tuple of slice objects manually, adjust the values depending on an axis parameter and use that as your index. So for
import numpy as np
arr = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
axis = 1
idx = 2
arr[:, idx] = np.array([0, 0, 0])
# ^- axis position
you can use
slices = [slice(None)] * arr.ndim
slices[axis] = idx
arr[tuple(slices)] = np.array([0, 0, 0])
I'm working with a connectivity matrix that is a representation of a graph datastructure. The NxM matrix corresponds to N edges with M vertices (it's likely to have more edges than vertices, which is why I am working with scipy's csr_matrix). The "start" point of the edge is represented by "-1" and the end point is represent by "1" in the connectivity matrix. All other values are 0, so each row only has 2 nonzero values.
I need to integrate a "subdivide" method, which will efficiently update the connectivity matrix. Currently I am transforming the connectivity matrix to a dense matrix so I can add the new rows/columns and update the old ones. I am converting to a dense matrix as I haven't found a solution to finding the column index for updating the old edge connectivity (no equivalent scipy.where) and the csr representation does not allow me to update values via indexing.
from numpy import where, array, zeros, hstack, vstack
from scipy.sparse import coo_matrix, csr_matrix
def connectivity_matrix(edges):
m = len(edges)
data = array([-1] * m + [1] * m)
rows = array(list(range(m)) + list(range(m)))
cols = array([edge[0] for edge in edges] + [edge[1] for edge in edges])
C = coo_matrix((data, (rows, cols))).asfptype()
return C.tocsr()
def subdivide_edges(C, edge_indices):
C = C.todense()
num_e = C.shape[0] # number of edges
num_v = C.shape[1] # number of vertices
for edge in edge_indices:
num_e += 1 # increment row (edge count)
num_v += 1 # increment column (vertex count)
_, start = where(C[edge] == -1.0)
_, end = where(C[edge] == 1.0)
si = start[0]
ei = end[0]
# add row
r, c = C.shape
new_r = zeros((1, c))
C = vstack([C, new_r])
# add column
r, c = C.shape
new_c = zeros((r, 1))
C = hstack([C, new_c])
# edit edge start/end points
C[edge, ei] = 0.0
C[edge, num_v - 1] = 1.0
# add new edge start/end points
C[num_e - 1, ei] = 1.0
C[num_e - 1, num_v - 1] = -1.0
return csr_matrix(C)
edges = [(0, 1), (1, 2)] # edge connectivity
C = connectivity_matrix(edges)
C = subdivide_edges(C, [0, 1])
# new edge connectivity: [(0, 3), (1, 4), (3, 1), (4, 2)]
A sparse matrix does have a nonzero method (np.where uses np.nonzero). But look at its code - it returns coo row/cols data.
Using a sparse matrix left over from another question:
In [468]: M
Out[468]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [469]: Mc = M.tocsr()
In [470]: Mc.nonzero()
Out[470]: (array([0, 1, 2, 3, 4], dtype=int32), array([2, 0, 4, 3, 1], dtype=int32))
In [471]: Mc[1,:].nonzero()
Out[471]: (array([0]), array([0]))
In [472]: Mc[3,:].nonzero()
Out[472]: (array([0]), array([3]))
I converted to csr to do the row index.
There is also a sparse vstack.
But iterative work on sparse matrix is slow compared to dense arrays.
Be wary of float comparisons like C[edge] == -1.0. == tests work much better with integers.
Changing values from zero to nonzero does raise a warning, but does work:
In [473]: Mc[1,1] = 23
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:774: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [474]: (Mc[1,:]==23).nonzero()
Out[474]: (array([0]), array([1]))
Changing nonzeros to zero doesn't produce the warning, but it also doesn't change the underlying sparsity (until the matrix is cleaned up). lil format is better for element by element changes.
In [478]: Ml = M.tolil()
In [479]: Ml.nonzero()
Out[479]: (array([0, 1, 2, 3, 4], dtype=int32), array([2, 0, 4, 3, 1], dtype=int32))
In [480]: Ml[1,:].nonzero()
Out[480]: (array([0], dtype=int32), array([0], dtype=int32))
In [481]: Ml[1,2]=.5
In [482]: Ml[1,:].nonzero()
Out[482]: (array([0, 0], dtype=int32), array([0, 2], dtype=int32))
In [483]: (Ml[1,:]==.5).nonzero()
Out[483]: (array([0], dtype=int32), array([2], dtype=int32))
In [486]: sparse.vstack((Ml,Ml),format='lil')
Out[486]:
<10x5 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in LInked List format>
sparse.vstack works by converting the inputs to coo, and joining their attributes (rows, cols, data), and making a new matrix.
I suspect that your code will work with a lil matrix without too many changes. But it probably will be slower. Sparse gets its best speed when doing things like matrix multiplication on low density matrices. It also helps when the dense equivalents are too large to fit in memory. But for iterative work and growing matrices it is slow.
Is it generally safe to provide the input array as the optional out argument to a ufunc in numpy, provided the type is correct? For example, I have verified that the following works:
>>> import numpy as np
>>> arr = np.array([1.2, 3.4, 4.5])
>>> np.floor(arr, arr)
array([ 1., 3., 4.])
The array type must be either compatible or identical with the output (which is a float for numpy.floor()), or this happens:
>>> arr2 = np.array([1, 3, 4], dtype = np.uint8)
>>> np.floor(arr2, arr2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'floor' output (typecode 'e') could not be coerced to provided output parameter (typecode 'B') according to the casting rule ''same_kind''
So given that an array of proper type, is it generally safe to apply ufuncs in-place? Or is floor() an exceptional case? The documentation does not make it clear, and neither do the following two threads that have tangential bearing on the question:
Numpy modify array in place?
Numpy Ceil and Floor "out" Argument
EDIT:
As a first order guess, I would assume it is often, but not always safe, based on the tutorial at http://docs.scipy.org/doc/numpy/user/c-info.ufunc-tutorial.html. There does not appear to be any restriction on using the output array as a temporary holder for intermediate results during the computation. While something like floor() and ciel() may not require temporary storage, more complex functions might. That being said, the entire existing library may be written with that in mind.
The out parameter of a numpy function is the array where the result is written. The main advantage of using out is avoiding the allocation of new memory where it is not necessary.
Is it safe to use write the output of a function on the same array passed as input? There is no general answer, it depends on what the function is doing.
Two examples
Here are two examples of ufunc-like functions:
In [1]: def plus_one(x, out=None):
...: if out is None:
...: out = np.zeros_like(x)
...:
...: for i in range(x.size):
...: out[i] = x[i] + 1
...: return out
...:
In [2]: x = np.arange(5)
In [3]: x
Out[3]: array([0, 1, 2, 3, 4])
In [4]: y = plus_one(x)
In [5]: y
Out[5]: array([1, 2, 3, 4, 5])
In [6]: z = plus_one(x, x)
In [7]: z
Out[7]: array([1, 2, 3, 4, 5])
Function shift_one:
In [11]: def shift_one(x, out=None):
...: if out is None:
...: out = np.zeros_like(x)
...:
...: n = x.size
...: for i in range(n):
...: out[(i+1) % n] = x[i]
...: return out
...:
In [12]: x = np.arange(5)
In [13]: x
Out[13]: array([0, 1, 2, 3, 4])
In [14]: y = shift_one(x)
In [15]: y
Out[15]: array([4, 0, 1, 2, 3])
In [16]: z = shift_one(x, x)
In [17]: z
Out[17]: array([0, 0, 0, 0, 0])
For the function plus_one there is no problem: the expected result is obtained when the parameters x and out are the same array. But the function shift_one gives a surprising result when the parameters x and out are the same array because the array
Discussion
For function of the form out[i] := some_operation(x[i]), such as plus_one above but also the functions floor, ceil, sin, cos, tan, log, conj, etc, as far as I know it is safe to write the result in the input using parameter out.
It is also safe for functions taking two input parameters of the form ``out[i] := some_operation(x[i], y[i]) such as the numpy function add, multiply, subtract.
For the other functions, it is case-by-case. As illustrated bellow, the matrix multiplication is not safe:
In [18]: a = np.arange(4).reshape((2,2))
In [19]: a
Out[19]:
array([[0, 1],
[2, 3]])
In [20]: b = (np.arange(4) % 2).reshape((2,2))
In [21]: b
Out[21]:
array([[0, 1],
[0, 1]], dtype=int32)
In [22]: c = np.dot(a, b)
In [23]: c
Out[23]:
array([[0, 1],
[0, 5]])
In [24]: d = np.dot(a, b, out=a)
In [25]: d
Out[25]:
array([[0, 1],
[0, 3]])
Last remark: if the implementation is multithreaded, the result of an unsafe function may even be non-deterministic because it depends on the order on which the array elements are processed.
This is an old question, but there is an updated answer:
Yes, it is safe. In the Numpy documentation, we see that as of v1.13:
Operations where ufunc input and output operands have memory overlap are defined to be the same as for equivalent operations where there is no memory overlap. Operations affected make temporary copies as needed to eliminate data dependency. As detecting these cases is computationally expensive, a heuristic is used, which may in rare cases result in needless temporary copies. For operations where the data dependency is simple enough for the heuristic to analyze, temporary copies will not be made even if the arrays overlap, if it can be deduced copies are not necessary. As an example, np.add(a, b, out=a) will not involve copies.