Fastest way to fill a scipy.sparse.dok_matrix - python

What's the most efficient way to fill a scipy.sparse.dok_matrix, based on an input list ?
Neither the number of columns or rows in the dok_matrix are known in advance.
The number of rows is the length of the input list, the number of columns depends on the values within the input list.
The obvious:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
max_cols = 0
datas = []
for value in values:
data = get_data(values)
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = scipy.sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
Has two for loops, a nested for loop, and many len() checks also. I can't imagine this being very efficient.
I have also considered:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
cols = 0
dok_matrix = scipy.sparse.dok_matrix((0, 0))
for row, value in enumerate(values):
dok_matrix.resize(row + 1, cols)
data = get_data(values)
for col, datum in enumerate(data):
if col + 1 > cols:
cols = col + 1
dok_matrix.resize(row + 1, cols)
dok_matrix[row, col] = datum
return dok_matrix
This hugely depends on how efficient scipy.sparse.dok_matrix.resize is, which I couldn't find in the documentation.
Which of these is most efficient?
Is there a better way that I am missing (maybe I can O(1) set an entire row at once?)?

With:
def get_dok_matrix(values):
max_cols = 0
datas = []
for value in values:
data = value # changed
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
And
In [13]: values = [[1],[1,2],[1,2,3],[4,5,6,7],[8,9,10,11,12]]
In [14]: dd = get_dok_matrix(values)
In [15]: dd
Out[15]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 15 stored elements in Dictionary Of Keys format>
In [16]: dd.A
Out[16]:
array([[ 1., 0., 0., 0., 0.],
[ 1., 2., 0., 0., 0.],
[ 1., 2., 3., 0., 0.],
[ 4., 5., 6., 7., 0.],
[ 8., 9., 10., 11., 12.]])
I wish you'd provided a values example, so I wouldn't have to study your code and create one that would work with it.
To make a coo format:
def get_coo_matrix(values):
data, row, col = [],[],[]
for i,value in enumerate(values):
n = len(value)
data.extend(value)
row.extend([i]*n)
col.extend(list(range(n)))
return sparse.coo_matrix((data,(row,col)))
In [18]: M = get_coo_matrix(values)
In [19]: M
Out[19]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in COOrdinate format>
In [20]: M.A
Out[20]:
array([[ 1, 0, 0, 0, 0],
[ 1, 2, 0, 0, 0],
[ 1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 0],
[ 8, 9, 10, 11, 12]])
times:
In [22]: timeit dd = get_dok_matrix(values)
431 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: timeit M = get_coo_matrix(values)
152 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related

Efficient dense counterpart to scipy.sparse.diags

scipy.sparse.diags allows me to enter multiple diagonal vectors, together with their location, to build a matrix such as
from scipy.sparse import diags
vec = np.ones((5,))
vec2 = vec + 1
diags([vec, vec2], [-2, 2])
I'm looking for an efficient way to do the same but build a dense matrix, instead of DIA. np.diag only supports a single diagonal. What's an efficient way to build a dense matrix from multiple diagonal vectors?
Expected output: the same as np.array(diags([vec, vec2], [-2, 2]).todense())
One way would be to index into the flattened output array using a step of N+1:
import numpy as np
from scipy.sparse import diags
from timeit import timeit
def diags_pp(vecs, offs, dtype=float, N=None):
if N is None:
N = len(vecs[0]) + abs(offs[0])
out = np.zeros((N, N), dtype)
outf = out.reshape(-1)
for vec, off in zip(vecs, offs):
if off<0:
outf[-N*off::N+1] = vec
else:
outf[off:N*(N-off):N+1] = vec
return out
def diags_sp(vecs, offs):
return diags(vecs, offs).A
for N, k in [(10, 2), (100, 20), (1000, 200)]:
print(N)
O = np.arange(-k,k)
D = [np.arange(1, N+1-abs(o)) for o in O]
for n, f in list(globals().items()):
if n.startswith('diags_'):
print(n.replace('diags_', ''), timeit(lambda: f(D, O), number=10000//N)*N)
if n != 'diags_sp':
assert np.all(f(D, O) == diags_sp(D, O))
Sample run:
10
pp 0.06757194991223514
sp 1.9529316504485905
100
pp 0.45834919437766075
sp 4.684177896706387
1000
pp 23.397524026222527
sp 170.66762899048626
With Paul Panzer's (10,2) case
In [107]: O
Out[107]: array([-2, -1, 0, 1])
In [108]: D
Out[108]:
[array([1, 2, 3, 4, 5, 6, 7, 8]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9])]
The diagonals have different lengths.
sparse.diags converts this to a sparse.dia_matrix:
In [109]: M = sparse.diags(D,O)
In [110]: M
Out[110]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 36 stored elements (4 diagonals) in DIAgonal format>
In [111]: M.data
Out[111]:
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 0., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.],
[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
Here the ragged list of diagonals has been converted to a padded 2d array. This can be a convenient way of specifying the diagonals, but it isn't particularly efficient. It has to be converted to csr format for most calculations:
In [112]: timeit sparse.diags(D,O)
99.8 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [113]: timeit sparse.diags(D,O, format='csr')
371 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using np.diag I can construct the same array with an iteration
np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
In [117]: timeit np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
39.3 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with Paul's function:
In [120]: timeit diags_pp(D,O)
12.3 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The key step in np.diags is a flat assignment:
res[:n-k].flat[i::n+1] = v
This is essentially the same as Paul's outf assignments. So the functionality is basically the same, assigning each diagonal via a slice. Paul streamlines it.
Creating the M.data array (Out[111]) also requires copying the D arrays to a 2d array - but with different slices.

Replace values in numpy 2D array based on pandas dataframe

>>> arr
array([[ 0., 10., 0., ..., 0., 0., 0.],
[ 0., 4., 0., ..., 6., 0., 9.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 2., 0., 0.],
[ 0., 0., 0., ..., 0., 3., 0.]])
In the numpy array above, I would like to replace every value that matches the column country_codes in the dataframe (df_A) with the value from the column continent_codes in df_A. df_A looks like:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
Right now, I loop through dataframe and replace using numpy indexing notation. Given that iterrows() tends to be slow, is there a more direct/vectorized way to do this?
for index, row in self.df_A.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']
Approach #1 : One vectorized approach using np.searchsorted and np.in1d would be as listed below -
# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
# Mask of elements to be changed
mask = np.in1d(arr,oldval)
# Indices for each match from oldval in arr
idx = np.searchsorted(oldval,arr.ravel()[mask])
# Using the mask put selective elements from continent_codes column into arr
arr.ravel()[mask] = newval[idx]
Sample run -
>>> arr # Original 2D array
array([[23, 4, 23, 5, 8],
[ 3, 6, 8, 5, 11],
[16, 24, 15, 4, 10],
[ 4, 16, 10, 8, 1]])
>>> df
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
>>> oldval = np.array(df['country_codes'])
>>> newval = np.array(df['continent_codes'])
>>> mask = np.in1d(arr,oldval)
>>> idx = np.searchsorted(oldval,arr.ravel()[mask])
>>> arr.ravel()[mask] = newval[idx]
>>> mask.reshape(arr.shape) # Mask array depiciting which elements were updated
array([[False, True, False, False, True],
[False, False, True, False, False],
[ True, True, False, True, False],
[ True, True, False, True, False]], dtype=bool)
>>> arr # Updated 2D array
array([[23, 4, 23, 5, 3],
[ 3, 6, 3, 5, 11],
[ 6, 5, 15, 4, 10],
[ 4, 6, 10, 3, 1]])
Approach #2 : As a variant, you can also create the mask with a comparison between np.searchsorted(oldval,arr,'left') and np.searchsorted(oldval,arr,'right') as discussed in the solutions for this question and re-use np.searchsorted(oldval,arr,'left') again later on while putting values into arr for a more efficient solution, like so -
# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
# Left and right indices for each match from oldval in arr
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
# Mask of elements to be changed
mask = left_idx!=right_idx
# Using the mask put selective elements from continent_codes column into arr
arr[mask] = newval[left_idx[mask]]
Runtime tests and verify outputs
Function definitions -
def original_app(arr,df):
for index, row in df.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']
def vectorized_app1(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
mask = np.in1d(arr,oldval)
idx = np.searchsorted(oldval,arr.ravel()[mask])
arr.ravel()[mask] = newval[idx]
def vectorized_app2(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
mask = left_idx!=right_idx
arr[mask] = newval[left_idx[mask]]
Verify outputs -
In [195]: # Input array
...: arr = np.random.randint(0,100000,(1000,1000))
...:
...: # Setup input dataframe
...: N = 1000
...: oldvals = np.unique(np.random.randint(0,100000,N))
...: newvals = np.random.randint(0,9,(oldvals.size))
...: df=pd.DataFrame({'country_codes':oldvals,'continent_codes':newvals})
...: df = df.reindex_axis(sorted(df.columns)[::-1], axis=1)
...:
...: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:
In [196]: # Verify outputs
...: original_app(arrc1,df)
...: vectorized_app1(arrc2,df)
...: vectorized_app2(arrc3,df)
...:
In [197]: np.allclose(arrc1,arrc2)
Out[197]: True
In [198]: np.allclose(arrc1,arrc3)
Out[198]: True
Timings -
In [199]: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:
In [200]: %timeit original_app(arrc1,df)
1 loops, best of 3: 2.79 s per loop
In [201]: %timeit vectorized_app1(arrc2,df)
1 loops, best of 3: 360 ms per loop
In [202]: %timeit vectorized_app2(arrc3,df)
1 loops, best of 3: 213 ms per loop
with this data as exemple, with at most N countries,
N=10**5
values=np.random.randint(0,N,(1000,1000))
exemple={'country':np.arange(N//2),'continent':randint(1,5,N//2)}
df=pd.DataFrame.from_dict(exemple)
You can just do :
v=np.arange(N)
v[df['country']]=df['continent']
v.take(values,out=values)
probably not optimal, but efficient (20ms).

Numpy indexing 3-dimensional array into 2-dimensional array

I have a three-dimensional array of the following structure:
x = np.array([[[1,2],
[3,4]],
[[5,6],
[7,8]]], dtype=np.double)
Additionally, I have an index array
idx = np.array([[0,1],[1,3]], dtype=np.int)
Each row of idx defines the row/column indices for the placement of each sub-array along the 0 axis in x into a two-dimensional array K that is initialized as
K = np.zeros((4,4), dtype=np.double)
I would like to use fancy indexing/broadcasting to performing the indexing without a for loop. I currently do it this way:
for i, id in enumerate(idx):
idx_grid = np.ix_(id,id)
K[idx_grid] += x[i]
Such that the result is:
>>> K = array([[ 1., 2., 0., 0.],
[ 3., 9., 0., 6.],
[ 0., 0., 0., 0.],
[ 0., 7., 0., 8.]])
Is this possible to do with fancy indexing?
Here's one alternative way. With x, idx and K defined as in your question:
indices = (idx[:,None] + K.shape[1]*idx).ravel('f')
np.add.at(K.ravel(), indices, x.ravel())
Then we have:
>>> K
array([[ 1., 2., 0., 0.],
[ 3., 9., 0., 6.],
[ 0., 0., 0., 0.],
[ 0., 7., 0., 8.]])
To perform unbuffered inplace addition on NumPy arrays you need to use np.add.at (to avoid using += in a for loop).
However, it's slightly probelmatic to pass a list of 2D index arrays, and corresponding arrays to add at these indices, to np.add.at. This is because the function interprets these lists of arrays as higher-dimensional arrays and IndexErrors are raised.
It's much simpler to pass in 1D arrays. You can temporarily ravel K and x to give you a 1D array of zeros and a 1D array of values to add to those zeros. The only fiddly part is constructing a corresponding 1D array of indices from idx at which to add the values. This can be done via broadcasting with arithmetical operators and then ravelling, as shown above.
The intended operation is one of an accumulation of values from x into places indexed by idx. You could think of those idx places as bins of a histogram data and the x values as the weights that you need to accumulate for those bins. Now, to perform such a binning operation, np.bincount could be used. Here's one such implementation with it -
# Get size info of expected output
N = idx.max()+1
# Extend idx to cover two axes, equivalent to `np.ix_`
idx1 = idx[:,None,:] + N*idx[:,:,None]
# "Accumulate" values from x into places indexed by idx1
K = np.bincount(idx1.ravel(),x.ravel()).reshape(N,N)
Runtime tests -
1) Create inputs:
In [361]: # Create x and idx, with idx having unique elements in each row of idx,
...: # as otherwise the intended operation is not clear
...:
...: nrows = 100
...: max_idx = 100
...: ncols_idx = 2
...:
...: x = np.random.rand(nrows,ncols_idx,ncols_idx)
...: idx = np.random.randint(0,max_idx,(nrows,ncols_idx))
...:
...: valid_mask = ~np.any(np.diff(np.sort(idx,axis=1),axis=1)==0,axis=1)
...:
...: x = x[valid_mask]
...: idx = idx[valid_mask]
...:
2) Define functions:
In [362]: # Define the original and proposed (bincount based) approaches
...:
...: def org_approach(x,idx):
...: N = idx.max()+1
...: K = np.zeros((N,N), dtype=np.double)
...: for i, id in enumerate(idx):
...: idx_grid = np.ix_(id,id)
...: K[idx_grid] += x[i]
...: return K
...:
...:
...: def bincount_approach(x,idx):
...: N = idx.max()+1
...: idx1 = idx[:,None,:] + N*idx[:,:,None]
...: return np.bincount(idx1.ravel(),x.ravel()).reshape(N,N)
...:
3) Finally time them:
In [363]: %timeit org_approach(x,idx)
100 loops, best of 3: 2.13 ms per loop
In [364]: %timeit bincount_approach(x,idx)
10000 loops, best of 3: 32 µs per loop
I do not think it is efficiently possible, since you have += in the loop. This means, you would have to "blow up" your array idx by one dimension and reduce it again by utilizing np.sum(x[...], axis=...).
A minor optimization would be:
import numpy as np
xx = np.array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]], dtype=np.double)
idx = np.array([[0, 1], [1, 3]], dtype=np.int)
K0, K1 = np.zeros((4, 4), dtype=np.double), np.zeros((4, 4), dtype=np.double)
for k, i in enumerate(idx):
idx_grid = np.ix_(i, i)
K0[idx_grid] += xx[k]
for x, i in zip(xx, idx):
K1[np.ix_(i, i)] += x
print("K1 == K0:", np.allclose(K1, K0)) # prints: K1 == K0: True
PS: Do not use id as a variable name, since it is a Python keyword.

Assign to array, adding multiple copies of index

So I have this array, right?
a=np.zeros(5)
I want to add values to it at the given indices, where indices can be duplicates.
e.g.
a[[1, 2, 2]] += [1, 2, 3]
I want this to produce array([ 0., 1., 5., 0., 0.]), but the answer I get is array([ 0., 1., 3., 0., 0.]).
I'd like this to work with multidimensional arrays and broadcastable indices and all that. Any ideas?
You need to use np.add.at to get around the buffering issue that you encounter with += (values are not accumulated at repeated indices). Specify the array, the indices, and the values to add in place at those indices:
>>> a = np.zeros(5)
>>> np.add.at(a, [1, 2, 2], [1, 2, 3])
>>> a
array([ 0., 1., 5., 0., 0.])
at is part of other ufuncs too (multiply, divide, and so on). This method will also work for multidimensional arrays.
The operation you are performing can be looked at as binning, and to be technically more specific, you are doing weighted bining with those values being the weights and the indices being the bins. For such a binning operation, you can use np.bincount.
Here's the implementation -
import numpy as np
a=np.zeros(5) # initialize output array
idx = [1, 2, 2] # indices
vals = [1, 2, 3] # values
a[:max(idx)+1] = np.bincount(idx,vals) # finally store the bincounts
Runtime tests
Here are some runtime tests for two sets of input datasizes comparing the proposed bincount based approach and the add.at based approach listed in the other answer:
Datasize #1 -
In [251]: a=np.zeros(1000)
...: idx = np.sort(np.random.randint(1,1000,(500))).tolist()
...: vals = np.random.rand(500).tolist()
...:
In [252]: %timeit np.add.at(a, idx, vals)
10000 loops, best of 3: 63.4 µs per loop
In [253]: %timeit a[:max(idx)+1] = np.bincount(idx,vals)
10000 loops, best of 3: 42.4 µs per loop
Datasize #2 -
In [254]: a=np.zeros(10000)
...: idx = np.sort(np.random.randint(1,10000,(5000))).tolist()
...: vals = np.random.rand(5000).tolist()
...:
In [255]: %timeit np.add.at(a, idx, vals)
1000 loops, best of 3: 597 µs per loop
In [256]: %timeit a[:max(idx)+1] = np.bincount(idx,vals)
1000 loops, best of 3: 404 µs per loop

Making a numpy ndarray matrix symmetric

I have a 70x70 numpy ndarray, which is mainly diagonal. The only off-diagonal values are the below the diagonal. I would like to make the matrix symmetric.
As a newcomer from Matlab world, I can't get it working without for loops. In MATLAB it was easy:
W = max(A,A')
where A' is matrix transposition and the max() function takes care to make the W matrix which will be symmetric.
Is there an elegant way to do so in Python as well?
EXAMPLE
The sample A matrix is:
1 0 0 0
0 2 0 0
1 0 2 0
0 1 0 3
The desired output matrix W is:
1 0 1 0
0 2 0 1
1 0 2 0
0 1 0 3
Found a following solution which works for me:
import numpy as np
W = np.maximum( A, A.transpose() )
Use the NumPy tril and triu functions as follows. It essentially "mirrors" elements in the lower triangle into the upper triangle.
import numpy as np
A = np.array([[1, 0, 0, 0], [0, 2, 0, 0], [1, 0, 2, 0], [0, 1, 0, 3]])
W = np.tril(A) + np.triu(A.T, 1)
tril(m, k=0) gets the lower triangle of a matrix m (returns a copy of the matrix m with all elements above the kth diagonal zeroed). Similarly, triu(m, k=0) gets the upper triangle of a matrix m (all elements below the kth diagonal zeroed).
To prevent the diagonal being added twice, one must exclude the diagonal from one of the triangles, using either np.tril(A) + np.triu(A.T, 1) or np.tril(A, -1) + np.triu(A.T).
Also note that this behaves slightly differently to using maximum. All elements in the upper triangle are overwritten, regardless of whether they are the maximum or not. This means they can be any value (e.g. nan or inf).
For what it is worth, using the MATLAB's numpy equivalent you mentioned is more efficient than the link #plonser added.
In [1]: import numpy as np
In [2]: A = np.zeros((4, 4))
In [3]: np.fill_diagonal(A, np.arange(4)+1)
In [4]: A[2:,:2] = np.eye(2)
# numpy equivalent to MATLAB:
In [5]: %timeit W = np.maximum( A, A.T)
100000 loops, best of 3: 2.95 µs per loop
# method from link
In [6]: %timeit W = A + A.T - np.diag(A.diagonal())
100000 loops, best of 3: 9.88 µs per loop
Timing for larger matrices can be done similarly:
In [1]: import numpy as np
In [2]: N = 100
In [3]: A = np.zeros((N, N))
In [4]: A[2:,:N-2] = np.eye(N-2)
In [5]: np.fill_diagonal(A, np.arange(N)+1)
In [6]: print A
Out[6]:
array([[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 2., 0., ..., 0., 0., 0.],
[ 1., 0., 3., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 98., 0., 0.],
[ 0., 0., 0., ..., 0., 99., 0.],
[ 0., 0., 0., ..., 1., 0., 100.]])
# numpy equivalent to MATLAB:
In [6]: %timeit W = np.maximum( A, A.T)
10000 loops, best of 3: 28.6 µs per loop
# method from link
In [7]: %timeit W = A + A.T - np.diag(A.diagonal())
10000 loops, best of 3: 49.8 µs per loop
And with N = 1000
# numpy equivalent to MATLAB:
In [6]: %timeit W = np.maximum( A, A.T)
100 loops, best of 3: 5.65 ms per loop
# method from link
In [7]: %timeit W = A + A.T - np.diag(A.diagonal())
100 loops, best of 3: 11.7 ms per loop

Categories