Error in scipy sparse diags matrix construction - python

When using scipy.sparse.spdiags or scipy.sparse.diags I have noticed want I consider to be a bug in the routines eg
scipy.sparse.spdiags([1.1,1.2,1.3],1,4,4).toarray()
returns
array([[ 0. , 1.2, 0. , 0. ],
[ 0. , 0. , 1.3, 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
That is for positive diagonals it drops the first k data. One might argue that there is some grand programming reason for this and that I just need to pad with zeros. OK annoying as that may be, one can use scipy.sparse.diags which gives the correct result. However this routine has a bug that can't be worked around
scipy.sparse.diags([1.1,1.2],0,(4,2)).toarray()
gives
array([[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ],
[ 0. , 0. ]])
nice, and
scipy.sparse.diags([1.1,1.2],-2,(4,2)).toarray()
gives
array([[ 0. , 0. ],
[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2]])
but
scipy.sparse.diags([1.1,1.2],-1,(4,2)).toarray()
gives an error saying ValueError: Diagonal length (index 0: 2 at offset -1) does not agree with matrix size (4, 2). Obviously the answer is
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
and for extra random behaviour we have
scipy.sparse.diags([1.1],-1,(4,2)).toarray()
giving
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.1],
[ 0. , 0. ]])
Anyone know if there is a function for constructing diagonal sparse matrices that actually works?

Executive summary: spdiags works correctly, even if the matrix input isn't the most intuitive. diags has a bug that affects some offsets in rectangular matrices. There is a bug fix on scipy github.
The example for spdiags is:
>>> data = array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
>>> diags = array([0,-1,2])
>>> spdiags(data, diags, 4, 4).todense()
matrix([[1, 0, 3, 0],
[1, 2, 0, 4],
[0, 2, 3, 0],
[0, 0, 3, 4]])
Note that the 3rd column of data always appears in the 3rd column of the sparse. The other columns also line up. But they are omitted where they 'fall off the edge'.
The input to this function is a matrix, while the input to diags is a ragged list. The diagonals of the sparse matrix all have different numbers of values. So the specification has to accomodate this in one or other. spdiags does this by ignoring some values, diags by taking a list input.
The sparse.diags([1.1,1.2],-1,(4,2)) error is puzzling.
the spdiags equivalent does work:
In [421]: sparse.spdiags([[1.1,1.2]],-1,4,2).A
Out[421]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
The error is raised in this block of code:
for j, diagonal in enumerate(diagonals):
offset = offsets[j]
k = max(0, offset)
length = min(m + offset, n - offset)
if length <= 0:
raise ValueError("Offset %d (index %d) out of bounds" % (offset, j))
try:
data_arr[j, k:k+length] = diagonal
except ValueError:
if len(diagonal) != length and len(diagonal) != 1:
raise ValueError(
"Diagonal length (index %d: %d at offset %d) does not "
"agree with matrix size (%d, %d)." % (
j, len(diagonal), offset, m, n))
raise
The actual matrix constructor in the diags is:
dia_matrix((data_arr, offsets), shape=(m, n))
This is the same constructor that spdiags uses, but without any manipulation.
In [434]: sparse.dia_matrix(([[1.1,1.2]],-1),shape=(4,2)).A
Out[434]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
In dia format, the inputs are stored exactly as given by spdiags (complete with that matrix with extra values):
In [436]: M.data
Out[436]: array([[ 1.1, 1.2]])
In [437]: M.offsets
Out[437]: array([-1], dtype=int32)
As #user2357112 points out, length = min(m + offset, n - offset is wrong, producing 3 in the test case. Changing it to length = min(m + k, n - k) makes all cases for this (4,2) matrix work. But it fails with the transpose: diags([1.1,1.2], 1, (2, 4))
The correction, as of Oct 5, for this issue is:
https://github.com/pv/scipy-work/commit/529cbde47121c8ed87f74fa6445c05d71353eb6c
length = min(m + offset, n - offset, min(m,n))
With this fix, diags([1.1,1.2], 1, (2, 4)) works.

Related

Adjacency Matrix from Numpy array using Euclidean Distance

Can someone help me please on how to generate a weighted adjacency matrix from a numpy array based on euclidean distance between all rows, i.e 0 and 1, 0 and 2,.. 1 and 2,...?
Given the following example with an input matrix(5, 4):
matrix = [[2,10,9,6],
[5,1,4,7],
[3,2,1,0],
[10, 20, 1, 4],
[17, 3, 5, 18]]
I would like to obtain a weighted adjacency matrix (5,5) containing the most minimal distance between nodes, i.e,
if dist(row0, row1)= 10,77 and dist(row0, row2)= 12,84,
--> the output matrix will take the first distance as a column value.
I have already solved the first part for the generation of the adjacency matrix with the following code :
from scipy.spatial.distance import cdist
dist = cdist( matrix, matrix, metric='euclidean')
and I get the following result :
array([[ 0. , 10.77032961, 12.84523258, 15.23154621, 20.83266666],
[10.77032961, 0. , 7.93725393, 20.09975124, 16.43167673],
[12.84523258, 7.93725393, 0. , 19.72308292, 23.17326045],
[15.23154621, 20.09975124, 19.72308292, 0. , 23.4520788 ],
[20.83266666, 16.43167673, 23.17326045, 23.4520788 , 0. ]])
But I don't know yet how to specify the number of neighbors for which we select for example 2 neighbors for each node. For example, we define the number of neighbors N = 2, then for each row, we choose only two neighbors with the two minimum distances and we get as a result :
[[ 0. , 10.77032961, 12.84523258, 0, 0],
[10.77032961, 0. , 7.93725393, 0, 0],
[12.84523258, 7.93725393, 0. , 0, 0],
[15.23154621, 0, 19.72308292, 0. , 0 ],
[20.83266666, 16.43167673, 0, 0 , 0. ]]
You can use this cleaner solution to get the smallest n from a matrix. Try the following -
The dist.argsort(1).argsort(1) creates a rank order (smallest is 0 and largest is 4) over axis=1 and the <= 2 decided the number of nsmallest values you need from the rank order. np.where filters it or replaces it with 0.
np.where(dist.argsort(1).argsort(1) <= 2, dist, 0)
array([[ 0. , 10.77032961, 12.84523258, 0. , 0. ],
[10.77032961, 0. , 7.93725393, 0. , 0. ],
[12.84523258, 7.93725393, 0. , 0. , 0. ],
[15.23154621, 0. , 19.72308292, 0. , 0. ],
[20.83266666, 16.43167673, 0. , 0. , 0. ]])
This works for any axis or if you want nlargest or nsmallest from a matrix as well.
Assuming a is your Euclidean distance matrix, you can use np.argpartition to choose n min/max values per row. Keep in mind the diagonal is always 0 and euclidean distances are non-negative, so to keep two closest point in each row, you need to keep three min per row (including 0s on diagonal). This does not hold if you want to do max however.
a[np.arange(a.shape[0])[:,None],np.argpartition(a, 3, axis=1)[:,3:]] = 0
output:
array([[ 0. , 10.77032961, 12.84523258, 0. , 0. ],
[10.77032961, 0. , 7.93725393, 0. , 0. ],
[12.84523258, 7.93725393, 0. , 0. , 0. ],
[15.23154621, 0. , 19.72308292, 0. , 0. ],
[20.83266666, 16.43167673, 0. , 0. , 0. ]])

A robust way to keep the n-largest elements in rows or colums in the matrix

I would like to make a sparse matrix from the dense one, such that in each row or column only n-largest elements are preserved. I do the following:
def sparsify(K, min_nnz = 5):
'''
This function eliminates the elements which are smaller that the maximal element in the matrix,
Parameters
----------
K : ndarray
K - the input matrix
min_nnz:
the minimal number of elements in row or column to be preserved
'''
cond = np.bitwise_or(K >= -np.partition(-K, min_nnz - 1, axis = 1)[:, min_nnz - 1][:, None],
K >= -np.partition(-K, min_nnz - 1, axis = 0)[min_nnz - 1, :][None, :])
return spsp.csr_matrix(np.where(cond, K, 0))
This approach works as intended but seems to be not the most efficient, and the robust one. What would you recommend to do it an better way?
The example of usage:
A = np.random.rand(10, 10)
A_sp = sparsify(A, min_nnz = 3)
Instead of making another dense matrix, you can use coo_matrix to build up using only the values you need:
return spsp.coo_matrix((K[cond], np.where(cond)), shape = K.shape)
As for the rest, you can maybe short-circuit the second dimension, but your time savings will be completely dependent on your inputs
def sparsify(K, min_nnz = 5):
'''
This function eliminates the elements which are smaller that the maximal element in the matrix,
Parameters
----------
K : ndarray
K - the input matrix
min_nnz:
the minimal number of elements in row or column to be preserved
'''
cond = K >= -np.partition(-K, min_nnz - 1, axis = 0)[min_nnz - 1, :]
mask = cond.sum(1) < min_nnz
cond[mask] = np.bitwise_or(cond[mask],
K[mask] >= -np.partition(-K[mask],
min_nnz - 1,
axis = 1)[:, min_nnz - 1][:, None])
return spsp.coo_matrix((K[cond], np.where(cond)), shape = K.shape)
Testing:
sparsify(A)
Out[]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 58 stored elements in COOrdinate format>
sparsify(A).A
Out[]:
array([[0. , 0. , 0.61362248, 0. , 0.73648987,
0.64561856, 0.40727807, 0.61674005, 0.53533315, 0. ],
[0.8888361 , 0.64548039, 0.94659603, 0.78474203, 0. ,
0. , 0.78809603, 0.88938798, 0. , 0.37631541],
[0.69356682, 0. , 0. , 0. , 0. ,
0.7386594 , 0.71687659, 0.67750768, 0.58002451, 0. ],
[0.67241433, 0.71923718, 0.95888737, 0. , 0. ,
0. , 0.82773085, 0.69788448, 0.63736915, 0.4263064 ],
[0. , 0.65831794, 0. , 0. , 0.59850093,
0. , 0. , 0.61913869, 0.65024867, 0.50860294],
[0.75522891, 0. , 0.93342402, 0.8284258 , 0.64471939,
0.6990814 , 0. , 0. , 0. , 0.32940821],
[0. , 0.88458635, 0.62460096, 0.60412265, 0.66969674,
0. , 0.40318741, 0. , 0. , 0.44116059],
[0. , 0. , 0.500971 , 0.92291245, 0. ,
0.8862903 , 0. , 0.375885 , 0.49473635, 0. ],
[0.86920647, 0.85157893, 0.89883006, 0. , 0.68427193,
0.91195162, 0. , 0. , 0.94762875, 0. ],
[0. , 0.6435456 , 0. , 0.70551006, 0. ,
0.8075527 , 0. , 0.9421039 , 0.91096934, 0. ]])
sparsify(A).A.astype(bool).sum(0)
Out[]: array([5, 6, 7, 5, 5, 6, 5, 7, 7, 5])
sparsify(A).A.astype(bool).sum(1)
Out[]: array([6, 7, 5, 7, 5, 6, 6, 5, 6, 5])

Reshape Array in Array in Array

I have a root file that I open with 2000 entries, and variable amount of subentries and in each column is a different variable. Lets say I am only interested in 5 of those. I want to put them in an array with np.shape(array)=(2000,250,5). The 250 is plenty to contain all subentrys per entry.
The root file is converted into a dictionary by uproot DATA=[variablename:[array of entries [array of subentries]]
I create an array np.zeros(2000,250,5) and fill it with the data I want, but it takes about 500ms and I need a solution that scales as I aim for 1 million entries later on. I found multiple solutions, but my lowest was about 300ms
lim_i=len(N_DATA["nTrack"])
i=0
INPUT_ARRAY=np.zeros((lim_i,500,5))
for l in range(len(INPUT_ARRAY)):
while i < lim_i:
EVENT=np.zeros((500,5))
k=0
lim_k=len(TRACK_DATA["Track_pt"][i])
while k<lim_k:
EVENT[k][0]=TRACK_DATA["Track_pt"][i][k]
EVENT[k][1]=TRACK_DATA["Track_phi"][i][k]
EVENT[k][2]=TRACK_DATA["Track_eta"][i][k]
EVENT[k][3]=TRACK_DATA["Track_dxy"][i][k]
EVENT[k][4]=TRACK_DATA["Track_charge"][i][k]
k+=1
INPUT_ARRAY[i]=EVENT
i+=1
INPUT_ARRAY
Taking note of fKarl Knechtel's second comment, "You should avoid explicitly iterating over Numpy arrays yourself (there is practically guaranteed to be a built-in Numpy thing that just does what you want, and probably much faster than native Python can)," there is a way to do this with array-at-a-time programming, but not in NumPy. The reason Uproot returns Awkward Arrays is because you need a way to deal with variable-length data efficiently.
I don't have your file, but I'll start with a similar one:
>>> import uproot4
>>> import skhep_testdata
>>> events = uproot4.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
The branches that start with "Muon_" in this file have the same variable-length structure as in your tracks. (The C++ typename is a dynamically sized array, interpreted in Python "as jagged.")
>>> events.show(filter_name="Muon_*")
name | typename | interpretation
---------------------+--------------------------+-------------------------------
Muon_Px | float[] | AsJagged(AsDtype('>f4'))
Muon_Py | float[] | AsJagged(AsDtype('>f4'))
Muon_Pz | float[] | AsJagged(AsDtype('>f4'))
Muon_E | float[] | AsJagged(AsDtype('>f4'))
Muon_Charge | int32_t[] | AsJagged(AsDtype('>i4'))
Muon_Iso | float[] | AsJagged(AsDtype('>f4'))
If you just ask for these arrays, you get them as an Awkward Array.
>>> muons = events.arrays(filter_name="Muon_*")
>>> muons
<Array [{Muon_Px: [-52.9, 37.7, ... 0]}] type='2421 * {"Muon_Px": var * float32,...'>
To put them to better use, let's import Awkward Array and start by asking for its type.
>>> import awkward1 as ak
>>> ak.type(muons)
2421 * {"Muon_Px": var * float32, "Muon_Py": var * float32, "Muon_Pz": var * float32, "Muon_E": var * float32, "Muon_Charge": var * int32, "Muon_Iso": var * float32}
What does this mean? It means you have 2421 records with fields named "Muon_Px", etc., that each contain variable-length lists of float32 or int32, depending on the field. We can look at one of them by converting it to Python lists and dicts.
>>> muons[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582],
'Muon_Py': [-11.654671669006348, 0.6934735774993896],
'Muon_Pz': [-8.16079330444336, -11.307581901550293],
'Muon_E': [54.77949905395508, 39.401695251464844],
'Muon_Charge': [1, -1],
'Muon_Iso': [4.200153350830078, 2.1510612964630127]}
(You could have made these lists of records, rather than records of lists, by passing how="zip" to TTree.arrays or using ak.unzip and ak.zip in Awkward Array, but that's tangential to the padding that you want to do.)
The problem is that the lists have different lengths. NumPy doesn't have any functions that will help us here because it deals entirely in rectilinear arrays. Therefore, we need a function that's specific to Awkward Array, ak.num.
>>> ak.num(muons)
<Array [{Muon_Px: 2, ... Muon_Iso: 1}] type='2421 * {"Muon_Px": int64, "Muon_Py"...'>
This is telling us the number of elements in each list, per field. For clarity, look at the first one:
>>> ak.num(muons)[0].tolist()
{'Muon_Px': 2, 'Muon_Py': 2, 'Muon_Pz': 2, 'Muon_E': 2, 'Muon_Charge': 2, 'Muon_Iso': 2}
You want to turn these irregular lists into regular lists that all have the same size. That's called "padding." Again, there's a function for that, but we first need to get the maximum number of elements, so that we know how much to pad it by.
>>> ak.max(ak.num(muons))
4
So let's make them all length 4.
>>> ak.pad_none(muons, ak.max(ak.num(muons)))
<Array [{Muon_Px: [-52.9, 37.7, ... None]}] type='2421 * {"Muon_Px": var * ?floa...'>
Again, let's look at the first one to understand what we have.
{'Muon_Px': [-52.89945602416992, 37.7377815246582, None, None],
'Muon_Py': [-11.654671669006348, 0.6934735774993896, None, None],
'Muon_Pz': [-8.16079330444336, -11.307581901550293, None, None],
'Muon_E': [54.77949905395508, 39.401695251464844, None, None],
'Muon_Charge': [1, -1, None, None],
'Muon_Iso': [4.200153350830078, 2.1510612964630127, None, None]}
You wanted to pad them with zeros, not None, so we convert the missing values into zeros.
>>> ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0)[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582, 0.0, 0.0],
'Muon_Py': [-11.654671669006348, 0.6934735774993896, 0.0, 0.0],
'Muon_Pz': [-8.16079330444336, -11.307581901550293, 0.0, 0.0],
'Muon_E': [54.77949905395508, 39.401695251464844, 0.0, 0.0],
'Muon_Charge': [1, -1, 0, 0],
'Muon_Iso': [4.200153350830078, 2.1510612964630127, 0.0, 0.0]}
Finally, NumPy doesn't have records (other than the structured array, which also implies that the columns are contiguous in memory; Awkward Array's "records" are abstract). So let's unzip what we have into six separate arrays.
>>> arrays = ak.unzip(ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0))
>>> arrays
(<Array [[-52.9, 37.7, 0, 0, ... 23.9, 0, 0, 0]] type='2421 * var * float64'>,
<Array [[-11.7, 0.693, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
<Array [[-8.16, -11.3, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
<Array [[54.8, 39.4, 0, 0], ... 69.6, 0, 0, 0]] type='2421 * var * float64'>,
<Array [[1, -1, 0, 0], ... [-1, 0, 0, 0]] type='2421 * var * int64'>,
<Array [[4.2, 2.15, 0, 0], ... [0, 0, 0, 0]] type='2421 * var * float64'>)
Note that this one line does everything from the initial data-pull from Uproot (muons). I'm not going to profile it now, but you'll find that this one line is considerably faster than explicit looping.
Now what we have is semantically equivalent to six NumPy arrays, so we'll just cast them as NumPy. (Attempts to do so with irregular data would fail. You have to explicitly pad the data.)
>>> numpy_arrays = [ak.to_numpy(x) for x in arrays]
>>> numpy_arrays
[array([[-52.89945602, 37.73778152, 0. , 0. ],
[ -0.81645936, 0. , 0. , 0. ],
[ 48.98783112, 0.82756668, 0. , 0. ],
...,
[-29.75678635, 0. , 0. , 0. ],
[ 1.14186978, 0. , 0. , 0. ],
[ 23.9132061 , 0. , 0. , 0. ]]),
array([[-11.65467167, 0.69347358, 0. , 0. ],
[-24.40425873, 0. , 0. , 0. ],
[-21.72313881, 29.8005085 , 0. , 0. ],
...,
[-15.30385876, 0. , 0. , 0. ],
[ 63.60956955, 0. , 0. , 0. ],
[-35.66507721, 0. , 0. , 0. ]]),
array([[ -8.1607933 , -11.3075819 , 0. , 0. ],
[ 20.19996834, 0. , 0. , 0. ],
[ 11.16828537, 36.96519089, 0. , 0. ],
...,
[-52.66374969, 0. , 0. , 0. ],
[162.17631531, 0. , 0. , 0. ],
[ 54.71943665, 0. , 0. , 0. ]]),
array([[ 54.77949905, 39.40169525, 0. , 0. ],
[ 31.69044495, 0. , 0. , 0. ],
[ 54.73978806, 47.48885727, 0. , 0. ],
...,
[ 62.39516068, 0. , 0. , 0. ],
[174.20863342, 0. , 0. , 0. ],
[ 69.55621338, 0. , 0. , 0. ]]),
array([[ 1, -1, 0, 0],
[ 1, 0, 0, 0],
[ 1, -1, 0, 0],
...,
[-1, 0, 0, 0],
[-1, 0, 0, 0],
[-1, 0, 0, 0]]),
array([[4.20015335, 2.1510613 , 0. , 0. ],
[2.18804741, 0. , 0. , 0. ],
[1.41282165, 3.38350415, 0. , 0. ],
...,
[3.76294518, 0. , 0. , 0. ],
[0.55081069, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ]])]
And now NumPy's dstack is appropriate. (This is making them contiguous in memory, so you could use NumPy's structured arrays if you want to. I would find that easier for keeping track of which index means which variable, but that's up to you. Actually, Xarray is particularly good at tracking metadata of rectilinear arrays.)
>>> import numpy as np
>>> np.dstack(numpy_arrays)
array([[[-52.89945602, -11.65467167, -8.1607933 , 54.77949905,
1. , 4.20015335],
[ 37.73778152, 0.69347358, -11.3075819 , 39.40169525,
-1. , 2.1510613 ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ]],
[[ -0.81645936, -24.40425873, 20.19996834, 31.69044495,
1. , 2.18804741],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ]],
[[ 48.98783112, -21.72313881, 11.16828537, 54.73978806,
1. , 1.41282165],
[ 0.82756668, 29.8005085 , 36.96519089, 47.48885727,
-1. , 3.38350415],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ]],
...,
[[-29.75678635, -15.30385876, -52.66374969, 62.39516068,
-1. , 3.76294518],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ]],
[[ 1.14186978, 63.60956955, 162.17631531, 174.20863342,
-1. , 0.55081069],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ]],
[[ 23.9132061 , -35.66507721, 54.71943665, 69.55621338,
-1. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. ]]])
Observation 1: we can assign directly to the appropriate sub-arrays of INPUT_ARRAY[i], instead of creating EVENT as a proxy for INPUT_ARRAY[i] and then copying that in. (I will also set your variable names in lowercase, to follow normal conventions.
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
while i < lim_i:
k = 0
lim_k = len(track_data["Track_pt"][i])
while k < lim_k:
input_array[i][k][0] = track_data["Track_pt"][i][k]
input_array[i][k][1] = track_data["Track_phi"][i][k]
input_array[i][k][2] = track_data["Track_eta"][i][k]
input_array[i][k][3] = track_data["Track_dxy"][i][k]
input_array[i][k][4] = track_data["Track_charge"][i][k]
k += 1
i += 1
Observation 2: the assignments we make in the innermost loop have the same basic structure. It would be nice if we could take the various entries of the TRACK_DATA dict (which are 2-dimensional data) and stack them together. Numpy has a convenient (and efficient) built-in for stacking 2-dimensional data along the third dimension: np.dstack. Having prepared that 3-dimensional array, we can just copy in from it mechanically:
track_array = np.dstack((
track_data['Track_pt'],
track_data['Track_phi'],
track_data['Track_eta'],
track_data['Track_dxy'],
track_data['Track_charge']
))
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
while i < lim_i:
k = 0
lim_k = len(track_data["Track_pt"][i])
while k < lim_k:
input_array[i][k][0] = track_data[i][k][0]
input_array[i][k][1] = track_data[i][k][1]
input_array[i][k][2] = track_data[i][k][2]
input_array[i][k][3] = track_data[i][k][3]
input_array[i][k][4] = track_data[i][k][4]
k += 1
i += 1
Observation 3: but now, the purpose of our innermost loop is simply to copy an entire chunk of track_data along the last dimension. We could just do that directly:
track_array = np.dstack((
track_data['Track_pt'],
track_data['Track_phi'],
track_data['Track_eta'],
track_data['Track_dxy'],
track_data['Track_charge']
))
lim_i = len(n_data["nTrack"])
i = 0
input_array = np.zeros((lim_i,500,5))
for l in range(len(input_array)):
while i < lim_i:
k = 0
lim_k = len(track_data["Track_pt"][i])
while k < lim_k:
input_array[i][k] = track_data[i][k]
k += 1
i += 1
Observation 4: But actually, the same reasoning applies to the other two dimensions of the array. Clearly, our intent is to copy the entire array produced from the dstack; and that is already a new array, so we could just use it directly.
input_array = np.dstack((
track_data['Track_pt'],
track_data['Track_phi'],
track_data['Track_eta'],
track_data['Track_dxy'],
track_data['Track_charge']
))

Getting the singular values of np.linalg.svd as a matrix

Given a 5x4 matrix A =
A piece of python code to construct the matrix
A = np.array([[1, 0, 0, 0],
[0, 0, 0, 4],
[0, 3, 0, 0],
[0, 0, 0, 0],
[2, 0, 0, 0]])
wolframalpha gives the svd result
the Vector(s) with the singular values Σ is in this form
the equivalent quantity (NumPy call it s) in the output of np.linalg.svd is in this form
[ 4. 3. 2.23606798 -0. ]
is there a way to have the quantity in output of numpy.linalg.svd shown as wolframalpha?
You can get most of the way there with diag:
>>> u, s, vh = np.linalg.svd(a)
>>> np.diag(s)
array([[ 4. , 0. , 0. , 0. ],
[ 0. , 3. , 0. , 0. ],
[ 0. , 0. , 2.23606798, 0. ],
[ 0. , 0. , 0. , -0. ]])
Note that wolfram alpha is giving an extra row. Getting that is marginally more involved:
>>> sigma = np.zeros(A.shape, s.dtype)
>>> np.fill_diagonal(sigma, s)
>>> sigma
array([[ 4. , 0. , 0. , 0. ],
[ 0. , 3. , 0. , 0. ],
[ 0. , 0. , 2.23606798, 0. ],
[ 0. , 0. , 0. , -0. ],
[ 0. , 0. , 0. , 0. ]])
Depending on what your goal is, removing a column from U might be a better approach than adding a row of zeros to sigma. That would look like:
>>> u, s, vh = np.linalg.svd(a, full_matrices=False)

Printing numpy with different position in the column

I have following numpy array
import numpy as np
np.random.seed(20)
np.random.rand(20).reshape(5, 4)
array([[ 0.5881308 , 0.89771373, 0.89153073, 0.81583748],
[ 0.03588959, 0.69175758, 0.37868094, 0.51851095],
[ 0.65795147, 0.19385022, 0.2723164 , 0.71860593],
[ 0.78300361, 0.85032764, 0.77524489, 0.03666431],
[ 0.11669374, 0.7512807 , 0.23921822, 0.25480601]])
For each column I would like to slice it in positions:
position_for_slicing=[0, 3, 4, 4]
So I will get following array:
array([[ 0.5881308 , 0.85032764, 0.23921822, 0.81583748],
[ 0.03588959, 0.7512807 , 0 0],
[ 0.65795147, 0, 0 0],
[ 0.78300361, 0, 0 0],
[ 0.11669374, 0, 0 0]])
Is there fast way to do this ? I know I can use to do for loop for each column, but I was wondering if there is more elegant way to do this.
If "elegant" means "no loop" the following would qualify, but probably not under many other definitions (arr is your input array):
m, n = arr.shape
arrf = np.asanyarray(arr, order='F')
padded = np.r_[arrf, np.zeros_like(arrf)]
assert padded.flags['F_CONTIGUOUS']
expnd = np.lib.stride_tricks.as_strided(padded, (m, m+1, n), padded.strides[:1] + padded.strides)
expnd[:, [0,3,4,4], range(4)]
# array([[ 0.5881308 , 0.85032764, 0.23921822, 0.25480601],
# [ 0.03588959, 0.7512807 , 0. , 0. ],
# [ 0.65795147, 0. , 0. , 0. ],
# [ 0.78300361, 0. , 0. , 0. ],
# [ 0.11669374, 0. , 0. , 0. ]])
Please note that order='C' and then 'C_CONTIGUOUS' in the assertion also works. My hunch is that 'F' could be a bit faster because the indexing then operates on contiguous slices.

Categories