Just found some unexpected behaviour in Numpy 1.8.1 in the triu function.
import numpy as np
a = np.zeros((4, 4))
a[1:, 2] = np.inf
a
>>>array([[ 0., 0., 0., 0.],
[ inf, 0., 0., 0.],
[ inf, 0., 0., 0.],
[ inf, 0., 0., 0.]])
np.triu(a)
>>>array([[ 0., 0., 0., 0.],
[ nan, 0., 0., 0.],
[ nan, 0., 0., 0.],
[ nan, 0., 0., 0.]])
Would this behaviour ever be desirable? Or shall I file a bug report?
Edit
I raised an issue on the Numpy github page
1. Explanation
Looks like you ignored the RuntimeWarning:
>>> np.triu(a)
twodim_base.py:450: RuntimeWarning: invalid value encountered in multiply
out = multiply((1 - tri(m.shape[0], m.shape[1], k - 1, dtype=m.dtype)), m)
The source code for numpy.triu is as follows:
def triu(m, k=0):
m = asanyarray(m)
out = multiply((1 - tri(m.shape[0], m.shape[1], k - 1, dtype=m.dtype)), m)
return out
This uses numpy.tri to get an array with ones below the diagonal and zeros above, and subtracts this from 1 to get an array with zeros below the diagonal and ones above:
>>> 1 - np.tri(4, 4, -1)
array([[ 1., 1., 1., 1.],
[ 0., 1., 1., 1.],
[ 0., 0., 1., 1.],
[ 0., 0., 0., 1.]])
Then it multiplies this element-wise with the original array. So where the original array has inf, the result has inf * 0 which is NaN.
2. Workaround
Use numpy.tril_indices to generate the indices of the lower triangle, and set all those entries to zero:
>>> a = np.ones((4, 4))
>>> a[1:, 0] = np.inf
>>> a
array([[ 1., 1., 1., 1.],
[ inf, 1., 1., 1.],
[ inf, 1., 1., 1.],
[ inf, 1., 1., 1.]])
>>> a[np.tril_indices(4, -1)] = 0
>>> a
array([[ 1., 1., 1., 1.],
[ 0., 1., 1., 1.],
[ 0., 0., 1., 1.],
[ 0., 0., 0., 1.]])
(Depending on what you are going to do with a, you might want to take a copy before zeroing these entries.)
Related
I have a 2-D numpy array let's say like this:
matrix([[1., 0., 0., ..., 1., 0., 0.],
[1., 0., 0., ..., 0., 1., 1.],
[1., 0., 0., ..., 1., 0., 0.],
[1., 1., 0., ..., 1., 0., 0.],
[1., 1., 0., ..., 1., 0., 0.],
[1., 1., 0., ..., 1., 0., 0.]])
I want to transform it into a 3-D numpy array based on the values of a column of a dataframe. Let's say the column is like this:
df = pd.DataFrame({"Case":[1,1,2,2,3,4]})
The final 3-D array should look like this:
matrix([
[
[1., 0., 0., ..., 1., 0., 0.], [1., 0., 0., ..., 0., 1., 1.]
],
[
[1., 0., 0., ..., 1., 0., 0.], [1., 1., 0., ..., 1., 0., 0.]
],
[
[1., 1., 0., ..., 1., 0., 0.]
],
[
[1., 1., 0., ..., 1., 0., 0.]
]
])
The first 2 arrays of the initial 2-D array becomes a 2-D array of the final 3-D array because from the column of the dataframe the first and second rows both have the same values of '1'.
Similarly, the next 2 arrays become another 2-D array of 2 arrays because the next two values of the column of the dataframe are '2' so the belong together.
There is only one row for the values '3' and '4' so the next 2-D arrays of the 3-D array has only 1 array each.
So, basically if two or more numbers of the column of the dataframe are same, then those indices of rows of the 2-D initial matrix belong together and are transformed into a 2-D matrix and pushed as an element of the final 3-D matrix.
How do I do this?
Numpy doesn't have very good support for arrays with rows of different length, but you can make it a list of 2D arrays instead:
M = np.ndarray(
[[1., 0., 0., ..., 1., 0., 0.],
[1., 0., 0., ..., 0., 1., 1.],
[1., 0., 0., ..., 1., 0., 0.],
[1., 1., 0., ..., 1., 0., 0.],
[1., 1., 0., ..., 1., 0., 0.],
[1., 1., 0., ..., 1., 0., 0.]]
)
df = pd.DataFrame({"Case":[1,1,2,2,3,4]})
M_per_case = [
np.stack([M[index] for index in df.index[df['Case'] == case]])
for case in set(df['Case'])
]
Given a 3d tenzor, say:
batch x sentence length x embedding dim
a = torch.rand((10, 1000, 96))
and an array(or tensor) of actual lengths for each sentence
lengths = torch .randint(1000,(10,))
outputs tensor([ 370., 502., 652., 859., 545., 964., 566., 576.,1000., 803.])
How to fill tensor ‘a’ with zeros after certain index along dimension 1 (sentence length) according to tensor ‘lengths’ ?
I want smth like that :
a[ : , lengths : , : ] = 0
One way of doing it (slow if batch size is big enough):
for i_batch in range(10):
a[ i_batch , lengths[i_batch ] : , : ] = 0
You can do it using a binary mask.
Using lengths as column-indices to mask we indicate where each sequence ends (note that we make mask longer than a.size(1) to allow for sequences with full length).
Using cumsum() we set all entries in mask after the seq len to 1.
mask = torch.zeros(a.shape[0], a.shape[1] + 1, dtype=a.dtype, device=a.device)
mask[(torch.arange(a.shape[0]), lengths)] = 1
mask = mask.cumsum(dim=1)[:, :-1] # remove the superfluous column
a = a * (1. - mask[..., None]) # use mask to zero after each column
For a.shape = (10, 5, 96), and lengths = [1, 2, 1, 1, 3, 0, 4, 4, 1, 3].
Assigning 1 to respective lengths at each row, mask looks like:
mask =
tensor([[0., 1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]])
After cumsum you get
mask =
tensor([[0., 1., 1., 1., 1.],
[0., 0., 1., 1., 1.],
[0., 1., 1., 1., 1.],
[0., 1., 1., 1., 1.],
[0., 0., 0., 1., 1.],
[1., 1., 1., 1., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 1., 1., 1., 1.],
[0., 0., 0., 1., 1.]])
Note that it exactly has zeros where the valid sequence entries are and ones beyond the lengths of the sequences. Taking 1 - mask gives you exactly what you want.
Enjoy ;)
Whats the most pythonic way of writing a function that returns a nxn boundary mask for convolotion, e.g for 3x3 it will return [[1,1,1],[1,0,1],[1,1,1]], for 5x5 it will return [[1,1,1,1,1],[1,0,0,0,1],[1,0,0,0,1],[1,0,0,0,1],[1,1,1,1,1]] and so on.
this works (but isnt so pythonic):
def boundaryMask(size):
mask=np.zeros((size,size))
for i in range(size):
mask[0][i]=1
mask[i][0]=1
mask[i][size-1]=1
mask[size-1][i]=1
return mask
One option would be to create an array of ones, and then assign zeros to the center of the array using slicing:
N = 4
x = np.ones((N, N))
x[1:-1, 1:-1] = 0
x
#array([[ 1., 1., 1., 1.],
# [ 1., 0., 0., 1.],
# [ 1., 0., 0., 1.],
# [ 1., 1., 1., 1.]])
Put in a function and test on various sizes:
def boundaryMask(size):
mask=np.ones((size,size))
mask[1:-1,1:-1] = 0
return mask
boundaryMask(1)
# array([[ 1.]])
boundaryMask(2)
#array([[ 1., 1.],
# [ 1., 1.]])
boundaryMask(3)
#array([[ 1., 1., 1.],
# [ 1., 0., 1.],
# [ 1., 1., 1.]])
boundaryMask(4)
#array([[ 1., 1., 1., 1.],
# [ 1., 0., 0., 1.],
# [ 1., 0., 0., 1.],
# [ 1., 1., 1., 1.]])
I am trying to store matrices into an array, however when I append the matrix, it would get every element and output just an 1 dimensional array.
Example Code:
matrix_array= np.array([])
for y in y_label:
matrix_array= np.append(matrix_array, np.identity(3))
Clearly np.append is the wrong tool for the job:
In [144]: np.append(np.array([]), np.identity(3))
Out[144]: array([ 1., 0., 0., 0., 1., 0., 0., 0., 1.])
From its docs:
If axis is not specified, values can be any shape and will be
flattened before use.
With list append
In [153]: alist=[]
In [154]: for y in [1,2]:
...: alist.append(np.identity(3))
...:
In [155]: alist
Out[155]:
[array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]]), array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])]
In [156]: np.array(alist)
Out[156]:
array([[[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]],
[[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]]])
In [157]: _.shape
Out[157]: (2, 3, 3)
Is there a way I can allocate memory for scipy sparse matrix functions to process large datasets?
Specifically, I'm attempting to use Asymmetric Least Squares Smoothing (translated into python here and the original here) to perform a baseline correction on a large mass spec dataset (length of ~60,000).
The function (see below) uses the scipy.sparse matrix operations.
def baseline_als(y, lam, p, niter):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in xrange(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
I have no problem when I pass data sets that are 10,000 or less in length:
baseline_als(np.ones(10000),100,0.1,10)
But when passing larger data sets, e.g.
baseline_als(np.ones(50000), 100, 0.1, 10)
I get a MemoryError, for the line
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
Try changing
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
to
diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)
D will be a sparse matrix in DIAgonal format. If it turns out that being in CSC format is important, convert it using the tocsc() method:
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2).tocsc()
The following example shows that the old and new versions generate the same matrix:
In [67]: from scipy import sparse
In [68]: L = 8
Original:
In [69]: D = sparse.csc_matrix(np.diff(np.eye(L), 2))
In [70]: D.A
Out[70]:
array([[ 1., 0., 0., 0., 0., 0.],
[-2., 1., 0., 0., 0., 0.],
[ 1., -2., 1., 0., 0., 0.],
[ 0., 1., -2., 1., 0., 0.],
[ 0., 0., 1., -2., 1., 0.],
[ 0., 0., 0., 1., -2., 1.],
[ 0., 0., 0., 0., 1., -2.],
[ 0., 0., 0., 0., 0., 1.]])
New version:
In [71]: diag = np.ones(L - 2)
In [72]: D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)
In [73]: D.A
Out[73]:
array([[ 1., 0., 0., 0., 0., 0.],
[-2., 1., 0., 0., 0., 0.],
[ 1., -2., 1., 0., 0., 0.],
[ 0., 1., -2., 1., 0., 0.],
[ 0., 0., 1., -2., 1., 0.],
[ 0., 0., 0., 1., -2., 1.],
[ 0., 0., 0., 0., 1., -2.],
[ 0., 0., 0., 0., 0., 1.]])