Related
Hello this is the code block that one-hot encodes a DNA sequence. What happens is that for 'n' it is mapping 1 in all 4 positions in the 2nd axis. I want to avoid using an if-else in the following code.
seq = 'nnnactgactgnnnnn'
onehot = np.zeros((len(seq), 4))
mapper = {'a':0,'c':1,'g':2,'t':3,'n':None}
for i in range(len(seq)):
onehot[i][mapper[seq[i]]] = 1
output:
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 0., 1., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 0., 1., 0.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
How can I assign 0 for n while using the mapper dict.
tldr: using None is accessing all the positions for a row. How to solve that?
You could use:
mapper = {'a':[1,0,0,0],'c':[0,1,0,0],'g':[0,0,1,0],'t':[0,0,0,1],'n':[0,0,0,0]}
And then append the corresponding encoding according to sequence.
Edit: Might be faster.
seq = 'nnnactgactgnnnnn'
onehot = np.zeros((len(seq), 4))
mapper = {'a':0,'c':1,'g':2,'t':3,'n':None}
result = {'a':1,'c':1,'g':1,'t':1,'n':0}
for i in range(len(seq)):
onehot[i][mapper[seq[i]]] = result[seq[i]]
Hi I have the following code and is there any way to replace the for loop with a one line numpy code?
x = 10
y = 5
z = 2
b = np.zeros((x,y))
a = np.random.choice(np.arange(y),size=(x,z))
for i in range(len(a)):
b[i,a[i]] = 1
With the above code, I get b as
array([[1., 0., 1., 0., 0.],
[0., 1., 0., 1., 0.],
[0., 0., 0., 1., 1.],
[0., 0., 1., 0., 1.],
[1., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 1., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 1.]])
I've tried b[a] = 1 instead of
for i in range(len(a)):
b[i,a[i]] = 1
and it gives all ones in the first 5 rows.
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 1., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 1.]])
b[range(len(a)), a[:, 0]] = 1
b[range(len(a)), a[:, 1]] = 1
or for arbitrary value of z you can do it like
b[np.resize(range(len(a)), a.shape[0] * a.shape[1]), a.T.reshape(-1)] = 1
I have an array that is grouped and looks like this:
import numpy as np
y = np.array(
[[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2.]]
)
n_repeats = 4
The array contains three groups, here marked as 0, 1, and 2. Every group appears n_repeats times. Here n_repeats=4. Currently I do the following to compute the mean and variance of chunks of that array:
mean = np.array([np.mean(y[i: i+n_repeats], axis=0) for i in range(0, len(y), n_repeats)])
var = np.array([np.var(y[i: i+n_repeats], axis=0) for i in range(0, len(y), n_repeats)])
Is there a better and faster way to achieve this?
Yes, reshape and then use .mean and .var along the appropriate dimension:
>>> arr.reshape(-1, 4, 6)
array([[[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]],
[[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]],
[[2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2., 2.]]])
>>> arr.reshape(-1, 4, 6).mean(axis=1)
array([[0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2.]])
>>> arr.reshape(-1, 4, 6).var(axis=1)
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
In case you do not know how many groups, or number of repeats, you can try:
>>> np.vstack([y[y == i].reshape(-1,y.shape[1]).mean(axis=0) for i in np.unique(y)])
array([[0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2.]])
>>> np.vstack([y[y == i].reshape(-1,y.shape[1]).var(axis=0) for i in np.unique(y)])
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
Given a 3d tenzor, say:
batch x sentence length x embedding dim
a = torch.rand((10, 1000, 96))
and an array(or tensor) of actual lengths for each sentence
lengths = torch .randint(1000,(10,))
outputs tensor([ 370., 502., 652., 859., 545., 964., 566., 576.,1000., 803.])
How to fill tensor ‘a’ with zeros after certain index along dimension 1 (sentence length) according to tensor ‘lengths’ ?
I want smth like that :
a[ : , lengths : , : ] = 0
One way of doing it (slow if batch size is big enough):
for i_batch in range(10):
a[ i_batch , lengths[i_batch ] : , : ] = 0
You can do it using a binary mask.
Using lengths as column-indices to mask we indicate where each sequence ends (note that we make mask longer than a.size(1) to allow for sequences with full length).
Using cumsum() we set all entries in mask after the seq len to 1.
mask = torch.zeros(a.shape[0], a.shape[1] + 1, dtype=a.dtype, device=a.device)
mask[(torch.arange(a.shape[0]), lengths)] = 1
mask = mask.cumsum(dim=1)[:, :-1] # remove the superfluous column
a = a * (1. - mask[..., None]) # use mask to zero after each column
For a.shape = (10, 5, 96), and lengths = [1, 2, 1, 1, 3, 0, 4, 4, 1, 3].
Assigning 1 to respective lengths at each row, mask looks like:
mask =
tensor([[0., 1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]])
After cumsum you get
mask =
tensor([[0., 1., 1., 1., 1.],
[0., 0., 1., 1., 1.],
[0., 1., 1., 1., 1.],
[0., 1., 1., 1., 1.],
[0., 0., 0., 1., 1.],
[1., 1., 1., 1., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 1., 1., 1., 1.],
[0., 0., 0., 1., 1.]])
Note that it exactly has zeros where the valid sequence entries are and ones beyond the lengths of the sequences. Taking 1 - mask gives you exactly what you want.
Enjoy ;)
I have two 500x500 images, and need to merge them together and add up their channels.
When I used Numpy's concatenate function for instance, the returned output becomes 500x1000, and not sure if the color channels are added at all.
The output I'm looking for for merging two colored 500x500 images would be 500x500x6.
How can I perform that in Python?
Thanks.
a couple of options, if you want separate RGB or stuck together:
np.stack([np.zeros((2,2,3)), np.ones((2,2,3))], axis=2)
Out[157]:
array([[[[ 0., 0., 0.],
[ 1., 1., 1.]],
[[ 0., 0., 0.],
[ 1., 1., 1.]]],
[[[ 0., 0., 0.],
[ 1., 1., 1.]],
[[ 0., 0., 0.],
[ 1., 1., 1.]]]])
np.concatenate([np.zeros((2,2,3)), np.ones((2,2,3))], axis=2)
Out[158]:
array([[[ 0., 0., 0., 1., 1., 1.],
[ 0., 0., 0., 1., 1., 1.]],
[[ 0., 0., 0., 1., 1., 1.],
[ 0., 0., 0., 1., 1., 1.]]])
to address the above, extract each original img:
two_img =np.stack([np.zeros((2,2,3)), np.ones((2,2,3))], axis=2)
two_img[...,0,:]
Out[160]:
array([[[ 0., 0., 0.],
[ 0., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 0.]]])
two_img[...,1,:]
Out[161]:
array([[[ 1., 1., 1.],
[ 1., 1., 1.]],
[[ 1., 1., 1.],
[ 1., 1., 1.]]])
too_img = np.concatenate([np.zeros((2,2,3)), np.ones((2,2,3))], axis=2)
too_img[...,0:3]
Out[163]:
array([[[ 0., 0., 0.],
[ 0., 0., 0.]],
[[ 0., 0., 0.],
[ 0., 0., 0.]]])
too_img[...,3:]
Out[164]:
array([[[ 1., 1., 1.],
[ 1., 1., 1.]],
[[ 1., 1., 1.],
[ 1., 1., 1.]]])