using numpy to randomly distribute DNA sequence reads over genomic features - python

Hi I have written a script that randomly shuffles read sequences over the gene they were mapped to.
This is useful if you want to determine if a peak that you observe over your gene of interest is statistically significant. I use this code to calculate False Discovery Rates for peaks in my gene of interest.
Below the code:
import numpy as np
import matplotlib.pyplot as plt
iterations = 1000 # number of times a read needs to be shuffled
featurelength = 1000 # length of the gene
a = np.zeros((iterations,featurelength)) # create a matrix with 1000 rows of the feature length
b = np.arange(iterations) # a matrix with the number of iterations (0-999)
reads = np.random.randint(10,50,1000) # a random dataset containing an array of DNA read lengths
Below the code to fill the large matrix (a):
for i in reads: # for read with read length i
r = np.random.randint(-i,featurelength-1,iterations) # generate random read start positions for the read i
for j in b: # for each row in a:
pos = r[j] # get the first random start position for that row
if pos < 0: # start position can be negative because a read does not have to completely overlap with the feature
a[j][:pos+i]+=1
else:
a[j][pos:pos+i]+=1 # add the read to the array and repeat
Then generate a heat map to see if the distribution is roughly even:
plt.imshow(a)
plt.show()
This generates the desired result but it is very slow because of the many for loops.
I tried to do fancy numpy indexing but I constantly get the "too many indices error".
Anybody have a better idea of how to do this?

Fancy indexing is a bit tricky, but still possible:
for i in reads:
r = np.random.randint(-i,featurelength-1,iterations)
idx = np.clip(np.arange(i)[:,None]+r, 0, featurelength-1)
a[b,idx] += 1
To deconstruct this a bit, we're:
Creating a simple index array as a column vector, from 0 to i: np.arange(i)[:,None]
Adding each element from r (a row vector), which broadcasts to make a matrix of size (i,iterations) with the correct offsets into the columns of a.
Clamping the indices to the range [0,featurelength), via np.clip.
Finally, we fancy-index a for each row (b) and the relevant columns (idx).

Related

TypeError: all the input array dimensions for the concatenation axis must match exactly

TypeError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 101 and the array at index 1 has size 2
Hi I'm super new to coding. I am trying to stack two arrays with numpy.column_stack, one of which is composed after a for loop and has two columns, the other one only has one. I've been trying to figure out how to make this error go away and nothing I tried worked. Please help
import numpy as np
import math
import matplotlib.pyplot as plt
def fb2(t,J):
x=J[0]
y=J[1]
dxdt=0.25*y-x
dydt=3*x-y
#X0,Y0=1,1 initial conditions
return([dxdt,dydt])
def odeRK4(function,tspan,R,h,*args):
#R is vector of inital conditions
x0=R[0]
y0=R[1]
#writing statement for what to do if h isnt given/other thing
if h==None:
h=.01*(tspan[1]-tspan[0])
elif h> tspan[1]-tspan[0]:
h=.01*(tspan[1]-tspan[0])
else:
h=h
#defining the 2-element array (i hope)
#pretty sure tspan is range of t values
x0=tspan[0] #probably 0 if this is meant for time
xn=tspan[1] #whatever time we want it to end at?
#xn is final x value-t
#x0 is initial
t_values=np.arange(x0,xn+h,h) #range of values based on increments of h
N=len(t_values)
y_val=np.zeros(N)
y_val[0]=y0
if function==fb2:
y_val=np.zeros([2,N])#makes 2 vectors, this makes it possible for : to apply
y_val[:,0]=y0
for i in range(1,N):
#rk4 method
k1=function(t_values[i-1],y_val[:,i-1],*args)
#0.5*k1*h
a=np.multiply(k1,h) #This multiples each part of k1 by h then i repeat it again to multiply by that .5
k2=function(t_values[i-1]+0.5*h,y_val[:,i-1]+np.multiply(a,.5),*args)
b=np.multiply(k2,h)
k3=function(t_values[i-1]+0.5*h,y_val[:,i-1]+np.multiply(b,0.5),*args)
k4=function(t_values[i-1]+h,y_val[:,i-1]+np.multiply(k3,h),*args)
c=(k1+np.multiply(k2,2)+np.multiply(k3,2)+k4)
d=np.multiply(c,1/6)
y_val[:,i]=y_val[:,i-1]+np.multiply(d,h)
#this fills the t_val array and keeps the loop going
a=np.column_stack((t_values,y_val))
print('At time t, Y= (t on left,Y on right)')
print(a)
plt.plot(t_values,y_val)
print('for 3B:')
odeRK4(fb2,[0,20],[1,1],None)

Generate random binary matrix constrained to no null row

I want to generate a random binary matrix, so I'm using W=np.random.binomial(1, p, (n,n)).
It works fine, but I want a constraint that no row is just of 0s.
I create the following function:
def random_matrix(p,n):
m=0
while m==0:
W = np.random.binomial(1, p, (n,n))
m=min(W.sum(axis=1))
return W
It also works fine, but it seems to me too inefficient. Is there a faster way to create this constraint?
When the matrix is large, regenerating the entire matrix just because few rows are full of zeros is not efficient. It should be statistically safe to only regenerate the target rows. Here is an example:
def random_matrix(p,n):
W = np.random.binomial(1, p, (n,n))
while True:
null_rows = np.where(W.sum(axis=1) == 0)[0]
# If there is no null row, then m>0 so we stop the replacement
if null_rows.size == 0:
break
# Replace only the null rows
W[null_rows] = np.random.binomial(1, p, (null_rows.shape[0],n))
return W
Even faster solutions
There is an even more efficient approach when p is close to 0 (when p is close to 1, then the above function is already fast). Indeed, a binomial random variable with 0-1 values is a Bernoulli random variable. The sum of Bernoulli random values with a probability p repeated many times is a binomial random value! Thus, you can generate the sum for all row using S = np.random.binomial(n, p, (n,n)), then apply the above method to remove null values and then build the final matrix by generating S[i] one values for the ith row and use np.shuffle so to randomize the order of the 0-1 values in each row. This method solve conflicts much more efficiently than all others. Indeed, it does not need to generate the full row to check if it is full of zeros. It is n times faster to solve conflicts!
If this is not enough, you can use the uint8 datatype to generate W. Indeed, the memory is slow so generating smaller matrices is generally faster, not to mention it takes less RAM.
If this is not enough, you can generate S item per item using Numba JIT compiler and a basic loop. This should be faster since there is no temporary array to create except the final one. For large matrices, this algorithm can even be parallelized (every row can be independently generated). This last solution should be close to be optimal.

How to sample M times from N different normal distributions in python? Is there a "faster" way in terms of processing time?

I need to sample multiple (M) times from N different normal distributions. This repeated sampling will happen in turn several thousand times. I want to do this in the most efficient way, because I would like to not die of old age before this process ends. The code would look something like this:
import numpy as np
# bunch of stuff that is unrelated to the problem
number_of_repeated_processes = 5000
number_of_samples_per_process = 20
# the normal distributions I'm sampling from are described by 2 vectors:
#
# myMEANS <- an numpy array of length 10 containing the means for the distributions
# myVAR <- an numpy array of length 10 containing the variance for the distributions
for i in range(number_of_repeated_processes):
# myRESULT is a list of arrays containing the results for the sampling
#
myRESULT = [np.random.normal(loc=myMEANS[j], scale=myVAR[j], size = number_of_samples_per_process) for j in range(10)]
#
# here do something with myRESULT
# end for loop
The questions is... is there a better way to obtain the myRESULT matrix
np.random.normal accepts means-var as an array directly and you can choose a size that covers all the sampling in one run without loops:
myRESULT = np.random.normal(loc=myMEANS, scale=myVAR, size = (number_of_samples_per_process, number_of_repeated_processes,myMEANS.size))
This will return a number_of_samples_per_process by number_of_repeated_processes column for each mean-var pair in your myMEANS-myVAR array. For example, to access your samples of myMEANS[i]-myVAR[i], use myRESULT[...,i]. This should boost your performance somewhat.

Protein Mutual Information

I'm trying to find the mutual information (MI) between a multiple sequence alignment (MSA).
The math behind it is ok for me. Though, I don't know how to implement it in Python, at least in a fast way.
How should I compute the overall frequency P(i;x); P(j;y); P(ij;xy). The Px and Py frequency is easy to calculate a hash could deal with it, but what about the P(ij;xy)?
So my real question is, how should I calculate the probability of Pxy in a given i and j column?
please note that MI could be defined as:
MI(i,j) = Sum(x->n)Sum(y->m) P(ij,xy) * log(P(ij,xy)/P(i,x)*P(j,y))
In which i and j are amino acid position in the columns, x and y are different amino acids found in a given i or j column.
Thanks,
EDIT
My input data looks like a df:
A = [
['M','T','S','K','L','G','-'.'-','S','L','K','P'],
['M','A','A','S','L','A','-','A','S','L','P','E'],
...,
['M','T','S','K','L','G','A','A','S','L','P','E'],
]
So indeed it is trully easy to compute any frequency of amino acid in a given position,
for example:
P(M) at position 1: 1
P(T) at position 2: 2/3
P(A) at position 2: 1/3
P(S) at position 3: 2/3
P(A) at position 3: 1/3
How should I proceed to get, for example, P of a T at position 2 and a S at position 3 at the same time:
In this example is 2/3.
So P(ij, xy) means the probability (or frequency) of a amino acid x in a column i occur at the same time of a amino acid y in a column j.
Ps: for a more simple explanation of MI please refer to this link mistic.leloir.org.ar/docs/help.html 'Thanks to Aaron'
I am not 100% sure if this is correct (e.g., how is '-' supposed to be handled)? I assume that the sum is over all pairs for which the frequencies in the numerator and denominator inside the log are all nonzero and furthermore, I assumed that it should be the natural log:
from math import log
from collections import Counter
def MI(sequences,i,j):
Pi = Counter(sequence[i] for sequence in sequences)
Pj = Counter(sequence[j] for sequence in sequences)
Pij = Counter((sequence[i],sequence[j]) for sequence in sequences)
return sum(Pij[(x,y)]*log(Pij[(x,y)]/(Pi[x]*Pj[y])) for x,y in Pij)
The code works by using 3 Counter objects to get the relevant counts, and then returning a sum which is a straightforward translation of the formula.
If this isn't correct, it would be helpful if you edit your question so that it has some expected output to test against.
On Edit. Here is a version which doesn't treat '-' as just another amino acid but instead filters away sequences in which it appears in either of the two columns, interpreting those sequences as sequences for which the requisite information is not available:
def MI(sequences,i,j):
sequences = [s for s in sequences if not '-' in [s[i],s[j]]]
Pi = Counter(s[i] for s in sequences)
Pj = Counter(s[j] for s in sequences)
Pij = Counter((s[i],s[j]) for s in sequences)
return sum(Pij[(x,y)]*log(Pij[(x,y)]/(Pi[x]*Pj[y])) for x,y in Pij)
Here's a place to get started... read the comments
import numpy as np
A = [ # you'll need to pad the end of your strings so that they're all the
# same length for this to play nice with numpy
"MTSKLG--SLKP",
"MAASLA-ASLPE",
"MTSKLGAASLPE"]
#create an array of bytes
B = np.array([np.fromstring(a, dtype=np.uint8) for a in A],)
#create search string to do bytetwise xoring
#same length as B.shape[1]
search_string = "-TS---------" # P of T at pos 1 and S at pos 2
#"M-----------" # P of M at pos 0
#take ord of each char in string
search_ord = np.fromstring(search_string, dtype=np.uint8)
#locate positions not compared
search_mask = search_ord != ord('-')
#xor with search_ord. 0 indicates letter in that position matches
#multiply with search_mask to force uninteresting positions to 0
#any remaining arrays that are all 0 are a match. ("any()" taken along axis 1)
#this prints [False, True, False]. take the sum to get the number of non-matches
print(((B^search_ord) * search_mask).any(1))

How to use an additive assignment with list based indexing in Numpy [duplicate]

This question already has answers here:
Handling of duplicate indices in NumPy assignments
(5 answers)
Closed 8 years ago.
I am currently trying to vectorize some code I had written using a large for loop in Python. The vectorized code is as follows:
rho[pi,pj] += (rho_coeff*dt)*i_frac*j_frac
rho[pi+1,pj] += (rho_coeff*dt)*ip1_frac*j_frac
rho[pi,pj+1] += (rho_coeff*dt)*i_frac*jp1_frac
rho[pi+1,pj+1] += (rho_coeff*dt)*ip1_frac*jp1_frac
Each of pi, pj, dt, i_frac, j_frac, ip1_frac, jp1_frac is a numpy array of one dimension and all of the same length. rho is a two dimensional numpy array. pi and pj make up a list of coordinates (pi,pj) which indicate which element of the matrix rho is modified. The modification involves the addition of the (rho_coeff*dt)*i_frac*j_frac term to the (pi,pj) element as well as addition of similar terms to neighbouring elements: (pi+1,pj), (pi,pj+1) and (pi+1,pj+1). Each coordinate in the list (pi, pj) has a unique dt, i_frac, j_frac, ip1_frac and jp1_frac associated with it.
The problem is that the list can have (and always will have) repeating coordinates. So instead of successively adding to rho each time the same coordinate is encountered in the list, it only adds the term corresponding to the last repeating coordinate. This problem is described briefly with an example in the Tentative Numpy Tutorial under fancy indexing with arrays of indices (see the last three examples before boolean indexing). Unfortunately they did not provide a solution to this.
Is there a way of doing this operation without resorting to a for loop? I am trying to optimize for performance and want to do away with a loop if possible.
FYI: this code forms part of a 2D particle tracking algorithm where the charge from each particle is added to the four adjacent nodes of a mesh surrounding the particle's position based on volume fractions.
You are going to have to figure out the repeated items and add them together before updating your array. The following code shows a way of doing that for your first update:
rows, cols = 100, 100
items = 1000
rho = np.zeros((rows, cols))
rho_coeff, dt, i_frac, j_frac = np.random.rand(4, items)
pi = np.random.randint(1, rows-1, size=(items,))
pj = np.random.randint(1, cols-1, size=(items,))
# The following code assumes pi and pj have the same dtype
pij = np.column_stack((pi, pj)).view((np.void,
2*pi.dtype.itemsize)).ravel()
unique_coords, indices = np.unique(pij, return_inverse=True)
unique_coords = unique_coords.view(pi.dtype).reshape(-1, 2)
data = rho_coeff*dt*i_frac*j_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data
I think you can reuse all of the unique coordinate finding above for the other updates, so the following would work:
ip1_frac, jp1_frac = np.random.rand(2, items)
unique_coords[:, 0] += 1
data = rho_coeff*dt*ip1_frac*j_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data
unique_coords[:, 1] += 1
data = rho_coeff*dt*ip1_frac*jp1_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data
unique_coords[:, 0] -= 1
data = rho_coeff*dt*i_frac*jp1_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data

Categories