Calculating comparisson matrix using strings as input in python - HARD

Calculating comparisson matrix using strings as input in python - HARD - python

I have two strings of DNA sequences and I want to compare both sequences, character by character, in order to get a matrix with comparisson values. The general idea is to have three essential points:
If there is the complementary AT (A in one sequence and T in the other) then 2/3.
If there is the complementary CG (C in one sequence and G in the other) then 1.
Otherwise, 0 is returned.
For example if I have two sequences ACTG then the result would be:
| A | C | T | G |
A| 0 | 0 | 2/3 | 0 |
C| 0 | 0 | 0 | 1 |
T| 2/3 | 0 | 0 | 0 |
G| 0 | 1 | 0 | 0 |
I saw there is some help in this post
Calculating a similarity/difference matrix from equal length strings in Python and it really work if you are using only a 4 nucleotide long sequence-
I tried using a larger sequence and this error was printed:
ValueError: shapes (5,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0)
I have the code in R which is
##2.1 Separas los strings
seq <- "ACTG"
seq1 <- unlist(as.matrix(strsplit(seq,""),ncol=nchar(seq),
nrow=nchar(seq)))
a <- matrix(ncol=length(seq),nrow=length(seq))
a[,1] <- seq1
a[1,] <- seq1
b <- matrix(ncol=length(a[1,]),nrow=length(a[1,]))
for (i in seq(nchar(seq))){
for (j in seq(nchar(seq))){
if (a[i,1] == "A" & a[1,j] == "T" | a[i,1] == "T" & a[1,j] == "A"){
b[[i,j]] <- 2/3
} else if (a[i,1] == "C" & a[1,j] == "G" | a[i,1] == "G" & a[1,j] == "C"){
b[[i,j]] <- 1
} else
b[[i,j]] <- 0
}
But I can't get it code in python.

I think you're making it harder than it needs to be.
import numpy as np
seq1 = 'AACCTTGG'
seq2 = 'ACGTACGT'
matrix = np.zeros((len(seq1),len(seq2)))
for y,c2 in enumerate(seq2):
for x,c1 in enumerate(seq1):
if c1+c2 in ('TA','AT'):
matrix[x,y] = 1.
elif c1+c2 in ('CG','GC'):
matrix[x,y] = 2/3
print(matrix)

Related

How can I do to add the condition AND using exlude?

I have this query :
MyTable.objects.filter(date=date).exclude(starthour__range=(start, end), endhour__range=(start, end))
But I want exclude the queries that starthour__range=(start, end) AND endhour__range=(start, end) not OR. I think in this case the OR is used.
Could you help me please ?
Thank you very much !

This is consequence of De Morgan's law [wiki], this specifies that ¬ (x ∧ y) is ¬ x ∨ ¬ y. This thus means that the negation of x and y, is not x or not y. Indeed, if we take a look at the truth table:
x | y | x &wedge; y | ¬x | ¬y | ¬(x &wedge; y) | ¬x &vee; ¬y
---+---+-------+----+----+----------+---------
0 | 0 | 0 | 1 | 1 | 1 | 1
0 | 1 | 0 | 1 | 0 | 1 | 1
1 | 0 | 0 | 0 | 1 | 1 | 1
1 | 1 | 1 | 0 | 0 | 0 | 0
So excluding items where both the starthour is in the(start, end) range and endhour is in the (start, end) range, is logically equivalent to allowing items where the starthour is not in the range, or where the endhour is not in the range.
Using and logic
You can thus make a disjunction in the .exclude(…) call to filter out items that satisfy one of the two conditions, or thus retain objects that do not satisfy any of the two conditions:
MyTable.objects.filter(date=date).exclude(
Q(starthour__range=(start, end)) | Q(endhour__range=(start, end))
)
Overlap logic
Based on your query however, you are looking for overlap, not for such range checks. It will not be sufficient to validate the starthour and endhour. If you want to check if two things overlap. Indeed, imagine that an event starts at 08:00 and ends at 18:00, and you filter for a range with 09:00 and 17:00, then both the starthour and endhour are not in the range, but still the events overlap.
Two ranges [s1, e1] and [s2, e2] do not overlap if s1≥ e2, or s2≥ e1. The negation, the condition when the two overlap is thus: s1< e2 and s2< e1. We thus can exclude the items that overlap with:
# records that do not overlap
MyTable.objects.filter(date=date).exclude(
starthour__lt=end, endhour__lt=start
)

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

Panda sum rows based on metric on column value

I have a dataframe df of the form
type | time | value
------------------------
a | 1.2 | 1
a | 1.3 | 3
a | 2.1 | 4
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Is there any feasible way to, for all rows, consolidate (sum) all following rows of a given type that have a timestamp difference of less than, for example, 1?
So for this example, the second and third row should be added to the first and the output should be
type | time | value
------------------------
a | 1.2 | 8
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Normally I would simply iterate over every row, add the value for all following rows that satisfy the constraint to the active row and then drop all rows whose values were added from the dataframe. But I'm not completely sure how to do that safely with panda considering that "You should never modify something you are iterating over."
But I sadly also don't see how this could be done with any operation that is applied on the whole dataframe at once.
Edit: I've found a very rough way to do it using a while loop. In every iteration it only adds the next row to those rows that already have no row of the same type with a time stamp less than 1 before it:
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
while df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'].any():
df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] = df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] + df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'nextvalue']
df = df.loc[~((df.shift(1).type == df.shift(1).nexttype) & ((df.shift(1).time - df.shift(1).lasttime >1 ) | (df.shift(1).type != df.shift(1).lasttype)) & (df.shift(1).time - df.shift(1).nexttime <=1 ))]
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
I would still be very interested if there is any faster way to do this, as this kind of loop obviously is not very efficient (especially since for the kind of dataframes I work with it has to iterate a few ten thousand times).

Gurobi: How can I sum just a part of a variable?

I have the following model:
from gurobipy import *
n_units = 1
n_periods = 3
n_ageclasses = 4
units = range(1,n_units+1)
periods = range(1,n_periods+1)
periods_plus1 = periods[:]
periods_plus1.append(max(periods_plus1)+1)
ageclasses = range(1,n_ageclasses+1)
nothickets = ageclasses[1:]
model = Model('MPPM')
HARVEST = model.addVars(units, periods, nothickets, vtype=GRB.INTEGER, name="HARVEST")
FOREST = model.addVars(units, periods_plus1, ageclasses, vtype=GRB.INTEGER, name="FOREST")
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for k in range(n_units) for t in range(n_periods) for nothicket in nothickets) == FOREST[unit, period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
I have a problem with formulating the constraint. I want for every unit and every period to sum the nothickets part of the variable HARVEST. Concretely I want xk=1,t=1,2 + xk=1,t=1,3 + xk=1,t=1,4
and so on. This should result in only three ones per row of the constraint matrix. But with the formulation above I get 9 ones.
I tried to use a for loop outside of the sum, but this results in another problem:
for k in range(n_units):
for t in range(n_periods):
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for nothicket in nothickets) == FOREST[unit,period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
With this formulation I get this matrix:
constraint matrix
But what I want is:
row_idx | col_idx | coeff
0 | 0 | 1
0 | 1 | 1
0 | 2 | 1
0 | 13 | -1
1 | 3 | 1
1 | 4 | 1
1 | 5 | 1
1 | 17 | -1
2 | 6 | 1
2 | 7 | 1
2 | 8 | 1
2 | 21 | -1
Can anybody please help me to reformulate this constraint?

This worked for me:
model.addConstrs((HARVEST.sum(unit, period, '*') == ...

How do I put data from a while loop into a table?

Basically I'm estimating pi using polygons. I have a loop which gives me a value for n, ann and bnn before running the loop again. here is what I have so far:
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
print(t,an,bn)
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
print(t,ann,bnn)
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
return
This is what I get if I run it with these values:
>>> printPiTable(4,2*sqrt(2),4,5)
4 4 2.8284271247461903
8 3.3137084989847607 3.0614674589207187
16 3.1825978780745285 3.121445152258053
32 3.1517249074292564 3.1365484905459398
64 3.1441183852459047 3.1403311569547534
128 3.1422236299424577 3.1412772509327733
I want to find a way to make it instead of printing out these values, just print the values in a nice neat table, any help?

Use string formatting. For example,
print('{:<4}{:>20f}{:>20f}'.format(t,ann,bnn))
produces
4 4.000000 2.828427
8 3.313708 3.061467
16 3.182598 3.121445
32 3.151725 3.136548
64 3.144118 3.140331
128 3.142224 3.141277
{:<4} is replaced by t, left-justified, formatted to a string of length 4.
{:>20f} is replaced by ann, right-justified, formatted as a float to a string of length 20.
The full story on the format string syntax is explained here.
To add column headers, just add a print statement like
print('{:<4}{:>20}{:>20}'.format('t','a','b'))
For fancier ascii tables, consider using a package like prettytable:
import prettytable
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
table = prettytable.PrettyTable(['t', 'a', 'b'])
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
table.add_row((t,an,bn))
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
table.add_row((t,ann,bnn))
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
print(table)
printPiTable(4,2*sqrt(2),4,5)
yields
+-----+---------------+---------------+
| t | a | b |
+-----+---------------+---------------+
| 4 | 4 | 2.82842712475 |
| 8 | 3.31370849898 | 3.06146745892 |
| 16 | 3.18259787807 | 3.12144515226 |
| 32 | 3.15172490743 | 3.13654849055 |
| 64 | 3.14411838525 | 3.14033115695 |
| 128 | 3.14222362994 | 3.14127725093 |
+-----+---------------+---------------+
Perhaps it is overkill for this sole purpose, but Pandas can make nice tables too, and can export them in other formats, such as HTML.

You can use output formatting to make it look pretty. Look here for an example:
http://docs.python.org/release/1.4/tut/node45.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating comparisson matrix using strings as input in python - HARD - python

Related

How can I do to add the condition AND using exlude?

Nearest neighbors in a given range

Panda sum rows based on metric on column value

Gurobi: How can I sum just a part of a variable?

How do I put data from a while loop into a table?

Categories

Resources