Finding overlaps between millions of ranges/intervals - python

I am trying to find pairs of intervals that overlap by at least some minimum overlap length that is set by the user. The intervals are from this pandas dataframe:
import pandas as pds
print(df1.head())
print(df1.tail())
query_id start_pos end_pos read_length orientation
0 1687655 1 4158 4158 F
1 2485364 1 7233 7233 R
2 1412202 1 3215 3215 R
3 1765889 1 3010 3010 R
4 2944965 1 4199 4199 R
query_id start_pos end_pos read_length orientation
3082467 112838 27863832 27865583 1752 F
3082468 138670 28431208 28431804 597 R
3082469 171683 28489928 28490409 482 F
3082470 2930053 28569533 28569860 328 F
3082471 1896622 28589281 28589554 274 R
where start_pos is where the interval starts and end_pos is where the interval ends. read_length is the length of the interval.
The data is sorted by start_pos.
The program should have the following output format:
query_id1 -- query_id2 -- read_length1 -- read_length2 -- overlap_length
I am running the program on a compute node with up to 512gb RAM and 4x Intel Xeon E7-4830 CPU (32 cores).
I've tried running my own code to find the overlaps, but it is taking too long to run.
Here is the code that I tried,
import pandas as pds
overlap_df = pds.DataFrame()
def create_overlap_table(df1, ovl_len):
...
(sort and clean the data here)
...
def iterate_queries(row):
global overlap_df
index1 = df1.index[df1['query_id'] == row['query_id']]
next_int_index = df1.index.get_loc(index1[0]) + 1
if row['read_length'] >= ovl_len:
if df1.index.size-1 >= next_int_index:
end_pos_minus_ovlp = (row['end_pos'] - ovl_len) + 2
subset_df = df1.loc[(df1['start_pos'] < end_pos_minus_ovlp)]
subset_df = subset_df.loc[subset_df.index == subset_df.index.max()]
subset_df = df1.iloc[next_int_index:df1.index.get_loc(subset_df.index[0])]
subset_df = subset_df.loc[subset_df['read_length'] >= ovl_len]
rows1 = pds.DataFrame({'read_id1': np.repeat(row['query_id'], repeats=subset_df.index.size), 'read_id2': subset_df['query_id']})
overlap_df = overlap_df.append(rows1)
df1.apply(iterate_queries, axis=1)
print(overlap_df)
Again, I ran this code on the compute node, but it was running for hours before I finally cancelled the job.
I've also tried using two packages for this problem--PyRanges, as well as an R package called IRanges, but they take too long to run as well. I've seen posts on interval trees and a python library called pybedtools, and I was planning on looking into them as a next step.
Any feedback would be really appreciated
EDIT:
For minimum overlap length of, say 800, the first 5 rows should look like this,
query_id1 query_id2 read_length1 read_length2 overlap_length
1687655 2485364 4158 7233 4158
1687655 1412202 4158 3215 3215
1687655 1765889 4158 3010 3010
1687655 2944965 4158 4199 4158
2485364 1412202 7233 3215 3215
So, query_id1 and query_id2 cannot be identical. Also, no duplications (i.e., an overlap between A and B should not appear twice in the output).

Here's an algorithm.
Prepare a set of intervals sorted by starting point. Initially the set is empty.
Sort all starting and ending points.
Traverse the points. If a starting point is encountered, add the corresponding interval to the set. If an ending point is encountered, remove the corresponding interval from the set.
When removing an interval, look at other intervals in the set. They all overlap the interval being removed, and they are sorted by the length of the overlap, longest first. Traverse the set until the length is too short, and report each overlap.

pyranges author here. Thanks for trying my library.
How big are your data? When both PyRanges were 1e7 rows, the heavy part done by pyranges took about 12 seconds using 24 cores on a busy server with 200 gb ram.
Setup:
import pyranges as pr
import numpy as np
import mkl
mkl.set_num_threads(1)
### Create data ###
length = int(1e7)
minimum_overlap = 5
gr = pr.random(length)
gr2 = pr.random(length)
# add ids
gr.id1 = np.arange(len(gr))
gr2.id2 = np.arange(len(gr))
# add lengths
gr.length1 = gr.lengths()
gr2.length2 = gr2.lengths()
gr
# +--------------+-----------+-----------+--------------+-----------+-----------+
# | Chromosome | Start | End | Strand | id1 | length1 |
# | (category) | (int32) | (int32) | (category) | (int64) | (int32) |
# |--------------+-----------+-----------+--------------+-----------+-----------|
# | chr1 | 146230338 | 146230438 | + | 0 | 100 |
# | chr1 | 199561432 | 199561532 | + | 1 | 100 |
# | chr1 | 189095813 | 189095913 | + | 2 | 100 |
# | chr1 | 27608425 | 27608525 | + | 3 | 100 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 21533766 | 21533866 | - | 9999996 | 100 |
# | chrY | 30105890 | 30105990 | - | 9999997 | 100 |
# | chrY | 49764407 | 49764507 | - | 9999998 | 100 |
# | chrY | 3319478 | 3319578 | - | 9999999 | 100 |
# +--------------+-----------+-----------+--------------+-----------+-----------+
# Stranded PyRanges object has 10,000,000 rows and 6 columns from 25 chromosomes.
Doing your analysis:
j = gr.join(gr2, new_pos="intersection", nb_cpu=24)
# CPU times: user 3.85 s, sys: 3.56 s, total: 7.41 s
# Wall time: 12.3 s
j.overlap = j.lengths()
out = j.df["id1 id2 length1 length2 overlap".split()]
out = out[out.overlap >= minimum_overlap]
out.head()
id1 id2 length1 length2 overlap
1 2 485629 100 100 74
2 2 418820 100 100 92
3 3 487066 100 100 13
4 7 191109 100 100 31
5 11 403447 100 100 76

Related

How should I solve logic error in timestamp using Python?

I have written a code to calculate a, b, and c. They were initialized at 0.
This is my input file
-------------------------------------------------------------
| Line | Time | Command | Data |
-------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | WRITING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | WRITING | |
and the timestamps and commands are used to process the calculation for a, b, and c.
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
#print(m[2])
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
#print(m[3], m[2])
else:
timestamps[m[3]] = [m[2]]
#print(m[3], m[2])
a = b = c = 0
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
if timestamps["ACTIVE"] > timestamps["PRECHARGE"]: #line causing logic error
a = a + 1
print(a)
Before getting into the calculation, I assign the timestamps with respect to the commands. This is the output for this section.
ACTIVE : 3, ['0015', '0150', '0318']
WRITING : 4, ['0030', '0200', '0447', '0503']
WRITING_A : 3, ['0100', '0345', '0430']
PRECHARGE : 2, ['0115', '0314']
REFRESH : 1, ['0120']
To get a, the timestamps of ACTIVE must be greater than PRECHARGE and WRITING must be greater than ACTIVE. (Line 4, 6, 7 will contribute to the first a and Line 8, 9, and 12 contributes to the second a)
To get b, the timestamps of WRITING must be greater than ACTIVE. For the lines that contribute to a such as Line 4, 6, 7, 8, 9, and 12, they cannot be used to calculate b. So, Line 1 and 2 contribute to b.
To get c, the rest of the unused lines containing WRITING will contribute to c.
The expected output:
a = 2
b = 1
c = 1
However, in my code, when I print a, it displays 0, which shows the logic has some error. Any suggestion to amend my code to achieve the goal? I have tried for a few days and the problem is not solved yet.
I made a function that will return the commands in order that match a pattern with gaps allowed.
I also made a more compact version of your file reading.
There is probably a better version to divide the list into two parts, the problem was to only allow elements in that match the whole pattern. In this one I iterate over the elements twice.
import re
commands = list()
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
_, line, time, command, data, _ = m
commands.append((line,time,command))
def search_pattern(pattern, iterable, key=None):
iter = 0
count = 0
length = len(pattern)
results = []
sentinel = object()
for elem in iterable:
original_elem = elem
if key is not None:
elem = key(elem)
if elem == pattern[iter]:
iter += 1
results.append((original_elem,sentinel))
if iter >= length:
iter = iter % length
count += length
else:
results.append((sentinel,original_elem))
matching = []
nonmatching = []
for res in results:
first,second = res
if count > 0:
if second is sentinel:
matching.append(first)
count -= 1
elif first is sentinel:
nonmatching.append(second)
else:
value = first if second is sentinel else second
nonmatching.append(value)
return matching, nonmatching
pattern_a = ['PRECHARGE','ACTIVE','WRITING']
pattern_b = ['ACTIVE','WRITING']
pattern_c = ['WRITING']
matching, nonmatching = search_pattern(pattern_a, commands, key=lambda t: t[2])
a = len(matching)//len(pattern_a)
matching, nonmatching = search_pattern(pattern_b, nonmatching, key=lambda t: t[2])
b = len(matching)//len(pattern_b)
matching, nonmatching = search_pattern(pattern_c, nonmatching, key=lambda t: t[2])
c = len(matching)//len(pattern_c)
print(f'{a=}')
print(f'{b=}')
print(f'{c=}')
Output:
a=2
b=1
c=1

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:
>> print a
id | value
1 | 250
2 | 150
3 | 350
4 | 550
5 | 450
>> print b
low | high | class
100 | 200 | 'A'
200 | 300 | 'B'
300 | 500 | 'A'
500 | 600 | 'C'
I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:
>> print a
id | value | class
1 | 250 | 'B'
2 | 150 | 'A'
3 | 350 | 'A'
4 | 550 | 'C'
5 | 450 | 'A'
I have the following code written that sort of does what I want:
a['class'] = pd.Series()
for i in range(len(a)):
val = a['value'][i]
cl = (b['class'][ (b['low'] <= val) \
(b['high'] >= val) ].iat[0])
a['class'].set_value(i,cl)
Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?
Here is a way to do a range join inspired by #piRSquared's solution:
A = a['value'].values
bh = b.high.values
bl = b.low.values
i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))
pd.DataFrame(
np.column_stack([a.values[i], b.values[j]]),
columns=a.columns.append(b.columns)
)
Output:
id value low high class
0 1 250 200 300 'B'
1 2 150 100 200 'A'
2 3 350 300 500 'A'
3 4 550 500 600 'C'
4 5 450 300 500 'A'
Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!
I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.
The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!
# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])
blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])
aclass = [[]]*len(avalue) # initialize aclass
start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
aclass[j]=bclass[i]
# add the class column to the original a DataFrame
a['class'] = aclass
print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Gurobi: How can I sum just a part of a variable?

I have the following model:
from gurobipy import *
n_units = 1
n_periods = 3
n_ageclasses = 4
units = range(1,n_units+1)
periods = range(1,n_periods+1)
periods_plus1 = periods[:]
periods_plus1.append(max(periods_plus1)+1)
ageclasses = range(1,n_ageclasses+1)
nothickets = ageclasses[1:]
model = Model('MPPM')
HARVEST = model.addVars(units, periods, nothickets, vtype=GRB.INTEGER, name="HARVEST")
FOREST = model.addVars(units, periods_plus1, ageclasses, vtype=GRB.INTEGER, name="FOREST")
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for k in range(n_units) for t in range(n_periods) for nothicket in nothickets) == FOREST[unit, period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
I have a problem with formulating the constraint. I want for every unit and every period to sum the nothickets part of the variable HARVEST. Concretely I want xk=1,t=1,2 + xk=1,t=1,3 + xk=1,t=1,4
and so on. This should result in only three ones per row of the constraint matrix. But with the formulation above I get 9 ones.
I tried to use a for loop outside of the sum, but this results in another problem:
for k in range(n_units):
for t in range(n_periods):
model.addConstrs((quicksum(HARVEST[(k+1), (t+1), nothicket] for nothicket in nothickets) == FOREST[unit,period+1, 1] for unit in units for period in periods if period < max(periods_plus1)), name="A_Thicket")
With this formulation I get this matrix:
constraint matrix
But what I want is:
row_idx | col_idx | coeff
0 | 0 | 1
0 | 1 | 1
0 | 2 | 1
0 | 13 | -1
1 | 3 | 1
1 | 4 | 1
1 | 5 | 1
1 | 17 | -1
2 | 6 | 1
2 | 7 | 1
2 | 8 | 1
2 | 21 | -1
Can anybody please help me to reformulate this constraint?
This worked for me:
model.addConstrs((HARVEST.sum(unit, period, '*') == ...

How do I put data from a while loop into a table?

Basically I'm estimating pi using polygons. I have a loop which gives me a value for n, ann and bnn before running the loop again. here is what I have so far:
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
print(t,an,bn)
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
print(t,ann,bnn)
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
return
This is what I get if I run it with these values:
>>> printPiTable(4,2*sqrt(2),4,5)
4 4 2.8284271247461903
8 3.3137084989847607 3.0614674589207187
16 3.1825978780745285 3.121445152258053
32 3.1517249074292564 3.1365484905459398
64 3.1441183852459047 3.1403311569547534
128 3.1422236299424577 3.1412772509327733
I want to find a way to make it instead of printing out these values, just print the values in a nice neat table, any help?
Use string formatting. For example,
print('{:<4}{:>20f}{:>20f}'.format(t,ann,bnn))
produces
4 4.000000 2.828427
8 3.313708 3.061467
16 3.182598 3.121445
32 3.151725 3.136548
64 3.144118 3.140331
128 3.142224 3.141277
{:<4} is replaced by t, left-justified, formatted to a string of length 4.
{:>20f} is replaced by ann, right-justified, formatted as a float to a string of length 20.
The full story on the format string syntax is explained here.
To add column headers, just add a print statement like
print('{:<4}{:>20}{:>20}'.format('t','a','b'))
For fancier ascii tables, consider using a package like prettytable:
import prettytable
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
table = prettytable.PrettyTable(['t', 'a', 'b'])
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
table.add_row((t,an,bn))
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
table.add_row((t,ann,bnn))
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
print(table)
printPiTable(4,2*sqrt(2),4,5)
yields
+-----+---------------+---------------+
| t | a | b |
+-----+---------------+---------------+
| 4 | 4 | 2.82842712475 |
| 8 | 3.31370849898 | 3.06146745892 |
| 16 | 3.18259787807 | 3.12144515226 |
| 32 | 3.15172490743 | 3.13654849055 |
| 64 | 3.14411838525 | 3.14033115695 |
| 128 | 3.14222362994 | 3.14127725093 |
+-----+---------------+---------------+
Perhaps it is overkill for this sole purpose, but Pandas can make nice tables too, and can export them in other formats, such as HTML.
You can use output formatting to make it look pretty. Look here for an example:
http://docs.python.org/release/1.4/tut/node45.html

Categories