Efficient Way to Build Large Scale Hierarchical Data Tree Path - python

I have a large dataset (think: big data) of network elements that form a tree-like network.
A toy dataset looks like this:
| id | type | parent_id |
|-----:|:-------|:------------|
| 1 | D | <NA> |
| 2 | C | 1 |
| 3 | C | 2 |
| 4 | C | 3 |
| 5 | B | 3 |
| 6 | B | 4 |
| 7 | A | 4 |
| 8 | A | 5 |
| 9 | A | 3 |
Important rules:
The root nodes (in the toy example of type D) and the leaf nodes (in the toy example of type A) cannot be connected with each other and amongst each other. I.e., a D node cannot be connected with another D node (vice-versa for A nodes) and an A node cannot directly be connected with a D node.
For simplicity reasons, any other node type can randomly be connected in terms of types.
The tree depth can be arbitrarily deep.
The leaf node is always of type A.
A leaf node does not need to be connected through all intermediate nodes. In reality there are only a handful intermediary nodes that are mandatory to pass through. This circumstance can be neglected for this example here.
If you are to recommend doing it in Spark, the solution must be written with pyspark in mind.
What I would like to achieve is to build an efficient way (preferably in Spark) to calculate the tree-path for each node, like so:
| id | type | parent_id | path |
|-----:|:-------|:------------|:--------------------|
| 1 | D | <NA> | D:1 |
| 2 | C | 1 | D:1>C:2 |
| 3 | C | 2 | D:1>C:2>C:3 |
| 4 | C | 3 | D:1>C:2>C:3>C:4 |
| 5 | B | 3 | D:1>C:2>C:3>B:5 |
| 6 | B | 4 | D:1>C:2>C:3>C:4>B:6 |
| 7 | A | 4 | D:1>C:2>C:3>C:4>A:7 |
| 8 | A | 5 | D:1>C:2>C:3>B:5>A:8 |
| 9 | A | 3 | D:1>C:2>C:3>A:9 |
Note:
Each element in the tree path is constructed like this: id:type.
If you have other efficient ways to store the tree path (e.g., closure tables) and calculate them, I am happy to hear them as well. However, the runtime for calculation must be really low (less than an hour, preferably minutes) and retrieval later needs to be in the area of few seconds.
The ultimate end goal is to have a data structure that allows me to aggregate any network node underneath a certain node efficiently (runtime of a few seconds at most).
The actual dataset consisting of around 3M nodes can be constructed like this:
Note:
The commented node_counts that produces the above shown toy examples
The distribution of the node elements is close to reality.
import random
import pandas as pd
random.seed(1337)
node_counts = {'A': 1424383, 'B': 596994, 'C': 234745, 'D': 230937, 'E': 210663, 'F': 122859, 'G': 119453, 'H': 57462, 'I': 23260, 'J': 15008, 'K': 10666, 'L': 6943, 'M': 6724, 'N': 2371, 'O': 2005, 'P': 385}
#node_counts = {'A': 3, 'B': 2, 'C': 3, 'D': 1}
elements = list()
candidates = list()
root_type = list(node_counts.keys())[-1]
leaf_type = list(node_counts.keys())[0]
root_counts = node_counts[root_type]
leaves_count = node_counts[leaf_type]
ids = [i + 1 for i in range(sum(node_counts.values()))]
idcounter = 0
for i, (name, count) in enumerate(sorted(node_counts.items(), reverse=True)):
for _ in range(count):
_id = ids[idcounter]
idcounter += 1
_type = name
if i == 0:
_parent = None
else:
# select a random one that is not a root or a leaf
if len(candidates) == 0: # first bootstrap case
candidate = random.choice(elements)
else:
candidate = random.choice(candidates)
_parent = candidate['id']
_obj = {'id': _id, 'type': _type, 'parent_id': _parent}
#print(_obj)
elements.append(_obj)
if _type != root_type and _type != leaf_type:
candidates.append(_obj)
df = pd.DataFrame.from_dict(elements).astype({'parent_id': 'Int64'})
In order to produce the tree path in pure python with the above toy data you can use the following function:
def get_hierarchy_path(df, cache_dict, ID='id', LABEL = 'type', PARENT_ID = 'parent_id', node_sep='|', elem_sep=':'):
def get_path(record):
if pd.isna(record[PARENT_ID]):
return f'{record[LABEL]}{elem_sep}{record[ID]}'
else:
if record[PARENT_ID] in cache_dict:
parent_path = cache_dict[record[PARENT_ID]]
else:
try:
parent_path = get_path(df.query(f'{ID} == {record[PARENT_ID]}').iloc[0])
except IndexError as e:
print(f'Index Miss for {record[PARENT_ID]} on record {record.to_dict()}')
parent_path = f'{record[LABEL]}{elem_sep}{record[ID]}'
cache_dict[record[PARENT_ID]] = parent_path
return f"{parent_path}{node_sep}{record[LABEL]}{elem_sep}{record[ID]}"
return df.apply(get_path, axis=1)
df['path'] = get_hierarchy_path(df, dict(), node_sep='>')
What I have already tried:
Calculating in pure python with the above function on the large dataset takes me around 5.5 hours. So this is not really a solution. Anything quicker than this is appreciated.
Technically using the spark graphframes package, I could use BFS. This would give me a good solution for individual leave nodes, but it does not scale to the entire network.
I think Pregel is the way to go here. But I do not know how to construct it in Pyspark.
Thank you for your help.

My current solution for this challenge relies now no longer on Spark but on SQL.
I load the whole dataset to a Postgres DB and place a Unique Index on id, type and parent_id.
Then using the following query, I can calculate the path:
with recursive recursive_hierarchy AS (
-- starting point
select
parent_id
, id
, type
, type || ':' || id as path
, 1 as lvl
from hierarchy.nodes
union all
-- recursion
select
ne.parent_id as parent_id
, h.id
, h.type
, ne.type || ':' || ne.id || '|' || h.path as path
, h.lvl + 1 as lvl
from (
select *
from hierarchy.nodes
) ne
inner join recursive_hierarchy h
on ne.id = h.parent_id
), paths as (
-- complete results
select
*
from recursive_hierarchy
), max_lvl as (
-- retrieve the longest path of a network element
select
id
, max(lvl) as max_lvl
from paths
group by id
)
-- all results with only the longest path of a network element
select distinct
, p.id
, p.type
, p.path
from paths p
inner join max_lvl l
on p.id = l.id
and p.lvl = l.max_lvl

Related

How should I solve logic error in timestamp using Python?

I have written a code to calculate a, b, and c. They were initialized at 0.
This is my input file
-------------------------------------------------------------
| Line | Time | Command | Data |
-------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | WRITING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | WRITING | |
and the timestamps and commands are used to process the calculation for a, b, and c.
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
#print(m[2])
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
#print(m[3], m[2])
else:
timestamps[m[3]] = [m[2]]
#print(m[3], m[2])
a = b = c = 0
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
if timestamps["ACTIVE"] > timestamps["PRECHARGE"]: #line causing logic error
a = a + 1
print(a)
Before getting into the calculation, I assign the timestamps with respect to the commands. This is the output for this section.
ACTIVE : 3, ['0015', '0150', '0318']
WRITING : 4, ['0030', '0200', '0447', '0503']
WRITING_A : 3, ['0100', '0345', '0430']
PRECHARGE : 2, ['0115', '0314']
REFRESH : 1, ['0120']
To get a, the timestamps of ACTIVE must be greater than PRECHARGE and WRITING must be greater than ACTIVE. (Line 4, 6, 7 will contribute to the first a and Line 8, 9, and 12 contributes to the second a)
To get b, the timestamps of WRITING must be greater than ACTIVE. For the lines that contribute to a such as Line 4, 6, 7, 8, 9, and 12, they cannot be used to calculate b. So, Line 1 and 2 contribute to b.
To get c, the rest of the unused lines containing WRITING will contribute to c.
The expected output:
a = 2
b = 1
c = 1
However, in my code, when I print a, it displays 0, which shows the logic has some error. Any suggestion to amend my code to achieve the goal? I have tried for a few days and the problem is not solved yet.
I made a function that will return the commands in order that match a pattern with gaps allowed.
I also made a more compact version of your file reading.
There is probably a better version to divide the list into two parts, the problem was to only allow elements in that match the whole pattern. In this one I iterate over the elements twice.
import re
commands = list()
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
_, line, time, command, data, _ = m
commands.append((line,time,command))
def search_pattern(pattern, iterable, key=None):
iter = 0
count = 0
length = len(pattern)
results = []
sentinel = object()
for elem in iterable:
original_elem = elem
if key is not None:
elem = key(elem)
if elem == pattern[iter]:
iter += 1
results.append((original_elem,sentinel))
if iter >= length:
iter = iter % length
count += length
else:
results.append((sentinel,original_elem))
matching = []
nonmatching = []
for res in results:
first,second = res
if count > 0:
if second is sentinel:
matching.append(first)
count -= 1
elif first is sentinel:
nonmatching.append(second)
else:
value = first if second is sentinel else second
nonmatching.append(value)
return matching, nonmatching
pattern_a = ['PRECHARGE','ACTIVE','WRITING']
pattern_b = ['ACTIVE','WRITING']
pattern_c = ['WRITING']
matching, nonmatching = search_pattern(pattern_a, commands, key=lambda t: t[2])
a = len(matching)//len(pattern_a)
matching, nonmatching = search_pattern(pattern_b, nonmatching, key=lambda t: t[2])
b = len(matching)//len(pattern_b)
matching, nonmatching = search_pattern(pattern_c, nonmatching, key=lambda t: t[2])
c = len(matching)//len(pattern_c)
print(f'{a=}')
print(f'{b=}')
print(f'{c=}')
Output:
a=2
b=1
c=1

Calculating comparisson matrix using strings as input in python - HARD

I have two strings of DNA sequences and I want to compare both sequences, character by character, in order to get a matrix with comparisson values. The general idea is to have three essential points:
If there is the complementary AT (A in one sequence and T in the other) then 2/3.
If there is the complementary CG (C in one sequence and G in the other) then 1.
Otherwise, 0 is returned.
For example if I have two sequences ACTG then the result would be:
| A | C | T | G |
A| 0 | 0 | 2/3 | 0 |
C| 0 | 0 | 0 | 1 |
T| 2/3 | 0 | 0 | 0 |
G| 0 | 1 | 0 | 0 |
I saw there is some help in this post
Calculating a similarity/difference matrix from equal length strings in Python and it really work if you are using only a 4 nucleotide long sequence-
I tried using a larger sequence and this error was printed:
ValueError: shapes (5,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0)
I have the code in R which is
##2.1 Separas los strings
seq <- "ACTG"
seq1 <- unlist(as.matrix(strsplit(seq,""),ncol=nchar(seq),
nrow=nchar(seq)))
a <- matrix(ncol=length(seq),nrow=length(seq))
a[,1] <- seq1
a[1,] <- seq1
b <- matrix(ncol=length(a[1,]),nrow=length(a[1,]))
for (i in seq(nchar(seq))){
for (j in seq(nchar(seq))){
if (a[i,1] == "A" & a[1,j] == "T" | a[i,1] == "T" & a[1,j] == "A"){
b[[i,j]] <- 2/3
} else if (a[i,1] == "C" & a[1,j] == "G" | a[i,1] == "G" & a[1,j] == "C"){
b[[i,j]] <- 1
} else
b[[i,j]] <- 0
}
But I can't get it code in python.
I think you're making it harder than it needs to be.
import numpy as np
seq1 = 'AACCTTGG'
seq2 = 'ACGTACGT'
matrix = np.zeros((len(seq1),len(seq2)))
for y,c2 in enumerate(seq2):
for x,c1 in enumerate(seq1):
if c1+c2 in ('TA','AT'):
matrix[x,y] = 1.
elif c1+c2 in ('CG','GC'):
matrix[x,y] = 2/3
print(matrix)

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

Panda sum rows based on metric on column value

I have a dataframe df of the form
type | time | value
------------------------
a | 1.2 | 1
a | 1.3 | 3
a | 2.1 | 4
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Is there any feasible way to, for all rows, consolidate (sum) all following rows of a given type that have a timestamp difference of less than, for example, 1?
So for this example, the second and third row should be added to the first and the output should be
type | time | value
------------------------
a | 1.2 | 8
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Normally I would simply iterate over every row, add the value for all following rows that satisfy the constraint to the active row and then drop all rows whose values were added from the dataframe. But I'm not completely sure how to do that safely with panda considering that "You should never modify something you are iterating over."
But I sadly also don't see how this could be done with any operation that is applied on the whole dataframe at once.
Edit: I've found a very rough way to do it using a while loop. In every iteration it only adds the next row to those rows that already have no row of the same type with a time stamp less than 1 before it:
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
while df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'].any():
df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] = df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] + df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'nextvalue']
df = df.loc[~((df.shift(1).type == df.shift(1).nexttype) & ((df.shift(1).time - df.shift(1).lasttime >1 ) | (df.shift(1).type != df.shift(1).lasttype)) & (df.shift(1).time - df.shift(1).nexttime <=1 ))]
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
I would still be very interested if there is any faster way to do this, as this kind of loop obviously is not very efficient (especially since for the kind of dataframes I work with it has to iterate a few ten thousand times).

How do I put data from a while loop into a table?

Basically I'm estimating pi using polygons. I have a loop which gives me a value for n, ann and bnn before running the loop again. here is what I have so far:
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
print(t,an,bn)
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
print(t,ann,bnn)
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
return
This is what I get if I run it with these values:
>>> printPiTable(4,2*sqrt(2),4,5)
4 4 2.8284271247461903
8 3.3137084989847607 3.0614674589207187
16 3.1825978780745285 3.121445152258053
32 3.1517249074292564 3.1365484905459398
64 3.1441183852459047 3.1403311569547534
128 3.1422236299424577 3.1412772509327733
I want to find a way to make it instead of printing out these values, just print the values in a nice neat table, any help?
Use string formatting. For example,
print('{:<4}{:>20f}{:>20f}'.format(t,ann,bnn))
produces
4 4.000000 2.828427
8 3.313708 3.061467
16 3.182598 3.121445
32 3.151725 3.136548
64 3.144118 3.140331
128 3.142224 3.141277
{:<4} is replaced by t, left-justified, formatted to a string of length 4.
{:>20f} is replaced by ann, right-justified, formatted as a float to a string of length 20.
The full story on the format string syntax is explained here.
To add column headers, just add a print statement like
print('{:<4}{:>20}{:>20}'.format('t','a','b'))
For fancier ascii tables, consider using a package like prettytable:
import prettytable
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
table = prettytable.PrettyTable(['t', 'a', 'b'])
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
table.add_row((t,an,bn))
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
table.add_row((t,ann,bnn))
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
print(table)
printPiTable(4,2*sqrt(2),4,5)
yields
+-----+---------------+---------------+
| t | a | b |
+-----+---------------+---------------+
| 4 | 4 | 2.82842712475 |
| 8 | 3.31370849898 | 3.06146745892 |
| 16 | 3.18259787807 | 3.12144515226 |
| 32 | 3.15172490743 | 3.13654849055 |
| 64 | 3.14411838525 | 3.14033115695 |
| 128 | 3.14222362994 | 3.14127725093 |
+-----+---------------+---------------+
Perhaps it is overkill for this sole purpose, but Pandas can make nice tables too, and can export them in other formats, such as HTML.
You can use output formatting to make it look pretty. Look here for an example:
http://docs.python.org/release/1.4/tut/node45.html

Categories