Panda sum rows based on metric on column value - python

I have a dataframe df of the form
type | time | value
------------------------
a | 1.2 | 1
a | 1.3 | 3
a | 2.1 | 4
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Is there any feasible way to, for all rows, consolidate (sum) all following rows of a given type that have a timestamp difference of less than, for example, 1?
So for this example, the second and third row should be added to the first and the output should be
type | time | value
------------------------
a | 1.2 | 8
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Normally I would simply iterate over every row, add the value for all following rows that satisfy the constraint to the active row and then drop all rows whose values were added from the dataframe. But I'm not completely sure how to do that safely with panda considering that "You should never modify something you are iterating over."
But I sadly also don't see how this could be done with any operation that is applied on the whole dataframe at once.
Edit: I've found a very rough way to do it using a while loop. In every iteration it only adds the next row to those rows that already have no row of the same type with a time stamp less than 1 before it:
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
while df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'].any():
df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] = df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] + df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'nextvalue']
df = df.loc[~((df.shift(1).type == df.shift(1).nexttype) & ((df.shift(1).time - df.shift(1).lasttime >1 ) | (df.shift(1).type != df.shift(1).lasttype)) & (df.shift(1).time - df.shift(1).nexttime <=1 ))]
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
I would still be very interested if there is any faster way to do this, as this kind of loop obviously is not very efficient (especially since for the kind of dataframes I work with it has to iterate a few ten thousand times).

Related

Efficient Way to Build Large Scale Hierarchical Data Tree Path

I have a large dataset (think: big data) of network elements that form a tree-like network.
A toy dataset looks like this:
| id | type | parent_id |
|-----:|:-------|:------------|
| 1 | D | <NA> |
| 2 | C | 1 |
| 3 | C | 2 |
| 4 | C | 3 |
| 5 | B | 3 |
| 6 | B | 4 |
| 7 | A | 4 |
| 8 | A | 5 |
| 9 | A | 3 |
Important rules:
The root nodes (in the toy example of type D) and the leaf nodes (in the toy example of type A) cannot be connected with each other and amongst each other. I.e., a D node cannot be connected with another D node (vice-versa for A nodes) and an A node cannot directly be connected with a D node.
For simplicity reasons, any other node type can randomly be connected in terms of types.
The tree depth can be arbitrarily deep.
The leaf node is always of type A.
A leaf node does not need to be connected through all intermediate nodes. In reality there are only a handful intermediary nodes that are mandatory to pass through. This circumstance can be neglected for this example here.
If you are to recommend doing it in Spark, the solution must be written with pyspark in mind.
What I would like to achieve is to build an efficient way (preferably in Spark) to calculate the tree-path for each node, like so:
| id | type | parent_id | path |
|-----:|:-------|:------------|:--------------------|
| 1 | D | <NA> | D:1 |
| 2 | C | 1 | D:1>C:2 |
| 3 | C | 2 | D:1>C:2>C:3 |
| 4 | C | 3 | D:1>C:2>C:3>C:4 |
| 5 | B | 3 | D:1>C:2>C:3>B:5 |
| 6 | B | 4 | D:1>C:2>C:3>C:4>B:6 |
| 7 | A | 4 | D:1>C:2>C:3>C:4>A:7 |
| 8 | A | 5 | D:1>C:2>C:3>B:5>A:8 |
| 9 | A | 3 | D:1>C:2>C:3>A:9 |
Note:
Each element in the tree path is constructed like this: id:type.
If you have other efficient ways to store the tree path (e.g., closure tables) and calculate them, I am happy to hear them as well. However, the runtime for calculation must be really low (less than an hour, preferably minutes) and retrieval later needs to be in the area of few seconds.
The ultimate end goal is to have a data structure that allows me to aggregate any network node underneath a certain node efficiently (runtime of a few seconds at most).
The actual dataset consisting of around 3M nodes can be constructed like this:
Note:
The commented node_counts that produces the above shown toy examples
The distribution of the node elements is close to reality.
import random
import pandas as pd
random.seed(1337)
node_counts = {'A': 1424383, 'B': 596994, 'C': 234745, 'D': 230937, 'E': 210663, 'F': 122859, 'G': 119453, 'H': 57462, 'I': 23260, 'J': 15008, 'K': 10666, 'L': 6943, 'M': 6724, 'N': 2371, 'O': 2005, 'P': 385}
#node_counts = {'A': 3, 'B': 2, 'C': 3, 'D': 1}
elements = list()
candidates = list()
root_type = list(node_counts.keys())[-1]
leaf_type = list(node_counts.keys())[0]
root_counts = node_counts[root_type]
leaves_count = node_counts[leaf_type]
ids = [i + 1 for i in range(sum(node_counts.values()))]
idcounter = 0
for i, (name, count) in enumerate(sorted(node_counts.items(), reverse=True)):
for _ in range(count):
_id = ids[idcounter]
idcounter += 1
_type = name
if i == 0:
_parent = None
else:
# select a random one that is not a root or a leaf
if len(candidates) == 0: # first bootstrap case
candidate = random.choice(elements)
else:
candidate = random.choice(candidates)
_parent = candidate['id']
_obj = {'id': _id, 'type': _type, 'parent_id': _parent}
#print(_obj)
elements.append(_obj)
if _type != root_type and _type != leaf_type:
candidates.append(_obj)
df = pd.DataFrame.from_dict(elements).astype({'parent_id': 'Int64'})
In order to produce the tree path in pure python with the above toy data you can use the following function:
def get_hierarchy_path(df, cache_dict, ID='id', LABEL = 'type', PARENT_ID = 'parent_id', node_sep='|', elem_sep=':'):
def get_path(record):
if pd.isna(record[PARENT_ID]):
return f'{record[LABEL]}{elem_sep}{record[ID]}'
else:
if record[PARENT_ID] in cache_dict:
parent_path = cache_dict[record[PARENT_ID]]
else:
try:
parent_path = get_path(df.query(f'{ID} == {record[PARENT_ID]}').iloc[0])
except IndexError as e:
print(f'Index Miss for {record[PARENT_ID]} on record {record.to_dict()}')
parent_path = f'{record[LABEL]}{elem_sep}{record[ID]}'
cache_dict[record[PARENT_ID]] = parent_path
return f"{parent_path}{node_sep}{record[LABEL]}{elem_sep}{record[ID]}"
return df.apply(get_path, axis=1)
df['path'] = get_hierarchy_path(df, dict(), node_sep='>')
What I have already tried:
Calculating in pure python with the above function on the large dataset takes me around 5.5 hours. So this is not really a solution. Anything quicker than this is appreciated.
Technically using the spark graphframes package, I could use BFS. This would give me a good solution for individual leave nodes, but it does not scale to the entire network.
I think Pregel is the way to go here. But I do not know how to construct it in Pyspark.
Thank you for your help.
My current solution for this challenge relies now no longer on Spark but on SQL.
I load the whole dataset to a Postgres DB and place a Unique Index on id, type and parent_id.
Then using the following query, I can calculate the path:
with recursive recursive_hierarchy AS (
-- starting point
select
parent_id
, id
, type
, type || ':' || id as path
, 1 as lvl
from hierarchy.nodes
union all
-- recursion
select
ne.parent_id as parent_id
, h.id
, h.type
, ne.type || ':' || ne.id || '|' || h.path as path
, h.lvl + 1 as lvl
from (
select *
from hierarchy.nodes
) ne
inner join recursive_hierarchy h
on ne.id = h.parent_id
), paths as (
-- complete results
select
*
from recursive_hierarchy
), max_lvl as (
-- retrieve the longest path of a network element
select
id
, max(lvl) as max_lvl
from paths
group by id
)
-- all results with only the longest path of a network element
select distinct
, p.id
, p.type
, p.path
from paths p
inner join max_lvl l
on p.id = l.id
and p.lvl = l.max_lvl

Calculating comparisson matrix using strings as input in python - HARD

I have two strings of DNA sequences and I want to compare both sequences, character by character, in order to get a matrix with comparisson values. The general idea is to have three essential points:
If there is the complementary AT (A in one sequence and T in the other) then 2/3.
If there is the complementary CG (C in one sequence and G in the other) then 1.
Otherwise, 0 is returned.
For example if I have two sequences ACTG then the result would be:
| A | C | T | G |
A| 0 | 0 | 2/3 | 0 |
C| 0 | 0 | 0 | 1 |
T| 2/3 | 0 | 0 | 0 |
G| 0 | 1 | 0 | 0 |
I saw there is some help in this post
Calculating a similarity/difference matrix from equal length strings in Python and it really work if you are using only a 4 nucleotide long sequence-
I tried using a larger sequence and this error was printed:
ValueError: shapes (5,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0)
I have the code in R which is
##2.1 Separas los strings
seq <- "ACTG"
seq1 <- unlist(as.matrix(strsplit(seq,""),ncol=nchar(seq),
nrow=nchar(seq)))
a <- matrix(ncol=length(seq),nrow=length(seq))
a[,1] <- seq1
a[1,] <- seq1
b <- matrix(ncol=length(a[1,]),nrow=length(a[1,]))
for (i in seq(nchar(seq))){
for (j in seq(nchar(seq))){
if (a[i,1] == "A" & a[1,j] == "T" | a[i,1] == "T" & a[1,j] == "A"){
b[[i,j]] <- 2/3
} else if (a[i,1] == "C" & a[1,j] == "G" | a[i,1] == "G" & a[1,j] == "C"){
b[[i,j]] <- 1
} else
b[[i,j]] <- 0
}
But I can't get it code in python.
I think you're making it harder than it needs to be.
import numpy as np
seq1 = 'AACCTTGG'
seq2 = 'ACGTACGT'
matrix = np.zeros((len(seq1),len(seq2)))
for y,c2 in enumerate(seq2):
for x,c1 in enumerate(seq1):
if c1+c2 in ('TA','AT'):
matrix[x,y] = 1.
elif c1+c2 in ('CG','GC'):
matrix[x,y] = 2/3
print(matrix)

How can I do to add the condition AND using exlude?

I have this query :
MyTable.objects.filter(date=date).exclude(starthour__range=(start, end), endhour__range=(start, end))
But I want exclude the queries that starthour__range=(start, end) AND endhour__range=(start, end) not OR. I think in this case the OR is used.
Could you help me please ?
Thank you very much !
This is consequence of De Morgan's law [wiki], this specifies that ¬ (x ∧ y) is ¬ x ∨ ¬ y. This thus means that the negation of x and y, is not x or not y. Indeed, if we take a look at the truth table:
x | y | x &wedge; y | ¬x | ¬y | ¬(x &wedge; y) | ¬x &vee; ¬y
---+---+-------+----+----+----------+---------
0 | 0 | 0 | 1 | 1 | 1 | 1
0 | 1 | 0 | 1 | 0 | 1 | 1
1 | 0 | 0 | 0 | 1 | 1 | 1
1 | 1 | 1 | 0 | 0 | 0 | 0
So excluding items where both the starthour is in the(start, end) range and endhour is in the (start, end) range, is logically equivalent to allowing items where the starthour is not in the range, or where the endhour is not in the range.
Using and logic
You can thus make a disjunction in the .exclude(…) call to filter out items that satisfy one of the two conditions, or thus retain objects that do not satisfy any of the two conditions:
MyTable.objects.filter(date=date).exclude(
Q(starthour__range=(start, end)) | Q(endhour__range=(start, end))
)
Overlap logic
Based on your query however, you are looking for overlap, not for such range checks. It will not be sufficient to validate the starthour and endhour. If you want to check if two things overlap. Indeed, imagine that an event starts at 08:00 and ends at 18:00, and you filter for a range with 09:00 and 17:00, then both the starthour and endhour are not in the range, but still the events overlap.
Two ranges [s1, e1] and [s2, e2] do not overlap if s1≥ e2, or s2≥ e1. The negation, the condition when the two overlap is thus: s1< e2 and s2< e1. We thus can exclude the items that overlap with:
# records that do not overlap
MyTable.objects.filter(date=date).exclude(
starthour__lt=end, endhour__lt=start
)

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:
>> print a
id | value
1 | 250
2 | 150
3 | 350
4 | 550
5 | 450
>> print b
low | high | class
100 | 200 | 'A'
200 | 300 | 'B'
300 | 500 | 'A'
500 | 600 | 'C'
I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:
>> print a
id | value | class
1 | 250 | 'B'
2 | 150 | 'A'
3 | 350 | 'A'
4 | 550 | 'C'
5 | 450 | 'A'
I have the following code written that sort of does what I want:
a['class'] = pd.Series()
for i in range(len(a)):
val = a['value'][i]
cl = (b['class'][ (b['low'] <= val) \
(b['high'] >= val) ].iat[0])
a['class'].set_value(i,cl)
Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?
Here is a way to do a range join inspired by #piRSquared's solution:
A = a['value'].values
bh = b.high.values
bl = b.low.values
i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))
pd.DataFrame(
np.column_stack([a.values[i], b.values[j]]),
columns=a.columns.append(b.columns)
)
Output:
id value low high class
0 1 250 200 300 'B'
1 2 150 100 200 'A'
2 3 350 300 500 'A'
3 4 550 500 600 'C'
4 5 450 300 500 'A'
Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!
I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.
The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!
# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])
blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])
aclass = [[]]*len(avalue) # initialize aclass
start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
aclass[j]=bclass[i]
# add the class column to the original a DataFrame
a['class'] = aclass
print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Categories