I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:
df1:
side_a side_b
a b
b c
c d
k l
l m
l n
p q
q r
r s
df2:
side_a end_point
a c
b c
c c
k m
k n
l m
l n
p s
q s
r s
The point is to get all points for each side_a value until reach end_point from df2 for that value.
If it has two end_point values (like "k" does) that it should be two lists.
I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points'] and that causes certain problems. But if someone wants me to post the code I will, of course.
The desired output would be something like this:
side_a end_point
a [b, c]
b [c]
c [c]
k [l, m]
k [l, n]
l [m]
l [n]
p [q, r, s]
q [r, s]
r [s]
And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.
import pandas as pd
import numpy as np
import itertools
def get_child_list(df, parent_id):
list_of_children = []
list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
for c_, r_ in df[df['side_a'] == parent_id].iterrows():
if r_['side_b'] != parent_id:
list_of_children.append(get_child_list(df, r_['side_b']))
# to flatten the list
list_of_children = [item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
temp_df['side_a'] = row['side_a']
new_df = new_df.append(temp_df)
So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.
Any help or hint is welcomed here, truly.
Thanks in advance.
You can use networkx library and graphs:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)
Output:
side_a end_point
0 a [b, c]
1 b [c]
2 c []
3 k [l, m]
4 k [l, n]
5 l [m]
6 l [n]
7 p [q, r, s]
8 q [r, s]
9 r [s]
Your rules are inconsistent and your definitions are unclear so you may need to add some constraints here and there because it is unclear exactly what you are asking. By organizing the data-structure to fit the problem and building a more robust function for traversal (shown below) it will be easier to add/edit constraints as needed - and solve the problem completely.
Transform the df to a dict to better represent a tree structure
This problem is a lot simpler if you transform the data structure to be more intuitive to the problem, instead of trying to solve the problem in the context of the current structure.
## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})
## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}
## Populate the tree with each row
for idx, row in df.iterrows():
tree[row['side_a']] = set(list(tree[row['side_a']]) + list(row['side_b']))
Traverse the Tree
This is much more straightforward now that the data structure is intuitive. Any standard Depth-First-Search algorithm w/ path saving will do the trick. I modified the one in the link to work with this example.
Edit: Reading again it looks you have a condition for search termination in endpoint (you need to be more clear in your question what is input and what is output). You can adjust dfs_path(tree,**target**, root) and change the termination condition to return only the correct paths.
## Standard DFS pathfinder
def dfs_paths(tree, root):
stack = [(root, [root])]
while stack:
(node, path) = stack.pop()
for nextNode in tree[node] - set(path):
# Termination condition.
### I set it to terminate search at the end of each path.
### You can edit the termination condition to fit the
### constraints of your goal
if not tree[nextNode]:
yield set(list(path) + list(nextNode)) - set(root)
else:
stack.append((nextNode, path + [nextNode]))
Build a dataframe from the generators we yielded
If you're not super comfortable with generators, you can structure the DFS traversal so that it outputs in a list. instead of a generator
set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
for row in list(gen.values()).pop():
set_a.append(list(gen.keys()).pop())
end_points.append(row)
## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')
Output
df_2[['set_a','end_points']]
set_a end_points
a {b, c, d}
b {c, d}
c {d}
k {n, l}
k {m, l}
l {n}
l {m}
p {s, r, q}
q {s, r}
r {s}
If you're OK with an extra import, this can be posed as a path problem on a graph and solved in a handful of lines using NetworkX:
import networkx
g = networkx.DiGraph(zip(df1.side_a, df1.side_b))
outdf = df2.apply(lambda row: [row.side_a,
set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}],
axis=1)
outdf looks like this. Note that this contains sets instead of lists as in your desired output - this allows all the paths to be combined in a simple way.
side_a end_point
0 a {c, b}
1 b {c}
2 c {}
3 k {l, m}
4 k {l, n}
5 l {m}
6 l {n}
7 p {r, q, s}
8 q {r, s}
9 r {s}
Related
This is a homework which was given to me and I have been struggling with writing the solution.
Write a program that finds the longest adjacent sequence of colors in a matrix(2D grid). Colors are represented by ‘R’, ‘G’, ‘B’ characters (respectively Red, Green and Blue).
You will be provided with 4 individual test cases, which must also be included in your solution.
An example of your solution root directory should look like this:
solutionRootDir
| - (my solution files and folders)
| - tests/
| - test_1
| - test_2
| - test_3
| - test_4
Individual test case input format:
First you should read two whitespace separated 32-bit integers from the provided test case
that represents the size (rows and cols) of the matrix.
Next you should read rows number of newline separated lines of 8-bit characters.
Your program should find and print the longest adjacent sequence (diagonals are not counted as adjacent fields),
and print to the standard output the number.
NOTE: in case of several sequences with the same length – simply print their equal length.
test_1
Provided input:
3 3
R R B
G G R
R B G
Expected Output:
2
test_2
Provided input:
4 4
R R R G
G B R G
R G G G
G G B B
Expected Output:
7
test_3
Provided input:
6 6
R R B B B B
B R B B G B
B G G B R B
B B R B G B
R B R B R B
R B B B G B
Expected Output:
22
test_4
Provided input:
1000 1000
1000 rows of 1000 R’s
Expected Output:
1000000
Your program entry point should accepted from one to four additional parameters.
Those parameters will indicate the names of the test cases that your program should run.
• Example 1: ./myprogram test_1 test_3
• Example 2: ./myprogram test_1 test_2 test_3 test_4
• you can assume that the input from the user will be correct (no validation is required)
import numpy as np
a = int(input("Enter rows: "))
b = int(input("Enter columns: "))
rgb = ["R", "G", "B"]
T = [[0 for col in range(b)] for row in range(a)]
for row in range(a):
for col in range(b):
T[row][col] = np.random.choice(rgb)
for r in T:
for c in r:
print(c, end=" ")
print()
def solution(t):
rows: int = len(t)
cols: int = len(t[0])
longest = np.empty((rows, cols))
longest_sean = 1
for i in range(rows - 1, -1, -1):
for j in range(cols - 1, -1, -1):
target = t[i][j]
current = 1
for ii in range(i, rows):
for jj in range(j, cols):
length = 1
if target == t[ii][jj]:
length += longest[ii][jj]
current = max(current, length)
longest[i][j] = current
longest_sean = max(current, longest_sean)
return longest_sean
print(solution(T))
in order to get the parameters from the console execution you have to use sys.argv so from sys import argv. than convert your text field to python lists like this
def load(file):
with open(file+".txt") as f:
data = f.readlines()
res = []
for row in data:
res.append([])
for element in row:
if element != "\n" and element != " ":
res[-1].append(element)
return res
witch will create a 2 dimentional list of containing "R", "B" and "G". than you can simply look for the longest area of one Value like using this Function:
def findLargest(data):
visited = []
area = []
length = 0
movement = [(1,0), (0,1), (-1,0),(0,-1)]
def recScan(x, y, scanArea):
visited.append((x,y))
scanArea.append((x,y))
for dx, dy in movement:
newX, newY = x+dx, y+dy
if newX >= 0 and newY >= 0 and newX < len(data) and newY < len(data[newX]):
if data[x][y] == data[newX][newY] and (not (newX, newY) in visited):
recScan(newX, newY, scanArea)
return scanArea
for x in range(len(data)):
for y in range(len(data[x])):
if (x, y) not in visited:
newArea = recScan(x, y, [])
if len(newArea) > length:
length = len(newArea)
area = newArea
return length, area
whereby recScan will check all adjacent fields that haven't bean visited jet. than just call the functions like this:
if __name__ == "__main__":
for file in argv[1:]:
data = load(file)
print(findLargest(data))
the argv[1:] is reqired because the first argument passed to python witch is the file you want to execute. my data structure is.
main.py
test_1.txt
test_2.txt
test_3.txt
test_4.txt
and test_1 threw test_4 look like this just with other values.
R R B B B B
B R B B G B
B G G B R B
B B R B G B
R B R B R B
R B B B G B
Maybe this is more of a theoretical language question rather than pandas per-se. I have a set of function extensions that I'd like to "attach" to e.g. a pandas DataFrame without explicitly calling utility functions and passing the DataFrame as an argument i.e. to have the syntactic sugar. Extending Pandas DataFrame is also not a choice because of the inaccessible types needed to define and chain the DataFrame contructor e.g. Axes and Dtype.
In Scala one can define an implicit class to attach functionality to an otherwise unavailable or too-complex-to-initialize object e.g. the String type can't be extended in Java AFAIR. For example the following attaches a function to a String type dynamically https://www.oreilly.com/library/view/scala-cookbook/9781449340292/ch01s11.html
scala> implicit class StringImprovements(s: String) {
def increment = s.map(c => (c + 1).toChar)
}
scala> val result = "HAL".increment
result: String = IBM
Likewise, I'd like to be able to do:
# somewhere in scope
def lexi_sort(df):
"""Lexicographically sorts the input pandas DataFrame by index and columns"""
df.sort_index(axis=0, level=df.index.names, inplace=True)
df.sort_index(axis=1, level=df.columns.names, inplace=True)
return df
df = pd.DataFrame(...)
# some magic and then ...
df.lexi_sort()
One valid possibility is to use the Decorator Pattern but I was wondering whether Python offered a less boiler-plate language alternative like Scala does.
In pandas, you can do:
def lexi_sort(df):
"""Lexicographically sorts the input pandas DataFrame by index and columns"""
df.sort_index(axis=0, level=df.index.names, inplace=True)
df.sort_index(axis=1, level=df.columns.names, inplace=True)
return df
pd.DataFrame.lexi_sort = lexi_sort
df = pd.read_csv('dummy.csv')
df.lexi_sort()
I guess for other objects you can define a method within the class to achieve the same outcome.
class A():
def __init__(self, df:pd.DataFrame):
self.df = df
self.n = 0
def lexi_sort(self):
"""Lexicographically sorts the input pandas DataFrame by index and columns"""
self.df.sort_index(axis=0, level=self.df.index.names, inplace=True)
self.df.sort_index(axis=1, level=self.df.columns.names, inplace=True)
return df
def add_one(self):
self.n += 1
a = A(df)
print(a.n)
a.add_one()
print(a.n)
Subclass DataFrame and don't do anything but add your feature.
import pd
import random,string
class Foo(pd.DataFrame):
def lexi_sort(self):
"""Lexicographically sorts the input pandas DataFrame by index and columns"""
self.sort_index(axis=0, level=df.index.names, inplace=True)
self.sort_index(axis=1, level=df.columns.names, inplace=True)
nrows = 10
columns = ['b','d','a','c']
rows = [random.sample(string.ascii_lowercase,len(columns)) for _ in range(nrows)]
index = random.sample(string.ascii_lowercase,nrows)
df = Foo(rows,index,columns)
>>> df
b d a c
w n g u m
x t e q k
n u x j s
u s t u b
f g t e j
j w b h j
h v o p a
a q i l b
g p i k u
o q x p t
>>> df.lexi_sort()
>>> df
a b c d
a l q b i
f e g j t
g k p u i
h p v a o
j h w j b
n j u s x
o p q t x
u u s b t
w u n m g
x q t k e
>>
It took 22 minutes on a Core 2 Duo with no result.
for a in range(10):
for b in range(10):
while b!=a:
for c in range(10):
while c!= a or b:
for d in range(10):
while d!=a or b or c:
for e in range(10):
while e!=a or b or c or d:
for f in range(10):
while f!=a or b or c or d or e:
for g in range(10):
while g!=a or b or c or d or e or f:
for h in range(10):
while h!= a or b or c or d or e or f or g:
for i in range(10):
while i!= a or b or c or d or e or f or g or h:
if (a+b+c==15 and a+d+g==15 and d+e+f==15 and g+h+i==15 and b+e+h==15 and c+f+i==15 and a+e+i==15 and c+e+g==15):
print(a,b,c,d,e,f,g,h,i)
This is ugly, but the error is that you cannot have a comparison like
while e!=a or b or c or d:
instead, you should write
while e!=a and e!=b and e!=c and e!=d:
Please learn how to use arrays/lists and re-think the problem.
#Selcuk has identified your problem; here is an improved approach:
# !!! Comments are good !!!
# Generate a magic square like
#
# a b c
# d e f
# g h i
#
# where each row, col, and diag adds to 15
values = list(range(10)) # problem domain
unused = set(values) # compare against a set
# instead of chained "a != b and a != c"
SUM = 15
for a in values:
unused.remove(a)
for b in values:
if b in unused:
unused.remove(b)
# value for c is forced by values of a and b
c = SUM - a - b
if c in unused:
unused.remove(c)
#
# etc
#
unused.add(c)
unused.add(b)
unused.add(a)
This will work, but will still be ugly.
In the long run you would be better off using a constraint solver like python-constraint:
from constraint import Problem, AllDifferentConstraint
values = list(range(10))
SUM = 15
prob = Problem()
# set up variables
prob.addVariables("abcdefghi", values)
# only use each value once
prob.addConstraint(AllDifferentConstraint())
# row sums
prob.addConstraint(lambda a,b,c: a+b+c==SUM, "abc")
prob.addConstraint(lambda d,e,f: d+e+f==SUM, "def")
prob.addConstraint(lambda g,h,i: g+h+i==SUM, "ghi")
# col sums
prob.addConstraint(lambda a,d,g: a+d+g==SUM, "adg")
prob.addConstraint(lambda b,e,h: b+e+h==SUM, "beh")
prob.addConstraint(lambda c,f,i: c+f+i==SUM, "cfi")
# diag sums
prob.addConstraint(lambda a,e,i: a+e+i==SUM, "aei")
prob.addConstraint(lambda c,e,g: c+e+g==SUM, "ceg")
for sol in prob.getSolutionIter():
print("{a} {b} {c}\n{d} {e} {f}\n{g} {h} {i}\n\n".format(**sol))
Note that this returns 8 solutions which are all rotated and mirrored versions of each other. If you only want unique solutions, you can add a constraint like
prob.addConstraint(lambda a,c,g: a<c<g, "acg")
which forces a unique ordering.
Also note that fixing the values of any three corners, or the center and any two non-opposed corners, forces all the remaining values. This leads to a simplified solution:
values = set(range(1, 10))
SUM = 15
for a in values:
for c in values:
if a < c: # for unique ordering
for g in values:
if c < g: # for unique ordering
b = SUM - a - c
d = SUM - a - g
e = SUM - c - g
f = SUM - d - e
h = SUM - b - e
i = SUM - a - e
if {a,b,c,d,e,f,g,h,i} == values:
print(
"{} {} {}\n{} {} {}\n{} {} {}\n\n"
.format(a,b,c,d,e,f,g,h,i)
)
On my machine this runs in 168 µs (approximately 1/6000th of a second).
So, I have a huge input file that looks like this: (you can download here)
1. FLO8;PRI2
2. FLO8;EHD3
3. GRI2;BET2
4. HAL4;AAD3
5. PRI2;EHD3
6. QLN3;FZF1
7. QLN3;ABR5
8. FZF1;ABR5
...
See it like a two column table, that the element before ";" shows to the element after ";"
I want to print simple strings iteratively that show the three elements that constitute a feedforward loop.
The example numbered list from above would output:
"FLO8 PRI2 EHD3"
"QLN3 FZF1 ABR5"
...
Explaining the first output line as a feedforward loop:
A -> B (FLO8;PRI2)
B -> C (PRI2;EHD3)
A -> C (FLO8;EHD3)
Only the circled one from this link
So, I have this, but it is terribly slow...Any suggestions to make a faster implementation?
import csv
TF = []
TAR = []
# READING THE FILE
with open("MYFILE.tsv") as tsv:
for line in csv.reader(tsv, delimiter=";"):
TF.append(line[0])
TAR.append(line[1])
# I WANT A BETTER WAY TO RUN THIS.. All these for loops are killing me
for i in range(len(TAR)):
for j in range(len(TAR)):
if ( TAR[j] != TF[j] and TAR[i] != TF[i] and TAR[i] != TAR[j] and TF[j] == TF[i] ):
for k in range(len(TAR )):
if ( not(k == i or k == j) and TF[k] == TAR[j] and TAR[k] == TAR[i]):
print "FFL: "+TF[i]+ " "+TAR[j]+" "+TAR[i]
NOTE: I don't want self-loops...from A -> A, B -> B or C -> C
I use a dict of sets to allow very fast lookups, like so:
Edit: prevented self-loops:
from collections import defaultdict
INPUT = "RegulationTwoColumnTable_Documented_2013927.tsv"
# load the data as { "ABF1": set(["ABF1", "ACS1", "ADE5,7", ... ]) }
data = defaultdict(set)
with open(INPUT) as inf:
for line in inf:
a,b = line.rstrip().split(";")
if a != b: # no self-loops
data[a].add(b)
# find all triplets such that A -> B -> C and A -> C
found = []
for a,bs in data.items():
bint = bs.intersection
for b in bs:
for c in bint(data[b]):
found.append("{} {} {}".format(a, b, c))
On my machine, this loads the data in 0.36s and finds 1,933,493 solutions in 2.90s; results look like
['ABF1 ADR1 AAC1',
'ABF1 ADR1 ACC1',
'ABF1 ADR1 ACH1',
'ABF1 ADR1 ACO1',
'ABF1 ADR1 ACS1',
Edit2: not sure this is what you want, but if you need A -> B and A -> C and B -> C but not B -> A or C -> A or C -> B, you could try
found = []
for a,bs in data.items():
bint = bs.intersection
for b in bs:
if a not in data[b]:
for c in bint(data[b]):
if a not in data[c] and b not in data[c]:
found.append("{} {} {}".format(a, b, c))
but this still returns 1,380,846 solutions.
Test set
targets = {'A':['B','C','D'],'B':['C','D'],'C':['A','D']}
And the function
for i in targets.keys():
try:
for y in targets.get(i):
#compares the dict values of two keys and saves the overlapping ones to diff
diff = list(set(targets.get(i)) & set(targets.get(y)))
#if there is at least one element overlapping from key.values i and y
#take up those elements and style them with some arrows
if (len(diff) > 0 and not i == y):
feed = i +'->'+ y + '-->'
forward = '+'.join(diff)
feedForward = feed + forward
print (feedForward)
except:
pass
The output is
A->B-->C+D
A->C-->D
C->A-->D
B->C-->D
Greetings to the Radboud Computational Biology course, Robin (q1/2016).
I am reading file using pandas.
d= pandas.DataFrame("data.csv")
data.csv
A B C
d 408.56087701 87.26907024
b 277.95015117 75.19386881
b 385.41416264 84.73488504
b 380.31630662 71.23504808
b 392.10729207 83.80720357
b 399.70877373 76.59640833
b 350.93124656 79.34979059
b 330.09702335 79.37166555
back= [399.70877373,385.41416264]
I am trying to sum values of C where I find match between "back" and column B
s=0
for indj, j in enumerate(back)
for indi, i in enumerate(d)
if (j== i):
s= s+d[indi][3]
I am trying to implement this using reduce :
reduce(lambda x, y: x+y,dat ..)
but i couldn't find a way to add condition to filter values ?
I just solved this using
the_sum = sum(x[2] for x in data if x[1] in back)