I have a bit of a logical challenge. I have a single table in excel that contains an identifier column and a cross reference column. There can be multiple rows for a single identifier which indicates multiple cross references. (see basic example below)
Any record that ends in the letter "X" indicates that it is a cross reference, and not an actual identifier. I need to generate a list of the cross references for each identifier, but trace it down to the actual cross reference identifier. So using "A1" as an example from the table above, I would need the list returned as follows "A2,A3,B1,B3". Notice there are no identifiers ending in "X" in the list, they have been traced down to the actual source record through the table.
Any ideas or help would be much appreciated. I'm using python and xlrd to read the table.
t = [
["a1","a2"],
["a1","a3"],
["a1","ax"],
["ax","b1"],
["ax","bx"],
["bx","b3"]
]
import itertools
def find_matches(t,key):
return list(itertools.chain(*[[v] if not v.endswith("x") else find_matches(t,v) for k,v in t if k == key]))
print find_matches(t,"a1")
you could treat your list as an adjacency matrix of a graph
something like
t = [
["a1","a2"],
["a1","a3"],
["a1","ax"],
["ax","b1"],
["ax","bx"],
["bx","b3"]
]
class MyGraph:
def __init__(self,adjacency_table):
self.table = adjacency_table
self.graph = {}
for from_node,to_node in adjacency_table:
if from_node in self.graph:
self.graph[from_node].append(to_node)
else:
self.graph[from_node] = [to_node]
print self.graph
def find_leaves(self,v):
seen = set(v)
def search(v):
for vertex in self.graph[v]:
if vertex in seen:
continue
seen.add(vertex)
if vertex in self.graph:
for p in search(vertex):
yield p
else:
yield vertex
for p in search(v):
yield p
print list(MyGraph(t).find_leaves("a1"))#,"a1")
Related
I have been trying to figure out a issue, in which my program does the topological sort but not in the format that I'm trying to get. For instance if I give it the input:
Learn Python
Understand higher-order functions
Learn Python
Read the Python tutorial
Do assignment 1
Learn Python
It should output:
Read the python tutorial
Understand higher-order functions
Learn Python
Do assignment 1
Instead I get it where the first two instances are swapped, for some of my other test cases this occurs as well where it will swap 2 random instances, heres my code:
import sys
graph={}
def populate(name,dep):
if name in graph:
graph[name].append(dep)
else:
graph[name]=[dep]
if dep not in graph:
graph[dep]=[]
def main():
last = ""
for line in sys.stdin:
lne=line.strip()
if last == "":
last=lne
else:
populate(last,lne)
last=""
def topoSort(graph):
sortedList=[] #result
zeroDegree=[]
inDegree = { u : 0 for u in graph }
for u in graph:
for v in graph[u]:
inDegree[v]+=1
for i in inDegree:
if(inDegree[i]==0):
zeroDegree.append(i)
while zeroDegree:
v=zeroDegree.pop(0)
sortedList.append(v)
#selection sort for alphabetical sort
for x in graph[v]:
inDegree[x]-=1
if (inDegree[x]==0):
zeroDegree.insert(0,x)
sortedList.reverse()
#for y in range(len(sortedList)):
# min=y
# for j in range(y+1,len(sortedList)):
# if sortedList[min]>sortedList[y]:
# min=j
# sortedList[y],sortedList[min]=sortedList[min],sortedList[y]
return sortedList
if __name__=='__main__':
main()
result=topoSort(graph)
if (len(result)==len(graph)):
print(result)
else:
print("cycle")
Any Ideas as to why this may be occurring?
The elements within dictionaries or sets are not ordered. If you add elements they are randomly inserted and not appended to the end. I think that is the reason why you get random results with your sorting algorithm. I guess it must have to do something with inDegree but I didn't debug very much.
I can't offer you a specific fix for your code, but accordingly to the wanted input and output it should look like this:
# read tuples from stdin until ctrl+d is pressed (on linux) or EOF is reached
graph = set()
while True:
try:
graph |= { (input().strip(), input().strip()) }
except:
break
# apply topological sort and print it to stdout
print("----")
while graph:
z = { (a,b) for a,b in graph if not [1 for c,d in graph if b==c] }
print ( "\n".join ( sorted ( {b for a,b in z} )
+ sorted ( {a for a,b in z if not [1 for c,d in graph if a==d]} ) ) )
graph -= z
The great advantage of Python (here 3.9.1) is the short solution you might get. Instead of lists I would use sets because those can be easier edited: graph|{elements} inserts items to this set and graph-{elements} removes entities from it. Duplicates are ignored.
At first are some tuples red from stdin with ... = input(), input() into the graph item set.
The line z = {result loop condition...} filters the printable elements which are then subtracted from the so called graph set.
The generated sets are randomly ordered so the printed output must be turned to sorted lists at the end which are separated by newlines.
I have a dataframe with 2 columns: "emp" is the child column and "man" is the parent column. I need to count the total number of children ( direct/indirect) for any given parent.
emp man
23ank(5*) 213raj(11*)
55man(5*) 213raj(11*)
2shu(1*) 23ank(5*)
7am(3*) 55man(5*)
9shi(0*) 55man(5*)
213raj(11*) 66sam(13*)
The solution I am looking for is if, for instance, I want the details related to 213raj(11*), then:
213raj(11*),23ank(5*),2shu(1*),55man(5*),7am(3*),9shi(0*)
and the total count for 213raj(11*) =5.
If I consider 66sam(13*) then:
66sam(13*),213raj(11*),23ank(5*),2shu(1*),55man(5*),7am(3*),9shi(0*)
and the total count for 66sam(13*) =6
I tried the code below but am not getting the required results:
kv = kvpp[['emp','man']]
kvp = dict(zip(kv.emp,kv.man))
parents = set()
children = {}
for c,p in kvp.items():
parents.add(p)
children[c] = p
def ancestors(p):
return (ancestors(children[p]) if p in children else []) + [p]
pp = []
for k in (set(children.keys()) - parents):
pp.append('/'.join(ancestors(k)))
In graph theory terms, you have an adjacency matrix forming a directed acyclic graph.
Here's a solution using the NetworkX graph theory library.
import networkx as nx
emp_to_man = [
('23ank(5*)', '213raj(11*)'),
('55man(5*)', '213raj(11*)'),
('2shu(1*)', '23ank(5*)'),
('7am(3*)', '55man(5*)'),
('9shi(0*)', '55man(5*)'),
('213raj(11*)', '66sam(13*)'),
]
# Create a directed graph using the adjacency matrix.
# Converting a 2-column DF into a digraph is as easy as
# `nx.DiGraph(list(df.values))`.
g = nx.DiGraph(emp_to_man)
for emp in sorted(g): # For every employee (in sorted order for tidiness),
# ... print the set of ancestors (in no particular order).
# Should the adjacency matrix be `man_to_emp` instead, you'd use `
print(emp, nx.ancestors(g, emp))
This prints out
213raj(11*) {'55man(5*)', '7am(3*)', '2shu(1*)', '9shi(0*)', '23ank(5*)'}
23ank(5*) {'2shu(1*)'}
2shu(1*) set()
55man(5*) {'9shi(0*)', '7am(3*)'}
66sam(13*) {'213raj(11*)', '55man(5*)', '7am(3*)', '9shi(0*)', '2shu(1*)', '23ank(5*)'}
7am(3*) set()
9shi(0*) set()
EDIT: In case performance is paramount, I'd heartily suggest the NetworkX approach. Based on a quick timeit test, finding all the employees is roughly 62 times faster than the Pandas-based code, and that's converting the DF into an NX network on every invocation.
EDIT 2: To my rather great surprise, a naïve set/defaultdict graph traversal is faster still -- 387 times faster than the Pandas code and 5 times faster than the Nx code above.
def dag_count_all_children():
dag = collections.defaultdict(set)
for man, emp in df.values:
dag[emp].add(man)
out = {}
for man in set(dag):
found = set()
open = {man}
while open:
emp = open.pop()
open.update(dag[emp] - found)
found.update(dag[emp])
out[man] = found
return out
If I've understood your question correctly, this function should give you the correct answers:
import pandas as pd
df = pd.DataFrame({'emp':['23ank(5*)', '55man(5*)', '2shu(1*)', '7am(3*)', '9shi(0*)', '213raj(11*)'],
'man':['213raj(11*)', '213raj(11*)', '23ank(5*)', '55man(5*)', '55man(5*)', '66sam(13*)']})
def count_children(parent):
total_children = [] # initialise list of children to append to
direct = df[df['man'] == parent]['emp'].to_list()
total_children += direct # add direct children
indirect = df[df['man'].isin(direct)]['emp'].to_list()
total_children += indirect # add indirect children
# next, add children of indirect children in a loop
next_indirect = indirect
while True:
next_indirect = df[df['man'].isin(next_indirect)]['emp'].to_list()
if not next_indirect or all(i in total_children for i in next_indirect):
break
else:
total_children = list(set(next_indirect).union(set(total_children)))
count = len(total_children)
return pd.DataFrame({'count':count,
'children':','.join(total_children)},
index=[parent])
count_children('213raj(11*)') -> 5
count_children('66sam(13*)') -> 6
I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions
I have a data set of ~500 points in 2D, with given coordinates (also implying I can refer to each point with a single integer) (x,y) between 0 and 10. Now I'm trying to divide the area into regular square cells by applying a grid. Note that this process is beeing repeated in an algorithm and at some point there will be >>>500 square cells.
What I want to achieve: Loop over all points, for each point find the square cell in which the point lies and save this information.
A few steps later: Loop over all points again, for each point identify its cell and the adjacent cells of the cell. Take all the points of these cells and add them to e.g. a list, for further usage.
My thought process: Since there will be alot of empty cells and I do not want to waste memory for them, use a tree.
Example: In cell_39_41 and cell_39_42 is a point.
First level: root-node with child 39
Second level: 39 node with children 41,42
Third level: 41 node with child point1 and 42 node with child point2
Fourth level: Nodes representing actual points
If I find more points in cell_39_41 or cell_39_42 they will be added as children of their respective third level nodes.
class Node(object):
def __init__(self, data):
self.data = data
self.children = []
def add_child(self, obj):
self.children.append(obj)
I left out an unrelevant method to return points in a cell.
Problems with this implementation:
1.If I add a second or third level node, I will have to refer to it to be able to add children or to find points in a certain cell and its adjacent cells. This means I have to do ALOT of costly linear searches since the children lists are not sorted.
2.I will be adding hundreds of nodes, but I need to able to refer to them by unique names. This might be a big personal fail, but I cannot think of a way to generate such names in a loop.
So I basically I'm pretty sure theres some mistake in my thought process or maybe the used implementation of a tree is not suitable. I have read alot of implementation of b-trees or similiar, but since this problem is limited to 2D I felt that they were just too much and not suited.
How about this ...
def add_point(data_dict, row, column, point):
# modifies source of data_dict in place, since dictionaries are mutable
data_dict.setdefault(row, {}).setdefault(column, []).append(point)
def get_table(data):
out_dict = {}
for row, column, point in data:
add_point(out_dict, row, column, point)
return out_dict
if __name__ == "__main__":
data = [(38, 41, 38411), (39, 41, 39411), (39, 42, 39421)]
points = get_table(data)
print points
add_point(points, 39, 42, 39422)
print points
Use dict of dicts as tree:
tree = {
'_data': 123,
'node1': {
'_data': 456,
'node11': {
'node111': {}
},
'node2': {
}
}
search in dicts are fast!
tree['node1']['node12']['node123']['_data'] = 123 # adding
unique names:
shortcuts = {}
shortcuts['name'] = tree['node1']['node11']['node111']
print shortcuts['name']['_data']
I have some project which I decide to do in Python. In brief: I have list of lists. Each of them also have lists, sometimes one-element, sometimes more. It looks like this:
rules=[
[[1],[2],[3,4,5],[4],[5],[7]]
[[1],[8],[3,7,8],[3],[45],[12]]
[[31],[12],[43,24,57],[47],[2],[43]]
]
The point is to compare values from numpy array to values from this rules (elements of rules table). We are comparing some [x][y] point to first element (e.g. 1 in first element), then, if it is true, value [x-1][j] from array with second from list and so on. Five first comparisons must be true to change value of [x][y] point. I've wrote sth like this (main function is SimulateLoop, order are switched because simulate2 function was written after second one):
def simulate2(self, i, j, w, rule):
data = Data(rule)
if w.world[i][j] in data.c:
if w.world[i-1][j] in data.n:
if w.world[i][j+1] in data.e:
if w.world[i+1][j] in data.s:
if w.world[i][j-1] in data.w:
w.world[i][j] = data.cc[0]
else: return
else: return
else: return
else: return
else: return
def SimulateLoop(self,w):
for z in range(w.steps):
for i in range(2,w.x-1):
for j in range(2,w.y-1):
for rule in w.rules:
self.simulate2(i,j,w,rule)
Data class:
class Data:
def __init__(self, rule):
self.c = rule[0]
self.n = rule[1]
self.e = rule[2]
self.s = rule[3]
self.w = rule[4]
self.cc = rule[5]
NumPy array is a object from World class. Rules is list as described above, parsed by function obtained from another program (GPL License).
To be honest it seems to work fine, but it does not. I was trying other possibilities, without luck. It is working, interpreter doesn't return any errors, but somehow values in array changing wrong. Rules are good because it was provided by program from which I've obtained parser for it (GPL license).
Maybe it will be helpful - it is Perrier's Loop, modified Langton's loop (artificial life).
Will be very thankful for any help!
)
I am not familiar with Perrier's Loop, but if you code something like famous "game life" you would have done simple mistake: store the next generation in the same array thus corrupting it.
Normally you store the next generation in temporary array and do copy/swap after the sweep, like in this sketch:
def do_step_in_game_life(world):
next_gen = zeros(world.shape) # <<< Tmp array here
Nx, Ny = world.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
neighbours = sum(world[i-1:i+2, j-1:j+2]) - world[i,j]
if neighbours < 3:
next_gen[i,j] = 0
elif ...
world[:,:] = next_gen[:,:] # <<< Saving computed next generation