I have been trying to figure out a issue, in which my program does the topological sort but not in the format that I'm trying to get. For instance if I give it the input:
Learn Python
Understand higher-order functions
Learn Python
Read the Python tutorial
Do assignment 1
Learn Python
It should output:
Read the python tutorial
Understand higher-order functions
Learn Python
Do assignment 1
Instead I get it where the first two instances are swapped, for some of my other test cases this occurs as well where it will swap 2 random instances, heres my code:
import sys
graph={}
def populate(name,dep):
if name in graph:
graph[name].append(dep)
else:
graph[name]=[dep]
if dep not in graph:
graph[dep]=[]
def main():
last = ""
for line in sys.stdin:
lne=line.strip()
if last == "":
last=lne
else:
populate(last,lne)
last=""
def topoSort(graph):
sortedList=[] #result
zeroDegree=[]
inDegree = { u : 0 for u in graph }
for u in graph:
for v in graph[u]:
inDegree[v]+=1
for i in inDegree:
if(inDegree[i]==0):
zeroDegree.append(i)
while zeroDegree:
v=zeroDegree.pop(0)
sortedList.append(v)
#selection sort for alphabetical sort
for x in graph[v]:
inDegree[x]-=1
if (inDegree[x]==0):
zeroDegree.insert(0,x)
sortedList.reverse()
#for y in range(len(sortedList)):
# min=y
# for j in range(y+1,len(sortedList)):
# if sortedList[min]>sortedList[y]:
# min=j
# sortedList[y],sortedList[min]=sortedList[min],sortedList[y]
return sortedList
if __name__=='__main__':
main()
result=topoSort(graph)
if (len(result)==len(graph)):
print(result)
else:
print("cycle")
Any Ideas as to why this may be occurring?
The elements within dictionaries or sets are not ordered. If you add elements they are randomly inserted and not appended to the end. I think that is the reason why you get random results with your sorting algorithm. I guess it must have to do something with inDegree but I didn't debug very much.
I can't offer you a specific fix for your code, but accordingly to the wanted input and output it should look like this:
# read tuples from stdin until ctrl+d is pressed (on linux) or EOF is reached
graph = set()
while True:
try:
graph |= { (input().strip(), input().strip()) }
except:
break
# apply topological sort and print it to stdout
print("----")
while graph:
z = { (a,b) for a,b in graph if not [1 for c,d in graph if b==c] }
print ( "\n".join ( sorted ( {b for a,b in z} )
+ sorted ( {a for a,b in z if not [1 for c,d in graph if a==d]} ) ) )
graph -= z
The great advantage of Python (here 3.9.1) is the short solution you might get. Instead of lists I would use sets because those can be easier edited: graph|{elements} inserts items to this set and graph-{elements} removes entities from it. Duplicates are ignored.
At first are some tuples red from stdin with ... = input(), input() into the graph item set.
The line z = {result loop condition...} filters the printable elements which are then subtracted from the so called graph set.
The generated sets are randomly ordered so the printed output must be turned to sorted lists at the end which are separated by newlines.
Related
I have this list called countries.txt that list all the countries by their name, area(in km2), population (eg. ["Afghanistan", 647500.0, 25500100]).
def readCountries(filename):
result=[]
lines=open(filename)
for line in lines:
result.append(line.strip('\n').split(',\t'))
for sublist in result:
sublist[1]=float(sublist[1])
sublist[2]=int(sublist[2])
I am trying to sort through the list using a bubble sort according to the are of each country:
>> c = countryByArea(7)
>>> c
>>["India",3287590.0,1239240000]
When typing in the parameter is should return the nth largest area.
I have this but I'm not sure how to output the information
def countryByArea(area):
myList=readCountries('countries.txt')
for i in range(0,len(list)):
for j in range(0,len(list)-1):
if list[j]>list[j+1]:
temp=list[j]
list[j]=list[j+1]
list[j+1]=temp
first of all, implement a generic bubble sort method. this is a correct bubble sort algorithm implementation... Im sure you can find other implementations on http://rosettacode.org
def bubble_sort(a_list,a_key):
changed=True
while changed:
changed = False
for i in range(len(a_list)-1):
if a_key(a_list[i]) > a_key(a_list[i+1]):
a_list[i],a_list[i+1] = a_list[i+1],a_list[i]
changed = True
then simply pass a key function that represents the data you want to sort by (in this case the middle value or index one of each row
import csv
def sort_by_area(fname):
with open(fname) as f:
a = list(csv.reader(f))
bubble_sort(a,lambda row:int(row[1]))
return a
a = sort_by_area("a_file.txt")
print a[-7] #the 7th largest by area
you can take this info and combine it to complete your assignment ... but really this is a question you should have asked a classmate or your teacher for help with ...
I've got a question regarding Linear Searching in Python. Say I've got the base code of
for l in lines:
for f in search_data:
if my_search_function(l[1],[f[0],f[2]]):
print "Found it!"
break
in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:
def my_search_function(search_key, search_values):
for s in search_values:
if search_key in s:
return True
return False
Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.
for line in lines:
negative_index = -1
positive_index = 0
middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
found = False
while positive_index < middle_element:
# print str(positive_index)+","+str(negative_index)
if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
print "Found it!"
break
positive_index = positive_index +1
negative_index = negative_index -1
However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!
Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.
Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.
Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)
Code to illustrate the second approach:
lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']
try:
iter_keys = iter(sorted(search_keys))
key = next(iter_keys)
for line in sorted(lines):
if line.startswith(key):
print('Line {} matches {}'.format(line, key))
else:
while key < line[:len(key)]:
key = next(iter_keys)
except StopIteration: # all keys processed
pass
Depends on problem detail.
For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.
Filling the hashtable is pseudo-linear.
Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were
def binary_search(master_row, source_data,master_search_index, source_search_index):
lower_bound = 0
upper_bound = len(source_data) - 1
found = False
while lower_bound <= upper_bound and not found:
middle_pos = (lower_bound + upper_bound) // 2
if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
return {"result": True, "index": middle_pos}
break
lower_bound = middle_pos + 1
elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
return {"result": True, "index": middle_pos}
break
upper_bound = middle_pos - 1
else:
if len(source_data[middle_pos][source_search_index]) > 5:
return {"result": True, "index": middle_pos}
else:
break
and then where we actually make the Binary Search call
#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
for m in master_search_columns:
found = False
for d in data_search_columns:
data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
results = binary_search(line, data_copy,m, d)
found = results["result"]
if found:
line = update_row(line, data_copy[results["index"]], column_mapping)
found_count = found_count +1
break
if found:
break
Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray
I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions
I am working on python and I need to match the strings of several data files. First I used pickle to unpack my files and then I place them into a list. I only want to match strings that have the same conditions. This conditions are indicated at the end of the string.
My working script looks approximately like this:
import pickle
f = open("data_a.dat")
list_a = pickle.load( f )
f.close()
f = open("data_b.dat")
list_b = pickle.load( f )
f.close()
f = open("data_c.dat")
list_c = pickle.load( f )
f.close()
f = open("data_d.dat")
list_d = pickle.load( f )
f.close()
for a in list_a:
for b in list_b:
for c in list_c
for d in list_d:
if a.GetName()[12:] in b.GetName():
if a.GetName[12:] in c.GetName():
if a.GetName[12:] in d.GetName():
"do whatever"
This seems to work fine for these 2 lists. The problems begin when I try to add more 8 or 9 more data files for which I also need to match the same conditions. The script simple won't process and it gets stuck. I appreciate your help.
Edit: Each of the lists contains histograms named after the parameters that were used to create them. The name of the histograms contains these parameters and their values at the end of the string. In the example I did it for 2 data sets, now I would like to do it for 9 data sets without using multiple loops.
Edit 2. I just expanded the code to reflect more accurately what I want to do. Now if I try to do that for 9 lists, it does not only look horrible, but it also doesn't work.
out of my head:
files = ["file_a", "file_b", "file_c"]
sets = []
for f in files:
f = open("data_a.dat")
sets.append(set(pickle.load(f)))
f.close()
intersection = sets[0].intersection(*sets[1:])
EDIT: Well I overlooked your mapping to x.GetName()[12:], but you should be able to reduce your problem to set logic.
Here a small piece of code you can inspire on. The main idea is the use of a recursive function.
For simplicity sake, I admit that I already have data loaded in lists but you can get them from file before :
data_files = [
'data_a.dat',
'data_b.dat',
'data_c.dat',
'data_d.dat',
'data_e.dat',
]
lists = [pickle.load(open(f)) for f in data_files]
And because and don't really get the details of what you really need to do, my goal here is to found the matches on the four firsts characters :
def do_wathever(string):
print "I have match the string '%s'" % string
lists = [
["hello", "world", "how", "grown", "you", "today", "?"],
["growl", "is", "a", "now", "on", "appstore", "too bad"],
["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]
positions = [0 for i in range(len(lists))]
def recursive_match(positions, lists):
strings = map(lambda p, l: l[p], positions, lists)
match = True
searched_string = strings.pop(0)[:4]
for string in strings:
if searched_string not in string:
match = False
break
if match:
do_wathever(searched_string)
# increment positions:
new_positions = positions[:]
lists_len = len(lists)
for i, l in enumerate(reversed(lists)):
max_position = len(l)-1
list_index = lists_len - i - 1
current_position = positions[list_index]
if max_position > current_position:
new_positions[list_index] += 1
break
else:
new_positions[list_index] = 0
continue
return new_positions, not any(new_positions)
search_is_finished = False
while not search_is_finished:
positions, search_is_finished = recursive_match(positions, lists)
Of course you can optimize a lot of things here, this is draft code, but take a look at the recursive function, this is a major concept.
In the end I ended up using the map built in function. I realize now I should have been even more explicit than I was (which I will do in the future).
My data files are histograms with 5 parameters, some with 3 or 4. Something like this,
par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]
I need to examine the behavior of the quantity plotted for each possible combination of the values of the parameters. In the end, I get a data file with ~300 histograms each identified in their name with the corresponding values of the parameters and the sample name. It looks something like,
datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"
So I get 300 histograms for each of the 9 data files, but luckily all of this histograms are created in the same order. Hence I can pair all of them just using the map built in function. I unpack the data files, put each on lists and the use the map function to pair each histogram with its corresponding configuration in the other data samples.
for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):
do_something(lst)
This solves my problem. Thank you to all for your help!
I have some project which I decide to do in Python. In brief: I have list of lists. Each of them also have lists, sometimes one-element, sometimes more. It looks like this:
rules=[
[[1],[2],[3,4,5],[4],[5],[7]]
[[1],[8],[3,7,8],[3],[45],[12]]
[[31],[12],[43,24,57],[47],[2],[43]]
]
The point is to compare values from numpy array to values from this rules (elements of rules table). We are comparing some [x][y] point to first element (e.g. 1 in first element), then, if it is true, value [x-1][j] from array with second from list and so on. Five first comparisons must be true to change value of [x][y] point. I've wrote sth like this (main function is SimulateLoop, order are switched because simulate2 function was written after second one):
def simulate2(self, i, j, w, rule):
data = Data(rule)
if w.world[i][j] in data.c:
if w.world[i-1][j] in data.n:
if w.world[i][j+1] in data.e:
if w.world[i+1][j] in data.s:
if w.world[i][j-1] in data.w:
w.world[i][j] = data.cc[0]
else: return
else: return
else: return
else: return
else: return
def SimulateLoop(self,w):
for z in range(w.steps):
for i in range(2,w.x-1):
for j in range(2,w.y-1):
for rule in w.rules:
self.simulate2(i,j,w,rule)
Data class:
class Data:
def __init__(self, rule):
self.c = rule[0]
self.n = rule[1]
self.e = rule[2]
self.s = rule[3]
self.w = rule[4]
self.cc = rule[5]
NumPy array is a object from World class. Rules is list as described above, parsed by function obtained from another program (GPL License).
To be honest it seems to work fine, but it does not. I was trying other possibilities, without luck. It is working, interpreter doesn't return any errors, but somehow values in array changing wrong. Rules are good because it was provided by program from which I've obtained parser for it (GPL license).
Maybe it will be helpful - it is Perrier's Loop, modified Langton's loop (artificial life).
Will be very thankful for any help!
)
I am not familiar with Perrier's Loop, but if you code something like famous "game life" you would have done simple mistake: store the next generation in the same array thus corrupting it.
Normally you store the next generation in temporary array and do copy/swap after the sweep, like in this sketch:
def do_step_in_game_life(world):
next_gen = zeros(world.shape) # <<< Tmp array here
Nx, Ny = world.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
neighbours = sum(world[i-1:i+2, j-1:j+2]) - world[i,j]
if neighbours < 3:
next_gen[i,j] = 0
elif ...
world[:,:] = next_gen[:,:] # <<< Saving computed next generation