Walking through a Directed Graph Python - python

I know this is not the place to ask for homework answers, but I just want some direction since I'm completely lost. I started a Python course at my college where the professor assumes no one has experience in the language. This is our first assignment:
Build on this code to create a subsequent loop that starts at W and takes a random walk of a particular maximum length specified as a variable. If there are multiple possible next states, choose one uniformly at random. If there are no next states, then you should stop the loop, even if you haven't reached your maximum length.
I understand that this code my professor provided is walking through each character of the string s, but I don't know how it's actually working. I don't get what c in enumerate(s) or next[c]=[] are doing. Any help explaining how this works or how to handle characters and strings in Python would be greatly appreciated. I've been coding in other languages for a few years and I have no idea how to even start this assignment.
next = {} # This will hold the directed graph
s = "Welcome to cs 477"
# This loops through all of the characters in s
# and keeps track of their indices
for i, c in enumerate(s):
if not c in next:
# If this is the first time seeing a particular
# character, we need to make a new key/value pair for it
next[c] = []
if i < len(s)-1: # If there is a character after this
# Record that s[i+1] is one of the following characters
next[c].append(s[i+1])

enumerate(s) allows you to iterate over a list getting both the ith element in the list and the position it's in, as detailed here. In your case, the variable i holds the number of the current element starting from 0 and c the current character. If you were to print i and c inside your for loop, the result would look like this:
for i, c in enumerate(s):
print(i, c)
# 0, W
# 1, e
# 2, l
# ...
Regarding next[c] = [], you need to first understand that next is a dictionary, which is basically a hash set. What next[c] = [] is doing, is adding the character c to the dictionary as the key, and it's value as an empty list.

Related

Performance task in Python

We have IT olimpics in my country. Normally they are written in Java, C or C++. I gues for a year or so they also include other languages like python.
I tried to solve a task from previous years in Python called Letters and I'm constantly failing. The task is to write a code that counts minimum number of shifts between neighboring letters to turn one string into another.
As input you get number of letters in one string and two strings with same amount of letters but in different order. Lenght of one string is from 2 to 1 000 000 letters. There are only capital letters, they can but don't have to be sorted and can repeat.
Here's an example:
7
AABCDDD
DDDBCAA
Correct output should be 16
As output you have to return single value which is minimum number of shifts. It has to calculate output under 5seconds.
I made it calculate correct output, but in longer strings (lik 800 000 letters) it starts to slow down. The longest inputs return value in about 30 seconds. There's also one input counting 900 000 letters per word that calculates 30 minutes!
Under link you can find all input files for tests:
https://oi.edu.pl/l/19oi_ksiazeczka/
Click on this link to download files for "Letters" task:
XIX OI testy i rozwiązania - zad. LIT (I etap) (3.5 MB)
Bellow is my code. How can I speed it up?
# import time
import sys
# start = time.time()
def file_reader():
standard_input=""
try:
data = sys.stdin.readlines()
for line in data:
standard_input+=line
except:
print("An exception occurred")
return standard_input
def mergeSortInversions(arr):
if len(arr) == 1:
return arr, 0
else:
a = arr[:len(arr)//2]
b = arr[len(arr)//2:]
a, ai = mergeSortInversions(a)
b, bi = mergeSortInversions(b)
c = []
i = 0
j = 0
inversions = 0 + ai + bi
while i < len(a) and j < len(b):
if a[i] <= b[j]:
c.append(a[i])
i += 1
else:
c.append(b[j])
j += 1
inversions += (len(a)-i)
c += a[i:]
c += b[j:]
return c, inversions
def literki():
words=(file_reader()).replace("\n", "")
number = int("".join((map(str, ([int(i) for i in list(words) if i.isdigit()])))))
all_letters = [x for x in list(words) if x not in str(number)]
name = all_letters[:number]
anagram = all_letters[number:]
p=[]
index=list(range(len(anagram)))
anagram_dict = {index[i]: anagram[i] for i in range(len(index))}
new_dict = {}
anagram_counts={}
for key, value in anagram_dict.items():
if value in new_dict:
new_dict[value].append(key)
else:
new_dict[value]=[key]
for i in new_dict:
anagram_counts.update({i:new_dict[i]})
for letter in name:
a=anagram_counts[letter]
p.append(a.pop(0))
print(mergeSortInversions(p)[1])
#>>
literki()
# end = time.time()
# print(start-end)
So to explain what it does in parts: File_reader: simply reads an input file from standard input. mergeSortInversions(arr): normally it would sort a string, but here I wanted it to return sum of inversions. I'm not that smart to figure it out by myself, I found it on web but it does the job. Unfortunatelly, for 1mln strings it does that in 10 secondes or so. In "literki" function: first, I've devided input to have number of signs and two, even in lenght words as lists.
Then, I've made something similar in function to stacks array (not shure if it is called this way in english). basically I made a dictionary with every letter as key and indexes of those letters as a list in values (if a letter occurs more than once, value would contain a list of all indexes for that letter). Last thing I did before "the slow thing", for every letter in "name" variable I've extracted coresponding index. Up to that point all opertations for every input, ware taking arround 2 secconds. And now two lines that generate the rest of time for calculating outcome: - I append the index to p=[] list and in the same time pop it from list in dictionary, so it wouldn't read it again for another same letter. - I calculate number of moves (inversions) with mergeSortInversions(arr) based on p=[...] list and print it as output.
I know that poping from bottom is slow but on the other hand I would have to create lists of indexes from bottom (so I could pop index from top) but that took even longer. I've also tried converting a=[... ] with deque but it also was to slow.
I think I'd try a genetic algorithm for this problem. GA's don't always come up with an optimal solution, but they are very good for getting an acceptable solution in a reasonable amount of time. And for small inputs, they can be optimal.
The gist is to come up with:
1) A fitness function that assigns a number indicating how good a particular candidate solution is
2) A sexual reproduction function, that combines, in a simple way, part of two candidate solutions
3) A mutation function, that introduces one small change to a candidate solution.
So you just let those functions go to town, creating solution after solution, and keeping the best ones - not the best one, the best ones.
Then after a while, the best solution found is your answer.
Here's an example of using a GA for another hard problem, called The House Robber Problem. It's in Python:
http://stromberg.dnsalias.org/~strombrg/house-robber-problem/

How do for loops work in Python? Like deep down what process in happening?

I am learning Python and having a lot of trouble with for loops. I know they are similar to while loops. My basic understanding is that they go through a list item by item and apply a block to it.
But I cannot seem to write a functioning for loop, I just can't wrap my head around something. Also when I look at example ones in my classes (Udacity) I don't understand how that works.
Here is an example of code that works, but I couldn't come up with that code or figure out why it is working:
def measure_udacity(U):
count = 0
for e in U:
if e[0] == 'U':
count = count + 1
return count
print measure_udacity(['Dave','Sebastian','Katy'])
print measure_udacity(['Umika','Umberto'])
print measure_udacity(['udacity', 'United States', 'umbrella', 'U2'])
The three prints output 0, 2 and 2. I guess what I don't understand is how does this line work?
if e[0] == 'U':
If you are specifying [0], how does it then get applied to 'Sebastian' and 'Katy'? which are in positions [1] and [2]?
I was trying to write the same for loop before I saw the solution and had something more like this:
def measure_udacity(ulist):
i = 0
j = 0
for i in ulist:
if ulist[j] == 'U':
i += 1
j += 1
else:
j+=1
return i
Basically trying to advance the position of where it was searching in the list for 'U'. This worked out about as well as a sack of bricks trying to float. So far Python has been very simple and the fact that I'm having so much trouble on for loops is telling me there is some basic thing about them I don't understand.
When the function measure_udacity is called, the variable U is a list of strings. The first time it's called, U has three elements and they are 'Dave', 'Sebastian' and 'Katy'. The for statement causes the indented block to run exactly once for each element, and the elements are assigned in turn to the variable e. The first time through the loop e='Dave', the second time through the loop e='Sebastian' and the third time e='Katy'. Python causes these assignments to the variable e to happen automatically; you don't need to write any extra code to do that (as you were trying to do in your question). That is the answer to your question about how to access all three strings in the list - it happens automatically because the computer goes through the loop three times, and each time e has a different value.
You asked about the line if e[0] == 'U'. I hope you can see now that e is a string, one of the strings from the list U. e[0] is simply the first character in the string, so this line compares the first character to 'U'. If it's true the counter increments by one. By the time you've looped through the entire list, the counter will equal the number of words that started with U.
I hope that helps, good luck learning.
In Python, for loops are essentially foreach loops, where the iterating variable is a loop item, rather than an index.
If you want to get the index as well, you can call enumerate on your list:
for index, item in enumerate(items):
print "{item} is at index {index}".format(item=item, index=index)
The three prints output 0, 2 and 2. I guess what I don't understand is how does this line work?
if e[0] == 'U':
If you are specifying [0], how does it then get applied to 'Sebastian' and 'Katy'? which are in positions [1] and [2]?
In this case, e is not the top-level U list, but an item of U. e[0] is then the first character of the string e, not the first item of the list U.

Annotation tips

Okay so I've annotated almost everything in my code, but I'm struggling slightly with annotating my for loop, I've got all of it except these two lines I just don't know how to explain with it making sense to anyone but me. Would be great if I could get some tips on this!
y = {} #
Positions = [] #
for i, word in enumerate (Sentence): #This starts a for look to go through every word within the variable 'Sentence' which is a list.
if not word in y: #This continues the code below if the word isn't found within the variable 'y'.
y[word] = (i+1) #This gives the word that wasn't found within the variable 'y' the next unused number plus 1 so that it doesn't confuse those unfamiliar with computer science starting at 0.
Positions = Positions + [y[word]] #This sets the variable 'Positions' to the variables 'Positions' and '[d[word]]'.
If you're going to comment a variable, then the comment should explain that the variable contains (or to be precise, since the purpose of the code is to populate these variables, our goal for what the variable will contain) and/or what it's expected to be used for. Since we don't see this data used for anything, I'll stick to the former:
y = {} # dictionary mapping words to the (1-based) index of their first occurrence in the sentence
Positions = [] # list containing, for each position in the sentence, the (1-based) index of the first occurrence of the word at that position.
In one you are declaring a dictionary:
y = {} #
in another a list:
Positions = [] #
Dictionaries store objects with keys. Lists are stacks of elements (position wise).

Strange behavior of list append()

I am trying to identify from a text file, a set of words that appear at least some number of times within any single of the text file. I have a list to hold the qualifying words. The file is read line by line. In each line, words occurred in the line and their counts are put into a dictionary. Words with count number higher than threshold is appended to the list. The code operating on a single line looks as follows (I pseudo coded some parts that doesn't pertain to the problem):
words = []
candidates = {}
for line in text:
for word in line:
if word in dict:
candidates[word] += 1
else
candidates[word] = 1
for word in candidates:
if candidates[word] > threshold:
if word not in words:
words.append(word)
# candidates.clear()
At the end of each line, I was hoping to empty the dictionary and not carry useless content in it. However, the line which I put after the # now: dict.clear() erases the content of the list, and leaves only the qualifying words in the final line. When this line is removed, the output is correct.
Can someone please explain why this is happening? Does the append() method of list class make local copy of the data or just maintain a pointer? Does the dictionary clear() method not only releases the dict's reference to the key value pairs, but also releases other objects' reference to them?
#EDIT: to address some of the comments, the word extraction in each line is pseudocode. I did not think this step is relevant to the problem. If you guys are interested, here's the original code. https://github.com/muyezhu/python/blob/master/freqword
This code looks for frequently occurring short DNA piece in a long sequence. Sample data can be downloaded at this link: http://rosalind.info/problems/1d/
Trying your linked code with the linked dataset shows that you're only getting one set of updates to kmers because the outermost for loop only runs once.
This is due to the range call you're using: range(range(0, len(genome) - L + 1, L - k). In the example data, len(genome) is 100, L is 75 and k is 5. That means your range is range(0, 26, 70), which yields only 0 (the next value would be 70, which is much greater than the upper bounds of 26).
I'm pretty sure you don't want to give the L - k step argument to range. If you change the loop code to use range(len(genome) - L + 1), you get the expected results in kmers: ['CGACA', 'GAAGA', 'AATGT'].

Python recursive function seems to lose some variable values

I have 4x4 table of letters and I want to find all possible paths there. They are candidates for being words. I have problems with the variable "used" It is a list that includes all the places where the path has been already so it doesn't go there again. There should be one used-list for every path. But it doesn't work correctly. For example I had a test print that printed the current word and the used-list. Sometimes the word had only one letter, but path had gone through all 16 cells/indices.
The for-loop of size 8 is there for all possible directions. And main-function executes the chase-function 16 times - once for every possible starting point. Move function returns the indice after moving to a specific direction. And is_allowed tests for whether it is allowed to move to a certain division.
sample input: oakaoastsniuttot. (4x4 table, where first 4 letters are first row etc.)
sample output: all the real words that can be found in dictionary of some word
In my case it might output one or two words but not nearly all, because it thinks some cells are used eventhough they are not.
def chase(current_place, used, word):
used.append(current_place) #used === list of indices that have been used
word += letter_list[current_place]
if len(word)>=11:
return 0
for i in range(3,9):
if len(word) == i and word in right_list[i-3]: #right_list === list of all words
print word
break
for i in range(8):
if is_allowed(current_place, i) and (move(current_place, i) not in used):
chase(move(current_place, i), used, word)
The problem is that there's only one used list that gets passed around. You have two options for fixing this in chase():
Make a copy of used and work with that copy.
Before you return from the function, undo the append() that was done at the start.

Categories