Trying to unravel a recursive Python program - python

I am writing a simple cryptogram solver and am having trouble 'unrolling' a recursive function. I must unroll it for other reasons, otherwise I would leave it recursive.
Here's the idea: I have a variable number of lists, each with words in them. The function's job is to go through each list and, after checking that the word fits in the current alphabet setup, find it's score. So if you have the following lists:
LIST1: [the, and, can,...]
LIST2: [kids, cars, knee,...]
LIST3: [talks, walks, music,...]
...
and the function needs to go through each list (in order) and try to find the best sentence. (I have a scoring algorithm that it calls to compare.) It starts with the first word in the first list, then iterates the second list until it finds a word that works, then starts iterating the third list until it finds a word in that list that works, etc. Once it exhausts the words in the 3rd list, it should then go back to the second and find the next word that works, continuing the process until it's done.
I tried using the Product function, but that doesn't work the right way...that just gives me all possible combinations, and technically works, but is not very efficient.
def find_sentence():
cycle through first list:
cycle through second list:
if word works:
start cycling through third word list.
else:
keep cycling through 2nd word list.
...
Keep going until we have gone through all word lists, finding a score that is above a threshold..
Any help?
From Bakuriu's response:
Thanks for your fast reply! I'm not that great at Python, but I don't think this is working the way I need it to. Your solution is similar to the Product method in that it's goal is to find all words that will work (or fit a score.) The method I need to use is : 1. Start with the 1st word in the 1st list. 2. Start iterating the next list of words. 3. As soon as one of those words works, start going through the 3rd list, etc. 4. When you've reached the end (to the last list of words) and find a candidate, you now have a solution, as you have one word in each list that works. 5. If, say, a word in list 3 does not fit, you must go back to list 2 and CONTINUE searching through that list, finding the next word that works, moving on to start list 3 OVER AGAIN, and continuing until nothing works or you've reached the end. I hope this is clear. Please let me know if I can clarify anything.

You really don't need recursion here at all, actually.
def find_sentence(*variable_number_of_lists):
out = []
for eachlist in variable_number_of_lists:
for word in eachlist:
if scoreword(out, word) > threshhold:
# presumably, your 'scoreword' function would take in the current
# list of okayed words in order to find the most recent one for use
# in your scoring, if I've understood the problem correctly
out.append(word)
break
return out

Related

python check string contains all characters

I'm reading in a long list of words, and I made a node for every word in the list. Each node has an attribute 'word' for their position in the list.
I am trying to connect a node to the next node if the next node is the previous node, with an addition of just one letter
I also alphabetically ordered each word per character, so that CAT -> ACT
I want to draw an edge from each unique starting word, to all of the possible chains, so I can see all the possible chains in the list.
For example
A -> AN -> TAN -> RANT
However A --x-> T
This is my attempt
for i in range(0, G.number_of_nodes()-1):
if ( ( (len(G.node[i]['word'])+1) == len(G.node[i+1]['word']) ) and (G.node[i]['word'] in G.node[i+1]['word'])):
print G.node[i]['word'], G.node[i+1]['word']
Gave me this,
DGO DGOS
DGOS DGOSS
I IN
ELLMS ELLMSS
AEPRS AEPRSS
INW DINW
DINW DINWY
What the word list and the alphabetical list looks like
Why do I not see IN INW?
Also, AGNRT AGNRST should be on there but I don't understand why, along with a lot of other pairs
Where do you think I went wrong?
The problem is that you are only comparing words that appear right next to each other in the list, i.e. words i and i+1, e.g. I and IN appear next to each other, as do WIN and WIND, but IN and WIND are far apart. It seems you want to compare all possible words, which requires a more sophisticated algorithm. Here's an idea:
Make a dictionary where they keys are sorted words and the values are lists of actual words, e.g. {"ACT": ["CAT", "ACT", "TAC], ...}. A collections.defaultdict(list) will be useful for this.
Sort the full input list of words by length. You can use list.sort(key=len) assuming you have just a list of words.
Iterate through the list sorted by length. For each word, go through every subset of length n-1. Something like for i in range(len(word)): process(word[:i] + word[i+1:]). You may want to be careful about duplicates here.
For each subset, sort the subset and look it up in the dictionary. Make a link from every word in the dictionary's value (a list of actual words) to the bigger word.
Looks like a formal languages problem. How do you handle looping nodes?
IN INW is in the list you gave.
AGNRT AGNRST are not in the list, because you started out with a single letter, that letter has to be in the next word for example I -> IN, but IN is not in AGNRT or AGNRST
You seem to be comparing each node with just one other node, so
"IN" directly follows "I" in your wordlist, but "INW" is not directly after "IN"
You can use the 3rd party python library, python-levenshtein, to calculate the Levenshtein Distance which is the string edit distance. In your case, the only allowed 'edit' is the 'insertion' of the character on the next string/word on your list, so you will also need to verify that the length of the next word is 1 plus the previous word.
Here is the sample code that would achieve our stuff:
import Levenshtein as lvst
if len(word2) - len(word1) == 1 and lvst.distance(word1, word2) == 1:
print(word1, word2)
You can install python-levenshtein by either apt-get (systemwide) or pip:
sudo apt-get install python-levenshtein
or
sudo apt-get install python3-levenshtein
or
pip install python-levenshtein

Memory error while solving an anagram

I am trying to solve the below question:
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once; e.g., orchestra = carthorse. Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, write a program that finds the sets of words that share the same characters that contain the most words in them.
It's failing even with just 1000 bytes of file size. Also every time a new list is created, so why does Python keep the old list in memory? I am getting the below error.
l=list(map(''.join, itertools.permutations(i)))
gives me:
MemoryError
Here's my code:
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = f.read(1000).split('\n')
for i in words:
l=[]
l=list(map(''.join, itertools.permutations(i)))
l.remove(i)
for anagram in l:
if l==i:
f2.write(i + "\n")
return True
anagram()
Changed the above code to, as per suggestion. But still getting the memory error.
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = set(line.rstrip('\n') for line in f)
for i in words:
l= map(''.join, itertools.permutations(i))
l =(x for x in l if x!=i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
return True
anagram()
MemoryError
[Finished in 22.2s]
This program is going to be horribly inefficient no matter what you do.
But you can fix this MemoryError so it'll just take forever to run instead of failing.
First, note that a 12-letter word has 479,001,600 permutations. Storing all of those in memory is going to take more than 2GB of memory. So, how do you solve that? Just don't store them all in memory. Leave the iterator as an iterator instead of making a list, and then you'll only have to fit one at a time, instead of all of them.
There's one problem here: You're actually using that list in the if l==i: line. But clearly that's a mistake. There's no way that a list of strings can ever equal a single string. You might as well replace that line with raise TypeError, at which point you can just replace the whole loop and fail a whole lot faster. :)
I think what you wanted there is if anagram in words:. In which case you have no need for l, except for in the for loop, which means you can safely leave it as a lazy iterator:
for i in words:
l = map(''.join, itertools.permutations(i))
l = (x for x in l if x != i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
I'm assuming Python 3.x here, since otherwise the list call was completely unnecessary. If you're using 2.x, replace that map with itertools.imap.
As a side note, f.read(1000) is usually going to get part of an extra word at the end, and the leftover part in the next loop. Try readlines. While it's useless with no argument, with an argument it's very useful:
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
So, f.readlines(1000) will let you read buffers of about 1K at a time, without getting partial lines. Of course now, instead of having to split on newlines, you have to rstrip them:
words = [line.rstrip('\n') for line in f.readlines(1000)]
However, you've got another problem. If you're only reading about 100 words at a time, the chances of finding an anagram are pretty slim. For example, orchestra is not going to be anywhere near carthorse in the dictionary, so there's no way to find unless you remember the entire file. But that should be fine; a typical Unix dictionary like web2 has around 200K lines; you an easily read that into memory and keep it around as a set without making even a dent on your 2GB. So:
words = set(line.rstrip('\n') for line in f)
Also, note that you're trying to print out every word in the dictionary that has an anagram (multiple times, if it has multiple anagrams). Even with an efficient algorithm, that's going to take a long time—and spew out more data than you could possibly want. A more useful program might be one that takes an input word (e.g., via input or sys.argv[1]) and outputs just the anagrams of that word.
Finally:
Even after using l as a generator it taking up too much off time though no failing with memory error. Can you explain the importance of words as a set rather than a list. [Finished in 137.4s] just for 200 bytes, you have mentioned it before, but how to overcome it using words as set?
As I said at the top, "This program is going to be horribly inefficient no matter what you do."
In order to find the anagrams of a 12-letter word, you're going through 479 million permutations, and checking each one against a dictionary of about 200 thousand words, so that's 479M * 200K = 95 trillion checks, for each word. There are two ways to improve this, the first involving using the right data structures for the job, and the second involving the right algorithms for the job.
Changing the collection of things to iterate over from a list into a generator (a lazy iterable) turns something that took linear space (479M strings) into something that takes constant space (some fixed-size iterator state, plus one string at a time). Similarly, changing the collection of words to check against from a list into a set turns something that takes linear time (comparing a string against every element in the list) into something that takes constant time (hashing a string, then seeing if there's anything in the set with that hash value). So, this gets rid of the * 200K part of your problem.
But you've still got the 479M part of the problem. And you can't make that go away with a better data structure. Instead, you have to rethink the problem. How can you check whether any permutation of a word matches any other words, without trying all the permutations?
Well, some permutation of the word X matches the word Y if and only if X and Y have the same letters. It doesn't matter what order the letters in X were in; if the set is the same, there is at least one matching permutation (or exactly one, depending on how you count duplicate letters), and if not, there are exactly 0. So, instead of iterating through all the permutations in the word to look up, just look up its set. But it does matter if there are duplicates, so you can't just use set here. You could use some kind of multi-set (collections.Counter) works… or, with very little loss in efficiency and a big gain in simplicity, you could just sort the letters. After all, if two words have the same letters in some arbitrary order, they have the same letters in the same order when they're both sorted.
Of course you need to know which words are anagrams, not just that there is an anagram, so you can't just look it up in a set of letter sets, you have to look it up in a dictionary that maps letter sets to words. For example, something like this:
lettersets = collections.defaultdict(set)
for word in words:
lettersets[''.join(sorted(word))].add(word)
So now, to look up the anagrams for a word, all you have to do is:
anagrams = lettersets[''.join(sorted(word))]
Not only is that simple and readable, it's also constant-time.
And if you really want to print out the massive list of all anagrams of all words… well, that's easy too:
for _, words in lettersets.items():
for word in words:
print('{} is an anagram of {}'.format(word, ', '.join(words - {word})))
Now, instead of taking 479M*200K time to find anagrams for one word, or 479M*200K*200K time to find all anagrams for all words, it takes constant time to find anagrams for one word, or 200K time to find all anagrams for all words. (Of course there is 200K setup time added to the start to create the mapping, but spending 200K time up-front to save 200K, much less 479M*200K, time for each lookup is an obvious win.)
Things get a little trickier when you want to, e.g., find partial anagrams, or sentence anagarms, but you want to follow the same basic principles: find data structures that let you do things in constant or logarithmic time instead of linear or worse, and find algorithms that don't require you to brute-force your way through an exponential or factorial number of candidates.
import urllib
def anagram():
f=urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt')
words = f.read().split('\n')
d={''.join(sorted(x)):[] for x in words} #create dic with empty list as default
for x in words:
d[''.join(sorted(x))].append(x)
max_len= max( len(v) for k,v in d.iteritems())
for k,v in d.iteritems():
if len(v)>=max_len:
print v
anagram()
Output:
['abel', 'able', 'bale', 'bela', 'elba']
['alger', 'glare', 'lager', 'large', 'regal']
['angel', 'angle', 'galen', 'glean', 'lange']
['evil', 'levi', 'live', 'veil', 'vile']
['caret', 'carte', 'cater', 'crate', 'trace']
['elan', 'lane', 'lean', 'lena', 'neal']
Finished in 5.7 secs
Here's a hint on solving the problem: two strings are anagrams of each other if they have the same collection of letters. You can sort the words (turning e.g. "orchestra" into "acehorrst"), then just see two words have the same sorted order. If they do, then the original words must have been anagrams of each other, since they have all the same letters (in a different order).

Python comparing elements in two lists

I have two lists:
a - dictionary which contains keywords such as ["impeccable", "obvious", "fantastic", "evident"] as elements of the list
b - sentences which contains sentences such as ["I am impeccable", "you are fantastic", "that is obvious", "that is evident"]
The goal is to use the dictionary list as a reference.
The process is as follows:
Take an element for the sentences list and run it against each element in the dictionary list. If any of the elements exists, then spit out that sentence to a new list
Repeating step 1 for each of the elements in the sentences list.
Any help would be much appreciated.
Thanks.
Below is the code:
sentences = "The book was awesome and envious","splendid job done by those guys", "that was an amazing sale"
dictionary = "awesome","amazing", "fantastic","envious"
##Find Matches
for match in dictionary:
if any(match in value for value in sentences):
print match
Now that you've fixed the original problem, and fixed the next problem with doing the check backward, and renamed all of your variables, you have this:
for match in dictionary:
if any(match in value for value in sentences):
print match
And your problem with it is:
The way I have the code written i can get the dictionary items but instead i want to print the sentences.
Well, yes, your match is a dictionary item, and that's what you're printing, so of course that's what you get.
If you want to print the sentences that contain the dictionary item, you can't use any, because the whole point of that function us to just return True if any elements are true. It won't tell you which ones—in fact, if there are more than one, it'll stop at the first one.
If you don't understand functions like any and the generator expressions you're passing to them, you really shouldn't be using them as magic invocations. Figure out how to write them as explicit loops, and you will be able to answer these problems for yourself easily. (Note that the any docs directly show you how to write an equivalent loop.)
For example, your existing code is equivalent to:
for match in dictionary:
for value in sentences:
if match in value:
print match
break
Written that way, it should be obvious how to fix it. First, you want to print the sentence instead of the word, so print value instead of match (and again, it would really help if you used meaningful variable names like sentence and word instead of meaningless names like value and misleading names like match…). Second, you want to print all matching sentences, not just the first one, so don't break. So:
for match in dictionary:
for value in sentences:
if match in value:
print value
And if you go back to my first answer, you may notice that this is the exact same structure I suggested.
You can simplify or shorten this by using comprehensions and iterator functions, but not until you understand the simple version, and how those comprehensions and iterator functions work.
First translate your algorithm into psuedocode instead of a vague description, like this:
for each sentence:
for each element in the dictionary:
if the element is in the sentence:
spit out the sentence to a new list
The only one of these steps that isn't completely trivial to convert to Python is "spit out the sentence to a new list". To do that, you'll need to have a new list before you get started, like a_new_list = [], and then you can call append on it.
Once you convert this to Python, you will discover that "I am impeccable and fantastic" gets spit out twice. If you don't want that, you need to find the appropriate please to break out of the inner loop and move on to the next sentence. Which is also trivial to convert to Python.
Now that you've posted your code… I don't know what problem you were asking about, but there's at least one thing obviously wrong with it.
sentences is a list of sentences.
So, for partial in sentences means each partial will be a sentence, like "I am impeccable".
dictionary is a list of words. So, for value in dictionary means each value will be a word, like "impeccable".
Now, you're checking partial in value for each value for each partial. That will never be true. "I am impeccable" is not in "impeccable".
If you turn that around, and check whether value in partial, it will give you something that's at least true sometimes, and that may even be what you actually want, but I'm not sure.
As a side note, if you used better names for your variables, this would be a lot more obvious. partial and value don't tell you what those things actually are; if you'd called them sentence and word it would be pretty clear that sentence in word is never going to be true, and that word in sentence is probably what you wanted.
Also, it really helps to look at intermediate values to debug things like this. When you use an explicit for statement, you can print(partial) to see each thing that partial holds, or you can put a breakpoint in your debugger, or you can step through in a visualizer like this one. If you have to break the any(genexpr) up into an explicit loop to do, then do so. (If you don't know how, then you probably don't understand what generator expressions or the any function do, and have just copied and pasted random code you didn't understand and tried changing random things until it worked… in which case you should stop doing that and learn what they actually mean.)

Python if statements moving ahead

I have the following code where frag is a list of strings which are cut up (in order) DNA sequence data:
for a in frag:
length_fragment = len(a)
if (a[0:5] == 'CCAGC') and (a[-1:] == 'C'):
total_length.append(length_fragment)
I however want to jump ahead to the next a in the for loop and see if the first letters of that next fragment are CCAGC... is this possible in python to do.
So I want to change the a[-1:] =='C' to be a statment which is the next a[0:5] =='ACGAG'. Key word there is the next a in the for loop. So I want to skip ahead briefly in the for loop.
for a, next_a in zip(frag, frag[1:]):
If frag is large, it will be more efficient to use an itertools.islice instead of [1:]
Use continue to skip the rest of the for loop and restart at the beginning with the next iteration.
(I'm not 100% clear on your intent, so I'll interpret: you want to find sequences that begin with CCAGC, but only if the following sequence begins with ACGAG. On that assumption...)
If it's convenient, store the data as a single string containing all the sequences, one per line, then use a regex:
ccagc_then_acgag = re.compile('(CCAGC.*)\n(?=ACGAG)')
sum( len(seq) for seq in ccagc_then_acgag.findall(sequences) )
I can't say whether this will be faster or slower than iterating over a list of strings (regex libraries have some nice optimisations and the entire loop runs in native code, but the list of strings has the advantage of not having to scan an entire line to find the ACGAG match), but it's worth testing.

How can I sort an item alphabetically before I add it to the list?

I have an assignment that I am supposed to find each word in a line, and add that word to a list. Then there is also another list corresponding to the list of word, but that list will tell the amount of times the word appear in the text.
I have finished that part. However, I cannot find a way to compare the new found word to the word in the list, and find the index to insert it in the list in an alphabetical order. I know that I am supposed to write a function that will find that index in the list, so i can insert that item in both lists. I am not allowed to use the sort operator, so I am having a little trouble. Can anyone help me writing that one function using only conditions operators.
If I am not clear, please let me know.
Homework hint: look at Python's source code for the bisect module. That shows how to find indexes and make insertions in a sorted list.

Categories