Memory error while solving an anagram - python

I am trying to solve the below question:
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once; e.g., orchestra = carthorse. Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, write a program that finds the sets of words that share the same characters that contain the most words in them.
It's failing even with just 1000 bytes of file size. Also every time a new list is created, so why does Python keep the old list in memory? I am getting the below error.
l=list(map(''.join, itertools.permutations(i)))
gives me:
MemoryError
Here's my code:
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = f.read(1000).split('\n')
for i in words:
l=[]
l=list(map(''.join, itertools.permutations(i)))
l.remove(i)
for anagram in l:
if l==i:
f2.write(i + "\n")
return True
anagram()
Changed the above code to, as per suggestion. But still getting the memory error.
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = set(line.rstrip('\n') for line in f)
for i in words:
l= map(''.join, itertools.permutations(i))
l =(x for x in l if x!=i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
return True
anagram()
MemoryError
[Finished in 22.2s]

This program is going to be horribly inefficient no matter what you do.
But you can fix this MemoryError so it'll just take forever to run instead of failing.
First, note that a 12-letter word has 479,001,600 permutations. Storing all of those in memory is going to take more than 2GB of memory. So, how do you solve that? Just don't store them all in memory. Leave the iterator as an iterator instead of making a list, and then you'll only have to fit one at a time, instead of all of them.
There's one problem here: You're actually using that list in the if l==i: line. But clearly that's a mistake. There's no way that a list of strings can ever equal a single string. You might as well replace that line with raise TypeError, at which point you can just replace the whole loop and fail a whole lot faster. :)
I think what you wanted there is if anagram in words:. In which case you have no need for l, except for in the for loop, which means you can safely leave it as a lazy iterator:
for i in words:
l = map(''.join, itertools.permutations(i))
l = (x for x in l if x != i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
I'm assuming Python 3.x here, since otherwise the list call was completely unnecessary. If you're using 2.x, replace that map with itertools.imap.
As a side note, f.read(1000) is usually going to get part of an extra word at the end, and the leftover part in the next loop. Try readlines. While it's useless with no argument, with an argument it's very useful:
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
So, f.readlines(1000) will let you read buffers of about 1K at a time, without getting partial lines. Of course now, instead of having to split on newlines, you have to rstrip them:
words = [line.rstrip('\n') for line in f.readlines(1000)]
However, you've got another problem. If you're only reading about 100 words at a time, the chances of finding an anagram are pretty slim. For example, orchestra is not going to be anywhere near carthorse in the dictionary, so there's no way to find unless you remember the entire file. But that should be fine; a typical Unix dictionary like web2 has around 200K lines; you an easily read that into memory and keep it around as a set without making even a dent on your 2GB. So:
words = set(line.rstrip('\n') for line in f)
Also, note that you're trying to print out every word in the dictionary that has an anagram (multiple times, if it has multiple anagrams). Even with an efficient algorithm, that's going to take a long time—and spew out more data than you could possibly want. A more useful program might be one that takes an input word (e.g., via input or sys.argv[1]) and outputs just the anagrams of that word.
Finally:
Even after using l as a generator it taking up too much off time though no failing with memory error. Can you explain the importance of words as a set rather than a list. [Finished in 137.4s] just for 200 bytes, you have mentioned it before, but how to overcome it using words as set?
As I said at the top, "This program is going to be horribly inefficient no matter what you do."
In order to find the anagrams of a 12-letter word, you're going through 479 million permutations, and checking each one against a dictionary of about 200 thousand words, so that's 479M * 200K = 95 trillion checks, for each word. There are two ways to improve this, the first involving using the right data structures for the job, and the second involving the right algorithms for the job.
Changing the collection of things to iterate over from a list into a generator (a lazy iterable) turns something that took linear space (479M strings) into something that takes constant space (some fixed-size iterator state, plus one string at a time). Similarly, changing the collection of words to check against from a list into a set turns something that takes linear time (comparing a string against every element in the list) into something that takes constant time (hashing a string, then seeing if there's anything in the set with that hash value). So, this gets rid of the * 200K part of your problem.
But you've still got the 479M part of the problem. And you can't make that go away with a better data structure. Instead, you have to rethink the problem. How can you check whether any permutation of a word matches any other words, without trying all the permutations?
Well, some permutation of the word X matches the word Y if and only if X and Y have the same letters. It doesn't matter what order the letters in X were in; if the set is the same, there is at least one matching permutation (or exactly one, depending on how you count duplicate letters), and if not, there are exactly 0. So, instead of iterating through all the permutations in the word to look up, just look up its set. But it does matter if there are duplicates, so you can't just use set here. You could use some kind of multi-set (collections.Counter) works… or, with very little loss in efficiency and a big gain in simplicity, you could just sort the letters. After all, if two words have the same letters in some arbitrary order, they have the same letters in the same order when they're both sorted.
Of course you need to know which words are anagrams, not just that there is an anagram, so you can't just look it up in a set of letter sets, you have to look it up in a dictionary that maps letter sets to words. For example, something like this:
lettersets = collections.defaultdict(set)
for word in words:
lettersets[''.join(sorted(word))].add(word)
So now, to look up the anagrams for a word, all you have to do is:
anagrams = lettersets[''.join(sorted(word))]
Not only is that simple and readable, it's also constant-time.
And if you really want to print out the massive list of all anagrams of all words… well, that's easy too:
for _, words in lettersets.items():
for word in words:
print('{} is an anagram of {}'.format(word, ', '.join(words - {word})))
Now, instead of taking 479M*200K time to find anagrams for one word, or 479M*200K*200K time to find all anagrams for all words, it takes constant time to find anagrams for one word, or 200K time to find all anagrams for all words. (Of course there is 200K setup time added to the start to create the mapping, but spending 200K time up-front to save 200K, much less 479M*200K, time for each lookup is an obvious win.)
Things get a little trickier when you want to, e.g., find partial anagrams, or sentence anagarms, but you want to follow the same basic principles: find data structures that let you do things in constant or logarithmic time instead of linear or worse, and find algorithms that don't require you to brute-force your way through an exponential or factorial number of candidates.

import urllib
def anagram():
f=urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt')
words = f.read().split('\n')
d={''.join(sorted(x)):[] for x in words} #create dic with empty list as default
for x in words:
d[''.join(sorted(x))].append(x)
max_len= max( len(v) for k,v in d.iteritems())
for k,v in d.iteritems():
if len(v)>=max_len:
print v
anagram()
Output:
['abel', 'able', 'bale', 'bela', 'elba']
['alger', 'glare', 'lager', 'large', 'regal']
['angel', 'angle', 'galen', 'glean', 'lange']
['evil', 'levi', 'live', 'veil', 'vile']
['caret', 'carte', 'cater', 'crate', 'trace']
['elan', 'lane', 'lean', 'lena', 'neal']
Finished in 5.7 secs

Here's a hint on solving the problem: two strings are anagrams of each other if they have the same collection of letters. You can sort the words (turning e.g. "orchestra" into "acehorrst"), then just see two words have the same sorted order. If they do, then the original words must have been anagrams of each other, since they have all the same letters (in a different order).

Related

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.
The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.
Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Performing Counts, Sorting/mapping Large Dicts

I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.
First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:
from urllib2 import urlopen
class Word(object):
def __init__(self):
self.word_count = {}
def alpha_only(self, word):
"""Converts word to lowercase and removes any non-alphabetic characters."""
x = ''
for letter in word:
s = letter.lower()
if s in 'abcdefghijklmnopqrstuvwxyz':
x += s
if len(x) > 0:
return x
def count(self, line):
"""Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
Adds each word to word_count if not already present, if present increases the count by 1."""
words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
for word in words:
if word in self.word_count:
self.word_count[word] += 1
elif word != None:
self.word_count[word] = 1
class File(object):
def __init__(self,book):
self.book = urlopen(book)
self.word = Word()
def strip_line(self,line):
"""Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
splits up the line and passes it along to the count method of the word object."""
s = line.strip('\n\r\t')
if s > 1:
self.word.count(s)
def process_book(self):
"""Main processing loop, will not begin processing until the first line after the line containing "START".
After processing it will close the file."""
begin = False
for line in self.book:
if begin == True:
self.strip_line(line)
elif 'START' in line:
begin = True
self.book.close()
book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')
book.process_book()
count = book.word.word_count
So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count or it gives me the exception ValueError: too many values to unpack, which rules out using list comprehension or mapping to a function to perform any kind of sorting.
I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items() lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count and added the following:
s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)
# Delete the original dict, it is no longer needed
del book.word.word_count
Now I finally have a sorted list of words, s_count. PHEW! So, my questions are:
Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by count.items() have been preferable? But that would probably slow it down, right?
This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?
I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?
Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.
Are you sure that for k,v in count: gives the exception ValueError: too many values to unpack? I expect it to give ValueError: need more than 1 value to unpack.
When you use a dict as an iterator (eg in a for loop) you just get the keys, you don't get the values. If you want key, value pairs you need to use the dict's iteritems() method as mentioned by figs in the comment (or in Python 3 the items() method).
Of course, you can always do something like:
for k in count:
print k, count[k]
...
I think that most of your questions are more suited to Code Review than to Stack Overflow. But since you've asked so nicely here, I'll mention a few points. :)
It's rather inefficient to build up a string char by char, so your alpha_only() method would be better if it collected chars in a list then used the str.join() method to join them into a single string. The usual Python idiom would do that using a list comprehension.
The list comprehension in your count() method calls alpha_only() twice for each word, which is in efficient.
You could make your strip() call simpler by using the default argument, as that strips all white space (and you don't need to preserve space chars in this application). Similarly, using split() with its default arg will split on any runs of blank space, which is probably better in this application, since giving an arg of a single space means that you'll get some empty strings in the list returned by split if there are any runs of multiple spaces within a line.
...
You mention hashing in your question, and whether it's useful for this application. Yes, it is. Python dictionaries actually use hashing of their keys, so you don't need to worry about the details. And yes, a dictionary is a good data structure to use for this task. There are fancier forms of dictionary that make things a bit simpler, but to use them does require importing a (standard) module. But using a dictionary (of some flavour or another) to hold data and then generating a list of tuples from it for final sorting is a fairly common practice in Python. And there's no need to specifically delete the dictionary when you've finished with it if the program's about to terminate anyway.
...
As for the duplicated call of alpha_only(), whenever you find yourself doing that sort of thing it's a sign that a list comprehension isn't really suitable for the task and that you should just use a normal for loop so that you can save the result of the function call rather than having to recalculate it. Eg,
words = []
for word in line.split():
word = self.alpha_only(word)
if word is not None:
words.append(word)

Python comparing elements in two lists

I have two lists:
a - dictionary which contains keywords such as ["impeccable", "obvious", "fantastic", "evident"] as elements of the list
b - sentences which contains sentences such as ["I am impeccable", "you are fantastic", "that is obvious", "that is evident"]
The goal is to use the dictionary list as a reference.
The process is as follows:
Take an element for the sentences list and run it against each element in the dictionary list. If any of the elements exists, then spit out that sentence to a new list
Repeating step 1 for each of the elements in the sentences list.
Any help would be much appreciated.
Thanks.
Below is the code:
sentences = "The book was awesome and envious","splendid job done by those guys", "that was an amazing sale"
dictionary = "awesome","amazing", "fantastic","envious"
##Find Matches
for match in dictionary:
if any(match in value for value in sentences):
print match
Now that you've fixed the original problem, and fixed the next problem with doing the check backward, and renamed all of your variables, you have this:
for match in dictionary:
if any(match in value for value in sentences):
print match
And your problem with it is:
The way I have the code written i can get the dictionary items but instead i want to print the sentences.
Well, yes, your match is a dictionary item, and that's what you're printing, so of course that's what you get.
If you want to print the sentences that contain the dictionary item, you can't use any, because the whole point of that function us to just return True if any elements are true. It won't tell you which ones—in fact, if there are more than one, it'll stop at the first one.
If you don't understand functions like any and the generator expressions you're passing to them, you really shouldn't be using them as magic invocations. Figure out how to write them as explicit loops, and you will be able to answer these problems for yourself easily. (Note that the any docs directly show you how to write an equivalent loop.)
For example, your existing code is equivalent to:
for match in dictionary:
for value in sentences:
if match in value:
print match
break
Written that way, it should be obvious how to fix it. First, you want to print the sentence instead of the word, so print value instead of match (and again, it would really help if you used meaningful variable names like sentence and word instead of meaningless names like value and misleading names like match…). Second, you want to print all matching sentences, not just the first one, so don't break. So:
for match in dictionary:
for value in sentences:
if match in value:
print value
And if you go back to my first answer, you may notice that this is the exact same structure I suggested.
You can simplify or shorten this by using comprehensions and iterator functions, but not until you understand the simple version, and how those comprehensions and iterator functions work.
First translate your algorithm into psuedocode instead of a vague description, like this:
for each sentence:
for each element in the dictionary:
if the element is in the sentence:
spit out the sentence to a new list
The only one of these steps that isn't completely trivial to convert to Python is "spit out the sentence to a new list". To do that, you'll need to have a new list before you get started, like a_new_list = [], and then you can call append on it.
Once you convert this to Python, you will discover that "I am impeccable and fantastic" gets spit out twice. If you don't want that, you need to find the appropriate please to break out of the inner loop and move on to the next sentence. Which is also trivial to convert to Python.
Now that you've posted your code… I don't know what problem you were asking about, but there's at least one thing obviously wrong with it.
sentences is a list of sentences.
So, for partial in sentences means each partial will be a sentence, like "I am impeccable".
dictionary is a list of words. So, for value in dictionary means each value will be a word, like "impeccable".
Now, you're checking partial in value for each value for each partial. That will never be true. "I am impeccable" is not in "impeccable".
If you turn that around, and check whether value in partial, it will give you something that's at least true sometimes, and that may even be what you actually want, but I'm not sure.
As a side note, if you used better names for your variables, this would be a lot more obvious. partial and value don't tell you what those things actually are; if you'd called them sentence and word it would be pretty clear that sentence in word is never going to be true, and that word in sentence is probably what you wanted.
Also, it really helps to look at intermediate values to debug things like this. When you use an explicit for statement, you can print(partial) to see each thing that partial holds, or you can put a breakpoint in your debugger, or you can step through in a visualizer like this one. If you have to break the any(genexpr) up into an explicit loop to do, then do so. (If you don't know how, then you probably don't understand what generator expressions or the any function do, and have just copied and pasted random code you didn't understand and tried changing random things until it worked… in which case you should stop doing that and learn what they actually mean.)

Trying to unravel a recursive Python program

I am writing a simple cryptogram solver and am having trouble 'unrolling' a recursive function. I must unroll it for other reasons, otherwise I would leave it recursive.
Here's the idea: I have a variable number of lists, each with words in them. The function's job is to go through each list and, after checking that the word fits in the current alphabet setup, find it's score. So if you have the following lists:
LIST1: [the, and, can,...]
LIST2: [kids, cars, knee,...]
LIST3: [talks, walks, music,...]
...
and the function needs to go through each list (in order) and try to find the best sentence. (I have a scoring algorithm that it calls to compare.) It starts with the first word in the first list, then iterates the second list until it finds a word that works, then starts iterating the third list until it finds a word in that list that works, etc. Once it exhausts the words in the 3rd list, it should then go back to the second and find the next word that works, continuing the process until it's done.
I tried using the Product function, but that doesn't work the right way...that just gives me all possible combinations, and technically works, but is not very efficient.
def find_sentence():
cycle through first list:
cycle through second list:
if word works:
start cycling through third word list.
else:
keep cycling through 2nd word list.
...
Keep going until we have gone through all word lists, finding a score that is above a threshold..
Any help?
From Bakuriu's response:
Thanks for your fast reply! I'm not that great at Python, but I don't think this is working the way I need it to. Your solution is similar to the Product method in that it's goal is to find all words that will work (or fit a score.) The method I need to use is : 1. Start with the 1st word in the 1st list. 2. Start iterating the next list of words. 3. As soon as one of those words works, start going through the 3rd list, etc. 4. When you've reached the end (to the last list of words) and find a candidate, you now have a solution, as you have one word in each list that works. 5. If, say, a word in list 3 does not fit, you must go back to list 2 and CONTINUE searching through that list, finding the next word that works, moving on to start list 3 OVER AGAIN, and continuing until nothing works or you've reached the end. I hope this is clear. Please let me know if I can clarify anything.
You really don't need recursion here at all, actually.
def find_sentence(*variable_number_of_lists):
out = []
for eachlist in variable_number_of_lists:
for word in eachlist:
if scoreword(out, word) > threshhold:
# presumably, your 'scoreword' function would take in the current
# list of okayed words in order to find the most recent one for use
# in your scoring, if I've understood the problem correctly
out.append(word)
break
return out

for loop only returning the last word in a list of many words

for word in wordStr:
word = word.strip()
print word
When the above code analyzes a .txt with thousands of words, why does it only return the last word in the .txt file? What do I need to do to get it to return all the words in the text file?
Because you're overwriting word in the loop, not really a good idea. You can try something like:
wordlist = ""
for word in wordStr:
wordlist = "%s %s"%(wordlist,word.strip())
print wordlist[1:]
This is fairly primitive Python and I'm sure there's a more Pythonic way to do it with list comprehensions and all that new-fangled stuff :-) but I usually prefer readability where possible.
What this does is to maintain a list of the words in a separate string and then add each stripped word to the end of that list. The [1:] at the end is simply to get rid of the initial space that was added when the first word was tacked on to the end of the empty word list.
It will suffer eventually as the word count becomes substantial since tacking things on to the end of a string is less optimal than other data structures. However, even up to 10,000 words (with the print removed), it's still well under a second of execution time.
At 50,000 words it becomes noticeable, taking 3 seconds on my box. If you're going to be processing that sort of quantity, you would probably opt for a real list-based solution like (equivalent to above but with a different underlying data structure):
wordlist = []
for word in wordStr:
wordlist.append (word.strip())
print wordlist
That takes about 0.22 seconds (without the print) to do my entire dictionary file, some 110,000 words.
To print all the words in wordStr (assuming that wordStr is some kind of iterable that returning strings), you can simply write
for word in wordStr:
word = word.strip()
print word # Notice that the only difference is the indentation on this line
Python cares about indentation, so in your code the print statement is outside the loop and is only executed once. In the modified version, the print statement is inside the loop and is executed once per word.
that is because the variable word is the current word, when finish the file, is the last one:
def stuff():
words = []
for word in wordStr:
words.append(word.strip())
print words
return words
List comprehensions should make your code snazzier and more pythonic.
wordStr = 'here are some words'
print [word.strip() for word in wordStr.split()]
returns
['here', 'are', 'some', 'words']
If you don't know why the code in your example is "returning" only the last word (actually it's not returning anything, it's printing a single word), then I'm afraid nothing here is going to help you very much.
I know that sounds hostile, but I don't wish to be. It seems from your question that you are throwing bits of Python together with little real understanding of the fundamental basics of programming in Python.
Now don't get me wrong, trying stuff out is a great early learning activity, and having a task to motivate your learning is also a great way to do it. So I don't want to tell you to stop! But whether you're trying to learn to program or just have a task that needs you to write a program, you aren't going to get very far without developing an understanding of the fundamental underlying issues that make your example not do what you want it to do.
We can tell you here that this:
for word in wordStr:
word = word.strip()
print word
is a program that roughly means "for every word in wordStr, bind word to the result of word.strip(); then after all that, print the contents of word", whereas what you wanted is likely:
for word in wordStr:
word = word.strip()
print word
which is a program that roughly means "for every word in wordStr, bind word to the result of word.strip() and then print word". And that solves your immediate problem. But you're going to run into many more problems of a very similar nature, and without an understanding of the very basics you won't be able to see that they're all "of a kind", and you'll just end up asking more questions here.
What you need is to gain a basic understanding of how variables, statements, loops, etc work in Python. You will probably eventually gain that if you just keep trying to apply code and ask questions here. But Stack Overflow is not the most efficient resource for gaining that understanding; a much better bet would be to find yourself a good book or tutorial (there's an official one at http://docs.python.org/tutorial/).
Here endeth the soap box.

Categories