Related
I am trying to solve the below question:
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once; e.g., orchestra = carthorse. Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, write a program that finds the sets of words that share the same characters that contain the most words in them.
It's failing even with just 1000 bytes of file size. Also every time a new list is created, so why does Python keep the old list in memory? I am getting the below error.
l=list(map(''.join, itertools.permutations(i)))
gives me:
MemoryError
Here's my code:
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = f.read(1000).split('\n')
for i in words:
l=[]
l=list(map(''.join, itertools.permutations(i)))
l.remove(i)
for anagram in l:
if l==i:
f2.write(i + "\n")
return True
anagram()
Changed the above code to, as per suggestion. But still getting the memory error.
import itertools
def anagram():
f=open('unixdict.txt')
f2=open('result_anagram.txt','w')
words = set(line.rstrip('\n') for line in f)
for i in words:
l= map(''.join, itertools.permutations(i))
l =(x for x in l if x!=i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
return True
anagram()
MemoryError
[Finished in 22.2s]
This program is going to be horribly inefficient no matter what you do.
But you can fix this MemoryError so it'll just take forever to run instead of failing.
First, note that a 12-letter word has 479,001,600 permutations. Storing all of those in memory is going to take more than 2GB of memory. So, how do you solve that? Just don't store them all in memory. Leave the iterator as an iterator instead of making a list, and then you'll only have to fit one at a time, instead of all of them.
There's one problem here: You're actually using that list in the if l==i: line. But clearly that's a mistake. There's no way that a list of strings can ever equal a single string. You might as well replace that line with raise TypeError, at which point you can just replace the whole loop and fail a whole lot faster. :)
I think what you wanted there is if anagram in words:. In which case you have no need for l, except for in the for loop, which means you can safely leave it as a lazy iterator:
for i in words:
l = map(''.join, itertools.permutations(i))
l = (x for x in l if x != i)
for anagram in l:
if anagram in words:
f2.write(i + "\n")
I'm assuming Python 3.x here, since otherwise the list call was completely unnecessary. If you're using 2.x, replace that map with itertools.imap.
As a side note, f.read(1000) is usually going to get part of an extra word at the end, and the leftover part in the next loop. Try readlines. While it's useless with no argument, with an argument it's very useful:
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
So, f.readlines(1000) will let you read buffers of about 1K at a time, without getting partial lines. Of course now, instead of having to split on newlines, you have to rstrip them:
words = [line.rstrip('\n') for line in f.readlines(1000)]
However, you've got another problem. If you're only reading about 100 words at a time, the chances of finding an anagram are pretty slim. For example, orchestra is not going to be anywhere near carthorse in the dictionary, so there's no way to find unless you remember the entire file. But that should be fine; a typical Unix dictionary like web2 has around 200K lines; you an easily read that into memory and keep it around as a set without making even a dent on your 2GB. So:
words = set(line.rstrip('\n') for line in f)
Also, note that you're trying to print out every word in the dictionary that has an anagram (multiple times, if it has multiple anagrams). Even with an efficient algorithm, that's going to take a long time—and spew out more data than you could possibly want. A more useful program might be one that takes an input word (e.g., via input or sys.argv[1]) and outputs just the anagrams of that word.
Finally:
Even after using l as a generator it taking up too much off time though no failing with memory error. Can you explain the importance of words as a set rather than a list. [Finished in 137.4s] just for 200 bytes, you have mentioned it before, but how to overcome it using words as set?
As I said at the top, "This program is going to be horribly inefficient no matter what you do."
In order to find the anagrams of a 12-letter word, you're going through 479 million permutations, and checking each one against a dictionary of about 200 thousand words, so that's 479M * 200K = 95 trillion checks, for each word. There are two ways to improve this, the first involving using the right data structures for the job, and the second involving the right algorithms for the job.
Changing the collection of things to iterate over from a list into a generator (a lazy iterable) turns something that took linear space (479M strings) into something that takes constant space (some fixed-size iterator state, plus one string at a time). Similarly, changing the collection of words to check against from a list into a set turns something that takes linear time (comparing a string against every element in the list) into something that takes constant time (hashing a string, then seeing if there's anything in the set with that hash value). So, this gets rid of the * 200K part of your problem.
But you've still got the 479M part of the problem. And you can't make that go away with a better data structure. Instead, you have to rethink the problem. How can you check whether any permutation of a word matches any other words, without trying all the permutations?
Well, some permutation of the word X matches the word Y if and only if X and Y have the same letters. It doesn't matter what order the letters in X were in; if the set is the same, there is at least one matching permutation (or exactly one, depending on how you count duplicate letters), and if not, there are exactly 0. So, instead of iterating through all the permutations in the word to look up, just look up its set. But it does matter if there are duplicates, so you can't just use set here. You could use some kind of multi-set (collections.Counter) works… or, with very little loss in efficiency and a big gain in simplicity, you could just sort the letters. After all, if two words have the same letters in some arbitrary order, they have the same letters in the same order when they're both sorted.
Of course you need to know which words are anagrams, not just that there is an anagram, so you can't just look it up in a set of letter sets, you have to look it up in a dictionary that maps letter sets to words. For example, something like this:
lettersets = collections.defaultdict(set)
for word in words:
lettersets[''.join(sorted(word))].add(word)
So now, to look up the anagrams for a word, all you have to do is:
anagrams = lettersets[''.join(sorted(word))]
Not only is that simple and readable, it's also constant-time.
And if you really want to print out the massive list of all anagrams of all words… well, that's easy too:
for _, words in lettersets.items():
for word in words:
print('{} is an anagram of {}'.format(word, ', '.join(words - {word})))
Now, instead of taking 479M*200K time to find anagrams for one word, or 479M*200K*200K time to find all anagrams for all words, it takes constant time to find anagrams for one word, or 200K time to find all anagrams for all words. (Of course there is 200K setup time added to the start to create the mapping, but spending 200K time up-front to save 200K, much less 479M*200K, time for each lookup is an obvious win.)
Things get a little trickier when you want to, e.g., find partial anagrams, or sentence anagarms, but you want to follow the same basic principles: find data structures that let you do things in constant or logarithmic time instead of linear or worse, and find algorithms that don't require you to brute-force your way through an exponential or factorial number of candidates.
import urllib
def anagram():
f=urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt')
words = f.read().split('\n')
d={''.join(sorted(x)):[] for x in words} #create dic with empty list as default
for x in words:
d[''.join(sorted(x))].append(x)
max_len= max( len(v) for k,v in d.iteritems())
for k,v in d.iteritems():
if len(v)>=max_len:
print v
anagram()
Output:
['abel', 'able', 'bale', 'bela', 'elba']
['alger', 'glare', 'lager', 'large', 'regal']
['angel', 'angle', 'galen', 'glean', 'lange']
['evil', 'levi', 'live', 'veil', 'vile']
['caret', 'carte', 'cater', 'crate', 'trace']
['elan', 'lane', 'lean', 'lena', 'neal']
Finished in 5.7 secs
Here's a hint on solving the problem: two strings are anagrams of each other if they have the same collection of letters. You can sort the words (turning e.g. "orchestra" into "acehorrst"), then just see two words have the same sorted order. If they do, then the original words must have been anagrams of each other, since they have all the same letters (in a different order).
I have two lists:
a - dictionary which contains keywords such as ["impeccable", "obvious", "fantastic", "evident"] as elements of the list
b - sentences which contains sentences such as ["I am impeccable", "you are fantastic", "that is obvious", "that is evident"]
The goal is to use the dictionary list as a reference.
The process is as follows:
Take an element for the sentences list and run it against each element in the dictionary list. If any of the elements exists, then spit out that sentence to a new list
Repeating step 1 for each of the elements in the sentences list.
Any help would be much appreciated.
Thanks.
Below is the code:
sentences = "The book was awesome and envious","splendid job done by those guys", "that was an amazing sale"
dictionary = "awesome","amazing", "fantastic","envious"
##Find Matches
for match in dictionary:
if any(match in value for value in sentences):
print match
Now that you've fixed the original problem, and fixed the next problem with doing the check backward, and renamed all of your variables, you have this:
for match in dictionary:
if any(match in value for value in sentences):
print match
And your problem with it is:
The way I have the code written i can get the dictionary items but instead i want to print the sentences.
Well, yes, your match is a dictionary item, and that's what you're printing, so of course that's what you get.
If you want to print the sentences that contain the dictionary item, you can't use any, because the whole point of that function us to just return True if any elements are true. It won't tell you which ones—in fact, if there are more than one, it'll stop at the first one.
If you don't understand functions like any and the generator expressions you're passing to them, you really shouldn't be using them as magic invocations. Figure out how to write them as explicit loops, and you will be able to answer these problems for yourself easily. (Note that the any docs directly show you how to write an equivalent loop.)
For example, your existing code is equivalent to:
for match in dictionary:
for value in sentences:
if match in value:
print match
break
Written that way, it should be obvious how to fix it. First, you want to print the sentence instead of the word, so print value instead of match (and again, it would really help if you used meaningful variable names like sentence and word instead of meaningless names like value and misleading names like match…). Second, you want to print all matching sentences, not just the first one, so don't break. So:
for match in dictionary:
for value in sentences:
if match in value:
print value
And if you go back to my first answer, you may notice that this is the exact same structure I suggested.
You can simplify or shorten this by using comprehensions and iterator functions, but not until you understand the simple version, and how those comprehensions and iterator functions work.
First translate your algorithm into psuedocode instead of a vague description, like this:
for each sentence:
for each element in the dictionary:
if the element is in the sentence:
spit out the sentence to a new list
The only one of these steps that isn't completely trivial to convert to Python is "spit out the sentence to a new list". To do that, you'll need to have a new list before you get started, like a_new_list = [], and then you can call append on it.
Once you convert this to Python, you will discover that "I am impeccable and fantastic" gets spit out twice. If you don't want that, you need to find the appropriate please to break out of the inner loop and move on to the next sentence. Which is also trivial to convert to Python.
Now that you've posted your code… I don't know what problem you were asking about, but there's at least one thing obviously wrong with it.
sentences is a list of sentences.
So, for partial in sentences means each partial will be a sentence, like "I am impeccable".
dictionary is a list of words. So, for value in dictionary means each value will be a word, like "impeccable".
Now, you're checking partial in value for each value for each partial. That will never be true. "I am impeccable" is not in "impeccable".
If you turn that around, and check whether value in partial, it will give you something that's at least true sometimes, and that may even be what you actually want, but I'm not sure.
As a side note, if you used better names for your variables, this would be a lot more obvious. partial and value don't tell you what those things actually are; if you'd called them sentence and word it would be pretty clear that sentence in word is never going to be true, and that word in sentence is probably what you wanted.
Also, it really helps to look at intermediate values to debug things like this. When you use an explicit for statement, you can print(partial) to see each thing that partial holds, or you can put a breakpoint in your debugger, or you can step through in a visualizer like this one. If you have to break the any(genexpr) up into an explicit loop to do, then do so. (If you don't know how, then you probably don't understand what generator expressions or the any function do, and have just copied and pasted random code you didn't understand and tried changing random things until it worked… in which case you should stop doing that and learn what they actually mean.)
The task that I have to perform is as follows :
Say I have a list of words (Just an example...the list can have any word):
'yappingly', 'yarding', 'yarly', 'yawnfully', 'yawnily', 'yawning','yawningly',
'yawweed', 'yealing', 'yeanling', 'yearling', 'yearly', 'yearnfully','yearning',
'yearnling', 'yeastily', 'yeasting', 'yed',
I have to create a new list of words from which words having the suffix ing are added after removing the suffix (i.e yeasting is added to the new list as yeast) and the remaining words are added as it is
Now as far as insertion of string ending with ing is concerned, i wrote the following code and it works fine
Data=[w[0:-3] for w in wordlist if re.search('ing$',w)]
But how to add the remaining words to the list?? How do I add an else clause to the above if statement? I was unable to find suitable documentation for the above. I did came across several questions on SO regarding the shorthand if else statement, but simply adding the else statement at the end of the above code doesn't work. How do I go about it??
Secondly, if I have to extend the above regular expression for multiple suffixes say as follows:
re.search('(ing|ed|al)$',w)
How do I perform the "trim" operation to remove the suffix accordingly and simultaneously add the word to the new list??
Please Help.
First, what makes you think you need a regexp at all? There are easier ways to strip suffixes.
Second, if you want to use regexps, why not just re.sub instead of trying to use regexps and slicing together? For example:
Data = [re.sub('(ing|ed|al)$', '', w) for w in wordlist]
Then you don't need to work out how much to slice off (which would require you to keep track of the result of re.search so you can get the length of the group, instead of just turning it into a bool).
But if you really want to do things your way, just replace your if filter with a conditional expression, as in iCodez's answer.
Finally, if you're stuck on how to fit something into a one-liner, just take it out of the one-liner. It should be easy to write a strip_suffixes function that returns the suffix-stripped string (which is the original string if there was no suffix). Then you can just write:
Data = [strip_suffixes(w) for w in wordlist]
Regarding your first question, you can use a ternary placed just before the for:
Data=[w[0:-3] if re.search('ing$',w) else w for w in wordlist]
Regarding your second, well, the best answer in my opinion is to use re.sub as #abarnert demonstrated. However, you could also make a slight adaption to your use of re.search:
Data=[re.search('(.*)(?:ing|ed|al)$', w).group(1) for w in wordlist]
Finally, here is a link for more information on comprehensions.
I am writing a simple cryptogram solver and am having trouble 'unrolling' a recursive function. I must unroll it for other reasons, otherwise I would leave it recursive.
Here's the idea: I have a variable number of lists, each with words in them. The function's job is to go through each list and, after checking that the word fits in the current alphabet setup, find it's score. So if you have the following lists:
LIST1: [the, and, can,...]
LIST2: [kids, cars, knee,...]
LIST3: [talks, walks, music,...]
...
and the function needs to go through each list (in order) and try to find the best sentence. (I have a scoring algorithm that it calls to compare.) It starts with the first word in the first list, then iterates the second list until it finds a word that works, then starts iterating the third list until it finds a word in that list that works, etc. Once it exhausts the words in the 3rd list, it should then go back to the second and find the next word that works, continuing the process until it's done.
I tried using the Product function, but that doesn't work the right way...that just gives me all possible combinations, and technically works, but is not very efficient.
def find_sentence():
cycle through first list:
cycle through second list:
if word works:
start cycling through third word list.
else:
keep cycling through 2nd word list.
...
Keep going until we have gone through all word lists, finding a score that is above a threshold..
Any help?
From Bakuriu's response:
Thanks for your fast reply! I'm not that great at Python, but I don't think this is working the way I need it to. Your solution is similar to the Product method in that it's goal is to find all words that will work (or fit a score.) The method I need to use is : 1. Start with the 1st word in the 1st list. 2. Start iterating the next list of words. 3. As soon as one of those words works, start going through the 3rd list, etc. 4. When you've reached the end (to the last list of words) and find a candidate, you now have a solution, as you have one word in each list that works. 5. If, say, a word in list 3 does not fit, you must go back to list 2 and CONTINUE searching through that list, finding the next word that works, moving on to start list 3 OVER AGAIN, and continuing until nothing works or you've reached the end. I hope this is clear. Please let me know if I can clarify anything.
You really don't need recursion here at all, actually.
def find_sentence(*variable_number_of_lists):
out = []
for eachlist in variable_number_of_lists:
for word in eachlist:
if scoreword(out, word) > threshhold:
# presumably, your 'scoreword' function would take in the current
# list of okayed words in order to find the most recent one for use
# in your scoring, if I've understood the problem correctly
out.append(word)
break
return out
for word in wordStr:
word = word.strip()
print word
When the above code analyzes a .txt with thousands of words, why does it only return the last word in the .txt file? What do I need to do to get it to return all the words in the text file?
Because you're overwriting word in the loop, not really a good idea. You can try something like:
wordlist = ""
for word in wordStr:
wordlist = "%s %s"%(wordlist,word.strip())
print wordlist[1:]
This is fairly primitive Python and I'm sure there's a more Pythonic way to do it with list comprehensions and all that new-fangled stuff :-) but I usually prefer readability where possible.
What this does is to maintain a list of the words in a separate string and then add each stripped word to the end of that list. The [1:] at the end is simply to get rid of the initial space that was added when the first word was tacked on to the end of the empty word list.
It will suffer eventually as the word count becomes substantial since tacking things on to the end of a string is less optimal than other data structures. However, even up to 10,000 words (with the print removed), it's still well under a second of execution time.
At 50,000 words it becomes noticeable, taking 3 seconds on my box. If you're going to be processing that sort of quantity, you would probably opt for a real list-based solution like (equivalent to above but with a different underlying data structure):
wordlist = []
for word in wordStr:
wordlist.append (word.strip())
print wordlist
That takes about 0.22 seconds (without the print) to do my entire dictionary file, some 110,000 words.
To print all the words in wordStr (assuming that wordStr is some kind of iterable that returning strings), you can simply write
for word in wordStr:
word = word.strip()
print word # Notice that the only difference is the indentation on this line
Python cares about indentation, so in your code the print statement is outside the loop and is only executed once. In the modified version, the print statement is inside the loop and is executed once per word.
that is because the variable word is the current word, when finish the file, is the last one:
def stuff():
words = []
for word in wordStr:
words.append(word.strip())
print words
return words
List comprehensions should make your code snazzier and more pythonic.
wordStr = 'here are some words'
print [word.strip() for word in wordStr.split()]
returns
['here', 'are', 'some', 'words']
If you don't know why the code in your example is "returning" only the last word (actually it's not returning anything, it's printing a single word), then I'm afraid nothing here is going to help you very much.
I know that sounds hostile, but I don't wish to be. It seems from your question that you are throwing bits of Python together with little real understanding of the fundamental basics of programming in Python.
Now don't get me wrong, trying stuff out is a great early learning activity, and having a task to motivate your learning is also a great way to do it. So I don't want to tell you to stop! But whether you're trying to learn to program or just have a task that needs you to write a program, you aren't going to get very far without developing an understanding of the fundamental underlying issues that make your example not do what you want it to do.
We can tell you here that this:
for word in wordStr:
word = word.strip()
print word
is a program that roughly means "for every word in wordStr, bind word to the result of word.strip(); then after all that, print the contents of word", whereas what you wanted is likely:
for word in wordStr:
word = word.strip()
print word
which is a program that roughly means "for every word in wordStr, bind word to the result of word.strip() and then print word". And that solves your immediate problem. But you're going to run into many more problems of a very similar nature, and without an understanding of the very basics you won't be able to see that they're all "of a kind", and you'll just end up asking more questions here.
What you need is to gain a basic understanding of how variables, statements, loops, etc work in Python. You will probably eventually gain that if you just keep trying to apply code and ask questions here. But Stack Overflow is not the most efficient resource for gaining that understanding; a much better bet would be to find yourself a good book or tutorial (there's an official one at http://docs.python.org/tutorial/).
Here endeth the soap box.