I'm trying to write an algorithm that by given to it a bunch of letters is giving you all the words that can be constructed of the letters, for instance, given 'car' should return a list contains [arc,car,a, etc...] and out of it returns the best scrabble word. The problem is in finding that list which contains all the words.
I've got a giant txt file dictionary, line delimited and I've tried this so far:
def find_optimal(bunch_of_letters: str):
words_to_check = []
c1 = Counter(bunch_of_letters.lower())
for word in load_words():
c2 = Counter(word.lower())
if c2 & c1 == c2:
words_to_check.append(word)
max_word = max_word_value(words_to_check)
return max_word,calc_word_value(max_word)
max_word_value - returns the word with the maximum value of the list given
calc_word_value - returns the word's score in scrabble.
load_words - return a list of the dictionary.
I'm currently using counters to do the trick but, the problem is that I'm currently on about 2.5 seconds per search and I don't know how to optimize this, any thoughts?
Try this:
def find_optimal(bunch_of_letters):
bunch_of_letters = ''.join(sorted(bunch_of_letters))
words_to_check = [word for word in load_words() if ''.join(sorted(word)) in bunch_of_letters]
max_word = max_word_value(words_to_check)
return max_word, calc_word_value(max_word)
I've just used (or at least tried to use) a list comprehension. Essentially, words_to_check will (hopefully!) be a list of all of the words which are in your text file.
On a side note, if you don't want to use a gigantic text file for the words, check out enchant!
from itertools import permutations
theword = 'car' # or we can use input('Type in a word: ')
mylist = [permutations(theword, i)for i in range(1, len(theword)+1)]
for generator in mylist:
for word in generator:
print(''.join(word))
# instead of .join just print (word) for tuple
Output:
c
a
r
ca
cr
...
ar rc ra car cra acr arc rca rac
This will give us all the possible combinations (i.e. permutations) of a word.
If you're looking to see if the generated word is an actual word in the English dictionary we can use This Answer
import enchant
d = enchant.Dict("en_US")
for word in mylist:
print(d.check(word), word)
Conclusion:
If want to generate all the combinations of the word. We use this code:
from itertools import combinations, permutations, product
word = 'word' # or we can use input('Type in a word: ')
solution = permutations(word, 4)
for i in solution:
print(''.join(i)) # just print(i) if you want a tuple
Related
I'm trying to create a function in Python that will generate anagrams of a given word. I'm not just looking for code that will rearrange the letters aimlessly. All the options given must be real words. I currently have a solution, which to be honest I took most of this code from a YouTube video, but it is very slow for my purpose and can only provide one word responses to a single word given. It uses a 400,000 word dictionary to compare the words it is going though, called "dict.txt".
My goal is to get this code to mimic how well this website's code works:
https://wordsmith.org/anagram/
I could not find the javascript code when reviewing the network activity using Google Chrome's developer tool, so I believe the code is probably in the background, and is possibly using Node.js. This would perhaps make it faster than Python, but given how much faster it is I believe there is more to it than just the programming language. I assume they are using some type of search algorithm rather than just going through each line one by one like I am. I also like the fact that their response is not limited to a single word, but can break up the word given to provide more options to the user. For example, an anagram of "anagram" is "nag a ram".
Any suggestions or ideas would be appreciated.
Thank you.
def init_words(filename):
words = {}
with open(filename) as f:
for line in f:
word = line.strip()
words[word] = 1
return words
def init_anagram_dict(words):
anagram_dict = {}
for word in words:
sorted_word = ''.join(sorted(list(word)))
if sorted_word not in anagram_dict:
anagram_dict[sorted_word] = []
anagram_dict[sorted_word].append(word)
return anagram_dict
def find_anagrams(word, anagram_dict):
key = ''.join(sorted(list(word)))
if key in anagram_dict:
return set(anagram_dict[key]).difference(set([word]))
return set([])
#This is the first function called.
def make_anagram(user_word):
x = str(user_word)
lower_user_word = str.lower(x)
word_dict = init_words('dict.txt')
result = find_anagrams(lower_user_word, init_anagram_dict(word_dict.keys()))
list_result = list(result)
count = len(list_result)
if count > 0:
random_num = random.randint(0,count -1)
anagram_value = list_result[random_num]
return ('An anagram of %s is %s. Would you like me to search for another word?' %(lower_user_word, anagram_value))
else:
return ("Sorry, I could not find an anagram for %s." %(lower_user_word))
You can build a dictionary of anagrams by grouping words by their sorted text. All words that have the same sorted text are anagrams of each other:
from collections import defaultdict
with open("/usr/share/dict/words","r") as wordFile:
words = wordFile.read().split("\n")
anagrams = defaultdict(list)
for word in words:
anagrams["".join(sorted(word))].append(word)
aWord = "spear"
result = anagrams["".join(sorted(aWord))]
print(aWord,result)
# ['asper', 'parse', 'prase', 'spaer', 'spare', 'spear']
Using 235,000 words, the response time is instantaneous
In order to obtain multiple words forming an anagram of the specified word, you will need to get into combinatorics. A recursive function is probably the easiest way to go about it:
from itertools import combinations,product
from collections import Counter,defaultdict
with open("/usr/share/dict/words","r") as wordFile:
words = wordFile.read().split("\n")
anagrams = defaultdict(set)
for word in words:
anagrams["".join(sorted(word))].add(word)
counters = { w:Counter(w) for w in anagrams }
minLen = 2 # minimum word length
def multigram(word,memo=dict()):
sWord = "".join(sorted(word))
if sWord in memo: return memo[sWord]
result = anagrams[sWord]
wordCounts = counters.get(sWord,Counter())
for size in range(minLen,len(word)-minLen+1):
seen = set()
for combo in combinations(word,size):
left = "".join(sorted(combo))
if left in seen or seen.add(left): continue
left = multigram(left,memo)
if not left: continue
right = multigram("".join((wordCounts-Counter(combo)).elements()),memo)
if not right: continue
result.update(a+" "+b for a,b in product(left,right) )
memo[sWord] = list(result)
return memo[sWord]
Performance is good up to 12 character words. Beyond that the exponential nature of combinations start to take a heavy toll
result = multigram("spear")
print(result)
# ['parse', 'asper', 'spear', 'er spa', 're spa', 'se rap', 'er sap', 'sa per', 're asp', 'ar pes', 'se par', 'pa ers', 're sap', 'er asp', 'as per', 'spare', 'spaer', 'as rep', 'sa rep', 'ra pes', 'pa ser', 'es rap', 'es par', 'prase']
len(multigram("mulberries")) # 15986 0.1 second 10 letters
len(multigram("raspberries")) # 60613 0.2 second 11 letters
len(multigram("strawberries")) # 374717 1.3 seconds 12 letters
len(multigram("tranquillizer")) # 711491 7.6 seconds 13 letters
len(multigram("communications")) # 10907666 52.2 seconds 14 letters
In order to avoid any delay, you can convert the function to an iterator. This will allows you to get the first few anagrams without having to generate them all:
def iMultigram(word,prefix=""):
sWord = "".join(sorted(word))
seen = set()
for anagram in anagrams.get(sWord,[]):
full = prefix+anagram
if full in seen or seen.add(full): continue
yield full
wordCounts = counters.get(sWord,Counter(word))
for size in reversed(range(minLen,len(word)-minLen+1)): # longest first
for combo in combinations(sWord,size):
left = "".join(sorted(combo))
if left in seen or seen.add(left): continue
for left in iMultigram(left,prefix):
right = "".join((wordCounts-Counter(combo)).elements())
for full in iMultigram(right,left+" "):
if full in seen or seen.add(full): continue
yield full
from itertools import islice
list(islice(iMultigram("communications"),5)) # 0.0 second
# ['communications', 'cinnamomic so ut', 'cinnamomic so tu', 'cinnamomic os ut', 'cinnamomic os tu']
So I was exploring on coderbyte.com and one of the challenges is to find the longest word in a string. My code to do so is the following:
def LongestWord(sen):
current="";
currentBest=0
numberOfLettersInWord=0
longestWord=0
temp=sen.split()
for item in temp:
listOfCharacters=list(item)
for currentChar in listOfCharacters:
if currentChar.isalpha():
numberOfLettersInWord+=1
if numberOfLettersInWord>longestWord:
longestWord=numberOfLettersInWord
numberOfLettersInWord=0
currentBest=item
z = list(currentBest)
x=''
for item in z:
if item.isalpha(): x+=item
return x
testCase="a confusing /:sentence:/ this"
print LongestWord(testCase)
when testCase is "a confusing /:sentence:/"
The code returns confusing, which is the correct answer. But when the test case is the one in the current code, my code is returning 'this' instead of 'confusing'
Any ideas as to why this is happening?
I know that this is not the answer to your question, but this is how I would calculate the longest word. And not sharing it, wouldn't help you, either:
import re
def func(text: str) -> str:
words = re.findall(r"[\w]+", text)
return max(words, key=len)
print(func('a confusing /:sentence:/ this'))
Let me suggest another approach, which is more modular and more Pythonic.
Let's make a function to measure word length:
def word_length(w):
return sum(ch.isalpha() for ch in w)
So it will count (using sum()) how many characters there are for which .isalpha() is True:
>>> word_length('hello!!!')
5
>>> word_length('/:sentence:/')
8
Now, from a list of words, create a list of lengths. This is easily done with map():
>>> sen = 'a confusing /:sentence:/ this'.split()
>>> map(word_length, sen)
[1, 9, 8, 4]
Another builtin useful to find the maximum value in a list is max():
>>> max(map(word_length, sen))
9
But you want to know the word which maximizes the length, which in mathematical terms is called argument of the maximum.
To solve this, zip() the lengths with the words, and get the second argument found by max().
Since this is useful in many cases, make it a function:
def arg_max(func, values):
return max(zip(map(func, values), values))[1]
Now the longest word is easily found with:
>>> arg_max(word_length, sen)
'confusing'
Note: PEP-0008 (Style Guide for Python Code) suggests that function names be lower case and with words separated by underscore.
You loop through the words composing the sentence. However, numberOfLettersInWord is never reseted so it keeps increasing while you iterate among the words.
You have to set the counter to 0 each time you start a new word.
for item in temp:
numberOfLettersInWord = 0
It solves your issue as you can see: https://ideone.com/y1cmHX
Here's a little function I just wrote that will return the longest word, using a regular expression to remove non alpha-numeric characters
import re
def longest_word(input):
words = input.split()
longest = ''
for word in words:
word = re.sub(r'\W+', '', word)
if len(word) > len(longest):
longest = word
return longest
print(longest_word("a confusing /:sentence:/ this"))
I'm trying to compare the output of a permutation of a string to a txt with pretty much every word in the dictionary. The function itself is an anagram solver.
The function takes word as a parameter. Here's basically what I have
def anagram(word):
{c for c in permutations(word,len(word))}
this output will give me a set of possible combinations of word.
If word = dog, the output will be:
[{'d','g','g'},{'g','o','d'}
plus the other 6 or so combinations.
I want to compare the result of this permutation to a list of words and then return the word(s) which are anagrams of the original word.
So
if result (god or dog or dgo or gdo...) is in word_list:
return result
Thanks in advance!
EDIT
Sorry I didn't explicitly say that the word list had already been imported in as a set/list.
The code for that is:
def load_words():
name = 'words.2-10.txt'
if isfile(name):
all_words = [ l.rstrip() for l in open(name, 'r') ]
as_lists = {}
for size in range(2, 11):
as_lists[size] = [ word for word in all_words if len(word) == size ]
as_sets = { size : (set(words) if words else None) for size, words in as_lists.iteritems() }
return as_lists, as_sets
return None, None
word_lists, word_sets = load_words()
Apologies!
First, you can get all the words from the file, and form a set using set comprehension, like this
with open("strings.txt") as strings_file:
words = {line.strip() for line in strings_file}
And then, when you generate the permutations, just join them with "".join, like this
def anagram(word):
return {"".join(c) for c in permutations(word, len(word))}
and then you can simply do set intersection operation, like this
print words & anagram("dog")
Now, you can use the same set of words to compare against any number of permutations, like this
print words & anagram("cabbage")
print words & anagram("Jon")
print words & anagram("Ffisegydd")
import itertools
def anagram(word):
for w in itertools.permutations(word):
yield ''.join(w)
def main():
word = input("Enter a word: ")
listOfWords = ['some', 'list', 'of', 'words']
for w in anagram(word):
if w in listOfWords:
print(w, 'is in the list')
Here is an approach with python sets. If you need a more efficient way than sets you should look into generators (yield) and the filte method.
from itertools import permutations
word_list = set(["god", "dog", "bla"])
def anagram(word):
perm = [''.join(p) for p in permutations(word)]
return set(perm)
dog = anagram("dog")
print word_list.intersection(dog)
I'm learning python from Think Python by Allen Downey and I'm stuck at Exercise 6 here. I wrote a solution to it, and at first look it seemed to be an improvement over the answer given here. But upon running both, I found that my solution took a whole day (~22 hours) to compute the answer, while the author's solution only took a couple seconds.
Could anyone tell me how the author's solution is so fast, when it iterates over a dictionary containing 113,812 words and applies a recursive function to each to compute a result?
My solution:
known_red = {'sprite': 6, 'a': 1, 'i': 1, '': 0} #Global dict of known reducible words, with their length as values
def compute_children(word):
"""Returns a list of all valid words that can be constructed from the word by removing one letter from the word"""
from dict_exercises import words_dict
wdict = words_dict() #Builds a dictionary containing all valid English words as keys
wdict['i'] = 'i'
wdict['a'] = 'a'
wdict[''] = ''
res = []
for i in range(len(word)):
child = word[:i] + word[i+1:]
if nword in wdict:
res.append(nword)
return res
def is_reducible(word):
"""Returns true if a word is reducible to ''. Recursively, a word is reducible if any of its children are reducible"""
if word in known_red:
return True
children = compute_children(word)
for child in children:
if is_reducible(child):
known_red[word] = len(word)
return True
return False
def longest_reducible():
"""Finds the longest reducible word in the dictionary"""
from dict_exercises import words_dict
wdict = words_dict()
reducibles = []
for word in wdict:
if 'i' in word or 'a' in word: #Word can only be reducible if it is reducible to either 'I' or 'a', since they are the only one-letter words possible
if word not in known_red and is_reducible(word):
known_red[word] = len(word)
for word, length in known_red.items():
reducibles.append((length, word))
reducibles.sort(reverse=True)
return reducibles[0][1]
wdict = words_dict() #Builds a dictionary containing all valid English words...
Presumably, this takes a while.
However, you regenerate this same, unchanging dictionary many times for every word you try to reduce. What a waste! If you make this dictionary once, and then re-use that dictionary for every word you try to reduce like you do for known_red, the computation time should be greatly reduced.
On the back of a block calendar I found the following riddle:
How many common English words of 4 letters or more can you make from the letters
of the word 'textbook' (each letter can only be used once).
My first solution that I came up with was:
from itertools import permutations
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
found_words = []
ps = (permutations(given_word, i) for i in range(4, len(given_word)+1))
for p in ps:
for word in map(''.join, p):
if word in words and word != given_word:
found_words.append(word)
print set(found_words)
This gives the result set(['tote', 'oboe', 'text', 'boot', 'took', 'toot', 'book', 'toke', 'betook']) but took more than 7 minutes on my machine.
My next iteration was:
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
print [word for word in words if len(word) >= 4 and sorted(filter(lambda letter: letter in word, given_word)) == sorted(word) and word != given_word]
Which return an answer almost immediately but gave as answer: ['book', 'oboe', 'text', 'toot']
What is the fastest, correct and most pythonic solution to this problem?
(edit: added my earlier permutations solution and its different output).
I thought I'd share this slightly interesting trick although it takes a good bit more code than the rest and isn't really "pythonic". This will take a good bit more code than the other solutions but should be rather quick if I look at the timing the others need.
We're doing a bit preprocessing to speed up the computations. The basic approach is the following: We assign every letter in the alphabet a prime number. E.g. A = 2, B = 3, and so on. We then compute a hash for every word in the alphabet which is simply the product of the prime representations of every character in the word. We then store every word in a dictionary indexed by the hash.
Now if we want to find out which words are equivalent to say textbook we only have to compute the same hash for the word and look it up in our dictionary. Usually (say in C++) we'd have to worry about overflows, but in python it's even simpler than that: Every word in the list with the same index will contain exactly the same characters.
Here's the code with the slightly optimization that in our case we only have to worry about characters also appearing in the given word, which means we can get by with a much smaller prime table than otherwise (the obvious optimization would be to only assign characters that appear in the word a value at all - it was fast enough anyhow so I didn't bother and this way we could pre process only once and do it for several words). The prime algorithm is useful often enough so you should have one yourself anyhow ;)
from collections import defaultdict
from itertools import permutations
PRIMES = list(gen_primes(256)) # some arbitrary prime generator
def get_dict(path):
res = defaultdict(list)
with open(path, "r") as file:
for line in file.readlines():
word = line.strip().upper()
hash = compute_hash(word)
res[hash].append(word)
return res
def compute_hash(word):
hash = 1
for char in word:
try:
hash *= PRIMES[ord(char) - ord(' ')]
except IndexError:
# contains some character out of range - always 0 for our purposes
return 0
return hash
def get_result(path, given_word):
words = get_dict(path)
given_word = given_word.upper()
result = set()
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
for word in (word for word in powerset(given_word) if len(word) >= 4):
hash = compute_hash(word)
for equiv in words[hash]:
result.add(equiv)
return result
if __name__ == '__main__':
path = "dict.txt"
given_word = "textbook"
result = get_result(path, given_word)
print(result)
Runs on my ubuntu word list (98k words) rather quickly, but not what I'd call pythonic since it's basically a port of a c++ algorithm. Useful if you want to compare more than one word that way..
How about this?
from itertools import permutations, chain
with open('/usr/share/dict/words') as fp:
words = set(fp.read().split())
given_word = 'textbook'
perms = (permutations(given_word, i) for i in range(4, len(given_word)+1))
pwords = (''.join(p) for p in chain(*perms))
matches = words.intersection(pwords)
print matches
which gives
>>> print matches
set(['textbook', 'keto', 'obex', 'tote', 'oboe', 'text', 'boot', 'toto', 'took', 'koto', 'bott', 'tobe', 'boke', 'toot', 'book', 'bote', 'otto', 'toke', 'toko', 'oket'])
There is a generator itertools.permutations with which you can gather all permutations of a sequence with a specified length. That makes it easier:
from itertools import permutations
GIVEN_WORD = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
print len(filter(lambda x: ''.join(x) in words, permutations(GIVEN_WORD, 4)))
Edit #1: Oh! It says "4 or more" ;) Forget what I said!
Edit #2: This is the second version I came up with:
LETTERS = set('textbook')
with open('/usr/share/dict/words') as f:
WORDS = filter(lambda x: len(x) >= 4, [l.strip() for l in f])
matching = filter(lambda x: set(x).issubset(LETTERS) and all([x.count(c) == 1 for c in x]), WORDS)
print len(matching)
Create the whole power set, then check whether the dictionary word is in the set (order of the letters doesn't matter):
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
pw = map(lambda x: sorted(x), powerset(given_word))
filter(lambda x: sorted(x) in pw, words)
The following just checks each word in the dictionary to see if it is of the appropriate length, and then if it is a permutation of 'textbook'. I borrowed the permutation check from
Checking if two strings are permutations of each other in Python
but changed it slightly.
given_word = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
matches = []
for word in words:
if word != given_word and 4 <= len(word) <= len(given_word):
if all(word.count(char) <= given_word.count(char) for char in word):
matches.append(word)
print sorted(matches)
This finishes almost immediately and gives the correct result.
Permutations get very big for longer words. Try counterrevolutionary for example.
I would filter the dict for words from 4 to len(word) (8 for textbook).
Then I would filter with regular expression "oboe".matches ("[textbook]+").
The remaining words, I would sort, and compare them with a sorted version of your word, ("beoo", "bekoottx") with jumping to the next index of a matching character, to find mismatching numbers of characters:
("beoo", "bekoottx")
("eoo", "ekoottx")
("oo", "koottx")
("oo", "oottx")
("o", "ottx")
("", "ttx") => matched
("bbo", "bekoottx")
("bo", "ekoottx") => mismatch
Since I don't talk python, I leave the implementation as an exercise to the audience.