I'm currently trying to generate a list of words that rhyme with an input word according to the CMU Pronouncing dictionary I have managed to arrange all the words into a dictionary with their keys being a list of strings representing their values. However, due to something rhyming based on the last vowel, I'm sort of stuck on finding how to go about this in the case of words that contain more than one
def dotheyrhyme(filename,word):
rhymes = {}
list = []
with open(filename) as f:
text = f.readlines()[56:]
for line in text:
splitline = line.split(" ")
rhymes[str(splitline[0])] = "".join(splitline[1:])
f.close()
comparer = rhymes[word.upper()].rstrip().split(" ")
return comparer
I plan to use the comparer variable as a baseline and believe reversing this variable could also be a good way to go about it but I'm lost or overthinking ways to compare if the last vowel and letters after are the same and append accordingly?
Example:
{SECOND: 'S' 'EH1' 'K' 'AH0' 'N' 'D'}
Would rhyme with
{'AND': 'AH0' 'N' 'D'}
but these two wouldn't rhyme
{'YELLOW': 'Y' 'EH1' 'L' 'OW0'}
And
{HELLO: 'HH' 'AH0' 'L' 'OW1'}
But the methods I can't think of ways to counter varying lengths and multiple vowels.
Thanks for your help!
Finding last vowel requires you to have a set of vowels. After that you only got to iterate over the list backwards.
vowels = {...} # some list of vowels
word = ['S', 'EH1', 'K', 'AH0', 'N', 'D']
for i in word[::-1]:
if i in vowels:
last_vowel = i
break
If open to other idea you can also look at this library which finds the rhymes for you : https://pypi.org/project/pronouncing/
You would have to start comparing from the end. There are special algorithms and data structures that can help in cases like yours - you can check Aho-Corasick algorithm.
But in the simple case, you would need to compare the words in the reverse order and find common substring above some threshold to call these words a rhyme, e.g.:
def if_rhymes(word1, word2):
r1 = reverse(rhymes[word2])
r2 = reverse(rhymes[word1])
the_same = 0
for sound1, sound2 in zip(r1, r2):
if sound1 == sound2:
the_same += 1
else:
break
if the_same < threshold:
return 'no rhyme' # or False if you want
else:
return 'rhymes' # or True
What the algorithm does
It takes the list of sounds from the rhymes dictionary that you populated from file (for clarity I recommend doing it outside the rhyme testing function).
Then it reverses the order of elements in lists of sounds for both words and creates a list of pairs (or tuples) using zip.
Each of the tuples (sounds from the words in the reverse order) is compared. We count the ones that are the same and stop comparing on the first different pair of sounds from the back.
Depending on the threshold (you may want to substitute the variable for an actual value) you consider given pair of words as a rhyme or not.
Related
I have been working on writing a Wordle bot, and wanted to see how it preforms with all 13,000 words. The problem is that I am running this through a for loop and it is very inefficient. After running it for 30 minutes, it only gets to around 5%. I could wait all that time, but it would end up being 10+ hours. There has got to be a more efficient way. I am new to python, so any suggestions would be greatly appreciated.
The code here is the code that is used to limit down the guesses each time. Would there be a way to search for a word that contains "a", "b", and "c"? Instead of running it 3 separate times. Right now the containts, nocontains, and isletter will each run every time I need to search for a new letter. Searching them all together would greatly reduce the time.
#Find the words that only match the criteria
def contains(letter, place):
list.clear()
for x in words:
if x not in removed:
if letter in x:
if letter == x[place]:
removed.append(x)
else:
list.append(x)
else:
removed.append(x)
def nocontains(letter):
list.clear()
for x in words:
if x not in removed:
if letter not in x:
list.append(x)
else:
removed.append(x)
def isletter(letter, place):
list.clear()
for x in words:
if x not in removed:
if letter == x[place]:
list.append(x)
else:
removed.append(x)
The performance problems can be massively reduced by using sets. Any time that you want to repeatedly test for membership (even only a few times), e.g. if x not in removed, you want to try to make a set. Lists require checking every element to find x, which is bad if the list has thousands of elements. In a Python set, if x not in removed should take as long to run if removed has 100 elements or 100,000, a small constant amount of time.
Besides this, you're running into problems by trying to use mutable global variables everywhere, like for list (which needs to be renamed) and removed. There's no benefit to doing that and several downsides, such as making it harder to reason about your code or optimize it. One benefit of Python is that you can pass large containers or objects to functions without any extra time or space cost: calling a function f(huge_list) is as fast and uses as much memory as f(tiny_list), as if you were passing by reference in other languages, so don't hesitate to use containers as function parameters or return types.
In summary, here's how your code could be refactored if you take away 'list' and 'removed' and instead store this as a set of possible words:
all_words = [] # Huge word list to read in from text file
current_possible_words = set(all_words)
def contains_only_elsewhere(possible_words, letter, place):
"""Given letter and place, remove from possible_words
all words containing letter but not at place"""
to_remove = {word for word in possible_words
if letter not in word or word[place] == letter}
return possible_words - to_remove
def must_not_contain(possible_words, letter):
"""Given a letter, remove from possible_words all words containing letter"""
to_remove = {word for word in possible_words
if letter in word}
return possible_words - to_remove
def exact_letter_match(possible_words, letter, place):
"""Given a letter and place, remove from possible_words
all words not containing letter at place"""
to_remove = {word for word in possible_words
if word[place] != letter}
return possible_words - to_remove
The outside code will be different: for example,
current_possible_words = exact_letter_match(current_possible_words, 'a', 2)`
Further optimizations are possible (and much easier now): storing only indices to words rather than the strings; precomputing, for each letter, the set of all words containing that letter, etc.
I just wrote a wordle bot that runs in about a second including the web scraping to fetch a list of 5 letter words.
import urllib.request
from bs4 import BeautifulSoup
def getwords():
source = "https://www.thefreedictionary.com/5-letter-words.htm"
filehandle = urllib.request.urlopen(source)
soup = BeautifulSoup(filehandle.read(), "html.parser")
wordslis = soup.findAll("li", {"data-f": "15"})
words = []
for k in wordslis:
words.append(k.getText())
return words
words = getwords()
def hasLetterAtPosition(letter,position,word):
return letter==word[position]
def hasLetterNotAtPosition(letter,position,word):
return letter in word[:position]+word[position+1:]
def doesNotHaveLetter(letter,word):
return not letter in word
lettersPositioned = [(0,"y")]
lettersMispositioned = [(0,"h")]
lettersNotHad = ["p"]
idx = 0
while idx<len(words):
eliminated = False
for criteria in lettersPositioned:
if not hasLetterAtPosition(criteria[1],criteria[0],words[idx]):
del words[idx]
eliminated = True
break
if eliminated:
continue
for criteria in lettersMispositioned:
if not hasLetterNotAtPosition(criteria[1],criteria[0],words[idx]):
del words[idx]
eliminated = True
break
if eliminated:
continue
for letter in lettersNotHad:
if not doesNotHaveLetter(letter,words[idx]):
del words[idx]
eliminated = True
break
if eliminated:
continue
idx+=1
print(words) # ["youth"]
The reason yours is slow is because you have a lot of calls to check if word in removed in addition to a number of superfluous logical conditions in addition to going through all the words for each of your checks.
Edit: Here's a get words function that gets more words.
def getwords():
source = "https://wordfind-com.translate.goog/length/5-letter-words/?_x_tr_sl=es&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp"
filehandle = urllib.request.urlopen(source)
soup = BeautifulSoup(filehandle.read(), "html.parser")
wordslis = soup.findAll("a", {"rel": "nofollow"})
words = []
for k in wordslis:
words.append(k.getText())
return words
Say I have a list of words. If a word's last letter is the same with another word's first letter, then we can connect them together. We don't connect the word with itself.The input elements are distinct.
Example: apple - elephant - tower - rank
I implemented it as this.
def transform(lst):
graph = []
for picked in lst:
link = []
i = lst.index(picked)
rest = lst[:i]+lst[i+1:]
for compare in rest:
if picked[-1] == compare[0]:
link.append(compare)
if len(link) != 0:
graph.append(link)
return graph
I don't know if I can still improve it.
=======================================================================
I think I should change
if len(link) != 0:
graph.append(link)
to
graph.append(link)
Otherwise, the order of the adjacent lists will be mixed
You should start by identifying the two things you're grouping by here. Ending letters and starting letters. Drop all the words into two dicts, keyed by each, and you'll end up with much faster lookups. list.index is a killer for efficiency, with each lookup costing O(n)
from collections import defaultdict
startswith, endswith = defaultdict(list), defaultdict(list)
wordlist = ['apple','elephant','tower','rank']
for word in wordlist:
startswith[word[0]].append(word)
endswith[word[-1]].append(word)
Then it should be a fairly simple graph traversal problem.
funny!
I hope I understand what you're talking about.
def transform(lst):
graph = []
for picked in lst:
if len(graph)==0:
graph.append(picked)
else:
ch=graph[len(graph)-1][len(graph[len(graph)-1])-1]
world_ch=[i for i in lst if ((i not in graph)and(i[0]==ch)) ]
if len(world_ch)==0: break
else: graph.append(world_ch[0])
return(graph)
lista=['apple','carloh','horse','apple','elephant','tower','rank']
print(str(transform(lista)))
['apple', 'elephant', 'tower', 'rank']
print(str([[i] for i in transform(lista)]))
[['apple'], ['elephant'], ['tower'], ['rank']]
I'm trying to write an algorithm that by given to it a bunch of letters is giving you all the words that can be constructed of the letters, for instance, given 'car' should return a list contains [arc,car,a, etc...] and out of it returns the best scrabble word. The problem is in finding that list which contains all the words.
I've got a giant txt file dictionary, line delimited and I've tried this so far:
def find_optimal(bunch_of_letters: str):
words_to_check = []
c1 = Counter(bunch_of_letters.lower())
for word in load_words():
c2 = Counter(word.lower())
if c2 & c1 == c2:
words_to_check.append(word)
max_word = max_word_value(words_to_check)
return max_word,calc_word_value(max_word)
max_word_value - returns the word with the maximum value of the list given
calc_word_value - returns the word's score in scrabble.
load_words - return a list of the dictionary.
I'm currently using counters to do the trick but, the problem is that I'm currently on about 2.5 seconds per search and I don't know how to optimize this, any thoughts?
Try this:
def find_optimal(bunch_of_letters):
bunch_of_letters = ''.join(sorted(bunch_of_letters))
words_to_check = [word for word in load_words() if ''.join(sorted(word)) in bunch_of_letters]
max_word = max_word_value(words_to_check)
return max_word, calc_word_value(max_word)
I've just used (or at least tried to use) a list comprehension. Essentially, words_to_check will (hopefully!) be a list of all of the words which are in your text file.
On a side note, if you don't want to use a gigantic text file for the words, check out enchant!
from itertools import permutations
theword = 'car' # or we can use input('Type in a word: ')
mylist = [permutations(theword, i)for i in range(1, len(theword)+1)]
for generator in mylist:
for word in generator:
print(''.join(word))
# instead of .join just print (word) for tuple
Output:
c
a
r
ca
cr
...
ar rc ra car cra acr arc rca rac
This will give us all the possible combinations (i.e. permutations) of a word.
If you're looking to see if the generated word is an actual word in the English dictionary we can use This Answer
import enchant
d = enchant.Dict("en_US")
for word in mylist:
print(d.check(word), word)
Conclusion:
If want to generate all the combinations of the word. We use this code:
from itertools import combinations, permutations, product
word = 'word' # or we can use input('Type in a word: ')
solution = permutations(word, 4)
for i in solution:
print(''.join(i)) # just print(i) if you want a tuple
I'm learning python from Think Python by Allen Downey and I'm stuck at Exercise 6 here. I wrote a solution to it, and at first look it seemed to be an improvement over the answer given here. But upon running both, I found that my solution took a whole day (~22 hours) to compute the answer, while the author's solution only took a couple seconds.
Could anyone tell me how the author's solution is so fast, when it iterates over a dictionary containing 113,812 words and applies a recursive function to each to compute a result?
My solution:
known_red = {'sprite': 6, 'a': 1, 'i': 1, '': 0} #Global dict of known reducible words, with their length as values
def compute_children(word):
"""Returns a list of all valid words that can be constructed from the word by removing one letter from the word"""
from dict_exercises import words_dict
wdict = words_dict() #Builds a dictionary containing all valid English words as keys
wdict['i'] = 'i'
wdict['a'] = 'a'
wdict[''] = ''
res = []
for i in range(len(word)):
child = word[:i] + word[i+1:]
if nword in wdict:
res.append(nword)
return res
def is_reducible(word):
"""Returns true if a word is reducible to ''. Recursively, a word is reducible if any of its children are reducible"""
if word in known_red:
return True
children = compute_children(word)
for child in children:
if is_reducible(child):
known_red[word] = len(word)
return True
return False
def longest_reducible():
"""Finds the longest reducible word in the dictionary"""
from dict_exercises import words_dict
wdict = words_dict()
reducibles = []
for word in wdict:
if 'i' in word or 'a' in word: #Word can only be reducible if it is reducible to either 'I' or 'a', since they are the only one-letter words possible
if word not in known_red and is_reducible(word):
known_red[word] = len(word)
for word, length in known_red.items():
reducibles.append((length, word))
reducibles.sort(reverse=True)
return reducibles[0][1]
wdict = words_dict() #Builds a dictionary containing all valid English words...
Presumably, this takes a while.
However, you regenerate this same, unchanging dictionary many times for every word you try to reduce. What a waste! If you make this dictionary once, and then re-use that dictionary for every word you try to reduce like you do for known_red, the computation time should be greatly reduced.
first of all i want to mention that there might not be any real life applications for this simple script i created, but i did it because I'm learning and I couldn't find anything similar here in SO. I wanted to know what could be done to "arbitrarily" change characters in an iterable like a list.
Sure tile() is a handy tool I learned relatively quick, but then I got to think what if, just for kicks, i wanted to format (upper case) the last character instead? or the third, the middle one,etc. What about lower case? Replacing specific characters with others?
Like I said this is surely not perfect but could give away some food for thought to other noobs like myself. Plus I think this can be modified in hundreds of ways to achieve all kinds of different formatting.
How about helping me improve what I just did? how about making it more lean and mean? checking for style, methods, efficiency, etc...
Here it goes:
words = ['house', 'flower', 'tree'] #string list
counter = 0 #counter to iterate over the items in list
chars = 4 #character position in string (0,1,2...)
for counter in range (0,len(words)):
while counter < len(words):
z = list(words[counter]) # z is a temp list created to slice words
if len(z) > chars: # to compare char position and z length
upper = [k.upper() for k in z[chars]] # string formatting EX: uppercase
z[chars] = upper [0] # replace formatted character with original
words[counter] = ("".join(z)) # convert and replace temp list back into original word str list
counter +=1
else:
break
print (words)
['housE', 'flowEr', 'tree']
This is somewhat of a combination of both (so +1 to both of them :) ). The main function accepts a list, an arbitrary function and the character to act on:
In [47]: def RandomAlter(l, func, char):
return [''.join([func(w[x]) if x == char else w[x] for x in xrange(len(w))]) for w in l]
....:
In [48]: RandomAlter(words, str.upper, 4)
Out[48]: ['housE', 'flowEr', 'tree']
In [49]: RandomAlter([str.upper(w) for w in words], str.lower, 2)
Out[49]: ['HOuSE', 'FLoWER', 'TReE']
In [50]: RandomAlter(words, lambda x: '_', 4)
Out[50]: ['hous_', 'flow_r', 'tree']
The function RandomAlter can be rewritten as this, which may make it a bit more clear (it takes advantage of a feature called list comprehensions to reduce the lines of code needed).
def RandomAlter(l, func, char):
# For each word in our list
main_list = []
for w in l:
# Create a container that is going to hold our new 'word'
new_word = []
# Iterate over a range that is equal to the number of chars in the word
# xrange is a more memory efficient 'range' - same behavior
for x in xrange(len(w)):
# If the current position is the character we want to modify
if x == char:
# Apply the function to the character and append to our 'word'
# This is a cool Python feature - you can pass around functions
# just like any other variable
new_word.append(func(w[x]))
else:
# Just append the normal letter
new_word.append(w[x])
# Now we append the 'word' to our main_list. However since the 'word' is
# a list of letters, we need to 'join' them together to form a string
main_list.append(''.join(new_word))
# Now just return the main_list, which will be a list of altered words
return main_list
There's much better Pythonistas than me, but here's one attempt:
[''.join(
[a[x].upper() if x == chars else a[x]
for x in xrange(0,len(a))]
)
for a in words]
Also, we're talking about the programmer's 4th, right? What everyone else calls 5th, yes?
Some comments on your code:
for counter in range (0,len(words)):
while counter < len(words):
This won't compile unless you indent the while loop under the for loop. And, if you do that, the inner loop will completely screw up the loop counter for the outer loop. And finally, you almost never want to maintain an explicit loop counter in Python. You probably want this:
for counter, word in enumerate(words):
Next:
z = list(words[counter]) # z is a temp list created to slice words
You can already slice strings, in exactly the same way you slice lists, so this is unnecessary.
Next:
upper = [k.upper() for k in z[chars]] # string formatting EX: uppercase
This is a bad name for the variable, since there's a function with the exact same name—which you're calling on the same line.
Meanwhile, the way you defined things, z[chars] is a character, a copy of words[4]. You can iterate over a single character in Python, because each character is itself a string. but it's generally pointless—[k.upper() for k in z[chars]] is the same thing as [z[chars].upper()].
z[chars] = upper [0] # replace formatted character with original
So you only wanted the list of 1 character to get the first character out of it… why make it a list in the first place? Just replace the last two lines with z[chars] = z[chars].upper().
else:
break
This is going to stop on the first string shorter than length 4, rather than just skip strings shorter than length 4, which is what it seems like you want. The way to say that is continue, not break. Or, better, just fall off the end of the list. In some cases, it's hard to write things without a continue, but in this case, it's easy—it's already at the end of the loop, and in fact it's inside an else: that has nothing else in it, so just remove both lines.
It's hard to tell with upper that your loops are wrong, because if you accidentally call upper twice, it looks the same as if you called it once. Change the upper to chr(ord(k)+1), which replaces any letter with the next letter. Then try it with:
words = ['house', 'flower', 'tree', 'a', 'abcdefgh']
You'll notice that, e.g., you get 'flowgr' instead of 'flowfr'.
You may also want to add a variable that counts up the number of times you run through the inner loop. It should only be len(words) times, but it's actually len(words) * len(words) if you have no short words, or len(words) * len(<up to the first short word>) if you have any. You're making the computer do a whole lot of extra work—if you have 1000 words, it has to do 1000000 loops instead of 1000. In technical terms, your algorithm is O(N^2), even though it only needs to be O(N).
Putting it all together:
words = ['house', 'flower', 'tree', 'a', 'abcdefgh'] #string list
chars = 4 #character position in string (0,1,2...)
for counter, word in enumerate(words):
if len(word) > chars: # to compare char position and z length
z = list(word)
z[chars] = chr(ord(z[chars]+1) # replace character with next character
words[counter] = "".join(z) # convert and replace temp list back into original word str list
print (words)
That does the same thing as your original code (except using "next character" instead of "uppercase character"), without the bugs, with much less work for the computer, and much easier to read.
I think the general case of what you're talking about is a method that, given a string and an index, returns that string, with the indexed character transformed according to some rule.
def transform_string(strng, index, transform):
lst = list(strng)
if index < len(lst):
lst[index] = transform(lst[index])
return ''.join(lst)
words = ['house', 'flower', 'tree']
output = [transform_string(word, 4, str.upper) for word in words]
To make it even more abstract, you could have a factory that returns a method, like so:
def transformation_factory(index, transform):
def inner(word):
lst = list(word)
if index < len(lst):
lst[index] = transform(lst[index])
return inner
transform = transformation_factory(4, lambda x: x.upper())
output = map(transform, words)