Best matches for a partially specified word in Python - python

I have a file dict.txt that has all words in the English language.
The user will input their word:
x = raw_input("Enter partial word: ")
Example inputs would be: r-n, --n, -u-, he--o, h-llo and so on, unknown characters would be specified by a underscore preferably instead of (-).
I want the program to come up with a list, of all the best matches that are found in the dictionary.
Example: If partial word was r--, the list would contain run, ran, rat, rob and so on.
Is there any way to do this using for loops?

One easy way to do this would be by using regular expressions. Since it is unclear whether this question is homework, the details are left as an exercise for the reader.

Instead of using _ to denote wildcards, use \w instead. Add \b to the beginning and end of the pattern, then just run the dictionary through a regexp matcher. So -un--- becomes:
>>> import re
>>> re.findall(r'\b\wun\w\w\w\b', "run runner bunt bunter bunted bummer")
['runner', 'bunter', 'bunted']
\w matches any 'word character'. \b matches any word boundary.

If you want to do this repeatedly you should create a index:
wordlist = [word.strip() for word in "run, ran, rat, rob, fish, tree".split(',')]
from collections import defaultdict
class Index(object):
def __init__(self, wordlist=()):
self.trie = defaultdict(set)
for word in wordlist:
self.add_word(word)
def add_word(self, word):
""" adds word to the index """
# save the length of the word
self.trie[len(word)].add(word)
for marker in enumerate(word):
# add word to the set of words with (pos,char)
self.trie[marker].add(word)
def find(self, pattern, wildcard='-' ):
# get all word with matching length as candidates
candidates = self.trie[len(pattern)]
# get all words with all the markers
for marker in enumerate(pattern):
if marker[1] != wildcard:
candidates &= self.trie[marker]
# exit early if there are no candicates
if not candidates:
return None
return candidates
with open('dict.txt', 'rt') as lines:
wordlist = [word.strip() for word in lines]
s = Index(wordlist)
print s.find("r--")
Tries are made for searching strings. This is a simple prefix trie, using a single dict.

Sounds like homework involving searching algorithms or something, but I'll give you a start.
One solution might be to index the file (if this can be done in a reasonable time) into a tree structure with each character representing a node value and each child being all subsequent characters. You could then traverse the tree using the input as a map. A character represents the next node to go to and a dash represents that it should include all child nodes. Every time you hit a leaf n levels deeps with n being the length of the input you know you've found a match.
The nice thing is that once you index, your search will speed up considerably. It's the indexing that can take forever...

Takes a bit of memory but this does the trick:
import re
import sys
word = '\\b' + sys.argv[1].replace('-', '\\w') + '\\b'
print word
with open('data.txt', 'r') as fh:
print re.findall(word, fh.read())

A few approaches occur to me;
The first is to preprocess your dictionary into words[wordlength][offset][charAtOffset] = set(matching words); then your query becomes an intersection of all relevant word-sets. Very fast, but memory-intensive and a lot of set-up work.
Ex:
# search for 'r-n'
matches = list(words[3][0]['r'] & words[3][2]['n'])
The second is a linear scan of the dictionary using regular expressions; much slower, but minimal memory footprint.
Ex:
import re
foundMatch = re.compile('r.n').match
matches = [word for word in allWords if foundMatch(word)]
Third would be a recursive search into a word-Trie;
Fourth - and it sounds like what you want - is a naive word-matcher:
with open('dictionary.txt') as inf:
all_words = [word.strip().lower() for word in inf] # one word per line
find_word = 'r-tt-r'
matching_words = []
for word in all_words:
if len(word)==len(find_word):
if all(find==ch or find=='-' for find,ch in zip(find_word, word)):
matching_words.append(word)
Edit: full code for the first option follows:
from collections import defaultdict
import operator
try:
inp = raw_input # Python 2.x
except NameError:
inp = input # Python 3.x
class Words(object):
#classmethod
def fromFile(cls, fname):
with open(fname) as inf:
return cls(inf)
def __init__(self, words=None):
super(Words,self).__init__()
self.words = set()
self.index = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
_addword = self.addWord
for word in words:
_addword(word.strip().lower())
def addWord(self, word):
self.words.add(word)
_ind = self.index[len(word)]
for ind,ch in enumerate(word):
_ind[ind][ch].add(word)
def findAll(self, pattern):
pattern = pattern.strip().lower()
_ind = self.index[len(pattern)]
return reduce(operator.__and__, (_ind[ind][ch] for ind,ch in enumerate(pattern) if ch!='-'), self.words)
def main():
print('Loading dict... ')
words = Words.fromFile('dict.txt')
print('done.')
while True:
seek = inp('Enter partial word ("-" is wildcard, nothing to exit): ').strip()
if seek:
print("Matching words: "+' '.join(words.findAll(seek))+'\n')
else:
break
if __name__=="__main__":
main()

Related

Is there a way to detect words without searching for whitespace or underscores

I am trying to write a CLI for generating python classes. Part of this requires validating the identifiers provided in user input, and for python this requires making sure that identifiers conform to the pep8 best practices/standards for identifiers- classes with CapsCases, fields with all_lowercase_with_underscores, packages and modules with so on so fourth-
# it is easy to correct when there is a identifier
# with underscores or whitespace and correcting for a class
def package_correct_convention(item):
return item.strip().lower().replace(" ","").replace("_","")
But when there is no whitespaces or underscores between tokens, I'm not sure how to how to correctly capitalize the first letter of each word in an identifier. Is it possible to implement something like that without using AI or something like that:
say for example:
# providing "ClassA" returns "classa" because there is no delimiter between "class" and "a"
def class_correct_convention(item):
if item.count(" ") or item.count("_"):
# checking whether space or underscore was used as word delimiter.
if item.count(" ") > item.count("_"):
item = item.split(" ")
elif item.count(" ") < item.count("_"):
item = item.split("_")
item = list(map(lambda x: x.title(), item))
return ("".join(item)).replace("_", "").replace(" ","")
# if there is no white space, best we can do it capitalize first letter
return item[0].upper() + item[1:]
Well, with AI-based approach it will be difficult, not perfect, a lot of work. If it does not worth it, there is maybe simpler and certainly comparably efficient.
I understand the worst scenario is "todelineatewordsinastringlikethat".
I would recommend you to download a text file for english language, one word by line, and to proceed this way:
import re
string = "todelineatewordsinastringlikethat"
#with open("mydic.dat", "r") as msg:
# lst = msg.read().splitlines()
lst = ['to','string','in'] #Let's say the dict contains 3 words
lst = sorted(lst, key=len, reverse = True)
replaced = []
for elem in lst:
if elem in string: #Very fast
replaced_str = " ".join(replaced) #Faster to check elem in a string than elem in a list
capitalized = elem[0].upper()+elem[1:] #Prepare your capitalized word
if elem not in replaced_str: #Check if elem could be a substring of something you replaced already
string = re.sub(elem,capitalized,string)
elif elem in replaced_str: #If elem is a sub of something you replaced, you'll protect
protect_replaced = [item for item in replaced if elem in item] #Get the list of replaced items containing the substring elem
for protect in protect_replaced: #Uppercase the whole word to protect, as we do a case sensitive re.sub()
string = re.sub(protect,protect.upper(),string)
string = re.sub(elem,capitalized,string)
for protect in protect_replaced: #Deprotect by doing the reverse, full uppercase to capitalized
string = re.sub(protect.upper(),protect,string)
replaced.append(capitalized) #Append replaced element in the list
print (string)
Output:
TodelIneatewordsInaStringlikethat
#You see that String has been protected but not delIneate, cause it was not in our dict.
This is certainly not optimal, but will perform certainly comparably to AI for a problem which would certainly not be presented as it is for AI anyway (input prep are very important in AI).
Note it is important to reverse sort the list of words. Cause you want to detect full string words first, not sub. Like in beforehand you want the full one, not before or and.

Automate the Boring Stuff With Python Madlibs: Trouble with Replacing Matched Regex (Losing Punctuation Marks)

This is my code:
import os, re
def madLibs():
madLibsDirectory = 'madLibsFiles'
os.chdir(madLibsDirectory)
madLibsFile = 'panda.txt'
madLibsFile = open(madLibsFile)
file = madLibsFile.read()
madLibsFile.close()
wordRegex = re.compile(r"ADJECTIVE|VERB|ADVERB|NOUN")
file = file.split() # split the madlib into a list with each word.
for word in file:
# check if word matches regex
if wordRegex.match(word):
foundWord = wordRegex.search(word) # create regex object on word
newWord = input(f'Please Enter A {foundWord.group()}: ') # recieve word
file[file.index(word)] = wordRegex.sub(newWord, foundWord.group(), 1)
file = ' '.join(file)
print(file)
def main():
madLibs()
if __name__ == '__main__':
main()
The problem line is file[file.index(word)] = wordRegex.sub(newWord, foundWord.group(), 1).
When my program runs across the word ADJECTIVE, VERB, ADVERB, and NOUN it will prompt the user for a word and replace this placeholder with the input. Currently this code correctly replaces the word HOWEVER, it does not keep punctuation.
For example here is panda.txt:
The ADJECTIVE panda walked to the NOUN and then VERB. A nearby NOUN
was unaffected by these events.
When I replace VERB with say "ate" it will do so but remove the period: "...and then ate A nearby...".
I'm sure this answer isn't too complicated but my REGEX knowledge is not fantastic yet unfortunately.
Thanks!
You've correctly identified the line that has the problem:
file[file.index(word)] = wordRegex.sub(newWord, foundWord.group(), 1)
The problem with this line is that you're replacing only a part of foundWord.group(), which only contains the matched word, not any punctuation marks that appear around it.
One easy fix is to drop foundWord completely and just use word as the text to do your replacement in. The line above would become:
file[file.index(word)] = wordRegex.sub(newWord, word, 1)
That should work! You can however improve your code in a number of other ways. For instance, rather than needing to search file for word to get the correct index for the assignment, you should use enumerate to get the index of each word as you go:
for i, word in enumerate(file):
if ...
...
file[i] = ...
Or you could make a bigger change. The re.sub function (and the equivalent method of compiled pattern objects) can make multiple substitutions in a single pass, and it can take a function, rather than a string to use as the replacement. The function will be called with a match object each time the pattern matches in the text. So why not use a function to prompt the user for the replacement word, and replace all the keywords in a single go?
def madLibs():
madLibsDirectory = 'madLibsFiles'
os.chdir(madLibsDirectory)
filename = 'panda.txt' # changed this variable name, to avoid duplication
with open(filename) as file: # a with statement will automatically close the file
text = file.read() # renamed this variable too
wordRegex = re.compile(r"ADJECTIVE|VERB|ADVERB|NOUN")
modified_text = wordRegex.sub(lambda match: input(f'Please Enter A {match.group()}: '),
text) # all the substitutions happen in this one call
print(modified_text)
The lambda in the call to wordRegex.sub is equivalent to this named function:
def func(match):
return input(f'Please Enter A {match.group()}: ')

How to calculate average word & Sentence length in python 2.7 from a text file

I have been stuck on this for the past 2 weeks was wondering could you help.
I am trying to calculate the average word length & sentence length from a text file. I just cant seem to wrap my head around it. I have just started using functions which are then called in the main file.
My Main file looks like so
import Consonants
import Vowels
import Sentences
import Questions
import Words
""" Vowels """
text = Vowels.fileToString("test.txt")
x = Vowels.countVowels(text)
print str(x) + " Vowels"
""" Consonats """
text = Consonants.fileToString("test.txt")
x = Consonants.countConsonants(text)
print str(x) + " Consonants"
""" Sentences """
text = Sentences.fileToString("test.txt")
x = Sentences.countSentences(text)
print str(x) + " Sentences"
""" Questions """
text = Questions.fileToString("test.txt")
x = Questions.countQuestions(text)
print str(x) + " Questions"
""" Words """
text = Words.fileToString("test.txt")
x = Words.countWords(text)
print str(x) + " Words"
And one of my function files are like so:
def fileToString(filename):
myFile = open(filename, "r")
myText = ""
for ch in myFile:
myText = myText + ch
return myText
def countWords(text):
vcount = 0
spaces = [' ']
for letter in text:
if (letter in spaces):
vcount = vcount + 1
return vcount
I was wondering how I would go about calculating the word length as a function that I import? I tried using some of the other threads here but they did not work for me correctly.
I'm trying to give you an algorithm for that,
Read the file, make a for loop using enumerate(), split() it and check how they ends with endswith(). Like;
for ind,word in enumerate(readlines.split()):
if word.endswith("?")
.....
if word.endswith("!")
Then put them in a dict, use the ind(index) value with a while loop;
obj = "Hey there! how are you? I hope you are ok."
dict1 = {}
for ind,word in enumerate(obj.split()):
dict1[ind]=word
x = 0
while x<len(dict1):
if "?" in dict1[x]:
print (list(dict1.values())[:x+1])
x += 1
Output;
>>>
['Hey', 'there!', 'how', 'are', 'you?']
>>>
You see, I actually cut the words until reach ?. So I have a sentence in a list now(you can change it to !). I can reach every element's length, the rest of it is simple mathematic. You will find the sum of every element's length then divide it to lenght of that list. Theorically, It will give the average.
Remember that, it's the algorithm. You really have to change this codes to fit with your data, the key points are enumerate(), endswith() and dict.
Honestly when you're matching things like words and sentences, you're better off learning and using regular expressions than just relying on str.split to catch every corner case.
#text.txt
Here is some text. It is written on more than one line, and will have several sentences.
Some sentences will have their OWN line!
It will also have a question. Is this the question? I think it is.
#!/usr/bin/python
import re
with open('test.txt') as infile:
data = infile.read()
sentence_pat = re.compile(r"""
\b # sentences will start with a word boundary
([^.!?]+[.!?]+) # continue with one or more non-sentence-ending
# characters, followed by one or more sentence-
# ending characters.""", re.X)
word_pat = re.compile(r"""
(\S+) # Words are just groups of non-whitespace together
""", re.X)
sentences = sentence_pat.findall(data)
words = word_pat.findall(data)
average_sentence_length = sum([len(sentence) for sentence in sentences])/len(sentences)
average_word_length = sum([len(word) for word in words])/len(words)
DEMO:
>>> sentences
['Here is some text.',
'It is written on more than one line, and will have several sentences.',
'Some sentences will have their OWN line!',
'It will also have a question.',
'Is this the question?',
'I think it is.']
>>> words
['Here',
'is',
'some',
'text.',
'It',
'is',
... ,
'I',
'think',
'it',
'is.']
>>> average_sentence_length
31.833333333333332
>>> average_word_length
4.184210526315789
To answer this:
I was wondering how I would go about calculating the word length as a
function that I import?
def avg_word_len(filename):
word_lengths = []
for line in open(filename).readlines():
word_lengths.extend([len(word) for word in line.split()])
return sum(word_lengths)/len(word_lengths)
Note: This doesn't consider things like . and ! at end of word.. etc..
This doesn't apply if you want to make the script yourself, but I would use NLTK. It has some really great tools for dealing with really long texts.
This page provides a cheat-sheet for nltk. You should be able to import your text, get the sentances as a large list of lists and get the list of n-grams (words of length n). Then you can calculate the average.

searching in python

I am trying to search a file to find all words which use any or all of the letters of a persons first name and are the same length as their first name. I have imported the file and it can be opened and read etc, but now i want to be able to seach the file for any words which would contain the specified letters, the words have to be same length as the persons first name.
You can use itertools (for permutations) and regular expressions (for searching)
def find_anagrams_in_file(filename, searchword):
import re
searchword = searchword.lower()
found_words = []
for line in open(filename, 'rt'):
words = re.split(r'\W', line)
for word in words:
if len(word) == len(searchword):
tmp = word.lower()
try:
for letter in searchword:
idx = tmp.index(letter)
tmp = tmp[:idx] + tmp[idx+1:]
found_words += [word]
except ValueError:
pass
return found_words
Run as so (Python 3):
>>> print(find_anagrams_in_file('apa.txt', 'Urne'))
['Rune', 'NurE', 'ERUN']
I would approach this problem this way:
filter out the words of the length different from the length of the first name,
iterate over the rest of the words checking whether intersection of first name's letters and word's letters is non-empty (set might be useful here).
P.S. Is that your homework?

Python- Remove all words that contain other words in a list

I have a list populated with words from a dictionary. I want to find a way to remove all words, only considering root words that form at the beginning of the target word.
For example, the word "rodeo" would be removed from the list because it contains the English-valid word "rode." "Typewriter" would be removed because it contains the English-valid word "type." However, the word "snicker" is still valid even if it contains the word "nick" because "nick" is in the middle and not at the beginning of the word.
I was thinking something like this:
for line in wordlist:
if line.find(...) --
but I want that "if" statement to then run through every single word in the list checking to see if its found and, if so, remove itself from the list so that only root words remain. Do I have to create a copy of wordlist to traverse?
So you have two lists: the list of words you want to check and possibly remove, and a list of valid words. If you like, you can use the same list for both purposes, but I'll assume you have two lists.
For speed, you should turn your list of valid words into a set. Then you can very quickly check to see if any particular word is in that set. Then, take each word, and check whether all its prefixes exist in the valid words list or not. Since "a" and "I" are valid words in English, will you remove all valid words starting with 'a', or will you have a rule that sets a minimum length for the prefix?
I am using the file /usr/share/dict/words from my Ubuntu install. This file has all sorts of odd things in it; for example, it seems to contain every letter by itself as a word. Thus "k" is in there, "q", "z", etc. None of these are words as far as I know, but they are probably in there for some technical reason. Anyway, I decided to simply exclude anything shorter than three letters from my valid words list.
Here is what I came up with:
# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)
lst = ["ark", "booze", "kite", "live", "rodeo"]
def subwords(word):
for i in range(len(word) - 1, 0, -1):
w = word[:i]
yield w
newlst = []
for word in lst:
# uncomment these for debugging to make sure it works
# print "subwords", [w for w in subwords(word)]
# print "valid subwords", [w for w in subwords(word) if w in valid]
if not any(w in valid for w in subwords(word)):
newlst.append(word)
print(newlst)
If you are a fan of one-liners, you could do away with the for list and use a list comprehension:
newlst = [word for word in lst if not any(w in valid for w in subwords(word))]
I think that's more terse than it should be, and I like being able to put in the print statements to debug.
Hmm, come to think of it, it's not too terse if you just add another function:
def keep(word):
return not any(w in valid for w in subwords(word))
newlst = [word for word in lst if keep(word)]
Python can be easy to read and understand if you make functions like this, and give them good names.
I'm assuming that you only have one list from which you want to remove any elements that have prefixes in that same list.
#Important assumption here... wordlist is sorted
base=wordlist[0] #consider the first word in the list
for word in wordlist: #loop through the entire list checking if
if not word.startswith(base): # the word we're considering starts with the base
print base #If not... we have a new base, print the current
base=word # one and move to this new one
#else word starts with base
#don't output word, and go on to the next item in the list
print base #finish by printing the last base
EDIT: Added some comments to make the logic more obvious
I find jkerian's asnwer to be the best (assuming only one list) and I would like to explain why.
Here is my version of the code (as a function):
wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];
def root_words(wordlist):
result = []
base = wordlist[0]
for word in wordlist:
if not word.startswith(base):
result.append(base)
base=word
result.append(base)
return result;
print root_words(wordlist);
As long as the word list is sorted (you could do this in the function if you wanted to), this will get the result in a single parse. This is because when you sort the list, all words made up of another word in the list, will be directly after that root word. e.g. anything that falls between "arc" and "arcane" in your particular list, will also be eliminated because of the root word "arc".
You should use the built-in lambda function for this. I think it'll make your life a lot easier
words = ['rode', 'nick'] # this is the list of all the words that you have.
# I'm using 'rode' and 'nick' as they're in your example
listOfWordsToTry = ['rodeo', 'snicker']
def validate(w):
for word in words:
if w.startswith(word):
return False
return True
wordsThatDontStartWithValidEnglishWords = \
filter(lambda x : validate(x), listOfWordsToTry)
This should work for your purposes, unless I misunderstand your question.
Hope this helps
I wrote an answer that assumes two lists, the list to be pruned and the list of valid words. In the discussion around my answer, I commented that maybe a trie solution would be good.
What the heck, I went ahead and wrote it.
You can read about a trie here:
http://en.wikipedia.org/wiki/Trie
For my Python solution, I basically used dictionaries. A key is a sequence of symbols, and each symbol goes into a dict, with another Trie instance as the data. A second dictionary stores "terminal" symbols, which mark the end of a "word" in the Trie. For this example, the "words" are actually words, but in principle the words could be any sequence of hashable Python objects.
The Wikipedia example shows a trie where the keys are letters, but can be more than a single letter; they can be a sequence of multiple letters. For simplicity, my code uses only a single symbol at a time as a key.
If you add both the word "cat" and the word "catch" to the trie, then there will be nodes for 'c', 'a', and 't' (and also the second 'c' in "catch"). At the node level for 'a', the dictionary of "terminals" will have 't' in it (thus completing the coding for "cat"), and likewise at the deeper node level of the second 'c' the dictionary of terminals will have 'h' in it (completing "catch"). So, adding "catch" after "cat" just means one additional node and one more entry in the terminals dictionary. The trie structure makes a very efficient way to store and index a really large list of words.
def _pad(n):
return " " * n
class Trie(object):
def __init__(self):
self.t = {} # dict mapping symbols to sub-tries
self.w = {} # dict listing terminal symbols at this level
def add(self, word):
if 0 == len(word):
return
cur = self
for ch in word[:-1]: # add all symbols but terminal
if ch not in cur.t:
cur.t[ch] = Trie()
cur = cur.t[ch]
ch = word[-1]
cur.w[ch] = True # add terminal
def prefix_match(self, word):
if 0 == len(word):
return False
cur = self
for ch in word[:-1]: # check all symbols but last one
# If you check the last one, you are not checking a prefix,
# you are checking whether the whole word is in the trie.
if ch in cur.w:
return True
if ch not in cur.t:
return False
cur = cur.t[ch] # walk down the trie to next level
return False
def debug_str(self, nest, s=None):
"print trie in a convenient nested format"
lst = []
s_term = "".join(ch for ch in self.w)
if 0 == nest:
lst.append(object.__str__(self))
lst.append("--top--: " + s_term)
else:
tup = (_pad(nest), s, s_term)
lst.append("%s%s: %s" % tup)
for ch, d in self.t.items():
lst.append(d.debug_str(nest+1, ch))
return "\n".join(lst)
def __str__(self):
return self.debug_str(0)
t = Trie()
# Build valid list from /usr/dict/share/words, which has every letter of
# the alphabet as words! Only take 2-letter words and longer.
wfile = "/usr/share/dict/words"
for line in open(wfile):
word = line.strip()
if len(word) >= 2:
t.add(word)
# add valid 1-letter English words
t.add("a")
t.add("I")
lst = ["ark", "booze", "kite", "live", "rodeo"]
# "ark" starts with "a"
# "booze" starts with "boo"
# "kite" starts with "kit"
# "live" is good: "l", "li", "liv" are not words
# "rodeo" starts with "rode"
newlst = [w for w in lst if not t.prefix_match(w)]
print(newlst) # prints: ['live']
I don't want to provide an exact solution, but I think there are two key functions in Python that will help you greatly here.
The first, jkerian mentioned: string.startswith() http://docs.python.org/library/stdtypes.html#str.startswith
The second: filter() http://docs.python.org/library/functions.html#filter
With filter, you could write a conditional function that will check to see if a word is the base of another word and return true if so.
For each word in the list, you would need to iterate over all of the other words and evaluate the conditional using filter, which could return the proper subset of root words.
I only had one list - and I wanted to remove any word from it that was a prefix of another.
Here is a solution that should run in O(n log N) time and O(M) space, where M is the size of the returned list. The runtime is dominated by the sorting.
l = sorted(your_list)
removed_prefixes = [l[g] for g in range(0, len(l)-1) if not l[g+1].startswith(l[g])] + l[-1:]
If the list is sorted then the item at index N is a prefix if it begins the item at index N+1.
At the end it appends the last item of the original sorted list, since by definition it is not a prefix.
Handling it last also allows us to iterate over an arbitrary number of indexes w/o going out of range.
If you have the banned list hardcoded in another list:
banned = tuple(banned_prefixes]
removed_prefixes = [ i for i in your_list if not i.startswith(banned)]
This relies on the fact that startswith accepts a tuple. It probably runs in something close to N * M where N is elements in list and M is elements in banned. Python could conceivably be doing some smart things to make it a bit quicker. If you are like OP and want to disregard case, you will need .lower() calls in places.

Categories