searching in python - python

I am trying to search a file to find all words which use any or all of the letters of a persons first name and are the same length as their first name. I have imported the file and it can be opened and read etc, but now i want to be able to seach the file for any words which would contain the specified letters, the words have to be same length as the persons first name.

You can use itertools (for permutations) and regular expressions (for searching)

def find_anagrams_in_file(filename, searchword):
import re
searchword = searchword.lower()
found_words = []
for line in open(filename, 'rt'):
words = re.split(r'\W', line)
for word in words:
if len(word) == len(searchword):
tmp = word.lower()
try:
for letter in searchword:
idx = tmp.index(letter)
tmp = tmp[:idx] + tmp[idx+1:]
found_words += [word]
except ValueError:
pass
return found_words
Run as so (Python 3):
>>> print(find_anagrams_in_file('apa.txt', 'Urne'))
['Rune', 'NurE', 'ERUN']

I would approach this problem this way:
filter out the words of the length different from the length of the first name,
iterate over the rest of the words checking whether intersection of first name's letters and word's letters is non-empty (set might be useful here).
P.S. Is that your homework?

Related

How to check generated strings against a text file

I'm trying to have the user input a string of characters with one asterisk. The asterisk indicates a character that can be subbed out for a vowel (a,e,i,o,u) in order to see what substitutions produce valid words.
Essentially, I want to take an input "l*g" and have it return "lag, leg, log, lug" because "lig" is not a valid English word. Below I have invalid words to be represented as "x".
I've gotten it to properly output each possible combination (e.g., including "lig"), but once I try to compare these words with the text file I'm referencing (for the list of valid words), it'll only return 5 lines of x's. I'm guessing it's that I'm improperly importing or reading the file?
Here's the link to the file I'm looking at so you can see the formatting:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/words.zip
Using the "en" file ~2.5MB
It's not in a dictionary layout i.e. no corresponding keys/values, just lines (maybe I could use the line number as the index, but I don't know how to do that). What can I change to check the test words to narrow down which are valid words based on the text file?
with open(os.path.expanduser('~/Downloads/words/en')) as f:
words = f.readlines()
inputted_word = input("Enter a word with ' * ' as the missing letter: ")
letters = []
for l in inputted_word:
letters.append(l)
### find the index of the blank
asterisk = inputted_word.index('*') # also used a redundant int(), works fine
### sub in vowels
vowels = ['a','e','i','o','u']
list_of_new_words = []
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters)
list_of_new_words.append(new_word)
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
There are probably more efficient ways to do this, but I'm brand new to this. The last two for loops could probably be combined but debugging it was tougher that way.
print(list_of_new_words)
gives
['lag', 'leg', 'lig', 'log', 'lug']
So far, so good.
But this :
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
Here you print new_word, which is defined in the previous for loop :
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters) # <----
list_of_new_words.append(new_word)
So after the loop, new_word still has the last value it was assigned to : "lug" (if the script input was l*g).
You probably meant w instead ?
for w in list_of_new_words:
if w in words:
print(w)
else:
print('x')
But it still prints 5 xs ...
So that means that w in words is always False. How is that ?
Looking at words :
print(words[0:10]) # the first 10 will suffice
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
All the words from the dictionary contain a newline character (\n) at the end. I guess you were not aware that it is what readlines do. So I recommend using :
words = f.read().splitlines()
instead.
With these 2 modifications (w and splitlines) :
Enter a word with ' * ' as the missing letter: l*g
lag
leg
x
log
lug
🎉

Grouping together words in a string which have common substrings of some length

I want to display words with common substring in a string.
for example if given string is
str = "the games are lame"
and the words have to be grouped together on the basis of common substring of length 3, so output should be
the
games, lame
are
since common substring of length 3 is "ame".
I proceeded by converting the string to list say "lista" using split() and made another list say "listb" with all the possible substrings of length 3, like
the, gam, gme, ges, ame, aes, mes, are, lam, lme, ame
then I checked the "listb" for duplicate items('ame') and on basis of them compared with items in "lista" like so
for items in duplicate:
for item in lista:
if items in item and not in listc:
listc.append(item)
Now, I have a "listc" with items that have common substring of length 3 but I can't figure out how to group them as needed in output. Also if "str" contains more words with common substring "listc" will also have those common words.
I don't know if I should have proceeded in this way and can't seem to figure out how to group items from "listc" as needed in output.
Here is a solution
str_ = "the games are lame"
# first I get a list of all the words
words = str_.split()
# words >>> ['the', 'games', 'are', 'lame']
groups = []
# This variable will contain the list of words
# For each words
for word in words:
found = False
# Get the first words of each groups
other_words = [x[0] for x in groups if x != word]
# Loop through the word and get all substring of 3 characters
for i in range(len(word)):
substring = word[i:i+3]
# Eliminates the substring that doesn't have the correct length
if len(substring) != 3:
continue
try:
# try to find the substring in a group and get the corresponding index of that group
index = [substring in other_word for other_word in other_words].index(True)
found = True
# Add the word in the group
groups[index].append(word)
except ValueError:
continue
# If we don't find a group for the word, we create a new group with that word in it
if not found:
groups.append([word])
# groups >>> [['the'], ['games', 'lame'], ['are']]
# Now print the groups
for group in groups:
print(", ".join(group))
output :
the
games, lame
are
I think you're creating a lot of lists there and this can be quite confusing.
If you want to use a purely logical approach without using libraries designed for sequence matching, such as difflib, you can first define a function that compares two strings; then you separate your sentence into a words list and perform a double iteration (nested) through that list comparing all possible pairs.
If the strings match they will be printed on the same line separated by commas otherwise on a new line.
In the following function I've also added a parameter for the length of the substring that you want to match, set to 3 by default to stay in line with your question:
# This function compairs two strings and returns them in a tuple if they contain the
# same substring of len_substring characters.
def string_matcher(string_a, string_b, len_substring = 3):
for i in range(len(string_a)-len_substring):
if string_a[i:i+len_substring] in string_b:
return string_a, string_b
return None
string = "the games are lame"
words = string.split()
output = ""
# Making a double iteration over the words list and calling string_matcher for each pair.
for i in range(len(words)-1):
output = output+words[i]
for j in range(i+1, len(words)):
try:
word_a, word_b = string_matcher(words[i], words[j])
output = output+", "+word_b
except TypeError:
pass
output = output+"\n"
print(output)
The program prints out:
the
games, lame
are

Duplicates with in a sentence of a text file in python

Hi I want to write a code that reads a text file, and identifies the sentences in the file with words that have duplicates within that sentence. I was thinking of putting each sentence of the file in a dictionary and finding which sentences have duplicates. Since I am new to Python, I need some help in writing the code.
This is what I have so far:
def Sentences():
def Strings():
l = string.split('.')
for x in range(len(l)):
print('Sentence', x + 1, ': ', l[x])
return
text = open('Rand article.txt', 'r')
string = text.read()
Strings()
return
The code above converts files to sentences.
Suppose you have a file where each line is a sentence, e.g. "sentences.txt":
I contain unique words.
This sentence repeats repeats a word.
The strategy could be to split the sentence into its constituent words, then use set to find the unique words in the sentence. If the resulting set is shorter than the list of all words, then you know that the sentence contains at least one duplicated word:
sentences_with_dups = []
with open("sentences.txt") as fh:
for sentence in fh:
words = sentence.split(" ")
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)

Best matches for a partially specified word in Python

I have a file dict.txt that has all words in the English language.
The user will input their word:
x = raw_input("Enter partial word: ")
Example inputs would be: r-n, --n, -u-, he--o, h-llo and so on, unknown characters would be specified by a underscore preferably instead of (-).
I want the program to come up with a list, of all the best matches that are found in the dictionary.
Example: If partial word was r--, the list would contain run, ran, rat, rob and so on.
Is there any way to do this using for loops?
One easy way to do this would be by using regular expressions. Since it is unclear whether this question is homework, the details are left as an exercise for the reader.
Instead of using _ to denote wildcards, use \w instead. Add \b to the beginning and end of the pattern, then just run the dictionary through a regexp matcher. So -un--- becomes:
>>> import re
>>> re.findall(r'\b\wun\w\w\w\b', "run runner bunt bunter bunted bummer")
['runner', 'bunter', 'bunted']
\w matches any 'word character'. \b matches any word boundary.
If you want to do this repeatedly you should create a index:
wordlist = [word.strip() for word in "run, ran, rat, rob, fish, tree".split(',')]
from collections import defaultdict
class Index(object):
def __init__(self, wordlist=()):
self.trie = defaultdict(set)
for word in wordlist:
self.add_word(word)
def add_word(self, word):
""" adds word to the index """
# save the length of the word
self.trie[len(word)].add(word)
for marker in enumerate(word):
# add word to the set of words with (pos,char)
self.trie[marker].add(word)
def find(self, pattern, wildcard='-' ):
# get all word with matching length as candidates
candidates = self.trie[len(pattern)]
# get all words with all the markers
for marker in enumerate(pattern):
if marker[1] != wildcard:
candidates &= self.trie[marker]
# exit early if there are no candicates
if not candidates:
return None
return candidates
with open('dict.txt', 'rt') as lines:
wordlist = [word.strip() for word in lines]
s = Index(wordlist)
print s.find("r--")
Tries are made for searching strings. This is a simple prefix trie, using a single dict.
Sounds like homework involving searching algorithms or something, but I'll give you a start.
One solution might be to index the file (if this can be done in a reasonable time) into a tree structure with each character representing a node value and each child being all subsequent characters. You could then traverse the tree using the input as a map. A character represents the next node to go to and a dash represents that it should include all child nodes. Every time you hit a leaf n levels deeps with n being the length of the input you know you've found a match.
The nice thing is that once you index, your search will speed up considerably. It's the indexing that can take forever...
Takes a bit of memory but this does the trick:
import re
import sys
word = '\\b' + sys.argv[1].replace('-', '\\w') + '\\b'
print word
with open('data.txt', 'r') as fh:
print re.findall(word, fh.read())
A few approaches occur to me;
The first is to preprocess your dictionary into words[wordlength][offset][charAtOffset] = set(matching words); then your query becomes an intersection of all relevant word-sets. Very fast, but memory-intensive and a lot of set-up work.
Ex:
# search for 'r-n'
matches = list(words[3][0]['r'] & words[3][2]['n'])
The second is a linear scan of the dictionary using regular expressions; much slower, but minimal memory footprint.
Ex:
import re
foundMatch = re.compile('r.n').match
matches = [word for word in allWords if foundMatch(word)]
Third would be a recursive search into a word-Trie;
Fourth - and it sounds like what you want - is a naive word-matcher:
with open('dictionary.txt') as inf:
all_words = [word.strip().lower() for word in inf] # one word per line
find_word = 'r-tt-r'
matching_words = []
for word in all_words:
if len(word)==len(find_word):
if all(find==ch or find=='-' for find,ch in zip(find_word, word)):
matching_words.append(word)
Edit: full code for the first option follows:
from collections import defaultdict
import operator
try:
inp = raw_input # Python 2.x
except NameError:
inp = input # Python 3.x
class Words(object):
#classmethod
def fromFile(cls, fname):
with open(fname) as inf:
return cls(inf)
def __init__(self, words=None):
super(Words,self).__init__()
self.words = set()
self.index = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
_addword = self.addWord
for word in words:
_addword(word.strip().lower())
def addWord(self, word):
self.words.add(word)
_ind = self.index[len(word)]
for ind,ch in enumerate(word):
_ind[ind][ch].add(word)
def findAll(self, pattern):
pattern = pattern.strip().lower()
_ind = self.index[len(pattern)]
return reduce(operator.__and__, (_ind[ind][ch] for ind,ch in enumerate(pattern) if ch!='-'), self.words)
def main():
print('Loading dict... ')
words = Words.fromFile('dict.txt')
print('done.')
while True:
seek = inp('Enter partial word ("-" is wildcard, nothing to exit): ').strip()
if seek:
print("Matching words: "+' '.join(words.findAll(seek))+'\n')
else:
break
if __name__=="__main__":
main()

Python- Remove all words that contain other words in a list

I have a list populated with words from a dictionary. I want to find a way to remove all words, only considering root words that form at the beginning of the target word.
For example, the word "rodeo" would be removed from the list because it contains the English-valid word "rode." "Typewriter" would be removed because it contains the English-valid word "type." However, the word "snicker" is still valid even if it contains the word "nick" because "nick" is in the middle and not at the beginning of the word.
I was thinking something like this:
for line in wordlist:
if line.find(...) --
but I want that "if" statement to then run through every single word in the list checking to see if its found and, if so, remove itself from the list so that only root words remain. Do I have to create a copy of wordlist to traverse?
So you have two lists: the list of words you want to check and possibly remove, and a list of valid words. If you like, you can use the same list for both purposes, but I'll assume you have two lists.
For speed, you should turn your list of valid words into a set. Then you can very quickly check to see if any particular word is in that set. Then, take each word, and check whether all its prefixes exist in the valid words list or not. Since "a" and "I" are valid words in English, will you remove all valid words starting with 'a', or will you have a rule that sets a minimum length for the prefix?
I am using the file /usr/share/dict/words from my Ubuntu install. This file has all sorts of odd things in it; for example, it seems to contain every letter by itself as a word. Thus "k" is in there, "q", "z", etc. None of these are words as far as I know, but they are probably in there for some technical reason. Anyway, I decided to simply exclude anything shorter than three letters from my valid words list.
Here is what I came up with:
# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)
lst = ["ark", "booze", "kite", "live", "rodeo"]
def subwords(word):
for i in range(len(word) - 1, 0, -1):
w = word[:i]
yield w
newlst = []
for word in lst:
# uncomment these for debugging to make sure it works
# print "subwords", [w for w in subwords(word)]
# print "valid subwords", [w for w in subwords(word) if w in valid]
if not any(w in valid for w in subwords(word)):
newlst.append(word)
print(newlst)
If you are a fan of one-liners, you could do away with the for list and use a list comprehension:
newlst = [word for word in lst if not any(w in valid for w in subwords(word))]
I think that's more terse than it should be, and I like being able to put in the print statements to debug.
Hmm, come to think of it, it's not too terse if you just add another function:
def keep(word):
return not any(w in valid for w in subwords(word))
newlst = [word for word in lst if keep(word)]
Python can be easy to read and understand if you make functions like this, and give them good names.
I'm assuming that you only have one list from which you want to remove any elements that have prefixes in that same list.
#Important assumption here... wordlist is sorted
base=wordlist[0] #consider the first word in the list
for word in wordlist: #loop through the entire list checking if
if not word.startswith(base): # the word we're considering starts with the base
print base #If not... we have a new base, print the current
base=word # one and move to this new one
#else word starts with base
#don't output word, and go on to the next item in the list
print base #finish by printing the last base
EDIT: Added some comments to make the logic more obvious
I find jkerian's asnwer to be the best (assuming only one list) and I would like to explain why.
Here is my version of the code (as a function):
wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];
def root_words(wordlist):
result = []
base = wordlist[0]
for word in wordlist:
if not word.startswith(base):
result.append(base)
base=word
result.append(base)
return result;
print root_words(wordlist);
As long as the word list is sorted (you could do this in the function if you wanted to), this will get the result in a single parse. This is because when you sort the list, all words made up of another word in the list, will be directly after that root word. e.g. anything that falls between "arc" and "arcane" in your particular list, will also be eliminated because of the root word "arc".
You should use the built-in lambda function for this. I think it'll make your life a lot easier
words = ['rode', 'nick'] # this is the list of all the words that you have.
# I'm using 'rode' and 'nick' as they're in your example
listOfWordsToTry = ['rodeo', 'snicker']
def validate(w):
for word in words:
if w.startswith(word):
return False
return True
wordsThatDontStartWithValidEnglishWords = \
filter(lambda x : validate(x), listOfWordsToTry)
This should work for your purposes, unless I misunderstand your question.
Hope this helps
I wrote an answer that assumes two lists, the list to be pruned and the list of valid words. In the discussion around my answer, I commented that maybe a trie solution would be good.
What the heck, I went ahead and wrote it.
You can read about a trie here:
http://en.wikipedia.org/wiki/Trie
For my Python solution, I basically used dictionaries. A key is a sequence of symbols, and each symbol goes into a dict, with another Trie instance as the data. A second dictionary stores "terminal" symbols, which mark the end of a "word" in the Trie. For this example, the "words" are actually words, but in principle the words could be any sequence of hashable Python objects.
The Wikipedia example shows a trie where the keys are letters, but can be more than a single letter; they can be a sequence of multiple letters. For simplicity, my code uses only a single symbol at a time as a key.
If you add both the word "cat" and the word "catch" to the trie, then there will be nodes for 'c', 'a', and 't' (and also the second 'c' in "catch"). At the node level for 'a', the dictionary of "terminals" will have 't' in it (thus completing the coding for "cat"), and likewise at the deeper node level of the second 'c' the dictionary of terminals will have 'h' in it (completing "catch"). So, adding "catch" after "cat" just means one additional node and one more entry in the terminals dictionary. The trie structure makes a very efficient way to store and index a really large list of words.
def _pad(n):
return " " * n
class Trie(object):
def __init__(self):
self.t = {} # dict mapping symbols to sub-tries
self.w = {} # dict listing terminal symbols at this level
def add(self, word):
if 0 == len(word):
return
cur = self
for ch in word[:-1]: # add all symbols but terminal
if ch not in cur.t:
cur.t[ch] = Trie()
cur = cur.t[ch]
ch = word[-1]
cur.w[ch] = True # add terminal
def prefix_match(self, word):
if 0 == len(word):
return False
cur = self
for ch in word[:-1]: # check all symbols but last one
# If you check the last one, you are not checking a prefix,
# you are checking whether the whole word is in the trie.
if ch in cur.w:
return True
if ch not in cur.t:
return False
cur = cur.t[ch] # walk down the trie to next level
return False
def debug_str(self, nest, s=None):
"print trie in a convenient nested format"
lst = []
s_term = "".join(ch for ch in self.w)
if 0 == nest:
lst.append(object.__str__(self))
lst.append("--top--: " + s_term)
else:
tup = (_pad(nest), s, s_term)
lst.append("%s%s: %s" % tup)
for ch, d in self.t.items():
lst.append(d.debug_str(nest+1, ch))
return "\n".join(lst)
def __str__(self):
return self.debug_str(0)
t = Trie()
# Build valid list from /usr/dict/share/words, which has every letter of
# the alphabet as words! Only take 2-letter words and longer.
wfile = "/usr/share/dict/words"
for line in open(wfile):
word = line.strip()
if len(word) >= 2:
t.add(word)
# add valid 1-letter English words
t.add("a")
t.add("I")
lst = ["ark", "booze", "kite", "live", "rodeo"]
# "ark" starts with "a"
# "booze" starts with "boo"
# "kite" starts with "kit"
# "live" is good: "l", "li", "liv" are not words
# "rodeo" starts with "rode"
newlst = [w for w in lst if not t.prefix_match(w)]
print(newlst) # prints: ['live']
I don't want to provide an exact solution, but I think there are two key functions in Python that will help you greatly here.
The first, jkerian mentioned: string.startswith() http://docs.python.org/library/stdtypes.html#str.startswith
The second: filter() http://docs.python.org/library/functions.html#filter
With filter, you could write a conditional function that will check to see if a word is the base of another word and return true if so.
For each word in the list, you would need to iterate over all of the other words and evaluate the conditional using filter, which could return the proper subset of root words.
I only had one list - and I wanted to remove any word from it that was a prefix of another.
Here is a solution that should run in O(n log N) time and O(M) space, where M is the size of the returned list. The runtime is dominated by the sorting.
l = sorted(your_list)
removed_prefixes = [l[g] for g in range(0, len(l)-1) if not l[g+1].startswith(l[g])] + l[-1:]
If the list is sorted then the item at index N is a prefix if it begins the item at index N+1.
At the end it appends the last item of the original sorted list, since by definition it is not a prefix.
Handling it last also allows us to iterate over an arbitrary number of indexes w/o going out of range.
If you have the banned list hardcoded in another list:
banned = tuple(banned_prefixes]
removed_prefixes = [ i for i in your_list if not i.startswith(banned)]
This relies on the fact that startswith accepts a tuple. It probably runs in something close to N * M where N is elements in list and M is elements in banned. Python could conceivably be doing some smart things to make it a bit quicker. If you are like OP and want to disregard case, you will need .lower() calls in places.

Categories