Finding anagrams using dictionary in Python - python

I'm trying to create a function in python that will print out the anagrams of words in a text file using dictionaries. I've looked at what feels like hundreds of similar questions, so I apologise if this is a repetition, but I can't seem to find a solution that fits my issue.
I understand what I need to do (at least, I think so), but I'm stuck on the final part.
This is what I have so far:
with open('words.txt', 'r') as fp:
line = fp.readlines()
def make_anagram_dict(line):
dict = {}
for word in line:
key = ''.join(sorted(word.lower()))
if key in dict.keys():
dict[key].append(word.lower())
else:
dict[key] = []
dict[key].append(word.lower())
if line == key:
print(line)
make_anagram_dict(line)
I think I need something which compares the key of each value to the keys of other values, and then prints if they match, but I can't get something to work.
At the moment, the best I can do is print out all the keys and values in the file, but ideally, I would be able to print all the anagrams from the file.
Output: I don't have a concrete specified output, but something along the lines of:
[cat: act, tac]
for each anagram.
Again, apologies if this is a repetition, but any help would be greatly appreciated.

I'm not sure about the output format. In my implementation, all anagrams are printed out in the end.
with open('words.txt', 'r') as fp:
line = fp.readlines()
def make_anagram_dict(line):
d = {} # avoid using 'dict' as variable name
for word in line:
word = word.lower() # call lower() only once
key = ''.join(sorted(word))
if key in d: # no need to call keys()
d[key].append(word)
else:
d[key] = [word] # you can initialize list with the initial value
return d # just return the mapping to process it later
if __name__ == '__main__':
d = make_anagram_dict(line)
for words in d.values():
if len(words) > 1: # several anagrams in this group
print('Anagrams: {}'.format(', '.join(words)))
Also, consider using defaultdict - it's a dictionary, that creates values of a specified type for fresh keys.
from collections import defaultdict
with open('words.txt', 'r') as fp:
line = fp.readlines()
def make_anagram_dict(line):
d = defaultdict(list) # argument is the default constructor for value
for word in line:
word = word.lower() # call lower() only once
key = ''.join(sorted(word))
d[key].append(word) # now d[key] is always list
return d # just return the mapping to process it later
if __name__ == '__main__':
d = make_anagram_dict(line)
for words in d.values():
if len(words) > 1: # several anagrams in this group
print('Anagrams: {}'.format(', '.join(words)))

I'm going to make the assumption you're grouping words within a file which are anagrams of eachother.
If, on the other hand, you're being asked to find all the English-language anagrams for a list of words in a file, you will need a way of determining what is or isn't a word. This means you either need an actual "dictionary" as in a set(<of all english words>) or a maybe a very sophisticated predicate method.
Anyhow, here's a relatively straightforward solution which assumes your words.txt is small enough to be read into memory completely:
with open('words.txt', 'r') as infile:
words = infile.read().split()
anagram_dict = {word : list() for word in words}
for k, v in anagram_dict.items():
k_anagrams = (othr for othr in words if (sorted(k) == sorted(othr)) and (k != othr))
anagram_dict[k].extend(k_anagrams)
print(anagram_dict)
This isn't the most efficient way to do this, but hopefully it gets accross the power of filtering.
Arguably, the most important thing here is the if (sorted(k) == sorted(othr)) and (k != othr) filter in the k_anagrams definition. This is a filter which only allows identical letter-combinations, but weeds out exact matches.

Your code is pretty much there, just needs some tweaks:
import re
def make_anagram_dict(words):
d = {}
for word in words:
word = word.lower() # call lower() only once
key = ''.join(sorted(word)) # make the key
if key in d: # check if it's in dictionary already
if word not in d[key]: # avoid duplicates
d[key].append(word)
else:
d[key] = [word] # initialize list with the initial value
return d # return the entire dictionary
if __name__ == '__main__':
filename = 'words.txt'
with open(filename) as file:
# Use regex to extract words. You can adjust to include/exclude
# characters, numbers, punctuation...
# This returns a list of words
words = re.findall(r"([a-zA-Z\-]+)", file.read())
# Now process them
d = make_anagram_dict(words)
# Now print them
for words in d.values():
if len(words) > 1: # we found anagrams
print('Anagram group {}: {}'.format(', '.join(words)))

Related

Need to sort a dictionary built from multiple text files - unsure how to structure function

I have the following code:
import os
import string
#(Function A) - that will take in string as input, and return a dictionary of word and word frequency.
def master_dictionary(directory):
filelist=[os.path.join(directory,f) for f in os.listdir(directory)]
def counter(x):
f = open(x, "rt")
words = f.read().split()
words= filter(lambda x: x.isalpha(), words)
word_counter = dict()
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return (word_counter)
def sort_dictionary(counter()):
remove_duplicate = []
new_list = dict()
for key, val in word_counter.items():
if val not in remove_duplicate:
remove_duplicate.append(val)
new_list[key] = val
new_list = sorted(new_list.items(), key = lambda word_counter: word_counter[1], reverse = True)
print (f'Top 3 words in file {x}:', new_list[:3])
return [counter(file) for file in filelist]
master_dictionary('medline')
I need to call on the return value from the counter function to the sort_dictionary function. The function needs to combine all the dictionaries from each file and the output should only be the top 3 words from that master dictionary. Unfortunately, I don't know how to structure it.
Your code has several issues:
needlessly nested functions; this causes functions to be redefined every time the function is called
missing return values, sort_dictionary doesn't return anything
syntax error, you cannot define sort_dictionary with a signature of sort_dictionary(counter())
confusing names, i.e. calling a dict, new_list
actual error in functionality, check below
It looks like you want to create a function that print the top 3 words of a file, for every file in a dictionary.
Your counter function is more or less OK:
def counter(x):
f = open(x, "rt")
words = f.read().split()
words= filter(lambda x: x.isalpha(), words)
word_counter = dict()
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return (word_counter)
Although it could certainly be improved upon:
from collections import defaultdict
def word_counter(filename):
word_count = defaultdict(int)
with open(filename, 'rt') as f:
for word in f.read().split():
word_count[word] += 1
return word_count
Your sort_dictionary function has several issues though. You're trying to find the highest word counts by inverting the dictionary and then sorting the items for which there is no duplicate. That kinda works, but isn't very clear.
However, what if there's 2 words that both occur 10 times? You don't just want to throw one out, if the next best is only 9? I.e. what if the top counts are 10, 10, and 9. I'd imagine you'd want the two 10's and any of the 9's?
Something like this works better:
def top_three(wc):
return {
w: c for c, w in
sorted([(count, word) for word, count in wc.items()], key=lambda x: x[0])[-3:]
}
And to then put it all together in a function that gets all the files in the directory and applies these functions (which is what your question was about):
def print_master_dictionary(dirname):
for filename in [os.path.join(dirname, f) for f in os.listdir(dirname)]:
print(f'Top 3 words in file {filename}:',
list(top_three(word_counter(filename)).keys()))
All together:
import os
from collections import defaultdict
def word_counter(filename):
word_count = defaultdict(int)
with open(filename, 'rt') as f:
for word in f.read().split():
word_count[word] += 1
return word_count
def top_three(wc):
return {
w: c for c, w in
sorted([(count, word) for word, count in wc.items()], key=lambda x: x[0])[-3:]
}
def print_master_dictionary(dirname):
for filename in [os.path.join(dirname, f) for f in os.listdir(dirname)]:
print(f'Top 3 words in file {filename}:',
list(top_three(word_counter(filename)).keys()))
print_master_dictionary('my_folder')
Note that the real answer here is that part list(top_three(word_counter(filename)).keys()) I like how the renaming of the functions makes it very clear what's happening: a list of the top three keys in a word_counter dictionary for some filename.
If you prefer a bit more of a step by step look in your code:
def master_dictionary(dirname):
for filename in [os.path.join(dirname, f) for f in os.listdir(dirname)]:
wc = word_counter(filename)
tt = top_three(wc)
print(f'Top 3 words in file {filename}:', list(tt.keys()))
The main takeaway on 'how to structure` code like this:
write functions with a well-defined function, doing one thing and doing it well
return results that can be easily reused, without side-effects (like printing or changing globals)
when using the function, remember that you're calling them and after execution you need to capture the returned result, either in a variable or as an argument to the next function.
Something like this: x(y(z(1))) causes z(1) to be executed first, the result is then passed to y(), and once y completes execution and returns, its result is passed to x(). so z, y and x are executed in that order, each using the result of the previous as its argument. And you'll be left with the result of the call to x.

For line in : not returning all lines

I am trying to traverse a text file and take each line and put it into a dictionary. Ex:
If the txt file is
a
b
c
I am trying to create a dictionary like
word_dict = {'a':1, 'b:2', 'c':3}
When I use this code:
def word_dict():
fin = open('words2.txt','r')
dict_words = dict()
i = 1
for line in fin:
txt = fin.readline().strip()
dict_words.update({txt: i})
i += 1
print(dict_words)
My dictionary only contains a partial list. If I use this code (not trying to build the dictionary, just testing):
def word_dict():
fin = open('words2.txt','r')
i = 1
while fin.readline():
txt = fin.readline().strip()
print(i,'.',txt)
i += 1
Same thing. It prints a list of values that is incomplete. The list matches the dictionary values though. What am I missing?
You're trying to read the lines twice.
Just do this:
def word_dict(file_path):
with open(file_path, 'r') as input_file:
words = {line.strip(): i for i, line in enumerate(input_file, 1)}
return words
print(word_dict('words2.txt'))
This fixes a couple things.
Functions should not have hard coded variables, rather you should use an argument. This way you can reuse the function.
Functions should (generally) return values instead of printing them. This allows you to use the results of the function in further computation.
You were using a manual index variable instead of using the builtin enumerate.
This line {line.strip(): i for i, line in enumerate(input_file, 1)} is what's known as a dictionary comprehension. It is equivalent to the follow code:
words = {}
for i, line in enumerate(input_file, 1):
words[line.strip()] = i
This is because you are calling the readline() function twice. Simply do:
def word_dict():
fin = open('words2.txt','r')
dict_words = dict()
i = 1
for line in fin:
txt = line.strip()
dict_words.update({txt: i})
i += 1
print(dict_words)

How to return a list of words from a text file in python

I want to return all words found in a text file. This is the code i have so far.
def get_dictionary_word_list():
f = open('dictionary.txt')
for word in f.read().split():
print(word)
It works using the print fucntion but instead of printing the words i want to return all the words in the text file. Using return it only shows 'aa' and not the words in the file. Im not sure why it's not working with return?
If you used return in the loop, it returned on the first iteration and you only got back the first word.
What you want is an aggregation of the words - or better yet, return the array you got back from splitting the words. You may want to sanitize line breaks.
def get_dictionary_word_list():
# with context manager assures us the
# file will be closed when leaving the scope
with open('dictionary.txt') as f:
# return the split results, which is all the words in the file.
return f.read().split()
To get a dictionary back, you can use this (takes care of line breaks):
def get_dictionary_word_list():
# with context manager assures us the
# file will be closed when leaving the scope
with open('dictionary.txt') as f:
# create a dictionary object to return
result = dict()
for line in f.read().splitlines():
# split the line to a key - value.
k, v = line.split()
# add the key - value to the dictionary object
result[k] = v
return result
To get key,value items back, you can use something like this to return a generator (keep in mind the file will be left open as long as the generator remains open). You can modify it to return just words if that's what you want, it's pretty straightforward:
def get_dictionary_word_list():
# with context manager assures us the
# file will be closed when leaving the scope
with open('dictionary.txt') as f:
for line in f.read().splitlines():
# yield a tuple (key, value)
yield tuple(line.split())
Example output for the first function:
xxxx:~$ cat dictionary.txt
a asd
b bsd
c csd
xxxx:~$ cat ld.py
#!/usr/bin/env python
def get_dictionary_word_list():
# with context manager assures us the
# file will be closed when leaving the scope
with open('dictionary.txt') as f:
# return the split results, which is all the words in the file.
return f.read().split()
print get_dictionary_word_list()
xxxx:~$ ./ld.py
['a', 'asd', 'b', 'bsd', 'c', 'csd']
How about this:
def get_dictionary_word_list(fname):
with open(fname) as fh:
return set(fh.read().split())
def get_dictionary_word_list():
f = open('dictionary.txt')
ll=[]
for word in f.read().split():
ll.append(word)
return ll
Try through a list
Simply try this:-
def func():
with open('new.txt') as f:
return f.read() # returns complete file,
with open('out.txt', 'w+') as w:
w.write(func())
w.seek(0)
print w.read()
with Generators:-
def func():
with open('new.txt') as f:
yield f.read()
data = func()
with open('out2.txt', 'w+') as w:
for line in data:
w.write(line) #or you may use map(w.write, line)
w.seek(0)
print w.read()

Splitting a sentence into two and storing them into a defaultdict as key and value in Python

I have some questions about Defaultdict and Counter. I have a situation where I have a text file with one sentence per line. I want to split up the sentence into two (at first space) and store them into a dictionary with the first substring as the key and the second substring as the value. The reason for doing this is so that I can get a total number of sentences that share the same key.
Text file format:
d1 This is an example
id3 Hello World
id1 This is also an example
id4 Hello Hello World
.
.
This is what I have tried but it doesn't work. I have looked at Counter but it's a bit tricky in my situation.
try:
openFileObject = open('test.txt', "r")
try:
with openFileObject as infile:
for line in infile:
#Break up line into two strings at first space
tempLine = line.split(' ' , 1)
classDict = defaultdict(tempLine)
for tempLine[0], tempLine[1] in tempLine:
classDict[tempLine[0]].append(tempLine[1])
#Get the total number of keys
len(classDict)
#Get value for key id1 (should return 2)
finally:
print 'Done.'
openFileObject.close()
except IOError:
pass
Is there a way to do this without splitting up the sentences and storing them as tuples in a huge list before attempting using Counter or defaultdict? Thanks!
EDIT: Thanks to all who answered. I finally found out where I went wrong in this. I edited the program with all the suggestions given by everyone.
openFileObject = open(filename, "r")
tempList = []
with openFileObject as infile:
for line in infile:
tempLine = line.split(' ' , 1)
tempList.append(tempLine)
classDict = defaultdict(list) #My error is here where I used tempLine instead if list
for key, value in tempList:
classDict[key].append(value)
print len(classDict)
print len(classDict['key'])
Using collections.Counter to "get a total number of sentences that share the same key."
from collections import Counter
with openFileObject as infile:
print Counter(x.split()[0] for x in infile)
will print
Counter({'id1': 2, 'id4': 1, 'id3': 1})
If you want to store a list of all the lines, your main mistake is here
classDict = defaultdict(tempLine)
For this pattern, you should be using
classDict = defaultdict(list)
But there's no point storing all those lines in a list if you're just indenting on taking the length.
dict.get(key, 0) return current accumulated count. If key was not in dict, return 0.
classDict = {}
with open('text.txt') as infile:
for line in infile:
key = line.split(' ' , 1)[0]
classDict[key] = classDict.get(key, 0) + 1
print(len(classDict))
for key in classDict:
print('{}: {}'.format(key, classDict[key]))
http://docs.python.org/3/library/stdtypes.html#dict.get
Full example of defaultdict (and improved way of displaying classDict)
from collections import defaultdict
classDict = defaultdict(int)
with open('text.txt') as f:
for line in f:
first_word = line.split()[0]
classDict[first_word] += 1
print(len(classDict))
for key, value in classDict.iteritems():
print('{}: {}'.format(key, value))

Why can't I search through the dictionary I created (Python)?

In this program I am making a dictionary from a plain text file, basically I count the amount a word occurs in a document, the word becomes the key and the amount of time it occurs is the value. I can create the dictionary but then I cannot search through the dictionary. Here is my updated code with your guys' input. I really appreciate the help.
from collections import defaultdict
import operator
def readFile(fileHandle):
d = defaultdict(int)
with open(fileHandle, "r") as myfile:
for currline in myfile:
for word in currline.split():
d[word] +=1
return d
def reverseLookup(dictionary, value):
for key in dictionary.keys():
if dictionary[key] == value:
return key
return None
afile = raw_input ("What is the absolute file path: ")
print readFile (afile)
choice = raw_input ("Would you like to (1) Query Word Count (2) Print top words to a new document (3) Exit: ")
if (choice == "1"):
query = raw_input ("What word would like to look up? ")
print reverseLookup(readFile(afile), query)
if (choice == "2"):
f = open("new.txt", "a")
d = dict(int)
for w in text.split():
d[w] += 1
f.write(d)
file.close (f)
if (choice == "3"):
print "The EXIT has HAPPENED"
else:
print "Error"
Your approach is very complicated (and syntactically wrong, at least in your posted code sample).
Also, you're rebinding the built-in name dict which is problematic, too.
Furthermore, this functionality is already built-in in Python:
from collections import defaultdict
def readFile(fileHandle):
d = defaultdict(int) # Access to undefined keys creates a entry with value 0
with open(fileHandle, "r") as myfile: # File will automatically be closed
for currline in myfile: # Loop through file line-by-line
for word in currline.strip().split(): # Loop through words w/o CRLF
d[word] +=1 # Increase word counter
return d
As for your reverseLookup function, see ypercube's answer.
Your code returns after it looks in the first (key,value) pair. You have to search the whole dictionary before returning that the value has not been found.
def reverseLookup(dictionary, value):
for key in dictionary.keys():
if dictionary[key] == value:
return key
return None
You should also not return "error" as it can be a word and thus a key in your dictionary!
Depending upon how you're intending to use this reverseLookup() function, you might find your code much happier if you employ two dictionaries: build the first dictionary as you already do, and then build a second dictionary that contains mappings between the number of occurrences and the words that occurred that many times. Then your reverseLookup() wouldn't need to perform the for k in d.keys() loop on every single lookup. That loop would only happen once, and every single lookup after that would run significantly faster.
I've cobbled together (but not tested) some code that shows what I'm talking about. I stole Tim's readFile() routine, because I like the look of it more :) but took his nice function-local dictionary d and moved it to global, just to keep the functions short and sweet. In a 'real project', I'd probably wrap the whole thing in a class to allow arbitrary number of dictionaries at run time and provide reasonable encapsulation. This is just demo code. :)
import operator
from collections import defaultdict
d = defaultdict(int)
numbers_dict = {}
def readFile(fileHandle):
with open(fileHandle, "r") as myfile:
for currline in myfile:
for word in currline.split():
d[word] +=1
return d
def prepareReverse():
for (k,v) in d.items():
old_list = numbers_dict.get(v, [])
new_list = old_list << k
numbers_dict[v]=new_list
def reverseLookup(v):
numbers_dict[v]
If you intend on making two or more lookups, this code will trade memory for execution speed. You only iterate through the dictionary once (iteration over all elements is not a dict's strong point), but at the cost of duplicate data in memory.
The search is not working because you have a dictionary mapping a word to its count, so getting the number of occurrences for a 'word' should be just dictionary[word]. You don't really need the reveseLookup(), there is already a .get(key, default_value) method in dict: dictionary.get(value, None)

Categories