Why can't I search through the dictionary I created (Python)? - python

In this program I am making a dictionary from a plain text file, basically I count the amount a word occurs in a document, the word becomes the key and the amount of time it occurs is the value. I can create the dictionary but then I cannot search through the dictionary. Here is my updated code with your guys' input. I really appreciate the help.
from collections import defaultdict
import operator
def readFile(fileHandle):
d = defaultdict(int)
with open(fileHandle, "r") as myfile:
for currline in myfile:
for word in currline.split():
d[word] +=1
return d
def reverseLookup(dictionary, value):
for key in dictionary.keys():
if dictionary[key] == value:
return key
return None
afile = raw_input ("What is the absolute file path: ")
print readFile (afile)
choice = raw_input ("Would you like to (1) Query Word Count (2) Print top words to a new document (3) Exit: ")
if (choice == "1"):
query = raw_input ("What word would like to look up? ")
print reverseLookup(readFile(afile), query)
if (choice == "2"):
f = open("new.txt", "a")
d = dict(int)
for w in text.split():
d[w] += 1
f.write(d)
file.close (f)
if (choice == "3"):
print "The EXIT has HAPPENED"
else:
print "Error"

Your approach is very complicated (and syntactically wrong, at least in your posted code sample).
Also, you're rebinding the built-in name dict which is problematic, too.
Furthermore, this functionality is already built-in in Python:
from collections import defaultdict
def readFile(fileHandle):
d = defaultdict(int) # Access to undefined keys creates a entry with value 0
with open(fileHandle, "r") as myfile: # File will automatically be closed
for currline in myfile: # Loop through file line-by-line
for word in currline.strip().split(): # Loop through words w/o CRLF
d[word] +=1 # Increase word counter
return d
As for your reverseLookup function, see ypercube's answer.

Your code returns after it looks in the first (key,value) pair. You have to search the whole dictionary before returning that the value has not been found.
def reverseLookup(dictionary, value):
for key in dictionary.keys():
if dictionary[key] == value:
return key
return None
You should also not return "error" as it can be a word and thus a key in your dictionary!

Depending upon how you're intending to use this reverseLookup() function, you might find your code much happier if you employ two dictionaries: build the first dictionary as you already do, and then build a second dictionary that contains mappings between the number of occurrences and the words that occurred that many times. Then your reverseLookup() wouldn't need to perform the for k in d.keys() loop on every single lookup. That loop would only happen once, and every single lookup after that would run significantly faster.
I've cobbled together (but not tested) some code that shows what I'm talking about. I stole Tim's readFile() routine, because I like the look of it more :) but took his nice function-local dictionary d and moved it to global, just to keep the functions short and sweet. In a 'real project', I'd probably wrap the whole thing in a class to allow arbitrary number of dictionaries at run time and provide reasonable encapsulation. This is just demo code. :)
import operator
from collections import defaultdict
d = defaultdict(int)
numbers_dict = {}
def readFile(fileHandle):
with open(fileHandle, "r") as myfile:
for currline in myfile:
for word in currline.split():
d[word] +=1
return d
def prepareReverse():
for (k,v) in d.items():
old_list = numbers_dict.get(v, [])
new_list = old_list << k
numbers_dict[v]=new_list
def reverseLookup(v):
numbers_dict[v]
If you intend on making two or more lookups, this code will trade memory for execution speed. You only iterate through the dictionary once (iteration over all elements is not a dict's strong point), but at the cost of duplicate data in memory.

The search is not working because you have a dictionary mapping a word to its count, so getting the number of occurrences for a 'word' should be just dictionary[word]. You don't really need the reveseLookup(), there is already a .get(key, default_value) method in dict: dictionary.get(value, None)

Related

Why can't i change dictionary when for loop is working?

So what I need is to add new key:value pairs or change existing, but my code is not working. What is wrong and how can I solve it?
What I want: create a dictionary, key is the symbol and the value is how many of this symbols in the list
My code:
def f(data: list):
characters = dict()
for i in data:
if i in characters:
characters[i] += 1
elif i not in data:
characters[i] = 1
return characters
Please post your code instead of a screenshot. But, your code is guaranteed to return an empty dictionary regardless of your input. Both of your conditions will always evaluate to false. I think what you want to do is something like this:
if i in characters:
characters[i] += 1
else:
characters[i] = 1
Change your elif i not in data: line to else: or elif i not in characters:.
This can be simplified if you can use defaultdict:
from collections import defaultdict
def f(data):
characters = defaultdict(int)
for i in data:
characters[i] += 1
return characters
from collections import Counter
characters = Counter(Data)
The function Counter() from collections is exactly what you need.

Need to sort a dictionary built from multiple text files - unsure how to structure function

I have the following code:
import os
import string
#(Function A) - that will take in string as input, and return a dictionary of word and word frequency.
def master_dictionary(directory):
filelist=[os.path.join(directory,f) for f in os.listdir(directory)]
def counter(x):
f = open(x, "rt")
words = f.read().split()
words= filter(lambda x: x.isalpha(), words)
word_counter = dict()
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return (word_counter)
def sort_dictionary(counter()):
remove_duplicate = []
new_list = dict()
for key, val in word_counter.items():
if val not in remove_duplicate:
remove_duplicate.append(val)
new_list[key] = val
new_list = sorted(new_list.items(), key = lambda word_counter: word_counter[1], reverse = True)
print (f'Top 3 words in file {x}:', new_list[:3])
return [counter(file) for file in filelist]
master_dictionary('medline')
I need to call on the return value from the counter function to the sort_dictionary function. The function needs to combine all the dictionaries from each file and the output should only be the top 3 words from that master dictionary. Unfortunately, I don't know how to structure it.
Your code has several issues:
needlessly nested functions; this causes functions to be redefined every time the function is called
missing return values, sort_dictionary doesn't return anything
syntax error, you cannot define sort_dictionary with a signature of sort_dictionary(counter())
confusing names, i.e. calling a dict, new_list
actual error in functionality, check below
It looks like you want to create a function that print the top 3 words of a file, for every file in a dictionary.
Your counter function is more or less OK:
def counter(x):
f = open(x, "rt")
words = f.read().split()
words= filter(lambda x: x.isalpha(), words)
word_counter = dict()
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return (word_counter)
Although it could certainly be improved upon:
from collections import defaultdict
def word_counter(filename):
word_count = defaultdict(int)
with open(filename, 'rt') as f:
for word in f.read().split():
word_count[word] += 1
return word_count
Your sort_dictionary function has several issues though. You're trying to find the highest word counts by inverting the dictionary and then sorting the items for which there is no duplicate. That kinda works, but isn't very clear.
However, what if there's 2 words that both occur 10 times? You don't just want to throw one out, if the next best is only 9? I.e. what if the top counts are 10, 10, and 9. I'd imagine you'd want the two 10's and any of the 9's?
Something like this works better:
def top_three(wc):
return {
w: c for c, w in
sorted([(count, word) for word, count in wc.items()], key=lambda x: x[0])[-3:]
}
And to then put it all together in a function that gets all the files in the directory and applies these functions (which is what your question was about):
def print_master_dictionary(dirname):
for filename in [os.path.join(dirname, f) for f in os.listdir(dirname)]:
print(f'Top 3 words in file {filename}:',
list(top_three(word_counter(filename)).keys()))
All together:
import os
from collections import defaultdict
def word_counter(filename):
word_count = defaultdict(int)
with open(filename, 'rt') as f:
for word in f.read().split():
word_count[word] += 1
return word_count
def top_three(wc):
return {
w: c for c, w in
sorted([(count, word) for word, count in wc.items()], key=lambda x: x[0])[-3:]
}
def print_master_dictionary(dirname):
for filename in [os.path.join(dirname, f) for f in os.listdir(dirname)]:
print(f'Top 3 words in file {filename}:',
list(top_three(word_counter(filename)).keys()))
print_master_dictionary('my_folder')
Note that the real answer here is that part list(top_three(word_counter(filename)).keys()) I like how the renaming of the functions makes it very clear what's happening: a list of the top three keys in a word_counter dictionary for some filename.
If you prefer a bit more of a step by step look in your code:
def master_dictionary(dirname):
for filename in [os.path.join(dirname, f) for f in os.listdir(dirname)]:
wc = word_counter(filename)
tt = top_three(wc)
print(f'Top 3 words in file {filename}:', list(tt.keys()))
The main takeaway on 'how to structure` code like this:
write functions with a well-defined function, doing one thing and doing it well
return results that can be easily reused, without side-effects (like printing or changing globals)
when using the function, remember that you're calling them and after execution you need to capture the returned result, either in a variable or as an argument to the next function.
Something like this: x(y(z(1))) causes z(1) to be executed first, the result is then passed to y(), and once y completes execution and returns, its result is passed to x(). so z, y and x are executed in that order, each using the result of the previous as its argument. And you'll be left with the result of the call to x.

Finding anagrams using dictionary in Python

I'm trying to create a function in python that will print out the anagrams of words in a text file using dictionaries. I've looked at what feels like hundreds of similar questions, so I apologise if this is a repetition, but I can't seem to find a solution that fits my issue.
I understand what I need to do (at least, I think so), but I'm stuck on the final part.
This is what I have so far:
with open('words.txt', 'r') as fp:
line = fp.readlines()
def make_anagram_dict(line):
dict = {}
for word in line:
key = ''.join(sorted(word.lower()))
if key in dict.keys():
dict[key].append(word.lower())
else:
dict[key] = []
dict[key].append(word.lower())
if line == key:
print(line)
make_anagram_dict(line)
I think I need something which compares the key of each value to the keys of other values, and then prints if they match, but I can't get something to work.
At the moment, the best I can do is print out all the keys and values in the file, but ideally, I would be able to print all the anagrams from the file.
Output: I don't have a concrete specified output, but something along the lines of:
[cat: act, tac]
for each anagram.
Again, apologies if this is a repetition, but any help would be greatly appreciated.
I'm not sure about the output format. In my implementation, all anagrams are printed out in the end.
with open('words.txt', 'r') as fp:
line = fp.readlines()
def make_anagram_dict(line):
d = {} # avoid using 'dict' as variable name
for word in line:
word = word.lower() # call lower() only once
key = ''.join(sorted(word))
if key in d: # no need to call keys()
d[key].append(word)
else:
d[key] = [word] # you can initialize list with the initial value
return d # just return the mapping to process it later
if __name__ == '__main__':
d = make_anagram_dict(line)
for words in d.values():
if len(words) > 1: # several anagrams in this group
print('Anagrams: {}'.format(', '.join(words)))
Also, consider using defaultdict - it's a dictionary, that creates values of a specified type for fresh keys.
from collections import defaultdict
with open('words.txt', 'r') as fp:
line = fp.readlines()
def make_anagram_dict(line):
d = defaultdict(list) # argument is the default constructor for value
for word in line:
word = word.lower() # call lower() only once
key = ''.join(sorted(word))
d[key].append(word) # now d[key] is always list
return d # just return the mapping to process it later
if __name__ == '__main__':
d = make_anagram_dict(line)
for words in d.values():
if len(words) > 1: # several anagrams in this group
print('Anagrams: {}'.format(', '.join(words)))
I'm going to make the assumption you're grouping words within a file which are anagrams of eachother.
If, on the other hand, you're being asked to find all the English-language anagrams for a list of words in a file, you will need a way of determining what is or isn't a word. This means you either need an actual "dictionary" as in a set(<of all english words>) or a maybe a very sophisticated predicate method.
Anyhow, here's a relatively straightforward solution which assumes your words.txt is small enough to be read into memory completely:
with open('words.txt', 'r') as infile:
words = infile.read().split()
anagram_dict = {word : list() for word in words}
for k, v in anagram_dict.items():
k_anagrams = (othr for othr in words if (sorted(k) == sorted(othr)) and (k != othr))
anagram_dict[k].extend(k_anagrams)
print(anagram_dict)
This isn't the most efficient way to do this, but hopefully it gets accross the power of filtering.
Arguably, the most important thing here is the if (sorted(k) == sorted(othr)) and (k != othr) filter in the k_anagrams definition. This is a filter which only allows identical letter-combinations, but weeds out exact matches.
Your code is pretty much there, just needs some tweaks:
import re
def make_anagram_dict(words):
d = {}
for word in words:
word = word.lower() # call lower() only once
key = ''.join(sorted(word)) # make the key
if key in d: # check if it's in dictionary already
if word not in d[key]: # avoid duplicates
d[key].append(word)
else:
d[key] = [word] # initialize list with the initial value
return d # return the entire dictionary
if __name__ == '__main__':
filename = 'words.txt'
with open(filename) as file:
# Use regex to extract words. You can adjust to include/exclude
# characters, numbers, punctuation...
# This returns a list of words
words = re.findall(r"([a-zA-Z\-]+)", file.read())
# Now process them
d = make_anagram_dict(words)
# Now print them
for words in d.values():
if len(words) > 1: # we found anagrams
print('Anagram group {}: {}'.format(', '.join(words)))

Best method to match words between a list and a dictionary, returning only the ones that are unique for a key without the use of modules

I'm writing this script where archive contains the words that a person has said and their age, and clues are sentences where some words are extracted to match with the most likely person that said them. Words that work as a clue are marked with a * and all clues should be uniquely used by a person.
from typing import List, Dict, TextIO, Tuple
def who_did_it(archive: Dict[str, List[tuple]], clues: str) -> str:
word_list = []
#contains person and a list of its words in a list
clean_clue = get_words(clues)
#get_words: extract the clues clean into a list without the `*`
suspect = []
#a list for the most likely person that did it
dict_list = {}
#person as key, a list of words as values
for people in archive:
clues = archive.get(people)
word_list.append([people, get_words(clues[0])])
clean_clue.sort()
for person, words in word_list:
dict_list.setdefault(person, words)
numb = 0
for names in dict_list:
for clues in clean_clue:
if clues in dict_list.get(names):
numb = numb + 1
elif tags not in dict_list.get(names):
numb = numb - 1
if numb == 1:
suspect.append(names)
counter = 0
if len(suspect) == 1:
print(suspect[0])
else:
print('need more evidence')
The problem comes when I use my test cases, some of them doesn't seem to work because of the way I'm doing it, is there any other way to compare this values? How can I compare this values in an efficient way without using modules?
You are better off using a dict with keys that are your clues/weapons and sets of names as values:
def who(things,clues):
""" Returns a sorted list of (name, [clues,...]) tuples, sorted by longest len first"""
result = {}
for t in things:
for name in clues[t]:
result.setdefault(name,[])
result[name].append(t)
return sorted(result.items(), key=lambda x:-len(x[1]))
clues = { "knife":{"Joe","Phil"}, "club":{"Jane","John"}, "ice":{"Joe","Neffe"}}
print(who({"knife","ice"}, clues))
Output:
[('Joe', ['knife', 'ice']), ('Phil', ['knife']), ('Neffe', ['ice'])]
The reason the other way round is better: you are looking for the clues - which should be the keys.
Your logic is mixed up with the parsing which is not a very good thing. If you separate them things are much easier to understand.
from typing import List, Dict
def get_words(sentence: str) -> List:
return [word[1:] for word in sentence.split() if word.startswith('*')]
def who_did_it(archive: Dict[str, List[str]], clues: List[str]) -> str:
suspect = []
#a list for the most likely person that did it
for name, belongings in archive.items():
if all(clue in belongings for clue in clues):
suspect.append(name)
if len(suspect) == 1:
print(suspect[0])
else:
print('need more evidence')
facts = {
'martin': ('I had a knife and a *broom', 22),
'jose': ('I had a *knife', 21),
}
archive = { name : get_words(fact[0]) for name, fact in facts.items()}
who_did_it(archive, get_words('he had a *knife'))

Sorting Dictionary in python by making a sorted list of tuplles doesn't work

I've been using the solution provided in Sorting_Dictionary to sort a dictionary according to values.I know dictionaries cannot be as such sorted but a list of sorted tupples can be obtained.
Complete code:
import sys
import pprint
def helper(filename):
Word_count={}
f=open(filename)
for line in f:
words=line.split()
for word in words:
word=word.lower()
Word_count.setdefault(word,0)
Word_count[word]+=1
f.close()
return Word_count
def print_words(filename):
Word_count_new=helper(filename)
sorted_count=sorted(Word_count_new.items(),key=Word_count_new.get,reverse=True)
for word in sorted_count:
pprint.pprint(word)
def print_top(filename):
word_list=[]
Word_count=helper(filename)
word_list=[(k,v) for k,v in Word_count.items()]
for i in range(20):
print word_list[i] + '\n'
###
# This basic command line argument parsing code is provided and
# calls the print_words() and print_top() functions which you must define.
def main():
if len(sys.argv) != 3:
print 'usage: ./wordcount.py {--count | --topcount} file'
sys.exit(1)
option = sys.argv[1]
filename = sys.argv[2]
if option == '--count':
print_words(filename)
elif option == '--topcount':
print_top(filename)
else:
print 'unknown option: ' + option
sys.exit(1)
if __name__ == '__main__':
main()
This function produces problem:
def print_words(filename):
Word_count_new=helper(filename)
sorted_count=sorted(Word_count_new.items(),key=Word_count_new.get,reverse=True)
for word in sorted_count:
pprint.pprint(word)
here helper is a method which returns a dictionary which is to be sorted. The dictionary is like this {Dad:1, Mom:2, baby:3}
But this doesn't produce a sorted list of tupples. Instead the output is somewhat random like this
('he', 111)
("hot-tempered,'", 1)
('made', 29)
('wise', 2)
('whether', 11)
('wish', 21)
('scroll', 1)
('eyes;', 1)
('this,', 17)
('signed', 2)
('this.', 1)
How can we explain this behaviour?
sorted_count = sorted(Word_count_new.items(), key=lambda x: x[1], reverse=True)
According to the documentation for sorted (https://docs.python.org/3/library/functions.html#sorted), the second argument is a function that creates a comparison key from each list element, so not the dict as a whole.
Word_count_new.items() returns an iterable (in python3, list in python2) of tuples, which is what's passed to your key function. If you want your comparison key to be based of the work frequency (the second element), you want to return the second element in this function (x[1] where x is the individual tuple getting compared).
To also explain the random output you got, your key was Word_count_new.get. Since your dict does not have tuples as keys, the default value will be None.

Categories