Disclaimer: I am just getting started learning Python
I have a function that counts the number of times a word appears in a text file and sets the word as the key and the count as the value, and stores it in a dictionary "book_index". Here is my code:
alice = open('location of the file', 'r', encoding = "cp1252")
def book_index(alice):
"""Alice is a file reference"""
"""Alice is opened, nothing else is done"""
worddict = {}
line = 0
for ln in alice:
words = ln.split()
for wd in words:
if wd not in worddict:
worddict[wd] = 1 #if wd is not in worddict, increase the count for that word to 1
else:
worddict[wd] = worddict[wd] + 1 #if wd IS in worddict, increase the count for that word BY 1
line = line + 1
return(worddict)
I need to turn that dictionary "inside out" and use the count as the key, and any word that appears x amount of times as the value. For instance: [2, 'hello', 'hi'] where 'hello' and 'hi' appear twice in the text file.
Do I need to loop through my existing dictionary or loop through the text file again?
As a dictionary is a key to value mapping, you cannot efficiently filter by the values. So you will have to loop through all elements in the dictionary to get the keys which values have some specific value.
This will print out all keys in the dictionary d where the value is equal to searchValue:
for k, v in d.items():
if v == searchValue:
print(k)
Regarding your book_index function, note that you can use the built-in Counter for counting things. Counter is essentially a dictionary that works with counts as its values and automatically takes care of nonexistant keys. Using a counter, your code would look like this:
from collections import Counter
def book_index(alice):
worddict = Counter()
for ln in alice:
worddict.update(ln.split())
return worddict
Or, as roippi suggested as a comment to another answer, just worddict = Counter(word for line in alice for word in line.split()).
Personally I would suggest the use of a Counter object here, which is specifically made for this kind of application. For instance:
from collections import Counter
counter = Counter()
for ln in alice:
counter.update(ln.split())
This will give you the relevant dictionary, and if you then read the Counter docs
You can just retrieve the most common results.
This might not work in every case in your proposed problem, but it's slightly nicer than manually iterating through even the first time around.
If you really want to "flip" this dictionary you could do something along these lines:
matching_values = lambda value: (word for word, freq in wordict.items() if freq==value)
{value: matching_values for value in set(worddict.values())}
The above solution has some advantages over other solutions in that the lazy execution means that for very sparse cases where you're not looking to make a lot of calls to this function, or just discover which value actually have corresponding entries, this will be faster as it won't actually iterate through the dictionary.
That said, this solution will usually be worse than the vanilla iteration solution since it actively iterates through the dictionary every time you need a new number.
Not radically different, but I didn't want to just copy the other answers here.
Loop through your existing dictionary, here is an example using dict.setdefault():
countdict = {}
for k, v in worddict.items():
countdict.setdefault(v, []).append(k)
Or with collections.defaultdict:
import collections
countdict = collections.defaultdict(list)
for k, v in worddict.items():
countdict[v].append(k)
Personally I prefer the setdefault() method because the result is a regular dictionary.
Example:
>>> worddict = {"hello": 2, "hi": 2, "world": 4}
>>> countdict = {}
>>> for k, v in worddict.items():
... countdict.setdefault(v, []).append(k)
...
>>> countdict
{2: ['hi', 'hello'], 4: ['world']}
As noted in some of the other answers, you can significantly shorten your book_index function by using collections.Counter.
Without duplicates:
word_by_count_dict = {value: key for key, value in worddict.iteritems()}
See PEP 274 to understand dictionary comprehension with Python: http://www.python.org/dev/peps/pep-0274/
With duplicates:
import collections
words_by_count_dict = collections.defaultdict(list)
for key, value in worddict.iteritems():
words_by_count_dict[value].append(key)
This way:
words_by_count_dict[2] = ["hello", "hi"]
Related
I know to write something simple and slow with loop, but I need it to run super fast in big scale.
input:
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
desired out put:
d = {1 : ["txt1", "txt2"], 2 : "txt3"]
There is something built-in at python which make dict() extend key instead replacing it?
dict(list(zip(lst[0], lst[1])))
One option is to use dict.setdefault:
out = {}
for k, v in zip(*lst):
out.setdefault(k, []).append(v)
Output:
{1: ['txt1', 'txt2'], 2: ['txt3']}
If you want the element itself for singleton lists, one way is adding a condition that checks for it while you build an output dictionary:
out = {}
for k,v in zip(*lst):
if k in out:
if isinstance(out[k], list):
out[k].append(v)
else:
out[k] = [out[k], v]
else:
out[k] = v
or if lst[0] is sorted (like it is in your sample), you could use itertools.groupby:
from itertools import groupby
out = {}
pos = 0
for k, v in groupby(lst[0]):
length = len([*v])
if length > 1:
out[k] = lst[1][pos:pos+length]
else:
out[k] = lst[1][pos]
pos += length
Output:
{1: ['txt1', 'txt2'], 2: 'txt3'}
But as #timgeb notes, it's probably not something you want because afterwards, you'll have to check for data type each time you access this dictionary (if value is a list or not), which is an unnecessary problem that you could avoid by having all values as lists.
If you're dealing with large datasets it may be useful to add a pandas solution.
>>> import pandas as pd
>>> lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
>>> s = pd.Series(lst[1], index=lst[0])
>>> s
1 txt1
1 txt2
2 txt3
>>> s.groupby(level=0).apply(list).to_dict()
{1: ['txt1', 'txt2'], 2: ['txt3']}
Note that this also produces lists for single elements (e.g. ['txt3']) which I highly recommend. Having both lists and strings as possible values will result in bugs because both of those types are iterable. You'd need to remember to check the type each time you process a dict-value.
You can use a defaultdict to group the strings by their corresponding key, then make a second pass through the list to extract the strings from singleton lists. Regardless of what you do, you'll need to access every element in both lists at least once, so some iteration structure is necessary (and even if you don't explicitly use iteration, whatever you use will almost definitely use iteration under the hood):
from collections import defaultdict
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
result = defaultdict(list)
for key, value in zip(lst[0], lst[1]):
result[key].append(value)
for key in result:
if len(result[key]) == 1:
result[key] = result[key][0]
print(dict(result)) # Prints {1: ['txt1', 'txt2'], 2: 'txt3'}
I would like my code to print all occurrences of words that appear once. I have produced this code below, however, the result only shows the first instance of the word occurring once but not all of them. I am not sure where the problem is and how to fix it. Any help would be greatly appreciated.
import re
from collections import Counter
def inspiration(text):
quote = re.findall(r'\w+', text.lower())
quote_dict = dict(Counter(quote).most_common())
quote_one = {}
for key, value in quote_dict.items():
if value == 1:
quote_one[key]= value
return quote_one
print(inspiration("We know what we are, but know not what we may be- William Shakespeare"))
Expected output: {"are":1, "but":1, "not":1, "be":1, "William":1, "Shakespeare":1}
Try this :
import re
from collections import Counter
def inspiration(text):
lst = re.findall(r'\w+', text.lower())
return {k: v for k, v in Counter(lst).items() if v == 1}
print(inspiration("We know what we are, but know not what we may be- William Shakespeare"))
output :
{'are': 1, 'but': 1, 'not': 1, 'may': 1, 'be': 1, 'william': 1, 'shakespeare': 1}
Note :
1- There is no need for .most_common()
2- Counter is inherited from dict, so it has .items() , No need to convert it to a dictionary.
Your return is in the for block so it returns at the end of the first iteration, what you need is returning after iterating on all pairs. Also you don't need the value as it's always 1 just return a list of words
def inspiration(text):
quote = re.findall(r'\w+', text.lower())
quote_dict = dict(Counter(quote).most_common())
quote_one = []
for key, value in quote_dict.items():
if value == 1:
quote_one.append(key)
return quote_one
Improvements
don't need to use most_common, you don't need an ordered iteration
so don't need back wrapping to dict, you could keep the list of pairs
Combined with a dict-comprehension :
def inspiration(text):
return [k for k, v in Counter(re.findall(r'\w+', text.lower())).items() if v == 1]
quote_one[key]= value should he indented to the right because it comes under if statement.
return ends the function. So it will end the function after the first iteration.
Here is the correct code
def inspiration(text):
quote = re.findall(r'\w+', text.lower())
quote_dict = dict(Counter(quote).most_common())
quote_one = {}
for key, value in quote_dict.items():
if value == 1:
quote_one[key]= value
return quote_one
I have a list called data and a dict object called word_count, before converting the frequency into unique integers, I want to return a dict object word_count (expected format: {'marjori': 1,'splendid':1...}) and then sort the frequency.
data = [['marjori',
'splendid'],
['rivet',
'perform',
'farrah',
'fawcett']]
def build_dict(data, vocab_size = 5000):
word_count = {}
for w in data:
word_count.append(data.count(w)) ????
#print(word_count)
# how can I sort the words to make sorted_words[0] is the most frequently appearing word and sorted_words[-1] is the least frequently appearing word.
sorted_words = ??
I'm new to Python, can someone help me, thanks in advance. (I only want to use numpy library and for loop.)
For each word, you need to create a dict entry if it doesn't exist yet, or add 1 to it's value if it does exist:
word_count = dict()
for w in data:
if word_count.get(w) is not None:
word_count[w] += 1
else:
word_count[w] = 1
Then you can sort your dictionary by value:
word_count = {k: v for k, v in sorted(word_count.items(), key=lambda item: item[1], reverse=True)}
The last part of your code is not understandable, but if you only want to count the words and insert it into a dictionary and sort it by it frequency in descending order, I would suggest to use defaultdict and implement it like this:
data = ['marjori',
'splendid',
'rivet',
'farrah',
'perform',
'farrah',
'fawcett']
from collections import defaultdict
def build_dict(data, vocab_size = 5000):
"""Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
word_count = defaultdict(int) # A dict storing the words that appear in the reviews along with how often they occur
for w in data:
word_count[w]+=1
#print(word_count)
# how can I sort the words to make sorted_words[0] is the most frequently appearing word and sorted_words[-1] is the least frequently appearing word.
sorted_words = {k: v for k, v in sorted(word_count.items(), key=lambda item: item[1])}
return sorted_words
build_dict(data)
Output:
{'farrah': 2,
'fawcett': 1,
'marjori': 1,
'perform': 1,
'rivet': 1,
'splendid': 1}
I created a dictionary of the alphabet with a value starting at 0, and is increased by a certain amount depending on the word file. I hard coded the initial dictionary and I wanted it to stay in alphabetical order but it does not at all. I want it to return the dictionary in alphabetical order, basically staying the same as the initial dictionary.
How can i keep it in order?
from wordData import*
def letterFreq(words):
totalLetters = 0
letterDict = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,'j':0,'k':0,'l':0,'m':0,'n':0,'o':0,'p':0,'q':0,
'r':0,'s':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z':0}
for word in words:
totalLetters += totalOccurences(word,words)*len(word)
for char in range(0,len(word)):
for letter in letterDict:
if letter == word[char]:
for year in words[word]:
letterDict[letter] += year.count
for letters in letterDict:
letterDict[letters] = float(letterDict[letters] / totalLetters)
print(letterDict)
return letterDict
def main():
filename = input("Enter filename: ")
words = readWordFile(filename)
letterFreq(words)
if __name__ == '__main__':
main()
Update for Python 3.7+:
Dictionaries now officially maintain insertion order for Python 3.7 and above.
Update for Python 3.6:
Dictionaries maintain insertion order in Python 3.6, however, this is considered an implementation detail and should not be relied upon.
Original answer - up to and including Python 3.5:
Dictionaries are not ordered and don't keep any order for you.
You could use an ordered dictionary, which maintains insertion order:
from collections import OrderedDict
letterDict = OrderedDict([('a', 0), ('b', 0), ('c', 0)])
Or you could just return a sorted list of your dictionary contents
letterDict = {'a':0,'b':0,'c':0}
sortedList = sorted([(k, v) for k, v in letterDict.iteritems()])
print sortedList # [('a', 0), ('b', 0), ('c', 0)]
You're only needing the keys in order once, so:
# create letterDict as in your question
keys = list(letterDict)
keys.sort()
for key in keys:
# do whatever with letterDict[key]
If you needed them in order more than once, you could use the standard library's collections.OrderedDict. Sometimes that's all you need. It preserves dictionary key order by order of addition.
If you truly need an ordered-by-keys dictionary type, and you don't need it just once (where list_.sort() is better), you could try one of these:
http://stromberg.dnsalias.org/~dstromberg/datastructures/
With regard to the above link, if your keys are getting added in an already-sorted order, you're probably best off with a treap or red-black tree (a treap is better on average, but red-black trees have a lower standard deviation). If your keys are (always) getting added in a randomized order, then the simple binary tree is better.
BTW, current fashion seems to favor sorted(list_) over list_.sort(), but sorted(list_) is a relatively recent addition to the language that we got along fine without before it was added, and it's a little slower. Also, list_.sort() doesn't give rise to one-liner-abuse the way sorted(list_) does.
Oh, and vanilla dictionaries are unordered - that's why they're fast for accessing arbitrary elements (they're built on a hash table). Some of the types at datastructures URL I gave above are good at dict_.find_min() and dict_.find_max() and obviate keys.sort(), but they're slower (logn) at accessing arbitrary elements.
You can sort your dictionary's keys and iterate over your dict.
>>> for key in sorted(letterDict.keys()):
... print ('{}: {}').format(key, letterDict.get(key))
...
a: 0
b: 0
c: 0
d: 0
e: 0
...
OR
This can be a possible solution in your case. We can have all your dictionary's keys in list whose sequence doesn't change and then we can get values in that order from your dictionary.
>>> import string
>>> keys = list(string.ascii_lowercase)
>>> letterDict = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,'j':0,'k':0,'l':0,'m':0,'n':0,'o':0,'p':0,'q':0,
... 'r':0,'s':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z':0}
>>> for key in keys:
... if key in letterDict:
... print ('{}: {}').format(key, letterDict.get(key))
...
a: 0
b: 0
c: 0
d: 0
e: 0
f: 0
g: 0
h: 0
i: 0
j: 0
k: 0
l: 0
m: 0
....
I wouldn't implement it that way. It's pretty hard to read. Something more like this:
# Make sure that division always gives you a float
from __future__ import division
from collections import defaultdict, OrderedDict
from string import ascii_lowercase
...
letterDict = defaultdict(int)
...
# Replace the for char in range(0,len(word)): loop with this
# Shorter, easier to understand, should be equivalent
for year in words[word]:
for char in word:
letterDict[char] += year.count
...
# Filter out any non-letters at this point
# Note that this is the OrderedDict constructor given a generator that creates tuples
# Already in order since ascii_lowercase is
letterRatio = OrderedDict((letter, letterDict[letter] / totalLetters) for letter in ascii_lowercase)
print(letterRatio)
return letterRatio
...
Now that you're returning an OrderedDict, the order will be preserved. I do caution you, though. If you really need it to be in order at some point, I would just sort it when you need it in the right order. Don't depend on functions that compute new data to return things in a specific sort order. Sort it when you need it sorted, and not before.
I try to iterate over an ordered dictionary in last in first out order.
While for a standard dictionary everything works fine, the first solution for the orderedDict reacts strange. It seems, that while popitem() returns one key/value pair (but somehow sequentially, since I can't replace kv_pair by two variables), iteration is finished then. I see no easy way to proceed to the next key/value pair.
While I found two working alternatives (shown below), both of them lack the elegance of the normal dictionary approach.
From what I found in the online help, it is impossible to decide, but I assume I have wrong expectations. Is there a more elgant approach?
from collections import OrderedDict
normaldict = {"0": "a0.csf", "1":"b1.csf", "2":"c2.csf"}
for k, v in normaldict.iteritems():
print k,":",v
d = OrderedDict()
d["0"] = "a0.csf"
d["1"] = "b1.csf"
d["2"] = "c2.csf"
print d, "****"
for kv_pair in d.popitem():
print kv_pair
print "++++"
for k in reversed(d.keys()):
print k, d[k]
print "%%%%"
while len(d) > 0:
k, v = d.popitem()
print k, v
dict.popitem() is not the same thing as dict.iteritems(); it removes one pair from the dictionary as a tuple, and you are looping over that pair.
The most efficient method is to use a while loop instead; no need to call len(), just test against the dictionary itself, an empty dictionary is considered false:
while d:
key, value = d.popitem()
print key, value
The alternative is to use reversed():
for key, item in reversed(d.items()):
print key, value
but that requires the whole dictionary to be copied into a list first.
However, if you were looking for a FIFO queue, use collections.deque() instead:
from collections import deque
d = deque(["a0.csf", "b1.csf", "c2.csf"])
while d:
item = d.pop()
or use deque.reverse().
d.popitems() will return only one tuple (k,v). So your for loop is iterating over the one item and the loop ends.
you can try
while d:
k, v = d.popitem()