Iteravely generate keys and count number of words - python

Is there a way to iterate your way through a dictionary to count the number of words with the string stored within the dictionary, and save it as a new dictionary that returns the word count for each item of that key?
For example...
#Input:
inputdict = {
'key1': 'The brown fox is brown and a fox.',
'key2': 'The red dog is the red and is a dog.'
}
newdict = {}
for k, v in inputdict:
newdict(str(k) + "_" + str(v)) = count(v)
#Output:
newdict = {
'key1_the': 1, 'key1_brown': 2, 'key1_is': 1, # ...
'key2_the': 2, 'key2_red': 2, # ...
}
Side Note:
This is kind of a follow up from an article at https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/. Where instead of reading in strings I'm trying to read in items from a dictionary.

Yes, just use the collections.Counter() class with a generator expression producing the values you want to count:
from collections import Counter
Counter(f"{k}_{word}" for k, v in inputdict.items() for word in v.split())
I've assumed a simple split on whitespace is sufficient, but if you need something more sophisticated then replace v.split() with something that produces an iterable of words to count.
Counter() is a subclass of dict, but with a few extra methods to help handle counts.
Demo:
>>> from collections import Counter
>>> inputdict = {
... 'key1': 'The brown fox is brown and a fox.',
... 'key2': 'The red dog is the red and is a dog.'
... }
>>> Counter(f"{k}_{word}" for k, v in inputdict.items() for word in v.split())
Counter({'key1_brown': 2, 'key2_red': 2, 'key2_is': 2, 'key1_The': 1, 'key1_fox': 1, 'key1_is': 1, 'key1_and': 1, 'key1_a': 1, 'key1_fox.': 1, 'key2_The': 1, 'key2_dog': 1, 'key2_the': 1, 'key2_and': 1, 'key2_a': 1, 'key2_dog.': 1})
Personally, I'd produce separate counts, using a nested dictionary structure:
{key: Counter(value.split()) for key, value in inputdict.items()}
and so produce:
{'key1': Counter({'brown': 2, 'The': 1, 'fox': 1, ... }),
'key2': Counter({'red': 2, 'is': 2, 'The': 1, ... })}
so you can access counts per sentence, with newdict["key1"] and newdict["key2"].

Related

How to loop through dictionary to get both frequency of words and symbols?

I have set up a function that finds the frequency of the number of times words appear in a text file, but the frequency is wrong for a couple of words because the function is not separating words from symbols like "happy,".
I have already tried to use the split function to split it with every "," and every "." but that does not work, I am also not allowed to import anything into the function as the professor does not want us to.
The code belows turns the text file into a dictionary and then uses the word or symbol as the key and the frequency as the value.
def getTokensFreq(file):
dict = {}
with open(file, 'r') as text:
wholetext = text.read().split()
for word in wholetext:
if word in dict:
dict[word] += 1
else:
dict[word] = 1
return dict
We are using the text file with the name of "f". This what is inside the file.
I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.
The desired results is this where both words and symbols are counted.
{'i': 5, 'felt': 1, 'happy': 4, 'because': 2, 'saw': 1,
'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1,
'feel': 1, ',': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, '.': 1}
This is what I am getting, where some words and symbols are counted as a separate word
{'I': 5, 'felt': 1, 'happy': 2, 'because': 2, 'saw': 1, 'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1, 'feel': 1, 'happy,': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, 'happy.': 1}
This is how to generate your desired frequency dictionary for one sentence. To do for the whole file, just call this code for each line to update the content of your dictionary.
# init vars
f = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy."
d = {}
# count punctuation chars
d['.'] = f.count('.')
d[','] = f.count(',')
# remove . and ,
for word in f.replace(',', '').replace('.','').split(' '):
if word not in d.keys():
d[word] = 1
else:
d[word] += 1
Alternatively, you can use a mix of regex and list expressions, like the following:
import re
# filter words and symbols
words = re.sub('[^A-Za-z0-9\s]+', '', f).split(' ')
symbols = re.sub('[A-Za-z0-9\s]+', ' ', f).strip().split(' ')
# count occurrences
count_words = dict(zip(set(words), [words.count(w) for w in set(words)]))
count_symbols = dict(zip(set(symbols), [symbols.count(s) for s in set(symbols)]))
# parse results in dict
d = count_symbols.copy()
d.update(count_words)
Output:
{',': 1,
'.': 1,
'I': 5,
'and': 1,
'because': 2,
'but': 1,
'feel': 1,
'felt': 1,
'happy': 4,
'knew': 1,
'not': 1,
'others': 1,
'really': 1,
'saw': 1,
'should': 1,
'the': 1,
'was': 1,
'were': 1}
Running the previous 2 approaches a 1000x times using a loop and capturing the run-times, proves that the second approach is faster than the first approach.
My solution is firstly replace all symbols into a space and then split by space. We will need a little help from regular expression.
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
My solution is similar to Verse's but it also takes makes an array of the symbols in the sentence. Afterwards, you can use the for loop and the dictionary to determine the counts.
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9\s]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
c = re.sub('[A-Za-z0-9\s]+', ' ', a)
symbols = c.strip().split(' ')
print(symbols)
# do the for loop stuff you did in your question but with wholetext and symbols
Oh, I missed that you couldn't import anything :(

How to iterate over a dictionary using for loops?

I want to decompress a dictionary and a list into a sentence. For example:
newlist = [1, 2, 3, 4, 5, 6]
new_dictionary = {'code': 2, 'help': 6, 'broken': 4, 'is': 3, 'please': 5, 'my': 1}
The original sentence is 'My code is broken please help'. The list shows the positions that the words appear within the sentence. The dictionary stores the word and the position that the word associates with.
The goal is to iterate over the dictionary until the matches the number in the list. Once this happens, the key that matches to the value is added to a list. This will continue to happen until there are no more numbers in the list. The list is then converted into a string and printed to the user.
I would imagine that something like this would be the solution:
for loop in range(len(newlist)):
x = 0
for k,v in new_dictionary.items():
if numbers[x] == v:
original_sentence.append(k)
else:
x = x + 1
print(original_sentence)
However, the code just prints an empty list. Is there any way of re-wording or re-arranging the for loops so that the code works?
Invert the dictionary and proceed. Try the following code.
>>> d = {'code': 2, 'help': 6, 'broken': 4, 'is': 3, 'please': 5, 'my': 1}
>>> numbers = [1, 2, 3, 4, 5, 6]
>>> d_inv = {v:k for k,v in d.items()}
>>> ' '.join([d_inv[i] for i in numbers])
'my code is broken please help'
I assume you don't want to invert the dictionary, so you can try something like this:
dictionary = {'code': 2, 'help': 6, 'broken': 4, 'is': 3, 'please': 5, 'my': 1}
numbers = [1, 2, 3, 4, 5, 6]
sentence = []
for number in numbers:
for key in dictionary.keys():
if dictionary[key] == number:
sentence.append(key)
break
Sorted the dict with using the values.
import operator
new_dictionary = {'code': 2, 'help': 6, 'broken': 4, 'is': 3, 'please': 5, 'my': 1}
sorted_x = sorted(new_dictionary.items(), key=operator.itemgetter(1))
print ' '.join(i[0] for i in sorted_x)
result
'my code is broken please help'
The whole code in single line.
In [1]: ' '.join([item[0] for item in sorted(new_dictionary.items(), key=operator.itemgetter(1))])
Out[1]: 'my code is broken please help'

How to compare values inside a dictionary to fill up sets()

dico = {"dico": {1:"bailler",2:"bailler",3:"percer",4:"calculer",5:"calculer",6:"trouer",7:"bailler",8:"découvrir",9:"bailler",10:"miser",11:"trouer",12:"changer"}}
I have a big dictionary of dictionaries like that. I want to put identic elements together in sets. So create a kind of condition which will say if the values of "dico" are equal put them in a set():
b=[set(1,2,7,9),set(3),set(4,5),set(6,11),set(8),set(10),set(12)]
I don't know if that question has already been asked but as a new pythonner I don't have all the keys... ^^
Thank you for you answers
I would reverse your dictionary and have the value a set(), then return all the values.
>>> from collections import defaultdict
>>>>my_dict= {"dico": {1:"bailler",2:"bailler",3:"percer",4:"calculer",5:"calculer",6:"trouer",7:"bailler",8:"découvrir",9:"bailler",10:"miser",11:"trouer",12:"changer"}}
>>> my_other_dict = defaultdict(set)
>>> for dict_name,sub_dict in my_dict.iteritems():
for k,v in sub_dict.iteritems():
my_other_dict[v].add(k) #the value, i.e. "bailler" is now the key
#e.g. {"bailler":set([1,2,9,7]),...
>>> [v for k,v in my_other_dict.iteritems()]
[set([8]), set([1, 2, 9, 7]), set([3]), set([4, 5]), set([12]), set([11, 6]), set([10])]
Of course as cynddl has pointed out, if your index in a list will always be the "key", simply enumerate a list and you won't have to store original data as a dictionary, nor use sets() as indices are unique.
You should write your data this way:
dico = ["bailler", "bailler", "percer", "calculer", "calculer", "trouer", "bailler", "découvrir", "bailler", "miser", "trouer", "changer"]
If you want to count the number of identic elements, use collections.Counter:
import collections
counter=collections.Counter(dico)
print(counter)
which returns a Counter object:
Counter({'bailler': 4, 'calculer': 2, 'trouer': 2, 'd\xc3\xa9couvrir': 1, 'percer': 1, 'changer': 1, 'miser': 1})
The dict.setdefault() method can be handy for tasks like this, as well as dict.items() which iterates through the (key, value) pairs of the dictionary.
>>> dico = {"dico": {1:"bailler",2:"bailler",3:"percer",4:"calculer",5:"calcul
er",6:"trouer",7:"bailler",8:"découvrir",9:"bailler",10:"miser",11:"trouer",12:"
changer"}}
>>> newdict = {}
>>> for k, subdict in dico.items():
... newdict[k] = {}
... for subk, subv in subdict.items():
... newdict[k].setdefault(subv, set()).add(subk)
...
>>> newdict
{'dico': {'bailler': {1, 2, 9, 7}, 'miser': {10}, 'découvrir': {8}, 'calculer':
{4, 5}, 'changer': {12}, 'percer': {3}, 'trouer': {11, 6}}}
>>> newdict['dico'].values()
dict_values([{1, 2, 9, 7}, {10}, {8}, {4, 5}, {12}, {3}, {11, 6}])

Python: Using dict as accumulator

I am trying to get the counts of each word in a text file with the below code.
def count_words(file_name):
with open(file_name, 'r') as f: return reduce(lambda acc, x: acc.get(x, 0) + 1, sum([line.split() for line in f], []), dict())
but I get the error
File "C:\Python27\abc.py", line 173, in count_words
with open(file_name, 'r') as f: return reduce(lambda acc, x: acc.get(x, 0) + 1, sum([line.split() for line in f], []), dict())
File "C:\Python27\abc.py", line 173, in <lambda>
with open(file_name, 'r') as f: return reduce(lambda acc, x: acc.get(x, 0) + 1, sum([line.split() for line in f], []), dict())
AttributeError: 'int' object has no attribute 'get'
I am not able to understand the error message here. Why does it complain that 'int' has no attribute even when I passed a dict as accumulator?
You can use collections.Counter to count the words:
In [692]: t='I am trying to get the counts of each word in a text file with the below code'
In [693]: from collections import Counter
In [694]: Counter(t.split())
Out[694]: Counter({'the': 2, 'a': 1, 'code': 1, 'word': 1, 'get': 1, 'I': 1, 'of': 1, 'in': 1, 'am': 1, 'to': 1, 'below': 1, 'text': 1, 'file': 1, 'each': 1, 'trying': 1, 'with': 1, 'counts': 1})
In [695]: c=Counter(t.split())
In [696]: c['the']
Out[696]: 2
The problem is that your lambda function returns an int, but not a dict.
So, even if you use a dict as seed, when your lambda function is called the second time, acc will be the result of acc.get(x, 0) + 1 from the first call, and it's an int and not a dict.
So if you are looking for a one-liner, I almost have a one-liner in the spirit of what you were trying to do with get.
>>> words = """One flew over the ocean
... One flew over the sea
... My Bonnie loves pizza
... but she doesn't love me"""
>>>
>>> f = open('foo.txt', 'w')
>>> f.writelines(words)
>>> f.close()
The "one-liner" (two-liner actually)
>>> word_count = {}
>>> with open('foo.txt', 'r') as f:
... _ = [word_count.update({word:word_count.get(word,0)+1}) for word in f.read().split()]
...
Result:
>>> word_count
{'but': 1, 'One': 2, 'the': 2, 'she': 1, 'over': 2, 'love': 1, 'loves': 1, 'ocean': 1, "doesn't": 1, 'pizza': 1, 'My': 1, 'me': 1, 'flew': 2, 'sea': 1, 'Bonnie': 1}
I imagine there's something you could do with a dict comprehension, but I couldn't see how to use get in that case.
The f.read().split() gives you a nice list of words to work with, however, and should be easier than trying to get words out of a list of lines. It's a better approach unless you have a huge file.

how to group new dictionaries by using common strings from my dictionary

Refer to my previous question: How to extract the common words before particular symbol and find particular word
mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
"g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
"g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
"g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
"g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
"g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
"h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,
"g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
"h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
"h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
"p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}
and I already got my OutputNameDict,
OutputNameDict = {'h18_84pp_3A_MVP_FIX': 1, 'p18_84pp_2B_MVP_FIX': 2, 'g18_84pp_2A_MVP_MIX': 0}
Now what I want to do is to group three new dictionaries by using my common strings CaseNameString(refer to previous question) and values from OutputNameDict.
The idea result will like:
Group1. mydict0 using value 0 in OutputNameDict and string g18_84pp_2A_MVP_GoodiesT0 inCaseNameString.
mydict0 = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
"g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
"g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
"g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
"g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
"g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
"g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 6}
Group2. mydict1 using value 1 in OutputNameDict and string h18_84pp_3A_MVP_GoodiesT1 inCaseNameString.
mydict1 ={"h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 0,
"h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 1,
"h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 2}
Group3. mydict2 using value 2 in OutputNameDict and string p18_84pp_2B_MVP_GoodiesT2 inCaseNameString.
mydict2 ={"p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 0}
Any suggestion? Is there any function to call?
I'd change your OutputNameDict keys to be regular expression patterns, as follows:
OutputNameDict = {'h18_84pp_3A_MVP.*FIX': 1, 'p18_84pp_2B_MVP.*FIX': 2, 'g18_84pp_2A_MVP.*MIX': 0}
Then, using the re regular expression module, use that to match against the keys in mydict, and place the dictionary element into the appropriate key in output_dicts dictionary, as follows
import collections
import re
output_dicts = collections.defaultdict(dict)
for k, v in mydict.iteritems():
for pattern, suffix in OutputNameDict.iteritems():
if re.match(pattern,k):
output_dicts['mydict' + str(suffix)][k] = v
break
else:
output_dicts['not matched'][k] = v
This results in the output_dicts dictionary populated as follows
for k, v in output_dicts.iteritems():
print k
print v
print
Which outputs
mydict1
{'h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt': 8,
'h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt': 9,
'h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt': 6}
mydict0
{'g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt': 0,
'g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt': 1,
'g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt': 3,
'g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt': 4,
'g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt': 2,
'g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt': 5,
'g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt': 7}
mydict2
{'p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt': 10}

Categories