Using np.where to access values in a dictionary - vectorization - python

Given a dictionary
coaching_hours_per_level = {1:30, 2: 55, 3:80, 4:115}
coaching_hours_per_level
and a dataframe:
df1 = {'skill_0': {'jay': 1, 'roy': 4, 'axel': 5, 'billy': 1, 'charlie': 2},
'skill_1': {'jay': 5, 'roy': 3, 'axel': 2, 'billy': 5, 'charlie': 1},
'skill_2': {'jay': 4, 'roy': 1, 'axel': 2, 'billy': 1, 'charlie': 4},
'skill_3': {'jay': 1, 'roy': 3, 'axel': 5, 'billy': 4, 'charlie': 3},
'skill_4': {'jay': 3, 'roy': 4, 'axel': 2, 'billy': 3, 'charlie': 4},
'skill_5': {'jay': 5, 'roy': 2, 'axel': 4, 'billy': 2, 'charlie': 4},
'skill_6': {'jay': 5, 'roy': 5, 'axel': 2, 'billy': 5, 'charlie': 1},
'skill_7': {'jay': 3, 'roy': 3, 'axel': 4, 'billy': 2, 'charlie': 1},
'skill_8': {'jay': 1, 'roy': 4, 'axel': 2, 'billy': 1, 'charlie': 2},
'skill_9': {'jay': 4, 'roy': 3, 'axel': 4, 'billy': 2, 'charlie': 1}}
My target is:
target = {'skill_0': {'jim': 3},
'skill_1': {'jim': 5},
'skill_2': {'jim': 1},
'skill_3': {'jim': 2},
'skill_4': {'jim': 1},
'skill_5': {'jim': 2},
'skill_6': {'jim': 3},
'skill_7': {'jim': 5},
'skill_8': {'jim': 3},
'skill_9': {'jim': 3}}
What i want to do is to understand how many hours of coaching a person might need to catch up on a level of a certain skill. E.g., for Jay in skill_0, Jay has to upskill 2 levels (which is 30 + 55, total of 85h). If the skills is already at the same level or above, it should be 0.
I've tried with np.where like below, and it works to just obtain the difference
np.where(df1>=target.values, 0, target.values-df1)
But when i try to access the dictionary to get a sum of the hours needed of coaching, it is like np.where doesn't vectorize anymore, even if i try to simply access the value in the dict
np.where(df1>=target.values, 0, coaching_hours_per_level[target.values+1])

You can build an hour matrix to indicate how much time it takes to go from level x to level y.
First some sample data:
current = {
"skill_0": {"jay": 1, "roy": 4, "axel": 5, "billy": 1, "charlie": 2},
"skill_1": {"jay": 5, "roy": 3, "axel": 2, "billy": 5, "charlie": 1},
"skill_2": {"jay": 4, "roy": 1, "axel": 2, "billy": 1, "charlie": 4},
"skill_3": {"jay": 1, "roy": 3, "axel": 5, "billy": 4, "charlie": 3},
"skill_4": {"jay": 3, "roy": 4, "axel": 2, "billy": 3, "charlie": 4},
"skill_5": {"jay": 5, "roy": 2, "axel": 4, "billy": 2, "charlie": 4},
"skill_6": {"jay": 5, "roy": 5, "axel": 2, "billy": 5, "charlie": 1},
"skill_7": {"jay": 3, "roy": 3, "axel": 4, "billy": 2, "charlie": 1},
"skill_8": {"jay": 1, "roy": 4, "axel": 2, "billy": 1, "charlie": 2},
"skill_9": {"jay": 4, "roy": 3, "axel": 4, "billy": 2, "charlie": 1},
}
# We will up the challenge a bit by saying not everyone
# wants to level up every skill
target = {
"skill_0": {"jay": 3, "charlie": 5},
"skill_1": {"jay": 5, "charlie": 5},
"skill_2": {"jay": 1, "charlie": 1},
"skill_3": {"jay": 2, "charlie": 1},
"skill_4": {"jay": 1, "charlie": 1},
"skill_5": {"jay": 2},
"skill_6": {"jay": 3},
"skill_7": {"jay": 5},
"skill_8": {"jay": 3},
"skill_9": {"jay": 3},
}
The algorithm:
coaching_hours_per_level = {1:30, 2: 55, 3:80, 4:115}
hours = [0] + list(coaching_hours_per_level.values())
# The value in hours_matrix[i, j] is the total time it takes
# to go from level (i + 1) to level (j + 1). Notice that
# hours_matrix[i, j] = 0 if i < j -- no time is needed to
# down-level.
hours_matrix = np.triu(
np.tile(hours, (len(hours), 1)),
k=1,
).cumsum(axis=1)
# Now line up the data
result = (
pd.concat(
[pd.DataFrame(current).unstack(), pd.DataFrame(target).unstack()],
axis=1,
keys=["current", "target"],
)
.dropna()
.astype("int")
)
# And the final step is just taking data from hours_matrix
result["hours"] = hours_matrix[result["current"] - 1, result["target"] - 1]
Result:
current target hours
skill_0 jay 1 3 85
charlie 2 5 250
skill_1 jay 5 5 0
charlie 1 5 280
skill_2 jay 4 1 0
charlie 4 1 0
skill_3 jay 1 2 30
charlie 3 1 0
skill_4 jay 3 1 0
charlie 4 1 0
skill_5 jay 5 2 0
skill_6 jay 5 3 0
skill_7 jay 3 5 195
skill_8 jay 1 3 85
skill_9 jay 4 3 0

Related

Fastest way to split a list into a minimum amount of sets, enumerating all possible solutions

Say I have a list of numbers with duplicants.
import random
lst = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
random.shuffle(lst)
I want to split the list into a minimum amount of sub"set"s with all unique numbers, without discarding any numbers. I managed to write the following code, but I feel like this is hard-coded, so there should be faster and more general solutions.
from collections import Counter
counter = Counter(lst)
maxcount = counter.most_common(1)[0][1]
res = []
while maxcount > 0:
res.append(set(x for x in lst if counter[x] >= maxcount))
maxcount -= 1
assert len([x for st in res for x in st]) == len(lst)
print(res)
Output:
[{4}, {8, 2, 4}, {0, 2, 3, 4, 7, 8}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]
Obviously, this is only one of the solutions. Another solution could be
[{4, 9}, {8, 2, 4}, {0, 2, 3, 4, 7, 8}, {0, 1, 2, 3, 4, 5, 6, 7, 8}]
I want to find all possible solutions with minimum amount of sub"set"s (4 in this case). Note that same numbers are indistinguishable, e.g. [{1}, {1, 2}] is the same solution as [{1, 2}, {1}] for a list of [1, 2, 1].
Any suggestions?
This way takes time linear in the number of list elements, and its output is the same (same sets, in the same order) regardless of the input list's order. It's basically a more "eager" variation of your code:
def split(xs):
from collections import defaultdict
x2count = defaultdict(int)
result = []
for x in xs:
x2count[x] += 1
count = x2count[x]
if count > len(result):
result.append(set())
result[count - 1].add(x)
return result
Then, e.g.,
xs = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
import random
random.shuffle(xs)
print(split(xs))
displays
[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 2, 3, 4, 7, 8}, {8, 2, 4}, {4}]
Finding all answers is bound to be annoying ;-) But straightforward enough. Once you know there are 4 sets in the result, then you have a hairy kind of Cartesian product to compute. You know that, e.g., 7 appears twice, so there are comb(4, 2) == 6 ways to pick the two result sets 7 goes in. For each of those ways, you know, e.g., that 8 appears 3 times, so there are comb(4, 3) == 4 ways to pick the three results sets 8 goes in. Now we're up to 6 * 4 = 24 partial results. Repeat similarly for all other original integers. itertools.combinations() can be used to do the choosing.
Unclear: consider input [1, 1, 2]. The output here is [{1, 2}, {1}]. Do you, or do you not, consider that to be the same as [{1}, {1, 2}]? That is, do you consider an output to be a sequence of sets (in which case they're different), or as a set of sets (in which case they're the same)? A straightforward Cartesian product takes the "it's a sequence" view.
Finding all of 'em
Here's a way. As sketched, it computes a Cartesian product of all ways of distributing each element across the number of required output sets. But rather than use itertools.product() for this, it does it recursively, one element at a time. This allows it to check partial results so far for isomorphism, and decline to extend any partial solution that's isomorphic to a partial solution it already extended.
Toward that end, it views a partial solution as a set of sets. For technical reasons, Python requires using a frozenset for a set that'a going to be used in turn as a set element.
Caution: this generator yields the same result object every time. That's for efficiency. If you don't like that, you could, e.g., replace
yield result
with
yield result[:]
instead.
EDIT: note that I replaced the line
sig = frozenset(map(frozenset, result))
with
sig = frozenset(Counter(map(frozenset, result)).items())
That's because you really aren't viewing the result as a set of sets, but as a multiset of sets (a given set can appear more than once in a result, and the number of times it appears is significant). In fancier test cases than were given here, that can make a real difference.
A Counter is the closest thing Python has to a builtin multiset type, but there is no "frozen" Counter workalike akin to frozensets. So instead we turn the Counter into a sequence of 2-tuples, and put those tuples into a frozenset. By using (set, count) pairs, this allows us to account for that the number of times a set appears in a result is significant.
def allsplit(xs):
from collections import Counter
from itertools import combinations
c = Counter(xs)
n = max(c.values())
result = [set() for i in range(n)]
pairs = list(c.items())
pin = len(pairs)
def inner(pi):
if pi >= pin:
yield result
return
elt, count = pairs[pi]
seen = set()
for ixs in combinations(range(n), count):
for i in ixs:
result[i].add(elt)
sig = frozenset(Counter(map(frozenset, result)).items())
if sig not in seen:
yield from inner(pi + 1)
seen.add(sig)
for i in ixs:
result[i].remove(elt)
return inner(0)
Example:
>>> for x in allsplit([1, 1, 2, 3, 8, 4, 4]):
... print(x)
[{1, 2, 3, 4, 8}, {1, 4}]
[{1, 2, 3, 4}, {8, 1, 4}]
[{1, 2, 4, 8}, {1, 3, 4}]
[{1, 2, 4}, {1, 3, 4, 8}]
For your original example, it finds 36992 unique ways to partition the input.
I'd suggest using a prefilled list then store each value in separate buckets
import random
from collections import Counter
lst = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
random.shuffle(lst)
c = Counter(lst)
maxcount = c.most_common(1)[0][1]
result = [set() for _ in range(maxcount)]
for k, v in c.items():
for i in range(v):
result[i].add(k)
print(result)
Can also be achieved with a defaultdict
c = Counter(lst)
result = defaultdict(set)
for k, v in c.items():
for i in range(v):
result[i].add(k)
result = list(result.values())
print(result)
Note on performance
from timeit import timeit
import numpy as np
lst = list(np.random.randint(0, 100, 10000))
nb = 1000
print(timeit(lambda: prefilled_list(lst), number=nb)) # 2.144
print(timeit(lambda: default_dict_set(lst), number=nb)) # 1.903
print(timeit(lambda: op_while_loop(lst), number=nb)) # 318.2
My simple solution, returning sets from biggest to smallest:
# Problem definition
import random
lst = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
random.shuffle(lst)
# Divide in sets, biggest first
l = lst.copy()
sets = []
while l:
sets.append(set(l))
for item in sets[-1]:
l.remove(item)
Now the hard part, combinations. I suggest starting from something similar and expanding, what follows is just a proof of concept and avoiding duplication other than the trivial "move the entire difference set across" is not covered. The real implementation will cover all combinations of set -> set transfers (just add another itertools.combinations level) but I have no idea how to handle moving element from and to different sets in parallel in a clever way off the top of my head.
import itertools
more_sets = [sets]
diff_0_1 = sets[0] - sets[1]
for comb_size in range(1, len(diff_0_1)):
for comb in itertools.combinations(diff_0_1, comb_size):
s0 = sets[0] - set(comb)
s1 = sets[1] | set(comb)
more_sets.append([s0, s1] + sets[2:])
for some_sets in more_sets:
print(some_sets)
The above code returns this:
~ python3.8 tmp.py
[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 2, 3, 4, 7, 8}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 1, 2, 3, 4, 7, 8}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 6, 7, 8, 9}, {0, 2, 3, 4, 5, 7, 8}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 5, 7, 8, 9}, {0, 2, 3, 4, 6, 7, 8}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 5, 6, 7, 8}, {0, 2, 3, 4, 7, 8, 9}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 6, 7, 8, 9}, {0, 1, 2, 3, 4, 5, 7, 8}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 5, 7, 8, 9}, {0, 1, 2, 3, 4, 6, 7, 8}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 5, 6, 7, 8}, {0, 1, 2, 3, 4, 7, 8, 9}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 7, 8, 9}, {0, 2, 3, 4, 5, 6, 7, 8}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 6, 7, 8}, {0, 2, 3, 4, 5, 7, 8, 9}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 5, 7, 8}, {0, 2, 3, 4, 6, 7, 8, 9}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 7, 8, 9}, {0, 1, 2, 3, 4, 5, 6, 7, 8}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 6, 7, 8}, {0, 1, 2, 3, 4, 5, 7, 8, 9}, {8, 2, 4}, {4}]
[{0, 2, 3, 4, 5, 7, 8}, {0, 1, 2, 3, 4, 6, 7, 8, 9}, {8, 2, 4}, {4}]
[{0, 1, 2, 3, 4, 7, 8}, {0, 2, 3, 4, 5, 6, 7, 8, 9}, {8, 2, 4}, {4}]

Sort the keys of a dictionary by key using a list and for loop [duplicate]

This question already has answers here:
convert a dict to sorted dict in python
(2 answers)
Closed 2 years ago.
I need to sort this dictionary that counts the times that some words appear in a song:
word_freq = {'love': 25, 'conversation': 1, 'every': 6, "we're": 1, 'plate': 1, 'sour': 1, 'jukebox': 1, 'now': 11, 'taxi': 1, 'fast': 1, 'bag': 1, 'man': 1, 'push': 3, 'baby': 14, 'going': 1, 'you': 16, "don't": 2, 'one': 1, 'mind': 2, 'backseat': 1, 'friends': 1, 'then': 3, 'know': 2, 'take': 1, 'play': 1, 'okay': 1, 'so': 2, 'begin': 1, 'start': 2, 'over': 1, 'body': 17, 'boy': 2, 'just': 1, 'we': 7, 'are': 1, 'girl': 2, 'tell': 1, 'singing': 2, 'drinking': 1, 'put': 3, 'our': 1, 'where': 1, "i'll": 1, 'all': 1, "isn't": 1, 'make': 1, 'lover': 1, 'get': 1, 'radio': 1, 'give': 1, "i'm": 23, 'like': 10, 'can': 1, 'doing': 2, 'with': 22, 'club': 1, 'come': 37, 'it': 1, 'somebody': 2, 'handmade': 2, 'out': 1, 'new': 6, 'room': 3, 'chance': 1, 'follow': 6, 'in': 27, 'may': 2, 'brand': 6, 'that': 2, 'magnet': 3, 'up': 3, 'first': 1, 'and': 23, 'pull': 3, 'of': 6, 'table': 1, 'much': 2, 'last': 3, 'i': 6, 'thrifty': 1, 'grab': 2, 'was': 2, 'driver': 1, 'slow': 1, 'dance': 1, 'the': 18, 'say': 2, 'trust': 1, 'family': 1, 'week': 1, 'date': 1, 'me': 10, 'do': 3, 'waist': 2, 'smell': 3, 'day': 6, 'although': 3, 'your': 21, 'leave': 1, 'want': 2, "let's": 2, 'lead': 6, 'at': 1, 'hand': 1, 'how': 1, 'talk': 4, 'not': 2, 'eat': 1, 'falling': 3, 'about': 1, 'story': 1, 'sweet': 1, 'best': 1, 'crazy': 2, 'let': 1, 'too': 5, 'van': 1, 'shots': 1, 'go': 2, 'to': 2, 'a': 8, 'my': 33, 'is': 5, 'place': 1, 'find': 1, 'shape': 6, 'on': 40, 'kiss': 1, 'were': 3, 'night': 3, 'heart': 3, 'for': 3, 'discovering': 6, 'something': 6, 'be': 16, 'bedsheets': 3, 'fill': 2, 'hours': 2, 'stop': 1, 'bar': 1}
In order to do it I need:
To create a new list just with the keys of the dictionary.
keys = list(word_freq.keys())
Sort the key list.
keys.sort()
Create an empty dictionary.
word_freq2 = {}
Use a for loop lo iterate each value of the list. For each iterated, find the corresponding value in the first dictionary and insert the key-value pair to the new empty dictionary.
This is my best solution up to now:
for key in keys:
if key in word_freq:
word_freq2.update({key: value})
print(word_freq2)
The problem is that I don't know how to add the correct value because right know I receive just 1 as a value, as I show here:
{'a': 1, 'about': 1, 'all': 1, 'although': 1, 'and': 1, 'are': 1, 'at': 1, 'baby': 1, 'backseat': 1, 'bag': 1, 'bar': 1, 'be': 1, 'bedsheets': 1, 'begin': 1, 'best': 1, 'body': 1, 'boy': 1, 'brand': 1, 'can': 1, 'chance': 1, 'club': 1, 'come': 1, 'conversation': 1, 'crazy': 1, 'dance': 1, 'date': 1, 'day': 1, 'discovering': 1, 'do': 1, 'doing': 1, "don't": 1, 'drinking': 1, 'driver': 1, 'eat': 1, 'every': 1, 'falling': 1, 'family': 1, 'fast': 1, 'fill': 1, 'find': 1, 'first': 1, 'follow': 1, 'for': 1, 'friends': 1, 'get': 1, 'girl': 1, 'give': 1, 'go': 1, 'going': 1, 'grab': 1, 'hand': 1, 'handmade': 1, 'heart': 1, 'hours': 1, 'how': 1, 'i': 1, "i'll": 1, "i'm": 1, 'in': 1, 'is': 1, "isn't": 1, 'it': 1, 'jukebox': 1, 'just': 1, 'kiss': 1, 'know': 1, 'last': 1, 'lead': 1, 'leave': 1, 'let': 1, "let's": 1, 'like': 1, 'love': 1, 'lover': 1, 'magnet': 1, 'make': 1, 'man': 1, 'may': 1, 'me': 1, 'mind': 1, 'much': 1, 'my': 1, 'new': 1, 'night': 1, 'not': 1, 'now': 1, 'of': 1, 'okay': 1, 'on': 1, 'one': 1, 'our': 1, 'out': 1, 'over': 1, 'place': 1, 'plate': 1, 'play': 1, 'pull': 1, 'push': 1, 'put': 1, 'radio': 1, 'room': 1, 'say': 1, 'shape': 1, 'shots': 1, 'singing': 1, 'slow': 1, 'smell': 1, 'so': 1, 'somebody': 1, 'something': 1, 'sour': 1, 'start': 1, 'stop': 1, 'story': 1, 'sweet': 1, 'table': 1, 'take': 1, 'talk': 1, 'taxi': 1, 'tell': 1, 'that': 1, 'the': 1, 'then': 1, 'thrifty': 1, 'to': 1, 'too': 1, 'trust': 1, 'up': 1, 'van': 1, 'waist': 1, 'want': 1, 'was': 1, 'we': 1, "we're": 1, 'week': 1, 'were': 1, 'where': 1, 'with': 1, 'you': 1, 'your': 1}
This code seems to work just fine:
word_freq = {'love': 25, 'conversation': 1, 'every': 6, "we're": 1, 'plate': 1, 'sour': 1, 'jukebox': 1, 'now': 11, 'taxi': 1, 'fast': 1, 'bag': 1, 'man': 1, 'push': 3, 'baby': 14, 'going': 1, 'you': 16, "don't": 2, 'one': 1, 'mind': 2, 'backseat': 1, 'friends': 1, 'then': 3, 'know': 2, 'take': 1, 'play': 1, 'okay': 1, 'so': 2, 'begin': 1, 'start': 2, 'over': 1, 'body': 17, 'boy': 2, 'just': 1, 'we': 7, 'are': 1, 'girl': 2, 'tell': 1, 'singing': 2, 'drinking': 1, 'put': 3, 'our': 1, 'where': 1, "i'll": 1, 'all': 1, "isn't": 1, 'make': 1, 'lover': 1, 'get': 1, 'radio': 1, 'give': 1, "i'm": 23, 'like': 10, 'can': 1, 'doing': 2, 'with': 22, 'club': 1, 'come': 37, 'it': 1, 'somebody': 2, 'handmade': 2, 'out': 1, 'new': 6, 'room': 3, 'chance': 1, 'follow': 6, 'in': 27, 'may': 2, 'brand': 6, 'that': 2, 'magnet': 3, 'up': 3, 'first': 1, 'and': 23, 'pull': 3, 'of': 6, 'table': 1, 'much': 2, 'last': 3, 'i': 6, 'thrifty': 1, 'grab': 2, 'was': 2, 'driver': 1, 'slow': 1, 'dance': 1, 'the': 18, 'say': 2, 'trust': 1, 'family': 1, 'week': 1, 'date': 1, 'me': 10, 'do': 3, 'waist': 2, 'smell': 3, 'day': 6, 'although': 3, 'your': 21, 'leave': 1, 'want': 2, "let's": 2, 'lead': 6, 'at': 1, 'hand': 1, 'how': 1, 'talk': 4, 'not': 2, 'eat': 1, 'falling': 3, 'about': 1, 'story': 1, 'sweet': 1, 'best': 1, 'crazy': 2, 'let': 1, 'too': 5, 'van': 1, 'shots': 1, 'go': 2, 'to': 2, 'a': 8, 'my': 33, 'is': 5, 'place': 1, 'find': 1, 'shape': 6, 'on': 40, 'kiss': 1, 'were': 3, 'night': 3, 'heart': 3, 'for': 3, 'discovering': 6, 'something': 6, 'be': 16, 'bedsheets': 3, 'fill': 2, 'hours': 2, 'stop': 1, 'bar': 1}
keys = list(word_freq.keys())
keys.sort()
word_freq2 = {}
for key in keys:
word_freq2[key] = word_freq[key]
print(word_freq2)

Create a pandas DataFrame of specified shape containing the same dict in every row [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would like to create a pandas.DataFrame which contains the same dict in every row:
input:
length = 5
a = {'this':1, 'is':2, 'an':3 'example':4}
output:
0
0 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
1 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
2 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
3 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
4 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
How could I do this properly?
Try this method -
import pandas as pd
length = 5
a = {'this':1, 'is':2, 'an':3, 'example':4}
out = pd.DataFrame({0:[a] * length})
print(out)
0
0 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
1 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
2 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
3 {'this': 1, 'is': 2, 'an': 3, 'example': 4}
4 {'this': 1, 'is': 2, 'an': 3, 'example': 4}

Python Pyplot word occurrence frequency

I have to plot the occurrence of each frequency of word in a txt file. So far I have the dictionary that contains each word and the frequency that it appears in the txt file. In order to plot, I have to convert that dictionary into a new dictionary (I'm assuming) that counts the number words at each frequency. For instance, if 5 words appear 3 times in the txt file, those need to be a single dictionary grouping that will plot the frequency as the x axis and number of words at that frequency on the y axis.
What I have now is simply not working:
def plot(word_dict):
new_dict = {}
for value in word_dict.values():
if value in word_dict:
new_dict += 1
else:
new_dict = 1
y = new_dict[value]
x = word_dict[value]
pyplot.plot(x, y)
pyplot.show()
A sample of data:
{'bangs': 1, 'sees': 1, 'stuff,': 1, 'Knox....': 1, 'Well': 1, 'about': 2, 'your': 1, 'blocks.': 1, 'what': 4, 'beetles....': 1, 'Boom': 1, 'blue': 1, 'paddled': 1, 'mixed': 1, 'fox': 5, 'Through': 1, 'on': 16, 'trick,': 2, 'When': 4, '...a': 1, 'silly': 1, 'band.': 2, 'come.': 3, "We'll": 2, 'likes': 2, 'slick,': 1, 'comes?': 1, 'chick': 1, 'goo,': 1, "it's": 2, 'then,': 1, 'muddled': 1, 'Now': 3, 'not': 1, 'flew,': 1, 'If,': 1, 'sneeze.': 1, 'bottled': 1, 'paddle': 4, 'called': 1, 'Goo-Goose,': 1, 'Blue': 2, 'Come': 1, 'fox.': 1, 'can': 3, 'poodle,': 1, 'this': 7, "Sue's": 4, 'Ben': 5, 'is': 7, 'goes': 1, 'to': 10, 'Crow': 4, 'cheese': 2, 'quick': 5, 'sir.': 27, 'easy,': 1, 'Clocks': 2, 'Fox': 6, 'Stop': 2, 'up': 1, 'be': 1, 'Well...': 1, 'hose': 2, 'Rose': 1, 'three': 4, 'Freezy': 2, 'New': 3, 'hate': 1, 'broom': 2, 'quite': 1, 'duck': 3, 'we': 1, 'done,': 1, 'tick.': 2, "can't": 5, 'beetles?': 1, 'well,': 2, 'box.': 4, "That's": 4, 'Do,': 1, 'say': 4, 'chicks': 5, '...': 1, 'enough,': 1, 'brick': 1, 'lot': 1, 'You': 4, 'sick': 2, 'that': 1, 'goo.': 4, 'Gooey': 1, 'made': 3, 'new': 5, 'noodles...': 1, 'Knox,': 6, 'for': 2, 'muddle.': 1, 'Bricks': 1, 'Luck': 4, 'Bim': 5, 'minute,': 1, 'brings': 2, 'bottle': 4, 'duddled': 1, "I'll": 3, 'come': 2, 'battles': 1, 'clocks': 2, 'such': 2, 'Then': 1, 'in': 19, 'sir....': 1, 'Two': 1, 'Knox.': 2, "Luke's": 1, 'lakes.': 4, 'trees': 3, "isn't": 2, 'band!': 4, 'our': 1, 'And': 2, 'blubber!': 1, 'another': 1, 'sews': 9, "bottle's": 1, "Crow's": 3, 'Step': 1, 'What': 1, 'grows': 1, 'like': 1, 'ticks': 2, 'too': 1, 'trick': 4, 'Fox,': 3, 'goo': 2, 'chewing!': 1, 'blocks': 3, 'fleas': 3, 'a': 24, 'lakes': 2, "don't": 2, 'those': 1, 'Luke': 4, 'sorry,': 1, 'tocks,': 2, 'Whose': 1, 'you': 3, 'Here': 1, 'tricks': 2, "poodle's": 1, 'they': 3, 'that.': 1, 'doing.': 1, 'Gluey.': 2, 'eating': 1, 'sir!': 1, 'breeze': 2, 'My': 4, 'tweetle': 11, 'these': 5, 'puddle,': 2, 'chewy': 1, 'tongue': 3, 'talk': 1, 'with': 11, 'beetles': 6, 'noodle': 2, 'make': 5, 'who': 1, 'lame,': 1, 'flew.': 1, "I'm": 1, 'Fox!': 2, 'Nose': 1, 'the': 7, 'I': 9, "crow's": 2, 'Thank': 1, 'easy': 2, 'likes.': 2, 'battle': 7, 'licks': 4, 'goes.': 1, 'socks': 4, 'lead': 1, 'muddle': 1, 'shame,': 1, 'Please,': 1, 'fight,': 1, 'fun,': 1, 'chew,': 2, 'fuddled': 1, 'Broom': 1, 'No,': 1, 'Hose': 1, 'something': 2, 'find': 3, 'know': 1, 'Who': 4, 'call...': 1, 'First,': 1, 'Gooey.': 2, 'Look,': 2, 'fight': 1, 'This': 1, "Luck's": 1, 'poor': 2, 'now.': 6, 'freeze.': 2, 'game': 4, "Ben's": 5, 'it!': 2, 'Joe': 5, 'their': 2, 'you,': 1, 'Box': 1, 'bands.': 2, 'it': 3, 'bands': 1, 'bricks': 5, "here's": 1, "Let's": 3, 'Sue': 5, 'when': 2, 'clocks,': 2, 'breaks.': 2, 'puddle': 8, 'Socks': 4, 'sir,': 6, 'an': 2, "Bim's": 5, 'Pig': 2, 'now....': 1, 'battle.': 4, 'Slow': 5, 'sew': 2, 'blew.': 1, 'bring': 1, 'game,': 1, 'AND...': 3, 'and': 16, 'brooms.': 1, 'way.': 2, 'booms.': 1, 'lots': 1, 'clock': 1, 'comes.': 4, 'please....': 1, 'then...': 1, '...they': 2, 'say....': 1, 'beetle': 7, 'nose.': 1, 'slow,': 1, 'or': 1, 'Six': 2, 'AND': 1, 'block': 1, 'broom.': 4, 'do': 6, 'it,': 1, 'some.': 2, 'Duck': 1, 'sir?': 2, 'grows.': 1, 'this,': 1, 'Very': 2, 'Big': 2, 'whose': 3, 'noodle-eating': 1, 'chew': 2, 'choose': 2, 'Mr.': 13, 'band': 2, "Here's": 2, 'it.': 2, 'call': 3, 'dumb': 1, 'have': 2, 'so': 2, 'Goo-Goose': 1, 'say.': 2, 'socks.': 5, "trees'": 1, 'poodle': 3, 'socks,': 4, 'my': 1, 'While': 1, 'play.': 2, 'Chicks': 3, 'stack.': 4, 'rose': 2, 'freezy': 1, 'clothes.': 3, 'makes': 1, 'little': 1, 'paddles': 3, 'box': 2, 'all': 1, 'free': 2, 'blocks,': 1, 'Do': 1, 'blab': 1, 'THIS': 1, 'thing': 1, 'bends': 2, 'bent': 2, 'Knox': 8, 'socks?': 2, 'tock.': 2, 'wuddled': 1, 'much': 1, 'takes': 2, 'bends.': 2, 'wait': 1, 'see': 1, 'rubber.': 1, 'of': 4, 'clothes?': 2, 'mouth': 3, 'bottle...': 1, 'too,': 1, 'blibber': 1, 'Try': 2, 'where': 1, "won't": 2, 'get': 1}
Use the a Counter from collections library.
Since the values you want to count are values from your word_dict (i.e. the frequencies of each word). You'll need to initialize the Counter instance like freq = Counter(word_dict.values()). Then you can extract the x and y series for your plot with c.keys() and c.values.
It seems as though you are attempting to plot strings along your x-axis, namely the keys you are using. This is not how pyplot works. You need to plot your values against a numeric vector (typically a numpy array). Once you have done this you can relabel your independent (x) vector using the xticks command.
x = numpy.linspace(0,len(new_dict.keys)-1,len(new_dict.keys))
pyplot.xticks(x, new_dict.keys)
Assuming you mean reversing the key, values, you can do:
>>> di={'bangs': 1, 'sees': 1, 'stuff,': 1, 'Knox....': 1, 'Well': 1, 'about': 2, 'your': 1, 'blocks.': 1, 'what': 4, 'beetles....': 1, 'Boom': 1, 'blue': 1, 'paddled': 1, 'mixed': 1, 'fox': 5, 'Through': 1, 'on': 16, 'trick,': 2, 'When': 4, '...a': 1, 'silly': 1, 'band.': 2, 'come.': 3, "We'll": 2, 'likes': 2, 'slick,': 1, 'comes?': 1, 'chick': 1, 'goo,': 1, "it's": 2, 'then,': 1, 'muddled': 1, 'Now': 3, 'not': 1, 'flew,': 1, 'If,': 1, 'sneeze.': 1, 'bottled': 1, 'paddle': 4, 'called': 1, 'Goo-Goose,': 1, 'Blue': 2, 'Come': 1, 'fox.': 1, 'can': 3, 'poodle,': 1, 'this': 7, "Sue's": 4, 'Ben': 5, 'is': 7, 'goes': 1, 'to': 10, 'Crow': 4, 'cheese': 2, 'quick': 5, 'sir.': 27, 'easy,': 1, 'Clocks': 2, 'Fox': 6, 'Stop': 2, 'up': 1, 'be': 1, 'Well...': 1, 'hose': 2, 'Rose': 1, 'three': 4, 'Freezy': 2, 'New': 3, 'hate': 1, 'broom': 2, 'quite': 1, 'duck': 3, 'we': 1, 'done,': 1, 'tick.': 2, "can't": 5, 'beetles?': 1, 'well,': 2, 'box.': 4, "That's": 4, 'Do,': 1, 'say': 4, 'chicks': 5, '...': 1, 'enough,': 1, 'brick': 1, 'lot': 1, 'You': 4, 'sick': 2, 'that': 1, 'goo.': 4, 'Gooey': 1, 'made': 3, 'new': 5, 'noodles...': 1, 'Knox,': 6, 'for': 2, 'muddle.': 1, 'Bricks': 1, 'Luck': 4, 'Bim': 5, 'minute,': 1, 'brings': 2, 'bottle': 4, 'duddled': 1, "I'll": 3, 'come': 2, 'battles': 1, 'clocks': 2, 'such': 2, 'Then': 1, 'in': 19, 'sir....': 1, 'Two': 1, 'Knox.': 2, "Luke's": 1, 'lakes.': 4, 'trees': 3, "isn't": 2, 'band!': 4, 'our': 1, 'And': 2, 'blubber!': 1, 'another': 1, 'sews': 9, "bottle's": 1, "Crow's": 3, 'Step': 1, 'What': 1, 'grows': 1, 'like': 1, 'ticks': 2, 'too': 1, 'trick': 4, 'Fox,': 3, 'goo': 2, 'chewing!': 1, 'blocks': 3, 'fleas': 3, 'a': 24, 'lakes': 2, "don't": 2, 'those': 1, 'Luke': 4, 'sorry,': 1, 'tocks,': 2, 'Whose': 1, 'you': 3, 'Here': 1, 'tricks': 2, "poodle's": 1, 'they': 3, 'that.': 1, 'doing.': 1, 'Gluey.': 2, 'eating': 1, 'sir!': 1, 'breeze': 2, 'My': 4, 'tweetle': 11, 'these': 5, 'puddle,': 2, 'chewy': 1, 'tongue': 3, 'talk': 1, 'with': 11, 'beetles': 6, 'noodle': 2, 'make': 5, 'who': 1, 'lame,': 1, 'flew.': 1, "I'm": 1, 'Fox!': 2, 'Nose': 1, 'the': 7, 'I': 9, "crow's": 2, 'Thank': 1, 'easy': 2, 'likes.': 2, 'battle': 7, 'licks': 4, 'goes.': 1, 'socks': 4, 'lead': 1, 'muddle': 1, 'shame,': 1, 'Please,': 1, 'fight,': 1, 'fun,': 1, 'chew,': 2, 'fuddled': 1, 'Broom': 1, 'No,': 1, 'Hose': 1, 'something': 2, 'find': 3, 'know': 1, 'Who': 4, 'call...': 1, 'First,': 1, 'Gooey.': 2, 'Look,': 2, 'fight': 1, 'This': 1, "Luck's": 1, 'poor': 2, 'now.': 6, 'freeze.': 2, 'game': 4, "Ben's": 5, 'it!': 2, 'Joe': 5, 'their': 2, 'you,': 1, 'Box': 1, 'bands.': 2, 'it': 3, 'bands': 1, 'bricks': 5, "here's": 1, "Let's": 3, 'Sue': 5, 'when': 2, 'clocks,': 2, 'breaks.': 2, 'puddle': 8, 'Socks': 4, 'sir,': 6, 'an': 2, "Bim's": 5, 'Pig': 2, 'now....': 1, 'battle.': 4, 'Slow': 5, 'sew': 2, 'blew.': 1, 'bring': 1, 'game,': 1, 'AND...': 3, 'and': 16, 'brooms.': 1, 'way.': 2, 'booms.': 1, 'lots': 1, 'clock': 1, 'comes.': 4, 'please....': 1, 'then...': 1, '...they': 2, 'say....': 1, 'beetle': 7, 'nose.': 1, 'slow,': 1, 'or': 1, 'Six': 2, 'AND': 1, 'block': 1, 'broom.': 4, 'do': 6, 'it,': 1, 'some.': 2, 'Duck': 1, 'sir?': 2, 'grows.': 1, 'this,': 1, 'Very': 2, 'Big': 2, 'whose': 3, 'noodle-eating': 1, 'chew': 2, 'choose': 2, 'Mr.': 13, 'band': 2, "Here's": 2, 'it.': 2, 'call': 3, 'dumb': 1, 'have': 2, 'so': 2, 'Goo-Goose': 1, 'say.': 2, 'socks.': 5, "trees'": 1, 'poodle': 3, 'socks,': 4, 'my': 1, 'While': 1, 'play.': 2, 'Chicks': 3, 'stack.': 4, 'rose': 2, 'freezy': 1, 'clothes.': 3, 'makes': 1, 'little': 1, 'paddles': 3, 'box': 2, 'all': 1, 'free': 2, 'blocks,': 1, 'Do': 1, 'blab': 1, 'THIS': 1, 'thing': 1, 'bends': 2, 'bent': 2, 'Knox': 8, 'socks?': 2, 'tock.': 2, 'wuddled': 1, 'much': 1, 'takes': 2, 'bends.': 2, 'wait': 1, 'see': 1, 'rubber.': 1, 'of': 4, 'clothes?': 2, 'mouth': 3, 'bottle...': 1, 'too,': 1, 'blibber': 1, 'Try': 2, 'where': 1, "won't": 2, 'get': 1}
new_di={}
for k, v in di.items():
new_di.setdefault(v, []).append(k)
>>> new_di
{1: ['What', 'game,', 'Whose', 'Thank', 'Broom', 'goo,', 'bring', 'fuddled', 'hate', 'Hose', 'then,', 'sneeze.', 'Here', 'sir....', 'Please,', '...', 'it,', 'get', 'Goo-Goose', 'bands', 'muddle', 'nose.', 'Goo-Goose,', 'sorry,', 'not', "I'm", 'little', 'No,', 'like', 'THIS', 'poodle,', 'Knox....', 'Bricks', 'blibber', 'chick', 'where', 'Rose', 'see', 'noodle-eating', 'call...', 'fun,', 'blue', 'chewing!', 'clock', 'lots', 'slow,', 'sir!', 'chewy', 'goes', 'beetles?', 'Do', 'goes.', 'flew.', 'Box', 'be', 'we', 'eating', 'this,', 'stuff,', "poodle's", 'Duck', 'Well...', 'then...', 'quite', 'minute,', 'Step', 'doing.', 'wait', 'brooms.', 'bottle...', 'thing', 'bangs', 'mixed', 'fight,', 'makes', 'or', 'grows.', 'duddled', 'all', 'too,', 'Two', 'Gooey', 'Boom', 'another', 'If,', 'done,', 'your', '...a', 'First,', 'now....', 'fight', 'muddle.', "trees'", 'too', 'lot', 'enough,', 'blew.', 'brick', 'This', 'Come', 'easy,', 'that', 'Well', "Luke's", 'those', "here's", 'say....', 'up', 'you,', 'freezy', 'silly', 'flew,', 'wuddled', 'dumb', 'my', 'called', 'lame,', 'sees', 'Do,', 'comes?', "Luck's", 'blubber!', 'rubber.', 'shame,', 'paddled', 'Then', 'blab', 'battles', 'booms.', 'bottled', 'please....', 'Through', 'grows', 'muddled', 'that.', 'our', 'who', 'much', 'slick,', 'Nose', 'blocks,', "bottle's", 'While', 'beetles....', 'noodles...', 'lead', 'fox.', 'AND', 'blocks.', 'block', 'talk', 'know'], 2: ['Blue', "don't", 'choose', 'clocks', 'band.', 'tock.', 'Big', 'broom', 'some.', "crow's", 'easy', 'it.', 'it!', 'Try', 'tocks,', 'Pig', 'Clocks', "isn't", 'likes', 'sew', 'chew', 'bends', 'Very', 'box', 'puddle,', 'Knox.', 'band', 'Six', 'for', 'ticks', '...they', "Here's", 'hose', 'And', 'free', 'say.', 'come', 'about', 'chew,', 'likes.', 'Freezy', 'way.', 'tick.', 'rose', 'cheese', 'bent', 'takes', 'their', "it's", "We'll", 'Fox!', 'brings', 'noodle', 'clocks,', 'Gooey.', 'Gluey.', 'sir?', 'when', 'breaks.', 'have', 'an', 'well,', 'something', 'clothes?', 'bends.', 'Stop', 'trick,', 'sick', 'poor', "won't", 'bands.', 'goo', 'play.', 'socks?', 'such', 'tricks', 'freeze.', 'breeze', 'so', 'lakes', 'Look,'], 3: ['find', 'Now', 'mouth', 'trees', 'they', 'Chicks', 'fleas', 'New', 'come.', 'whose', 'AND...', 'tongue', 'poodle', 'duck', 'call', 'Fox,', "I'll", 'made', 'can', 'paddles', 'it', 'clothes.', "Let's", 'you', 'blocks', "Crow's"], 4: ['goo.', 'band!', 'game', 'socks', 'battle.', 'My', 'lakes.', 'broom.', 'what', 'paddle', "Sue's", 'of', 'When', 'Socks', 'three', 'box.', 'licks', "That's", 'trick', 'socks,', 'say', 'comes.', 'You', 'stack.', 'Luke', 'Who', 'Luck', 'Crow', 'bottle'], 5: ['chicks', 'Bim', 'quick', 'Sue', 'fox', 'Joe', 'new', "Bim's", "can't", 'bricks', 'socks.', "Ben's", 'Ben', 'Slow', 'make', 'these'], 6: ['Fox', 'Knox,', 'do', 'now.', 'sir,', 'beetles'], 7: ['beetle', 'battle', 'this', 'is', 'the'], 8: ['Knox', 'puddle'], 9: ['sews', 'I'], 10: ['to'], 11: ['tweetle', 'with'], 13: ['Mr.'], 16: ['and', 'on'], 19: ['in'], 24: ['a'], 27: ['sir.']}
I'm not sure what you used for tokenizing your data, but a quick solution could be using nltk.
Here is a small example on how it can be done:
# necessary imports
from nltk import FreqDist # used later to plot and get count
from nltk.tokenize import word_tokenize # tokenizes our sentence by word
# sample text
text = 'this is a super long text, that has some random words in it. It is not really
that long, but could be very long.'
tknz = word_tokenize(text) # tokenizes the text into ('this', 'is',...)
fdist = FreqDist(tknz) # creates frequency distribution from the tokenized words
From that you can simply do fdis.plot() which gives:
From here you have a matplotlib plot that you can edit, and it only took a few lines to obtain.
You can find additional information about FreqDist here. It also behaves like a dictionary:
>>> fdist.items()
dict_items([(',', 2), ('in', 1), ('a', 1), ('very', 1), ('really', 1), ('be', 1), ...])

Complex dictionary sorting

I have a dictionary with keys that are words and each word has a value that is a number. I want to output the top 10 largest values of keys, but I have multiple keys of the same value. How do I display the alphabetically sorted keys along with the other keys that are either by itself (unique value) or also sorted (shares same value as other keys)?
HERE IS MY DICTIONARY AS PROMISED!
{'callooh': 1, 'all': 2, 'beware': 1, 'through': 3, 'eyes': 1, 'its': 1, 'callay': 1,
'jubjub': 1, 'to': 1, 'frumious': 1, 'wood': 1, 'tulgey': 1, 'has': 1, 'his': 2,
'"beware': 1, 'one': 2, 'day': 1, 'mome': 2, 'uffish': 1, 'manxome': 1, 'did': 2,
'galumphing': 1, 'whiffling': 1, '`twas': 1, 'went': 2, 'outgrabe': 2, 'slithy': 2,
'blade': 1, 'bandersnatch!"': 1, 'jaws': 1, 'snicker-snack': 1, 'back': 1, 'dead': 1,
'stood': 2, 'foe': 1, 'bird': 1, 'claws': 1, 'joy': 1, 'shun': 1, 'come': 1, 'by': 1,
'boy': 1, 'raths': 2, 'thou': 1, 'of': 1, 'o': 1, 'toves': 2, 'son': 1, '"and': 1,
'slain': 1, 'twas': 1, 'brillig': 2, 'bite': 1, 'two': 2, 'long': 1, 'head': 1, 'that': 2,
'took': 1, 'vorpal': 2, 'arms': 1, 'catch': 1, 'with': 2, 'he': 7, 'wabe': 2,
'tree': 1, 'flame': 1, 'were': 2, 'chortled': 1, 'beamish': 1, **'and': 13**,
'gimble': 2, 'it': 2, 'as': 2, 'in': 6, 'sought': 1, 'my': 3, 'awhile': 1, 'mimsy': 2,
'sword': 1, 'borogoves': 2, 'hand': 1, 'rested': 1, 'frabjous': 1, 'gyre': 2,
'tumtum': 1, 'thought': 2, 'so': 1, 'time': 1, 'jabberwock': 3, **'the': 19**,
'burbled': 1, 'came': 2, 'left': 1}
>>> from itertools import islice, chain, repeat
>>> food = {1: ['apple', 'chai', 'coffe', 'dom banana'], 2: ['pie', 'tea'], 3: ['bacon', 'pepsi'], 4: ['strawberry'], 5: ['egg'], 7: ['cake', 'ham'], 9: ['milk', 'mocha'], 10: ['pear'], 11: ['chicken', 'latte'], 13: ['coke'], 20: ['chocolate']}
>>> list(islice(chain.from_iterable(repeat(k, len(v))
for k, v in
sorted(food.iteritems(), reverse=True)), 10))
[20, 13, 11, 11, 10, 9, 9, 7, 7, 5]
I'm not sure I completely understand, but you can try something like:
# Assuming the data you're working with is something like:
>>> d = {'apple': 10, 'banana': 10, 'pear': 5, 'peach': 35, 'plum': 17, 'tomato': 17}
# Use - to order by values descending, key ordering will still be ascending.
>>> sorted(d.items(), key = lambda kv: (-kv[1], kv[0]))
[('peach', 35),
('plum', 17),
('tomato', 17),
('apple', 10),
('banana', 10),
('pear', 5)]

Categories