I am trying to write code to find the mean of the keys in my dict, but based on the dict values. So, for example, for:
d = {1:2, 2:1, 3:2}
the dict keys would be:
[1,1,2,3,3]
I've written the following code, which works for small data sets such as the above:
def get_median_of_dict_keys(d: dict) -> float:
nums_list = []
for k,v in d.items():
if type(v) != int:
raise TypeError
nums_list.extend([k] * v)
median = sum(nums_list) / len(nums_list)
return median
This gets me the values I want when the data set is small, but if the data set is something like:
d = {1:1_000_000_000_000_000, 2:2_000, 3:1_000_000_000_000_000}
I get an out of memory error which, now that I think about it, makes sense.
So how can I structure the above function in a way that will also handle those larger data sets? Thanks for your time.
You do not need to create a list, just keep two running variables, one holding the total sum and the other one holding the number of elements:
def get_mean_of_dict_keys(d: dict) -> float:
total = 0
count = 0
for k, v in d.items():
total += k * v
count += v
mean = total / count
return mean
print(get_mean_of_dict_keys({1: 2, 2: 1, 3: 2}))
Output
2.0
If you want the mean
this is perfectly attainable with larger numbers:
import numpy as np
d = {1:2000000000, 2:1000, 3:2000000000}
print(np.mean([i*d[i] for i in d]))
output
2666667333.3333335
breakdown
[i*d[i] for i in d]
# is equivalent to:
lst = []
for i in d:
lst.append(i*d[i])
What you want to find is weighted average.
Formula:
Where,
X1..n are keys in your dictionary.
W1..n are values in your dictionary.
X̅ is weighted average.
Pure Python approach.
Using itertools.starmap with operator.mul
from itertools import starmap
from operator import mul
d = {1:2, 2:1, 3:2}
sum(starmap(mul, d.items()))/sum(d.values())
# 2.0
If you want to use NumPy
You can use np.average here.
np.average([*d.keys()], weights=[*d.values()])
# 2.0
Related
I know to write something simple and slow with loop, but I need it to run super fast in big scale.
input:
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
desired out put:
d = {1 : ["txt1", "txt2"], 2 : "txt3"]
There is something built-in at python which make dict() extend key instead replacing it?
dict(list(zip(lst[0], lst[1])))
One option is to use dict.setdefault:
out = {}
for k, v in zip(*lst):
out.setdefault(k, []).append(v)
Output:
{1: ['txt1', 'txt2'], 2: ['txt3']}
If you want the element itself for singleton lists, one way is adding a condition that checks for it while you build an output dictionary:
out = {}
for k,v in zip(*lst):
if k in out:
if isinstance(out[k], list):
out[k].append(v)
else:
out[k] = [out[k], v]
else:
out[k] = v
or if lst[0] is sorted (like it is in your sample), you could use itertools.groupby:
from itertools import groupby
out = {}
pos = 0
for k, v in groupby(lst[0]):
length = len([*v])
if length > 1:
out[k] = lst[1][pos:pos+length]
else:
out[k] = lst[1][pos]
pos += length
Output:
{1: ['txt1', 'txt2'], 2: 'txt3'}
But as #timgeb notes, it's probably not something you want because afterwards, you'll have to check for data type each time you access this dictionary (if value is a list or not), which is an unnecessary problem that you could avoid by having all values as lists.
If you're dealing with large datasets it may be useful to add a pandas solution.
>>> import pandas as pd
>>> lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
>>> s = pd.Series(lst[1], index=lst[0])
>>> s
1 txt1
1 txt2
2 txt3
>>> s.groupby(level=0).apply(list).to_dict()
{1: ['txt1', 'txt2'], 2: ['txt3']}
Note that this also produces lists for single elements (e.g. ['txt3']) which I highly recommend. Having both lists and strings as possible values will result in bugs because both of those types are iterable. You'd need to remember to check the type each time you process a dict-value.
You can use a defaultdict to group the strings by their corresponding key, then make a second pass through the list to extract the strings from singleton lists. Regardless of what you do, you'll need to access every element in both lists at least once, so some iteration structure is necessary (and even if you don't explicitly use iteration, whatever you use will almost definitely use iteration under the hood):
from collections import defaultdict
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
result = defaultdict(list)
for key, value in zip(lst[0], lst[1]):
result[key].append(value)
for key in result:
if len(result[key]) == 1:
result[key] = result[key][0]
print(dict(result)) # Prints {1: ['txt1', 'txt2'], 2: 'txt3'}
I have a dictionary like this :
d = {'v03':["elem_A","elem_B","elem_C"],'v02':["elem_A","elem_D","elem_C"],'v01':["elem_A","elem_E"]}
How would you return a new dictionary with the elements that are not contained in the key of the highest value ?
In this case :
d2 = {'v02':['elem_D'],'v01':["elem_E"]}
Thank you,
I prefer to do differences with the builtin data type designed for it: sets.
It is also preferable to write loops rather than elaborate comprehensions. One-liners are clever, but understandable code that you can return to and understand is even better.
d = {'v03':["elem_A","elem_B","elem_C"],'v02':["elem_A","elem_D","elem_C"],'v01':["elem_A","elem_E"]}
last = None
d2 = {}
for key in sorted(d.keys()):
if last:
if set(d[last]) - set(d[key]):
d2[last] = sorted(set(d[last]) - set(d[key]))
last = key
print d2
{'v01': ['elem_E'], 'v02': ['elem_D']}
from collections import defaultdict
myNewDict = defaultdict(list)
all_keys = d.keys()
all_keys.sort()
max_value = all_keys[-1]
for key in d:
if key != max_value:
for value in d[key]:
if value not in d[max_value]:
myNewDict[key].append(value)
You can get fancier with set operations by taking the set difference between the values in d[max_value] and each of the other keys but first I think you should get comfortable working with dictionaries and lists.
defaultdict(<type 'list'>, {'v01': ['elem_E'], 'v02': ['elem_D']})
one reason not to use sets is that the solution does not generalize enough because sets can only have hashable objects. If your values are lists of lists the members (sublists) are not hashable so you can't use a set operation
Depending on your python version, you may be able to get this done with only one line, using dict comprehension:
>>> d2 = {k:[v for v in values if not v in d.get(max(d.keys()))] for k, values in d.items()}
>>> d2
{'v01': ['elem_E'], 'v02': ['elem_D'], 'v03': []}
This puts together a copy of dict d with containing lists being stripped off all items stored at the max key. The resulting dict looks more or less like what you are going for.
If you don't want the empty list at key v03, wrap the result itself in another dict:
>>> {k:v for k,v in d2.items() if len(v) > 0}
{'v01': ['elem_E'], 'v02': ['elem_D']}
EDIT:
In case your original dict has a very large keyset [or said operation is required frequently], you might also want to substitute the expression d.get(max(d.keys())) by some previously assigned list variable for performance [but I ain't sure if it doesn't in fact get pre-computed anyway]. This speeds up the whole thing by almost 100%. The following runs 100,000 times in 1.5 secs on my machine, whereas the unsubstituted expression takes more than 3 seconds.
>>> bl = d.get(max(d.keys()))
>>> d2 = {k:v for k,v in {k:[v for v in values if not v in bl] for k, values in d.items()}.items() if len(v) > 0}
I have the following three integer values:
id # identifies the pair
entropy # gives entropy information
len # basicly the length of a string
now i want to store many of these values and select the top 10 having the highest entropy overall and a length value over n
from collections import defaultdict
d = defaultdict(list)
for id, entropy, len in generateValues:
d[id].append(entropy)
d[id].append(len)
# now get the top 10 values
Can this be easily done?
You can get the top 10 values after you've constructed the dictionary like this. Although there would be a more efficient solution if you find them as you construct the dictionary if that's possible.
import heapq
heapq.nlargest(10, (k for k in d if d[k][1] > n), key=lambda k: d[k][0])
To solve your problem, sorted supports a key argument:
filtered = ((k,v) for k,v in d.iteritems() if v[1] > n) # or filter(d.iteritems(), lambda t: t[1][1] > n)
topTen = sorted(filtered, key=lambda t: t[0], reversed=true)[:10]
This is, imho, more readable than (and of equivalent efficiency to) the solutions using heapq.
In Python I currently have a Dictionary with a composite Key. In this dictionary there are multiple occurences of these keys. (The keys are comma-separated):
(A,B), (A,C), (A,B), (A,D), (C,A), (A,B), (C,A), (C,B), (C,B)
I already have something that totals the unique occurrences and counts the duplicates which gives me a print-out similar to this:
(A,B) with a count of 4, (A,C) with a count of 2, (B,C) with a count of 6, etc.
I would like to know how to code a loop that would give me the following:
Print out the first occurance of the first part of the key and its associtated values and counts.
Name: A:
Type Count
B 4
C 2
Total 6
Name: B:
Type Count
A 3
B 2
C 3
Total 8
I know I need to create a loop where the first statement = the first statement and do the following, but have no real idea how to approach/code this.
Here's a slightly slow algorithm that'll get it done:
def convert(myDict):
keys = myDict.keys()
answer = collections.defaultdict(dict)
for key in keys:
for k in [k for k in keys if k.startswith(key[0])]:
answer[key[0]][k[1]] = myDict[k]
return answer
Ultimately, I think what you're after is a trie
Its a little misleading to say that your dictionary has multiple values for a given key. Python doesn't allow that. Instead, what you have are keys that are tuples. You want to unpack those tuples and rebuild a nested dictionary.
Here's how I'd do it:
import collections
# rebuild data structure
nested = collections.defaultdict(dict)
for k, v in myDict.items():
k1, k2 = k # unpack key tuple
nested[k1][k2] = v
# print out data in the desired format (with totals)
for k1, inner in nested.items():
print("%s\tType\tCount" % k1)
total = 0
for k2, v in innner.items():
print("\t%s\t%d" % (k2, v))
total += v
print("\tTotal\t%d" % total)
How can I take a list of values (percentages):
example = [(1,100), (1,50), (2,50), (1,100), (3,100), (2,50), (3,50)]
and return a dictionary:
example_dict = {1:250, 2:100, 3:150}
and recalculate by dividing by sum(example_dict.values())/100:
final_dict = {1:50, 2:20, 3:30}
The methods I have tried for mapping the list of values to a dictionary results in values being iterated over rather than summed.
Edit:
Since it was asked here are some attempts (after just writing over old values) that went no where and demonstrate my 'noviceness' with python:
{k: +=v if k==w[x][0] for x in range(0,len(w),1)}
invalid
for i in w[x][0] in range(0,len(w),1):
for item in r:
+=v (don't where I was going on that one)
invalid again.
another similar one that was invalid, nothing on google, then to SO.
You could try something like this:
total = float(sum(v for k,v in example))
example_dict = {}
for k,v in example:
example_dict[k] = example_dict.get(k, 0) + v * 100 / total
See it working online: ideone
Use the Counter class:
from collections import Counter
totals = Counter()
for k, v in example: totals.update({k:v})
total = sum(totals.values())
final_dict = {k: 100 * v // total for k, v in totals.items()}