How to remove efficiently some elements from a python dictionary? - python

I want to filter out some elements from a Python dictionary. Which way is more preferable? My dictionary is quite big...
Below is however an example code with a smaller data set:
d = {0: 0, 1: 2, 2: 4, 3: 6, 4: 8, 5: 10}
1st
d = {i:v for i, v in d.items() if i%2 == 0}
2nd
list_to_del = []
for i in d.keys():
if i%2 == 0:
list_to_del.append(d)
for i in list_to_del:
del d[i]
list_to_del.clear()
Is there any risk of memory leak in the first case?

Honestly, your first one looks exactly how I would do it. However, if you do need to just iterate over the entries and filter, try just leaving it in iterator form.
entries = filter(lambda x: x[1] % 2 == 0, d.items())

The first one is faster. The .append() is quite slow, so your first for-loop in the second answer will need longer than itterating over elements

Related

Is it possible to look up the key of a dictionary value under a certain number?

For example, I have a dictionary like such:
sampleDict = {0: 0, 1: 10, 2: 2, 3: 3, 4: 4}
I know that I can use the get() function to get the value of a particular key, but is there a way to do the opposite in a conditional manner?
For example, what if I wanted to get the key for maximum number in sampleDict. Is there a good function or set of functions I can use to quickly and efficiently get that value? I've tried a quick Google search, but I'm not coming up with what I want. In this example, I would want to receive the number 1 since the value for key 1 is the maximum value in the dictionary.
To get with max value you can use max() builtin function:
sampleDict = {0: 0, 1: 10, 2: 2, 3: 3, 4: 4}
max_key = max(sampleDict, key=sampleDict.get)
print(max_key)
Prints:
1
This return a array of keys that have the value you search.
keys = [k for k, v in sampleDict.items() if v == value]
print(keys)
You can use a loop to find the key which matches the max value:
sampleDict = {0: 0, 1: 10, 2: 2, 3: 3, 4: 4}
def foo(sampleDict):
m = max(sampleDict.values())
for k, v in sampleDict.items():
if v == m:
return k
print(foo(sampleDict))
Result: 1
This obviously does not account for any dupe max values. It just finds the first value that matches the max.
In terms of performance, this assumes the input size is not massive.
If we needed to make it more efficient, we could calculate the maximum as we iterate.

python list of lists to dict when key appear many times

I know to write something simple and slow with loop, but I need it to run super fast in big scale.
input:
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
desired out put:
d = {1 : ["txt1", "txt2"], 2 : "txt3"]
There is something built-in at python which make dict() extend key instead replacing it?
dict(list(zip(lst[0], lst[1])))
One option is to use dict.setdefault:
out = {}
for k, v in zip(*lst):
out.setdefault(k, []).append(v)
Output:
{1: ['txt1', 'txt2'], 2: ['txt3']}
If you want the element itself for singleton lists, one way is adding a condition that checks for it while you build an output dictionary:
out = {}
for k,v in zip(*lst):
if k in out:
if isinstance(out[k], list):
out[k].append(v)
else:
out[k] = [out[k], v]
else:
out[k] = v
or if lst[0] is sorted (like it is in your sample), you could use itertools.groupby:
from itertools import groupby
out = {}
pos = 0
for k, v in groupby(lst[0]):
length = len([*v])
if length > 1:
out[k] = lst[1][pos:pos+length]
else:
out[k] = lst[1][pos]
pos += length
Output:
{1: ['txt1', 'txt2'], 2: 'txt3'}
But as #timgeb notes, it's probably not something you want because afterwards, you'll have to check for data type each time you access this dictionary (if value is a list or not), which is an unnecessary problem that you could avoid by having all values as lists.
If you're dealing with large datasets it may be useful to add a pandas solution.
>>> import pandas as pd
>>> lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
>>> s = pd.Series(lst[1], index=lst[0])
>>> s
1 txt1
1 txt2
2 txt3
>>> s.groupby(level=0).apply(list).to_dict()
{1: ['txt1', 'txt2'], 2: ['txt3']}
Note that this also produces lists for single elements (e.g. ['txt3']) which I highly recommend. Having both lists and strings as possible values will result in bugs because both of those types are iterable. You'd need to remember to check the type each time you process a dict-value.
You can use a defaultdict to group the strings by their corresponding key, then make a second pass through the list to extract the strings from singleton lists. Regardless of what you do, you'll need to access every element in both lists at least once, so some iteration structure is necessary (and even if you don't explicitly use iteration, whatever you use will almost definitely use iteration under the hood):
from collections import defaultdict
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
result = defaultdict(list)
for key, value in zip(lst[0], lst[1]):
result[key].append(value)
for key in result:
if len(result[key]) == 1:
result[key] = result[key][0]
print(dict(result)) # Prints {1: ['txt1', 'txt2'], 2: 'txt3'}

Summing instead of overriding values in dictionary comprehension

When the keys in a dictionary comprehension is the same, I want their values to be added up. For example,
>>> dct = {-1: 1, 0: 2, 1: 3}
>>> {k**2: v for k, v in dct.items()}
{1: 3, 0: 2}
However, what I want to get in this case is {1: 4, 0: 2}, because both the square of 1 and -1 is 1, and 1 + 3 = 4.
Clearly, I can do it with a for loop, but is there a shorthand?
There isn't a shorthand version, since your comprehension would need to keep track of the current state which isn't doable. Like you said, the answer is a for loop:
old = {-1: 1, 0: 2, 1:3}
new = {}
for k, v in old.items():
new[k**2] = new.get(k**2, 0) + v
The trick using the dict.get method I saw somewhere in the Python docs. It does the same thing as:
if k**2 in new:
new[k**2] += v
else:
new[k**2] = v
But this variation uses the get method which returns a default 0 which is added on to the value that will be assigned (when the key doesn't exist). Since it is 0, and the values are numbers being added, 0 has no effect. By contrast, if you needed to get the product, you'd use 1 as the default as starting off with 0 will mean that you never increase the value.
In addition, the latter, more verbose, method shown above evaluates k**2 twice each cycle which uses up computation. To make it use 1 calculation would require another line of code which in my opinion isn't worth the time when the get method is so much cleaner.
One of the fastest ways to calculate the sums is to use defaultdict - a self-initializing dictionary:
from collections import defaultdict
new = defaultdict(int)
for k, v in old.items():
new[k**2] += v

Python count elements in dictionary comprehension when count is above a threshold

I want to populate a dictionary with the counts of various items in a list, but only when the count exceeds a certain number. (This is in Python 2.7)
For example:
x = [2,3,4,2,3,5,6] if I only want numbers that appear twice or more, I would want only
d = {2: 2, 3: 2} as an output.
I wanted to do this with a dictionary comprehension, for example
{(num if x.count(num) >= 2): x.count(num) for num in x}
But this throws an "invalid syntax" error, and it seems I need to set some default key, which means some key I don't want being added to the dictionary which I then have to remove.
What I'm doing now is in two lines:
d = {(num if x.count(num) >= 2 else None): x.count(num) for num in x}
d.pop(None, None)
But is there a way to do it in one, or to do the dictionary comprehension with an if statement without actually adding any default key for the else statement?
Use Counter to count each items in x, the use a dictionary comprehension to pull those values where the count is greater than or equal to your threshold (e.g. 2).
from collections import Counter
x = [2, 3, 4, 2, 3, 5, 6]
threshold = 2
c = Counter(x)
d = {k: v for k, v in c.iteritems() if v >= threshold}
>>> d
{2: 2, 3: 2}
That works:
{ i: x.count(i) for i in x if x.count(i) >= 2}
The if part must be after the for, not before, that's why you get the syntax error.
To avoid counting elements twice, and without any extra import, you could also use two nested comprehensions (actually the inner one is a generator to avoid iterating the full list twice) :
>>> { j: n for j, n in ((i, x.count(i)) for i in x) if n >= 2}
{2: 2, 3: 2}
The test in your expression: (num if x.count(num) >= 2 else None) comes too late: you already instructed the dict comp to issue a value. You have to filter it out beforehand.
just move the condition from ternary to the filter part of the comprehension:
x = [2,3,4,2,3,5,6]
d = {num: x.count(num) for num in x if x.count(num) >= 2}
that said, this method isn't very effective, because it counts elements twice.
Filter a Counter instead:
import collections
d = {num:count for num,count in collections.Counter(x).items() if count>=2}
This should work:
a = [1,2,2,2,3,4,4,5,6,2,2,2]
{n: a.count(n) for n in set(a) if a.count(n) >= 2}
{2: 6, 4: 2}
This should work:
Input:
a = [2,2,2,2,1,1,1,3,3,4]
Code:
x = { i : a.count(i) for i in a }
print(x)
Output:
>>> {2: 4, 1: 3, 3: 2, 4: 1}

python quickest way to merge dictionaries based on key match

I have 2 lists of dictionaries. List A is 34,000 long, list B is 650,000 long. I am essentially inserting all the List B dicts into the List A dicts based on a key match. Currently, I am doing the obvious, but its taking forever (seriously, like a day). There must be a quicker way!
for a in listA:
a['things'] = []
for b in listB:
if a['ID'] == b['ID']:
a['things'].append(b)
from collections import defaultdict
dictB = defaultdict(list)
for b in listB:
dictB[b['ID']].append(b)
for a in listA:
a['things'] = []
for b in dictB[a['ID']]:
a['things'].append(b)
this will turn your algorithm from O(n*m) to O(m)+O(n), where n=len(listA), m=len(listB)
basically it avoids looping through each dict in listB for each dict in listA by 'precalculating' what dicts from listB match each 'ID'
Here's an approach that may help. I'll leave it to you to fill in the details.
Your code is slow because it is a O(n^2) algorithm, comparing every A against every B.
If you sort each of listA and listB by id first (these are O(nlogn)) operations, then you can iterate easily through the sorted versions of A and B (this will be in linear time).
This approach is common when you have to do external merges on very large data sets. Mihai's answer is better for internal merging, where you simply index everything by id (in memory). If you have the memory to hold these additional structures, and dictionary lookup is constant time, that approach will likely be faster, not to mention simpler. :)
By way of example let's say A had the following ids after sorting
acfgjp
and B had these ids, again after sorting
aaaabbbbcccddeeeefffggiikknnnnppppqqqrrr
The idea is, strangely enough, to keep indexes into A and B (I know that does not sound very Pythonic). At first you are looking at a in A and a in B. So you walk through B adding all the a's to your "things" array for a. Once you exhaust the a's in B, you move up one in A, to c. But the next item in B is b, which is less than c, so you have to skip the b's. Then you arrive at a c in B, so you can start adding into "things" for c. Continue in this fashion until both lists are exhausted. Just one pass. :)
I'd convert ListA and ListB into dictionaries instead, dictionaries with ID as the key. Then it is a simple matter to append data using python's quick dictionary lookups:
from collections import defaultdict
class thingdict(dict):
def __init__(self, *args, **kwargs):
things = []
super(thingdict,self).__init__(*args, things=things, **kwargs)
A = defaultdict(thingdict)
A[1] = defaultdict(list)
A[2] = defaultdict(list, things=[6]) # with some dummy data
A[3] = defaultdict(list, things=[7])
B = {1: 5, 2: 6, 3: 7, 4: 8, 5: 9}
for k, v in B.items():
# print k,v
A[k]['things'].append(v)
print A
print B
This returns:
defaultdict(<class '__main__.thingdict'>, {
1: defaultdict(<type 'list'>, {'things': [5]}),
2: defaultdict(<type 'list'>, {'things': [6, 6]}),
3: defaultdict(<type 'list'>, {'things': [7, 7]}),
4: {'things': [8]},
5: {'things': [9]}
})
{1: 5, 2: 6, 3: 7, 4: 8, 5: 9}

Categories