Optimize creation of dictionary - python

I have a list with ids called ids. Every element in ids is a string. One id can exist multiple times in this list.
My aim is to create a dictionary which has the the number of occurrences as a key and the value is a list of the ids which appear that often.
My current approach looks like this:
from collections import defaultdict
import numpy as np
ids = ["foo", "foo", "bar", "hi", "hi"]
counts = defaultdict(list)
for id in np.unique(ids):
counts[ids.count(id)].append(id)
Output:
print counts
--> defaultdict(<type 'list'>, {1: ['bar'], 2: ['foo', 'hi']})
This works nicely if the list of ids is not too long. However, for longer lists the performance is rather bad.
How can I make this faster?

Instead of calling count for each element in the list, create a collections.Counter for the entire list:
ids = ["foo", "foo", "bar", "hi", "hi"]
counts = defaultdict(list)
for i, c in Counter(ids).items():
counts[c].append(i)
# counts: defaultdict(<class 'list'>, {1: ['bar'], 2: ['foo', 'hi']})
If you prefer a one-liner, you could also combine Counter.most_common (for view on the elements sorted by counts) and itertools.groupby (but I rather wouldn't)
>>> {k: [v[0] for v in g] for k, g in groupby(Counter(ids).most_common(), lambda x: x[1])}
{1: ['bar'], 2: ['foo', 'hi']}

Related

Pythonic/List comprehension way of creating a list from a dict where the values are guaranteed to be int of range 0..len(dict) -1

Suppose I have a python dict like:
my_dict = {"foo": 1, "bar": 0, "more": 2}
I want to create an ordered list of the keys such that the order follows the value increasingly. The values are guaranteed to be unique and in the range of 0.. len(my_dict) -1. The following code achieves the objective:
my_list = []
for i in range(0, len(self.my_dict)):
for key, idx in self.my_dict.items():
if idx == i:
my_list.append(key)
my_list.reverse()
Is there a more pythonic/list comprehension way of achieving the same objective?
Just plain old sort with a key:
>>> my_dict = {"foo": 1, "bar": 0, "more": 2}
>>> sorted(my_dict, key=my_dict.get)
['bar', 'foo', 'more']
Include reverse keyword if you want values decreasing (your question says increasing but your example code does decreasing, so I'm not sure)
>>> sorted(my_dict, key=my_dict.get, reverse=True)
['more', 'foo', 'bar']
The values are guaranteed to be unique and in the range of 0.. len(my_dict) -1.
In this special case you also have some O(n) solutions available. Sorting would be O(n log n).
>>> my_dict_inverse = {v: k for k, v in my_dict.items()}
>>> [my_dict_inverse[i] for i in range(len(my_dict))]
['bar', 'foo', 'more']
Or:
L = [None] * len(my_dict)
for val, i in my_dict.items():
L[i] = val
I would still take the sorted approach unless the data is very big, though.

python list of lists to dict when key appear many times

I know to write something simple and slow with loop, but I need it to run super fast in big scale.
input:
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
desired out put:
d = {1 : ["txt1", "txt2"], 2 : "txt3"]
There is something built-in at python which make dict() extend key instead replacing it?
dict(list(zip(lst[0], lst[1])))
One option is to use dict.setdefault:
out = {}
for k, v in zip(*lst):
out.setdefault(k, []).append(v)
Output:
{1: ['txt1', 'txt2'], 2: ['txt3']}
If you want the element itself for singleton lists, one way is adding a condition that checks for it while you build an output dictionary:
out = {}
for k,v in zip(*lst):
if k in out:
if isinstance(out[k], list):
out[k].append(v)
else:
out[k] = [out[k], v]
else:
out[k] = v
or if lst[0] is sorted (like it is in your sample), you could use itertools.groupby:
from itertools import groupby
out = {}
pos = 0
for k, v in groupby(lst[0]):
length = len([*v])
if length > 1:
out[k] = lst[1][pos:pos+length]
else:
out[k] = lst[1][pos]
pos += length
Output:
{1: ['txt1', 'txt2'], 2: 'txt3'}
But as #timgeb notes, it's probably not something you want because afterwards, you'll have to check for data type each time you access this dictionary (if value is a list or not), which is an unnecessary problem that you could avoid by having all values as lists.
If you're dealing with large datasets it may be useful to add a pandas solution.
>>> import pandas as pd
>>> lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
>>> s = pd.Series(lst[1], index=lst[0])
>>> s
1 txt1
1 txt2
2 txt3
>>> s.groupby(level=0).apply(list).to_dict()
{1: ['txt1', 'txt2'], 2: ['txt3']}
Note that this also produces lists for single elements (e.g. ['txt3']) which I highly recommend. Having both lists and strings as possible values will result in bugs because both of those types are iterable. You'd need to remember to check the type each time you process a dict-value.
You can use a defaultdict to group the strings by their corresponding key, then make a second pass through the list to extract the strings from singleton lists. Regardless of what you do, you'll need to access every element in both lists at least once, so some iteration structure is necessary (and even if you don't explicitly use iteration, whatever you use will almost definitely use iteration under the hood):
from collections import defaultdict
lst = [[1, 1, 2], ["txt1", "txt2", "txt3"]]
result = defaultdict(list)
for key, value in zip(lst[0], lst[1]):
result[key].append(value)
for key in result:
if len(result[key]) == 1:
result[key] = result[key][0]
print(dict(result)) # Prints {1: ['txt1', 'txt2'], 2: 'txt3'}

Python array of tuples group by first, store second

So I have an array of tuples something like this
query_results = [("foo", "bar"), ("foo", "qux"), ("baz", "foo")]
I would like to achieve something like:
{
"foo": ["bar", "qux"],
"baz": ["foo"]
}
So I have tried using this
from itertools import groupby
grouped_results = {}
for key, y in groupby(query_results, lambda x: x[0]):
grouped_results[key] = [y[1] for u in list(y)]
The issue I have is although the number of keys are correct, the number of values in each array is dramatically lower than it should be. Can anyone explain why this happens and what I should be doing?
You better use a defaultdict for this:
from collections import defaultdict
result = defaultdict(list)
for k,v in query_results:
result[k].append(v)
Which yields:
>>> result
defaultdict(<class 'list'>, {'baz': ['foo'], 'foo': ['bar', 'qux']})
If you wish to turn it into a vanilla dictionary again, you can - after the for loop - use:
result = dict(result)
this then results in:
>>> dict(result)
{'baz': ['foo'], 'foo': ['bar', 'qux']}
A defaultdict is constructed with a factory, here list. In case the key cannot be found in the dictionary, the factory is called (list() constructs a new empty list). The result is then associated with the key.
So for each key k that is not yet in the dictionary, we will construct a new list first. We then call .append(v) on that list to append values to it.
Well why not use a simple for loop?
grouped_results = {}
for key, value in query_results:
grouped_results.setdefault(key, []).append(value)
Output:
{'foo': ['bar', 'qux'], 'baz': ['foo']}
How about using a defaultdict?
d = defaultdict(list)
for pair in query_results:
d[pair[0]].append(pair[1])

How can I keep python dictionary sorted?

lets say I have list like this
x = ["foo", "foo", "bar", "baz", "foo", "bar"]
and I want to count number of each occurrence in a dict but I want them ordered while I am looping through the list. Like this:
from collections import defaultdict
ordered_dict = defaultdict(lambda: 0, {})
for line in x:
ordered_dict[line] += 1
I want the result to be something like this:
{"foo":3, "bar":2, "baz":1}
I wonder if there is someway to keep the dictionary ordered while I am looping. Currently I use heapq after loop
You need an OrderedDict with the "reversed" option. You used a defaultdict instead. These are not sorted, by design.
https://docs.python.org/dev/library/collections.html#collections.OrderedDict
OrderedDict is usually sorted by key insertion. If you want the dictionary sorted by value, you may want to use a Counter and then insert those entries into an OrderedDict.
Basically, you just used the wrong collection from collections:
>>> from collections import Counter, OrderedDict
>>> OrderedDict(Counter(["foo", "foo", "bar", "baz", "foo", "bar"]).most_common())
OrderedDict([('foo', 3), ('bar', 2), ('baz', 1)])
Code:
x = ["foo", "foo", "bar", "baz", "foo", "bar"]
d = {}
for item in x:
d[item] = d.get(item,0) + 1
print(d)
Output:
{'bar': 2, 'baz': 1, 'foo': 3}

How to implement associative array (not dictionary) in Python?

I trying to print out a dictionary in Python:
Dictionary = {"Forename":"Paul","Surname":"Dinh"}
for Key,Value in Dictionary.iteritems():
print Key,"=",Value
Although the item "Forename" is listed first, but dictionaries in Python seem to be sorted by values, so the result is like this:
Surname = Dinh
Forename = Paul
How to print out these with the same order in code or the order when items are appended in (not sorted by values nor by keys)?
You can use a list of tuples (or list of lists). Like this:
Arr= [("Forename","Paul"),("Surname","Dinh")]
for Key,Value in Arr:
print Key,"=",Value
Forename = Paul
Surname = Dinh
you can make a dictionary out of this with:
Dictionary=dict(Arr)
And the correctly sorted keys like this:
keys = [k for k,v in Arr]
Then do this:
for k in keys: print k,Dictionary[k]
but I agree with the comments on your question: Would it not be easy to sort the keys in the required order when looping instead?
EDIT: (thank you Rik Poggi), OrderedDict does this for you:
od=collections.OrderedDict(Arr)
for k in od: print k,od[k]
First of all dictionaries are not sorted at all nor by key, nor by value.
And basing on your description. You actualy need collections.OrderedDict module
from collections import OrderedDict
my_dict = OrderedDict([("Forename", "Paul"), ("Surname", "Dinh")])
for key, value in my_dict.iteritems():
print '%s = %s' % (key, value)
Note that you need to instantiate OrderedDict from list of tuples not from another dict as dict instance will shuffle the order of items before OrderedDict will be instantiated.
You can use collections.OrderedDict. It's available in python2.7 and python3.2+.
This may meet your need better:
Dictionary = {"Forename":"Paul","Surname":"Dinh"}
KeyList = ["Forename", "Surname"]
for Key in KeyList:
print Key,"=",Dictionary[Key]
'but dictionaries in Python are sorted by values' maybe I'm mistaken here but what game you that ideea? Dictionaries are not sorted by anything.
You would have two solutions, either keep a list of keys additional to the dictionary, or use a different data structure like an array or arrays.
I wonder if it is an ordered dict that you want:
>>> k = "one two three four five".strip().split()
>>> v = "a b c d e".strip().split()
>>> k
['one', 'two', 'three', 'four', 'five']
>>> v
['a', 'b', 'c', 'd', 'e']
>>> dx = dict(zip(k, v))
>>> dx
{'four': 'd', 'three': 'c', 'five': 'e', 'two': 'b', 'one': 'a'}
>>> for itm in dx:
print(itm)
four
three
five
two
one
>>> # instantiate this data structure from OrderedDict class in the Collections module
>>> from Collections import OrderedDict
>>> dx = OrderedDict(zip(k, v))
>>> for itm in dx:
print(itm)
one
two
three
four
five
A dictionary created using the OrderdDict preserves the original insertion order.
Put another way, such a dictionary iterates over the key/value pairs according to the order in which they were inserted.
So for instance, when you delete a key and then add the same key again, the iteration order is changes:
>>> del dx['two']
>>> for itm in dx:
print(itm)
one
three
four
five
>>> dx['two'] = 'b'
>>> for itm in dx:
print(itm)
one
three
four
five
two
As of Python 3.7, regular dicts are guaranteed to be ordered, so you can just do
Dictionary = {"Forename":"Paul","Surname":"Dinh"}
for Key,Value in Dictionary.items():
print(Key,"=",Value)

Categories