I'm trying to do grouping in python in a one line expression. I want to build a dict of groups and number of items in the group:
{k: {'objects': list(g), 'count': len(list(g))}
for k,g in groupby(rows, key=lambda x: x['group_id'])}
But g is an iterator and it does not work in second use with 'count': len(list(g)).
How can I do counting and reusing g in one line expression?
You can't call list() on an iterator more than once, no. You have to store the result first.
Your options are, in order of feasibility:
To not use a one-liner. Use a regular for loop and assign the list() result to a separate variable first.
Wrap the groupby() iterator in a generator expression that applies list() to the group object.
Add a second loop over with a single element tuple, the list() call, so you can use the loop target as a variable for both keys in the dictionary you are building.
Wait until Python 3.8 which adds PEP 572 assignment expressions and assign the list() call result to a name to re-use for len()
The first should be the preferred option. Readability counts!
result = {}
for group_id, group in groupby(rows, key=lambda x: x['group_id']):
objects = list(group)
result[group_id] = {'objects': objects, 'count': len(objects)}
Using a generator expression is perhaps the next best option:
list_group = ((k, list(g)) for k, g in groupby(rows, key=lambda x: x['group_id']))
result = {k: {'objects': gl, 'count': len(gl)} for k, gl in list_group}
The generator expression loop is executed in parallel as for k, gl in list_group iterates.
The second-loop option looks like this:
{
k: {'objects': gl, 'count': len(gl)}
for k, g in groupby(rows, key=lambda x: x['group_id'])
for gl in (list(g),)
}
Because this trick is surprising and hard to read, I strongly recommend against using it.
In Python 3.8, with PEP 572 implemented, you can use:
{
k: {'objects': gl := list(g), 'count': len(gl)}
for k, g in groupby(rows, key=lambda x: x['group_id'])
}
Iterators can be 'doubled' by using the itertools.tee() object, but that has to then cache the whole list in memory separately, doubling the memory cost and the code would become no more readable (as you'd have to use a similar trick then to store the tee() call iterators in variables too!).
Related
I am very new to python and I was wondering how to get the following dictionary from the list of tuples?
Question
x = [('A',1),('B',2),('C',3),('A',10),('B',10)]
required_dict = {'A': 11,'B': 12, 'C': 3}
Easiest way IMO is using defaultdict and a single for loop:
from collections import defaultdict
required_dict = defaultdict(int)
for k, v in x:
required_dict[k] += v
You could also do this in a single line with a nested comprehension, but this is less efficient because it involves iterating over x repeatedly instead of doing it in a single pass:
required_dict = {k: sum(v for k1, v in x if k1 == k) for k, v in x}
Another comprehension-based solution that doesn't involve redundant iteration would be to use groupby in order to iterate only within each group of identical keys:
from itertools import groupby
required_dict = {
k: sum(v for _, v in g)
for k, g in groupby(sorted(x), key=lambda t: t[0])
}
These three approaches are respectively:
O(n) (single iteration)
O(n^2) (re-iteration for each element)
O(nlogn) (a full sort followed by a single iteration)
from itertools import groupby
from operator import itemgetter
d = [{'k': 'v1'}]
r = ((k, v) for k, v in groupby(d, key=itemgetter('k')))
for k, v in r:
print(k, list(v)) # v1 [{'k': 'v1'}]
print('---')
r = {k: v for k, v in groupby(d, key=itemgetter('k'))}
for k, v in r.items():
print(k, list(v)) # v1 []
Seems like some quirk, or am I missing something?
This is a documented part of itertools.groupby:
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
In other words, you need to access the group before getting the next item in the iterator -- in this case, from inside the dict comprehension. To use it in a dict comprehension you need to make the list in the comprehension:
from itertools import groupby
from operator import itemgetter
d = [{'k': 'v1'}]
r = {k: list(v) for k, v in groupby(d, key=itemgetter('k'))}
for k, v in r.items():
print(k, v) # v1 [{'k': 'v1'}]
In your first example, because you are using a generator expression, you don't actually start iterating the groupby iterator until you start the for loop. However, you would have the same issue if you used a non-lazy list comprehension instead of a generator (i.e. r = [(k, v) for k, v in groupby(d, key=itemgetter('k'))]).
Why does it work this way?
Preserving lazy iteration is the motivating idea behind itertools. Because it is dealing with (possibly large, or infinite) iterators, it never wants to store any values in memory. It just calls next() on the underlying iterator and does something with that value. Once you've called next() you can't go back to earlier values (without storing them, which itertools doesn't want to do).
With groupby it's easier to see with an example. Here is a simple generator that makes alternating ranges of positive and negative numbers and a groupby iterator that groups them:
def make_groups():
i = 1
while True:
for n in range(1, 10):
print("yielding: ", n*i)
yield n * i
i *= -1
g = make_groups()
grouper = groupby(g, key=lambda x: x>0)
make_groups prints a line each time next() is called before yielding the value to help know what's happening. When we call next() on grouper this results in a next call to g and gets out first group and value:
> k, gr = next(grouper)
yielding: 1
Now each next() call on gr results in a next() call to the underlying g as you can see from the print:
> next(gr)
1 # already have this value from the initial next(grouper)
> next(gr)
yielding: 2 # gets the next value and clicks the underlying generator to the next yield
2
Now look what happens if we call next() on grouper to get the next group:
> next(grouper)
yielding: 3
yielding: 4
yielding: 5
yielding: 6
yielding: 7
yielding: 8
yielding: 9
yielding: -1
Groupby is iterated through the generator until it hit a value that changed the key. The values have been yielded by g. We can no longer the next value of gr (ie. 3) unless somehow we stored all those values or we somehow t-ed off the underlying g generator into two independent generators. Neither of these are good solutions for the default implementation (especially since the point of itertools is not to do this), so it leaves it up to you to do this, but you need to store these values before something causes next(grouper) to be called and advance the generator past the values you wanted.
This is a fairly simple query while I was checking out various discussions on Dictionary comprehension over here.
SO I did something like this:
newlist=[{v:k} for k,v in dicnew.items() if v==value]
>>> newlist
[{2: 'a'}, {2: 'ac'}, {2: 'b'}, {2: 'love'}]
What i did next is something like this:
newlist.setdefault(v,[]).append((v,k) for k,v in dicnew.items() if v==value)
>>> newlist
{'go': [<generator object <genexpr> at 0x01E98D50>]}
What just happened? What is this 'go'?
You called .setdefault() with a value of v:
newlist.setdefault(v,[])
and v must have been defined already and set to 'go'. Had you run this code in a new interpreter, or executed del v first, Python would have raised a NameError exception instead:
>>> newlist = {}
>>> newlist.setdefault(v,[]).append((v,k) for k,v in dicnew.items() if v==value)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'v' is not defined
>>> v = 'go'
>>> newlist.setdefault(v, [])
[]
>>> newlist
{'go': []}
The setdefault() part is executed before the .append() method is executed. The v name used in the generator expression inside the .append() method has nothing to do with the v name used in the .setdefault() call.
In Python 2.7 and earlier, list comprehension variables 'leak' to the parent scope:
>>> [foo for foo in range(3)]
[0, 1, 2]
>>> foo
2
As such, v was set in your previous loop, and the last value assigned to it is 'go'.
If you wanted to 'invert' a dictionary by collecting a list of keys for each value, use:
from collections import defaultdict
keys_for_value = defaultdict(list)
for key, value in original_dict.iteritems():
keys_for_value[value].append(key)
If you must insist on a one-liner, use itertools.groupby and sorting:
from itertools import groupby
from operator import itemgetter
v = itemgetter(1)
keys_for_value = {value: [k for k, v in items] for value, items in groupby(sorted(original_dict.iteritems(), key=v, key=v)}
This is going to be slower, as you need to sort the dictionary items first (cost O(n log n)) before looping over the sorted results (itself O(n), so a total of O(n) + O(n log n)), as opposed to a simple O(n) complexity of using defaultdict and a for loop.
So I've got a comprehension to the effect of:
dict((x.key, x.value) for x in y)
The problem, of course, is that if there's multiple x.keys with the same value, they get collapsed with the last x.value with that particular x.key as the only surviving member. I want to actually make the values of the resulting dict a list:
{
'key1': ['value1'],
'key2': ['value2', 'value3', 'value4'],
'key3': ['value5'],
# etc.
}
Is this logic possible with a comprehension?
You can add the elements one by one to a dictionary that contains empty lists by default:
import collections
result_dict = collections.defaultdict(list)
for x in y:
result_dict[x.key].append(x.value)
You can also do something very similar without having to use the collections module:
result_dict = {}
for x in y:
result_dict.setdefault(x.key, []).append(x.value)
but this is arguably slightly less legible.
An equivalent, more legible (no need to "parse" the less common setdefault) but more pedestrian, base Python approach is:
result_dict = {}
for x in y:
if x.key not in result_dict:
result_dict[x.key] = []
result_dict[x.key].append(x.value)
The first solution is clearly the preferred one, as it is at the same time concise, legible, and fast.
Nope. You cannot do this in a comprehension.
But you can use itertools.groupby.
I'm not saying it's the right thing to do, but, just out of sheer intellectual curiosity..
You can use itertools.groupby and lambda to do it in one dict comprehension, if that's what you really want to do: (where l is the list of tuples you want to make a dict out of:
dict((k, [v[1] for v in vs]) for (k, vs) in itertools.groupby(l, lambda x: x[0]))
I have dictionary like
d = {'user_id':1, 'user':'user1', 'group_id':3, 'group_name':'ordinary users'}
and "mapping" dictionary like:
m = {'user_id':'uid', 'group_id':'gid', 'group_name':'group'}
All i want to do is "replace" keys in first dictionary with values from the second one. The expected output is:
d = {'uid':1, 'user':'user1', 'gid':3, 'group':'ordinary users'}
I know that keys are immutable and i know how to do it with 'if/else' statement.
But maybe there is way to do it in one line expression?
Sure:
d = dict((m.get(k, k), v) for (k, v) in d.items())
Let's take the excellent code from #karlknechtel and see what it does:
>>> d = dict((m.get(k, k), v) for (k, v) in d.items())
{'gid': 3, 'group': 'ordinary users', 'uid': 1, 'user': 'user1'}
But how does it work?
To build a dictionary, you can use the dict() function. It expects a list of tuples. In 3.x and >2.7, you can also use dictionary comprehension (see answer by #nightcracker).
Let's dissect the argument of dict. At first, we need a list of all items in m. Every item is a tuple in the format (key, value).
>>> d.items()
[('group_id', 3), ('user_id', 1), ('user', 'user1'), ('group_name', 'ordinary users')]
Given a key value k, we could get the right key value from m by doing m[k].
>>> k = 'user_id'
>>> m[k]
'uid'
Unfortunately, not all keys in d also exist in m.
>>> k = 'user'
>>> m[k]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'user'
To work around that, you can use d.get(x, y), which returns d[x] if the key x exists, or the default value y if it doesn't. Now, if a key k from d doesn't exist in m, we just keep it, so the default is k.
>>> m.get(k, k).
'user'
Now we are ready to build a list of tuples to supply to dict(). To build a list in one line, we can use list comprehension.
To build a list of squares, you would write this:
>>> [x**2 for x in range(5)]
[0, 1, 4, 9, 16]
In our case, it looks like this:
>>> [(m.get(k, k), v) for (k, v) in d.items()]
[('gid', 3), ('uid', 1), ('user', 'user1'), ('group', 'ordinary users')]
That's a mouthful, let's look at that again.
Give me a list [...], which consists of tuples:
[(.., ..) ...]
I want one tuple for every item x in d:
[(.., ..) for x in d.items()]
We know that every item is a tuple with two components, so we can expand it to two variables k and v.
[(.., ..) for (k, v) in d.items()]
Every tuple should have the right key from m as first component, or k if k doesn't exist in m, and the value from d.
[(m.get(k, k), v) for (k, v) in d.items()]
We can pass it as argument to dict().
>>> dict([(m.get(k, k), v) for (k, v) in d.items()])
{'gid': 3, 'group': 'ordinary users', 'uid': 1, 'user': 'user1'}
Looks good! But wait, you might say, #karlknechtel didn't use square brackets.
Right, he didn't use a list comprehension, but a generator expression. Simply speaking, the difference is that a list comprehension builds the whole list in memory, while a generator expression calculates on item at a time. If a list on serves as an intermediate result, it's usually a good idea to use a generator expression. In this example, it doesn't really make a difference, but it's a good habit to get used to.
The equivalent generator expressions looks like this:
>>> ((m.get(k, k), v) for (k, v) in d.items())
<generator object <genexpr> at 0x1004b61e0>
If you pass a generator expression as argument to a function, you can usually omit the outer parentheses. Finally, we get:
>>> dict((m.get(k, k), v) for (k, v) in d.items())
{'gid': 3, 'group': 'ordinary users', 'uid': 1, 'user': 'user1'}
There happens quite a lot in one line of code. Some say this is unreadable, but once you are used to it, stretching this code over several lines seems unreadable. Just don't overdo it. List comprehension and generator expressions are very powerful, but with great power comes great responsibility. +1 for a good question!
In 3.x:
d = {m.get(key, key):value for key, value in d.items()}
It works by creating a new dictionary which contains every value from d and mapped to a new key. The key is retrieved like this: m[key] if m in key else key, but then with the default .get function (which supports default values if the key doesn't exist).
Why would you want to do it in one line?
result = {}
for k, v in d.iteritems():
result[m.get(k, k)] = v