Sort a list by multiple attributes?

Sort a list by multiple attributes? - python

I have a list of lists:
[[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
If I wanted to sort by one element, say the tall/short element, I could do it via s = sorted(s, key = itemgetter(1)).
If I wanted to sort by both tall/short and colour, I could do the sort twice, once for each element, but is there a quicker way?

A key can be a function that returns a tuple:
s = sorted(s, key = lambda x: (x[1], x[2]))
Or you can achieve the same using itemgetter (which is faster and avoids a Python function call):
import operator
s = sorted(s, key = operator.itemgetter(1, 2))
And notice that here you can use sort instead of using sorted and then reassigning:
s.sort(key = operator.itemgetter(1, 2))

I'm not sure if this is the most pythonic method ...
I had a list of tuples that needed sorting 1st by descending integer values and 2nd alphabetically. This required reversing the integer sort but not the alphabetical sort. Here was my solution: (on the fly in an exam btw, I was not even aware you could 'nest' sorted functions)
a = [('Al', 2),('Bill', 1),('Carol', 2), ('Abel', 3), ('Zeke', 2), ('Chris', 1)]
b = sorted(sorted(a, key = lambda x : x[0]), key = lambda x : x[1], reverse = True)
print(b)
[('Abel', 3), ('Al', 2), ('Carol', 2), ('Zeke', 2), ('Bill', 1), ('Chris', 1)]

Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis:
s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)

It appears you could use a list instead of a tuple.
This becomes more important I think when you are grabbing attributes instead of 'magic indexes' of a list/tuple.
In my case I wanted to sort by multiple attributes of a class, where the incoming keys were strings. I needed different sorting in different places, and I wanted a common default sort for the parent class that clients were interacting with; only having to override the 'sorting keys' when I really 'needed to', but also in a way that I could store them as lists that the class could share
So first I defined a helper method
def attr_sort(self, attrs=['someAttributeString']:
'''helper to sort by the attributes named by strings of attrs in order'''
return lambda k: [ getattr(k, attr) for attr in attrs ]
then to use it
# would defined elsewhere but showing here for consiseness
self.SortListA = ['attrA', 'attrB']
self.SortListB = ['attrC', 'attrA']
records = .... #list of my objects to sort
records.sort(key=self.attr_sort(attrs=self.SortListA))
# perhaps later nearby or in another function
more_records = .... #another list
more_records.sort(key=self.attr_sort(attrs=self.SortListB))
This will use the generated lambda function sort the list by object.attrA and then object.attrB assuming object has a getter corresponding to the string names provided. And the second case would sort by object.attrC then object.attrA.
This also allows you to potentially expose outward sorting choices to be shared alike by a consumer, a unit test, or for them to perhaps tell you how they want sorting done for some operation in your api by only have to give you a list and not coupling them to your back end implementation.

convert the list of list into a list of tuples then sort the tuple by multiple fields.
data=[[12, 'tall', 'blue', 1],[2, 'short', 'red', 9],[4, 'tall', 'blue', 13]]
data=[tuple(x) for x in data]
result = sorted(data, key = lambda x: (x[1], x[2]))
print(result)
output:
[(2, 'short', 'red', 9), (12, 'tall', 'blue', 1), (4, 'tall', 'blue', 13)]

Here's one way: You basically re-write your sort function to take a list of sort functions, each sort function compares the attributes you want to test, on each sort test, you look and see if the cmp function returns a non-zero return if so break and send the return value.
You call it by calling a Lambda of a function of a list of Lambdas.
Its advantage is that it does single pass through the data not a sort of a previous sort as other methods do. Another thing is that it sorts in place, whereas sorted seems to make a copy.
I used it to write a rank function, that ranks a list of classes where each object is in a group and has a score function, but you can add any list of attributes.
Note the un-lambda-like, though hackish use of a lambda to call a setter.
The rank part won't work for an array of lists, but the sort will.
#First, here's a pure list version
my_sortLambdaLst = [lambda x,y:cmp(x[0], y[0]), lambda x,y:cmp(x[1], y[1])]
def multi_attribute_sort(x,y):
r = 0
for l in my_sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
Lst = [(4, 2.0), (4, 0.01), (4, 0.9), (4, 0.999),(4, 0.2), (1, 2.0), (1, 0.01), (1, 0.9), (1, 0.999), (1, 0.2) ]
Lst.sort(lambda x,y:multi_attribute_sort(x,y)) #The Lambda of the Lambda
for rec in Lst: print str(rec)
Here's a way to rank a list of objects
class probe:
def __init__(self, group, score):
self.group = group
self.score = score
self.rank =-1
def set_rank(self, r):
self.rank = r
def __str__(self):
return '\t'.join([str(self.group), str(self.score), str(self.rank)])
def RankLst(inLst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank)):
#Inner function is the only way (I could think of) to pass the sortLambdaLst into a sort function
def multi_attribute_sort(x,y):
r = 0
for l in sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
inLst.sort(lambda x,y:multi_attribute_sort(x,y))
#Now Rank your probes
rank = 0
last_group = group_lambda(inLst[0])
for i in range(len(inLst)):
rec = inLst[i]
group = group_lambda(rec)
if last_group == group:
rank+=1
else:
rank=1
last_group = group
SetRank_Lambda(inLst[i], rank) #This is pure evil!! The lambda purists are gnashing their teeth
Lst = [probe(4, 2.0), probe(4, 0.01), probe(4, 0.9), probe(4, 0.999), probe(4, 0.2), probe(1, 2.0), probe(1, 0.01), probe(1, 0.9), probe(1, 0.999), probe(1, 0.2) ]
RankLst(Lst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank))
print '\t'.join(['group', 'score', 'rank'])
for r in Lst: print r

There is a operator < between lists e.g.:
[12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13]
will give
False

Related

Sort list of objects based on multiple attributes of the objects? [duplicate]

I have a list of lists:
[[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
If I wanted to sort by one element, say the tall/short element, I could do it via s = sorted(s, key = itemgetter(1)).
If I wanted to sort by both tall/short and colour, I could do the sort twice, once for each element, but is there a quicker way?

A key can be a function that returns a tuple:
s = sorted(s, key = lambda x: (x[1], x[2]))
Or you can achieve the same using itemgetter (which is faster and avoids a Python function call):
import operator
s = sorted(s, key = operator.itemgetter(1, 2))
And notice that here you can use sort instead of using sorted and then reassigning:
s.sort(key = operator.itemgetter(1, 2))

I'm not sure if this is the most pythonic method ...
I had a list of tuples that needed sorting 1st by descending integer values and 2nd alphabetically. This required reversing the integer sort but not the alphabetical sort. Here was my solution: (on the fly in an exam btw, I was not even aware you could 'nest' sorted functions)
a = [('Al', 2),('Bill', 1),('Carol', 2), ('Abel', 3), ('Zeke', 2), ('Chris', 1)]
b = sorted(sorted(a, key = lambda x : x[0]), key = lambda x : x[1], reverse = True)
print(b)
[('Abel', 3), ('Al', 2), ('Carol', 2), ('Zeke', 2), ('Bill', 1), ('Chris', 1)]

Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis:
s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)

It appears you could use a list instead of a tuple.
This becomes more important I think when you are grabbing attributes instead of 'magic indexes' of a list/tuple.
In my case I wanted to sort by multiple attributes of a class, where the incoming keys were strings. I needed different sorting in different places, and I wanted a common default sort for the parent class that clients were interacting with; only having to override the 'sorting keys' when I really 'needed to', but also in a way that I could store them as lists that the class could share
So first I defined a helper method
def attr_sort(self, attrs=['someAttributeString']:
'''helper to sort by the attributes named by strings of attrs in order'''
return lambda k: [ getattr(k, attr) for attr in attrs ]
then to use it
# would defined elsewhere but showing here for consiseness
self.SortListA = ['attrA', 'attrB']
self.SortListB = ['attrC', 'attrA']
records = .... #list of my objects to sort
records.sort(key=self.attr_sort(attrs=self.SortListA))
# perhaps later nearby or in another function
more_records = .... #another list
more_records.sort(key=self.attr_sort(attrs=self.SortListB))
This will use the generated lambda function sort the list by object.attrA and then object.attrB assuming object has a getter corresponding to the string names provided. And the second case would sort by object.attrC then object.attrA.
This also allows you to potentially expose outward sorting choices to be shared alike by a consumer, a unit test, or for them to perhaps tell you how they want sorting done for some operation in your api by only have to give you a list and not coupling them to your back end implementation.

convert the list of list into a list of tuples then sort the tuple by multiple fields.
data=[[12, 'tall', 'blue', 1],[2, 'short', 'red', 9],[4, 'tall', 'blue', 13]]
data=[tuple(x) for x in data]
result = sorted(data, key = lambda x: (x[1], x[2]))
print(result)
output:
[(2, 'short', 'red', 9), (12, 'tall', 'blue', 1), (4, 'tall', 'blue', 13)]

Here's one way: You basically re-write your sort function to take a list of sort functions, each sort function compares the attributes you want to test, on each sort test, you look and see if the cmp function returns a non-zero return if so break and send the return value.
You call it by calling a Lambda of a function of a list of Lambdas.
Its advantage is that it does single pass through the data not a sort of a previous sort as other methods do. Another thing is that it sorts in place, whereas sorted seems to make a copy.
I used it to write a rank function, that ranks a list of classes where each object is in a group and has a score function, but you can add any list of attributes.
Note the un-lambda-like, though hackish use of a lambda to call a setter.
The rank part won't work for an array of lists, but the sort will.
#First, here's a pure list version
my_sortLambdaLst = [lambda x,y:cmp(x[0], y[0]), lambda x,y:cmp(x[1], y[1])]
def multi_attribute_sort(x,y):
r = 0
for l in my_sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
Lst = [(4, 2.0), (4, 0.01), (4, 0.9), (4, 0.999),(4, 0.2), (1, 2.0), (1, 0.01), (1, 0.9), (1, 0.999), (1, 0.2) ]
Lst.sort(lambda x,y:multi_attribute_sort(x,y)) #The Lambda of the Lambda
for rec in Lst: print str(rec)
Here's a way to rank a list of objects
class probe:
def __init__(self, group, score):
self.group = group
self.score = score
self.rank =-1
def set_rank(self, r):
self.rank = r
def __str__(self):
return '\t'.join([str(self.group), str(self.score), str(self.rank)])
def RankLst(inLst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank)):
#Inner function is the only way (I could think of) to pass the sortLambdaLst into a sort function
def multi_attribute_sort(x,y):
r = 0
for l in sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
inLst.sort(lambda x,y:multi_attribute_sort(x,y))
#Now Rank your probes
rank = 0
last_group = group_lambda(inLst[0])
for i in range(len(inLst)):
rec = inLst[i]
group = group_lambda(rec)
if last_group == group:
rank+=1
else:
rank=1
last_group = group
SetRank_Lambda(inLst[i], rank) #This is pure evil!! The lambda purists are gnashing their teeth
Lst = [probe(4, 2.0), probe(4, 0.01), probe(4, 0.9), probe(4, 0.999), probe(4, 0.2), probe(1, 2.0), probe(1, 0.01), probe(1, 0.9), probe(1, 0.999), probe(1, 0.2) ]
RankLst(Lst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank))
print '\t'.join(['group', 'score', 'rank'])
for r in Lst: print r

There is a operator < between lists e.g.:
[12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13]
will give
False

Sorting based on frequency and alphabetical order

I am trying to sort based on frequency and display it in alphabetical order
After freq counting , I have a list with (string, count) tuple
E.g tmp = [("xyz", 1), ("foo", 2 ) , ("bar", 2)]
I then sort as sorted(tmp, reverse=True)
This gives me [("foo", 2 ) , ("bar", 2), ("xyz", 1)]
How can I make them sort alphabetically in lowest order when frequency same, Trying to figure out the comparator function
expected output:[("bar", 2), ("foo", 2 ), ("xyz", 1)]

You have to sort by multiple keys.
sorted(tmp, key=lambda x: (-x[1], x[0]))
Source: Sort a list by multiple attributes?.

Use this code:
from operator import itemgetter
tmp = [('xyz',1), ('foo', 2 ) , ('bar', 2)]
print(sorted(tmp, key=itemgetter(0,1)))
This skips the usage of function call.

Spark select top values in RDD

The original dataset is:
# (numbersofrating,title,avg_rating)
newRDD =[(3,'monster',4),(4,'minions 3D',5),....]
I want to select top N avg_ratings in newRDD.I use the following code,it has an error.
selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......))
TypeError: map() takes no keyword arguments
The expected data should be:
# (numbersofrating,title,avg_rating)
selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....]

You can use either top or takeOrdered with key argument:
newRDD.top(2, key=lambda x: x[2])
or
newRDD.takeOrdered(2, key=lambda x: -x[2])
Note that top is taking elements in descending order and takeOrdered in ascending so key function is different in both cases.

Have you tried using top? Given that you want the top avg ratings (and it is the third item in the tuple), you'll need to assign it to the key using a lambda function.
# items = (number_of_ratings, title, avg_rating)
newRDD = sc.parallelize([(3, 'monster', 4), (4, 'minions 3D', 5)])
top_n = 10
>>> newRDD.top(top_n, key=lambda items: items[2])
[(4, 'minions 3D', 5), (3, 'monster', 4)]

How to use bisect.insort_left with a key?

Doc's are lacking an example...How do you use bisect.insort_left)_ based on a key?
Trying to insert based on key.
bisect.insort_left(data, ('brown', 7))
puts insert at data[0].
From docs...
bisect.insort_left(a, x, lo=0, hi=len(a))
Insert x in a in sorted order. This is equivalent to a.insert(bisect.bisect_left(a, x, lo, hi), x) assuming that a is already sorted. Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.
Sample usage:
>>> data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
>>> data.sort(key=lambda r: r[1])
>>> keys = [r[1] for r in data] # precomputed list of keys
>>> data[bisect_left(keys, 0)]
('black', 0)
>>> data[bisect_left(keys, 1)]
('blue', 1)
>>> data[bisect_left(keys, 5)]
('red', 5)
>>> data[bisect_left(keys, 8)]
('yellow', 8)
>>>
I want to put ('brown', 7) after ('red', 5) on sorted list in data using bisect.insort_left. Right now bisect.insort_left(data, ('brown', 7)) puts ('brown', 7) at data[0]...because I am not using the keys to do insert...docs don't show to do inserts using the keys.

You could wrap your iterable in a class that implements __getitem__ and __len__. This allows you the opportunity to use a key with bisect_left. If you set up your class to take the iterable and a key function as arguments.
To extend this to be usable with insort_left it's required to implement the insert method. The problem here is that if you do that is that insort_left will try to insert your key argument into the list containing the objects of which the the key is a member.
An example is clearer
from bisect import bisect_left, insort_left
class KeyWrapper:
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __getitem__(self, i):
return self.key(self.it[i])
def __len__(self):
return len(self.it)
def insert(self, index, item):
print('asked to insert %s at index%d' % (item, index))
self.it.insert(index, {"time":item})
timetable = [{"time": "0150"}, {"time": "0250"}, {"time": "0350"}, {"time": "0450"}, {"time": "0550"}, {"time": "0650"}, {"time": "0750"}]
bslindex = bisect_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
islindex = insort_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
See how in my insert method I had to make it specific to the timetable dictionary otherwise insort_left would try insert "0359" where it should insert {"time": "0359"}?
Ways round this could be to construct a dummy object for the comparison, inherit from KeyWrapper and override insert or pass some sort of factory function to create the object. None of these ways are particularly desirable from an idiomatic python point of view.
So the easiest way is to just use the KeyWrapper with bisect_left, which returns you the insert index and then do the insert yourself. You could easily wrap this in a dedicated function.
e.g.
bslindex = bisect_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
timetable.insert(bslindex, {"time":"0359"})
In this case ensure you don't implement insert, so you will be immediately aware if you accidentally pass a KeyWrapper to a mutating function like insort_left which probably wouldn't do the right thing.
To use your example data
from bisect import bisect_left
class KeyWrapper:
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __getitem__(self, i):
return self.key(self.it[i])
def __len__(self):
return len(self.it)
data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
data.sort(key=lambda c: c[1])
newcol = ('brown', 7)
bslindex = bisect_left(KeyWrapper(data, key=lambda c: c[1]), newcol[1])
data.insert(bslindex, newcol)
print(data)
Here is the class with proper typing:
from typing import TypeVar, Generic, Sequence, Callable
T = TypeVar('T')
V = TypeVar('V')
class KeyWrapper(Generic[T, V]):
def __init__(self, iterable: Sequence[T], key: Callable[[T], V]):
self.it = iterable
self.key = key
def __getitem__(self, i: int) -> V:
return self.key(self.it[i])
def __len__(self) -> int:
return len(self.it)

This does essentially the same thing the SortedCollection recipe does that the bisect documentation mentions in its See also: section at the end, but unlike the insert() method in the recipe, the function shown supports a key-function.
What's being done is a separate sorted keys list is maintained in parallel with the sorted data list to improve performance (it's faster than creating the keys list before each insertion, but keeping it around and updating it isn't strictly required). The ActiveState recipe encapsulated this for you within a class, but in the code below they're just two separate independent lists being passed around (so it'd be easier for them to get out of sync than it would be if they were both held in an instance of the recipe's class).
from bisect import bisect_left
def insert(seq, keys, item, keyfunc=lambda v: v):
"""Insert an item into a sorted list using a separate corresponding
sorted keys list and a keyfunc() to extract the key from each item.
Based on insert() method in SortedCollection recipe:
http://code.activestate.com/recipes/577197-sortedcollection/
"""
k = keyfunc(item) # Get key.
i = bisect_left(keys, k) # Determine where to insert item.
keys.insert(i, k) # Insert key of item to keys list.
seq.insert(i, item) # Insert the item itself in the corresponding place.
# Initialize the sorted data and keys lists.
data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
data.sort(key=lambda r: r[1]) # Sort data by key value
keys = [r[1] for r in data] # Initialize keys list
print(data) # -> [('black', 0), ('blue', 1), ('red', 5), ('yellow', 8)]
insert(data, keys, ('brown', 7), keyfunc=lambda x: x[1])
print(data) # -> [('black', 0), ('blue', 1), ('red', 5), ('brown', 7), ('yellow', 8)]
Follow-on question:
    Can bisect.insort_left be used?
No, you can't simply use the bisect.insort_left() function to do this because it wasn't written in a way that supports a key-function—instead it just compares the whole item passed to it to insert, x, with one of the whole items in the array in its if a[mid] < x: statement. You can see what I mean by looking at the source for the bisect module in Lib/bisect.py.
Here's the relevant excerpt:
def insort_left(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.
If x is already in a, insert it to the left of the leftmost x.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid] < x: lo = mid+1
else: hi = mid
a.insert(lo, x)
You could modify the above to accept an optional key-function argument and use it:
def my_insort_left(a, x, lo=0, hi=None, keyfunc=lambda v: v):
x_key = keyfunc(x) # Get comparison value.
. . .
if keyfunc(a[mid]) < x_key: # Compare key values.
lo = mid+1
. . .
...and call it like this:
my_insort_left(data, ('brown', 7), keyfunc=lambda v: v[1])
Actually, if you're going to write a custom function, for the sake of more efficiency at the expense of unneeded generality, you could dispense with the adding of a generic key function argument and just hardcode everything to operate the way needed with the data format you have. This will avoid the overhead of repeated calls to a key-function while doing the insertions.
def my_insort_left(a, x, lo=0, hi=None):
x_key = x[1] # Key on second element of each item in sequence.
. . .
if a[mid][1] < x_key: lo = mid+1 # Compare second element to key.
. . .
...called this way without passing keyfunc:
my_insort_left(data, ('brown', 7))

Add comparison methods to your class
Sometimes this is the least painful way, especially if you already have a class and just want to sort by a key from it:
#!/usr/bin/env python3
import bisect
import functools
#functools.total_ordering
class MyData:
def __init__(self, color, number):
self.color = color
self.number = number
def __lt__(self, other):
return self.number < other.number
def __str__(self):
return '{} {}'.format(self.color, self.number)
mydatas = [
MyData('red', 5),
MyData('blue', 1),
MyData('yellow', 8),
MyData('black', 0),
]
mydatas_sorted = []
for mydata in mydatas:
bisect.insort(mydatas_sorted, mydata)
for mydata in mydatas_sorted:
print(mydata)
Output:
black 0
blue 1
red 5
yellow 8
See also: "Enabling" comparison for classes
Tested in Python 3.5.2.
Upstream requests/patches
I get the feeling this is going to happen sooner or later ;-)
https://github.com/python/cpython/pull/13970
https://bugs.python.org/issue4356

As of Python 3.10, all the binary search helpers in the bisect module now accept a key argument:
key specifies a key function of one argument that is used to extract a
comparison key from each input element. The default value is None
(compare the elements directly).
Therefore, you can pass the same function you used to sort the data:
>>> import bisect
>>> data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
>>> data.sort(key=lambda r: r[1])
>>> data
[('black', 0), ('blue', 1), ('red', 5), ('yellow', 8)]
>>> bisect.insort_left(data, ('brown', 7), key=lambda r: r[1])
>>> data
[('black', 0), ('blue', 1), ('red', 5), ('brown', 7), ('yellow', 8)]

If your goal is to mantain a list sorted by key, performing usual operations like bisect insert, delete and update, I think sortedcontainers should suit your needs as well, and you'll avoid O(n) inserts.

From python version 3.10, the key argument has been added.
It will be something like:
import bisect
bisect.bisect_left(('brown', 7), data, key=lambda r: r[1])
Sources:
GitHub feature request
Documentation for version 3.10
See that documentation for version 3.9 does not have the key argument.

How do I use itertools.groupby()?

I haven't been able to find an understandable explanation of how to actually use Python's itertools.groupby() function. What I'm trying to do is this:
Take a list - in this case, the children of an objectified lxml element
Divide it into groups based on some criteria
Then later iterate over each of these groups separately.
I've reviewed the documentation, but I've had trouble trying to apply them beyond a simple list of numbers.
So, how do I use of itertools.groupby()? Is there another technique I should be using? Pointers to good "prerequisite" reading would also be appreciated.

IMPORTANT NOTE: You have to sort your data first.
The part I didn't get is that in the example construction
groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
k is the current grouping key, and g is an iterator that you can use to iterate over the group defined by that grouping key. In other words, the groupby iterator itself returns iterators.
Here's an example of that, using clearer variable names:
from itertools import groupby
things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print("A %s is a %s." % (thing[1], key))
print("")
This will give you the output:
A bear is a animal.
A duck is a animal.
A cactus is a plant.
A speed boat is a vehicle.
A school bus is a vehicle.
In this example, things is a list of tuples where the first item in each tuple is the group the second item belongs to.
The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with.
Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key.
In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key. You can use the returned iterator to iterate over each individual item in that group.
Here's a slightly different example with the same data, using a list comprehension:
for key, group in groupby(things, lambda x: x[0]):
listOfThings = " and ".join([thing[1] for thing in group])
print(key + "s: " + listOfThings + ".")
This will give you the output:
animals: bear and duck.
plants: cactus.
vehicles: speed boat and school bus.

itertools.groupby is a tool for grouping items.
From the docs, we glean further what it might do:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
groupby objects yield key-group pairs where the group is a generator.
Features
A. Group consecutive items together
B. Group all occurrences of an item, given a sorted iterable
C. Specify how to group items with a key function *
Comparisons
# Define a printer for comparing outputs
>>> def print_groupby(iterable, keyfunc=None):
... for k, g in it.groupby(iterable, keyfunc):
... print("key: '{}'--> group: {}".format(k, list(g)))
# Feature A: group consecutive occurrences
>>> print_groupby("BCAACACAADBBB")
key: 'B'--> group: ['B']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'D'--> group: ['D']
key: 'B'--> group: ['B', 'B', 'B']
# Feature B: group all occurrences
>>> print_groupby(sorted("BCAACACAADBBB"))
key: 'A'--> group: ['A', 'A', 'A', 'A', 'A']
key: 'B'--> group: ['B', 'B', 'B', 'B']
key: 'C'--> group: ['C', 'C', 'C']
key: 'D'--> group: ['D']
# Feature C: group by a key function
>>> # islower = lambda s: s.islower() # equivalent
>>> def islower(s):
... """Return True if a string is lowercase, else False."""
... return s.islower()
>>> print_groupby(sorted("bCAaCacAADBbB"), keyfunc=islower)
key: 'False'--> group: ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D']
key: 'True'--> group: ['a', 'a', 'b', 'b', 'c']
Uses
Anagrams (see notebook)
Binning
Group odd and even numbers
Group a list by values
Remove duplicate elements
Find indices of repeated elements in an array
Split an array into n-sized chunks
Find corresponding elements between two lists
Compression algorithm (see notebook)/Run Length Encoding
Grouping letters by length, key function (see notebook)
Consecutive values over a threshold (see notebook)
Find ranges of numbers in a list or continuous items (see docs)
Find all related longest sequences
Take consecutive sequences that meet a condition (see related post)
Note: Several of the latter examples derive from Víctor Terrón's PyCon (talk) (Spanish), "Kung Fu at Dawn with Itertools". See also the groupby source code written in C.
* A function where all items are passed through and compared, influencing the result. Other objects with key functions include sorted(), max() and min().
Response
# OP: Yes, you can use `groupby`, e.g.
[do_something(list(g)) for _, g in groupby(lxml_elements, criteria_func)]

The example on the Python docs is quite straightforward:
groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
So in your case, data is a list of nodes, keyfunc is where the logic of your criteria function goes and then groupby() groups the data.
You must be careful to sort the data by the criteria before you call groupby or it won't work. groupby method actually just iterates through a list and whenever the key changes it creates a new group.

A neato trick with groupby is to run length encoding in one line:
[(c,len(list(cgen))) for c,cgen in groupby(some_string)]
will give you a list of 2-tuples where the first element is the char and the 2nd is the number of repetitions.
Edit: Note that this is what separates itertools.groupby from the SQL GROUP BY semantics: itertools doesn't (and in general can't) sort the iterator in advance, so groups with the same "key" aren't merged.

Another example:
for key, igroup in itertools.groupby(xrange(12), lambda x: x // 5):
print key, list(igroup)
results in
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11]
Note that igroup is an iterator (a sub-iterator as the documentation calls it).
This is useful for chunking a generator:
def chunker(items, chunk_size):
'''Group items in chunks of chunk_size'''
for _key, group in itertools.groupby(enumerate(items), lambda x: x[0] // chunk_size):
yield (g[1] for g in group)
with open('file.txt') as fobj:
for chunk in chunker(fobj):
process(chunk)
Another example of groupby - when the keys are not sorted. In the following example, items in xx are grouped by values in yy. In this case, one set of zeros is output first, followed by a set of ones, followed again by a set of zeros.
xx = range(10)
yy = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
for group in itertools.groupby(iter(xx), lambda x: yy[x]):
print group[0], list(group[1])
Produces:
0 [0, 1, 2]
1 [3, 4, 5]
0 [6, 7, 8, 9]

WARNING:
The syntax list(groupby(...)) won't work the way that you intend. It seems to destroy the internal iterator objects, so using
for x in list(groupby(range(10))):
print(list(x[1]))
will produce:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[9]
Instead, of list(groupby(...)), try [(k, list(g)) for k,g in groupby(...)], or if you use that syntax often,
def groupbylist(*args, **kwargs):
return [(k, list(g)) for k, g in groupby(*args, **kwargs)]
and get access to the groupby functionality while avoiding those pesky (for small data) iterators all together.

I would like to give another example where groupby without sort is not working. Adapted from example by James Sulak
from itertools import groupby
things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print "A %s is a %s." % (thing[1], key)
print " "
output is
A bear is a vehicle.
A duck is a animal.
A cactus is a animal.
A speed boat is a vehicle.
A school bus is a vehicle.
there are two groups with vehicule, whereas one could expect only one group

#CaptSolo, I tried your example, but it didn't work.
from itertools import groupby
[(c,len(list(cs))) for c,cs in groupby('Pedro Manoel')]
Output:
[('P', 1), ('e', 1), ('d', 1), ('r', 1), ('o', 1), (' ', 1), ('M', 1), ('a', 1), ('n', 1), ('o', 1), ('e', 1), ('l', 1)]
As you can see, there are two o's and two e's, but they got into separate groups. That's when I realized you need to sort the list passed to the groupby function. So, the correct usage would be:
name = list('Pedro Manoel')
name.sort()
[(c,len(list(cs))) for c,cs in groupby(name)]
Output:
[(' ', 1), ('M', 1), ('P', 1), ('a', 1), ('d', 1), ('e', 2), ('l', 1), ('n', 1), ('o', 2), ('r', 1)]
Just remembering, if the list is not sorted, the groupby function will not work!

Sorting and groupby
from itertools import groupby
val = [{'name': 'satyajit', 'address': 'btm', 'pin': 560076},
{'name': 'Mukul', 'address': 'Silk board', 'pin': 560078},
{'name': 'Preetam', 'address': 'btm', 'pin': 560076}]
for pin, list_data in groupby(sorted(val, key=lambda k: k['pin']),lambda x: x['pin']):
... print pin
... for rec in list_data:
... print rec
...
o/p:
560076
{'name': 'satyajit', 'pin': 560076, 'address': 'btm'}
{'name': 'Preetam', 'pin': 560076, 'address': 'btm'}
560078
{'name': 'Mukul', 'pin': 560078, 'address': 'Silk board'}

Sadly I don’t think it’s advisable to use itertools.groupby(). It’s just too hard to use safely, and it’s only a handful of lines to write something that works as expected.
def my_group_by(iterable, keyfunc):
"""Because itertools.groupby is tricky to use
The stdlib method requires sorting in advance, and returns iterators not
lists, and those iterators get consumed as you try to use them, throwing
everything off if you try to look at something more than once.
"""
ret = defaultdict(list)
for k in iterable:
ret[keyfunc(k)].append(k)
return dict(ret)
Use it like this:
def first_letter(x):
return x[0]
my_group_by('four score and seven years ago'.split(), first_letter)
to get
{'f': ['four'], 's': ['score', 'seven'], 'a': ['and', 'ago'], 'y': ['years']}

How do I use Python's itertools.groupby()?
You can use groupby to group things to iterate over. You give groupby an iterable, and a optional key function/callable by which to check the items as they come out of the iterable, and it returns an iterator that gives a two-tuple of the result of the key callable and the actual items in another iterable. From the help:
groupby(iterable[, keyfunc]) -> create an iterator which returns
(key, sub-iterator) grouped by each value of key(value).
Here's an example of groupby using a coroutine to group by a count, it uses a key callable (in this case, coroutine.send) to just spit out the count for however many iterations and a grouped sub-iterator of elements:
import itertools
def grouper(iterable, n):
def coroutine(n):
yield # queue up coroutine
for i in itertools.count():
for j in range(n):
yield i
groups = coroutine(n)
next(groups) # queue up coroutine
for c, objs in itertools.groupby(iterable, groups.send):
yield c, list(objs)
# or instead of materializing a list of objs, just:
# return itertools.groupby(iterable, groups.send)
list(grouper(range(10), 3))
prints
[(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9])]

This basic implementation helped me understand this function. Hope it helps others as well:
arr = [(1, "A"), (1, "B"), (1, "C"), (2, "D"), (2, "E"), (3, "F")]
for k,g in groupby(arr, lambda x: x[0]):
print("--", k, "--")
for tup in g:
print(tup[1]) # tup[0] == k
-- 1 --
A
B
C
-- 2 --
D
E
-- 3 --
F

One useful example that I came across may be helpful:
from itertools import groupby
#user input
myinput = input()
#creating empty list to store output
myoutput = []
for k,g in groupby(myinput):
myoutput.append((len(list(g)),int(k)))
print(*myoutput)
Sample input: 14445221
Sample output: (1,1) (3,4) (1,5) (2,2) (1,1)

from random import randint
from itertools import groupby
l = [randint(1, 3) for _ in range(20)]
d = {}
for k, g in groupby(l, lambda x: x):
if not d.get(k, None):
d[k] = list(g)
else:
d[k] = d[k] + list(g)
the code above shows how groupby can be used to group a list based on the lambda function/key supplied. The only problem is that the output is not merged, this can be easily resolved using a dictionary.
Example:
l = [2, 1, 2, 3, 1, 3, 2, 1, 3, 3, 1, 3, 2, 3, 1, 2, 1, 3, 2, 3]
after applying groupby the result will be:
for k, g in groupby(l, lambda x:x):
print(k, list(g))
2 [2]
1 [1]
2 [2]
3 [3]
1 [1]
3 [3]
2 [2]
1 [1]
3 [3, 3]
1 [1]
3 [3]
2 [2]
3 [3]
1 [1]
2 [2]
1 [1]
3 [3]
2 [2]
3 [3]
Once a dictionary is used as shown above following result is derived which can be easily iterated over:
{2: [2, 2, 2, 2, 2, 2], 1: [1, 1, 1, 1, 1, 1], 3: [3, 3, 3, 3, 3, 3, 3, 3]}

The key thing to recognize with itertools.groupby is that items are only grouped together as long as they're sequential in the iterable. This is why sorting works, because basically you're rearranging the collection so that all of the items which satisfy callback(item) now appear in the sorted collection sequentially.
That being said, you don't need to sort the list, you just need a collection of key-value pairs, where the value can grow in accordance to each group iterable yielded by groupby. i.e. a dict of lists.
>>> things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
>>> coll = {}
>>> for k, g in itertools.groupby(things, lambda x: x[0]):
... coll.setdefault(k, []).extend(i for _, i in g)
...
{'vehicle': ['bear', 'speed boat', 'school bus'], 'animal': ['duck', 'cactus']}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort a list by multiple attributes? - python

Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis: s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)

There is a operator < between lists e.g.: [12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13] will give False

Related

Sort list of objects based on multiple attributes of the objects? [duplicate]

Sorting based on frequency and alphabetical order

Spark select top values in RDD

How to use bisect.insort_left with a key?

How do I use itertools.groupby()?

Categories

Resources