Spark select top values in RDD

Spark select top values in RDD - python

The original dataset is:
# (numbersofrating,title,avg_rating)
newRDD =[(3,'monster',4),(4,'minions 3D',5),....]
I want to select top N avg_ratings in newRDD.I use the following code,it has an error.
selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......))
TypeError: map() takes no keyword arguments
The expected data should be:
# (numbersofrating,title,avg_rating)
selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....]

You can use either top or takeOrdered with key argument:
newRDD.top(2, key=lambda x: x[2])
or
newRDD.takeOrdered(2, key=lambda x: -x[2])
Note that top is taking elements in descending order and takeOrdered in ascending so key function is different in both cases.

Have you tried using top? Given that you want the top avg ratings (and it is the third item in the tuple), you'll need to assign it to the key using a lambda function.
# items = (number_of_ratings, title, avg_rating)
newRDD = sc.parallelize([(3, 'monster', 4), (4, 'minions 3D', 5)])
top_n = 10
>>> newRDD.top(top_n, key=lambda items: items[2])
[(4, 'minions 3D', 5), (3, 'monster', 4)]

Related

How to sort sub-items in a list by ascending order to display them in a tuples

Lets say I have exactly this list : ["('Jeff', 8)", "('Louis', 9)","('Deandre', 5)"]
I want to be able to display in the console exactly this :
('Deandre',5)
('Jeff',8)
('Louis',9)
So that the 2nd element of the tuples are in ascending order (5,8,9...)
I tried with something like this :
sorted(mylist, key=lambda x: x[1])
But nothing...

First parse the strings to tuples using ast.literal_eval, then sort:
from ast import literal_eval
lst = ["('Jeff', 8)", "('Louis', 9)", "('Deandre', 5)"]
out = sorted(map(literal_eval, lst), key=lambda x: x[1])
print(*out)
Prints:
('Deandre', 5) ('Jeff', 8) ('Louis', 9)

Sort list of objects based on multiple attributes of the objects? [duplicate]

I have a list of lists:
[[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
If I wanted to sort by one element, say the tall/short element, I could do it via s = sorted(s, key = itemgetter(1)).
If I wanted to sort by both tall/short and colour, I could do the sort twice, once for each element, but is there a quicker way?

A key can be a function that returns a tuple:
s = sorted(s, key = lambda x: (x[1], x[2]))
Or you can achieve the same using itemgetter (which is faster and avoids a Python function call):
import operator
s = sorted(s, key = operator.itemgetter(1, 2))
And notice that here you can use sort instead of using sorted and then reassigning:
s.sort(key = operator.itemgetter(1, 2))

I'm not sure if this is the most pythonic method ...
I had a list of tuples that needed sorting 1st by descending integer values and 2nd alphabetically. This required reversing the integer sort but not the alphabetical sort. Here was my solution: (on the fly in an exam btw, I was not even aware you could 'nest' sorted functions)
a = [('Al', 2),('Bill', 1),('Carol', 2), ('Abel', 3), ('Zeke', 2), ('Chris', 1)]
b = sorted(sorted(a, key = lambda x : x[0]), key = lambda x : x[1], reverse = True)
print(b)
[('Abel', 3), ('Al', 2), ('Carol', 2), ('Zeke', 2), ('Bill', 1), ('Chris', 1)]

Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis:
s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)

It appears you could use a list instead of a tuple.
This becomes more important I think when you are grabbing attributes instead of 'magic indexes' of a list/tuple.
In my case I wanted to sort by multiple attributes of a class, where the incoming keys were strings. I needed different sorting in different places, and I wanted a common default sort for the parent class that clients were interacting with; only having to override the 'sorting keys' when I really 'needed to', but also in a way that I could store them as lists that the class could share
So first I defined a helper method
def attr_sort(self, attrs=['someAttributeString']:
'''helper to sort by the attributes named by strings of attrs in order'''
return lambda k: [ getattr(k, attr) for attr in attrs ]
then to use it
# would defined elsewhere but showing here for consiseness
self.SortListA = ['attrA', 'attrB']
self.SortListB = ['attrC', 'attrA']
records = .... #list of my objects to sort
records.sort(key=self.attr_sort(attrs=self.SortListA))
# perhaps later nearby or in another function
more_records = .... #another list
more_records.sort(key=self.attr_sort(attrs=self.SortListB))
This will use the generated lambda function sort the list by object.attrA and then object.attrB assuming object has a getter corresponding to the string names provided. And the second case would sort by object.attrC then object.attrA.
This also allows you to potentially expose outward sorting choices to be shared alike by a consumer, a unit test, or for them to perhaps tell you how they want sorting done for some operation in your api by only have to give you a list and not coupling them to your back end implementation.

convert the list of list into a list of tuples then sort the tuple by multiple fields.
data=[[12, 'tall', 'blue', 1],[2, 'short', 'red', 9],[4, 'tall', 'blue', 13]]
data=[tuple(x) for x in data]
result = sorted(data, key = lambda x: (x[1], x[2]))
print(result)
output:
[(2, 'short', 'red', 9), (12, 'tall', 'blue', 1), (4, 'tall', 'blue', 13)]

Here's one way: You basically re-write your sort function to take a list of sort functions, each sort function compares the attributes you want to test, on each sort test, you look and see if the cmp function returns a non-zero return if so break and send the return value.
You call it by calling a Lambda of a function of a list of Lambdas.
Its advantage is that it does single pass through the data not a sort of a previous sort as other methods do. Another thing is that it sorts in place, whereas sorted seems to make a copy.
I used it to write a rank function, that ranks a list of classes where each object is in a group and has a score function, but you can add any list of attributes.
Note the un-lambda-like, though hackish use of a lambda to call a setter.
The rank part won't work for an array of lists, but the sort will.
#First, here's a pure list version
my_sortLambdaLst = [lambda x,y:cmp(x[0], y[0]), lambda x,y:cmp(x[1], y[1])]
def multi_attribute_sort(x,y):
r = 0
for l in my_sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
Lst = [(4, 2.0), (4, 0.01), (4, 0.9), (4, 0.999),(4, 0.2), (1, 2.0), (1, 0.01), (1, 0.9), (1, 0.999), (1, 0.2) ]
Lst.sort(lambda x,y:multi_attribute_sort(x,y)) #The Lambda of the Lambda
for rec in Lst: print str(rec)
Here's a way to rank a list of objects
class probe:
def __init__(self, group, score):
self.group = group
self.score = score
self.rank =-1
def set_rank(self, r):
self.rank = r
def __str__(self):
return '\t'.join([str(self.group), str(self.score), str(self.rank)])
def RankLst(inLst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank)):
#Inner function is the only way (I could think of) to pass the sortLambdaLst into a sort function
def multi_attribute_sort(x,y):
r = 0
for l in sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
inLst.sort(lambda x,y:multi_attribute_sort(x,y))
#Now Rank your probes
rank = 0
last_group = group_lambda(inLst[0])
for i in range(len(inLst)):
rec = inLst[i]
group = group_lambda(rec)
if last_group == group:
rank+=1
else:
rank=1
last_group = group
SetRank_Lambda(inLst[i], rank) #This is pure evil!! The lambda purists are gnashing their teeth
Lst = [probe(4, 2.0), probe(4, 0.01), probe(4, 0.9), probe(4, 0.999), probe(4, 0.2), probe(1, 2.0), probe(1, 0.01), probe(1, 0.9), probe(1, 0.999), probe(1, 0.2) ]
RankLst(Lst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank))
print '\t'.join(['group', 'score', 'rank'])
for r in Lst: print r

There is a operator < between lists e.g.:
[12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13]
will give
False

Sorting based on frequency and alphabetical order

I am trying to sort based on frequency and display it in alphabetical order
After freq counting , I have a list with (string, count) tuple
E.g tmp = [("xyz", 1), ("foo", 2 ) , ("bar", 2)]
I then sort as sorted(tmp, reverse=True)
This gives me [("foo", 2 ) , ("bar", 2), ("xyz", 1)]
How can I make them sort alphabetically in lowest order when frequency same, Trying to figure out the comparator function
expected output:[("bar", 2), ("foo", 2 ), ("xyz", 1)]

You have to sort by multiple keys.
sorted(tmp, key=lambda x: (-x[1], x[0]))
Source: Sort a list by multiple attributes?.

Use this code:
from operator import itemgetter
tmp = [('xyz',1), ('foo', 2 ) , ('bar', 2)]
print(sorted(tmp, key=itemgetter(0,1)))
This skips the usage of function call.

how to sort a list by two different keys?

{'Adam': {('Cleaning',4), ('Tutoring',2), ('Baking',1)},
'Betty': {('Gardening',2), ('Tutoring',1), ('Cleaning',3)},
'Charles': {('Plumbing',2), ('Cleaning',5)},
'Diane': {('Laundry',2), ('Cleaning',4), ('Gardening',3)}}
def who(db : {str:{(str,int)}}, job: str, min_skill : int) -> [(str,int)]:
result = []
if type(min_skill) != int:
raise AssertionError
if min_skill < 0 or min_skill > 5:
raise AssertionError
for key,value in db.items():
for item in value:
if item[0] == job and item[1] >= min_skill:
result.append((key,item[1]))
return sorted(result,key = lambda x: x[1],reverse = True )
the who function, which takes a database, a job (str), minimum skill level (int) as arguments; it returns a list of 2-tuples: persons and their skill level, sorted by decreasing skill level. if two people have the same skill level, they should appear alphabetically.
my function is able to sort the list with minimum skill level (int) , but is not able to sort the list alphabetically.
I got the following error:
*Error: who(db,'Cleaning' ,4) -> [('Charles', 5), ('Diane', 4), ('Adam', 4)] but should -> [('Charles', 5), ('Adam', 4), ('Diane', 4)]
can someone help me to fix my code in order to sort both by the minimum skill level (int) and alphabetically.

Pass a tuple to sorted function as key, negate the second element(skill level) to sort it in descending order, if the skill level ties, it is then sorted by job in ascending order:
sorted(result,key = lambda x: (-x[1], x[0]))
who(db, "Cleaning", 4)
# [('Charles', 5), ('Adam', 4), ('Diane', 4)]

You can sort on two attributes by giving the key argument a function that returns a tuple. Try this:
return sorted(result, key = lambda x: (-x[1], x[0]))
Here you make x[1] negative inside the tuple to sort by skill level in decreasing order, then ascending order for the names. This eliminates the need for the reverse=True.

Sort a list by multiple attributes?

I have a list of lists:
[[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
If I wanted to sort by one element, say the tall/short element, I could do it via s = sorted(s, key = itemgetter(1)).
If I wanted to sort by both tall/short and colour, I could do the sort twice, once for each element, but is there a quicker way?

A key can be a function that returns a tuple:
s = sorted(s, key = lambda x: (x[1], x[2]))
Or you can achieve the same using itemgetter (which is faster and avoids a Python function call):
import operator
s = sorted(s, key = operator.itemgetter(1, 2))
And notice that here you can use sort instead of using sorted and then reassigning:
s.sort(key = operator.itemgetter(1, 2))

I'm not sure if this is the most pythonic method ...
I had a list of tuples that needed sorting 1st by descending integer values and 2nd alphabetically. This required reversing the integer sort but not the alphabetical sort. Here was my solution: (on the fly in an exam btw, I was not even aware you could 'nest' sorted functions)
a = [('Al', 2),('Bill', 1),('Carol', 2), ('Abel', 3), ('Zeke', 2), ('Chris', 1)]
b = sorted(sorted(a, key = lambda x : x[0]), key = lambda x : x[1], reverse = True)
print(b)
[('Abel', 3), ('Al', 2), ('Carol', 2), ('Zeke', 2), ('Bill', 1), ('Chris', 1)]

Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis:
s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)

It appears you could use a list instead of a tuple.
This becomes more important I think when you are grabbing attributes instead of 'magic indexes' of a list/tuple.
In my case I wanted to sort by multiple attributes of a class, where the incoming keys were strings. I needed different sorting in different places, and I wanted a common default sort for the parent class that clients were interacting with; only having to override the 'sorting keys' when I really 'needed to', but also in a way that I could store them as lists that the class could share
So first I defined a helper method
def attr_sort(self, attrs=['someAttributeString']:
'''helper to sort by the attributes named by strings of attrs in order'''
return lambda k: [ getattr(k, attr) for attr in attrs ]
then to use it
# would defined elsewhere but showing here for consiseness
self.SortListA = ['attrA', 'attrB']
self.SortListB = ['attrC', 'attrA']
records = .... #list of my objects to sort
records.sort(key=self.attr_sort(attrs=self.SortListA))
# perhaps later nearby or in another function
more_records = .... #another list
more_records.sort(key=self.attr_sort(attrs=self.SortListB))
This will use the generated lambda function sort the list by object.attrA and then object.attrB assuming object has a getter corresponding to the string names provided. And the second case would sort by object.attrC then object.attrA.
This also allows you to potentially expose outward sorting choices to be shared alike by a consumer, a unit test, or for them to perhaps tell you how they want sorting done for some operation in your api by only have to give you a list and not coupling them to your back end implementation.

convert the list of list into a list of tuples then sort the tuple by multiple fields.
data=[[12, 'tall', 'blue', 1],[2, 'short', 'red', 9],[4, 'tall', 'blue', 13]]
data=[tuple(x) for x in data]
result = sorted(data, key = lambda x: (x[1], x[2]))
print(result)
output:
[(2, 'short', 'red', 9), (12, 'tall', 'blue', 1), (4, 'tall', 'blue', 13)]

Here's one way: You basically re-write your sort function to take a list of sort functions, each sort function compares the attributes you want to test, on each sort test, you look and see if the cmp function returns a non-zero return if so break and send the return value.
You call it by calling a Lambda of a function of a list of Lambdas.
Its advantage is that it does single pass through the data not a sort of a previous sort as other methods do. Another thing is that it sorts in place, whereas sorted seems to make a copy.
I used it to write a rank function, that ranks a list of classes where each object is in a group and has a score function, but you can add any list of attributes.
Note the un-lambda-like, though hackish use of a lambda to call a setter.
The rank part won't work for an array of lists, but the sort will.
#First, here's a pure list version
my_sortLambdaLst = [lambda x,y:cmp(x[0], y[0]), lambda x,y:cmp(x[1], y[1])]
def multi_attribute_sort(x,y):
r = 0
for l in my_sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
Lst = [(4, 2.0), (4, 0.01), (4, 0.9), (4, 0.999),(4, 0.2), (1, 2.0), (1, 0.01), (1, 0.9), (1, 0.999), (1, 0.2) ]
Lst.sort(lambda x,y:multi_attribute_sort(x,y)) #The Lambda of the Lambda
for rec in Lst: print str(rec)
Here's a way to rank a list of objects
class probe:
def __init__(self, group, score):
self.group = group
self.score = score
self.rank =-1
def set_rank(self, r):
self.rank = r
def __str__(self):
return '\t'.join([str(self.group), str(self.score), str(self.rank)])
def RankLst(inLst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank)):
#Inner function is the only way (I could think of) to pass the sortLambdaLst into a sort function
def multi_attribute_sort(x,y):
r = 0
for l in sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
inLst.sort(lambda x,y:multi_attribute_sort(x,y))
#Now Rank your probes
rank = 0
last_group = group_lambda(inLst[0])
for i in range(len(inLst)):
rec = inLst[i]
group = group_lambda(rec)
if last_group == group:
rank+=1
else:
rank=1
last_group = group
SetRank_Lambda(inLst[i], rank) #This is pure evil!! The lambda purists are gnashing their teeth
Lst = [probe(4, 2.0), probe(4, 0.01), probe(4, 0.9), probe(4, 0.999), probe(4, 0.2), probe(1, 2.0), probe(1, 0.01), probe(1, 0.9), probe(1, 0.999), probe(1, 0.2) ]
RankLst(Lst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank))
print '\t'.join(['group', 'score', 'rank'])
for r in Lst: print r

There is a operator < between lists e.g.:
[12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13]
will give
False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark select top values in RDD - python

You can use either top or takeOrdered with key argument: newRDD.top(2, key=lambda x: x[2]) or newRDD.takeOrdered(2, key=lambda x: -x[2]) Note that top is taking elements in descending order and takeOrdered in ascending so key function is different in both cases.

Related

How to sort sub-items in a list by ascending order to display them in a tuples

Sort list of objects based on multiple attributes of the objects? [duplicate]

Sorting based on frequency and alphabetical order

how to sort a list by two different keys?

Sort a list by multiple attributes?

Categories

Resources