I'm getting an error "int object is unsubscriptable" while executing the following script :
element.reduceByKey( lambda x , y : x[1]+y[1])
with element is an key-value RDD and the value is a tuple. Example input:
(A, (toto , 10))
(A, (titi , 30))
(5, (tata, 10))
(A, (toto, 10))
I understand that the reduceByKey function takes (K,V) tuples and apply a function on all the values to get the final result of the reduce.
Like the example given in ReduceByKey Apache.
Any help please?
Here is an example that will illustrate what's going on.
Let's consider what happens when you call reduce on a list with some function f:
reduce(f, [a,b,c]) = f(f(a,b),c)
If we take your example, f = lambda u, v: u[1] + v[1], then the above expression breaks down into:
reduce(f, [a,b,c]) = f(f(a,b),c) = f(a[1]+b[1],c)
But a[1] + b[1] is an integer so there is no __getitem__ method, hence your error.
In general, the better approach (as shown below) is to use map() to first extract the data in the format that you want, and then apply reduceByKey().
A MCVE with your data
element = sc.parallelize(
[
('A', ('toto' , 10)),
('A', ('titi' , 30)),
('5', ('tata', 10)),
('A', ('toto', 10))
]
)
You can almost get your desired output with a more sophisticated reduce function:
def add_tuple_values(a, b):
try:
u = a[1]
except:
u = a
try:
v = b[1]
except:
v = b
return u + v
print(element.reduceByKey(add_tuple_values).collect())
Except that this results in:
[('A', 50), ('5', ('tata', 10))]
Why? Because there's only one value for the key '5', so there is nothing to reduce.
For these reasons, it's best to first call map. To get your desired output, you could do:
>>> print(element.map(lambda x: (x[0], x[1][1])).reduceByKey(lambda u, v: u+v).collect())
[('A', 50), ('5', 10)]
Update 1
Here's one more approach:
You could create tuples in your reduce function, and then call map to extract the value you want. (Essentially reverse the order of map and reduce.)
print(
element.reduceByKey(lambda u, v: (0,u[1]+v[1]))
.map(lambda x: (x[0], x[1][1]))
.collect()
)
[('A', 50), ('5', 10)]
Notes
Had there been at least 2 records for each key, using add_tuple_values() would have given you the correct output.
Another approach would be to use Dataframe
rdd = sc.parallelize([('A', ('toto', 10)),('A', ('titi', 30)),('5', ('tata', 10)),('A', ('toto', 10))])
rdd.map(lambda (a,(b,c)): (a,b,c)).toDF(['a','b','c']).groupBy('a').agg(sum("c")).rdd.map(lambda (a,c): (a,c)).collect()
>>>[(u'5', 10), (u'A', 50)]
Related
I have a list of lists:
[[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
If I wanted to sort by one element, say the tall/short element, I could do it via s = sorted(s, key = itemgetter(1)).
If I wanted to sort by both tall/short and colour, I could do the sort twice, once for each element, but is there a quicker way?
A key can be a function that returns a tuple:
s = sorted(s, key = lambda x: (x[1], x[2]))
Or you can achieve the same using itemgetter (which is faster and avoids a Python function call):
import operator
s = sorted(s, key = operator.itemgetter(1, 2))
And notice that here you can use sort instead of using sorted and then reassigning:
s.sort(key = operator.itemgetter(1, 2))
I'm not sure if this is the most pythonic method ...
I had a list of tuples that needed sorting 1st by descending integer values and 2nd alphabetically. This required reversing the integer sort but not the alphabetical sort. Here was my solution: (on the fly in an exam btw, I was not even aware you could 'nest' sorted functions)
a = [('Al', 2),('Bill', 1),('Carol', 2), ('Abel', 3), ('Zeke', 2), ('Chris', 1)]
b = sorted(sorted(a, key = lambda x : x[0]), key = lambda x : x[1], reverse = True)
print(b)
[('Abel', 3), ('Al', 2), ('Carol', 2), ('Zeke', 2), ('Bill', 1), ('Chris', 1)]
Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis:
s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)
It appears you could use a list instead of a tuple.
This becomes more important I think when you are grabbing attributes instead of 'magic indexes' of a list/tuple.
In my case I wanted to sort by multiple attributes of a class, where the incoming keys were strings. I needed different sorting in different places, and I wanted a common default sort for the parent class that clients were interacting with; only having to override the 'sorting keys' when I really 'needed to', but also in a way that I could store them as lists that the class could share
So first I defined a helper method
def attr_sort(self, attrs=['someAttributeString']:
'''helper to sort by the attributes named by strings of attrs in order'''
return lambda k: [ getattr(k, attr) for attr in attrs ]
then to use it
# would defined elsewhere but showing here for consiseness
self.SortListA = ['attrA', 'attrB']
self.SortListB = ['attrC', 'attrA']
records = .... #list of my objects to sort
records.sort(key=self.attr_sort(attrs=self.SortListA))
# perhaps later nearby or in another function
more_records = .... #another list
more_records.sort(key=self.attr_sort(attrs=self.SortListB))
This will use the generated lambda function sort the list by object.attrA and then object.attrB assuming object has a getter corresponding to the string names provided. And the second case would sort by object.attrC then object.attrA.
This also allows you to potentially expose outward sorting choices to be shared alike by a consumer, a unit test, or for them to perhaps tell you how they want sorting done for some operation in your api by only have to give you a list and not coupling them to your back end implementation.
convert the list of list into a list of tuples then sort the tuple by multiple fields.
data=[[12, 'tall', 'blue', 1],[2, 'short', 'red', 9],[4, 'tall', 'blue', 13]]
data=[tuple(x) for x in data]
result = sorted(data, key = lambda x: (x[1], x[2]))
print(result)
output:
[(2, 'short', 'red', 9), (12, 'tall', 'blue', 1), (4, 'tall', 'blue', 13)]
Here's one way: You basically re-write your sort function to take a list of sort functions, each sort function compares the attributes you want to test, on each sort test, you look and see if the cmp function returns a non-zero return if so break and send the return value.
You call it by calling a Lambda of a function of a list of Lambdas.
Its advantage is that it does single pass through the data not a sort of a previous sort as other methods do. Another thing is that it sorts in place, whereas sorted seems to make a copy.
I used it to write a rank function, that ranks a list of classes where each object is in a group and has a score function, but you can add any list of attributes.
Note the un-lambda-like, though hackish use of a lambda to call a setter.
The rank part won't work for an array of lists, but the sort will.
#First, here's a pure list version
my_sortLambdaLst = [lambda x,y:cmp(x[0], y[0]), lambda x,y:cmp(x[1], y[1])]
def multi_attribute_sort(x,y):
r = 0
for l in my_sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
Lst = [(4, 2.0), (4, 0.01), (4, 0.9), (4, 0.999),(4, 0.2), (1, 2.0), (1, 0.01), (1, 0.9), (1, 0.999), (1, 0.2) ]
Lst.sort(lambda x,y:multi_attribute_sort(x,y)) #The Lambda of the Lambda
for rec in Lst: print str(rec)
Here's a way to rank a list of objects
class probe:
def __init__(self, group, score):
self.group = group
self.score = score
self.rank =-1
def set_rank(self, r):
self.rank = r
def __str__(self):
return '\t'.join([str(self.group), str(self.score), str(self.rank)])
def RankLst(inLst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank)):
#Inner function is the only way (I could think of) to pass the sortLambdaLst into a sort function
def multi_attribute_sort(x,y):
r = 0
for l in sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
inLst.sort(lambda x,y:multi_attribute_sort(x,y))
#Now Rank your probes
rank = 0
last_group = group_lambda(inLst[0])
for i in range(len(inLst)):
rec = inLst[i]
group = group_lambda(rec)
if last_group == group:
rank+=1
else:
rank=1
last_group = group
SetRank_Lambda(inLst[i], rank) #This is pure evil!! The lambda purists are gnashing their teeth
Lst = [probe(4, 2.0), probe(4, 0.01), probe(4, 0.9), probe(4, 0.999), probe(4, 0.2), probe(1, 2.0), probe(1, 0.01), probe(1, 0.9), probe(1, 0.999), probe(1, 0.2) ]
RankLst(Lst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank))
print '\t'.join(['group', 'score', 'rank'])
for r in Lst: print r
There is a operator < between lists e.g.:
[12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13]
will give
False
So my rdd consists of data looking like:
(k, [v1,v2,v3...])
I want to create a combination of all sets of two for the value part.
So the end map should look like:
(k1, (v1,v2))
(k1, (v1,v3))
(k1, (v2,v3))
I know to get the value part, I would use something like
rdd.cartesian(rdd).filter(case (a,b) => a < b)
However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby.
Also, ultimately, I want to get to the k,v looking like
((k1,v1,v2),1)
I know how to get from what I am looking for to that, but maybe its easier to go straight there?
Thanks.
I think Israel's answer is a incomplete, so I go a step further.
import itertools
a = sc.parallelize([
(1, [1,2,3,4]),
(2, [3,4,5,6]),
(3, [-1,2,3,4])
])
def combinations(row):
l = row[1]
k = row[0]
return [(k, v) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(3)
# [(1, (1, 2)), (1, (1, 3)), (1, (1, 4))]
Use itertools to create the combinations. Here is a demo:
import itertools
k, v1, v2, v3 = 'k1 v1 v2 v3'.split()
a = (k, [v1,v2,v3])
b = itertools.combinations(a[1], 2)
data = [(k, pair) for pair in b]
data will be:
[('k1', ('v1', 'v2')), ('k1', ('v1', 'v3')), ('k1', ('v2', 'v3'))]
I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, please give a hand if you can.
import pandas as pd
import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums):
def ranges(n):
print(n)
return range(n, -1, -1)
num_list = list(map(ranges, nums))
return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list))
print(data)
I'm trying to filter an RDD of tuples to return the largest N tuples based on key values. I need the return format to be an RDD.
So the RDD:
[(4, 'a'), (12, 'e'), (2, 'u'), (49, 'y'), (6, 'p')]
filtered for the largest 3 keys should return the RDD:
[(6,'p'), (12,'e'), (49,'y')]
Doing a sortByKey() and then take(N) returns the values and doesn't result in an RDD, so that won't work.
I could return all of the keys, sort them, find the Nth largest value, and then filter the RDD for key values greater than that, but that seems very inefficient.
What would be the best way to do this?
With RDD
A quick but not particularly efficient solution is to follow sortByKey use zipWithIndex and filter:
n = 3
rdd = sc.parallelize([(4, 'a'), (12, 'e'), (2, 'u'), (49, 'y'), (6, 'p')])
rdd.sortByKey().zipWithIndex().filter(lambda xi: xi[1] < n).keys()
If n is relatively small compared to RDD size a little bit more efficient approach is to avoid full sort:
import heapq
def key(kv):
return kv[0]
top_per_partition = rdd.mapPartitions(lambda iter: heapq.nlargest(n, iter, key))
top_per_partition.sortByKey().zipWithIndex().filter(lambda xi: xi[1] < n).keys()
If keys are much smaller than values and order of final output doesn't matter then filter approach can work just fine:
keys = rdd.keys()
identity = lambda x: x
offset = (keys
.mapPartitions(lambda iter: heapq.nlargest(n, iter))
.sortBy(identity)
.zipWithIndex()
.filter(lambda xi: xi[1] < n)
.keys()
.max())
rdd.filter(lambda kv: kv[0] <= offset)
Also it won't keep exact n values in case of ties.
With DataFrames
You can just orderBy and limit:
from pyspark.sql.functions import col
rdd.toDF().orderBy(col("_1").desc()).limit(n)
A less effort approach since you only want to convert take(N) results to new RDD.
sc.parallelize(yourSortedRdd.take(Nth))
The original dataset is:
# (numbersofrating,title,avg_rating)
newRDD =[(3,'monster',4),(4,'minions 3D',5),....]
I want to select top N avg_ratings in newRDD.I use the following code,it has an error.
selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......))
TypeError: map() takes no keyword arguments
The expected data should be:
# (numbersofrating,title,avg_rating)
selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....]
You can use either top or takeOrdered with key argument:
newRDD.top(2, key=lambda x: x[2])
or
newRDD.takeOrdered(2, key=lambda x: -x[2])
Note that top is taking elements in descending order and takeOrdered in ascending so key function is different in both cases.
Have you tried using top? Given that you want the top avg ratings (and it is the third item in the tuple), you'll need to assign it to the key using a lambda function.
# items = (number_of_ratings, title, avg_rating)
newRDD = sc.parallelize([(3, 'monster', 4), (4, 'minions 3D', 5)])
top_n = 10
>>> newRDD.top(top_n, key=lambda items: items[2])
[(4, 'minions 3D', 5), (3, 'monster', 4)]
I have a list of lists:
[[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
If I wanted to sort by one element, say the tall/short element, I could do it via s = sorted(s, key = itemgetter(1)).
If I wanted to sort by both tall/short and colour, I could do the sort twice, once for each element, but is there a quicker way?
A key can be a function that returns a tuple:
s = sorted(s, key = lambda x: (x[1], x[2]))
Or you can achieve the same using itemgetter (which is faster and avoids a Python function call):
import operator
s = sorted(s, key = operator.itemgetter(1, 2))
And notice that here you can use sort instead of using sorted and then reassigning:
s.sort(key = operator.itemgetter(1, 2))
I'm not sure if this is the most pythonic method ...
I had a list of tuples that needed sorting 1st by descending integer values and 2nd alphabetically. This required reversing the integer sort but not the alphabetical sort. Here was my solution: (on the fly in an exam btw, I was not even aware you could 'nest' sorted functions)
a = [('Al', 2),('Bill', 1),('Carol', 2), ('Abel', 3), ('Zeke', 2), ('Chris', 1)]
b = sorted(sorted(a, key = lambda x : x[0]), key = lambda x : x[1], reverse = True)
print(b)
[('Abel', 3), ('Al', 2), ('Carol', 2), ('Zeke', 2), ('Bill', 1), ('Chris', 1)]
Several years late to the party but I want to both sort on 2 criteria and use reverse=True. In case someone else wants to know how, you can wrap your criteria (functions) in parenthesis:
s = sorted(my_list, key=lambda i: ( criteria_1(i), criteria_2(i) ), reverse=True)
It appears you could use a list instead of a tuple.
This becomes more important I think when you are grabbing attributes instead of 'magic indexes' of a list/tuple.
In my case I wanted to sort by multiple attributes of a class, where the incoming keys were strings. I needed different sorting in different places, and I wanted a common default sort for the parent class that clients were interacting with; only having to override the 'sorting keys' when I really 'needed to', but also in a way that I could store them as lists that the class could share
So first I defined a helper method
def attr_sort(self, attrs=['someAttributeString']:
'''helper to sort by the attributes named by strings of attrs in order'''
return lambda k: [ getattr(k, attr) for attr in attrs ]
then to use it
# would defined elsewhere but showing here for consiseness
self.SortListA = ['attrA', 'attrB']
self.SortListB = ['attrC', 'attrA']
records = .... #list of my objects to sort
records.sort(key=self.attr_sort(attrs=self.SortListA))
# perhaps later nearby or in another function
more_records = .... #another list
more_records.sort(key=self.attr_sort(attrs=self.SortListB))
This will use the generated lambda function sort the list by object.attrA and then object.attrB assuming object has a getter corresponding to the string names provided. And the second case would sort by object.attrC then object.attrA.
This also allows you to potentially expose outward sorting choices to be shared alike by a consumer, a unit test, or for them to perhaps tell you how they want sorting done for some operation in your api by only have to give you a list and not coupling them to your back end implementation.
convert the list of list into a list of tuples then sort the tuple by multiple fields.
data=[[12, 'tall', 'blue', 1],[2, 'short', 'red', 9],[4, 'tall', 'blue', 13]]
data=[tuple(x) for x in data]
result = sorted(data, key = lambda x: (x[1], x[2]))
print(result)
output:
[(2, 'short', 'red', 9), (12, 'tall', 'blue', 1), (4, 'tall', 'blue', 13)]
Here's one way: You basically re-write your sort function to take a list of sort functions, each sort function compares the attributes you want to test, on each sort test, you look and see if the cmp function returns a non-zero return if so break and send the return value.
You call it by calling a Lambda of a function of a list of Lambdas.
Its advantage is that it does single pass through the data not a sort of a previous sort as other methods do. Another thing is that it sorts in place, whereas sorted seems to make a copy.
I used it to write a rank function, that ranks a list of classes where each object is in a group and has a score function, but you can add any list of attributes.
Note the un-lambda-like, though hackish use of a lambda to call a setter.
The rank part won't work for an array of lists, but the sort will.
#First, here's a pure list version
my_sortLambdaLst = [lambda x,y:cmp(x[0], y[0]), lambda x,y:cmp(x[1], y[1])]
def multi_attribute_sort(x,y):
r = 0
for l in my_sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
Lst = [(4, 2.0), (4, 0.01), (4, 0.9), (4, 0.999),(4, 0.2), (1, 2.0), (1, 0.01), (1, 0.9), (1, 0.999), (1, 0.2) ]
Lst.sort(lambda x,y:multi_attribute_sort(x,y)) #The Lambda of the Lambda
for rec in Lst: print str(rec)
Here's a way to rank a list of objects
class probe:
def __init__(self, group, score):
self.group = group
self.score = score
self.rank =-1
def set_rank(self, r):
self.rank = r
def __str__(self):
return '\t'.join([str(self.group), str(self.score), str(self.rank)])
def RankLst(inLst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank)):
#Inner function is the only way (I could think of) to pass the sortLambdaLst into a sort function
def multi_attribute_sort(x,y):
r = 0
for l in sortLambdaLst:
r = l(x,y)
if r!=0: return r #keep looping till you see a difference
return r
inLst.sort(lambda x,y:multi_attribute_sort(x,y))
#Now Rank your probes
rank = 0
last_group = group_lambda(inLst[0])
for i in range(len(inLst)):
rec = inLst[i]
group = group_lambda(rec)
if last_group == group:
rank+=1
else:
rank=1
last_group = group
SetRank_Lambda(inLst[i], rank) #This is pure evil!! The lambda purists are gnashing their teeth
Lst = [probe(4, 2.0), probe(4, 0.01), probe(4, 0.9), probe(4, 0.999), probe(4, 0.2), probe(1, 2.0), probe(1, 0.01), probe(1, 0.9), probe(1, 0.999), probe(1, 0.2) ]
RankLst(Lst, group_lambda= lambda x:x.group, sortLambdaLst = [lambda x,y:cmp(x.group, y.group), lambda x,y:cmp(x.score, y.score)], SetRank_Lambda = lambda x, rank:x.set_rank(rank))
print '\t'.join(['group', 'score', 'rank'])
for r in Lst: print r
There is a operator < between lists e.g.:
[12, 'tall', 'blue', 1] < [4, 'tall', 'blue', 13]
will give
False