Call Distinct on 'pyspark.resultiterable.ResultIterable'

Call Distinct on 'pyspark.resultiterable.ResultIterable' - python

I am writing some spark code and I have an RDD which looks like
[(4, <pyspark.resultiterable.ResultIterable at 0x9d32a4c>),
(1, <pyspark.resultiterable.ResultIterable at 0x9d32cac>),
(5, <pyspark.resultiterable.ResultIterable at 0x9d32bac>),
(2, <pyspark.resultiterable.ResultIterable at 0x9d32acc>)]
What I need to do is to call a distinct on the pyspark.resultiterable.ResultIterable
I tried this
def distinctHost(a, b):
p = sc.parallelize(b)
return (a, p.distinct())
mydata.map(lambda x: distinctHost(*x))
But I get an error:
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transforamtion.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
The error is self explanatory that I cannot use sc. But I need to find a way to cover the pyspark.resultiterable.ResultIterable to an RDD so that I can call distinct on it.

Straightforward approach is to use sets:
from numpy.random import choice, seed
seed(323)
keys = (4, 1, 5, 2)
hosts = [
u'in24.inetnebr.com',
u'ix-esc-ca2-07.ix.netcom.com',
u'uplherc.upl.com',
u'slppp6.intermind.net',
u'piweba4y.prodigy.com'
]
pairs = sc.parallelize(zip(choice(keys, 20), choice(hosts, 20))).groupByKey()
pairs.map(lambda (k, v): (k, set(v))).take(3)
Result:
[(1, {u'ix-esc-ca2-07.ix.netcom.com', u'slppp6.intermind.net'}),
(2,
{u'in24.inetnebr.com',
u'ix-esc-ca2-07.ix.netcom.com',
u'slppp6.intermind.net',
u'uplherc.upl.com'}),
(4, {u'in24.inetnebr.com', u'piweba4y.prodigy.com', u'uplherc.upl.com'})]
If there is a particular reason for using rdd.disinct you can try something like this:
def distinctHost(pairs, key):
return (pairs
.filter(lambda (k, v): k == key)
.flatMap(lambda (k, v): v)
.distinct())
[(key, distinctHost(pairs, key).collect()) for key in pairs.keys().collect()]

Related

Fail to subtract two RDD with list column in PySpark

I have two RDD of below type:
RDD[(int, List[(string, int)]
And I want to get the subtract set from the two RDD. The code is like:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('xxxxx.com').getOrCreate()
rdd1 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd2 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd = rdd1.subtract(rdd2)
rdd.toDF().show()
However I got the below errors:
d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'
But if i change the rdd to DF first and then do subtract, it can get the right answer. Not know how to fix the issue if using rdd directly.
rdd1 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd2 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd = rdd1.toDF().subtract(rdd2.toDF())
rdd.show()

First of all, the reason why this does not work in python is simple. subtract is about finding the elements of rdd1 that are not in rdd2. To do that, spark will put all the records with the same hash on the same partition and then check for each record of rdd1 if there is an equal record with the same hash from rdd2. To do that, the records must be hashable. In python, tuples are but lists are not, hence the error you obtain. There are several workarounds. The easiest one would probably be to work in scala. Lists are hashable in java/scala.
val rdd1 = spark.sparkContext.parallelize(Seq((1, Seq(("foo", 101), ("bar", 111))), (2, Seq(("foobar", 22), ("bar", 222))), (3, Seq(("foo", 333)))))
val rdd2 = spark.sparkContext.parallelize(Seq((1, Seq(("foo", 101), ("bar", 111))), (2, Seq(("foobar", 22), ("bar", 222))), (3, Seq(("foo", 333)))))
// and you can check that this works
rdd1.subtract(rdd2).collect()
In python, one way at this would be to define your own list class. It would need to be hashable and to provide an __eq__ method to allow spark to know when objects are equal. Such a custom class could be defined as follows:
class my_list:
def __init__(self, list):
self.list=list
def __hash__(self):
my_hash = 0
for t in self.list:
my_hash+=hash(t[0])
my_hash+=t[1]
return my_hash
def __eq__(self, other_list):
self.list == other_list.list
Then, you can check that this works:
rdd1.mapValues(lambda x : my_list(x))\
.subtract(rdd2.mapValues(lambda x: my_list(x)))\
.collect()
NB: if you work in a shell, do not define the class within the shell or pickle won't be able to serialize your class. Define it in a separate file like my_list.py and import it with pyspark --py-files my_list.py and in the shell, import my_list from my_list.

Converting 2 list and one string to dictionary

P.S: Thank you everybody ,esp Matthias Fripp . Just reviewed the question You are right I made mistake : String is value not the key
num=[1,2,3,4,5,6]
pow=[1,4,9,16,25,36]
s= ":subtraction"
dic={1:1 ,0:s , 2:4,2:s, 3:9,6:s, 4:16,12:s.......}
There is easy way to convert two list to dictionary :
newdic=dict(zip(list1,list2))
but for this problem no clue even with comprehension:
print({num[i]:pow[i] for i in range(len(num))})

As others have said, dict cannot contain duplicate keys. You can make key duplicate with a little bit of tweaking. I used OrderedDict to keep order of inserted keys:
from pprint import pprint
from collections import OrderedDict
num=[1,2,3,4,5,6]
pow=[1,4,9,16,25,36]
pprint(OrderedDict(sum([[[a, b], ['substraction ({}-{}):'.format(a, b), a-b]] for a, b in zip(num, pow)], [])))
Prints:
OrderedDict([(1, 1),
('substraction (1-1):', 0),
(2, 4),
('substraction (2-4):', -2),
(3, 9),
('substraction (3-9):', -6),
(4, 16),
('substraction (4-16):', -12),
(5, 25),
('substraction (5-25):', -20),
(6, 36),
('substraction (6-36):', -30)])

In principle, this would do what you want:
nums = [(n, p) for (n, p) in zip(num, pow)]
diffs = [('subtraction', p-n) for (n, p) in zip(num, pow)]
items = nums + diffs
dic = dict(items)
However, a dictionary cannot have multiple items with the same key, so each of your "subtraction" items will be replaced by the next one added to the dictionary, and you'll only get the last one. So you might prefer to work with the items list directly.
If you need the items list sorted as you've shown, that will take a little more work. Maybe something like this:
items = []
for n, p in zip(num, pow):
items.append((n, p))
items.append(('subtraction', p-n))
# the next line will drop most 'subtraction' entries, but on
# Python 3.7+, it will at least preserve the order (not possible
# with earlier versions of Python)
dic = dict(items)

Error using reducebykey: int object is unsubscriptable

I'm getting an error "int object is unsubscriptable" while executing the following script :
element.reduceByKey( lambda x , y : x[1]+y[1])
with element is an key-value RDD and the value is a tuple. Example input:
(A, (toto , 10))
(A, (titi , 30))
(5, (tata, 10))
(A, (toto, 10))
I understand that the reduceByKey function takes (K,V) tuples and apply a function on all the values to get the final result of the reduce.
Like the example given in ReduceByKey Apache.
Any help please?

Here is an example that will illustrate what's going on.
Let's consider what happens when you call reduce on a list with some function f:
reduce(f, [a,b,c]) = f(f(a,b),c)
If we take your example, f = lambda u, v: u[1] + v[1], then the above expression breaks down into:
reduce(f, [a,b,c]) = f(f(a,b),c) = f(a[1]+b[1],c)
But a[1] + b[1] is an integer so there is no __getitem__ method, hence your error.
In general, the better approach (as shown below) is to use map() to first extract the data in the format that you want, and then apply reduceByKey().
A MCVE with your data
element = sc.parallelize(
[
('A', ('toto' , 10)),
('A', ('titi' , 30)),
('5', ('tata', 10)),
('A', ('toto', 10))
]
)
You can almost get your desired output with a more sophisticated reduce function:
def add_tuple_values(a, b):
try:
u = a[1]
except:
u = a
try:
v = b[1]
except:
v = b
return u + v
print(element.reduceByKey(add_tuple_values).collect())
Except that this results in:
[('A', 50), ('5', ('tata', 10))]
Why? Because there's only one value for the key '5', so there is nothing to reduce.
For these reasons, it's best to first call map. To get your desired output, you could do:
>>> print(element.map(lambda x: (x[0], x[1][1])).reduceByKey(lambda u, v: u+v).collect())
[('A', 50), ('5', 10)]
Update 1
Here's one more approach:
You could create tuples in your reduce function, and then call map to extract the value you want. (Essentially reverse the order of map and reduce.)
print(
element.reduceByKey(lambda u, v: (0,u[1]+v[1]))
.map(lambda x: (x[0], x[1][1]))
.collect()
)
[('A', 50), ('5', 10)]
Notes
Had there been at least 2 records for each key, using add_tuple_values() would have given you the correct output.

Another approach would be to use Dataframe
rdd = sc.parallelize([('A', ('toto', 10)),('A', ('titi', 30)),('5', ('tata', 10)),('A', ('toto', 10))])
rdd.map(lambda (a,(b,c)): (a,b,c)).toDF(['a','b','c']).groupBy('a').agg(sum("c")).rdd.map(lambda (a,c): (a,c)).collect()
>>>[(u'5', 10), (u'A', 50)]

Creating combination of value list with existing key - Pyspark

So my rdd consists of data looking like:
(k, [v1,v2,v3...])
I want to create a combination of all sets of two for the value part.
So the end map should look like:
(k1, (v1,v2))
(k1, (v1,v3))
(k1, (v2,v3))
I know to get the value part, I would use something like
rdd.cartesian(rdd).filter(case (a,b) => a < b)
However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby.
Also, ultimately, I want to get to the k,v looking like
((k1,v1,v2),1)
I know how to get from what I am looking for to that, but maybe its easier to go straight there?
Thanks.

I think Israel's answer is a incomplete, so I go a step further.
import itertools
a = sc.parallelize([
(1, [1,2,3,4]),
(2, [3,4,5,6]),
(3, [-1,2,3,4])
])
def combinations(row):
l = row[1]
k = row[0]
return [(k, v) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(3)
# [(1, (1, 2)), (1, (1, 3)), (1, (1, 4))]

Use itertools to create the combinations. Here is a demo:
import itertools
k, v1, v2, v3 = 'k1 v1 v2 v3'.split()
a = (k, [v1,v2,v3])
b = itertools.combinations(a[1], 2)
data = [(k, pair) for pair in b]
data will be:
[('k1', ('v1', 'v2')), ('k1', ('v1', 'v3')), ('k1', ('v2', 'v3'))]

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, please give a hand if you can.
import pandas as pd
import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums):
def ranges(n):
print(n)
return range(n, -1, -1)
num_list = list(map(ranges, nums))
return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list))
print(data)

How to use bisect.insort_left with a key?

Doc's are lacking an example...How do you use bisect.insort_left)_ based on a key?
Trying to insert based on key.
bisect.insort_left(data, ('brown', 7))
puts insert at data[0].
From docs...
bisect.insort_left(a, x, lo=0, hi=len(a))
Insert x in a in sorted order. This is equivalent to a.insert(bisect.bisect_left(a, x, lo, hi), x) assuming that a is already sorted. Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.
Sample usage:
>>> data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
>>> data.sort(key=lambda r: r[1])
>>> keys = [r[1] for r in data] # precomputed list of keys
>>> data[bisect_left(keys, 0)]
('black', 0)
>>> data[bisect_left(keys, 1)]
('blue', 1)
>>> data[bisect_left(keys, 5)]
('red', 5)
>>> data[bisect_left(keys, 8)]
('yellow', 8)
>>>
I want to put ('brown', 7) after ('red', 5) on sorted list in data using bisect.insort_left. Right now bisect.insort_left(data, ('brown', 7)) puts ('brown', 7) at data[0]...because I am not using the keys to do insert...docs don't show to do inserts using the keys.

You could wrap your iterable in a class that implements __getitem__ and __len__. This allows you the opportunity to use a key with bisect_left. If you set up your class to take the iterable and a key function as arguments.
To extend this to be usable with insort_left it's required to implement the insert method. The problem here is that if you do that is that insort_left will try to insert your key argument into the list containing the objects of which the the key is a member.
An example is clearer
from bisect import bisect_left, insort_left
class KeyWrapper:
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __getitem__(self, i):
return self.key(self.it[i])
def __len__(self):
return len(self.it)
def insert(self, index, item):
print('asked to insert %s at index%d' % (item, index))
self.it.insert(index, {"time":item})
timetable = [{"time": "0150"}, {"time": "0250"}, {"time": "0350"}, {"time": "0450"}, {"time": "0550"}, {"time": "0650"}, {"time": "0750"}]
bslindex = bisect_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
islindex = insort_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
See how in my insert method I had to make it specific to the timetable dictionary otherwise insort_left would try insert "0359" where it should insert {"time": "0359"}?
Ways round this could be to construct a dummy object for the comparison, inherit from KeyWrapper and override insert or pass some sort of factory function to create the object. None of these ways are particularly desirable from an idiomatic python point of view.
So the easiest way is to just use the KeyWrapper with bisect_left, which returns you the insert index and then do the insert yourself. You could easily wrap this in a dedicated function.
e.g.
bslindex = bisect_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
timetable.insert(bslindex, {"time":"0359"})
In this case ensure you don't implement insert, so you will be immediately aware if you accidentally pass a KeyWrapper to a mutating function like insort_left which probably wouldn't do the right thing.
To use your example data
from bisect import bisect_left
class KeyWrapper:
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __getitem__(self, i):
return self.key(self.it[i])
def __len__(self):
return len(self.it)
data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
data.sort(key=lambda c: c[1])
newcol = ('brown', 7)
bslindex = bisect_left(KeyWrapper(data, key=lambda c: c[1]), newcol[1])
data.insert(bslindex, newcol)
print(data)
Here is the class with proper typing:
from typing import TypeVar, Generic, Sequence, Callable
T = TypeVar('T')
V = TypeVar('V')
class KeyWrapper(Generic[T, V]):
def __init__(self, iterable: Sequence[T], key: Callable[[T], V]):
self.it = iterable
self.key = key
def __getitem__(self, i: int) -> V:
return self.key(self.it[i])
def __len__(self) -> int:
return len(self.it)

This does essentially the same thing the SortedCollection recipe does that the bisect documentation mentions in its See also: section at the end, but unlike the insert() method in the recipe, the function shown supports a key-function.
What's being done is a separate sorted keys list is maintained in parallel with the sorted data list to improve performance (it's faster than creating the keys list before each insertion, but keeping it around and updating it isn't strictly required). The ActiveState recipe encapsulated this for you within a class, but in the code below they're just two separate independent lists being passed around (so it'd be easier for them to get out of sync than it would be if they were both held in an instance of the recipe's class).
from bisect import bisect_left
def insert(seq, keys, item, keyfunc=lambda v: v):
"""Insert an item into a sorted list using a separate corresponding
sorted keys list and a keyfunc() to extract the key from each item.
Based on insert() method in SortedCollection recipe:
http://code.activestate.com/recipes/577197-sortedcollection/
"""
k = keyfunc(item) # Get key.
i = bisect_left(keys, k) # Determine where to insert item.
keys.insert(i, k) # Insert key of item to keys list.
seq.insert(i, item) # Insert the item itself in the corresponding place.
# Initialize the sorted data and keys lists.
data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
data.sort(key=lambda r: r[1]) # Sort data by key value
keys = [r[1] for r in data] # Initialize keys list
print(data) # -> [('black', 0), ('blue', 1), ('red', 5), ('yellow', 8)]
insert(data, keys, ('brown', 7), keyfunc=lambda x: x[1])
print(data) # -> [('black', 0), ('blue', 1), ('red', 5), ('brown', 7), ('yellow', 8)]
Follow-on question:
    Can bisect.insort_left be used?
No, you can't simply use the bisect.insort_left() function to do this because it wasn't written in a way that supports a key-function—instead it just compares the whole item passed to it to insert, x, with one of the whole items in the array in its if a[mid] < x: statement. You can see what I mean by looking at the source for the bisect module in Lib/bisect.py.
Here's the relevant excerpt:
def insort_left(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.
If x is already in a, insert it to the left of the leftmost x.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid] < x: lo = mid+1
else: hi = mid
a.insert(lo, x)
You could modify the above to accept an optional key-function argument and use it:
def my_insort_left(a, x, lo=0, hi=None, keyfunc=lambda v: v):
x_key = keyfunc(x) # Get comparison value.
. . .
if keyfunc(a[mid]) < x_key: # Compare key values.
lo = mid+1
. . .
...and call it like this:
my_insort_left(data, ('brown', 7), keyfunc=lambda v: v[1])
Actually, if you're going to write a custom function, for the sake of more efficiency at the expense of unneeded generality, you could dispense with the adding of a generic key function argument and just hardcode everything to operate the way needed with the data format you have. This will avoid the overhead of repeated calls to a key-function while doing the insertions.
def my_insort_left(a, x, lo=0, hi=None):
x_key = x[1] # Key on second element of each item in sequence.
. . .
if a[mid][1] < x_key: lo = mid+1 # Compare second element to key.
. . .
...called this way without passing keyfunc:
my_insort_left(data, ('brown', 7))

Add comparison methods to your class
Sometimes this is the least painful way, especially if you already have a class and just want to sort by a key from it:
#!/usr/bin/env python3
import bisect
import functools
#functools.total_ordering
class MyData:
def __init__(self, color, number):
self.color = color
self.number = number
def __lt__(self, other):
return self.number < other.number
def __str__(self):
return '{} {}'.format(self.color, self.number)
mydatas = [
MyData('red', 5),
MyData('blue', 1),
MyData('yellow', 8),
MyData('black', 0),
]
mydatas_sorted = []
for mydata in mydatas:
bisect.insort(mydatas_sorted, mydata)
for mydata in mydatas_sorted:
print(mydata)
Output:
black 0
blue 1
red 5
yellow 8
See also: "Enabling" comparison for classes
Tested in Python 3.5.2.
Upstream requests/patches
I get the feeling this is going to happen sooner or later ;-)
https://github.com/python/cpython/pull/13970
https://bugs.python.org/issue4356

As of Python 3.10, all the binary search helpers in the bisect module now accept a key argument:
key specifies a key function of one argument that is used to extract a
comparison key from each input element. The default value is None
(compare the elements directly).
Therefore, you can pass the same function you used to sort the data:
>>> import bisect
>>> data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
>>> data.sort(key=lambda r: r[1])
>>> data
[('black', 0), ('blue', 1), ('red', 5), ('yellow', 8)]
>>> bisect.insort_left(data, ('brown', 7), key=lambda r: r[1])
>>> data
[('black', 0), ('blue', 1), ('red', 5), ('brown', 7), ('yellow', 8)]

If your goal is to mantain a list sorted by key, performing usual operations like bisect insert, delete and update, I think sortedcontainers should suit your needs as well, and you'll avoid O(n) inserts.

From python version 3.10, the key argument has been added.
It will be something like:
import bisect
bisect.bisect_left(('brown', 7), data, key=lambda r: r[1])
Sources:
GitHub feature request
Documentation for version 3.10
See that documentation for version 3.9 does not have the key argument.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Call Distinct on 'pyspark.resultiterable.ResultIterable' - python

Related

Fail to subtract two RDD with list column in PySpark

Converting 2 list and one string to dictionary

Error using reducebykey: int object is unsubscriptable

Creating combination of value list with existing key - Pyspark

How to use bisect.insort_left with a key?

Categories

Resources