Fail to subtract two RDD with list column in PySpark

Fail to subtract two RDD with list column in PySpark - python

I have two RDD of below type:
RDD[(int, List[(string, int)]
And I want to get the subtract set from the two RDD. The code is like:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('xxxxx.com').getOrCreate()
rdd1 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd2 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd = rdd1.subtract(rdd2)
rdd.toDF().show()
However I got the below errors:
d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'
But if i change the rdd to DF first and then do subtract, it can get the right answer. Not know how to fix the issue if using rdd directly.
rdd1 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd2 = spark.sparkContext.parallelize([(1, [("foo", 101), ("bar", 111)]), (2, [("foobar", 22), ("bar", 222)]), (3, [("foo", 333)])])
rdd = rdd1.toDF().subtract(rdd2.toDF())
rdd.show()

First of all, the reason why this does not work in python is simple. subtract is about finding the elements of rdd1 that are not in rdd2. To do that, spark will put all the records with the same hash on the same partition and then check for each record of rdd1 if there is an equal record with the same hash from rdd2. To do that, the records must be hashable. In python, tuples are but lists are not, hence the error you obtain. There are several workarounds. The easiest one would probably be to work in scala. Lists are hashable in java/scala.
val rdd1 = spark.sparkContext.parallelize(Seq((1, Seq(("foo", 101), ("bar", 111))), (2, Seq(("foobar", 22), ("bar", 222))), (3, Seq(("foo", 333)))))
val rdd2 = spark.sparkContext.parallelize(Seq((1, Seq(("foo", 101), ("bar", 111))), (2, Seq(("foobar", 22), ("bar", 222))), (3, Seq(("foo", 333)))))
// and you can check that this works
rdd1.subtract(rdd2).collect()
In python, one way at this would be to define your own list class. It would need to be hashable and to provide an __eq__ method to allow spark to know when objects are equal. Such a custom class could be defined as follows:
class my_list:
def __init__(self, list):
self.list=list
def __hash__(self):
my_hash = 0
for t in self.list:
my_hash+=hash(t[0])
my_hash+=t[1]
return my_hash
def __eq__(self, other_list):
self.list == other_list.list
Then, you can check that this works:
rdd1.mapValues(lambda x : my_list(x))\
.subtract(rdd2.mapValues(lambda x: my_list(x)))\
.collect()
NB: if you work in a shell, do not define the class within the shell or pickle won't be able to serialize your class. Define it in a separate file like my_list.py and import it with pyspark --py-files my_list.py and in the shell, import my_list from my_list.

Related

Custom filtering function for python objects

I have a list of tuples like (id, ) and I want to remove duplicates ids. In the case where there are multiple pairs with the same id, I'd like to keep the one that has an object with a higher score. How could I implement this efficiently?
# For the sake of example - assume that a hashing function is implemented based on the score
class Object
def __init__(self):
score = 0
def __repr__(self):
return f'<Object {self.score}>'
pairs = [(1, <Object 1>), (1, <Object 1>), (3, <Object 7>), (9, <Object 3>), (9, <Object 4>)]
filtered_pairs = [(1, <Object 1>), (3, <Object 7>), (9, <Object 4>)]
I know that I can call set on the pairs, but that'll only take care of the cases where both the id and score are equivalent (like with Object 1). How can I filter it but in the case where there are matching ids, take the higher score?
I know that I could do groupby from itertools, and implement a sort using the score as the key and then just take the last item from every group, but I'm wondering if there's a more efficient way.

You can use itertools.groupby to group by the first values and use max on the result
from itertools import groupby
class Object:
def __init__(self, score):
self.score = score
def __repr__(self):
return f'<Object {self.score}>'
pairs = [(1, Object(1)), (1, Object(1)), (3, Object(7)), (9, Object(3)), (9, Object(4))]
filtered_pairs = [max(list(elem), key=lambda x: x[1].score) for grp, elem in groupby(pairs, lambda x: (x[0]))]
print(filtered_pairs)
Output:
[(1, <Object 1>), (3, <Object 7>), (9, <Object 4>)]

Since you are considering a set, I'm assuming the original order isn't important. If that's the case, one options is to add a __lt__ method to your class so you can compare objects by score. Then sort the tuples in reverse order, group by the integer, and take the first item from each group. It's easier to see in code than explain:
from itertools import groupby
class myObject:
def __init__(self, score):
self.score = score
def __repr__(self):
return f'<Object {self.score}>'
def __lt__(self, other):
return self.score < other.score
pairs = [(1, myObject(1)), (1, myObject(1)), (3, myObject(7)), (9, myObject(3)), (9, myObject(4))]
[next(v) for k, v in groupby(sorted(pairs, reverse=True), key=lambda x: x[0])]
Result
[(9, <Object 4>), (3, <Object 7>), (1, <Object 1>)]

Something like this:
from collections import namedtuple
Pair = namedtuple('Pair', ['id', 'score'])
pairs = [Pair(*t) for t in [(1, 1), (1, 1), (3, 7), (9, 3), (9, 4)]]
best_pairs = {}
for p in pairs:
if p.id not in best_pairs or p.score > best_pairs[p.id]:
best_pairs[p.id] = p.score
pairs = [Pair(*t) for t in best_pairs.items()]
print(pairs)
namedtuple is only in there as a more concise version of your Object and the conversion back to pairs as a list of Pairs is only in there in case you don't like your result being the dictionary best_pairs.
Result:
[Pair(id=1, score=1), Pair(id=3, score=7), Pair(id=9, score=4)]

You can sort by score, convert to dict (so that max scores are dict values) and convert back to list of tuples:
class Object:
def __init__(self, score):
self.score = score
def __repr__(self):
return f'<Object {self.score}>'
def __gt__(self, other):
return self.score > other.score
pairs = [(1, Object(1)), (1, Object(1)), (3, Object(7)), (9, Object(4)), (9, Object(3))]
filtered_pairs = list(dict(sorted(pairs)).items())

Converting 2 list and one string to dictionary

P.S: Thank you everybody ,esp Matthias Fripp . Just reviewed the question You are right I made mistake : String is value not the key
num=[1,2,3,4,5,6]
pow=[1,4,9,16,25,36]
s= ":subtraction"
dic={1:1 ,0:s , 2:4,2:s, 3:9,6:s, 4:16,12:s.......}
There is easy way to convert two list to dictionary :
newdic=dict(zip(list1,list2))
but for this problem no clue even with comprehension:
print({num[i]:pow[i] for i in range(len(num))})

As others have said, dict cannot contain duplicate keys. You can make key duplicate with a little bit of tweaking. I used OrderedDict to keep order of inserted keys:
from pprint import pprint
from collections import OrderedDict
num=[1,2,3,4,5,6]
pow=[1,4,9,16,25,36]
pprint(OrderedDict(sum([[[a, b], ['substraction ({}-{}):'.format(a, b), a-b]] for a, b in zip(num, pow)], [])))
Prints:
OrderedDict([(1, 1),
('substraction (1-1):', 0),
(2, 4),
('substraction (2-4):', -2),
(3, 9),
('substraction (3-9):', -6),
(4, 16),
('substraction (4-16):', -12),
(5, 25),
('substraction (5-25):', -20),
(6, 36),
('substraction (6-36):', -30)])

In principle, this would do what you want:
nums = [(n, p) for (n, p) in zip(num, pow)]
diffs = [('subtraction', p-n) for (n, p) in zip(num, pow)]
items = nums + diffs
dic = dict(items)
However, a dictionary cannot have multiple items with the same key, so each of your "subtraction" items will be replaced by the next one added to the dictionary, and you'll only get the last one. So you might prefer to work with the items list directly.
If you need the items list sorted as you've shown, that will take a little more work. Maybe something like this:
items = []
for n, p in zip(num, pow):
items.append((n, p))
items.append(('subtraction', p-n))
# the next line will drop most 'subtraction' entries, but on
# Python 3.7+, it will at least preserve the order (not possible
# with earlier versions of Python)
dic = dict(items)

Call Distinct on 'pyspark.resultiterable.ResultIterable'

I am writing some spark code and I have an RDD which looks like
[(4, <pyspark.resultiterable.ResultIterable at 0x9d32a4c>),
(1, <pyspark.resultiterable.ResultIterable at 0x9d32cac>),
(5, <pyspark.resultiterable.ResultIterable at 0x9d32bac>),
(2, <pyspark.resultiterable.ResultIterable at 0x9d32acc>)]
What I need to do is to call a distinct on the pyspark.resultiterable.ResultIterable
I tried this
def distinctHost(a, b):
p = sc.parallelize(b)
return (a, p.distinct())
mydata.map(lambda x: distinctHost(*x))
But I get an error:
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transforamtion.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
The error is self explanatory that I cannot use sc. But I need to find a way to cover the pyspark.resultiterable.ResultIterable to an RDD so that I can call distinct on it.

Straightforward approach is to use sets:
from numpy.random import choice, seed
seed(323)
keys = (4, 1, 5, 2)
hosts = [
u'in24.inetnebr.com',
u'ix-esc-ca2-07.ix.netcom.com',
u'uplherc.upl.com',
u'slppp6.intermind.net',
u'piweba4y.prodigy.com'
]
pairs = sc.parallelize(zip(choice(keys, 20), choice(hosts, 20))).groupByKey()
pairs.map(lambda (k, v): (k, set(v))).take(3)
Result:
[(1, {u'ix-esc-ca2-07.ix.netcom.com', u'slppp6.intermind.net'}),
(2,
{u'in24.inetnebr.com',
u'ix-esc-ca2-07.ix.netcom.com',
u'slppp6.intermind.net',
u'uplherc.upl.com'}),
(4, {u'in24.inetnebr.com', u'piweba4y.prodigy.com', u'uplherc.upl.com'})]
If there is a particular reason for using rdd.disinct you can try something like this:
def distinctHost(pairs, key):
return (pairs
.filter(lambda (k, v): k == key)
.flatMap(lambda (k, v): v)
.distinct())
[(key, distinctHost(pairs, key).collect()) for key in pairs.keys().collect()]

How to use bisect.insort_left with a key?

Doc's are lacking an example...How do you use bisect.insort_left)_ based on a key?
Trying to insert based on key.
bisect.insort_left(data, ('brown', 7))
puts insert at data[0].
From docs...
bisect.insort_left(a, x, lo=0, hi=len(a))
Insert x in a in sorted order. This is equivalent to a.insert(bisect.bisect_left(a, x, lo, hi), x) assuming that a is already sorted. Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.
Sample usage:
>>> data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
>>> data.sort(key=lambda r: r[1])
>>> keys = [r[1] for r in data] # precomputed list of keys
>>> data[bisect_left(keys, 0)]
('black', 0)
>>> data[bisect_left(keys, 1)]
('blue', 1)
>>> data[bisect_left(keys, 5)]
('red', 5)
>>> data[bisect_left(keys, 8)]
('yellow', 8)
>>>
I want to put ('brown', 7) after ('red', 5) on sorted list in data using bisect.insort_left. Right now bisect.insort_left(data, ('brown', 7)) puts ('brown', 7) at data[0]...because I am not using the keys to do insert...docs don't show to do inserts using the keys.

You could wrap your iterable in a class that implements __getitem__ and __len__. This allows you the opportunity to use a key with bisect_left. If you set up your class to take the iterable and a key function as arguments.
To extend this to be usable with insort_left it's required to implement the insert method. The problem here is that if you do that is that insort_left will try to insert your key argument into the list containing the objects of which the the key is a member.
An example is clearer
from bisect import bisect_left, insort_left
class KeyWrapper:
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __getitem__(self, i):
return self.key(self.it[i])
def __len__(self):
return len(self.it)
def insert(self, index, item):
print('asked to insert %s at index%d' % (item, index))
self.it.insert(index, {"time":item})
timetable = [{"time": "0150"}, {"time": "0250"}, {"time": "0350"}, {"time": "0450"}, {"time": "0550"}, {"time": "0650"}, {"time": "0750"}]
bslindex = bisect_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
islindex = insort_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
See how in my insert method I had to make it specific to the timetable dictionary otherwise insort_left would try insert "0359" where it should insert {"time": "0359"}?
Ways round this could be to construct a dummy object for the comparison, inherit from KeyWrapper and override insert or pass some sort of factory function to create the object. None of these ways are particularly desirable from an idiomatic python point of view.
So the easiest way is to just use the KeyWrapper with bisect_left, which returns you the insert index and then do the insert yourself. You could easily wrap this in a dedicated function.
e.g.
bslindex = bisect_left(KeyWrapper(timetable, key=lambda t: t["time"]), "0359")
timetable.insert(bslindex, {"time":"0359"})
In this case ensure you don't implement insert, so you will be immediately aware if you accidentally pass a KeyWrapper to a mutating function like insort_left which probably wouldn't do the right thing.
To use your example data
from bisect import bisect_left
class KeyWrapper:
def __init__(self, iterable, key):
self.it = iterable
self.key = key
def __getitem__(self, i):
return self.key(self.it[i])
def __len__(self):
return len(self.it)
data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
data.sort(key=lambda c: c[1])
newcol = ('brown', 7)
bslindex = bisect_left(KeyWrapper(data, key=lambda c: c[1]), newcol[1])
data.insert(bslindex, newcol)
print(data)
Here is the class with proper typing:
from typing import TypeVar, Generic, Sequence, Callable
T = TypeVar('T')
V = TypeVar('V')
class KeyWrapper(Generic[T, V]):
def __init__(self, iterable: Sequence[T], key: Callable[[T], V]):
self.it = iterable
self.key = key
def __getitem__(self, i: int) -> V:
return self.key(self.it[i])
def __len__(self) -> int:
return len(self.it)

This does essentially the same thing the SortedCollection recipe does that the bisect documentation mentions in its See also: section at the end, but unlike the insert() method in the recipe, the function shown supports a key-function.
What's being done is a separate sorted keys list is maintained in parallel with the sorted data list to improve performance (it's faster than creating the keys list before each insertion, but keeping it around and updating it isn't strictly required). The ActiveState recipe encapsulated this for you within a class, but in the code below they're just two separate independent lists being passed around (so it'd be easier for them to get out of sync than it would be if they were both held in an instance of the recipe's class).
from bisect import bisect_left
def insert(seq, keys, item, keyfunc=lambda v: v):
"""Insert an item into a sorted list using a separate corresponding
sorted keys list and a keyfunc() to extract the key from each item.
Based on insert() method in SortedCollection recipe:
http://code.activestate.com/recipes/577197-sortedcollection/
"""
k = keyfunc(item) # Get key.
i = bisect_left(keys, k) # Determine where to insert item.
keys.insert(i, k) # Insert key of item to keys list.
seq.insert(i, item) # Insert the item itself in the corresponding place.
# Initialize the sorted data and keys lists.
data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
data.sort(key=lambda r: r[1]) # Sort data by key value
keys = [r[1] for r in data] # Initialize keys list
print(data) # -> [('black', 0), ('blue', 1), ('red', 5), ('yellow', 8)]
insert(data, keys, ('brown', 7), keyfunc=lambda x: x[1])
print(data) # -> [('black', 0), ('blue', 1), ('red', 5), ('brown', 7), ('yellow', 8)]
Follow-on question:
    Can bisect.insort_left be used?
No, you can't simply use the bisect.insort_left() function to do this because it wasn't written in a way that supports a key-function—instead it just compares the whole item passed to it to insert, x, with one of the whole items in the array in its if a[mid] < x: statement. You can see what I mean by looking at the source for the bisect module in Lib/bisect.py.
Here's the relevant excerpt:
def insort_left(a, x, lo=0, hi=None):
"""Insert item x in list a, and keep it sorted assuming a is sorted.
If x is already in a, insert it to the left of the leftmost x.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid] < x: lo = mid+1
else: hi = mid
a.insert(lo, x)
You could modify the above to accept an optional key-function argument and use it:
def my_insort_left(a, x, lo=0, hi=None, keyfunc=lambda v: v):
x_key = keyfunc(x) # Get comparison value.
. . .
if keyfunc(a[mid]) < x_key: # Compare key values.
lo = mid+1
. . .
...and call it like this:
my_insort_left(data, ('brown', 7), keyfunc=lambda v: v[1])
Actually, if you're going to write a custom function, for the sake of more efficiency at the expense of unneeded generality, you could dispense with the adding of a generic key function argument and just hardcode everything to operate the way needed with the data format you have. This will avoid the overhead of repeated calls to a key-function while doing the insertions.
def my_insort_left(a, x, lo=0, hi=None):
x_key = x[1] # Key on second element of each item in sequence.
. . .
if a[mid][1] < x_key: lo = mid+1 # Compare second element to key.
. . .
...called this way without passing keyfunc:
my_insort_left(data, ('brown', 7))

Add comparison methods to your class
Sometimes this is the least painful way, especially if you already have a class and just want to sort by a key from it:
#!/usr/bin/env python3
import bisect
import functools
#functools.total_ordering
class MyData:
def __init__(self, color, number):
self.color = color
self.number = number
def __lt__(self, other):
return self.number < other.number
def __str__(self):
return '{} {}'.format(self.color, self.number)
mydatas = [
MyData('red', 5),
MyData('blue', 1),
MyData('yellow', 8),
MyData('black', 0),
]
mydatas_sorted = []
for mydata in mydatas:
bisect.insort(mydatas_sorted, mydata)
for mydata in mydatas_sorted:
print(mydata)
Output:
black 0
blue 1
red 5
yellow 8
See also: "Enabling" comparison for classes
Tested in Python 3.5.2.
Upstream requests/patches
I get the feeling this is going to happen sooner or later ;-)
https://github.com/python/cpython/pull/13970
https://bugs.python.org/issue4356

As of Python 3.10, all the binary search helpers in the bisect module now accept a key argument:
key specifies a key function of one argument that is used to extract a
comparison key from each input element. The default value is None
(compare the elements directly).
Therefore, you can pass the same function you used to sort the data:
>>> import bisect
>>> data = [('red', 5), ('blue', 1), ('yellow', 8), ('black', 0)]
>>> data.sort(key=lambda r: r[1])
>>> data
[('black', 0), ('blue', 1), ('red', 5), ('yellow', 8)]
>>> bisect.insort_left(data, ('brown', 7), key=lambda r: r[1])
>>> data
[('black', 0), ('blue', 1), ('red', 5), ('brown', 7), ('yellow', 8)]

If your goal is to mantain a list sorted by key, performing usual operations like bisect insert, delete and update, I think sortedcontainers should suit your needs as well, and you'll avoid O(n) inserts.

From python version 3.10, the key argument has been added.
It will be something like:
import bisect
bisect.bisect_left(('brown', 7), data, key=lambda r: r[1])
Sources:
GitHub feature request
Documentation for version 3.10
See that documentation for version 3.9 does not have the key argument.

Why does python dict behave this way?

I have some code written like so:
class Invite(models.Model):
STATE_UNKNOWN = 0
STATE_WILL_PLAY = 1
STATE_WONT_PLAY = 2
STATE_READY = 3
STATE_CHOICES = ((STATE_UNKNOWN, _("Unknown")),
(STATE_WILL_PLAY, _("Yes, I'll play")),
(STATE_WONT_PLAY, _("Sorry, can't play")),
(STATE_READY, _("I'm ready to play now")))
...
def change_state(self, state):
assert(state in dict(Invite.STATE_CHOICES))
This code works like I want it to, but I'm curious as to why it works this way. It is admittedly very convenient that it does work this way, but it seems like maybe I'm missing some underlying philosophy as to why that is.
If I try something like:
dict((1,2,3), (2,2,3), (3,2,3))
ValueError: dictionary update sequence element #0 has length 3; 2 is required
it doesn't create a dict that looks like
{1: (2,3), 2: (2,3), 3: (2,3)}
So the general pattern is not to take the first part of the tuple as the key and the rest as the value. Is there some fundamental underpinning that causes this behavior, or it is just, well, it would be convenient if it did....

I think it's somewhat obvious. In your example, (1,2,3) is a single object. So the idea behind a dictionary is to map a key to a value (i.e. object).
So consider the output:
>>> dict(((1,(2,3)), (2,(2,3)))).items()
[(1, (2, 3)), (2, (2, 3))]
But you can also do something like this:
>>> dict((((1,2),3), ((2,2),3)))
[((1, 2), 3), ((2, 2), 3)]
Where the key is actually an object too! In this case a tuple also.
So in your example:
dict((1,2,3), (2,2,3), (3,2,3))
how do you know which part of each tuple is the key and which is the value?
If you find this annoying, it's a simple fix to write your own constructor:
def special_dict(*args):
return dict((arg[0], arg[1:]) for arg in args)
Also, to Rafe's comment, you should define the dictionary right away:
class Invite(models.Model):
STATE_UNKNOWN = 0
STATE_WILL_PLAY = 1
STATE_WONT_PLAY = 2
STATE_READY = 3
STATE_CHOICES = dict(((STATE_UNKNOWN, _("Unknown")),
(STATE_WILL_PLAY, _("Yes, I'll play")),
(STATE_WONT_PLAY, _("Sorry, can't play")),
(STATE_READY, _("I'm ready to play now"))))
...
def change_state(self, state):
assert(state in Invite.STATE_CHOICES)
If you ever want to iterate over the states, all you have to do is:
for state, description = Invite.STATE_CHOICES.iteritems():
print "{0} == {1}".format(state, description)
The construction of the dictionary in your change_state function is unnecessarily costly.
When you define the Django field, just do:
models.IntegerField(sorted(choices=Invite.STATE_CHOICES.iteritems()))

The constructor of dict accepts (among other things) a sequence of (key, value) tuples. Your second examples passes a list of tuples of length 3 instead of 2, and hence fails.
dict([(1, (2, 3)), (2, (2, 3)), (3, (2, 3))])
however will create the dictionary
{1: (2, 3), 2: (2, 3), 3: (2, 3)}

The general pattern is just this: you can create a dict from a list (in general: iterable) of pairs, treated as (key, value). Anything longer would be arbitrary: why (1,2,3)->{1:(2,3)} and not (1,2,3)-> {(1,2):3}?
Moreover, the pairs<->dict conversion is obviously two-way. With triples it couldn't be (see the above example).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fail to subtract two RDD with list column in PySpark - python

Related

Custom filtering function for python objects

Converting 2 list and one string to dictionary

Call Distinct on 'pyspark.resultiterable.ResultIterable'

How to use bisect.insort_left with a key?

Why does python dict behave this way?

Categories

Resources