What kind of objects can be elements in a Spark RDD?

What kind of objects can be elements in a Spark RDD? - python

What are the constraints on the elements that can be passed to SparkContext.parallelize to create an RDD? More specifically, if I create a custom class in Python, what methods do I need to implement to ensure it works correctly in an RDD? I'm assuming it needs to implement __eq__ and __hash__ and be picklable. What else? Links to relevant documentation would be greatly appreciated. I couldn't find this anywhere.

Strictly speaking the only hard requirement is that class is serializable (picklable) although it is not necessary for objects which life cycle is limited to a single task (are neither shuffled nor collected / parallelized).
Consistent __hash__ and __eq__ is required only if class will be used as a shuffle key, either directly (as a key in byKey operations) or indirectly (for example for distinct or cache).
Additionally class definition has to be importable on each worker node, so module has to be already present on the PYTHONPATH, or distributed with pyFiles. If class depends on native dependencies, these have to be present on each worker node as well.
Finally for sorting types has to be orderable using standard Python semantics.
To summarize:
No special requirements, other than being importable:
class Foo:
...
# objects are used locally inside a single task
rdd.map(lambda i: Foo(i)).map(lambda foo: foo.get(i))
Has to be serializable:
# Has to be pickled to be distributed
sc.parallelize([Foo(1), Foo(2)])
# Has to be pickled to be persisted
sc.range(10).map(lambda i: Foo(i)).cache()
# Has to be pickled to be fetched to the driver
sc.range(10).map(lambda i: Foo(i)).collect() # take, first, etc.
Has to be Hashable:
# Explicitly used as a shuffle key
sc.range(10).map(lambda i: (Foo(i), 1)).reduceByKey(add) # *byKey
# Implicitly used as a shuffle kye
sc.range(10).map(lambda i: Foo(i)).distinct # subtract, etc.
Additionally all variables passed with closure have to be serializable.

Related

How to hash a class or function definition?

Background
When experimenting with machine learning, I often reuse models trained previously, by means of pickling/unpickling.
However, when working on the feature-extraction part, it's a challenge not to confuse different models.
Therefore, I want to add a check that ensures that the model was trained using exactly the same feature-extraction procedure as the test data.
Problem
My idea was the following:
Along with the model, I'd include in the pickle dump a hash value which fingerprints the feature-extraction procedure.
When training a model or using it for prediction/testing, the model wrapper is given a feature-extraction class that conforms to certain protocol.
Using hash() on that class won't work, of course, as it isn't persistent across calls.
So I thought I could maybe find the source file where the class is defined, and get a hash value from that file.
However, there might be a way to get a stable hash value from the class’s in-memory contents directly.
This would have two advantages:
It would also work if no source file can be found.
And it would probably ignore irrelevant changes to the source file (eg. fixing a typo in the module docstring).
Do classes have a code object that could be used here?

All you’re looking for is a hash procedure that includes all the salient details of the class’s definition. (Base classes can be included by including their definitions recursively.) To minimize false matches, the basic idea is to apply a wide (cryptographic) hash to a serialization of your class. So start with pickle: it supports more types than hash and, when it uses identity, it uses a reproducible identity based on name. This makes it a good candidate for the base case of a recursive strategy: deal with the functions and classes whose contents are important and let it handle any ancillary objects referenced.
So define a serialization by cases. Call an object special if it falls under any case below but the last.
For a tuple deemed to contain special objects:
The character t
The serialization of its len
The serialization of each element, in order
For a dict deemed to contain special objects:
The character d
The serialization of its len
The serialization of each name and value, in sorted order
For a class whose definition is salient:
The character C
The serialization of its __bases__
The serialization of its vars
For a function whose definition is salient:
The character f
The serialization of its __defaults__
The serialization of its __kwdefaults__ (in Python 3)
The serialization of its __closure__ (but with cell values instead of the cells themselves)
The serialization of its vars
The serialization of its __code__
For a code object (since pickle doesn’t support them at all):
The character c
The serializations of its co_argcount, co_nlocals, co_flags, co_code, co_consts, co_names, co_freevars, and co_cellvars, in that order; none of these are ever special
For a static or class method object:
The character s or m
The serialization of its __func__
For a property:
The character p
The serializations of its fget, fset, and fdel, in that order
For any other object: pickle.dumps(x,-1)
(You never actually store all this: just create a hashlib object of your choice in the top-level function, and in the recursive part update it with each piece of the serialization in turn.)
The type tags are to avoid collisions and in particular to be prefix-free. Binary pickles are already prefix-free. You can base the decision about a container on a deterministic analysis of its contents (even if heuristic) or on context, so long as you’re consistent.
As always, there is something of an art to balancing false positives against false negatives: for a function, you could include __globals__ (with pruning of objects already serialized to avoid large if not infinite serializations) or just any __name__ found therein. Omitting co_varnames ignores renaming local variables, which is good unless introspection is important; similarly for co_filename and co_name.
You may need to support more types: look for static attributes and default arguments that don’t pickle correctly (because they contain references to special types) or at all. Note of course that some types (like file objects) are unpicklable because it’s difficult or impossible to serialize them (although unlike pickle you can handle lambdas just like any other function once you’ve done code objects). At some risk of false matches, you can choose to serialize just the type of such objects (as always, prefixed with a character ? to distinguish from actually having the type in that position).

Is it problematic to start a dynamically created identifier with a digit?

The Python docs (Python2 and Python3) state that identifiers must not start with a digit. From my understanding this is solely a compiler constraint (see also this question). So is there anything wrong about starting dynamically created identifiers with a digit? For example:
type('3Tuple', (object,), {})
setattr(some_object, '123', 123)
Edit
Admittedly the second example (using setattr) from above might be less relevant, as one could introspect the object via dir, discovers the attribute '123' but cannot retrieve it via some_object.123.
So I'll elaborate a bit more on the first example (which appears more relevant to me).
The user should be provided with fixed length tuples and because the tuple length is arbitrary and not known in advance a proxy function for retrieving such tuples is used (could also be a class implementing __call__ or __getattr__):
def NTuple(number_of_elements):
# Add methods here.
return type('{0}Tuple'.format(number_of_elements),
(object,),
{'number_of_elements': number_of_elements})
The typical use case involves referencing instances of those dynamically create classes, not the classes themselves, as for example:
limits = NTuple(3)(1, 2, 3)
But still the class name provides some useful information (as opposed to just using 'Tuple'):
>>> limits.__class__.__name__
'3Tuple'
Also those class names will not be relevant for any code at compile time, hence it doesn't introduce any obstacles for the programmer/user.

Use python dict to lookup mutable objects

I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.

I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]

OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)

Python - set somehow getting duplicate data

I have an class definition with a __hash__ function that uses the object properties to create a unique key for comparison in python sets.
The hash method looks like this:
def __hash__(self):
return int('%d%s'%(self.id,self.create_key))
In a module responsible for implementing this class, several queries are run that could conceivably construct duplicate instances of this class, and the queue that is created in the function responsible for doing this is a represented as a set to make sure the the dupes can be omitted:
in_set = set()
out_set = set()
for inid in inids:
ps = Perceptron.getwherelinked(inid,self.in_ents)
for p in ps:
in_set.add(p)
for poolid in poolids:
ps = Perceptron.getwherelinked(poolid,self.out_ents)
for p in ps:
out_set.add(p)
return in_set.union(out_set)
(Not sure why the indenting got mangled here)
Somehow, despite calling the union method, I am still getting the two duplicate instances. When printed out (with a str method in the Perceptron class that just calls hash) the two hashes are identical, which theoretically shouldn't be possible.
set([1630, 1630])
Any guidance would be appreciated.

If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either
source
Define __eq__().

You also need to implement __eq__() to match your __hash__() implementation.

Best way to store and use a large text-file in python

I'm creating a networked server for a boggle-clone I wrote in python, which accepts users, solves the boards, and scores the player input. The dictionary file I'm using is 1.8MB (the ENABLE2K dictionary), and I need it to be available to several game solver classes. Right now, I have it so that each class iterates through the file line-by-line and generates a hash table(associative array), but the more solver classes I instantiate, the more memory it takes up.
What I would like to do is import the dictionary file once and pass it to each solver instance as they need it. But what is the best way to do this? Should I import the dictionary in the global space, then access it in the solver class as globals()['dictionary']? Or should I import the dictionary then pass it as an argument to the class constructor? Is one of these better than the other? Is there a third option?

If you create a dictionary.py module, containing code which reads the file and builds a dictionary, this code will only be executed the first time it is imported. Further imports will return a reference to the existing module instance. As such, your classes can:
import dictionary
dictionary.words[whatever]
where dictionary.py has:
words = {}
# read file and add to 'words'

Even though it is essentially a singleton at this point, the usual arguments against globals apply. For a pythonic singleton-substitute, look up the "borg" object.
That's really the only difference. Once the dictionary object is created, you are only binding new references as you pass it along unless if you explicitly perform a deep copy. It makes sense that it is centrally constructed once and only once so long as each solver instance does not require a private copy for modification.

Adam, remember that in Python when you say:
a = read_dict_from_file()
b = a
... you are not actually copying a, and thus using more memory, you are merely making b another reference to the same object.
So basically any of the solutions you propose will be far better in terms of memory usage. Basically, read in the dictionary once and then hang on to a reference to that. Whether you do it with a global variable, or pass it to each instance, or something else, you'll be referencing the same object and not duplicating it.
Which one is most Pythonic? That's a whole 'nother can of worms, but here's what I would do personally:
def main(args):
run_initialization_stuff()
dictionary = read_dictionary_from_file()
solvers = [ Solver(class=x, dictionary=dictionary) for x in len(number_of_solvers) ]
HTH.

Depending on what your dict contains, you may be interested in the 'shelve' or 'anydbm' modules. They give you dict-like interfaces (just strings as keys and items for 'anydbm', and strings as keys and any python object as item for 'shelve') but the data is actually in a DBM file (gdbm, ndbm, dbhash, bsddb, depending on what's available on the platform.) You probably still want to share the actual database between classes as you are asking for, but it would avoid the parsing-the-textfile step as well as the keeping-it-all-in-memory bit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What kind of objects can be elements in a Spark RDD? - python

Related

How to hash a class or function definition?

Is it problematic to start a dynamically created identifier with a digit?

Use python dict to lookup mutable objects

Python - set somehow getting duplicate data

Best way to store and use a large text-file in python

Categories

Resources