Background
When experimenting with machine learning, I often reuse models trained previously, by means of pickling/unpickling.
However, when working on the feature-extraction part, it's a challenge not to confuse different models.
Therefore, I want to add a check that ensures that the model was trained using exactly the same feature-extraction procedure as the test data.
Problem
My idea was the following:
Along with the model, I'd include in the pickle dump a hash value which fingerprints the feature-extraction procedure.
When training a model or using it for prediction/testing, the model wrapper is given a feature-extraction class that conforms to certain protocol.
Using hash() on that class won't work, of course, as it isn't persistent across calls.
So I thought I could maybe find the source file where the class is defined, and get a hash value from that file.
However, there might be a way to get a stable hash value from the class’s in-memory contents directly.
This would have two advantages:
It would also work if no source file can be found.
And it would probably ignore irrelevant changes to the source file (eg. fixing a typo in the module docstring).
Do classes have a code object that could be used here?
All you’re looking for is a hash procedure that includes all the salient details of the class’s definition. (Base classes can be included by including their definitions recursively.) To minimize false matches, the basic idea is to apply a wide (cryptographic) hash to a serialization of your class. So start with pickle: it supports more types than hash and, when it uses identity, it uses a reproducible identity based on name. This makes it a good candidate for the base case of a recursive strategy: deal with the functions and classes whose contents are important and let it handle any ancillary objects referenced.
So define a serialization by cases. Call an object special if it falls under any case below but the last.
For a tuple deemed to contain special objects:
The character t
The serialization of its len
The serialization of each element, in order
For a dict deemed to contain special objects:
The character d
The serialization of its len
The serialization of each name and value, in sorted order
For a class whose definition is salient:
The character C
The serialization of its __bases__
The serialization of its vars
For a function whose definition is salient:
The character f
The serialization of its __defaults__
The serialization of its __kwdefaults__ (in Python 3)
The serialization of its __closure__ (but with cell values instead of the cells themselves)
The serialization of its vars
The serialization of its __code__
For a code object (since pickle doesn’t support them at all):
The character c
The serializations of its co_argcount, co_nlocals, co_flags, co_code, co_consts, co_names, co_freevars, and co_cellvars, in that order; none of these are ever special
For a static or class method object:
The character s or m
The serialization of its __func__
For a property:
The character p
The serializations of its fget, fset, and fdel, in that order
For any other object: pickle.dumps(x,-1)
(You never actually store all this: just create a hashlib object of your choice in the top-level function, and in the recursive part update it with each piece of the serialization in turn.)
The type tags are to avoid collisions and in particular to be prefix-free. Binary pickles are already prefix-free. You can base the decision about a container on a deterministic analysis of its contents (even if heuristic) or on context, so long as you’re consistent.
As always, there is something of an art to balancing false positives against false negatives: for a function, you could include __globals__ (with pruning of objects already serialized to avoid large if not infinite serializations) or just any __name__ found therein. Omitting co_varnames ignores renaming local variables, which is good unless introspection is important; similarly for co_filename and co_name.
You may need to support more types: look for static attributes and default arguments that don’t pickle correctly (because they contain references to special types) or at all. Note of course that some types (like file objects) are unpicklable because it’s difficult or impossible to serialize them (although unlike pickle you can handle lambdas just like any other function once you’ve done code objects). At some risk of false matches, you can choose to serialize just the type of such objects (as always, prefixed with a character ? to distinguish from actually having the type in that position).
Related
After I learned about different data types I learned that once an object from a given type is created it has innate methods that can do 'things'.
Playing around, I noticed that, while some methods return a value, others make change to the original data stored.
Is there any specific term for these two types of methods and is there any intuition or logic as to which methods return a value and which make changes?
For example:
abc= "something"
defg= [12,34,11,45,132,1]
abc.capitalise() #this returns a value
defg.sort() #this changes the orignal list
Is there any specific term for these two types of methods
A method that changes an object's state (ie list.sort()) is usually called a "mutator" (it "mutates" the object). There's no general name for methods that return values - they could be "getters" (methods that take no arguments and return part of the object's state), alternative constructors (methods that are called on the class itself and provide an alternative way to construct an instance of the class), or just methods that take some arguments, do some computations based on both the arguments and the object's state and return a result, or actually just do anything (do some computation AND change the object's state AND return a value).
is there any intuition or logic as to which methods return a value and which make changes?
Some Python objects are immutable (strings, numerics, tuples etc) so when you're working on one of those types you know you won't have any mutator. Except for this special case, nope, you will have to check the doc. The only naming convention here is that methods whose name starts with "set_" and take one argument will change the object's state based on their argument (and most often return nothing) and that methods whose name starts with "get_" and take no arguments will return informations on the object's state and change nothing (you'll often see the formers named "setters" and the laters named "getters"), but like any convention it's only followed by those who follow it, IOW don't assume that because a method name starts with "get_" or "set_" it will indeed behave as expected.
Strings are immutable, so all libraries that do string manipulation will return a new string.
For the other types, you will have to refer to the library documentation.
What are the constraints on the elements that can be passed to SparkContext.parallelize to create an RDD? More specifically, if I create a custom class in Python, what methods do I need to implement to ensure it works correctly in an RDD? I'm assuming it needs to implement __eq__ and __hash__ and be picklable. What else? Links to relevant documentation would be greatly appreciated. I couldn't find this anywhere.
Strictly speaking the only hard requirement is that class is serializable (picklable) although it is not necessary for objects which life cycle is limited to a single task (are neither shuffled nor collected / parallelized).
Consistent __hash__ and __eq__ is required only if class will be used as a shuffle key, either directly (as a key in byKey operations) or indirectly (for example for distinct or cache).
Additionally class definition has to be importable on each worker node, so module has to be already present on the PYTHONPATH, or distributed with pyFiles. If class depends on native dependencies, these have to be present on each worker node as well.
Finally for sorting types has to be orderable using standard Python semantics.
To summarize:
No special requirements, other than being importable:
class Foo:
...
# objects are used locally inside a single task
rdd.map(lambda i: Foo(i)).map(lambda foo: foo.get(i))
Has to be serializable:
# Has to be pickled to be distributed
sc.parallelize([Foo(1), Foo(2)])
# Has to be pickled to be persisted
sc.range(10).map(lambda i: Foo(i)).cache()
# Has to be pickled to be fetched to the driver
sc.range(10).map(lambda i: Foo(i)).collect() # take, first, etc.
Has to be Hashable:
# Explicitly used as a shuffle key
sc.range(10).map(lambda i: (Foo(i), 1)).reduceByKey(add) # *byKey
# Implicitly used as a shuffle kye
sc.range(10).map(lambda i: Foo(i)).distinct # subtract, etc.
Additionally all variables passed with closure have to be serializable.
The Python docs (Python2 and Python3) state that identifiers must not start with a digit. From my understanding this is solely a compiler constraint (see also this question). So is there anything wrong about starting dynamically created identifiers with a digit? For example:
type('3Tuple', (object,), {})
setattr(some_object, '123', 123)
Edit
Admittedly the second example (using setattr) from above might be less relevant, as one could introspect the object via dir, discovers the attribute '123' but cannot retrieve it via some_object.123.
So I'll elaborate a bit more on the first example (which appears more relevant to me).
The user should be provided with fixed length tuples and because the tuple length is arbitrary and not known in advance a proxy function for retrieving such tuples is used (could also be a class implementing __call__ or __getattr__):
def NTuple(number_of_elements):
# Add methods here.
return type('{0}Tuple'.format(number_of_elements),
(object,),
{'number_of_elements': number_of_elements})
The typical use case involves referencing instances of those dynamically create classes, not the classes themselves, as for example:
limits = NTuple(3)(1, 2, 3)
But still the class name provides some useful information (as opposed to just using 'Tuple'):
>>> limits.__class__.__name__
'3Tuple'
Also those class names will not be relevant for any code at compile time, hence it doesn't introduce any obstacles for the programmer/user.
I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.
I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]
OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)
All I am new to python programming.I referred different tutorials in python,few Authors says In python like Numbers(int,float,complex),list,set,tuple,dictionary,string are data types some of theme says data-structure few are says classes.i am confused which is correct.
I'm doing an essay on Python and found this statement on a random site, just wondering if anyone could clarify and Justify your answer.
exact meanings have changed slightly over time. the latest version of python (python 3) is the simplest and most consistent, so i will explain with that.
let's start with the idea that there are two kinds of things: values and the types of those values.
a value in python can be, for example, a number, a list, or even a function.
types of values describe those. so the type of a number might be int for example.
so far we have only considered things "built in" to the language. but you can also define your own things. to do that you define a new class. the type() function will say (in python 3) that the type of an instance of your class is the class itself:
so maybe you define a class called MyFoo:
>>> class MyFoo:
>>> def __init__(self, a):
>>> self.a = a
>>>
>>> foo = MyFoo(1)
>>> type(foo)
<class '__main__.MyFoo'>
compare that with integers:
>>> type(1)
<class 'int'>
and it's clear that the type of your value is its class.
so values in python (eg numbers, lists, and even functions) are all instances of classes. and the type of a value is the class that describes how it behaves.
now things get more complicated because you can also assign a type to a value! then you have a value that is a type:
>>> x = type(1)
>>> type(x)
<class 'type'>
it turns out that type of any type is type. which means that any class is itself an instance (of type). which is all a little weird. but it's consistent and not something you need to worry about normally.
so, in summary, for python 3 (which is simplest):
every value has a type
the type describes how the value works
types are classes
the type of an instance of a class is its class
numbers, lists, functions, user-defined objects are all instances of classes
even classes are instances of classes! they are instances of type!
finally, to try answer your exact question. some people call classes data types and the instances data structures (i think). it's messy and confusing and people are not very careful. it's easiest to just stick with classes and types (which are the same thing really) and instances.
A "data type" is a description of a kind of data: what kinds of values can be an instance of that kind, and what kind of operations can be done on them.
A "class" is one way of representing a data type (although not the only way), treating the operations on the type as "methods" on the instances of the type (called "objects"). This is a general term across all class-based languages. But Python also has a specific meanings for "class": Something defined by the class statement, or something defined in built-in/extension code that meets certain requirements, is a class.
So, arbitrary-sized integers and mapping dictionaries are data types. in Python, they're represented by the built-in classes int and dict.
A "data structure" is a way of organizing data for efficient or easy access. This isn't directly relevant to data types. In many languages (like C++ or Java), defining a new class requires you to tell the compiler how an instance's members are laid out in memory, and things like that, but in Python you just construct objects and add members to them and the interpreter figures out how to organize them. (There are exceptions that come up when you're building extension modules or using ctypes to build wrapper classes, but don't worry about that.)
Things get blurry when you get to higher-level abstract data structures (like pointer-based nodes) and lower-level abstract data types (like order-preserving collection of elements that can do constant-time insertion and deletion at the head). Is a linked list a data type that inherently requires a certain data structure, or a data structure that defines an obvious data type, or what? Well, unless you major in computer science in college, the answer to that isn't really going to make much difference, as long as you understand the question.
So, mapping dictionaries are data types, but they're also abstract data structures—and, under the covers, Python's dict objects are built from a specific concrete data structure (open-chained hash table with quadratic probing) which is still partly abstract (each bucket contains a duck-typed value).
The terms "data type" and "class" are synonymous in Python, and they are both correct for the examples you gave. Unlike some other languages, there are no simple types, everything (that you can point to with a variable) in Python is an object. The term "data structure" on the other hand should probably be reserved for container types, for example sets, tuples, dictionaries or lists.