Python - set somehow getting duplicate data

Python - set somehow getting duplicate data - python

I have an class definition with a __hash__ function that uses the object properties to create a unique key for comparison in python sets.
The hash method looks like this:
def __hash__(self):
return int('%d%s'%(self.id,self.create_key))
In a module responsible for implementing this class, several queries are run that could conceivably construct duplicate instances of this class, and the queue that is created in the function responsible for doing this is a represented as a set to make sure the the dupes can be omitted:
in_set = set()
out_set = set()
for inid in inids:
ps = Perceptron.getwherelinked(inid,self.in_ents)
for p in ps:
in_set.add(p)
for poolid in poolids:
ps = Perceptron.getwherelinked(poolid,self.out_ents)
for p in ps:
out_set.add(p)
return in_set.union(out_set)
(Not sure why the indenting got mangled here)
Somehow, despite calling the union method, I am still getting the two duplicate instances. When printed out (with a str method in the Perceptron class that just calls hash) the two hashes are identical, which theoretically shouldn't be possible.
set([1630, 1630])
Any guidance would be appreciated.

If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either
source
Define __eq__().

You also need to implement __eq__() to match your __hash__() implementation.

Related

What kind of objects can be elements in a Spark RDD?

What are the constraints on the elements that can be passed to SparkContext.parallelize to create an RDD? More specifically, if I create a custom class in Python, what methods do I need to implement to ensure it works correctly in an RDD? I'm assuming it needs to implement __eq__ and __hash__ and be picklable. What else? Links to relevant documentation would be greatly appreciated. I couldn't find this anywhere.

Strictly speaking the only hard requirement is that class is serializable (picklable) although it is not necessary for objects which life cycle is limited to a single task (are neither shuffled nor collected / parallelized).
Consistent __hash__ and __eq__ is required only if class will be used as a shuffle key, either directly (as a key in byKey operations) or indirectly (for example for distinct or cache).
Additionally class definition has to be importable on each worker node, so module has to be already present on the PYTHONPATH, or distributed with pyFiles. If class depends on native dependencies, these have to be present on each worker node as well.
Finally for sorting types has to be orderable using standard Python semantics.
To summarize:
No special requirements, other than being importable:
class Foo:
...
# objects are used locally inside a single task
rdd.map(lambda i: Foo(i)).map(lambda foo: foo.get(i))
Has to be serializable:
# Has to be pickled to be distributed
sc.parallelize([Foo(1), Foo(2)])
# Has to be pickled to be persisted
sc.range(10).map(lambda i: Foo(i)).cache()
# Has to be pickled to be fetched to the driver
sc.range(10).map(lambda i: Foo(i)).collect() # take, first, etc.
Has to be Hashable:
# Explicitly used as a shuffle key
sc.range(10).map(lambda i: (Foo(i), 1)).reduceByKey(add) # *byKey
# Implicitly used as a shuffle kye
sc.range(10).map(lambda i: Foo(i)).distinct # subtract, etc.
Additionally all variables passed with closure have to be serializable.

Use python dict to lookup mutable objects

I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.

I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]

OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)

Are python methods chainable?

s = set([1,2,3])
I should be elegant to do the following:
a.update(s).update(s)
I doesn't work as I thought make a contains set([1,2,3,1,2,3,1,2,3])
So I'm wandering that Does Python advocate this chainable practise?

set.update() returns None so you can't chain updates like that
The usual rule in Python is that methods that mutate don't return the object
contrast with methods on immutable objects, which obviously must return a new object such as str.replace() which can be chained

It depends.
Methods that modify an object usually return None so you can't call a sequence of methods like this:
L.append(2).append(3).append(4)
And hope to have the same effect as:
L.append(2)
L.append(3)
L.append(4)
You'll probably get an AttributeError because the first call to append returns None and None does not have an append method.
Methods that creates new object returns that object, so for example:
"some string".replace("a", "b").replace("c", "d")
Is perfectly fine.
Usually immutable objects return new objects, while mutable ones return None but it depends on the method and the specific object.
Anyway it's certainly not a feature of the language itself but only a way to implement the methods. Chainable methods can be implemented in probably any language so the question "are python methods chainable" does not make much sense.
The only reasonable question would be "Are python methods always/forced to be/etc. chainable?", and the answer to this question is "No".
In your example set can only contain unique items, so the result that you show does not make any sense. You probably wanted to use a simple list.

And update method does not return you a set, rather a None value.
So, you cannot invoke another method update in chain on NoneType
So, this will anyways give you error..
a.update(s).update(s)
However, since a Set can contain only unique values. So, even if you separate your update on different lines, you won't get a Set like you want..

Yes, you can chain method calls in Python. As to whether it's good practice, there are a number of libraries out there which advocate using chained calls. For example, SQLAlchemy's tutorial makes extensive use of this style of coding. You frequently encounter code snippets like
session.query(User).filter(User.name.in_(['Edwardo', 'fakeuser'])).all()
A sensible guideline to adopt is to ask yourself whether it'll make the code easier to read and maintain. Always strive to make code readable.

I write a simple example, chainable methods should always return an object ,like self,
class Chain(object):
"""Chain example"""
def __init__(self):
self._content = ''
def update(self, new_content):
"""method of appending content"""
self._content += new_content
return self
def __str__(self):
return self._content
>>> c = Chain()
>>> c.update('str1').update('str2').update('str3')
>>> print c
str1str2str3

Python: Testing equivalence of sets of custom classes when all instances are unique by definition?

Using Python 2.6, with the set() builtin, not sets.set.
I have defined some custom data abstraction classes, which will be made members of some sets using the builtin set() object.
The classes are already being stored in a separate structure, before being divided up into sets. All instances of the classes are declared first. No class instances are created or deleted after the first set is declared. No two class instances are ever considered to be "equal" to each other. (Two instances of the class, containing identical data, are considered not the same. A == B is False for all A,B where B is not A.)
Given the above, will there be any reasonable difference between these strategies for testing set_a == set_b?:
Option 1: Store integers in the sets that uniquely identify instances of my class.
Option 2: Store instances of my class, and implement __hash__() and __eq__() to compare id(self) == id(other). (This may not be necessary? Do default implementations of these functions in object just do the same thing but faster?) Possibly use an instance variable that increments every time a new instance calls __init__(). (Not thread safe?)
or,
Option 3: The instances are already stored and looked up in dictionaries keyed by rather long strings. The strings are what most directly represents what the instances are, and are kept unique. I thought storing these strings in the sets would be a RAM overhead and/or create a bunch of extra runtime by calling __eq__() and __hash__(). If this is not the case, I should store the strings directly. (But I think what I've read so far tells me it is the case.)
I'm somewhat new to sets in Python. I've figured out some of what I need to know already, just want to make sure I'm not overlooking something tricky or drawing a false conclusion somewhere.

I might be misunderstanding the question, but this is how Python behaves by default:
class Foo(object):
pass
a = Foo()
b = Foo()
c = Foo()
x = set([a, b])
y = set([a, b])
z = set([a, c])
print x == y # True
print x == z # False
Do default implementations of these functions in object just do the same thing but faster?
Yes. User-defined classes have __cmp__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns id(x). docs

Is it possible to add arbitrary data to an ObjectifiedElement instance?

I've set up a custom namespace lookup dictionary in order to map elements in XML files to subclasses of ObjectifiedElement. Now, I want to add some data to instances of these classes. But due to the way ObjectifiedElement works, adding an attribute will result in an element being added to the element tree, which is not what I want. More importantly, this doesn't work for all Python types; for example, it is not possible to create an attribute of the list type.
This seems to be possible by subclassing ElementBase instead, but that would imply losing the features provided by ObjectifiedElement. You could say I only need the read part of ObjectifiedElement. I suppose I can add a __getattr__ to my subclasses to simulate this, but I was hoping there was another way.

I ended up with having __getattr__() simply forward to etree's find():
class SomewhatObjectifiedElement(etree.ElementBase):
nsmap = {'ns': 'http://www.my.org/namespace'}
def __getattr__(self, name):
return self.find('ns:' + name, self.nsmap)
This will only return the first element if there are several matching, unlike ObjectifiedElement's behaviour, but it suffices for my application (mostly it can be only a single match, otherwise, I use findall()).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - set somehow getting duplicate data - python

If a class does not define a cmp() or eq() method it should not define a hash() operation either source Define eq().

You also need to implement eq() to match your hash() implementation.

Related

What kind of objects can be elements in a Spark RDD?

Use python dict to lookup mutable objects

Are python methods chainable?

Python: Testing equivalence of sets of custom classes when all instances are unique by definition?

Is it possible to add arbitrary data to an ObjectifiedElement instance?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - set somehow getting duplicate data - python

If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either source Define __eq__().

You also need to implement __eq__() to match your __hash__() implementation.

Related

What kind of objects can be elements in a Spark RDD?

Use python dict to lookup mutable objects

Are python methods chainable?

Python: Testing equivalence of sets of custom classes when all instances are unique by definition?

Is it possible to add arbitrary data to an ObjectifiedElement instance?

Categories

Resources

If a class does not define a cmp() or eq() method it should not define a hash() operation either source Define eq().

You also need to implement eq() to match your hash() implementation.