Subclass str to override __eq__ in python to costume set.intersection - python

I wrote a python script in python to create two sets. I was trying to override eq in string class, so the equal logic is that if string a is in string b, then a "equal" b. I subclass str class and the two sets contains the new class.
Then I tried to use set.intersect to get the result. But the result always show 0. My code is like this:
# override str class method__eq__
class newString(str):
def __new__(self, origial):
self.value = origial
return str.__new__(self, origial)
def __eq__(self, other):
return other.value in self.value or self.value in other.value
def __ne__(self, other):
return not self.__eq__(other)
def __hash__(self):
return 1
def get_rows():
lines = set([])
for line in file_handler:
lines.add(newString(line.upper()))
unique_new_set = lines.intersection(columb)
intersection_new_set = lines.intersection(columa)
# open file1 and file2 in append model
A = open(mailfile, 'r+U')
B = open(suppfile, 'r+U')
get_rows(intersection, unique, A, AB, CLEAN)
A.close()
B.close()
AB.close()
CLEAN.close()

You cannot use sets to do this, because you also need to produce the same hash values for the two strings. You cannot do that, because you'd have to know up front what containment equalities might exist.
From the object.__hash__ documentation:
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
Emphasis mine.
You cannot set the return value of __hash__ to a constant, because then you map all values to the same hash table slot, removing any and all advantages a set might have over other data structures. Instead, you'll get an endless series of hash collisions for any object you try to add to the set, turning a O(1) lookup into O(N).
Sets are the wrong approach because your equality test does not allow for the data to be partitioned into proper sub-sets. If you have the lines The quick brown fox jumps over the lazy dog, quick brown fox and lazy dog, depending on how you build sets you have between 1 and 3 unique values; for sets the values need to be unique in whatever order you add them to a set.

Related

Override python's float for inexact comparison - hash implementation

I run into an interesting problem - here is an override of float that compares inexactly:
class Rounder(float):
"""Float wrapper used for inexact comparison."""
__slots__ = ()
def __hash__(self):
raise NotImplementedError
def __eq__(self, b, rel_tol=1e-06, abs_tol=1e-12):
"""Check if the two floats are equal to the sixth place (relatively)
or to the twelfth place (absolutely)."""
try:
return abs(self - b) <= max(rel_tol * max(abs(self), abs(b)),
abs_tol) # could use math.isclose
except TypeError:
return NotImplemented
...
The requirement is that equal objects have equal hashes - but can't seem to to come up with a formula that represents all Rounder(float) instances that would compare the same (so map them all to the same hash value). Most of the advice in the web is on how to define hash/equals for classes that compare based on some (immutable) attributes - does not apply to this case.
There is no valid way to hash these objects. Hashing requires a transitive definition of ==, which your objects do not have. Even if you did something like def __hash__(self): return 0, intransitivity of == would still make your objects unsafe to use as dict keys.
Non-transitivity is one of the big reasons not to define == this way. If you want to do a closeness check, do that explicitly with math.isclose. Don't make that operation ==.

What are the best practices for __repr__ with collection class Python?

I have a custom Python class which essentially encapsulate a list of some kind of object, and I'm wondering how I should implement its __repr__ function. I'm tempted to go with the following:
class MyCollection:
def __init__(self, objects = []):
self._objects = []
self._objects.extend(objects)
def __repr__(self):
return f"MyCollection({self._objects})"
This has the advantage of producing a valid Python output which fully describes the class instance. However, in my real-wold case, the object list can be rather large and each object may have a large repr by itself (they are arrays themselves).
What are the best practices in such situations? Accept that the repr might often be a very long string? Are there potential issues related to this (debugger UI, etc.)? Should I implement some kind of shortening scheme using semicolon? If so, is there a good/standard way to achieve this? Or should I skip listing the collection's content altogether?
The official documentation outlines this as how you should handle __repr__:
Called by the repr() built-in function to compute the “official”
string representation of an object. If at all possible, this should
look like a valid Python expression that could be used to recreate an
object with the same value (given an appropriate environment). If this
is not possible, a string of the form <...some useful description...>
should be returned. The return value must be a string object. If a
class defines __repr__() but not __str__(), then __repr__() is also
used when an “informal” string representation of instances of that
class is required.
This is typically used for debugging, so it is important that the
representation is information-rich and unambiguous.
Python 3 __repr__ Docs
Lists, strings, sets, tuples and dictionaries all print out the entirety of their collection in their __repr__ method.
Your current code looks to perfectly follow the example of what the documentation suggests. Though I would suggest changing your __init__ method so it looks more like this:
class MyCollection:
def __init__(self, objects=None):
if objects is None:
objects = []
self._objects = objects
def __repr__(self):
return f"MyCollection({self._objects})"
You generally want to avoid using mutable objects as default arguments. Technically because of the way your method is implemented using extend (which makes a copy of the list), it will still work perfectly fine, but Python's documentation still suggests you avoid this.
It is good programming practice to not use mutable objects as default
values. Instead, use None as the default value and inside the
function, check if the parameter is None and create a new
list/dictionary/whatever if it is.
https://docs.python.org/3/faq/programming.html#why-are-default-values-shared-between-objects
If you're interested in how another library handles it differently, the repr for Numpy arrays only shows the first three items and the last three items when the array length is greater than 1,000. It also formats the items so they all use the same amount of space (In the example below, 1000 takes up four spaces so 0 has to be padded with three more spaces to match).
>>> repr(np.array([i for i in range(1001)]))
'array([ 0, 1, 2, ..., 998, 999, 1000])'
To mimic this numpy array style you could implement a __repr__ method like this in your class:
class MyCollection:
def __init__(self, objects=None):
if objects is None:
objects = []
self._objects = objects
def __repr__(self):
# If length is less than 1,000 return the full list.
if len(self._objects) < 1000:
return f"MyCollection({self._objects})"
else:
# Get the first and last three items
items_to_display = self._objects[:3] + self._objects[-3:]
# Find the which item has the longest repr
max_length_repr = max(items_to_display, key=lambda x: len(repr(x)))
# Get the length of the item with the longest repr
padding = len(repr(max_length_repr))
# Create a list of the reprs of each item and apply the padding
values = [repr(item).rjust(padding) for item in items_to_display]
# Insert the '...' inbetween the 3rd and 4th item
values.insert(3, '...')
# Convert the list to a string joined by commas
array_as_string = ', '.join(values)
return f"MyCollection([{array_as_string}])"
>>> repr(MyCollection([1,2,3,4]))
'MyCollection([1, 2, 3, 4])'
>>> repr(MyCollection([i for i in range(1001)]))
'MyCollection([ 0, 1, 2, ..., 998, 999, 1000])'

Case-insensitive set intersection

What would be the best way to do the following case-insensitive intersection:
a1 = ['Disney', 'Fox']
a2 = ['paramount', 'fox']
a1.intersection(a2)
> ['fox']
Normally I'd do a list comprehension to convert both to all lowercased:
>>> set([_.lower() for _ in a1]).intersection(set([_.lower() for _ in a2]))
set(['fox'])
but it's a bit ugly. Is there a better way to do this?
Using the set comprehension syntax is slightly less ugly:
>>> {str.casefold(x) for x in a1} & {str.casefold(x) for x in a2}
{'fox'}
The algorithm is the same, and there is not any more efficient way available because the hash values of strings are case sensitive.
Using str.casefold instead of str.lower will behave more correctly for international data, and is available since Python 3.3+.
There are some problems with definitions here, for example in the case that a string appears twice in the same set with two different cases, or in two different sets (which one do we keep?).
With that being said, if you don't care, and you want to perform this sort of intersections a lot of times, you can create a case invariant string object:
class StrIgnoreCase:
def __init__(self, val):
self.val = val
def __eq__(self, other):
if not isinstance(other, StrIgnoreCase):
return False
return self.val.lower() == other.val.lower()
def __hash__(self):
return hash(self.val.lower())
And then I'd just maintain both the sets so that they contain these objects instead of plain strings. It would require less conversions on each creation of new sets and each intersection operation.

Python ordered list search versus searching sets for list of objects

I have two lists of objects. Let's call the lists a and b. The objects (for our intents and purposes) are defined as below:
class MyObj:
def __init__(self, string: str, integer: int):
self.string = string
self.integer = integer
def __eq__(self, other):
if self.integer == other.integer:
pass
else:
return False
if fuzz.ratio(self.string, other.string) > 90: # fuzzywuzzy library checks if strings are "similar enough"
return True
else:
return False
Now what I want to achieve is to check which objects in list a are "in" list b (return true against == when compared to some object in list b).
Currently I'm just looping through them as follows:
for obj in a:
for other_obj in b:
if a == b:
<do something>
break
I strongly suspect that there is a faster way of implementing this. The lists are long. Up to like 100 000 objects each. So this is a big bottleneck in my code.
I looked at this answer Fastest way to search a list in python and it suggests that sets work much better. I'm a bit confused by this though:
How significant is the "removal of duplicates" speedup? I don't expect to have many duplicates in my lists.
Can sets remove duplicates and properly hash when I have defined the eq the way I have?
How would this compare with pre-ordering the list, and using something like binary search? A set is unordered...
So what is the best approach here? Please provide implementation guidelines in the answer as well.
TL;DR, when using fuzzy comparison techniques, sets and sorting can be very difficult to work with without some normalization method. You can try to be smart about reducing search spaces as much as possible, but care should be taken to do it consistently.
If a class defines __eq__ and not __hash__, it is not hashable.
For instance, consider the following class
class Name:
def __init__(self, first, last):
self.first = first
self.last = last
def __repr__(self):
return f'{self.first} {self.last}'
def __eq__(self, other):
return (self.first == other.first) and (self.last == other.last)
Now, if you were to try to create a set with these elements
>>> {Name('Neil', 'Stackoverflow-user')}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'Name'
So, in the case of Name, you would simply define a __hash__ method. However, in your case, this is more difficult since you have fuzzy equality semantics. The only way I can think of to get around this is to have a normalization function that you can prove will be consistent, and use the normalized string instead of the actual string as part of your hash. Take Floats as dictionary keys as an example of needing to normalize in order to use a "fuzzy" type like floats as keys.
For sorting and binary searching, since you are fuzzy-searching, you still need to be careful with things like binary searching. As an example, assume you say equality is determined by being within a certain range of Levenshtein distances. Then book and hook will similar to each other (distance = 1), but hack with a distance of 2, will be closer to hook. So how will you define a good sorting algorithm for fuzzy searching in this case?
One thing to try would be to use some form of group-by/bucketing, like a dictionary of the type Dict[int, List[MyObj]], where instances of MyObj are classified by their one constant, the self.integer field. Then you can try comparing smaller sub-lists. This would at least reduce search spaces by clustering.

Why does my equal method always return false?

Hi I just started learning classes in python and I'm trying to implement an array based list. This is my class and the init constructor.
class List:
def __init__(self,max_capacity=50):
self.array=build_array(max_capacity)
self.count=0
However, I wrote a method equals that returns true if the list equals another. However, it always return false. And yes my append method is working.
def __eq__(self,other):
result=False
if self.array==other:
result=True
else:
result=False
return result
This is how I tested it but it return false?
a_list=List()
b_list=[3,2,1]
a_list.append(3)
a_list.append(2)
a_list.append(1)
print(a_list==b_list)
Any help would be appreciated!
EDIT:
After all the helpful suggestions, I figured out I have to iterate through other and a_list and check the elements.
__eq__, for any class, should handle three cases:
self and other are the same object
self and other are compatible instances (up to duck-typing: they don't need to be instances of the same class, but should support the same interface as necessary)
self and other are not comparable.
Keeping these three points in mind, define __eq__ as
def __eq__(self, other):
if self is other:
return True
try:
return self.array == other.array
except AttributeError:
# other doesn't have an array attribute,
# meaning they can't be equal
return False
Note this assumes that a List instance should compare as equal to another object as long as both objects have equal array attributes (whatever that happens to mean). If that isn't what you want, you'll have to be more specific in your question.
One final option is to fall back to other == self to see if type of other knows how to compare itself to your List class. Equality should be symmetric, so self == other and other == self should produce the same value if, indeed, the two values can be compared for equality.
except AttributeError:
return other == self
Of course, you need to be careful that this doesn't lead to an infinite loop of List and type(other) repeatedly deferring to the other.
You are comparing the array embedded in your instance (self.array) to the entirety of the other object. This will not work well.
You need to compare self.array to other.array and/or convert both objects to the same type before comparing. You probably also need to specify what it means to compare two arrays (i.e., you want a single boolean value that indicates whether all elements are equal, not an array of boolean values for each element).
For the code below, I assume you are using a numpy ndarray for self.array. If not, you could write your own array_equal that will convert other to an array, then compare the lengths of the arrays, then return (self.array==other_as_array).all().
If you want to test for strict equality between the objects (same types, same values), you could use this:
from numpy import array_equal
import numpy as np
class List
...
def __eq__(self, other):
return isinstance(other, List) and array_equal(self.array, other.array)
If you just want to check for equality of the items in the list, regardless of the object type, then you could do this:
def __eq__(self, other):
if isinstance(other, List):
return array_equal(array, other.array)
else:
return array_equal(self.array, other)

Categories