Difflib's SequenceMatcher - Customized equality

Difflib's SequenceMatcher - Customized equality - python

I've been trying to create a nested or recursive effect with SequenceMatcher.
The final goal is comparing two sequences, both may contain instances of different types.
For example, the sequences could be:
l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]
Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.
I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.
What I tried doing is write a wrapper for the built-in str class:
from difflib import SequenceMatcher
class myString:
def __init__(self, string):
self.string = string
def __hash__(self):
return hash(self.string)
def __eq__(self, other):
return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5
Edit: perhaps a more elegant way is:
class myString(str):
def __eq__(self, other):
return SequenceMatcher(a=self, b=other).ratio() > 0.5
By doing this, the following is made possible:
>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75
So, evidently it's working, but I have a bad feeling about overriding the hash function.
When is the hash used? Where can it come back and bite me?
SequenceMatcher's documentation states the following:
This is a flexible class for comparing pairs of sequences of any type,
so long as the sequence elements are hashable.
And by definition hashable elements are required to fulfill the following requirement:
Hashable objects which compare equal must have the same hash value.
In addition, do I need to override cmp as well?
I'd love to hear about other solutions that come to mind.
Thanks.

Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.
Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.
All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).

Related

What are the best practices for repr with collection class Python?

I have a custom Python class which essentially encapsulate a list of some kind of object, and I'm wondering how I should implement its __repr__ function. I'm tempted to go with the following:
class MyCollection:
def __init__(self, objects = []):
self._objects = []
self._objects.extend(objects)
def __repr__(self):
return f"MyCollection({self._objects})"
This has the advantage of producing a valid Python output which fully describes the class instance. However, in my real-wold case, the object list can be rather large and each object may have a large repr by itself (they are arrays themselves).
What are the best practices in such situations? Accept that the repr might often be a very long string? Are there potential issues related to this (debugger UI, etc.)? Should I implement some kind of shortening scheme using semicolon? If so, is there a good/standard way to achieve this? Or should I skip listing the collection's content altogether?

The official documentation outlines this as how you should handle __repr__:
Called by the repr() built-in function to compute the “official”
string representation of an object. If at all possible, this should
look like a valid Python expression that could be used to recreate an
object with the same value (given an appropriate environment). If this
is not possible, a string of the form <...some useful description...>
should be returned. The return value must be a string object. If a
class defines __repr__() but not __str__(), then __repr__() is also
used when an “informal” string representation of instances of that
class is required.
This is typically used for debugging, so it is important that the
representation is information-rich and unambiguous.
Python 3 __repr__ Docs
Lists, strings, sets, tuples and dictionaries all print out the entirety of their collection in their __repr__ method.
Your current code looks to perfectly follow the example of what the documentation suggests. Though I would suggest changing your __init__ method so it looks more like this:
class MyCollection:
def __init__(self, objects=None):
if objects is None:
objects = []
self._objects = objects
def __repr__(self):
return f"MyCollection({self._objects})"
You generally want to avoid using mutable objects as default arguments. Technically because of the way your method is implemented using extend (which makes a copy of the list), it will still work perfectly fine, but Python's documentation still suggests you avoid this.
It is good programming practice to not use mutable objects as default
values. Instead, use None as the default value and inside the
function, check if the parameter is None and create a new
list/dictionary/whatever if it is.
https://docs.python.org/3/faq/programming.html#why-are-default-values-shared-between-objects
If you're interested in how another library handles it differently, the repr for Numpy arrays only shows the first three items and the last three items when the array length is greater than 1,000. It also formats the items so they all use the same amount of space (In the example below, 1000 takes up four spaces so 0 has to be padded with three more spaces to match).
>>> repr(np.array([i for i in range(1001)]))
'array([ 0, 1, 2, ..., 998, 999, 1000])'
To mimic this numpy array style you could implement a __repr__ method like this in your class:
class MyCollection:
def __init__(self, objects=None):
if objects is None:
objects = []
self._objects = objects
def __repr__(self):
# If length is less than 1,000 return the full list.
if len(self._objects) < 1000:
return f"MyCollection({self._objects})"
else:
# Get the first and last three items
items_to_display = self._objects[:3] + self._objects[-3:]
# Find the which item has the longest repr
max_length_repr = max(items_to_display, key=lambda x: len(repr(x)))
# Get the length of the item with the longest repr
padding = len(repr(max_length_repr))
# Create a list of the reprs of each item and apply the padding
values = [repr(item).rjust(padding) for item in items_to_display]
# Insert the '...' inbetween the 3rd and 4th item
values.insert(3, '...')
# Convert the list to a string joined by commas
array_as_string = ', '.join(values)
return f"MyCollection([{array_as_string}])"
>>> repr(MyCollection([1,2,3,4]))
'MyCollection([1, 2, 3, 4])'
>>> repr(MyCollection([i for i in range(1001)]))
'MyCollection([ 0, 1, 2, ..., 998, 999, 1000])'

Case-insensitive set intersection

What would be the best way to do the following case-insensitive intersection:
a1 = ['Disney', 'Fox']
a2 = ['paramount', 'fox']
a1.intersection(a2)
> ['fox']
Normally I'd do a list comprehension to convert both to all lowercased:
>>> set([_.lower() for _ in a1]).intersection(set([_.lower() for _ in a2]))
set(['fox'])
but it's a bit ugly. Is there a better way to do this?

Using the set comprehension syntax is slightly less ugly:
>>> {str.casefold(x) for x in a1} & {str.casefold(x) for x in a2}
{'fox'}
The algorithm is the same, and there is not any more efficient way available because the hash values of strings are case sensitive.
Using str.casefold instead of str.lower will behave more correctly for international data, and is available since Python 3.3+.

There are some problems with definitions here, for example in the case that a string appears twice in the same set with two different cases, or in two different sets (which one do we keep?).
With that being said, if you don't care, and you want to perform this sort of intersections a lot of times, you can create a case invariant string object:
class StrIgnoreCase:
def __init__(self, val):
self.val = val
def __eq__(self, other):
if not isinstance(other, StrIgnoreCase):
return False
return self.val.lower() == other.val.lower()
def __hash__(self):
return hash(self.val.lower())
And then I'd just maintain both the sets so that they contain these objects instead of plain strings. It would require less conversions on each creation of new sets and each intersection operation.

Python ordered list search versus searching sets for list of objects

I have two lists of objects. Let's call the lists a and b. The objects (for our intents and purposes) are defined as below:
class MyObj:
def __init__(self, string: str, integer: int):
self.string = string
self.integer = integer
def __eq__(self, other):
if self.integer == other.integer:
pass
else:
return False
if fuzz.ratio(self.string, other.string) > 90: # fuzzywuzzy library checks if strings are "similar enough"
return True
else:
return False
Now what I want to achieve is to check which objects in list a are "in" list b (return true against == when compared to some object in list b).
Currently I'm just looping through them as follows:
for obj in a:
for other_obj in b:
if a == b:
<do something>
break
I strongly suspect that there is a faster way of implementing this. The lists are long. Up to like 100 000 objects each. So this is a big bottleneck in my code.
I looked at this answer Fastest way to search a list in python and it suggests that sets work much better. I'm a bit confused by this though:
How significant is the "removal of duplicates" speedup? I don't expect to have many duplicates in my lists.
Can sets remove duplicates and properly hash when I have defined the eq the way I have?
How would this compare with pre-ordering the list, and using something like binary search? A set is unordered...
So what is the best approach here? Please provide implementation guidelines in the answer as well.

TL;DR, when using fuzzy comparison techniques, sets and sorting can be very difficult to work with without some normalization method. You can try to be smart about reducing search spaces as much as possible, but care should be taken to do it consistently.
If a class defines __eq__ and not __hash__, it is not hashable.
For instance, consider the following class
class Name:
def __init__(self, first, last):
self.first = first
self.last = last
def __repr__(self):
return f'{self.first} {self.last}'
def __eq__(self, other):
return (self.first == other.first) and (self.last == other.last)
Now, if you were to try to create a set with these elements
>>> {Name('Neil', 'Stackoverflow-user')}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'Name'
So, in the case of Name, you would simply define a __hash__ method. However, in your case, this is more difficult since you have fuzzy equality semantics. The only way I can think of to get around this is to have a normalization function that you can prove will be consistent, and use the normalized string instead of the actual string as part of your hash. Take Floats as dictionary keys as an example of needing to normalize in order to use a "fuzzy" type like floats as keys.
For sorting and binary searching, since you are fuzzy-searching, you still need to be careful with things like binary searching. As an example, assume you say equality is determined by being within a certain range of Levenshtein distances. Then book and hook will similar to each other (distance = 1), but hack with a distance of 2, will be closer to hook. So how will you define a good sorting algorithm for fuzzy searching in this case?
One thing to try would be to use some form of group-by/bucketing, like a dictionary of the type Dict[int, List[MyObj]], where instances of MyObj are classified by their one constant, the self.integer field. Then you can try comparing smaller sub-lists. This would at least reduce search spaces by clustering.

More concise way to filter a list?

If I have a list
a=[4, 5, 6]
as far as I know the simplest way to filter it is like:
[i for i in a if a<6]
Now I have just been introduced to dataframes, where for a dataframe like
df = pd.DataFrame({'a':[4, 5, 6], 'b':[7, 1, 2]})
I can apply a (row) filter just by specifying the element and the condition:
df[df['a']<6]
This seems more concise and maybe less confusing (once you get used to it) than the way to filter a list. Couldn't a list filter by applied by simply specifying a condition in the [], like
a[<6]
Obviously, it isn't implemented this way currently, but isn't the current method relatively verbose? Why couldn't it be simplified?

You have the simplest way above. However, you can use the following
filtered_list = filter(lambda k: k < 6, original_list)
This looks great, but I still have a soft spot for the list comprehension.

You can't have exactly the syntax you're asking for, but if you want to create your own list class, you can have one just as succinct:
class List(list):
def __lt__(self, other):
return List(i for i in self if i < other)
a = List([4,5,6])
b = a < 6
assert b == [4,5]

Yes, the python language could be built so that a[<6] did the filtering you want, but then every python programmer and every python compiler would have to learn that special syntax just to save a little source code in a few special cases: >, >=, ==, <=, <.
Pandas does something like that as you showed, but it is built for lots of numerical analysis, so that syntactic sugar may be more worth the cost. Also, pandas tends to provide lots of syntactic sugar inspired by the R language so it's not very idiomatic python.

Case-insensitive comparison of sets in Python

I have two sets (although I can do lists, or whatever):
a = frozenset(('Today','I','am','fine'))
b = frozenset(('hello','how','are','you','today'))
I want to get:
frozenset(['Today'])
or at least:
frozenset(['today'])
The second option is doable if I lowercase everything I presume, but I'm looking for a more elegant way. Is it possible to do
a.intersection(b)
in a case-insensitive manner?
Shortcuts in Django are also fine since I'm using that framework.
Example from intersection method below (I couldn't figure out how to get this formatted in a comment):
print intersection('Today I am fine tomorrow'.split(),
'Hello How a re you TODAY and today and Today and Tomorrow'.split(),
key=str.lower)
[(['tomorrow'], ['Tomorrow']), (['Today'], ['TODAY', 'today', 'Today'])]

Here's version that works for any pair of iterables:
def intersection(iterableA, iterableB, key=lambda x: x):
"""Return the intersection of two iterables with respect to `key` function.
"""
def unify(iterable):
d = {}
for item in iterable:
d.setdefault(key(item), []).append(item)
return d
A, B = unify(iterableA), unify(iterableB)
return [(A[k], B[k]) for k in A if k in B]
Example:
print intersection('Today I am fine'.split(),
'Hello How a re you TODAY'.split(),
key=str.lower)
# -> [(['Today'], ['TODAY'])]

Unfortunately, even if you COULD "change on the fly" the comparison-related special methods of the sets' items (__lt__ and friends -- actually, only __eq__ needed the way sets are currently implemented, but that's an implementatio detail) -- and you can't, because they belong to a built-in type, str -- that wouldn't suffice, because __hash__ is also crucial and by the time you want to do your intersection it's already been applied, putting the sets' items in different hash buckets from where they'd need to end up to make intersection work the way you want (i.e., no guarantee that 'Today' and 'today' are in the same bucket).
So, for your purposes, you inevitably need to build new data structures -- if you consider it "inelegant" to have to do that at all, you're plain out of luck: built-in sets just don't carry around the HUGE baggage and overhead that would be needed to allow people to change comparison and hashing functions, which would bloat things by 10 times (or more) for the sae of a need felt in (maybe) one use case in a million.
If you have frequent needs connected with case-insensitive comparison, you should consider subclassing or wrapping str (overriding comparison and hashing) to provide a "case insensitive str" type cistr -- and then, of course, make sure than only instances of cistr are (e.g.) added to your sets (&c) of interest (either by subclassing set &c, or simply by paying care). To give an oversimplified example...:
class ci(str):
def __hash__(self):
return hash(self.lower())
def __eq__(self, other):
return self.lower() == other.lower()
class cifrozenset(frozenset):
def __new__(cls, seq=()):
return frozenset((ci(x) for x in seq))
a = cifrozenset(('Today','I','am','fine'))
b = cifrozenset(('hello','how','are','you','today'))
print a.intersection(b)
this does emit frozenset(['Today']), as per your expressed desire. Of course, in real life you'd probably want to do MUCH more overriding (for example...: the way I have things here, any operation on a cifrozenset returns a plain frozenset, losing the precious case independence special feature -- you'd probably want to ensure that a cifrozenset is returned each time instead, and, while quite feasible, that's NOT trivial).

First, don't you mean a.intersection(b)? The intersection (if case insensitive) would be set(['today']). The difference would be set(['i', 'am', 'fine'])
Here are two ideas:
1.) Write a function to convert the elements of both sets to lowercase and then do the intersection. Here's one way you could do it:
>>> intersect_with_key = lambda s1, s2, key=lambda i: i: set(map(key, s1)).intersection(map(key, s2))
>>> fs1 = frozenset('Today I am fine'.split())
>>> fs2 = frozenset('Hello how are you TODAY'.split())
>>> intersect_with_key(fs1, fs2)
set([])
>>> intersect_with_key(fs1, fs2, key=str.lower)
set(['today'])
>>>
This is not very efficient though because the conversion and new sets would have to be created on each call.
2.) Extend the frozenset class to keep a case insensitive copy of the elements. Override the intersection method to use the case insensitive copy of the elements. This would be more efficient.

>>> a_, b_ = map(set, [map(str.lower, a), map(str.lower, b)])
>>> a_ & b_
set(['today'])
Or... with less maps,
>>> a_ = set(map(str.lower, a))
>>> b_ = set(map(str.lower, b))
>>> a_ & b_
set(['today'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difflib's SequenceMatcher - Customized equality - python

Related

What are the best practices for repr with collection class Python?

Case-insensitive set intersection

Python ordered list search versus searching sets for list of objects

More concise way to filter a list?

Case-insensitive comparison of sets in Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difflib's SequenceMatcher - Customized equality - python

Related

What are the best practices for __repr__ with collection class Python?

Case-insensitive set intersection

Python ordered list search versus searching sets for list of objects

More concise way to filter a list?

Case-insensitive comparison of sets in Python

Categories

Resources

What are the best practices for repr with collection class Python?