What would be the best way to do the following case-insensitive intersection:
a1 = ['Disney', 'Fox']
a2 = ['paramount', 'fox']
a1.intersection(a2)
> ['fox']
Normally I'd do a list comprehension to convert both to all lowercased:
>>> set([_.lower() for _ in a1]).intersection(set([_.lower() for _ in a2]))
set(['fox'])
but it's a bit ugly. Is there a better way to do this?
Using the set comprehension syntax is slightly less ugly:
>>> {str.casefold(x) for x in a1} & {str.casefold(x) for x in a2}
{'fox'}
The algorithm is the same, and there is not any more efficient way available because the hash values of strings are case sensitive.
Using str.casefold instead of str.lower will behave more correctly for international data, and is available since Python 3.3+.
There are some problems with definitions here, for example in the case that a string appears twice in the same set with two different cases, or in two different sets (which one do we keep?).
With that being said, if you don't care, and you want to perform this sort of intersections a lot of times, you can create a case invariant string object:
class StrIgnoreCase:
def __init__(self, val):
self.val = val
def __eq__(self, other):
if not isinstance(other, StrIgnoreCase):
return False
return self.val.lower() == other.val.lower()
def __hash__(self):
return hash(self.val.lower())
And then I'd just maintain both the sets so that they contain these objects instead of plain strings. It would require less conversions on each creation of new sets and each intersection operation.
Related
Short version of the question:
When comparing a dictionary for which the keys are integers in range(n) and a list of length n, which are the key points of an implementation to choose between one or the other? Things like "if you are doing a lot of this thing with your object, then a dictionary is better".
Long version of the question
I'm not sure if the following details of my implementation matter for the question... So here it is.
In trying to make my code a bit more pythonic, I implemented a subclass of UserList that accepts as index both an integer and a list that represents an integer in base l.
from collections import UserList
class MyList(UserList):
"""
A list that can be accessed both by a g-tuple of coefficients in range(l)
or the corresponding integer.
"""
def __init__(self, data=None, l=2, g=None):
self.l = l
if data == None:
if g == None:
raise ValueError
self.data = [0]*(l**g)
else:
self.data = data
def __setitem__(self, key, value):
if isinstance(key, int):
self.data[key] = value
else:
self.data[self.idx(key)] = value
def __getitem__(self, key):
if isinstance(key, int):
return self.data[key]
return self.data[self.idx(key)]
def idx(self, key):
l = self.l
idx = 0
for i, value in enumerate(key):
idx += value*l**i
return idx
Which can be used like this:
L = MyList(l=4, g=2) #creates a list of length 4**2 initialized at zero
L[9] = 'Hello World'
L[9] == L[1,2]
I have generalized this class to also accept l to be a tuple of bases (let's call this generalized class MyListTuple), but the code is in SageMath so I don't really want to translate that to pure python too, but it works great.
It would look something like this:
L = MyListTuple(l=[2,4], g=2) #creates a list of length 2^2*4^2 initialized at zero
L[0,9] = 'Hello World'
L[0,9] == L[[0,0],[1,2]]
The next part I want to improve I currently use a dictionary of which the keys are tuples of integers (so you would access it as d[9,13,0]), but I want to also be able to use as (equivalent) keys lists representing the integer in base l as above (so for l=4 that would be d[[1,2], [1,3], [0,0]]).
This is very similar to what I have done in MyListTuple, but in this case, a lot of the keys are never used.
So my question is: How to choose between creating a subclass of UserDict that is equivalent to MyListTuple in handling the given key or just use MyListTuple even if in most cases most entries will never be used?
Or as I phrased it above, which are the details in the use of this structure that I should look for to choose between the two? (things like "if you are doing a lot of this thing with your object, then a dictionary is better")
(Will only try to address the general "list vs dict" part.
Take this with a grain of salt; from user, not implementer.
This is not a real answer, more of a big comment.)
List (probably doubly-linked list) should provide efficient
insertion & deletion anywhere (only modifying pointers to
next/prev elements, O(1)).
Searching will be ineffecient (both O(n) -check all items-,
and cache misses* -bad locality of reference-).
(*vs items stored contiguously in memory (e.g. numpy.array)).
Dict (some kind of hash-map) should theoretically provide
efficient searches, insertions & deletions (amortized O(1));
but that may depend on the quality of the hash function, the
bucket size, the usage patterns, etc. (I don't know enough).
Iterating through all items sequentially will be inefficient
for both, due to cache misses / bad locality of reference
(following pointers, instead of accessing memory sequentially).
As far as I know:
You would use lists as mutable sequences (when you need to
iterate over all items) in Python, for lack of a better
alternative (C arrays, C++ std::array/std::vector/, etc.).
You would use dicts for quick lookup/search, based on keys,
when searching is more important/often, than insertion/deletion.
I have two lists of objects. Let's call the lists a and b. The objects (for our intents and purposes) are defined as below:
class MyObj:
def __init__(self, string: str, integer: int):
self.string = string
self.integer = integer
def __eq__(self, other):
if self.integer == other.integer:
pass
else:
return False
if fuzz.ratio(self.string, other.string) > 90: # fuzzywuzzy library checks if strings are "similar enough"
return True
else:
return False
Now what I want to achieve is to check which objects in list a are "in" list b (return true against == when compared to some object in list b).
Currently I'm just looping through them as follows:
for obj in a:
for other_obj in b:
if a == b:
<do something>
break
I strongly suspect that there is a faster way of implementing this. The lists are long. Up to like 100 000 objects each. So this is a big bottleneck in my code.
I looked at this answer Fastest way to search a list in python and it suggests that sets work much better. I'm a bit confused by this though:
How significant is the "removal of duplicates" speedup? I don't expect to have many duplicates in my lists.
Can sets remove duplicates and properly hash when I have defined the eq the way I have?
How would this compare with pre-ordering the list, and using something like binary search? A set is unordered...
So what is the best approach here? Please provide implementation guidelines in the answer as well.
TL;DR, when using fuzzy comparison techniques, sets and sorting can be very difficult to work with without some normalization method. You can try to be smart about reducing search spaces as much as possible, but care should be taken to do it consistently.
If a class defines __eq__ and not __hash__, it is not hashable.
For instance, consider the following class
class Name:
def __init__(self, first, last):
self.first = first
self.last = last
def __repr__(self):
return f'{self.first} {self.last}'
def __eq__(self, other):
return (self.first == other.first) and (self.last == other.last)
Now, if you were to try to create a set with these elements
>>> {Name('Neil', 'Stackoverflow-user')}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'Name'
So, in the case of Name, you would simply define a __hash__ method. However, in your case, this is more difficult since you have fuzzy equality semantics. The only way I can think of to get around this is to have a normalization function that you can prove will be consistent, and use the normalized string instead of the actual string as part of your hash. Take Floats as dictionary keys as an example of needing to normalize in order to use a "fuzzy" type like floats as keys.
For sorting and binary searching, since you are fuzzy-searching, you still need to be careful with things like binary searching. As an example, assume you say equality is determined by being within a certain range of Levenshtein distances. Then book and hook will similar to each other (distance = 1), but hack with a distance of 2, will be closer to hook. So how will you define a good sorting algorithm for fuzzy searching in this case?
One thing to try would be to use some form of group-by/bucketing, like a dictionary of the type Dict[int, List[MyObj]], where instances of MyObj are classified by their one constant, the self.integer field. Then you can try comparing smaller sub-lists. This would at least reduce search spaces by clustering.
I wrote a python script in python to create two sets. I was trying to override eq in string class, so the equal logic is that if string a is in string b, then a "equal" b. I subclass str class and the two sets contains the new class.
Then I tried to use set.intersect to get the result. But the result always show 0. My code is like this:
# override str class method__eq__
class newString(str):
def __new__(self, origial):
self.value = origial
return str.__new__(self, origial)
def __eq__(self, other):
return other.value in self.value or self.value in other.value
def __ne__(self, other):
return not self.__eq__(other)
def __hash__(self):
return 1
def get_rows():
lines = set([])
for line in file_handler:
lines.add(newString(line.upper()))
unique_new_set = lines.intersection(columb)
intersection_new_set = lines.intersection(columa)
# open file1 and file2 in append model
A = open(mailfile, 'r+U')
B = open(suppfile, 'r+U')
get_rows(intersection, unique, A, AB, CLEAN)
A.close()
B.close()
AB.close()
CLEAN.close()
You cannot use sets to do this, because you also need to produce the same hash values for the two strings. You cannot do that, because you'd have to know up front what containment equalities might exist.
From the object.__hash__ documentation:
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
Emphasis mine.
You cannot set the return value of __hash__ to a constant, because then you map all values to the same hash table slot, removing any and all advantages a set might have over other data structures. Instead, you'll get an endless series of hash collisions for any object you try to add to the set, turning a O(1) lookup into O(N).
Sets are the wrong approach because your equality test does not allow for the data to be partitioned into proper sub-sets. If you have the lines The quick brown fox jumps over the lazy dog, quick brown fox and lazy dog, depending on how you build sets you have between 1 and 3 unique values; for sets the values need to be unique in whatever order you add them to a set.
I've been trying to create a nested or recursive effect with SequenceMatcher.
The final goal is comparing two sequences, both may contain instances of different types.
For example, the sequences could be:
l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]
Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.
I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.
What I tried doing is write a wrapper for the built-in str class:
from difflib import SequenceMatcher
class myString:
def __init__(self, string):
self.string = string
def __hash__(self):
return hash(self.string)
def __eq__(self, other):
return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5
Edit: perhaps a more elegant way is:
class myString(str):
def __eq__(self, other):
return SequenceMatcher(a=self, b=other).ratio() > 0.5
By doing this, the following is made possible:
>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75
So, evidently it's working, but I have a bad feeling about overriding the hash function.
When is the hash used? Where can it come back and bite me?
SequenceMatcher's documentation states the following:
This is a flexible class for comparing pairs of sequences of any type,
so long as the sequence elements are hashable.
And by definition hashable elements are required to fulfill the following requirement:
Hashable objects which compare equal must have the same hash value.
In addition, do I need to override cmp as well?
I'd love to hear about other solutions that come to mind.
Thanks.
Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.
Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.
All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).
I have two sets (although I can do lists, or whatever):
a = frozenset(('Today','I','am','fine'))
b = frozenset(('hello','how','are','you','today'))
I want to get:
frozenset(['Today'])
or at least:
frozenset(['today'])
The second option is doable if I lowercase everything I presume, but I'm looking for a more elegant way. Is it possible to do
a.intersection(b)
in a case-insensitive manner?
Shortcuts in Django are also fine since I'm using that framework.
Example from intersection method below (I couldn't figure out how to get this formatted in a comment):
print intersection('Today I am fine tomorrow'.split(),
'Hello How a re you TODAY and today and Today and Tomorrow'.split(),
key=str.lower)
[(['tomorrow'], ['Tomorrow']), (['Today'], ['TODAY', 'today', 'Today'])]
Here's version that works for any pair of iterables:
def intersection(iterableA, iterableB, key=lambda x: x):
"""Return the intersection of two iterables with respect to `key` function.
"""
def unify(iterable):
d = {}
for item in iterable:
d.setdefault(key(item), []).append(item)
return d
A, B = unify(iterableA), unify(iterableB)
return [(A[k], B[k]) for k in A if k in B]
Example:
print intersection('Today I am fine'.split(),
'Hello How a re you TODAY'.split(),
key=str.lower)
# -> [(['Today'], ['TODAY'])]
Unfortunately, even if you COULD "change on the fly" the comparison-related special methods of the sets' items (__lt__ and friends -- actually, only __eq__ needed the way sets are currently implemented, but that's an implementatio detail) -- and you can't, because they belong to a built-in type, str -- that wouldn't suffice, because __hash__ is also crucial and by the time you want to do your intersection it's already been applied, putting the sets' items in different hash buckets from where they'd need to end up to make intersection work the way you want (i.e., no guarantee that 'Today' and 'today' are in the same bucket).
So, for your purposes, you inevitably need to build new data structures -- if you consider it "inelegant" to have to do that at all, you're plain out of luck: built-in sets just don't carry around the HUGE baggage and overhead that would be needed to allow people to change comparison and hashing functions, which would bloat things by 10 times (or more) for the sae of a need felt in (maybe) one use case in a million.
If you have frequent needs connected with case-insensitive comparison, you should consider subclassing or wrapping str (overriding comparison and hashing) to provide a "case insensitive str" type cistr -- and then, of course, make sure than only instances of cistr are (e.g.) added to your sets (&c) of interest (either by subclassing set &c, or simply by paying care). To give an oversimplified example...:
class ci(str):
def __hash__(self):
return hash(self.lower())
def __eq__(self, other):
return self.lower() == other.lower()
class cifrozenset(frozenset):
def __new__(cls, seq=()):
return frozenset((ci(x) for x in seq))
a = cifrozenset(('Today','I','am','fine'))
b = cifrozenset(('hello','how','are','you','today'))
print a.intersection(b)
this does emit frozenset(['Today']), as per your expressed desire. Of course, in real life you'd probably want to do MUCH more overriding (for example...: the way I have things here, any operation on a cifrozenset returns a plain frozenset, losing the precious case independence special feature -- you'd probably want to ensure that a cifrozenset is returned each time instead, and, while quite feasible, that's NOT trivial).
First, don't you mean a.intersection(b)? The intersection (if case insensitive) would be set(['today']). The difference would be set(['i', 'am', 'fine'])
Here are two ideas:
1.) Write a function to convert the elements of both sets to lowercase and then do the intersection. Here's one way you could do it:
>>> intersect_with_key = lambda s1, s2, key=lambda i: i: set(map(key, s1)).intersection(map(key, s2))
>>> fs1 = frozenset('Today I am fine'.split())
>>> fs2 = frozenset('Hello how are you TODAY'.split())
>>> intersect_with_key(fs1, fs2)
set([])
>>> intersect_with_key(fs1, fs2, key=str.lower)
set(['today'])
>>>
This is not very efficient though because the conversion and new sets would have to be created on each call.
2.) Extend the frozenset class to keep a case insensitive copy of the elements. Override the intersection method to use the case insensitive copy of the elements. This would be more efficient.
>>> a_, b_ = map(set, [map(str.lower, a), map(str.lower, b)])
>>> a_ & b_
set(['today'])
Or... with less maps,
>>> a_ = set(map(str.lower, a))
>>> b_ = set(map(str.lower, b))
>>> a_ & b_
set(['today'])