Case-insensitive comparison of sets in Python

Case-insensitive comparison of sets in Python - python

I have two sets (although I can do lists, or whatever):
a = frozenset(('Today','I','am','fine'))
b = frozenset(('hello','how','are','you','today'))
I want to get:
frozenset(['Today'])
or at least:
frozenset(['today'])
The second option is doable if I lowercase everything I presume, but I'm looking for a more elegant way. Is it possible to do
a.intersection(b)
in a case-insensitive manner?
Shortcuts in Django are also fine since I'm using that framework.
Example from intersection method below (I couldn't figure out how to get this formatted in a comment):
print intersection('Today I am fine tomorrow'.split(),
'Hello How a re you TODAY and today and Today and Tomorrow'.split(),
key=str.lower)
[(['tomorrow'], ['Tomorrow']), (['Today'], ['TODAY', 'today', 'Today'])]

Here's version that works for any pair of iterables:
def intersection(iterableA, iterableB, key=lambda x: x):
"""Return the intersection of two iterables with respect to `key` function.
"""
def unify(iterable):
d = {}
for item in iterable:
d.setdefault(key(item), []).append(item)
return d
A, B = unify(iterableA), unify(iterableB)
return [(A[k], B[k]) for k in A if k in B]
Example:
print intersection('Today I am fine'.split(),
'Hello How a re you TODAY'.split(),
key=str.lower)
# -> [(['Today'], ['TODAY'])]

Unfortunately, even if you COULD "change on the fly" the comparison-related special methods of the sets' items (__lt__ and friends -- actually, only __eq__ needed the way sets are currently implemented, but that's an implementatio detail) -- and you can't, because they belong to a built-in type, str -- that wouldn't suffice, because __hash__ is also crucial and by the time you want to do your intersection it's already been applied, putting the sets' items in different hash buckets from where they'd need to end up to make intersection work the way you want (i.e., no guarantee that 'Today' and 'today' are in the same bucket).
So, for your purposes, you inevitably need to build new data structures -- if you consider it "inelegant" to have to do that at all, you're plain out of luck: built-in sets just don't carry around the HUGE baggage and overhead that would be needed to allow people to change comparison and hashing functions, which would bloat things by 10 times (or more) for the sae of a need felt in (maybe) one use case in a million.
If you have frequent needs connected with case-insensitive comparison, you should consider subclassing or wrapping str (overriding comparison and hashing) to provide a "case insensitive str" type cistr -- and then, of course, make sure than only instances of cistr are (e.g.) added to your sets (&c) of interest (either by subclassing set &c, or simply by paying care). To give an oversimplified example...:
class ci(str):
def __hash__(self):
return hash(self.lower())
def __eq__(self, other):
return self.lower() == other.lower()
class cifrozenset(frozenset):
def __new__(cls, seq=()):
return frozenset((ci(x) for x in seq))
a = cifrozenset(('Today','I','am','fine'))
b = cifrozenset(('hello','how','are','you','today'))
print a.intersection(b)
this does emit frozenset(['Today']), as per your expressed desire. Of course, in real life you'd probably want to do MUCH more overriding (for example...: the way I have things here, any operation on a cifrozenset returns a plain frozenset, losing the precious case independence special feature -- you'd probably want to ensure that a cifrozenset is returned each time instead, and, while quite feasible, that's NOT trivial).

First, don't you mean a.intersection(b)? The intersection (if case insensitive) would be set(['today']). The difference would be set(['i', 'am', 'fine'])
Here are two ideas:
1.) Write a function to convert the elements of both sets to lowercase and then do the intersection. Here's one way you could do it:
>>> intersect_with_key = lambda s1, s2, key=lambda i: i: set(map(key, s1)).intersection(map(key, s2))
>>> fs1 = frozenset('Today I am fine'.split())
>>> fs2 = frozenset('Hello how are you TODAY'.split())
>>> intersect_with_key(fs1, fs2)
set([])
>>> intersect_with_key(fs1, fs2, key=str.lower)
set(['today'])
>>>
This is not very efficient though because the conversion and new sets would have to be created on each call.
2.) Extend the frozenset class to keep a case insensitive copy of the elements. Override the intersection method to use the case insensitive copy of the elements. This would be more efficient.

>>> a_, b_ = map(set, [map(str.lower, a), map(str.lower, b)])
>>> a_ & b_
set(['today'])
Or... with less maps,
>>> a_ = set(map(str.lower, a))
>>> b_ = set(map(str.lower, b))
>>> a_ & b_
set(['today'])

Related

Case-insensitive set intersection

What would be the best way to do the following case-insensitive intersection:
a1 = ['Disney', 'Fox']
a2 = ['paramount', 'fox']
a1.intersection(a2)
> ['fox']
Normally I'd do a list comprehension to convert both to all lowercased:
>>> set([_.lower() for _ in a1]).intersection(set([_.lower() for _ in a2]))
set(['fox'])
but it's a bit ugly. Is there a better way to do this?

Using the set comprehension syntax is slightly less ugly:
>>> {str.casefold(x) for x in a1} & {str.casefold(x) for x in a2}
{'fox'}
The algorithm is the same, and there is not any more efficient way available because the hash values of strings are case sensitive.
Using str.casefold instead of str.lower will behave more correctly for international data, and is available since Python 3.3+.

There are some problems with definitions here, for example in the case that a string appears twice in the same set with two different cases, or in two different sets (which one do we keep?).
With that being said, if you don't care, and you want to perform this sort of intersections a lot of times, you can create a case invariant string object:
class StrIgnoreCase:
def __init__(self, val):
self.val = val
def __eq__(self, other):
if not isinstance(other, StrIgnoreCase):
return False
return self.val.lower() == other.val.lower()
def __hash__(self):
return hash(self.val.lower())
And then I'd just maintain both the sets so that they contain these objects instead of plain strings. It would require less conversions on each creation of new sets and each intersection operation.

Using an operator to add in Python

Consider:
operator.add(a, b)
I'm having trouble understanding what this does. An operator is something like +-*/, so what does operator.add(a, b) do and how would you use it in a program?

Operator functions let you pick operations dynamically.
They do the same thing as the operator, so operator.add(a, b) does the exact same thing as a + b, but you can now use these operators in abstract.
Take for example:
import operator, random
ops = [operator.add, operator.sub]
print(random.choice(ops)(10, 5))
The above code will randomly either add up or subtract the two numbers. Because the operators can be applied in function form, you can also store these functions in variables (lists, dictionaries, etc.) and use them indirectly, based on your code. You can pass them to map() or reduce() or partial, etc. etc. etc.

As operator.add is a function and you can pass argument to it, it's for the situations where you can not use statements like a+d, like the map or itertools.imap functions. For better understanding, see the following example:
>>> import operator
>>> from itertools import imap
>>> list(imap(operator.add,[1,3],[5,5]))
[6, 8]

It does the same, it's just a function version of the operator in the Python operator module. It returns the result, so you would just it like this:
result = operator.add(a, b)
This is functionally equivalent to
result = a + b

It literally is how the + operator is defined. Look at the following example
class foo():
def __init__(self, a):
self.a = a
def __add__(self, b):
return self.a + b
>>> x = foo(5)
>>> x + 3
8
The + operator actually just calls the __add__ method of the class
The same thing happens for native Python types,
>>> 5 + 3
8
>>> operator.add(5,3)
8
Note that since I defined my __add__ method, I can also do
>>> operator.add(x, 3)
8

For the first part of your question, checkout the source for operator.add. It does exactly as you'd expect; adds two values together.
The answer to part two of your question is a little tricky.
They can be good for when you don't know what operator you'll need until run time. Like when the data file you're reading contains the operation as well as the values:
# warning: nsfw
total = 0
with open('./foo.dat') as fp:
for line in fp:
operation, first_val, second_val = line.split()
total += getattr(operator, operation)(first_val, second_val)
Also, you might want to make your code cleaner or more efficient (subjective) by using the operator functions with the map built-in as the example shows in the Python docs:
orig_values = [1,2,3,4,5]
new_values = [5,4,3,2,1]
total = sum(map(operator.add, orig_values, new_values))
Those are both convoluted examples which usually means that you probably won't use them except in extraordinary situations. You should really know that you need these functions before you use them.

Difflib's SequenceMatcher - Customized equality

I've been trying to create a nested or recursive effect with SequenceMatcher.
The final goal is comparing two sequences, both may contain instances of different types.
For example, the sequences could be:
l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]
Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.
I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.
What I tried doing is write a wrapper for the built-in str class:
from difflib import SequenceMatcher
class myString:
def __init__(self, string):
self.string = string
def __hash__(self):
return hash(self.string)
def __eq__(self, other):
return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5
Edit: perhaps a more elegant way is:
class myString(str):
def __eq__(self, other):
return SequenceMatcher(a=self, b=other).ratio() > 0.5
By doing this, the following is made possible:
>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75
So, evidently it's working, but I have a bad feeling about overriding the hash function.
When is the hash used? Where can it come back and bite me?
SequenceMatcher's documentation states the following:
This is a flexible class for comparing pairs of sequences of any type,
so long as the sequence elements are hashable.
And by definition hashable elements are required to fulfill the following requirement:
Hashable objects which compare equal must have the same hash value.
In addition, do I need to override cmp as well?
I'd love to hear about other solutions that come to mind.
Thanks.

Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.
Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.
All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).

Comparing mixed dict elements against int, strings and floats

I would like to be able to make comparisons in a mixed-type dictionary (containing int, floats, strings, numpy.arrays). My minimal example has a list of dictionaries and I would like a function (or generator) to iterate over that list and pick out the elements (dicts) that contain key-value pairs as specified by **kwargs input to that function (or generator).
import re
list_of_dicts = [{'s1':'abcd', 's2':'ABC', 'i':42, 'f':4.2},
{'s2':'xyz', 'i':84, 'f':8.4}]
def find_list_element(**kwargs):
for s in list_of_dicts:
for criterion, criterion_val in kwargs.iteritems():
if type(criterion_val) is str:
if re.match(criterion_val, s.get(criterion, 'unlikely_return_val')):
yield s
continue
if s.get(criterion, None) == criterion_val:
yield s
print [a for a in find_list_element(i=41)] # []
print [a for a in find_list_element(i=42)] # [{'i': 42, 's2': 'ABC', 's1': 'abcd', 'f': 4.2}]
print [a for a in find_list_element(s1='xyz')] # []
print [a for a in find_list_element(s2='xyz')] # [{'i': 84, 's2': 'xyz', 'f': 8.4}]
print [a for a in find_list_element(s2='[a-z]')] # [{'i': 84, 's2': 'xyz', 'f': 8.4}]
My two problems with the above are:
If the function asks for a a comparison that is a string, I would like to switch to regex matching (re.search or re.match) instead of plain string comparison. In the above code this is accomplished through the reviled type checking and it doesn't look all that elegant. Are there better solutions not involving type checking? Or maybe, this is a case where type checking is allowed in python?
**kwargs can of course contain more than one comparison. Currently I can only think of a solution involving some flags (found = False switched to a found = True and evaluated at the end of each iteration of list_of_dicts). Is there some clever way to accumulate the comparison results for each s before deciding on whether to yield it or not?
Are there ways to make this whole walk through this collection of dicts prettier?
PS: The actual use case for this involves the representation of acquired MRI datasets (BRUKER). Datasets are characterized through parameter files that I have converted to dicts that are part of the objects representing said scans. I am collecting these datasets and would like to further filter them based on certain criteria given by these parameter files. These parameters can be strings, numbers and some other less handy types.
UPDATE and Distilled Answer
If I head to come up with a consensus answer derived from the input by #BrenBarn and #srgerg it would be this
list_of_dicts = [{'s1':'abcd', 's2':'ABC', 'i':42, 'f':4.2},
{'s2':'xyz', 'i':84, 'f':8.4}]
# just making up some comparison strategies
def regex_comp(a,b): return re.match(a,b)
def int_comp(a,b): return a==b
def float_comp(a,b): return round(a,-1) == round (b,-1)
pre_specified_comp_dict = {frozenset(['s1','s2']) : regex_comp,
frozenset(['i']): int_comp,
frozenset(['f']): float_comp}
def fle_new(**kwargs):
chosen_comps={}
for key in kwargs.keys():
# remember, the keys here are frozensets
cand_comp = [x for x in pre_specified_comp_dict if key in x]
chosen_comps[key] = pre_specified_comp_dict[cand_comp[0]]
matches = lambda d: all(k in d and chosen_comps[k](v, d[k])
for k, v in kwargs.items())
return filter(matches, list_of_dicts)
Now the only challenge would be to come up with a pain-free strategy of creating pre_specified_comp_dict.

It seems okay to me to use type-checking in this situation, as you really do want totally different behavior depending on the type. However, you should make your typecheck a bit smarter. Use if isinstance(criterion_val, basestring) rather than a direct check for str type. This way, it will still work for unicode strings.
The way to avoid typechecking would be to pre-specify the comparison type for each field. Looking at your sample data, it looks like each field always has a consistent type (e.g., s1 is always a string). If that's the case, you could create an explicit mapping between the field names and the type of comparison, something like:
regex_fields = ['s1', 's2']
Then in your code, instead of the type check, do if criterion in regex_fields to see if the field is one that should be compared with a regex. If you have more than just two types of comparison, you could use a dict mapping field names to some kind of ID for the comparison operation.
The advantage of this is that it encodes your assumptions more explicitly, so that if some weird data gets in (e.g., a string where you expect a number), an error will be raised instead of silently applying the type-appropriate comparison. It also keeps the relationship between fields and comparisons "separate" rather than burying it in the middle of the actual comparison code.
This might especially be worth doing if you had a large number of fields with many different comparison operations for differnet subsets of them. In that case, it might be better to predefine which comparisons apply to which field names (as opposed to which types), rather than deciding on-the-fly for each comparison. As long as you always know based on the field name what type of comparison to do, this will keep things cleaner. It does add maintenance overhead if you need to add a new field, though, so I probably wouldn't do it if this was just a script for a private audience.

Here's how I would implement your find_list_element function. It still uses The Reviled Type Checking (TM), but it looks a little more eloquent IMHO:
def find_list_element(**kwargs):
compare = lambda e, a: re.match(e, a) is not None if isinstance(e, str) else e == a
matches = lambda d: all(k in d and compare(v, d[k]) for k, v in kwargs.items())
return filter(matches, list_of_dicts)
(I'm using Python 3, by the way, though the code works in Python 2.7 but should use basestring rather than str as BrenBarn has already pointed out).
Note that I have used Python's all function to avoid having to accumulate the comparison results.

You can see my code below which solves the need for more than one comparisons:
def find_dict(**kwargs):
for data in lds: # lds is the same as list_of_dicts
for key, val in kwargs.iteritems():
if not data.get(key, False) == val: return False
else:
yield data
O/P:
find_dict(i=42, s1='abcd')
{'i': 42, 's2': 'ABC', 's1': 'abcd', 'f': 4.2}
I have not included the code for regex comparision!
Cheers!

Is it pythonic for a function to return multiple values?

In python, you can have a function return multiple values. Here's a contrived example:
def divide(x, y):
quotient = x/y
remainder = x % y
return quotient, remainder
(q, r) = divide(22, 7)
This seems very useful, but it looks like it can also be abused ("Well..function X already computes what we need as an intermediate value. Let's have X return that value also").
When should you draw the line and define a different method?

Absolutely (for the example you provided).
Tuples are first class citizens in Python
There is a builtin function divmod() that does exactly that.
q, r = divmod(x, y) # ((x - x%y)/y, x%y) Invariant: div*y + mod == x
There are other examples: zip, enumerate, dict.items.
for i, e in enumerate([1, 3, 3]):
print "index=%d, element=%s" % (i, e)
# reverse keys and values in a dictionary
d = dict((v, k) for k, v in adict.items()) # or
d = dict(zip(adict.values(), adict.keys()))
BTW, parentheses are not necessary most of the time.
Citation from Python Library Reference:
Tuples may be constructed in a number of ways:
Using a pair of parentheses to denote the empty tuple: ()
Using a trailing comma for a singleton tuple: a, or (a,)
Separating items with commas: a, b, c or (a, b, c)
Using the tuple() built-in: tuple() or tuple(iterable)
Functions should serve single purpose
Therefore they should return a single object. In your case this object is a tuple. Consider tuple as an ad-hoc compound data structure. There are languages where almost every single function returns multiple values (list in Lisp).
Sometimes it is sufficient to return (x, y) instead of Point(x, y).
Named tuples
With the introduction of named tuples in Python 2.6 it is preferable in many cases to return named tuples instead of plain tuples.
>>> import collections
>>> Point = collections.namedtuple('Point', 'x y')
>>> x, y = Point(0, 1)
>>> p = Point(x, y)
>>> x, y, p
(0, 1, Point(x=0, y=1))
>>> p.x, p.y, p[0], p[1]
(0, 1, 0, 1)
>>> for i in p:
... print(i)
...
0
1

Firstly, note that Python allows for the following (no need for the parenthesis):
q, r = divide(22, 7)
Regarding your question, there's no hard and fast rule either way. For simple (and usually contrived) examples, it may seem that it's always possible for a given function to have a single purpose, resulting in a single value. However, when using Python for real-world applications, you quickly run into many cases where returning multiple values is necessary, and results in cleaner code.
So, I'd say do whatever makes sense, and don't try to conform to an artificial convention. Python supports multiple return values, so use it when appropriate.

The example you give is actually a python builtin function, called divmod. So someone, at some point in time, thought that it was pythonic enough to include in the core functionality.
To me, if it makes the code cleaner, it is pythonic. Compare these two code blocks:
seconds = 1234
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
seconds = 1234
minutes = seconds / 60
seconds = seconds % 60
hours = minutes / 60
minutes = minutes % 60

Yes, returning multiple values (i.e., a tuple) is definitely pythonic. As others have pointed out, there are plenty of examples in the Python standard library, as well as in well-respected Python projects. Two additional comments:
Returning multiple values is sometimes very, very useful. Take, for example, a method that optionally handles an event (returning some value in doing so) and also returns success or failure. This might arise in a chain of responsibility pattern. In other cases, you want to return multiple, closely linked pieces of data---as in the example given. In this setting, returning multiple values is akin to returning a single instance of an anonymous class with several member variables.
Python's handling of method arguments necessitates the ability to directly return multiple values. In C++, for example, method arguments can be passed by reference, so you can assign output values to them, in addition to the formal return value. In Python, arguments are passed "by reference" (but in the sense of Java, not C++). You can't assign new values to method arguments and have it reflected outside method scope. For example:
// C++
void test(int& arg)
{
arg = 1;
}
int foo = 0;
test(foo); // foo is now 1!
Compare with:
# Python
def test(arg):
arg = 1
foo = 0
test(foo) # foo is still 0

It's definitely pythonic. The fact that you can return multiple values from a function the boilerplate you would have in a language like C where you need to define a struct for every combination of types you return somewhere.
However, if you reach the point where you are returning something crazy like 10 values from a single function, you should seriously consider bundling them in a class because at that point it gets unwieldy.

Returning a tuple is cool. Also note the new namedtuple
which was added in python 2.6 which may make this more palatable for you:
http://docs.python.org/dev/library/collections.html#collections.namedtuple

OT: RSRE's Algol68 has the curious "/:=" operator. eg.
INT quotient:=355, remainder;
remainder := (quotient /:= 113);
Giving a quotient of 3, and a remainder of 16.
Note: typically the value of "(x/:=y)" is discarded as quotient "x" is assigned by reference, but in RSRE's case the returned value is the remainder.
c.f. Integer Arithmetic - Algol68

It's fine to return multiple values using a tuple for simple functions such as divmod. If it makes the code readable, it's Pythonic.
If the return value starts to become confusing, check whether the function is doing too much and split it if it is. If a big tuple is being used like an object, make it an object. Also, consider using named tuples, which will be part of the standard library in Python 2.6.

I'm fairly new to Python, but the tuple technique seems very pythonic to me. However, I've had another idea that may enhance readability. Using a dictionary allows access to the different values by name rather than position. For example:
def divide(x, y):
return {'quotient': x/y, 'remainder':x%y }
answer = divide(22, 7)
print answer['quotient']
print answer['remainder']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.