I would like to be able to make comparisons in a mixed-type dictionary (containing int, floats, strings, numpy.arrays). My minimal example has a list of dictionaries and I would like a function (or generator) to iterate over that list and pick out the elements (dicts) that contain key-value pairs as specified by **kwargs input to that function (or generator).
import re
list_of_dicts = [{'s1':'abcd', 's2':'ABC', 'i':42, 'f':4.2},
{'s2':'xyz', 'i':84, 'f':8.4}]
def find_list_element(**kwargs):
for s in list_of_dicts:
for criterion, criterion_val in kwargs.iteritems():
if type(criterion_val) is str:
if re.match(criterion_val, s.get(criterion, 'unlikely_return_val')):
yield s
continue
if s.get(criterion, None) == criterion_val:
yield s
print [a for a in find_list_element(i=41)] # []
print [a for a in find_list_element(i=42)] # [{'i': 42, 's2': 'ABC', 's1': 'abcd', 'f': 4.2}]
print [a for a in find_list_element(s1='xyz')] # []
print [a for a in find_list_element(s2='xyz')] # [{'i': 84, 's2': 'xyz', 'f': 8.4}]
print [a for a in find_list_element(s2='[a-z]')] # [{'i': 84, 's2': 'xyz', 'f': 8.4}]
My two problems with the above are:
If the function asks for a a comparison that is a string, I would like to switch to regex matching (re.search or re.match) instead of plain string comparison. In the above code this is accomplished through the reviled type checking and it doesn't look all that elegant. Are there better solutions not involving type checking? Or maybe, this is a case where type checking is allowed in python?
**kwargs can of course contain more than one comparison. Currently I can only think of a solution involving some flags (found = False switched to a found = True and evaluated at the end of each iteration of list_of_dicts). Is there some clever way to accumulate the comparison results for each s before deciding on whether to yield it or not?
Are there ways to make this whole walk through this collection of dicts prettier?
PS: The actual use case for this involves the representation of acquired MRI datasets (BRUKER). Datasets are characterized through parameter files that I have converted to dicts that are part of the objects representing said scans. I am collecting these datasets and would like to further filter them based on certain criteria given by these parameter files. These parameters can be strings, numbers and some other less handy types.
UPDATE and Distilled Answer
If I head to come up with a consensus answer derived from the input by #BrenBarn and #srgerg it would be this
list_of_dicts = [{'s1':'abcd', 's2':'ABC', 'i':42, 'f':4.2},
{'s2':'xyz', 'i':84, 'f':8.4}]
# just making up some comparison strategies
def regex_comp(a,b): return re.match(a,b)
def int_comp(a,b): return a==b
def float_comp(a,b): return round(a,-1) == round (b,-1)
pre_specified_comp_dict = {frozenset(['s1','s2']) : regex_comp,
frozenset(['i']): int_comp,
frozenset(['f']): float_comp}
def fle_new(**kwargs):
chosen_comps={}
for key in kwargs.keys():
# remember, the keys here are frozensets
cand_comp = [x for x in pre_specified_comp_dict if key in x]
chosen_comps[key] = pre_specified_comp_dict[cand_comp[0]]
matches = lambda d: all(k in d and chosen_comps[k](v, d[k])
for k, v in kwargs.items())
return filter(matches, list_of_dicts)
Now the only challenge would be to come up with a pain-free strategy of creating pre_specified_comp_dict.
It seems okay to me to use type-checking in this situation, as you really do want totally different behavior depending on the type. However, you should make your typecheck a bit smarter. Use if isinstance(criterion_val, basestring) rather than a direct check for str type. This way, it will still work for unicode strings.
The way to avoid typechecking would be to pre-specify the comparison type for each field. Looking at your sample data, it looks like each field always has a consistent type (e.g., s1 is always a string). If that's the case, you could create an explicit mapping between the field names and the type of comparison, something like:
regex_fields = ['s1', 's2']
Then in your code, instead of the type check, do if criterion in regex_fields to see if the field is one that should be compared with a regex. If you have more than just two types of comparison, you could use a dict mapping field names to some kind of ID for the comparison operation.
The advantage of this is that it encodes your assumptions more explicitly, so that if some weird data gets in (e.g., a string where you expect a number), an error will be raised instead of silently applying the type-appropriate comparison. It also keeps the relationship between fields and comparisons "separate" rather than burying it in the middle of the actual comparison code.
This might especially be worth doing if you had a large number of fields with many different comparison operations for differnet subsets of them. In that case, it might be better to predefine which comparisons apply to which field names (as opposed to which types), rather than deciding on-the-fly for each comparison. As long as you always know based on the field name what type of comparison to do, this will keep things cleaner. It does add maintenance overhead if you need to add a new field, though, so I probably wouldn't do it if this was just a script for a private audience.
Here's how I would implement your find_list_element function. It still uses The Reviled Type Checking (TM), but it looks a little more eloquent IMHO:
def find_list_element(**kwargs):
compare = lambda e, a: re.match(e, a) is not None if isinstance(e, str) else e == a
matches = lambda d: all(k in d and compare(v, d[k]) for k, v in kwargs.items())
return filter(matches, list_of_dicts)
(I'm using Python 3, by the way, though the code works in Python 2.7 but should use basestring rather than str as BrenBarn has already pointed out).
Note that I have used Python's all function to avoid having to accumulate the comparison results.
You can see my code below which solves the need for more than one comparisons:
def find_dict(**kwargs):
for data in lds: # lds is the same as list_of_dicts
for key, val in kwargs.iteritems():
if not data.get(key, False) == val: return False
else:
yield data
O/P:
find_dict(i=42, s1='abcd')
{'i': 42, 's2': 'ABC', 's1': 'abcd', 'f': 4.2}
I have not included the code for regex comparision!
Cheers!
Related
I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call
I've always use Perl hash reference tricks to deal with tree structures.
This time, I have to do it in Python that I am not really familiar with.
For example, in the following code. I create a hash chain and cut it to half. Then I concatenate one piece to another hash chain.
my $hashRoot->{'1'}->{'2'}->{'C'}->{'D'} = 'val';
my $subHash = $hashRoot->{'1'}->{'2'}; #Extract the Last 2 Level of Chain
my $anotherHashRoot->{'A'}->{'B'} = $subHash; #Concatenate to another Hash Chain
say '$hashRoot';
say Dump $hashRoot;
say '$subHash';
say Dump $subHash;
say '$anotherHashRoot';
say Dump $anotherHashRoot;
Above code generates output as following:
$hashRoot
---
1:
2:
C:
D: val
$subHash
---
C:
D: val
$anotherHashRoot
---
A:
B:
C:
D: val
In short, I am searching a Pythonic way to cut/insert/copy a hash(dict) chain as in Perl (or in C). Does anyone have the answer?
You have several ways to declare a dictionary in Python. First of all, you can do it straight-forward, in this case:
>>> your_dict = {1: {2: {'C': {'D': 'val'}}}}
>>> print(your_dict)
{1: {2: {'C': {'D': 'val'}}}}
>>> sub_dict = your_dict[1][2]
>>> print(sub_dict)
{'C': {'D': 'val'}}
>>> new_dict = {'A': {'B': sub_list}}
>>> print(new_dict)
{'A': {'B': {'C': {'D': 'val'}}}}
This particular case is not perfectly suitable for using generator expressions, but it's also possible to use them to build a dict. And, of course, you can create a dictionary in a for-loop.
You may be also interested in additional data-types in python, such as: OrderedDict and defaultdict.
If you want to understand how dict objects work, it's a good idea to read this spec.
Some important parts:
object.__len__(self)
Called to implement the built-in function len().
Should return the length of the object, an integer >= 0. Also, an
object that doesn’t define a __nonzero__() method and whose __len__()
method returns zero is considered to be false in a Boolean context.
object.__getitem__(self, key)
Called to implement evaluation of
self[key]. For sequence types, the accepted keys should be integers
and slice objects. Note that the special interpretation of negative
indexes (if the class wishes to emulate a sequence type) is up to the
__getitem__() method. If key is of an inappropriate type, TypeError may be raised; if of a value outside the set of indexes for the
sequence (after any special interpretation of negative values),
IndexError should be raised. For mapping types, if key is missing (not
in the container), KeyError should be raised.
object.__missing__(self, key)
Called by dict.__getitem__() to
implement self[key] for dict subclasses when key is not in the
dictionary.
So in Python 2 you could use something like
>>> items = [[1, 2], [3], [3], 4, 'a', 'b', 'a']
>>> from itertools import groupby
>>> [k for k, g in groupby(sorted(items))]
[4, [1, 2], [3], 'a', 'b']
Which works well, in O(N log N) time. However Python 3 exclaims TypeError: unorderable types: int() < list(). So What's the best way to do it in Python 3? (I know best is a subjective term but really there should be one way of doing it according to Python)
EDIT: It doesn't have to use a sort, but I'm guessing that would be the best way
In 2.x, values of two incomparable built-in types are ordered by type. The order of the types is not defined, except that it will be consistent during one run of the interpreter. So, 2 < [2] may be true or false, but it will be consistently true or false.
In 3.x, values of incomparable built-in types are incomparable—meaning they raise a TypeError if you try to compare them. So, 2 < [2] is an error. And, at least as of 3.3, the types themselves aren't even comparable. But if all you want to reproduce is the 2.x behavior, their ids are definitely comparable, and consistent during a run of the interpreter. So:
sorted(items, key=lambda x: (id(type(x)), x))
For your use case, that's all you need.
However, this won't be exactly the same thing that 2.x does, because it means that, for example, 1.5 < 2 may be False (because float > int). If you want to duplicate the exact behavior, you need to write a key function that first tries to compare the values, then on TypeError falls back to comparing the types.
This is one of the few cases where an old-style cmp function is a lot easier to read than a new-style key function, so let's write one of those, then use cmp_to_key on it:
def cmp2x(a, b):
try:
if a==b: return 0
elif a<b: return -1
elif b<a: return 1
except TypeError:
pass
return cmp2x(id(type(a)), id(type(b)))
sorted(items, key=functools.cmp_to_key(cmp2x))
This still doesn't guarantee the same order between two values of different types that 2.x would give, but since 2.x didn't define any such order (just that it's consistent within one run), there's no way it could.
There is still one real flaw, however: If you define a class whose objects are not fully-ordered, they will end up sorting as equal, and I'm not sure this is the same thing 2.x would do in that case.
Let's take a step back.
You want to uniquify a collection.
If the values were hashable, you'd use the O(N) set solution. But they're not. If you could come up with some kind of hash function, you could equivalently use a dict of myhash(value): value. If your use case really is "nothing but hashable values and flat lists of hashable values", you could do that by trying to hash, then falling back to hash(tuple()). But in general, that's not going to work.
If they were fully ordered, you'd use the O(N log N) sorted solution (or, equivalent, a tree-based solution or similar). If you could come up with some kind of full-ordering function, you could just pass a key to the sorted function. I think this will work in your use case (hence my other answer). But, if not, no O(N log N) solution is going to work.
If they're neither, you can fall back to the O(N**2) linear search solution:
unique = []
for value in items:
if value not in unique:
unique.append(value)
If you can't find some way to define a full-ordering or a hash function on your values, this is the best you can do.
I'm trying to learn python (with a VBA background).
I've imported the following function into my interpreter:
def shuffle(dict_in_question): #takes a dictionary as an argument and shuffles it
shuff_dict = {}
n = len(dict_in_question.keys())
for i in range(0, n):
shuff_dict[i] = pick_item(dict_in_question)
return shuff_dict
following is a print of my interpreter;
>>> stuff = {"a":"Dave", "b":"Ben", "c":"Harry"}
>>> stuff
{'a': 'Dave', 'c': 'Harry', 'b': 'Ben'}
>>> decky11.shuffle(stuff)
{0: 'Harry', 1: 'Dave', 2: 'Ben'}
>>> stuff
{}
>>>
It looks like the dictionary gets shuffled, but after that, the dictionary is empty. Why? Or, am I using it wrong?
You need to assign it back to stuff too, as you're returning a new dictionary.
>>> stuff = decky11.shuffle(stuff)
Dogbert's answer solves your immediate problem, but keep in mind that dictionaries don't have an order! There's no such thing as "the first element of my_dict." (Using .keys() or .values() generates a list, which does have an order, but the dictionary itself doesn't.) So, it's not really meaningful to talk about "shuffling" a dictionary.
All you've actually done here is remapped the keys from letters a, b, c, to integers 0, 1, 2. These keys have different hash values than the keys you started with, so they print in a different order. But you haven't changed the order of the dictionary, because the dictionary didn't have an order to begin with.
Depending on what you're ultimately using this for (are you iterating over keys?), you can do something more direct:
shufflekeys = random.shuffle(stuff.keys())
for key in shufflekeys:
# do thing that requires order
As a side note, dictionaries (aka hash tables) are a really clever, hyper-useful data structure, which I'd recommend learning deeply if you're not already familiar. A good hash function (and non-pathological data) will give you O(1) (i.e., constant) lookup time - so you can check if a key is in a dictionary of a million items as fast as you can in a dictionary of ten items! The lack of order is a critical feature of a dictionary that enables this speed.
I have two sets (although I can do lists, or whatever):
a = frozenset(('Today','I','am','fine'))
b = frozenset(('hello','how','are','you','today'))
I want to get:
frozenset(['Today'])
or at least:
frozenset(['today'])
The second option is doable if I lowercase everything I presume, but I'm looking for a more elegant way. Is it possible to do
a.intersection(b)
in a case-insensitive manner?
Shortcuts in Django are also fine since I'm using that framework.
Example from intersection method below (I couldn't figure out how to get this formatted in a comment):
print intersection('Today I am fine tomorrow'.split(),
'Hello How a re you TODAY and today and Today and Tomorrow'.split(),
key=str.lower)
[(['tomorrow'], ['Tomorrow']), (['Today'], ['TODAY', 'today', 'Today'])]
Here's version that works for any pair of iterables:
def intersection(iterableA, iterableB, key=lambda x: x):
"""Return the intersection of two iterables with respect to `key` function.
"""
def unify(iterable):
d = {}
for item in iterable:
d.setdefault(key(item), []).append(item)
return d
A, B = unify(iterableA), unify(iterableB)
return [(A[k], B[k]) for k in A if k in B]
Example:
print intersection('Today I am fine'.split(),
'Hello How a re you TODAY'.split(),
key=str.lower)
# -> [(['Today'], ['TODAY'])]
Unfortunately, even if you COULD "change on the fly" the comparison-related special methods of the sets' items (__lt__ and friends -- actually, only __eq__ needed the way sets are currently implemented, but that's an implementatio detail) -- and you can't, because they belong to a built-in type, str -- that wouldn't suffice, because __hash__ is also crucial and by the time you want to do your intersection it's already been applied, putting the sets' items in different hash buckets from where they'd need to end up to make intersection work the way you want (i.e., no guarantee that 'Today' and 'today' are in the same bucket).
So, for your purposes, you inevitably need to build new data structures -- if you consider it "inelegant" to have to do that at all, you're plain out of luck: built-in sets just don't carry around the HUGE baggage and overhead that would be needed to allow people to change comparison and hashing functions, which would bloat things by 10 times (or more) for the sae of a need felt in (maybe) one use case in a million.
If you have frequent needs connected with case-insensitive comparison, you should consider subclassing or wrapping str (overriding comparison and hashing) to provide a "case insensitive str" type cistr -- and then, of course, make sure than only instances of cistr are (e.g.) added to your sets (&c) of interest (either by subclassing set &c, or simply by paying care). To give an oversimplified example...:
class ci(str):
def __hash__(self):
return hash(self.lower())
def __eq__(self, other):
return self.lower() == other.lower()
class cifrozenset(frozenset):
def __new__(cls, seq=()):
return frozenset((ci(x) for x in seq))
a = cifrozenset(('Today','I','am','fine'))
b = cifrozenset(('hello','how','are','you','today'))
print a.intersection(b)
this does emit frozenset(['Today']), as per your expressed desire. Of course, in real life you'd probably want to do MUCH more overriding (for example...: the way I have things here, any operation on a cifrozenset returns a plain frozenset, losing the precious case independence special feature -- you'd probably want to ensure that a cifrozenset is returned each time instead, and, while quite feasible, that's NOT trivial).
First, don't you mean a.intersection(b)? The intersection (if case insensitive) would be set(['today']). The difference would be set(['i', 'am', 'fine'])
Here are two ideas:
1.) Write a function to convert the elements of both sets to lowercase and then do the intersection. Here's one way you could do it:
>>> intersect_with_key = lambda s1, s2, key=lambda i: i: set(map(key, s1)).intersection(map(key, s2))
>>> fs1 = frozenset('Today I am fine'.split())
>>> fs2 = frozenset('Hello how are you TODAY'.split())
>>> intersect_with_key(fs1, fs2)
set([])
>>> intersect_with_key(fs1, fs2, key=str.lower)
set(['today'])
>>>
This is not very efficient though because the conversion and new sets would have to be created on each call.
2.) Extend the frozenset class to keep a case insensitive copy of the elements. Override the intersection method to use the case insensitive copy of the elements. This would be more efficient.
>>> a_, b_ = map(set, [map(str.lower, a), map(str.lower, b)])
>>> a_ & b_
set(['today'])
Or... with less maps,
>>> a_ = set(map(str.lower, a))
>>> b_ = set(map(str.lower, b))
>>> a_ & b_
set(['today'])