I have a list of objects that I need to sort according to a key function. The problem is that some of the elements in my list can go "out-of-date" while the list is being sorted. When the key function is called on such an expired item, it fails with an exception.
Ideally, what I would like is a way of sorting my list with a key function such that when an error occurs upon calling the key function on an element, this element is excluded from the sort result.
My problem can be reconstructed using the following example: Suppose I have two classes, Good and Bad:
class Good(object):
def __init__(self, x):
self.x = x
def __repr__(self):
return 'Good(%r)' % self.x
class Bad(object):
#property
def x(self):
raise RuntimeError()
def __repr__(self):
return 'Bad'
I want to sort instances of these classes according to their x property. Eg.:
>>> sorted([Good(5), Good(3), Good(7)], key=lambda obj: obj.x)
[Good(3), Good(5), Good(7)]
Now, when there is a Bad in my list, the sorting fails:
>>> sorted([Good(5), Good(3), Bad()], key=lambda obj: obj.x)
... RuntimeError
I am looking for a magical function func that sorts a list according to a key function, but simply ignores elements for which the key function raised an error:
>>> func([Good(5), Good(3), Bad()], key=lambda obj: obj.x)
[Good(3), Good(5)]
What is the most Pythonic way of achieving this?
Every sorting algorithm I know doesn't throw out some values because they're outdated or something. The task of sorting algorithm is to sort the list, and sort it fast, everything else is extraneous, specific task.
So, I would write this magical function myself. It would do the sorting in two steps: first it would filter the list, leaving only Good values, and then sort the resulting list.
I did this once with a mergesort. Mergesort makes it relatively simple to eliminate no-longer-useful values.
The project I did it in is at http://stromberg.dnsalias.org/~dstromberg/equivalence-classes.html#python-3e . Feel free to raid it for ideas, or lift code out of it; it's Free as in speech (GPLv2 or later, at your option).
The sort in that code should almost do what you want, except it'll sort a list with duplicates to a list of lists, where each sublist has equal values. That part may or may not be useful to you.
I've got a more straightforward mergesort (it doesn't do the duplicate buckets thing, but it doesn't deal with dropping no-longer-good values either) at http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ . The file is .m4, but don't let that fool you - it's really pure python or cython autogenerated from the same .m4 file.
Since the result of the key function can change over time, and most sorting implementations probably assume a deterministic key function, it's probably best to only execute the key function once per object, to ensure a well-ordered and crash-free final list.
def func(seq, **kargs):
key = kargs["key"]
stored_values = {}
for item in seq:
try:
value = key(item)
stored_values[item] = value
except RuntimeError:
pass
return sorted(stored_values.iterkeys(), key=lambda item: stored_values[item])
print func([Good(5), Good(3), Bad()], key=lambda obj: obj.x)
Result:
[Good(3), Good(5)]
If the list items can go from Good to Bad while sorting, then there is nothing you can do. The keys are evaluated only once before sorting, so any change in the key will be invisible to the sort function:
>>> from random import randrange
>>> values = [randrange(100) for i in range(10)]
>>> values
[54, 72, 91, 73, 55, 68, 21, 25, 18, 95]
>>> def k(x):
... print x
... return x
...
>>> values.sort(key=k)
54
72
91
73
55
68
21
25
18
95
(If the key were evaluated many times during the sort, you would see the numbers printed many times).
Related
Here is the code:
EDIT**** Please no more "it's not possible with unordered dictionary replies". I pretty much already know that. I made this post on the off-chance that it MIGHT be possible or someone has a workable idea.
#position equals some set of two dimensional coords
for name in self.regions["regions"]: # I want to start the iteration with 'last_region'
# I don't want to run these next two lines over every dictionary key each time since the likelihood is that the new
# position is still within the last region that was matched.
rect = (self.regions["regions"][name]["pos1"], self.regions["regions"][name]["pos2"])
if all(self.point_inside(rect, position)):
# record the name of this region in variable- 'last_region' so I can start with it on the next search...
# other code I want to run when I get a match
return
return # if code gets here, the points were not inside any of the named regions
Hopefully the comments in the code explain my situation well enough. Lets say I was last inside region "delta" (i.e., the key name is delta, the value will be sets of coordinates defining it's boundaries) and I have 500 more regions. The first time I find myself in region delta, the code may not have discovered this until, let's say (hypothetically), the 389th iteration... so it made 388 all(self.point_inside(rect, position)) calculations before it found that out. Since I will probably still be in delta the next time it runs (but I must verify that each time the code runs), it would be helpful if the key "delta" was the first one that got checked by the for loop.
This particular code can be running many times a second for many different users.. so speed is critical. The design is such that very often, the user will not be in a region and all 500 records may need to be cycled through and will exit the loop with no matches, but I would like to speed the overall program up by speeding it up for those that are presently in one of the regions.
I don't want an additional overhead of sorting the dictionary in any particular order, etc.. I just want it to start looking with the last one that it successfully matched all(self.point_inside(rect, position))
Maybe this will help a bit more.. The following is the dictionary I am using (only the first record shown) so you can see the structure I coded to above... and yes, despite the name "rect" in the code, it actually checks for the point in a cubical region.
{"regions": {"shop": {"flgs": {"breakprot": true, "placeprot": true}, "dim": 0, "placeplayers": {"4f953255-6775-4dc6-a612-fb4230588eff": "SurestTexas00"}, "breakplayers": {"4f953255-6775-4dc6-a612-fb4230588eff": "SurestTexas00"}, "protected": true, "banplayers": {}, "pos1": [5120025, 60, 5120208], "pos2": [5120062, 73, 5120257], "ownerUuid": "4f953255-6775-4dc6-a612-fb4230588eff", "accessplayers": {"4f953255-6775-4dc6-a612-fb4230588eff": "SurestTexas00"}}, more, more, more...}
You may try to implement some caching mechanism within a custom subclass of dict.
You could set a self._cache = None in __init__, add a method like set_cache(self, key) to set the cache and finally overriding __iter__ to yield self._cache before calling the default __iter__.
However, that can be kinda cumbersome, if you consider this stackoverflow answer and also this one.
For what it's written in your question, I would try, instead, to implement this caching logic in your code.
def _match_region(self, name, position):
rect = (self.regions["regions"][name]["pos1"], self.regions["regions"][name]["pos2"])
return all(self.point_inside(rect, position))
if self.last_region and self._match_region(self.last_region, position):
self.code_to_run_when_match(position)
return
for name in self.regions["regions"]:
if self._match_region(name, position):
self.last_region = name
self.code_to_run_when_match(position)
return
return # if code gets here, the points were not inside any of the named regions
That is right, dictionary is an unordered type. Therefore OrderedDict won't help you much for what you want to do.
You could store the last region into your class. Then, on the next call, check if last region is still good before check the entire dictionary ?
Instead of a for-loop, you could use iterators directly. Here's an example function that does something similar to what you want, using iterators:
def iterate(what, iterator):
iterator = iterator or what.iteritems()
try:
while True:
k,v = iterator.next()
print "Trying k = ", k
if v > 100:
return iterator
except StopIteration:
return None
Instead of storing the name of the region in last_region, you would store the result of this function, which is like a "pointer" to where you left off. Then, you can use the function like this (shown as if run in the Python interactive interpreter, including the output):
>>> x = {'a':12, 'b': 42, 'c':182, 'd': 9, 'e':12}
>>> last_region = None
>>> last_region = iterate(x, last_region)
Trying k = a
Trying k = c
>>> last_region = iterate(x, last_region)
Trying k = b
Trying k = e
Trying k = d
Thus, you can easily resume from where you left off, but there's one additional caveat to be aware of:
>>> last_region = iterate(x, last_region)
Trying k = a
Trying k = c
>>> x['z'] = 45
>>> last_region = iterate(x, last_region)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in iterate
RuntimeError: dictionary changed size during iteration
As you can see, it'll raise an error if you ever add a new key. So, if you use this method, you'll need to be sure to set last_region = None any time you add a new region to the dictionary.
TigerhawkT3 is right. Dicts are unordered in a sense that there is no guaranteed order or keys in the given dictionary. You can even have different order of keys if you iterate over same dictionary. If you want order you need to use either OrderedDict or just plain list. You can convert your dict to list and sort it the way it represents the order you need.
Without knowing what your objects are and whether self in the example is a user instance or an environment instance it is hard to come up with a solution. But if self in the example is the environment, its Class could have a class attribute that is a dictionary of all current users and their last known position, if the user instance is hashable.
Something like this
class Thing(object):
__user_regions = {}
def where_ami(self, user):
try:
region = self.__user_regions[user]
print 'AHA!! I know where you are!!'
except KeyError:
# find region
print 'Hmmmm. let me think about that'
region = 'foo'
self.__user_regions[user] = region
class User(object):
def __init__(self, position):
self.pos = position
thing = Thing()
thing2 = Thing()
u = User((1,2))
v = User((3,4))
Now you can try to retrieve the user's region from the class attribute. If there is more than one Thing they would share that class attribute.
>>>
>>> thing._Thing__user_regions
{}
>>> thing2._Thing__user_regions
{}
>>>
>>> thing.where_ami(u)
Hmmmm. let me think about that
>>>
>>> thing._Thing__user_regions
{<__main__.User object at 0x0433E2B0>: 'foo'}
>>> thing2._Thing__user_regions
{<__main__.User object at 0x0433E2B0>: 'foo'}
>>>
>>> thing2.where_ami(v)
Hmmmm. let me think about that
>>>
>>> thing._Thing__user_regions
{<__main__.User object at 0x0433EA90>: 'foo', <__main__.User object at 0x0433E2B0>: 'foo'}
>>> thing2._Thing__user_regions
{<__main__.User object at 0x0433EA90>: 'foo', <__main__.User object at 0x0433E2B0>: 'foo'}
>>>
>>> thing.where_ami(u)
AHA!! I know where you are!!
>>>
You say that you "don't want an additional overhead of sorting the dictionary in any particular order". What overhead? Presumably OrderedDict uses some additional data structure internally to keep track of the order of keys. But unless you know that this is costing you too much memory, then OrderedDict is your solution. That means profiling your code and making sure that an OrderedDict is the source of your bottleneck.
If you want the cleanest code, just use an OrderedDict. It has a move_to_back method which takes a key and puts it either in the front of the dictionary, or at the end. For example:
from collections import OrderedDict
animals = OrderedDict([('cat', 1), ('dog', 2), ('turtle', 3), ('lizard', 4)])
def check_if_turtle(animals):
for animal in animals:
print('Checking %s...' % animal)
if animal == 'turtle':
animals.move_to_end('turtle', last=False)
return True
else:
return False
Our check_if_turtle function looks through an OrderedDict for a turtle key. If it doesn't find it, it returns False. If it does find it, it returns True, but not after moving the turtle key to the beginning of the OrderedDict.
Let's try it. On the first run:
>>> check_if_turtle(animals)
Checking cat...
Checking dog...
Checking turtle...
True
we see that it checked all of the keys up to turtle. Now, if we run it again:
>>> check_if_turtle(animals)
Checking turtle...
True
we see that it checked the turtle key first.
I am practically repeating the same code with only one minor change in each function, but an essential change.
I have about 4 functions that look similar to this:
def list_expenses(self):
explist = [(key,item.amount) for key, item in self.expensedict.iteritems()] #create a list from the dictionary, making a tuple of dictkey and object values
sortedlist = reversed(sorted(explist, key = lambda (k,a): (a))) #sort the list based on the value of the amount in the tuples of sorted list. Reverse to get high to low
for ka in sortedlist:
k, a = ka
print k , a
def list_income(self):
inclist = [(key,item.amount) for key, item in self.incomedict.iteritems()] #create a list from the dictionary, making a tuple of dictkey and object values
sortedlist = reversed(sorted(inclist, key = lambda (k,a): (a))) #sort the list based on the value of the amount in the tuples of sorted list. Reverse to get high to low
for ka in sortedlist:
k, a = ka
print k , a
I believe this is what they refer to as violating "DRY", however I don't have any idea how I can change this to be more DRYlike, as I have two seperate dictionaries(expensedict and incomedict) that I need to work with.
I did some google searching and found something called decorators, and I have a very basic understanding of how they work, but no clue how I would apply it to this.
So my request/question:
Is this a candidate for a decorator, and if a decorator is
necessary, could I get hint as to what the decorator should do?
Pseudocode is fine. I don't mind struggling. I just need something
to start with.
What do you think about using a separate function (as a private method) for list processing? For example, you may do the following:
def __list_processing(self, list):
#do the generic processing of your lists
def list_expenses(self):
#invoke __list_processing with self.expensedict as a parameter
def list_income(self):
#invoke __list_processing with self.incomedict as a parameter
It looks better since all the complicated processing is in a single place, list_expenses and list_income etc are the corresponding wrapper functions.
I am working with data pulled from a spreadsheet-like file. I am trying to find, for each "ligand", the item with the lowest corresponding "energy". To do this I'm trying to make a list of all the ligands I find in the file, and compare them to one another, using the index value to find the energy of each ligand, keeping the one with the lowest energy. However, the following loop is not working out for me. The program won't finish, it just keeps running until I cancel it manually. I'm assuming this is due to an error in the structure of my loop.
for item in ligandList:
for i in ligandList:
if ligandList.index(item) != ligandList.index(i):
if ( item == i ) :
if float(lineList[ligandList.index(i)][42]) < float(lineList[ligandList.index(item)][42]):
lineList.remove(ligandList.index(item))
else:
lineList.remove(ligandList.index(i))
As you can see, I've created a separate ligandList containing the ligands, and am using the current index of that list to access the energy values in the lineList.
Does anyone know why this isn't working?
It is a bit hard to answer without some actual data to play with, but I hope this works, or at least leads you into the right direction:
for idx1, item1 in enumerate(ligandList):
for idx2, item2 in enumerate(ligandList):
if idx1 == idx2: continue
if item1 != item2: continue
if float(lineList[idx1][42]) < float(lineList[idx2][42]):
del lineList [idx1]
else:
del lineList [idx2]
That’s a really inefficient way of doing things. Lots of index calls. It might just feel infinite because it’s slow.
Zip your related things together:
l = zip(ligandList, lineList)
Sort them by “ligand” and “energy”:
l = sorted(l, key=lambda t: (t[0], t[1][42]))
Grab the first (lowest) “energy” for each:
l = ((lig, lin[1].next()[1]) for lig, lin in itertools.groupby(l, key=lambda t: t[0]))
Yay.
result = ((lig, lin[1].next()[1]) for lig, lin in itertools.groupby(
sorted(zip(ligandList, lineList), key=lambda t: (t[0], t[1][42])),
lambda t: t[0]
))
It would probably look more flattering if you made lineList contain classes of some kind.
Demo
You look like you're trying to find the element in ligandList with the smallest value in index 42. Let's just do that....
min(ligandList, key=lambda x: float(x[42]))
If these "Ligands" are something you use regularly, STRONGLY consider writing a class wrapper for them, something like:
class Ligand(object):
def __init__(self,lst):
self.attr_name = lst[index_of_attr] # for each attribute
... # for each attribute
... # etc etc
self.energy = lst[42]
def __str__(self):
"""This method defines what the class looks like if you call str() on
it, e.g. a call to print(Ligand) will show this function's return value."""
return "A Ligand with energy {}".format(self.energy) # or w/e
def transmogfiscate(self,other):
pass # replace this with whatever Ligands do, if they do things...
In which case you can simply create a list of the Ligands:
ligands = [Ligand(ligand) for ligand in ligandList]
and return the object with the smallest energy:
lil_ligand = min(ligands, key=lambda ligand: ligand.energy)
As a huge aside, PEP 8 encourages the use of the lowercase naming convention for variables, rather than mixedCase as many languages use.
I have a dict that has string-type keys whose exact values I can't know (because they're generated dynamically elsewhere). However, I know that that the key I want contains a particular substring, and that a single key with this substring is definitely in the dict.
What's the best, or "most pythonic" way to retrieve the value for this key?
I thought of two strategies, but both irk me:
for k,v in some_dict.items():
if 'substring' in k:
value = v
break
-- OR --
value = [v for (k,v) in some_dict.items() if 'substring' in k][0]
The first method is bulky and somewhat ugly, while the second is cleaner, but the extra step of indexing into the list comprehension (the [0]) irks me. Is there a better way to express the second version, or a more concise way to write the first?
There is an option to write the second version with the performance attributes of the first one.
Use a generator expression instead of list comprehension:
value = next(v for (k,v) in some_dict.iteritems() if 'substring' in k)
The expression inside the parenthesis will return an iterator which you will then ask to provide the next, i.e. first element. No further elements are processed.
How about this:
value = (v for (k,v) in some_dict.iteritems() if 'substring' in k).next()
It will stop immediately when it finds the first match.
But it still has O(n) complexity, where n is the number of key-value pairs. You need something like a suffix list or a suffix tree to speed up searching.
If there are many keys but the string is easy to reconstruct from the substring, then it can be faster reconstructing it. e.g. often you know the start of the key but not the datestamp that has been appended on. (so you may only have to try 365 dates rather than iterate through millions of keys for example).
It's unlikely to be the case but I thought I would suggest it anyway.
e.g.
>>> names={'bob_k':32,'james_r':443,'sarah_p':12}
>>> firstname='james' #you know the substring james because you have a list of firstnames
>>> for c in "abcdefghijklmnopqrstuvwxyz":
... name="%s_%s"%(firstname,c)
... if name in names:
... print name
...
james_r
class MyDict(dict):
def __init__(self, *kwargs):
dict.__init__(self, *kwargs)
def __getitem__(self,x):
return next(v for (k,v) in self.iteritems() if x in k)
# Defining several dicos ----------------------------------------------------
some_dict = {'abc4589':4578,'abc7812':798,'kjuy45763':1002}
another_dict = {'boumboum14':'WSZE x478',
'tagada4783':'ocean11',
'maracuna102455':None}
still_another = {12:'jfg',45:'klsjgf'}
# Selecting the dicos whose __getitem__ method will be changed -------------
name,obj = None,None
selected_dicos = [ (name,obj) for (name,obj) in globals().iteritems()
if type(obj)==dict
and all(type(x)==str for x in obj.iterkeys())]
print 'names of selected_dicos ==',[ name for (name,obj) in selected_dicos]
# Transforming the selected dicos in instances of class MyDict -----------
for k,v in selected_dicos:
globals()[k] = MyDict(v)
# Exemple of getting a value ---------------------------------------------
print "some_dict['7812'] ==",some_dict['7812']
result
names of selected_dicos == ['another_dict', 'some_dict']
some_dict['7812'] == 798
I prefer the first version, although I'd use some_dict.iteritems() (if you're on Python 2) because then you don't have to build an entire list of all the items beforehand. Instead you iterate through the dict and break as soon as you're done.
On Python 3, some_dict.items(2) already results in a dictionary view, so that's already a suitable iterator.
This question is in relation to another question asked here:
Sorting 1M records
I have since figured out the problem I was having with sorting. I was sorting items from a dictionary into a list every time I updated the data. I have since realized that a lot of the power of Python's sort resides in the fact that it sorts data more quickly that is already partially sorted.
So, here is the question. Suppose I have the following as a sample set:
self.sorted_records = [(1, 1234567890), (20, 1245678903),
(40, 1256789034), (70, 1278903456)]
t[1] of each tuple in the list is a unique id. Now I want to update this list with the follwoing:
updated_records = {1245678903:45, 1278903456:76}
What is the fastest way for me to do so ending up with
self.sorted_records = [(1, 1234567890), (45, 1245678903),
(40, 1256789034), (76, 1278903456)]
Currently I am doing something like this:
updated_keys = updated_records.keys()
for i, record in enumerate(self.sorted_data):
if record[1] in updated_keys:
updated_keys.remove(record[1])
self.sorted_data[i] = (updated_records[record[1]], record[1])
But I am sure there is a faster, more elegant solution out there.
Any help?
* edit
It turns out I used bad exaples for the ids since they end up in sorted order when I do my update. I am actually interested in t[0] being in sorted order. After I do the update I was intending on resorting with the updated data, but it looks like bisect might be the ticket to insert in sorted order.
end edit *
You're scanning through all n records. You could instead do a binary search, which would be O(log(n)) instead of O(n). You can use the bisect module to do this.
Since apparently you don't care about the ending value of self.sorted_records actually being sorted (you have values in order 1, 45, 20, 76 -- that's NOT sorted!-), AND you only appear to care about IDs in updated_records that are also in self.sorted_data, a listcomp (with side effects if you want to change the updated_record on the fly) would serve you well, i.e.:
self.sorted_data = [(updated_records.pop(recid, value), recid)
for (value, recid) in self.sorted_data]
the .pop call removes from updated_records the keys (and corresponding values) that are ending up in the new self.sorted_data (and the "previous value for that recid", value, is supplied as the 2nd argument to pop to ensure no change where a recid is NOT in updated_record); this leaves in updated_record the "new" stuff so you can e.g append it to self.sorted_data before re-sorting, i.e I suspect you want to continue with something like
self.sorted_data.extend(value, recid
for recid, value in updated_records.iteritems())
self.sorted_data.sort()
though this part DOES go beyond the question you're actually asking (and I'm giving it only because I've seen your previous questions;-).
You'd probably be best served by some form of tree here (preserving sorted order while allowing O(log n) replacements.) There is no builtin balanaced tree type, but you can find many third party examples. Alternatively, you could either:
Use a binary search to find the node. The bisect module will do this, but it compares based on the normal python comparison order, whereas you seem to be sorted based on the second element of each tuple. You could reverse this, or just write your own binary search (or simply take the code from bisect_left and modify it)
Use both a dict and a list. The list contains the sorted keys only. You can wrap the dict class easily enough to ensure this is kept in sync. This allows you fast dict updating while maintaining sort order of the keys. This should prevent your problem of losing sort performance due to constant conversion between dict/list.
Here's a quick implementation of such a thing:
import bisect
class SortedDict(dict):
"""Dictionary which is iterable in sorted order.
O(n) sorted iteration
O(1) lookup
O(log n) replacement ( but O(n) insertion or new items)
"""
def __init__(self, *args, **kwargs):
dict.__init__(self, *args, **kwargs)
self._keys = sorted(dict.iterkeys(self))
def __setitem__(self, key, val):
if key not in self:
# New key - need to add to list of keys.
pos = bisect.bisect_left(self._keys, key)
self._keys.insert(pos, key)
dict.__setitem__(self, key, val)
def __delitem__(self, key):
if key in self:
pos = bisect.bisect_left(self._keys, key)
del self._keys[pos]
dict.__delitem__(self, key)
def __iter__(self):
for k in self._keys: yield k
iterkeys = __iter__
def iteritems(self):
for k in self._keys: yield (k, self[k])
def itervalues(self):
for k in self._keys: yield self[k]
def update(self, other):
dict.update(self, other)
self._keys = sorted(dict.iterkeys(self)) # Rebuild (faster if lots of changes made - may be slower if only minor changes to large dict)
def keys(self): return list(self.iterkeys())
def values(self): return list(self.itervalues())
def items(self): return list(self.iteritems())
def __repr__(self):
return "%s(%s)" % (self.__class__.__name__, ', '.join("%s=%r" % (k, self[k]) for k in self))
Since you want to replace by dictionary key, but have the array sorted by dictionary value, you definitely need a linear search for the key. In that sense, your algorithm is the best you can hope for.
If you would preserve the old dictionary value, then you could use a binary search for the value, and then locate the key in the proximity of where the binary search lead you.