Python: update a list of tuples... fastest method - python

This question is in relation to another question asked here:
Sorting 1M records
I have since figured out the problem I was having with sorting. I was sorting items from a dictionary into a list every time I updated the data. I have since realized that a lot of the power of Python's sort resides in the fact that it sorts data more quickly that is already partially sorted.
So, here is the question. Suppose I have the following as a sample set:
self.sorted_records = [(1, 1234567890), (20, 1245678903),
(40, 1256789034), (70, 1278903456)]
t[1] of each tuple in the list is a unique id. Now I want to update this list with the follwoing:
updated_records = {1245678903:45, 1278903456:76}
What is the fastest way for me to do so ending up with
self.sorted_records = [(1, 1234567890), (45, 1245678903),
(40, 1256789034), (76, 1278903456)]
Currently I am doing something like this:
updated_keys = updated_records.keys()
for i, record in enumerate(self.sorted_data):
if record[1] in updated_keys:
updated_keys.remove(record[1])
self.sorted_data[i] = (updated_records[record[1]], record[1])
But I am sure there is a faster, more elegant solution out there.
Any help?
* edit
It turns out I used bad exaples for the ids since they end up in sorted order when I do my update. I am actually interested in t[0] being in sorted order. After I do the update I was intending on resorting with the updated data, but it looks like bisect might be the ticket to insert in sorted order.
end edit *

You're scanning through all n records. You could instead do a binary search, which would be O(log(n)) instead of O(n). You can use the bisect module to do this.

Since apparently you don't care about the ending value of self.sorted_records actually being sorted (you have values in order 1, 45, 20, 76 -- that's NOT sorted!-), AND you only appear to care about IDs in updated_records that are also in self.sorted_data, a listcomp (with side effects if you want to change the updated_record on the fly) would serve you well, i.e.:
self.sorted_data = [(updated_records.pop(recid, value), recid)
for (value, recid) in self.sorted_data]
the .pop call removes from updated_records the keys (and corresponding values) that are ending up in the new self.sorted_data (and the "previous value for that recid", value, is supplied as the 2nd argument to pop to ensure no change where a recid is NOT in updated_record); this leaves in updated_record the "new" stuff so you can e.g append it to self.sorted_data before re-sorting, i.e I suspect you want to continue with something like
self.sorted_data.extend(value, recid
for recid, value in updated_records.iteritems())
self.sorted_data.sort()
though this part DOES go beyond the question you're actually asking (and I'm giving it only because I've seen your previous questions;-).

You'd probably be best served by some form of tree here (preserving sorted order while allowing O(log n) replacements.) There is no builtin balanaced tree type, but you can find many third party examples. Alternatively, you could either:
Use a binary search to find the node. The bisect module will do this, but it compares based on the normal python comparison order, whereas you seem to be sorted based on the second element of each tuple. You could reverse this, or just write your own binary search (or simply take the code from bisect_left and modify it)
Use both a dict and a list. The list contains the sorted keys only. You can wrap the dict class easily enough to ensure this is kept in sync. This allows you fast dict updating while maintaining sort order of the keys. This should prevent your problem of losing sort performance due to constant conversion between dict/list.
Here's a quick implementation of such a thing:
import bisect
class SortedDict(dict):
"""Dictionary which is iterable in sorted order.
O(n) sorted iteration
O(1) lookup
O(log n) replacement ( but O(n) insertion or new items)
"""
def __init__(self, *args, **kwargs):
dict.__init__(self, *args, **kwargs)
self._keys = sorted(dict.iterkeys(self))
def __setitem__(self, key, val):
if key not in self:
# New key - need to add to list of keys.
pos = bisect.bisect_left(self._keys, key)
self._keys.insert(pos, key)
dict.__setitem__(self, key, val)
def __delitem__(self, key):
if key in self:
pos = bisect.bisect_left(self._keys, key)
del self._keys[pos]
dict.__delitem__(self, key)
def __iter__(self):
for k in self._keys: yield k
iterkeys = __iter__
def iteritems(self):
for k in self._keys: yield (k, self[k])
def itervalues(self):
for k in self._keys: yield self[k]
def update(self, other):
dict.update(self, other)
self._keys = sorted(dict.iterkeys(self)) # Rebuild (faster if lots of changes made - may be slower if only minor changes to large dict)
def keys(self): return list(self.iterkeys())
def values(self): return list(self.itervalues())
def items(self): return list(self.iteritems())
def __repr__(self):
return "%s(%s)" % (self.__class__.__name__, ', '.join("%s=%r" % (k, self[k]) for k in self))

Since you want to replace by dictionary key, but have the array sorted by dictionary value, you definitely need a linear search for the key. In that sense, your algorithm is the best you can hope for.
If you would preserve the old dictionary value, then you could use a binary search for the value, and then locate the key in the proximity of where the binary search lead you.

Related

Does OrderedDict.items() also keeps the order preserved?

Assuming that I'm using the following OrderedDict:
order_dict = OrderedDict([("a",1), ("b",2), ("c",3)])
At some point, I would like to get the (key,value) items and define an iterator, and start moving it once desired:
ordered_dict_items_iter = iter(ordered_dict.items())
...
key,val = next(ordered_dict_items_iter)
...
I'd like to know if order_dict.items() will also preserve the same order?
As I observed it seems that it does preserve the order, however I couldn't prove it.
It does. The idea of OrderedDict is that is behaves exactly as a dictionary, but internally it's a list of tuples, representing key-values pairs, so that order is preserved. All dictionary methods are replicated using this list of tuples.
Note: After python 3.7, standard dictionaries are also guaranteed to maintain insertion order.
Yes, it'll preserve the order you specify while initiliazing the dictionary.
Yes. OrderedDict.items() will return the items in the order they are inserted.
If you check the implementation of OrderedDict, you can see that items returns _OrderedDictItemsView.
class OrderedDict(dict):
...
...
def items(self):
"D.items() -> a set-like object providing a view on D's items"
return _OrderedDictItemsView(self)
And if you dig deep and find the implementation of _OrderedDictItemsView,
class _OrderedDictItemsView(_collections_abc.ItemsView):
def __reversed__(self):
for key in reversed(self._mapping):
yield (key, self._mapping[key])
And if you go deeper to check the _collections_abc.ItemsView, you will see that,
class ItemsView(MappingView, Set):
...
...
def __iter__(self):
for key in self._mapping:
yield (key, self._mapping[key])
And further down the MappingView, you will see,
class MappingView(Sized):
__slots__ = '_mapping',
def __init__(self, mapping):
self._mapping = mapping
Now our journey has reached it's destination, and we can see that the _mapping is the OrderedDict we created and it is always in order. The __iter__ method ItemsView, just iterates through each key value in the OrderedDict. Hence the proof :)

Python defaultdict for large data sets

I am using defaultdict to store millions of phrases, so my data structure looks like mydict['string'] = set(['other', 'strings']). It seems to work ok for smaller sets but when I hit anything over 10 million keys, my program just crashes with the helpful message of Process killed. I know defaultdicts are memory heavy, but is there an optimised method of storing using defaultdicts or would I have to look at other data structures like numpy array?
Thank you
If you're set on staying in memory with a single Python process, then you're going to have to abandon the dict datatype -- as you noted, it has excellent runtime performance characteristics, but it uses a bunch of memory to get you there.
Really, I think #msw's comment and #Udi's answer are spot on -- to scale you ought to look at on-disk or at least out-of-process storage of some sort, probably an RDBMS is the easiest thing to get going.
However, if you're sure that you need to stay in memory and in-process, I'd recommend using a sorted list to store your dataset. You can do lookups in O(log n) time and insertions and deletions in constant time, and you can wrap up the code for yourself so that the usage looks pretty much like a defaultdict. Something like this might help (not debugged beyond tests at the bottom):
import bisect
class mystore:
def __init__(self, constructor):
self.store = []
self.constructor = constructor
self.empty = constructor()
def __getitem__(self, key):
i, k = self.lookup(key)
if k == key:
return v
# key not present, create a new item for this key.
value = self.constructor()
self.store.insert(i, (key, value))
return value
def __setitem__(self, key, value):
i, k = self.lookup(key)
if k == key:
self.store[i] = (key, value)
else:
self.store.insert(i, (key, value))
def lookup(self, key):
i = bisect.bisect(self.store, (key, self.empty))
if 0 <= i < len(self.store):
return i, self.store[i][0]
return i, None
if __name__ == '__main__':
s = mystore(set)
s['a'] = set(['1'])
print(s.store)
s['b']
print(s.store)
s['a'] = set(['2'])
print(s.store)
Maybe try redis' set data type:
Redis Sets are unordered collections of strings. The SADD command adds
new elements to a set. It's also possible to do a number of other
operations against sets like testing if a given element already
exists...
From here: http://redis.io/topics/data-types-intro
redis-py supports these commands.

Python: Handling exceptions while sorting

I have a list of objects that I need to sort according to a key function. The problem is that some of the elements in my list can go "out-of-date" while the list is being sorted. When the key function is called on such an expired item, it fails with an exception.
Ideally, what I would like is a way of sorting my list with a key function such that when an error occurs upon calling the key function on an element, this element is excluded from the sort result.
My problem can be reconstructed using the following example: Suppose I have two classes, Good and Bad:
class Good(object):
def __init__(self, x):
self.x = x
def __repr__(self):
return 'Good(%r)' % self.x
class Bad(object):
#property
def x(self):
raise RuntimeError()
def __repr__(self):
return 'Bad'
I want to sort instances of these classes according to their x property. Eg.:
>>> sorted([Good(5), Good(3), Good(7)], key=lambda obj: obj.x)
[Good(3), Good(5), Good(7)]
Now, when there is a Bad in my list, the sorting fails:
>>> sorted([Good(5), Good(3), Bad()], key=lambda obj: obj.x)
... RuntimeError
I am looking for a magical function func that sorts a list according to a key function, but simply ignores elements for which the key function raised an error:
>>> func([Good(5), Good(3), Bad()], key=lambda obj: obj.x)
[Good(3), Good(5)]
What is the most Pythonic way of achieving this?
Every sorting algorithm I know doesn't throw out some values because they're outdated or something. The task of sorting algorithm is to sort the list, and sort it fast, everything else is extraneous, specific task.
So, I would write this magical function myself. It would do the sorting in two steps: first it would filter the list, leaving only Good values, and then sort the resulting list.
I did this once with a mergesort. Mergesort makes it relatively simple to eliminate no-longer-useful values.
The project I did it in is at http://stromberg.dnsalias.org/~dstromberg/equivalence-classes.html#python-3e . Feel free to raid it for ideas, or lift code out of it; it's Free as in speech (GPLv2 or later, at your option).
The sort in that code should almost do what you want, except it'll sort a list with duplicates to a list of lists, where each sublist has equal values. That part may or may not be useful to you.
I've got a more straightforward mergesort (it doesn't do the duplicate buckets thing, but it doesn't deal with dropping no-longer-good values either) at http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ . The file is .m4, but don't let that fool you - it's really pure python or cython autogenerated from the same .m4 file.
Since the result of the key function can change over time, and most sorting implementations probably assume a deterministic key function, it's probably best to only execute the key function once per object, to ensure a well-ordered and crash-free final list.
def func(seq, **kargs):
key = kargs["key"]
stored_values = {}
for item in seq:
try:
value = key(item)
stored_values[item] = value
except RuntimeError:
pass
return sorted(stored_values.iterkeys(), key=lambda item: stored_values[item])
print func([Good(5), Good(3), Bad()], key=lambda obj: obj.x)
Result:
[Good(3), Good(5)]
If the list items can go from Good to Bad while sorting, then there is nothing you can do. The keys are evaluated only once before sorting, so any change in the key will be invisible to the sort function:
>>> from random import randrange
>>> values = [randrange(100) for i in range(10)]
>>> values
[54, 72, 91, 73, 55, 68, 21, 25, 18, 95]
>>> def k(x):
... print x
... return x
...
>>> values.sort(key=k)
54
72
91
73
55
68
21
25
18
95
(If the key were evaluated many times during the sort, you would see the numbers printed many times).

Last element in OrderedDict

I have od of type OrderedDict. I want to access its most recently added (key, value) pair. od.popitem(last = True) would do it, but would also remove the pair from od which I don't want.
What's a good way to do that? Can /should I do this:
class MyOrderedDict(OrderedDict):
def last(self):
return next(reversed(self))
Using next(reversed(od)) is a perfect way of accessing the most-recently added element. The class OrderedDict uses a doubly linked list for the dictionary items and implements __reversed__(), so this implementation gives you O(1) access to the desired element. Whether it is worthwhile to subclass OrderedDict() for this simple operation may be questioned, but there's nothing actually wrong with this approach.
God, I wish this was all built-in functionality...
Here's something to save you precious time. Tested in Python 3.7. od is your OrderedDict.
# Get first key
next(iter(od))
# Get last key
next(reversed(od))
# Get first value
od[next(iter(od))]
# Get last value
od[next(reversed(od))]
# Get first key-value tuple
next(iter(od.items()))
# Get last key-value tuple
next(reversed(od.items()))
A little magic from timeit can help here...
from collections import OrderedDict
class MyOrderedDict1(OrderedDict):
def last(self):
k=next(reversed(self))
return (k,self[k])
class MyOrderedDict2(OrderedDict):
def last(self):
out=self.popitem()
self[out[0]]=out[1]
return out
class MyOrderedDict3(OrderedDict):
def last(self):
k=(list(self.keys()))[-1]
return (k,self[k])
if __name__ == "__main__":
from timeit import Timer
N=100
d1=MyOrderedDict1()
for i in range(N): d1[i]=i
print ("d1",d1.last())
d2=MyOrderedDict2()
for i in range(N): d2[i]=i
print ("d2",d2.last())
d3=MyOrderedDict3()
for i in range(N): d3[i]=i
print("d3",d3.last())
t=Timer("d1.last()",'from __main__ import d1')
print ("OrderedDict1",t.timeit())
t=Timer("d2.last()",'from __main__ import d2')
print ("OrderedDict2",t.timeit())
t=Timer("d3.last()",'from __main__ import d3')
print ("OrderedDict3",t.timeit())
results in:
d1 (99, 99)
d2 (99, 99)
d3 (99, 99)
OrderedDict1 1.159217119216919
OrderedDict2 3.3667118549346924
OrderedDict3 24.030261993408203
(Tested on python3.2, Ubuntu Linux).
As pointed out by #SvenMarnach, the method you described is quite efficient compared to the other two ways I could cook up.
Your idea is fine, however the default iterator is only over the keys, so your example will only return the last key. What you actually want is:
class MyOrderedDict(OrderedDict):
def last(self):
return list(self.items())[-1]
This gives the (key, value) pairs, not just the keys, as you wanted.
Note that on pre-3.x versions of Python, OrderedDict.items() returns a list, so you don't need the list() call, but later versions return a dictionary view object, so you will.
Edit: As noted in the comments, the quicker operation is to do:
class MyOrderedDict(OrderedDict):
def last(self):
key = next(reversed(self))
return (key, self[key])
Although I must admit I do find this uglier in the code (I never liked getting the key then doing x[key] to get the value separately, I prefer getting the (key, value) tuple) - depending on the importance of speed and your preferences, you may wish to pick the former option.

How to retrieve from python dict where key is only partially known?

I have a dict that has string-type keys whose exact values I can't know (because they're generated dynamically elsewhere). However, I know that that the key I want contains a particular substring, and that a single key with this substring is definitely in the dict.
What's the best, or "most pythonic" way to retrieve the value for this key?
I thought of two strategies, but both irk me:
for k,v in some_dict.items():
if 'substring' in k:
value = v
break
-- OR --
value = [v for (k,v) in some_dict.items() if 'substring' in k][0]
The first method is bulky and somewhat ugly, while the second is cleaner, but the extra step of indexing into the list comprehension (the [0]) irks me. Is there a better way to express the second version, or a more concise way to write the first?
There is an option to write the second version with the performance attributes of the first one.
Use a generator expression instead of list comprehension:
value = next(v for (k,v) in some_dict.iteritems() if 'substring' in k)
The expression inside the parenthesis will return an iterator which you will then ask to provide the next, i.e. first element. No further elements are processed.
How about this:
value = (v for (k,v) in some_dict.iteritems() if 'substring' in k).next()
It will stop immediately when it finds the first match.
But it still has O(n) complexity, where n is the number of key-value pairs. You need something like a suffix list or a suffix tree to speed up searching.
If there are many keys but the string is easy to reconstruct from the substring, then it can be faster reconstructing it. e.g. often you know the start of the key but not the datestamp that has been appended on. (so you may only have to try 365 dates rather than iterate through millions of keys for example).
It's unlikely to be the case but I thought I would suggest it anyway.
e.g.
>>> names={'bob_k':32,'james_r':443,'sarah_p':12}
>>> firstname='james' #you know the substring james because you have a list of firstnames
>>> for c in "abcdefghijklmnopqrstuvwxyz":
... name="%s_%s"%(firstname,c)
... if name in names:
... print name
...
james_r
class MyDict(dict):
def __init__(self, *kwargs):
dict.__init__(self, *kwargs)
def __getitem__(self,x):
return next(v for (k,v) in self.iteritems() if x in k)
# Defining several dicos ----------------------------------------------------
some_dict = {'abc4589':4578,'abc7812':798,'kjuy45763':1002}
another_dict = {'boumboum14':'WSZE x478',
'tagada4783':'ocean11',
'maracuna102455':None}
still_another = {12:'jfg',45:'klsjgf'}
# Selecting the dicos whose __getitem__ method will be changed -------------
name,obj = None,None
selected_dicos = [ (name,obj) for (name,obj) in globals().iteritems()
if type(obj)==dict
and all(type(x)==str for x in obj.iterkeys())]
print 'names of selected_dicos ==',[ name for (name,obj) in selected_dicos]
# Transforming the selected dicos in instances of class MyDict -----------
for k,v in selected_dicos:
globals()[k] = MyDict(v)
# Exemple of getting a value ---------------------------------------------
print "some_dict['7812'] ==",some_dict['7812']
result
names of selected_dicos == ['another_dict', 'some_dict']
some_dict['7812'] == 798
I prefer the first version, although I'd use some_dict.iteritems() (if you're on Python 2) because then you don't have to build an entire list of all the items beforehand. Instead you iterate through the dict and break as soon as you're done.
On Python 3, some_dict.items(2) already results in a dictionary view, so that's already a suitable iterator.

Categories