Use Python 2 Dict Comparison in Python 3

Use Python 2 Dict Comparison in Python 3 - python

I'm trying to port some code from Python 2 to Python 3. It's ugly stuff but I'm trying to get the Python 3 results to be as identical to the Python 2 results as possible. I have code similar to this:
import json
# Read a list of json dictionaries by line from file.
objs = []
with open('data.txt') as fptr:
for line in fptr:
objs.append(json.loads(line))
# Give the dictionaries a reliable order.
objs = sorted(objs)
# Do something externally visible with each dictionary:
for obj in objs:
do_stuff(obj)
When I port this code from Python 2 to Python 3, I get an error:
TypeError: unorderable types: dict() < dict()
So I changed the sorted line to this:
objs = sorted(objs, key=id)
But the ordering of the dictionaries still changed between Python 2 and Python 3.
Is there a way to replicate the Python 2 comparison logic in Python 3? Is it simply that id was used before and is not reliable between Python versions?

If you want the same behavior as earlier versions of Python 2.x in both 2.7 (which uses an arbitrary sort order instead) and 3.x (which refuses to sort dicts), Ned Batchelder's answer to a question about how sorting dicts works gets you part of the way there, but not all the way.
First, it gives you an old-style cmp function, not a new-style key function. Fortunately, both 2.7 and 3.x have functools.cmp_to_key to solve that. (You could of course instead rewrite the code as a key function, but that may make it harder to see any differences between the posted code and your code…)
More importantly, it not only doesn't do the same thing in 2.7 and 3.x, it doesn't even work in 2.7 and 3.x. To understand why, look at the code:
def smallest_diff_key(A, B):
"""return the smallest key adiff in A such that A[adiff] != B[bdiff]"""
diff_keys = [k for k in A if A.get(k) != B.get(k)]
return min(diff_keys)
def dict_cmp(A, B):
if len(A) != len(B):
return cmp(len(A), len(B))
adiff = smallest_diff_key(A, B)
bdiff = smallest_diff_key(B, A)
if adiff != bdiff:
return cmp(adiff, bdiff)
return cmp(A[adiff], b[bdiff])
Notice that it's calling cmp on the mismatched values.
If the dicts can contain other dicts, that's relying on the fact that cmp(d1, d2) is going to end up calling this function… which is obviously not true in newer Python.
On top of that, in 3.x cmp doesn't even exist anymore.
Also, this relies on the fact that any value can be compared with any other value—you might get back arbitrary results, but you won't get an exception. That was true (except in a few rare cases) in 2.x, but it's not true in 3.x. That may not be a problem for you if you don't want to compare dicts with non-comparable values (e.g., if it's OK for {1: 2} < {1: 'b'} to raise an exception), but otherwise, it is.
And of course if you don't want arbitrary results for dict comparison, do you really want arbitrary results for value comparisons?
The solution to all three problems is simple: you have to replace cmp, instead of calling it. So, something like this:
def mycmp(A, B):
if isinstance(A, dict) and isinstance(B, dict):
return dict_cmp(A, B)
try:
return A < B
except TypeError:
# what goes here depends on how far you want to go for consistency
If you want the exact rules for comparison of objects of different types that 2.7 used, they're documented, so you can implement them. But if you don't need that much detail, you can write something simpler here (or maybe even just not trap the TypeError, if the exception mentioned above is acceptable).

The logic in CPython2.x is somewhat complicated since the behavior is dictated by dict.__cmp__. A python implementation can be found here.
However, if you really want a reliable ordering, you'll need to sort on a better key than id. You could use functools.cmp_to_key to transform the comparison function from the linked answer to a key function, but really, it's not a good ordering anyway as it is completely arbitrary.
Your best bet is to sort all of the dictionaries by a field's value (or multiple fields). operator.itemgetter can be used for this purpose quite nicely. Using this as a key function should give you consistent results for any somewhat modern implementation and version of python.

Is there a way to replicate the Python 2 comparison logic in Python 3? Is it simply that id was used before and is not reliable between Python versions?
id is never "reliable". The id that you get for any given object is a completely arbitrary value; it can be different from one run to the next, even on the same machine and Python version.
Python 2.x doesn't actually document that it sorts by id. All it says is:
Outcomes other than equality are resolved consistently, but are not otherwise defined.
But that just makes the point even better: the order is explicitly defined to be arbitary (except for being consistent during any given run). Which is exactly the same guarantee you get by sorting with key=id in Python 3.x, whether or not it actually works the same way.*
So you're doing the same thing in 3.x. The fact that the two arbitrary orders are different just means that arbitrary is arbitrary.
If you want some kind of repeatable ordering for a dict based on what it contains, you just have to decide what that order is, and then you can build it. For example, you can sort the items in order then compare them (recursively passing down the same key function in case the items are or contain dicts).**
And, having designed and implemented some kind of sensible, non-arbitrary ordering, it will of course work the same way in 2.7 and 3.x.
* Note that it's not equivalent for identity comparisons, only for ordering comparisons. If you're only using it for sorted, this has the consequence that your sort will no longer be stable. But since it's in arbitrary order anyway, that hardly matters.
** Note that Python 2.x used to use a rule similar to this. From the footnote to the above: "Earlier versions of Python used lexicographic comparison of the sorted (key, value) lists, but this was very expensive for the common case of comparing for equality." So, that tells you that it's a reasonable rule—as long as it's actually the rule you want, and you don't mind the performance cost.

If you simply need an order that is consistent across multiple runs of Python on potentially different platforms but don't actually care about the actual order then a simple solution is to dump the dicts to JSON before sorting them:
import json
def sort_as_json(dicts):
return sorted(dicts, key=lambda d: json.dumps(d, sort_keys=True))
print(list(sort_as_json([{'foo': 'bar'}, {1: 2}])))
# Prints [{1: 2}, {'foo': 'bar'}]
Obviously this only works if your dicts are JSON-representable, but since you load them from JSON anyway that should be no problem. In your case, you could achieve the same result by simply sorting the file you're loading the objects from before deserializing the JSON.

You can compare .items()
d1 = {"key1": "value1"}
d2 = {"key1": "value1", "key2": "value2"}
d1.items() <= d2.items()
True
But this is not recursive
d1 = {"key1": "value1", "key2": {"key11": "value11"}}
d2 = {"key1": "value1", "key2": {"key11": "value11", "key12": "value12"}}
d1.items() <= d2.items()
False

Related

Python 3.7+: Access elements of a dictionary by order

So I understand that, since Python 3.7, dicts are ordered, but the documentation doesn't seem to list methods of using said ordering. For example, how do I access say the first element of the dict by order, independent of keys? What operations can I do with the ordered dict?
As an example, I am working on implementing a Least Frequently Used cache, where I need to track not only the number of uses of a key, but also use Least Recently Used information as a tie break. I could use a dict of queues to implement a priority queue, but then I lose O(1) lookup within the queue. If I use a dict of dicts I can retain the advantages of a hashed set AND implement it as a priority queue... I think. I just need to be able to pop the first element of the dictionary. Alas, there is no dict.popleft().
For now I am converting the keys to a list and just using the first element of the list, but while this works (the dict is keeping ordering), the conversion is costly.
LFU_queue = collections.defaultdict(collections.defaultdict)
LFU_queue[1].update({"key_1":None})
LFU_queue[1].update({"key_32":None})
LFU_queue[1].update({"key_6":None})
#Inspecting this, I DO get what I expect, which is a
#dict of dicts with the given ordering:
#{ 1: {"key_1":None}, {"key_32":None}, {"key_6":None}}
#here is where I'd love to be able to do something like
# LFU_queue[1].popleft() to return {"key_1":None}
list(LFU_queue[1])[0] works, but is less than ideal

As others commented, using OrderedDict is the right choice for that problem. But you can also use the bare dict using the items() like this.
d = {}
d['a'] = 1
d['b'] = 2
first_pair = next(iter(d.items()), None)
If you are interested, see PEP 372 and PEP 468 for the history of dictionary ordering.
With PEP 468, you can implement a syntax sugar like this.
from ctypes import Structure, c_int, c_ulong
def c_fields(**kwargs):
return list(kwargs.items())
class XClientMessageEvent(Structure):
_fields_ = c_fields(
type = c_int,
serial = c_ulong,
send_event = c_int,
...
)

Most efficient way to get a key from a dictionary

Let d be a large (but still fits into memory) Python dictionary where we do not know what the keys are. What is the most efficient way (efficient should mean something like the memory used to perform the task is small compared to the size of the dictionary and the speed should at least as fast any of the methods below) to get a key of d (where it does not mater which key you get) and d is unchanged either in content or order (for newer versions of Python) once you are done? This question is not about readability but about the python dictionary objects. For example two methods are:
Use the list method
any_key = list(d)[0]
Using the popitem method
any_key,y = d.popitem()
d[any_key]=y
So both methods essentially implement a peekkey() method. My basic timeit analysis shows that method 2) is must faster than method 1) and I assume that method 2) uses a lot less memory (but I do not really know if this true yet). Is method 2) "best" or is there something better?
Extra brownie points if you get a fast and a readable method using only Python. Even more points for a C/Python method that accesses the dictionary object directly if that method is significantly faster than the best python method.

If you do not care about which key you get, and you don't mean "sample" in the random sense, then just grab the first key using next
key = next(iter(d.keys()))
which, for brevity, is the same as
key = next(iter(d))
Just to test performance, if I generate a dict with 1000 elements
d = {k:k for k in range(1000)}
then benchmarking these two methods, the next approach is about 95% faster
>>> timeit.timeit('sample_key = list(d)[0]', setup='d = {k:k for k in range(1000)}')
5.3303698
>>> timeit.timeit('next(iter(d.keys()))', setup='d = {k:k for k in range(1000)}')
0.18915620000001354

Python - Versioned list instead of immutable list?

Update:
As of CPython 3.6, dictionaries have a version (thank you pylang for showing this to me).
If they added the same version to list and made it public, all 3 asserts from my original post would pass! It would definitely meet my needs. Their implementation differs from what I envisioned, but I like it.
As it is, I don't feel I can use dictionary version:
It isn't public. Jake Vanderplas shows how to expose it in a post, but he cautions: definitely not code you should use for any purpose beyond simply having fun. I agree with his reasons.
In all of my use cases, the data is conceptually arrays of elements each of which has the same structure. A list of tuples is a natural fit. Using a dictionary would make the code less natural and probably more cumbersome.
Does anyone know if there are plans to add version to list?
Are there plans to make it public?
If there are plans to add version to list and make it public, I would feel awkward putting forward an incompatible VersionedList now. I would just implement the bare minimum I need and get by.
Original post below
Turns out that many of the times I wanted an immutable list, a VersionedList would have worked almost as well (sometimes even better).
Has anyone implemented a versioned list?
Is there a better, more Pythonic, concept that meets my needs? (See motivation below.)
What I mean by a versioned list is:
A class that behaves like a list
Any change to an instance or elements in the instance results in instance.version() being updated. So, if alist is a normal list:
a = VersionedList(alist)
a_version = a.version()
change(a)
assert a_version != a.version()
reverse_last_change(a)
If a list was hashable, hash() would achieve the above and meet all the needs identified in the motivation below. We need to define 'version()' in a way that doesn't have all of the same problems as 'hash()'.
If identical data in two lists is highly unlikely to ever happen except at initialization, we aren't going to have a reason to test for deep equality. From (https://docs.python.org/3.5/reference/datamodel.html#object.hash) The only required property is that objects which compare equal have the same hash value. If we don't impose this requirement on 'version()', it seems likely that 'version()' won't have all of the same problems that makes lists unhashable. So unlike hash, identical contents doesn't mean the same version
#contents of 'a' are now identical to original, but...
assert a_version != a.version()
b = VersionedList(alist)
c = VersionedList(alist)
assert b.version() != c.version()
For VersionList, it would be good if any attempt to modify the result of __get__ automatically resulted in a copy instead of modifying the underlying implementation data. I think that the only other option would be to have __get__ always return a copy of the elements, and this would be very inefficient for all of the use cases I can think of. I think we need to restrict the elements to immutable objects (deeply immutable, for example: exclude tuples with list elements). I can think of 3 ways to achieve this:
Only allow elements that can't contain mutable elements (int, str, etc are fine, but exclude tuples). (This is far too limiting for my cases)
Add code to __init__, __set__, etc to traverse inputs to deeply check for mutable sub-elements. (expensive, any way to avoid this?)
Also allow more complex elements, but require that they are deeply immutable. Perhaps require that they expose a deeply_immutable attribute. (This turns out to be easy for all the use cases I have)
Motivation:
If I am analyzing a dataset, I often have to perform multiple steps that return large datasets (note: since the dataset is ordered, it is best represented by a List not a set).
If at the end of several steps (ex: 5) it turns out that I need to perform different analysis (ex: back at step 4), I want to know that the dataset from step 3 hasn't accidentally been changed. That way I can start at step 4 instead of repeating steps 1-3.
I have functions (control-points, first-derivative, second-derivative, offset, outline, etc) that depend on and return array-valued objects (in the linear algebra sense). The base 'array' is knots.
control-points() depends on: knots, algorithm_enum
first-derivative() depends on: control-points(), knots
offset() depends on: first-derivative(), control-points(), knots, offset_distance
outline() depends on: offset(), end_type_enum
If offset_distance changes, I want to avoid having to recalculate first-derivative() and control-points(). To avoid recalculation, I need to know that nothing has accidentally changed the resultant 'arrays'.
If 'knots' changes, I need to recalculate everything and not depend on the previous resultant 'arrays'.
To achieve this, knots and all of the 'array-valued' objects could be VersionedList.
FYI: I had hoped to take advantage of an efficient class like numpy.ndarray. In most of my use cases, the elements logically have structure. Having to mentally keep track of multi-dimensions of indexes meant implementing and debugging the algorithms was many times more difficult with ndarray. An implementation based on lists of namedtuples of namedtuples turned out to be much more sustainable.

Private dicts in 3.6
In Python 3.6, dictionaries are now private (PEP 509) and compact (issue 27350), which track versions and preserve order respectively. These features are presently true when using the CPython 3.6 implementation. Despite the challenge, Jake VanderPlas demonstrates in his blog post a detailed demonstration of exposing this versioning feature from CPython within normal Python. We can use his approach to:
determine when a dictionary has been updated
preserve the order
Example
import numpy as np
d = {"a": np.array([1,2,3]),
"c": np.array([1,2,3]),
"b": np.array([8,9,10]),
}
for i in range(3):
print(d.get_version()) # monkey-patch
# 524938
# 524938
# 524938
Notice the version number does not change until the dictionary is updated, as shown below:
d.update({"c": np.array([10, 11, 12])})
d.get_version()
# 534448
In addition, the insertion order is preserved (the following was tested in restarted sessions of Python 3.5 and 3.6):
list(d.keys())
# ['a', 'c', 'b']
You may be able to take advantage of this new dictionary behavior, saving you from implementing a new datatype.
Details
For those interested, the latter get_version()is a monkey-patched method for any dictionary, implemented in Python 3.6 using the following modified code derived from Jake VanderPlas' blog post. This code was run prior to calling get_version().
import types
import ctypes
import sys
assert (3, 6) <= sys.version_info < (3, 7) # valid only in Python 3.6
py_ssize_t = ctypes.c_ssize_t
# Emulate the PyObjectStruct from CPython
class PyObjectStruct(ctypes.Structure):
_fields_ = [('ob_refcnt', py_ssize_t),
('ob_type', ctypes.c_void_p)]
# Create a DictStruct class to wrap existing dictionaries
class DictStruct(PyObjectStruct):
_fields_ = [("ma_used", py_ssize_t),
("ma_version_tag", ctypes.c_uint64),
("ma_keys", ctypes.c_void_p),
("ma_values", ctypes.c_void_p),
]
def __repr__(self):
return (f"DictStruct(size={self.ma_used}, "
f"refcount={self.ob_refcnt}, "
f"version={self.ma_version_tag})")
#classmethod
def wrap(cls, obj):
assert isinstance(obj, dict)
return cls.from_address(id(obj))
assert object.__basicsize__ == ctypes.sizeof(PyObjectStruct)
assert dict.__basicsize__ == ctypes.sizeof(DictStruct)
# Code for monkey-patching existing dictionaries
class MappingProxyStruct(PyObjectStruct):
_fields_ = [("mapping", ctypes.POINTER(DictStruct))]
#classmethod
def wrap(cls, D):
assert isinstance(D, types.MappingProxyType)
return cls.from_address(id(D))
assert types.MappingProxyType.__basicsize__ == ctypes.sizeof(MappingProxyStruct)
def mappingproxy_setitem(obj, key, val):
"""Set an item in a read-only mapping proxy"""
proxy = MappingProxyStruct.wrap(obj)
ctypes.pythonapi.PyDict_SetItem(proxy.mapping,
ctypes.py_object(key),
ctypes.py_object(val))
mappingproxy_setitem(dict.__dict__,
'get_version',
lambda self: DictStruct.wrap(self).ma_version_tag)

conditionals with dicts Python

I was wondering what is the correct way to check a key:value pair of a dict. Lets say I have this dict
dict_ = {
'key1':'val1',
'key2':'val2'
}
I can check a condition like this
if dict_['key1'] == 'val1'
but I feel like there is a more elegant way that takes advantage of the dict data structure.

What you're doing already does take advantage of the data structure, which is why it's "the one obvious way" to do what you want to do. (You can find examples like this all over the tutorial, the reference docs, and the stdlib implementation.)
However, I can see what you're thinking: the dict is in some sense a container of key-value pairs (even if it's only a collections.Container of keys…), so… shouldn't there be some way to just check whether a key-value pair exists?
Up to Python 2.6, there really isn't.* But in 3.0, the items() method returns a special set-like view of the key-value pairs. And 2.7 backported that functionality, under the name viewitems. So:
('key1', 'val1') in d.viewitems()
But I don't think that's really clearer or cleaner; "items" feels like a lower-level way to think of dictionaries than "mappings", which is what both your original code and smci's answer rely on.
It's also less concise, it doesn't work in 2.6 or earlier, and many dict-like mapping objects don't support it,** and it's and slightly slower on 2.7 to boot, but these are probably less important, and not what you asked about.
* Well, there is, but only by iterating over all of the items with iteritems, or using items to effectively do the same exhaustive search behind your back, neither of which is what you want.
** In fact, in 2.7, it's not actually possible to support it with a pure-Python class…

If you want to avoid throwing KeyError if dict doesn't even contain 'key1':
if dict_.get('key1')=='val1':
(However, throwing an exception for missing key is perfectly fine Python idiom.)
Otherwise, #Cyber is correct that it's already fine! (What exactly is the problem?)

There is a has_key function
dict_.has_key('key1')
This returns a boolean true or false.
Alternatively, you can have you get function return a default value when the key is not present.
dict_.get('key3','Default Value')
Modified typo*

Why use dict.keys?

I recently wrote some code that looked something like this:
# dct is a dictionary
if "key" in dct.keys():
However, I later found that I could achieve the same results with:
if "key" in dct:
This discovery got me thinking and I began to run some tests to see if there could be a scenario when I must use the keys method of a dictionary. My conclusion however is no, there is not.
If I want the keys in a list, I can do:
keys_list = list(dct)
If I want to iterate over the keys, I can do:
for key in dct:
...
Lastly, if I want to test if a key is in dct, I can use in as I did above.
Summed up, my question is: am I missing something? Could there ever be a scenario where I must use the keys method?...or is it simply a leftover method from an earlier installation of Python that should be ignored?

On Python 3, use dct.keys() to get a dictionary view object, which lets you do set operations on just the keys:
>>> for sharedkey in dct1.keys() & dct2.keys(): # intersection of two dictionaries
... print(dct1[sharedkey], dct2[sharedkey])
In Python 2.7, you'd use dct.viewkeys() for that.
In Python 2, dct.keys() returns a list, a copy of the keys in the dictionary. This can be passed around an a separate object that can be manipulated in its own right, including removing elements without affecting the dictionary itself; however, you can create the same list with list(dct), which works in both Python 2 and 3.
You indeed don't want any of these for iteration or membership testing; always use for key in dct and key in dct for those, respectively.

Source: PEP 234, PEP 3106
Python 2's relatively useless dict.keys method exists for historical reasons. Originally, dicts weren't iterable. In fact, there was no such thing as an iterator; iterating over sequences worked by calling __getitem__, the element access method, with increasing integer indices until an IndexError was raised. To iterate over the keys of a dict, you had to call the keys method to get an explicit list of keys and iterate over that.
When iterators went in, dicts became iterable, because it was more convenient, faster, and all around better to say
for key in d:
than
for key in d.keys()
This had the side-effect of making d.keys() utterly superfluous; list(d) and iter(d) now did everything d.keys() did in a cleaner, more general way. They couldn't get rid of keys, though, since so much code already called it.
(At this time, dicts also got a __contains__ method, so you could say key in d instead of d.has_key(key). This was shorter and nicely symmetrical with for key in d; the symmetry is also why iterating over a dict gives the keys instead of (key, value) pairs.)
In Python 3, taking inspiration from the Java Collections Framework, the keys, values, and items methods of dicts were changed. Instead of returning lists, they would return views of the original dict. The key and item views would support set-like operations, and all views would be wrappers around the underlying dict, reflecting any changes to the dict. This made keys useful again.

Assuming you're not using Python 3, list(dct) is equivalent to dct.keys(). Which one you use is a matter of personal preference. I personally think dct.keys() is slightly clearer, but to each their own.
In any case, there isn't a scenario where you "need" to use dct.keys() per se.
In Python 3, dct.keys() returns a "dictionary view object", so if you need to get a hold of an unmaterialized view to the keys (which could be useful for huge dictionaries) outside of a for loop context, you'd need to use dct.keys().

key in dict
is much faster than checking
key in dict.keys()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.