I'm working with enum lately and I dont really get the utility of them in some cases. I hope my question is not too trivial or too stupid, and I would really love to better understand the logic behind this python structure. One common use I found online or in some pieces of code I have been working on lately is the use of values that are strings like for example:
from enum import Enum
class Days(Enum):
MONDAY = 'monday'
TUESDAY = 'tuesday'
...
SUNDAY = 'sunday'
And here from my humble prospective, the values seems redundant: if I print the values of some member I obtain the following:
print(Days.MONDAY.value)
>> 'monday'
I totally understand the utility when the values are numbers and they represent a gerarchic structure like for example
class Levels(Enum):
HIGH = 10
MID = 5
LOW = 0
In which you can do stuff like:
HIGH > LOW
> True
But in a lot of example and actual code use, I see the first approach, the one with MONDAY = 'monday', i.e. when the values are string instead of numerical values, and this case I really dont understand the utility of having a key that is pretty much equal to the value.
If anyone can help me understand, or show some utilities I would really love to understand new stuff.
The members (should) always be named in all uppercase (per the very first note on the enum docs), so if you want them to have "names" that are in some other casing, you can assign strings with arbitrary case, which may be more human-friendly when it comes time to display the values to the user.
You can also convert from said values to the enum constant easily, with the Enum's constructor, e.g.:
Days('monday') is Days.MONDAY # This is True
so if the data from the user (or database or whatever) has specific values, you can easily convert them to their logical Enum equivalents this way.
If the values really aren't meaningful, you can just assign auto() to all of them and not think about the values.
Just in case you're asking "why not use the strings themselves?", the general advantage to enums is guaranteed uniqueness and exhaustiveness for efficient checks and self-documenting code. If you use the strings directly, you have to use == checks (str has no guarantee that equal values are the same object unless you explicitly intern them, so is checks can't be used), and people can pass in strings that don't actually come from the expected set of strings. With Enums there is a central definition of all possible values (and therefore all other values are not possible), and since the values are all guaranteed singletons, when you have a member of that Enum, you can use is/is not testing for cheap identity testing (in CPython at least, it's literally just a pointer comparison) without relying on value equality tests, ==/!=, that invoke the more expensive rich comparison machinery. This even works when you make aliases for the same enum member, e.g.:
class Foo(Enum):
SPAM = 1
EGGS = 2
ALSO_SPAM = 1
which seamlessly makes Foo.ALSO_SPAM the same object as Foo.SPAM so Foo.SPAM is Foo.ALSO_SPAM is true, allowing two aliases with different names to be used interchangably.
It's one approach to adding type safety to your code. Even if Days looks like a collection of strings, it's not a subclass/subtype/special-case of str.
>>> isinstance(Days.MONDAY, Days)
True
>>> isinstance('monday', Days)
False
>>> isinstance(Days.MONDAY, str)
False
In addition to #ShadowRanger's answer: The primary reason why Enum has both a name and a value field is so meaningful constants can have easy to use names -- constants such as database values, various codes (http, error, response, etc.) -- in other words, so Enum can provide an easy interface with other systems.
Related
Python seems to lacks a base class for "all numbers", e. g. int, float, complex, long (in Python2). This is unfortunate and a bit inconvenient to me.
I'm writing a function for matching data types onto each other (a tree matching / searching algorithm). For this I'd like to test the input values for being lists, dicts, strings, or "numbers". Each of the four cases is handled separately, but I do not want to distinguish between int, long, float, or complex (though the latter will probably not appear). This would be rather easy to achieve if all number types would be derived from a base type number, but unfortunately, they are not, AFAICS.
This enforcement of explicitness makes it obvious that the unusual complex is included. Which typically raises questions which I rather not want to think about. My design rather says "all number types" than that explicit list. I also do not want to explicitly list all possible number types coming from other libraries like numpy or similar.
First question, rather a theoretical one: Why didn't the designers make all number types inherit a common number base class? Is there a good reason for this, maybe a theory about it which lets it seem not recommended? Or would it make sense to propose this idea for later versions of Python?
Second question, the more practical one: Is there a recommended way of checking a value for being a number, maybe a library call I'm not aware of? The straight-forward version of using isinstance(x, (int, float, complex, long)) does look like a clutch to me, isn't compatible to Python3 which doesn't know a long type anymore, and doesn't include library-based number types like numpy.int32.
There is actually a base class for those types you listed.
If you're not looking at numpy types, a good starting point would be numbers.Complex:
>>> import numbers
>>> isinstance(1+9j, numbers.Complex)
True
>>> isinstance(1L, numbers.Complex)
True
>>> isinstance(1., numbers.Complex)
True
>>> isinstance(1, numbers.Complex)
True
It gets a bit messier when you start to include those from numpy, however, the numbers.Complex abstract base class already handles a good number of the mentioned cases.
Sorry, a bit late to the party. But this might be helpful for future readers of this page. I suggest you use Python's "duck typing" character and its EAFP ("Easier to Ask Forgiveness than Permission") philosophy: in other words, just try using the object in question as a number. Write something like this:
def isnumber(thing):
try:
thing + 0
return True
except TypeError:
return False
It should work for any type of number, including user-defined classes.
I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:
confusion = {}
confusion[1] = 1
confusion['1'] = 2
confusion[1.0] = 4
sum = 0
for k in confusion:
sum += confusion[k]
print sum
The output is 6, since the key 1.0 replaces 1. This feels a bit dangerous to me, is this ever a useful language feature?
First of all: the behaviour is documented explicitly in the docs for the hash function:
hash(object)
Return the hash value of the object (if it has one). Hash values are
integers. They are used to quickly compare dictionary keys during a
dictionary lookup. Numeric values that compare equal have the same
hash value (even if they are of different types, as is the case for 1
and 1.0).
Secondly, a limitation of hashing is pointed out in the docs for object.__hash__
object.__hash__(self)
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. __hash__()
should return an integer. The only required property is that objects
which compare equal have the same hash value;
This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().
So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.
In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.
If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.
The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.
Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.
This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.
For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.
Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.
Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.
However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.
If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).
You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.
The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.
So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.
It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.
Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.
In python:
1==1.0
True
This is because of implicit casting
However:
1 is 1.0
False
I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.
It is actually a language design decision and a matter of taste more than different functionalities
Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.
If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.
The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
TL;DR: a dictionary would break if keys that compared equal did not map to the same value.
Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.
Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.
I agree with others that it makes sense to treat 1 and 1.0 as the same in this context. Even if Python did treat them differently, it would probably be a bad idea to try to use 1 and 1.0 as distinct keys for a dictionary. On the other hand -- I have trouble thinking of a natural use-case for using 1.0 as an alias for 1 in the context of keys. The problem is that either the key is literal or it is computed. If it is a literal key then why not just use 1 rather than 1.0? If it is a computed key -- round off error could muck things up:
>>> d = {}
>>> d[1] = 5
>>> d[1.0]
5
>>> x = sum(0.01 for i in range(100)) #conceptually this is 1.0
>>> d[x]
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
d[x]
KeyError: 1.0000000000000007
So I would say that, generally speaking, the answer to your question "is this ever a useful language feature?" is "No, probably not."
I would like to have informative representations for my composite objects (i.e., objects composed of other (potentially composite) objects). However, because my code fundamentally deals with high-precision numbers (please don't ask me why I don't just use doubles), I end up with representations like you see here: http://pastebin.com/jpLgAfxC. Would it just be better to just stick with the default __repr__?
Whether to have a verbose repr depends on what you want to accomplish. For complex or composite objects, I know which I'd prefer of the following:
Point(x=1.12, y=2.2, z=-1.9)
<__main__.Point object at 0x103011890>
They both tell me what type the object is, but only the first is clear about all of the (relevant) values involved, and avoids low-level information that is only relevant on the rarest of occasions.
I like to see the real values. But, yours is a special case, given that your values are so frightfully humongous:
72401317106217603290426741268390656010621951704689382948334809645
87850348552960901165648762842931879347325584704068956434195098288
38279057775096090002410493665682226331178331461681861612403032369
73237863637784679012984303024949059416189689048527978878840119376
5152408961823197987224502419157858495179687559851
That they cannot be useful for most development or debugging purposes. I'm sure there are times you need the full serialization--to send to and from files, for example. But those have to be fairly rare, no? I can't imagine you really remember all 309 digits, or can determine if the above number is the same as the one below on visual inspection:
72401317106217603290426741268390656010621951704689382948334809645
87850348552960901165648762842931879347325584704068956434195098288
38279057775096090002410493665682226331178331461681861612403032369
73327863637784679012984303024949059416189689048527978878840119376
5152408961823197987224502419157858495179687559851
They're not the same. But unless you're Spock or The Terminator, you wouldn't know that from a quick glance. (And actually, I've made it easier here, length-wrapping to avoid having to horizontally scroll.)
So I would recommend (massively) shortening their representation, to make the output more tractable. This is like printing out the entire chapter text every time you want to print a Chapter object. Overkill.
Instead, try something much shorter and easier to work with. Truncation and/or ellipsis are useful. e.g.
72401...59851
7240131710...
You can use the object id as well. If your high-precision type is HP, then:
HP(0x103011890)
At least then you will be able to tell them apart. One ugliness of using object ids, however, is that objects can be logically equivalent, but if you create multiple objects with the same logical value, they'd have different ids, thus appear different when they are not. You can get around that by creating your own short hash function. There's a bit of an art to hashing, but for reprs, even something simple would work. E.g.:
import binascii, struct
def shorthash(s):
"""
Given a Python value, produce a short alphanumeric hash that
helps identify it for debugging purposes. A riff on
http://stackoverflow.com/a/2511059/240490
Enhanced to remove trailing boilerplate, and to work
on either Python 2 or Python 3.
"""
hashbytes = binascii.b2a_base64(struct.pack('l', hash(s)))
return hashbytes.decode('utf-8').rstrip().rstrip("=")
Then define your repr in the high-precision class:
def __repr__(self):
clsname = self.__class__.__name__
return '{0}({1}).format(clsname, shorthash(self.value))
Where self.value is whatever local attribute, property, or method creates the multi-hundred-digit value. If you're subclassing int, this could be just self.
This gets you to:
HP(Tea+5MY0WwA)
The two massive, almost identical numbers above? Using this scheme, they render out to:
HP(XhkG0358Fx4)
HP(27CdIG5elhQ)
Which are obviously different. You can combine this with a bit of a value representation. E.g. a few alternatives:
HP(~7.24013e308 # XhkG0358Fx4)
HP(dig='72401...59851', ndigits=309, hash='XhkG0358Fx4')
You'll find these shorter values more useful in debugging contexts. You can, of course, keep around a method or property (e.g. .value, .digits, or .alldigits) for those case in which you need every last bit, but define the common case as something more easily consumed.
Thank you to Demian for the pointer to https://docs.python.org/2/reference/datamodel.html#object.repr, specifically:
This is typically used for debugging, so it is important that the
representation is information-rich and unambiguous.
http://pastebin.com/jpLgAfxC is probably the best possible __repr__ in this case.
I noticed that many libraries nowadays seem to prefer the use of strings over enum-type variables for parameters.
Where people would previously use enums, e.g. dateutil.rrule.FR for a Friday, it seems that this has shifted towards using string (e.g. 'FRI').
Same in numpy (or pandas for that matter), where searchsorted for example uses of strings (e.g. side='left', or side='right') rather than a defined enum. For the avoidance of doubt, before python 3.4 this could have been easily implemented as an enum as such:
class SIDE:
RIGHT = 0
LEFT = 1
And the advantages of enums-type variable are clear: You can't misspell them without raising an error, they offer proper support for IDEs, etc.
So why use strings at all, instead of sticking to enum types? Doesn't this make the programs much more prone to user errors? It's not like enums create an overhead - if anything they should be slightly more efficient. So when and why did this paradigm shift happen?
I think enums are safer especially for larger systems with multiple developers.
As soon as the need arises to change the value of such an enum, looking up and replacing a string in many places is not my idea of fun :-)
The most important criteria IMHO is the usage: for use in a module or even a package a string seems to be fine, in a public API I'ld prefer enums.
[update]
As of today (2019) Python introduced dataclasses - combined with optional type annotations and static type analyzers like mypy I think this is a solved problem.
As for efficiency, attribute lookup is somewhat expensive in Python compared to most computer languages so I guess some libraries may still chose to avoid it for performance reasons.
[original answer]
IMHO it is a matter of taste. Some people like this style:
def searchsorted(a, v, side='left', sorter=None):
...
assert side in ('left', 'right'), "Invalid side '{}'".format(side)
...
numpy.searchsorted(a, v, side='right')
Yes, if you call searchsorted with side='foo' you may get an AssertionError way later at runtime - but at least the bug will be pretty easy to spot looking the traceback.
While other people may prefer (for the advantages you highlighted):
numpy.searchsorted(a, v, side=numpy.CONSTANTS.SIDE.RIGHT)
I favor the first because I think seldom used constants are not worth the namespace cruft. You may disagree, and people may align with either side due to other concerns.
If you really care, nothing prevents you from defining your own "enums":
class SIDE(object):
RIGHT = 'right'
LEFT = 'left'
numpy.searchsorted(a, v, side=SIDE.RIGHT)
I think it is not worth but again it is a matter of taste.
[update]
Stefan made a fair point:
As soon as the need arises to change the value of such an enum, looking up and replacing a string in many places is not my idea of fun :-)
I can see how painful this can be in a language without named parameters - using the example you have to search for the string 'right' and get a lot of false positives. In Python you can narrow it down searching for side='right'.
Of course if you are dealing with an interface that already has a defined set of enums/constants (like an external C library) then yes, by all means mimic the existing conventions.
I understand this question has already been answered, but there is one thing that has not at all been addressed: the fact that Python Enum objects must be explicitly called for their value when using values stored by Enums.
>>> class Test(Enum):
... WORD='word'
... ANOTHER='another'
...
>>> str(Test.WORD.value)
'word'
>>> str(Test.WORD)
'Test.WORD'
One simple solution to this problem is to offer an implementation of __str__()
>>> class Test(Enum):
... WORD='word'
... ANOTHER='another'
... def __str__(self):
... return self.value
...
>>> Test.WORD
<Test.WORD: 'word'>
>>> str(Test.WORD)
'word'
Yes, adding .value is not a huge deal, but it is an inconvenience nonetheless. Using regular strings requires zero extra effort, no extra classes, or redefinition of any default class methods. Still, there must be explicit casting to a string value in many cases, where a simple str would not have a problem.
i prefer strings for the reason of debugging. compare an object like
side=1, opt_type=0, order_type=6
to
side='BUY', opt_type='PUT', order_type='FILL_OR_KILL'
i also like "enums" where the values are strings:
class Side(object):
BUY = 'BUY'
SELL = 'SELL'
SHORT = 'SHORT'
Strictly speaking Python does not have enums - or at least it didn't prior to v3.4
https://docs.python.org/3/library/enum.html
I prefer to think of your example as programmer defined constants.
In argparse, one set of constants have string values. While the code uses the constant names, users more often use the strings.
e.g. argparse.ZERO_OR_MORE = '*'
arg.parse.OPTIONAL = '?'
numpy is one of the older 3rd party packages (at least its roots like numeric are). String values are more common than enums. In fact I can't off hand think of any enums (as you define them).
I intend to implement a library which includes lots of constants. Now there are already many questions about constants on stackoverflow. But I am pondering about a special case which complicates things and I am not sure which way to go. Probably someone already solved a similar problem and can help me avoid obvious fails :).
To be more precise, I am going to implement SMPP (short message p2p) related stuff. SMPP consists of many fields which have different possible values which sometimes are composed of different parts. The esm_class field for example is an 8 bit integer, bits 1-0, 5-2 and 7-6 being 'sub fields'. So the value esm_class consists of three constants 'concatenated' together.
But the story goes on. The esm_class subfields have different meanings depending on the type of message they are being used in.
Now a few questions arose:
What's the best way, to organise the different constants?
How do I define different constants for the same field (or subfield)?
What's the best way to add a description to each constant?
For question 3 I went for tupels: CONST_NAME = (0x01, 'value meaning',).
My first idea for the other problems was to organise the constants in modules. This would for example give us:
constants.esm_class which contains all esm_class values
constants.data_coding which contains all possible encodings
etc (for each field a module in constants)
For data_coding this works well. But esm_class actually consists of 'subfields'. So I thought about:
constants.esm_class.types for the message type subfield
constants.esm_class.modes for the message modes
constants.esm_class.features for the features field
This already is quite long, resulting in constants.esm_class.features.UDHI_AND_REPLAY_PATH for example. Not very nice, somehow. And that's not yet all I need... actually I even need to split the constants.esm_class.types apart because the type values have different meanings in different contexts (basically incoming messages vs. outgoing short messages). So this would result in even one more submodule, if I followed the schema: constants.esm_class.types.outgoing and constants.esm_class.types.incoming
Additionally, I also need a way to concatenate types, modes and features to create the esm_class value and I also need to split esm_class apart for parsing. So there needs to be a generic way to build field values from constants and also to detect that a certain value is illegal (i.e. has a value not defined by a constant).
My question: What do you think is the best way I should follow? Should I go for classes? As I've read best practice with python is to define constants on module level. I also see some issues with using classes. So probably a new data type for each possible field would do it? I'm stuck and looking forward to your ideas!
Edit: I intend to use the lib with python 3.3+ and probably I want to make it compatible to 2.7, so Enums from Python 3.4 unfortunately are no solution.
Edit 2: Probably I packed too much information in too few text. So I will give an example.
esm_class in SMPP is kind of a derived value, consisting of a type, a mode and a feature. If the esm_class for example has the value 00111100, in fact it means that the feature equals to 00, the mode to 1111 and the type again 00. That's what i refer to as a "subtype". The constants will be put together with bitwise operations.
I will make an attempt at giving an answer (I hope I've understood you correctly). Elaborating on my comments above, I would go for something like:
class esm_class:
BITMASK_FEATURE = 0b11000000
BITMASK_MODE = 0b00111100
BITMASK_TYPE = 0b00000011
class Features:
FEATURE_1 = 0b00
FEATURE_2 = 0b01
FEATURE_3 = 0b10
FEATURE_4 = 0b11
class Modes:
MODE_1 = 0b0000
# …
class Types:
TYPE_1 = 0b00
TYPE_2 = 0b01
# …
def __init__(self, type_, mode, feature):
self.type = type_
self.mode = mode
self.feature = feature
def __bytes__(self):
'''
Use this to write an instance of esm_class to a file-like
object or stream, e.g..:
mysocket.send(bytes(my_esm_class_obj))
'''
return ((self.feature << 6) | (self.mode << 2) |
self.type).to_bytes(1, byteorder='big')
#classmethod
def parse(cls, filelike):
'''
Use this to parse incoming data from a file-like object
and create an instance of esm_class accordingly, e.g:
my_esm_class_obj = esm_class.parse(stream)
'''
instance = cls()
data = filelike.read(1)
instance.feature = int.from_bytes((data & cls.BITMASK_FEATURE) >> 6)
instance.mode = int.from_bytes((data & cls.BITMASK_MODE) >> 2)
instance.mode = int.from_bytes((data & cls.BITMASK_TYPE) >> 2)
Now, the code can certainly be optimized but the point is: I tried to abstract away the binary representation as soon as possible, in order to arrive at a class hierarchy representing the structures appearing in the protocol. Accordingly, I didn't simply put the constants / the values in different modules but organized them in such a way that they appear together with the structure they are part of – which is also the structure you're going to use in the rest of your code.
Further things to note:
You can certainly omit the classes Features, Modes and Types if you like and put all values in the scope of esm_class. However, I would do this only if the symbols' names FEATURE_1, MODE_2 and so on make it clear which part of the bitstring they correspond to (i.e. the feature part or the mode part or…).
I would usually avoid Python 3.4's enums because IMO they are tedious to deal with when you need to access a symbol's value (i.e. its value attribute) but that's also because I don't need the symbols' names themselves. You mileage may vary.
Finally, in the comments above, chepner made a very good point regarding whether to store descriptions in your code at all. I recommend you follow his advice.