Semantic Type Safety in Python

Semantic Type Safety in Python - python

In my recent project I have the problem, that some values are often misinterpreted. For instance I calculate a wave as a sum of two waves (for which I need two amplitudes and two phase shifts), and then sample it at 4 points. I pass these tuples of four values to different functions, but sometimes I made the mistake to pass wave parameters instead of sample points.
These errors are hard to find, because all the calculations work without any error, but the values are totally meaningless in this context and so the results are just wrong.
What I want now is some kind of semantic type. I want to state that the one function returns sample points and the other function awaits sample points, and that I can do nothing that would conflict this declarations without immediately getting an error.
Is there any way to do this in python?

I would recommend implementing specific data types to be able to distinguish between different kind of information with the same structure.
You can simply subclass list for example and then do some type checking at runtime within your functions:
class WaveParameter(list):
pass
class Point(list):
pass
# you can use them just like lists
point = Point([1, 2, 3, 4])
wp = WaveParameter([5, 6])
# of course all methods from list are inherited
wp.append(7)
wp.append(8)
# let's check them
print(point)
print(wp)
# type checking examples
print isinstance(point, Point)
print isinstance(wp, Point)
print isinstance(point, WaveParameter)
print isinstance(wp, WaveParameter)
So you can include this kind of type checking in your functions, to make sure the correct kind of data was passed to it:
def example_function_with_waveparameter(data):
if not isinstance(data, WaveParameter):
log.error("received wrong parameter type (%s instead WaveParameter)" %
type(data))
# and then do the stuff
or simply assert:
def example_function_with_waveparameter(data):
assert(isinstance(data, WaveParameter))

Pyhon's notion of a "semantic type" is called a class, but as mentioned, Python is dynamically typed so even using custom classes instead of tuples you won't get any compile-time error - at best you'll get runtime errors if your classes are designed in such a way that trying to use one instead of the other will fail.
Now classes are not just about data, they are about behaviour too, so if you have functions that do waveform-specific computations these functions would probably become methods of the Waveform class, and idem for the Point part, and this might be enough to avoid logical errors like passing a "waveform" tuple to a function expecting a "point" tuple.
To make a long story short: if you want a statically typed functional language, Python is not the right tool (Haskell might be a better choice). If you really want / have to use Python, try using classes and methods instead of tuples and functions, it still won't detect type errors at compile-time but chances are you'll have less type errors AND that these type errors will be detected at runtime instead of producing wrong results.

Related

How to check the type of an operation in a statement?

I want to be able to check if the type of the return value is the same as the type of a method in ANTLR. (i.e int processOperation() should return an int like return (3-1*4))
My grammar is the following: https://github.com/RodrigoZea/Lab00DDC/blob/fda787998e5ed1cc5e5d94e6506ed6ca08dbd955/Decaf/Decaf.g4
I'm using the python implementation of ANTLR4, but I'm unsure as to how to check the type of an operation in a return statment, for example (1+3*4) should return an int. I'm using a Listener, so my logic is as follows:
First check the value if its a primitive (i.e. return "random", return 1)
Check if the value is an operation or a single variable.
For a single variable, searching it up in the symbol table would be enough, but for an operation I'm unsure on how to approach it. I've read about using a ParseTreeProperty<> but I don't think there's an implementation of that in the Python version of ANTLR4, that seemed to be the best approach from what I've read in the ANTLR4 definitive reference since it will save the nodes' (and the operation subtree) data type and I can easily check its type and compare it to my method type. I'm guessing I would need to check when I'm entering an operator rule, but I'm unsure on what to do with that data or if there's a way to implement a ParseTreeProperty in Python. Thanks.

So, basing myself on what Mike and kaby answered, I came up with a solution. It's incredibly simple but very functional.
The best way to replicate a ParseTreeProperty in Python is to create a dictionary, the ctx Object will be the key and the value is set manually, depending on what you want the value to be (this is where Mike's answer comes in handy). To update the dictionary values, you will do this on the *Exit() methods, just as Mike said as well.
For example, if you're exiting an int literal, char literal, or whatever (you can take my grammar as reference) you can add an entry to your dictionary as follows:
def exitType_literal(self, ctx: DecafParser.Type_literalContext):
self.nodeTypes[ctx] = 'type'
So for example, if I wanted to save a node as an int value, I would do something like...
def exitInt_literal(self, ctx: DecafParser.Int_literalContext):
self.parseTreePropertyDictionary[ctx] = 'int'
If you want to get the value of a variable however, you would have to search it up on your implementation of a Symbol Table. That was my approach on getting the type value.
So once every node is setup, you can simply setup how you want to process your operations. For example, if you want a "+" operator to be with ints, you would check the type of the first and second operator on your dictionary, check if both are ints, and if thats the case, then save it on your dictionary as an 'int' type node where you are processing your "+" operator.
Then, to get the type of the operation, you will simply access your dictionary on that node and it will return 'int' or whatever the type you set it up to be.

ParseTreeProperty is a convenience for "attaching" properties to nodes of your parse tree, and could be a useful way to keep track of the type of each node in your tree. However, as the comments mention, there are other data structures you can you to track the type of each node and map back to it. (Note: if you use this approach with a listener, as your question implies, you'd need to implement it in the *Exit() method, as you would want all the children to have been "listened to" and their types assigned, so that you can determine the type of the parent expression.)
Using a listener, you can also just have a stack of types. When you exit each expression, it pops the types of all of its children, evaluates the expression type for itself, and pushes that type on the stack. You, of course, have to take care to properly manage to pushing and popping (look out for exceptions), but it can be a reasonably clean implementation.
You could also implement an expression type validation visitor. With this approach, you write an expression visitor that returns it's type. With each overriden visit*() you can just call visit() on each child to get it's type, and then decide what you want to resulting type to be (and probably whether it's even a valid expression). Notice that ```visit``ing a node return a result with visitors, this is one of the key differences between visitors and listeners (the other being, that, with visitors, you ave to explicitly choose how to navigate your child nodes).
So far as "what to do with this data", at this point you're making design decisions about how you want your language to behave, what's valid, etc.
For example:
7 * "string"
Maybe you decide 7 is an Int type and "string" is a String type. In your listener/visitor for for multiplication expressions, it's up to you to decide if this is an error (and the resulting "type" is InvalidType, perhaps), or maybe, like Ruby, it's a cute way of getting "stringstringstringstringstringstringstring", in which case you'd return a type of String. For functions you have decisions to make about the return type of the function. Do you require them to be explicitly defined? Must the be defined before they're referenced (if not, you'll need to make a pass of you parse tree creating a symbol table of functions and return types to reference, before you can navigate your tree evaluating expression types). Maybe, you have a dynamic language where different input types (or even values) might result in different return types from your function.
Clearly, this gets pretty deep into language design choices, and languages have made many different decisions about how to handle them. ANTLR is just your parsing technology and (other than providing convenience classes like listeners and visitors) has nothing to say about how you make these decisions or how you implement them. And, there's not a way to codify them in your grammar as they ares semantic concerns that have no impact on parsing or the construction of your parse tree.

Can you create an abstract data type that is not in a class?

This is just a question that is a curiosity as I was reviewing OOP. Can you have an ADT that is not in a class? So it'd all be separate functions. The language (it shouldn't matter, but in case it does) that I'm thinking in is Python 3.

No. A data type (in Python at least) is by definition a class. In C, you have to simulate object-orientedness by having individual functions, but there still has to be a struct to hold the data. Otherwise, there's no "data type".

Within the Python language, a data type is a class (with certain properties), so the trivial answer is no. In particular, one major characteristic that differentiates class (or data type) functionality from simple function calls, is that the defined data operations work seamlessly on the data type, or with trivial syntax, rather than having to specify every operation and operand in an explicit call.
Consider the statements:
# Fully functional, implicit data type operation
z = x + y
# Explicit data type operation, still within the class
z = x.add(y)
# Function call
z = add(x, y)
In the third instance, you have none of the built-in protections or encapsulations that come with a class. You can have a set of functions that just happen to coordinate to give you the desired results, but this is not an abstract data type.

repr for (large) composite objects

I would like to have informative representations for my composite objects (i.e., objects composed of other (potentially composite) objects). However, because my code fundamentally deals with high-precision numbers (please don't ask me why I don't just use doubles), I end up with representations like you see here: http://pastebin.com/jpLgAfxC. Would it just be better to just stick with the default __repr__?

Whether to have a verbose repr depends on what you want to accomplish. For complex or composite objects, I know which I'd prefer of the following:
Point(x=1.12, y=2.2, z=-1.9)
<__main__.Point object at 0x103011890>
They both tell me what type the object is, but only the first is clear about all of the (relevant) values involved, and avoids low-level information that is only relevant on the rarest of occasions.
I like to see the real values. But, yours is a special case, given that your values are so frightfully humongous:
72401317106217603290426741268390656010621951704689382948334809645
87850348552960901165648762842931879347325584704068956434195098288
38279057775096090002410493665682226331178331461681861612403032369
73237863637784679012984303024949059416189689048527978878840119376
5152408961823197987224502419157858495179687559851
That they cannot be useful for most development or debugging purposes. I'm sure there are times you need the full serialization--to send to and from files, for example. But those have to be fairly rare, no? I can't imagine you really remember all 309 digits, or can determine if the above number is the same as the one below on visual inspection:
72401317106217603290426741268390656010621951704689382948334809645
87850348552960901165648762842931879347325584704068956434195098288
38279057775096090002410493665682226331178331461681861612403032369
73327863637784679012984303024949059416189689048527978878840119376
5152408961823197987224502419157858495179687559851
They're not the same. But unless you're Spock or The Terminator, you wouldn't know that from a quick glance. (And actually, I've made it easier here, length-wrapping to avoid having to horizontally scroll.)
So I would recommend (massively) shortening their representation, to make the output more tractable. This is like printing out the entire chapter text every time you want to print a Chapter object. Overkill.
Instead, try something much shorter and easier to work with. Truncation and/or ellipsis are useful. e.g.
72401...59851
7240131710...
You can use the object id as well. If your high-precision type is HP, then:
HP(0x103011890)
At least then you will be able to tell them apart. One ugliness of using object ids, however, is that objects can be logically equivalent, but if you create multiple objects with the same logical value, they'd have different ids, thus appear different when they are not. You can get around that by creating your own short hash function. There's a bit of an art to hashing, but for reprs, even something simple would work. E.g.:
import binascii, struct
def shorthash(s):
"""
Given a Python value, produce a short alphanumeric hash that
helps identify it for debugging purposes. A riff on
http://stackoverflow.com/a/2511059/240490
Enhanced to remove trailing boilerplate, and to work
on either Python 2 or Python 3.
"""
hashbytes = binascii.b2a_base64(struct.pack('l', hash(s)))
return hashbytes.decode('utf-8').rstrip().rstrip("=")
Then define your repr in the high-precision class:
def __repr__(self):
clsname = self.__class__.__name__
return '{0}({1}).format(clsname, shorthash(self.value))
Where self.value is whatever local attribute, property, or method creates the multi-hundred-digit value. If you're subclassing int, this could be just self.
This gets you to:
HP(Tea+5MY0WwA)
The two massive, almost identical numbers above? Using this scheme, they render out to:
HP(XhkG0358Fx4)
HP(27CdIG5elhQ)
Which are obviously different. You can combine this with a bit of a value representation. E.g. a few alternatives:
HP(~7.24013e308 # XhkG0358Fx4)
HP(dig='72401...59851', ndigits=309, hash='XhkG0358Fx4')
You'll find these shorter values more useful in debugging contexts. You can, of course, keep around a method or property (e.g. .value, .digits, or .alldigits) for those case in which you need every last bit, but define the common case as something more easily consumed.

Thank you to Demian for the pointer to https://docs.python.org/2/reference/datamodel.html#object.repr, specifically:
This is typically used for debugging, so it is important that the
representation is information-rich and unambiguous.
http://pastebin.com/jpLgAfxC is probably the best possible __repr__ in this case.

How much input validation should I be doing on my python functions/methods?

I'm interested in how much up front validation people do in the Python they write.
Here are a few examples of simple functions:
def factorial(num):
"""Computes the factorial of num."""
def isPalindrome(inputStr):
"""Tests to see if inputStr is the same backwards and forwards."""
def sum(nums):
"""Same as the built-in sum()... computes the sum of all the numbers passed in."""
How thoroughly do you check the input values before beginning computation, and how do you do your checking? Do you throw some kind of proprietary exception if input is faulty (BadInputException defined in the same module, for example)? Do you just start your calculation and figure it will throw an exception at some point if bad data was passed in ("asd" to factorial, for example)?
When the passed in value is supposed to be a container do you check not only the container but all the values inside it?
What about situations like factorial, where what's passed in might be convertible to an int (e.g. a float) but you might lose precision when doing so?

I assert what's absolutely essential.
Important: What's absolutely essential. Some people over-test things.
def factorial(num):
assert int(num)
assert num > 0
Isn't completely correct. long is also a legal possibility.
def factorial(num):
assert type(num) in ( int, long )
assert num > 0
Is better, but still not perfect. Many Python types (like rational numbers, or number-like objects) can also work in a good factorial function. It's hard to assert that an object has basic integer-like properties without being too specific and eliminating future unthought-of classes from consideration.
I never define unique exceptions for individual functions. I define a unique exception for a significant module or package. Usually, however, just an Error class or something similar. That way the application says except somelibrary.Error,e: which is about all you need to know. Fine-grained exceptions get fussy and silly.
I've never done this, but I can see places where it might be necessary.
assert all( type(i) in (int,long) for i in someList )
Generally, however, the ordinary Python built-in type checks work fine. They find almost all of the exceptional situations that matter almost all the time. When something isn't the right type, Python raises a TypeError that always points at the right line of code.
BTW. I only add asserts at design time if I'm absolutely certain the function will be abused. I sometimes add assertions later when I have a unit test that fails in an obscure way.

For calculations like sum, factorial etc, pythons built-in type checks will do fine. The calculations will end upp calling add, mul etc for the types, and if they break, they will throw the correct exception anyway. By enforcing your own checks, you may invalidate otherwise working input.

I'm trying to write docstring stating what type of parameter is expected and accepted, and I'm not checking it explicitly in my functions.
If someone wants to use my function with any other type its his responsibility to check if his type emulates one I accept well enough. Maybe your factorial can be used with some custom long-like type to obtain something you wouldn't think of? Or maybe your sum can be used to concatenate strings? Why should you disallow it by type checking? It's not C, anyway.

I basically try to convert the variable to what it should be and pass up or throw the appropriate exception if that fails.
def factorial(num):
"""Computes the factorial of num."""
try:
num = int(num)
except ValueError, e:
print e
else:
...

It rather depends on what I'm writing, and how the output gets there. Python doesn't have the public/private protections of other OO-languages. Instead there are conventions. For example, external code should only call object methods that are not prefixed by an underscore.
Therefore, if I'm writing a module, I'd validate anything that is not generated from my own code, i.e. any calls to publicly-accessible methods/functions. Sometimes, if I know the validation is expensive, I make it togglable with a kwarg:
def publicly_accessible_function(arg1, validate=False):
if validate:
do_validation(arg1)
do_work
Internal methods can do validation via the assert statement, which can be disabled altogether when the code goes out of development and into production.

I almost never enforce any kind of a check, unless I think there's a possibility that someone might think they can pass some X which would produce completely crazy results.
The other time I check is when I accept several types for an argument, for example a function that takes a list, might accept an arbitrary object and just wrap it in a list (if it's not already a list). So in that case I check for the type -not to enforce anything- just because I want the function to be flexible in how it's used.

Only bother to check if you have a failing unit-test that forces you to.
Also consider "EAFP"... It's the Python way!

A bit of perspective on how another language handles it might add some value. For Perl, I remember using this module - http://search.cpan.org/dist/Params-Validate/ which offloads a lot of parameter validation from the developer. I was searching for something similar in python and came across this: http://www.voidspace.org.uk/python/validate.html I haven't tried it out. But I guess aiming for a standard way of validating params across the entire codebase leads to upfront setting of parameter validation expectations across the entire team.

Why should functions always return the same type?

I read somewhere that functions should always return only one type
so the following code is considered as bad code:
def x(foo):
if 'bar' in foo:
return (foo, 'bar')
return None
I guess the better solution would be
def x(foo):
if 'bar' in foo:
return (foo, 'bar')
return ()
Wouldn't it be cheaper memory wise to return a None then to create a new empty tuple or is this time difference too small to notice even in larger projects?

Why should functions return values of a consistent type? To meet the following two rules.
Rule 1 -- a function has a "type" -- inputs mapped to outputs. It must return a consistent type of result, or it isn't a function. It's a mess.
Mathematically, we say some function, F, is a mapping from domain, D, to range, R. F: D -> R. The domain and range form the "type" of the function. The input types and the result type are as essential to the definition of the function as is the name or the body.
Rule 2 -- when you have a "problem" or can't return a proper result, raise an exception.
def x(foo):
if 'bar' in foo:
return (foo, 'bar')
raise Exception( "oh, dear me." )
You can break the above rules, but the cost of long-term maintainability and comprehensibility is astronomical.
"Wouldn't it be cheaper memory wise to return a None?" Wrong question.
The point is not to optimize memory at the cost of clear, readable, obvious code.

It's not so clear that a function must always return objects of a limited type, or that returning None is wrong. For instance, re.search can return a _sre.SRE_Match object or a NoneType object:
import re
match=re.search('a','a')
type(match)
# <type '_sre.SRE_Match'>
match=re.search('a','b')
type(match)
# <type 'NoneType'>
Designed this way, you can test for a match with the idiom
if match:
# do xyz
If the developers had required re.search to return a _sre.SRE_Match object, then
the idiom would have to change to
if match.group(1) is None:
# do xyz
There would not be any major gain by requiring re.search to always return a _sre.SRE_Match object.
So I think how you design the function must depend on the situation and in particular, how you plan to use the function.
Also note that both _sre.SRE_Match and NoneType are instances of object, so in a broad sense they are of the same type. So the rule that "functions should always return only one type" is rather meaningless.
Having said that, there is a beautiful simplicity to functions that return objects which all share the same properties. (Duck typing, not static typing, is the python way!) It can allow you to chain together functions: foo(bar(baz))) and know with certainty the type of object you'll receive at the other end.
This can help you check the correctness of your code. By requiring that a function returns only objects of a certain limited type, there are fewer cases to check. "foo always returns an integer, so as long as an integer is expected everywhere I use foo, I'm golden..."

Best practice in what a function should return varies greatly from language to language, and even between different Python projects.
For Python in general, I agree with the premise that returning None is bad if your function generally returns an iterable, because iterating without testing becomes impossible. Just return an empty iterable in this case, it will still test False if you use Python's standard truth testing:
ret_val = x()
if ret_val:
do_stuff(ret_val)
and still allow you to iterate over it without testing:
for child in x():
do_other_stuff(child)
For functions that are likely to return a single value, I think returning None is perfectly acceptable, just document that this might happen in your docstring.

Here are my thoughts on all that and I'll try to also explain why I think that the accepted answer is mostly incorrect.
First of all programming functions != mathematical functions. The closest you can get to mathematical functions is if you do functional programming but even then there are plenty of examples that say otherwise.
Functions do not have to have input
Functions do not have to have output
Functions do not have to map input to output (because of the previous two bullet points)
A function in terms of programming is to be viewed simply as a block of memory with a start (the function's entry point), a body (empty or otherwise) and exit point (one or multiple depending on the implementation) all of which are there for the purpose of reusing code that you've written. Even if you don't see it a function always "returns" something. This something is actually the address of next statement right after the function call. This is something you will see in all of its glory if you do some really low-level programming with an Assembly language (I dare you to go the extra mile and do some machine code by hand like Linus Torvalds who ever so often mentions this during his seminars and interviews :D). In addition you can also take some input and also spit out some output. That is why
def foo():
pass
is a perfectly correct piece of code.
So why would returning multiple types be bad? Well...It isn't at all unless you abuse it. This is of course a matter of poor programming skills and/or not knowing what the language you're using can do.
Wouldn't it be cheaper memory wise to return a None then to create a new empty tuple or is this time difference too small to notice even in larger projects?
As far as I know - yes, returning a NoneType object would be much cheaper memory-wise. Here is a small experiment (returned values are bytes):
>> sys.getsizeof(None)
16
>> sys.getsizeof(())
48
Based on the type of object you are using as your return value (numeric type, list, dictionary, tuple etc.) Python manages the memory in different ways including the initially reserved storage.
However you have to also consider the code that is around the function call and how it handles whatever your function returns. Do you check for NoneType? Or do you simply check if the returned tuple has length of 0? This propagation of the returned value and its type (NoneType vs. empty tuple in your case) might actually be more tedious to handle and blow up in your face. Don't forget - the code itself is loaded into memory so if handling the NoneType requires too much code (even small pieces of code but in a large quantity) better leave the empty tuple, which will also avoid confusion in the minds of people using your function and forgetting that it actually returns 2 types of values.
Speaking of returning multiple types of value this is the part where I agree with the accepted answer (but only partially) - returning a single type makes the code more maintainable without a doubt. It's much easier to check only for type A then A, B, C, ... etc.
However Python is an object-oriented language and as such inheritance, abstract classes etc. and all that is part of the whole OOP shenanigans comes into play. It can go as far as even generating classes on-the-fly, which I have discovered a few months ago and was stunned (never seen that stuff in C/C++).
Side note: You can read a little bit about metaclasses and dynamic classes in this nice overview article with plenty of examples.
There are in fact multiple design patterns and techniques that wouldn't even exists without the so called polymorphic functions. Below I give you two very popular topics (can't find a better way to summarize both in a single term):
Duck typing - often part of the dynamic typing languages which Python is a representative of
Factory method design pattern - basically it's a function that returns various objects based on the input it receives.
Finally whether your function returns one or multiple types is totally based on the problem you have to solve. Can this polymorphic behaviour be abused? Sure, like everything else.

I personally think it is perfectly fine for a function to return a tuple or None. However, a function should return at most 2 different types and the second one should be a None. A function should never return a string and list for example.

If x is called like this
foo, bar = x(foo)
returning None would result in a
TypeError: 'NoneType' object is not iterable
if 'bar' is not in foo.
Example
def x(foo):
if 'bar' in foo:
return (foo, 'bar')
return None
foo, bar = x(["foo", "bar", "baz"])
print foo, bar
foo, bar = x(["foo", "NOT THERE", "baz"])
print foo, bar
This results in:
['foo', 'bar', 'baz'] bar
Traceback (most recent call last):
File "f.py", line 9, in <module>
foo, bar = x(["foo", "NOT THERE", "baz"])
TypeError: 'NoneType' object is not iterable

Premature optimization is the root of all evil. The minuscule efficiency gains might be important, but not until you've proven that you need them.
Whatever your language: a function is defined once, but tends to be used at any number of places. Having a consistent return type (not to mention documented pre- and postconditions) means you have to spend more effort defining the function, but you simplify the usage of the function enormously. Guess whether the one-time costs tend to outweigh the repeated savings...?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Semantic Type Safety in Python - python

Related

How to check the type of an operation in a statement?

Can you create an abstract data type that is not in a class?

repr for (large) composite objects

How much input validation should I be doing on my python functions/methods?

Why should functions always return the same type?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Semantic Type Safety in Python - python

Related

How to check the type of an operation in a statement?

Can you create an abstract data type that is *not* in a class?

__repr__ for (large) composite objects

How much input validation should I be doing on my python functions/methods?

Why should functions always return the same type?

Categories

Resources

Can you create an abstract data type that is not in a class?

repr for (large) composite objects