Is there a Pythonic way to validate whether a string represents a floating-point number (any input that would be recognizable by float(), e.g. -1.6e3), without converting it (and, ideally, without resorting to throwing and catching exceptions)?
Previous questions have been submitted about how to check if a string represents an integer or a float. Answers suggest using try...except clauses together with the int() and float() built-ins, in a user-defined function.
However, these haven't properly addressed the issue of speed. While using the try...except idiom for this ties the conversion process to the validation process (to some extent rightfully), applications that go over a large amount of text for validation purposes (any schema validator, parsers) will suffer from the overhead of performing the actual conversion. Besides the slowdown due to the actual conversion of the number, there is also the slowdown caused by throwing and catching exceptions. This GitHub gist demonstrates how, compared to user-defined validation only, built-in conversion code is twice as costly (compare True cases), and exception handling time (False time minus True time for the try..except version) alone is as much as 7 validations. This answers my question for the case of integer numbers.
Valid answers will be: functions that solve the problem in a more efficient way than the try..except method, a reference to documentation for a built-in feature that will allow this in the future, a reference to a Python package that allows this now (and is more efficient than the try..except method), or an explanation pointing to documentation of why such a solution is not Pythonic, or will otherwise never be implemented. Specifically, to prevent clutter, please avoid answers such as 'No' without pointing to official documentation or mailing-list debate, and avoid reiterating the try..except method.
As #John mentioned in a comment, this appears as an answer in another question, though it is not the accepted answer in that case. Regular expressions and the fastnumbers module are two solutions to this problem.
However, it's duly noted (as #en_Knight did) that performance depends largely on the inputs. If expecting mostly valid inputs, then the EAFP approach is faster, and arguably more elegant. If you don't know what to input to expect, then LBYL might be more appropriate. Validation, in essence, should expect mostly valid inputs, so it's more appropriate for try..except.
The fact is, for my use case (and as the writer of the question it bears relevance) of identifying types of data in a tabular data file, the try..except method was more appropriate: a column is either all float, or, if it has a non-float value, from that row on it's considered textual, so most of the inputs actually tested for float are valid in either case. I guess all those other answers were on to something.
Back to answer, fastnumbers and regular expressions are still appealing solutions for the general case. Specifically, the fastnumbers package seem to be working well for all values except for special ones, such as Infinity, Inf and NaN, as demonstrated in this GitHub gist. The same goes for the simple regular expression from the aforementioned answer (modified slightly - removed the trailing \b as it would cause some inputs to fail):
^[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?$
A bulkier version, that does recognize the special values, was used in the gist, and has equal performance:
^[-+]?(?:[Nn][Aa][Nn]|[Ii][Nn][Ff](?:[Ii][Nn][Ii][Tt][Yy])?|(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?)$
The regular expression implementation is ~2.8 times slower on valid inputs, but ~2.2 faster on invalid inputs. Invalid inputs run ~5 times slower than valid ones using try..except, or ~1.3 times faster using regular expressions. Given these results, it means it's favorable to use regular expressions when 40% or more of expected inputs are invalid.
fastnumbers is merely ~1.2 times faster on valid inputs, but ~6.3 times faster on invalid inputs.
Results are described in the plot below. I ran with 10^6 repeats, with 170 valid inputs and 350 invalid inputs (weighted accordingly, so the average time is per a single input). Colors don't show because boxes are too narrow, but the ones on the left of each column describe timings for valid inputs, while invalid inputs are to the right.
NOTE The answer was edited multiple times to reflect on comments both to the question, this answer and other answers. For clarity, edits have been merged. Some of the comments refer to previous versions.
If being pythonic is a justification then you should just stick to The Zen of Python. Specifically to this ones:
Explicit is better than implicit.
Simple is better than complex.
Readability counts.
There should be one-- and preferably only one --obvious way to do it.
If the implementation is hard to explain, it's a bad idea.
All those are in favour of the try-except approach. The conversion is explicit, is simple, is readable, is obvious and easy to explain
Also, the only way to know if something is a float number is testing if it's a float number. This may sound redundant, but it's not
Now, if the main problem is speed when trying to test too much supposed float numbers you could use some C extensions with cython to test all of them at once. But I don't really think it will give you too much improvements in terms of speed unless the amount of strings to try is really big
Edit:
Python developers tend to prefer the EAFP approach (Easier to Ask for Forgiveness than Permission), making the try-except approach more pythonic (I can't find the PEP)
And here (Cost of exception handlers in Python) is a comparisson between try-except approach against the if-then. It turns out that in Python the exception handling is not as expensive as it is in other languages, and it's only more expensive in the case that a exception must be handled. And in general use cases you won't be trying to validate a string with high probability of not being actually a float number (Unless in your specific scenario you have this case).
Again as I said in a comment. The entire question doesn't have that much sense without a specific use case, data to test and a measure of time. Just talking about the most generic use case, try-except is the way to go, if you have some actual need that can't be satisfied fast enough with it then you should add it to the question
To prove a point: there's not that many conditions that a string has to abide by in order to be float-able. However, checking all those conditions in Python is going to be rather slow.
ALLOWED = "0123456789+-eE."
def is_float(string):
minuses = string.count("-")
if minuses == 1 and string[0] != "-":
return False
if minuses > 1:
return False
pluses = string.count("+")
if pluses == 1 and string[0] != "+":
return False
if pluses > 1:
return False
points = string.count(".")
if points > 1:
return False
small_es = string.count("e")
large_es = string.count("E")
es = small_es + large_es
if es > 1:
return False
if (es == 1) and (points == 1):
if small_es == 1:
if string.index(".") > string.index("e"):
return False
else:
if string.index(".") > string.index("E"):
return False
return all(char in ALLOWED for char in string)
I didn't actually test this, but I'm willing to bet that this is a lot slower than try: float(string); return True; except Exception: return False
Speedy Solution If You're Sure You Want it
Taking a look at this reference implementation - the conversion to float in python happens in C code and is executed very efficiently. If you really were worried about overhead, you could copy that code verbatim into a custom C extension, but instead of raising the error flag, return a boolean indicating success.
In particular, look at the complicated logic implemented to coerce hex into float. This is done in the C level, with a lot of error cases; it seems highly unlikely there's a shortcut here (note the 40 lines of comments arguing for one particular guarding case), or that any hand-rolled implementation will be faster while preserving these cases.
But... Necessary?
As a hypothetical, this question is interesting, but in the general case one should try to profile their code to ensure that the try catch method is adding overhead. Try/catch is often idiomatic and moreover can be faster depending on your usage. For example, for-loops in python use try/catch by design.
Alternatives and Why I Don't Like Them
To clarify, the question asks about
any input that would be recognizable by float()
Alternative #1 -- How about a regex
I find it hard to believe that you will get a regex to solve this problem in general. While a regex will be good at capturing float literals, there are a lot of corner cases. Look at all the cases on this answer - does your regex handle NaN? Exponentials? Bools (but not bool strings)?
Alternative #2: Manually Unrlodded Python Check:
To summarize the tough cases that need to be captured (which Python natively does)
Case insensitive capturing of Nan
Hex matching
All of the cases enumerated in the language specification
Signs, including signs in the exponent
Booleans
I also would point you to the case below floating points in the language specification; imaginary numbers. The floating method handles these elegantly by recognizing what they are, but throwing a type error on the conversion. Will your custom method emulate that behaviour?
Related
I am a newbie reading Uncle Bob's Clean Code Book.
It is indeed great practice to limit the number of function arguments as few as possible. But I still come across so many functions offered in many libraries that require a bunch of arguments. For example, in Python's pandas, there is a function with 9 arguments:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
(And this function also violates the advice about flag arguments)
It seems that such cases are much rarer in Python standard libraries, but I still managed to find one with 4 arguments:
re.split(pattern, string, maxsplit=0, flags=0)
I understand that this is just a suggestion instead of silver bullet, but is it applicable when it comes to something mentioned above?
Uncle Bob does not mention a hard limit of arguments that would make your code smell, but I would consider 9 arguments as too much.
Today's IDEs are much better in supporting the readability of the code, nevertheless refactoring stays tricky, especially with a large number of equally typed arguments.
The suggested solution is to encapsulate the arguments in a single struct/object (depending on your language). In the given case, this could be a GroupingStrategy:
strategy = GroupingStrategy();
strategy.by = "Foo"
strategy.axis = 0
strategy.sorted = true
DataFrame.groupby(strategy)
All not mentioned attributes will be assigned with the respective default values.
You could then also convert it to a fluent API:
DataFrame.groupby(GroupingStrategy.by("Foo").axis(0).sorted())
Or keep some of the arguments, if this feels better:
DataFrame.groupby("Foo", GroupingStrategy.default())
The first point to note is that all those arguments to groupby are relevant. You can reduce the number of arguments by having different versions of groupby but that doesn't help much when the arguments can be applied independently of each other, as is the case here. The same logic would apply to re.split.
It's true that integer "flag" arguments can be dodgy from a maintenance point of view - what happens if you want to change a flag value in your code? You have to hunt through and manually fix each case. The traditional approach is to use enums (which map numbers to words eg a Day enum would have Day.Sun = 0, Day.Mon = 1, etc) In compiled languages like C++ or C# this gives you the speed of using integers under the hood but the readability of using labels/words in your code. However enums in Python are slow.
One rule that I think applies to any source code is to avoid "magic numbers", ie numbers which appear directly in the source code. The enum is one solution. Another solution is to have constant variables to represent different flag settings. Python sort-of supports constants (uppercase variable names in constant.py which you then import) however they are constant only by convention, you can actually change their value :(
I was reading the PEP 8 (Python . org), and I noticed that using implicit comparisons with Boolean was preferred.
if booleanCond == True # Actually works
if booleanCond # Works too but preferred according to PEP8
Those two statements mean the same, but in most languages I know explicit comparison is preferred.
Can anyone explain me (quickly ?) why ?
Thanks !
AFAIK explicit comparison is frowned upon in most languages. There is a question about this practice on the Software Engineering stack exchange.
The big picture is that if you need to explicitely compare your boolean condition to True you might have a naming problem with your variable.
if is_blue: reads well (which is an important thing in python because it helps reduce the cognitive load of the programmer) and if is_blue is True: does not.
As usual this is a heuristic and should not be dogmatic, but if you ever feel that you need to compare a boolean value to True or False to help your reader understand what you're doing it might be worth questionning your naming for this variable.
I have a method, which needs to return the result of multiple checks for equality. It is an implementation of the __eq__ method of a class, which represents vocables in my application. Here is the code for the return statement:
return all((
self.first_language_translations == other.first_language_translations,
self.first_language_phonetic_scripts == other.first_language_phonetic_scripts,
self.second_language_translations == other.second_language_translations,
self.second_language_phonetic_scripts == other.second_language_phonetic_scripts
))
I've tested the runtime of this way of doing it and the other way, using and operators. The and operators are slightly faster, maybe 0.05s. It seems logical, because of having to create a list of those boolean values first and then running a function, which might do more than what the corresponding and operators would have done. However, this is probably going to be executed a lot during my applications runtime.
Now I am wondering, if the usage of all in such a case is a good or a bad practice and if it is worth the slowdown, if it is a good practice. My application is all about vocables and might often need to check, whether a vocable or an identical one is already in a list of vocables. It doesn't need to be super fast and I am thinking this might be micro-optimization, so I'd like to use the best practice for such a situation.
Is this a good usage of the built in all function?
No, that's not a good use of all(), since you have a small, fixed number of comparisons to make, and all() isn't even letting you represent it any more succinctly than you would when using and. Using and is more readable, and you should always put readability first unless you've profiled and performance is actually an issue. That said, using and is indeed a tiny bit faster in the worst case, and even faster on average because it'll short circuit on the first False rather than executing all the comparisons every time.
Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.
I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.
Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.
Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...
So am I right, or are there more insights/something more that I'm not considering?
The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.
Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.
In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.
Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.
In Python, find and index are very similar methods, used to look up values in a sequence type. find is used for strings, while index is for lists and tuples. They both return the lowest index (the index furthest to the left) that the supplied argument is found.
For example, both of the following would return 1:
"abc".find("b")
[1,2,3].index(2)
However, one thing I'm somewhat confused about is that, even though the two methods are very similar, and fill nearly the same role, just for different data types, they have very different reactions to attempting to find something not in the sequence.
"abc".find("d")
Returns -1, to signify 'not found', while
[1,2,3].index(4)
raises an exception.
Basically, why do they have different behaviors? Is there a particular reason, or is it just a weird inconsistency for no particular reason?
Now, I'm not asking how to deal with this – obviously, a try/except block, or a conditional in statement, would work. I'm simply asking what the rationale was for making the behavior in just that particular case different. To me, it would make more sense to have a particular behavior to say not found, for consistency's sake.
Also, I'm not asking for opinions on whether the reason is a good reason or not – I'm simply curious about what the reason is.
Edit: Some have pointed out that strings also have an index method, which works like the index method for lists, which I'll admit I didn't know, but that just makes me wonder why, if strings have both, lists only have index.
This has always been annoying ;-) Contrary to one answer, there's nothing special about -1 with respect to strings; e.g.,
>>> "abc"[-1]
'c'
>>> [2, 3, 42][-1]
42
The problem with find() in practice is that -1 is in fact not special as an index. So code using find() is prone to surprises when the thing being searched for is not found - it was noted even before Python 1.0.0 was released that such code often went on to do a wrong thing.
No such surprises occur when index() is used instead - an exception can't be ignored silently. But setting up try/except for such a simple operation is not only annoying, it adds major overhead (extra time) for what "should be" a fast operation. Because of that, string.find() was added in Python 0.9.9 (before then, only string.index() could be used).
So we have both, and that persists even into Python 3. Pick your poison :-)