Python - Why do the find and index methods work differently? - python

In Python, find and index are very similar methods, used to look up values in a sequence type. find is used for strings, while index is for lists and tuples. They both return the lowest index (the index furthest to the left) that the supplied argument is found.
For example, both of the following would return 1:
"abc".find("b")
[1,2,3].index(2)
However, one thing I'm somewhat confused about is that, even though the two methods are very similar, and fill nearly the same role, just for different data types, they have very different reactions to attempting to find something not in the sequence.
"abc".find("d")
Returns -1, to signify 'not found', while
[1,2,3].index(4)
raises an exception.
Basically, why do they have different behaviors? Is there a particular reason, or is it just a weird inconsistency for no particular reason?
Now, I'm not asking how to deal with this – obviously, a try/except block, or a conditional in statement, would work. I'm simply asking what the rationale was for making the behavior in just that particular case different. To me, it would make more sense to have a particular behavior to say not found, for consistency's sake.
Also, I'm not asking for opinions on whether the reason is a good reason or not – I'm simply curious about what the reason is.
Edit: Some have pointed out that strings also have an index method, which works like the index method for lists, which I'll admit I didn't know, but that just makes me wonder why, if strings have both, lists only have index.

This has always been annoying ;-) Contrary to one answer, there's nothing special about -1 with respect to strings; e.g.,
>>> "abc"[-1]
'c'
>>> [2, 3, 42][-1]
42
The problem with find() in practice is that -1 is in fact not special as an index. So code using find() is prone to surprises when the thing being searched for is not found - it was noted even before Python 1.0.0 was released that such code often went on to do a wrong thing.
No such surprises occur when index() is used instead - an exception can't be ignored silently. But setting up try/except for such a simple operation is not only annoying, it adds major overhead (extra time) for what "should be" a fast operation. Because of that, string.find() was added in Python 0.9.9 (before then, only string.index() could be used).
So we have both, and that persists even into Python 3. Pick your poison :-)

Related

Do “Clean Code”'s function argument number guidelines apply to API design?

I am a newbie reading Uncle Bob's Clean Code Book.
It is indeed great practice to limit the number of function arguments as few as possible. But I still come across so many functions offered in many libraries that require a bunch of arguments. For example, in Python's pandas, there is a function with 9 arguments:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
(And this function also violates the advice about flag arguments)
It seems that such cases are much rarer in Python standard libraries, but I still managed to find one with 4 arguments:
re.split(pattern, string, maxsplit=0, flags=0)
I understand that this is just a suggestion instead of silver bullet, but is it applicable when it comes to something mentioned above?
Uncle Bob does not mention a hard limit of arguments that would make your code smell, but I would consider 9 arguments as too much.
Today's IDEs are much better in supporting the readability of the code, nevertheless refactoring stays tricky, especially with a large number of equally typed arguments.
The suggested solution is to encapsulate the arguments in a single struct/object (depending on your language). In the given case, this could be a GroupingStrategy:
strategy = GroupingStrategy();
strategy.by = "Foo"
strategy.axis = 0
strategy.sorted = true
DataFrame.groupby(strategy)
All not mentioned attributes will be assigned with the respective default values.
You could then also convert it to a fluent API:
DataFrame.groupby(GroupingStrategy.by("Foo").axis(0).sorted())
Or keep some of the arguments, if this feels better:
DataFrame.groupby("Foo", GroupingStrategy.default())
The first point to note is that all those arguments to groupby are relevant. You can reduce the number of arguments by having different versions of groupby but that doesn't help much when the arguments can be applied independently of each other, as is the case here. The same logic would apply to re.split.
It's true that integer "flag" arguments can be dodgy from a maintenance point of view - what happens if you want to change a flag value in your code? You have to hunt through and manually fix each case. The traditional approach is to use enums (which map numbers to words eg a Day enum would have Day.Sun = 0, Day.Mon = 1, etc) In compiled languages like C++ or C# this gives you the speed of using integers under the hood but the readability of using labels/words in your code. However enums in Python are slow.
One rule that I think applies to any source code is to avoid "magic numbers", ie numbers which appear directly in the source code. The enum is one solution. Another solution is to have constant variables to represent different flag settings. Python sort-of supports constants (uppercase variable names in constant.py which you then import) however they are constant only by convention, you can actually change their value :(

Does python iterable imply countable?

I've stumbled upon an interesting case - only thing I am sure about is that I will get an iterable object.
What I really, and only, want to do is to count it.
I've searched if iterable in python implies countable and I found various places claiming so, except the official docs.
So 2 questions arise:
In Python, does iterable => countable (number of items)? Or is it just very common to be so?
Is there a generic pythonic way to get count from an iterable? Which seems to be answered here https://stackoverflow.com/a/3345807/1835470 i.e. not without counting, but author provided a pythonic one-liner:
sum(1 for _ in iterableObject)
iterables do not support the len method. To get their length you have to cast them into something that does support it or manually loop through them (and exhaust them in the process) while counting such as in this code snippet you posted.
As a result, the requirement of using an iterable and needing to know its length is somewhat conflicting. Plus as #Veedrac says in the comments, iterables can have infinite length.
To sum up, if you need to know how many items are contained, you need a different data structure.

Why is it allowed for a list to append itself?

I noticed by accident that in Python you can write something like the following:
l = list()
for i in range(5):
l.append(l)
If I print out what l is it looks like this:
[[...], [...], [...], [...], [...]]
And now I can do things like:
l[0][0][0][0]...[0]
which will return still the exact same list.
The error I had made in my code was not detected by my linter and so might have been difficult to spot. So my question is: why is it allowed for a list to append itself? What use can you make of it?
If it's not useful it should probably be disallowed.
Your criteria for not allowing a list in a list is that it isn't useful. But it's hard to prove that's not true. Maybe nobody knows yet a use, but that doesn't someone won't in the future. In the meantime, there is added cost to the language both in terms of explaining what exceptions there are to adding an object to a list, and in implementing those exceptions.
If you could prove that it is harmful to allow adding a list to itself, then you would have a stronger case for making that change to the language. But "I accidentally typed the wrong variable name" is not a strong argument for harm.
Why is it allowed for list to append to itself? Remember that in python, everything is a reference. A list is just a sequence of references to objects. Since a list is an object, it can be referenced by another list. When a list references an object, its not like that object is actually contained within the list now, its just that a single element of the sequence is now a reference to that object. Though circular references are sometimes bad, they are sometimes useful, and nothing has actually been broken here. Remember, a list containing itself is actually just a list that has a reference to itself.
Why is it allowed for a list to append itself?
Your assumption that it's an error and should not be allowed is incorrect. It's up to us as engineers to decide if it's a good fit for the situation we have or not. I'll turn this question around, why do you think it should be disallowed?
What use can you make of it?
Adding arrays to arrays is actually pretty common. I could imagine a situation where you have to start with 5 identical arrays, perform different calculations on each one, then compare the results. This would be a way to accomplish that.

Python: validate whether string is a float without conversion

Is there a Pythonic way to validate whether a string represents a floating-point number (any input that would be recognizable by float(), e.g. -1.6e3), without converting it (and, ideally, without resorting to throwing and catching exceptions)?
Previous questions have been submitted about how to check if a string represents an integer or a float. Answers suggest using try...except clauses together with the int() and float() built-ins, in a user-defined function.
However, these haven't properly addressed the issue of speed. While using the try...except idiom for this ties the conversion process to the validation process (to some extent rightfully), applications that go over a large amount of text for validation purposes (any schema validator, parsers) will suffer from the overhead of performing the actual conversion. Besides the slowdown due to the actual conversion of the number, there is also the slowdown caused by throwing and catching exceptions. This GitHub gist demonstrates how, compared to user-defined validation only, built-in conversion code is twice as costly (compare True cases), and exception handling time (False time minus True time for the try..except version) alone is as much as 7 validations. This answers my question for the case of integer numbers.
Valid answers will be: functions that solve the problem in a more efficient way than the try..except method, a reference to documentation for a built-in feature that will allow this in the future, a reference to a Python package that allows this now (and is more efficient than the try..except method), or an explanation pointing to documentation of why such a solution is not Pythonic, or will otherwise never be implemented. Specifically, to prevent clutter, please avoid answers such as 'No' without pointing to official documentation or mailing-list debate, and avoid reiterating the try..except method.
As #John mentioned in a comment, this appears as an answer in another question, though it is not the accepted answer in that case. Regular expressions and the fastnumbers module are two solutions to this problem.
However, it's duly noted (as #en_Knight did) that performance depends largely on the inputs. If expecting mostly valid inputs, then the EAFP approach is faster, and arguably more elegant. If you don't know what to input to expect, then LBYL might be more appropriate. Validation, in essence, should expect mostly valid inputs, so it's more appropriate for try..except.
The fact is, for my use case (and as the writer of the question it bears relevance) of identifying types of data in a tabular data file, the try..except method was more appropriate: a column is either all float, or, if it has a non-float value, from that row on it's considered textual, so most of the inputs actually tested for float are valid in either case. I guess all those other answers were on to something.
Back to answer, fastnumbers and regular expressions are still appealing solutions for the general case. Specifically, the fastnumbers package seem to be working well for all values except for special ones, such as Infinity, Inf and NaN, as demonstrated in this GitHub gist. The same goes for the simple regular expression from the aforementioned answer (modified slightly - removed the trailing \b as it would cause some inputs to fail):
^[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?$
A bulkier version, that does recognize the special values, was used in the gist, and has equal performance:
^[-+]?(?:[Nn][Aa][Nn]|[Ii][Nn][Ff](?:[Ii][Nn][Ii][Tt][Yy])?|(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?)$
The regular expression implementation is ~2.8 times slower on valid inputs, but ~2.2 faster on invalid inputs. Invalid inputs run ~5 times slower than valid ones using try..except, or ~1.3 times faster using regular expressions. Given these results, it means it's favorable to use regular expressions when 40% or more of expected inputs are invalid.
fastnumbers is merely ~1.2 times faster on valid inputs, but ~6.3 times faster on invalid inputs.
Results are described in the plot below. I ran with 10^6 repeats, with 170 valid inputs and 350 invalid inputs (weighted accordingly, so the average time is per a single input). Colors don't show because boxes are too narrow, but the ones on the left of each column describe timings for valid inputs, while invalid inputs are to the right.
NOTE The answer was edited multiple times to reflect on comments both to the question, this answer and other answers. For clarity, edits have been merged. Some of the comments refer to previous versions.
If being pythonic is a justification then you should just stick to The Zen of Python. Specifically to this ones:
Explicit is better than implicit.
Simple is better than complex.
Readability counts.
There should be one-- and preferably only one --obvious way to do it.
If the implementation is hard to explain, it's a bad idea.
All those are in favour of the try-except approach. The conversion is explicit, is simple, is readable, is obvious and easy to explain
Also, the only way to know if something is a float number is testing if it's a float number. This may sound redundant, but it's not
Now, if the main problem is speed when trying to test too much supposed float numbers you could use some C extensions with cython to test all of them at once. But I don't really think it will give you too much improvements in terms of speed unless the amount of strings to try is really big
Edit:
Python developers tend to prefer the EAFP approach (Easier to Ask for Forgiveness than Permission), making the try-except approach more pythonic (I can't find the PEP)
And here (Cost of exception handlers in Python) is a comparisson between try-except approach against the if-then. It turns out that in Python the exception handling is not as expensive as it is in other languages, and it's only more expensive in the case that a exception must be handled. And in general use cases you won't be trying to validate a string with high probability of not being actually a float number (Unless in your specific scenario you have this case).
Again as I said in a comment. The entire question doesn't have that much sense without a specific use case, data to test and a measure of time. Just talking about the most generic use case, try-except is the way to go, if you have some actual need that can't be satisfied fast enough with it then you should add it to the question
To prove a point: there's not that many conditions that a string has to abide by in order to be float-able. However, checking all those conditions in Python is going to be rather slow.
ALLOWED = "0123456789+-eE."
def is_float(string):
minuses = string.count("-")
if minuses == 1 and string[0] != "-":
return False
if minuses > 1:
return False
pluses = string.count("+")
if pluses == 1 and string[0] != "+":
return False
if pluses > 1:
return False
points = string.count(".")
if points > 1:
return False
small_es = string.count("e")
large_es = string.count("E")
es = small_es + large_es
if es > 1:
return False
if (es == 1) and (points == 1):
if small_es == 1:
if string.index(".") > string.index("e"):
return False
else:
if string.index(".") > string.index("E"):
return False
return all(char in ALLOWED for char in string)
I didn't actually test this, but I'm willing to bet that this is a lot slower than try: float(string); return True; except Exception: return False
Speedy Solution If You're Sure You Want it
Taking a look at this reference implementation - the conversion to float in python happens in C code and is executed very efficiently. If you really were worried about overhead, you could copy that code verbatim into a custom C extension, but instead of raising the error flag, return a boolean indicating success.
In particular, look at the complicated logic implemented to coerce hex into float. This is done in the C level, with a lot of error cases; it seems highly unlikely there's a shortcut here (note the 40 lines of comments arguing for one particular guarding case), or that any hand-rolled implementation will be faster while preserving these cases.
But... Necessary?
As a hypothetical, this question is interesting, but in the general case one should try to profile their code to ensure that the try catch method is adding overhead. Try/catch is often idiomatic and moreover can be faster depending on your usage. For example, for-loops in python use try/catch by design.
Alternatives and Why I Don't Like Them
To clarify, the question asks about
any input that would be recognizable by float()
Alternative #1 -- How about a regex
I find it hard to believe that you will get a regex to solve this problem in general. While a regex will be good at capturing float literals, there are a lot of corner cases. Look at all the cases on this answer - does your regex handle NaN? Exponentials? Bools (but not bool strings)?
Alternative #2: Manually Unrlodded Python Check:
To summarize the tough cases that need to be captured (which Python natively does)
Case insensitive capturing of Nan
Hex matching
All of the cases enumerated in the language specification
Signs, including signs in the exponent
Booleans
I also would point you to the case below floating points in the language specification; imaginary numbers. The floating method handles these elegantly by recognizing what they are, but throwing a type error on the conversion. Will your custom method emulate that behaviour?

Understanding len function with iterators

Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.
I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.
Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.
Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...
So am I right, or are there more insights/something more that I'm not considering?
The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.
Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.
In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.
Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.

Categories