In documenting a Python function, I find it more Pythonic to say:
def Foo(i):
"""i: An interable containing…"""
…rather than…
def Foo(i):
"""i: A list of …"""
When i really doesn't need to be a list. (Foo will happily operate on a set, tuple, etc.) The problem is generators. Generators typically only allow 1 iteration. Most functions are OK with generators or iterables that only allow a single pass, but some are not.
For those functions that cannot accept generators/things that can only be iterated once, is there a clear, consistent Python term to say "thing that can only be iterated more than once"?
The Python glossary for iterable and iterator seem to have a "once, but maybe more if you're lucky" definition.
I don't know of a standard term for this, at least not offhand, but I think "reusable iterable" would get the point across if you need a short phrase.
In practice, it's generally possible to structure your function so that you don't need to iterate over i more than once. Alternatively, you can create a list out of the iterable and then iterate over the list as many times as you want; or you can use itertools.tee to get multiple independent "copies" of the iterator. That lets you accept a generator even if you do need to use it more than once.
This is probably more a matter of style and preference than anything else, yet... I have a different take on my documentation: I always write the docstring according to the expected input in the context of the program.
Example: if I wrote a function that expect to go over keys of a dictionary and ignore its values I write:
arg : a dictionary of...
even if for e in arg: would work with other iterables. I chose to do so, because within the context of my code, I don't care if the function would still work... I care more that whoever reads the documentation understand how that function is meant to be used.
On the other hand, if I am writing a utility function that can cope with a wide spectrum of iterables by design, I go one of these two ways:
document what kind of exception will be rose under certain conditions [ex: "Raise TypeError if the iterable can't be iterated more than once"]
perform some pre-emptive argument handling that will make the function compatible with 'once-only' iterables.
In other words, I try to either make my function solid enough to handle edge cases, or to be very outspoken on its limitations.
Again: there's nothing wrong with the approach you want to take, but I consider this one of the cases in which "explicit is better than implicit": a documentation in which is mentioned "reusable iterable" is definitively accurate, but the adjective could easily be overlooked.
HTH!
Related
I am a newbie reading Uncle Bob's Clean Code Book.
It is indeed great practice to limit the number of function arguments as few as possible. But I still come across so many functions offered in many libraries that require a bunch of arguments. For example, in Python's pandas, there is a function with 9 arguments:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
(And this function also violates the advice about flag arguments)
It seems that such cases are much rarer in Python standard libraries, but I still managed to find one with 4 arguments:
re.split(pattern, string, maxsplit=0, flags=0)
I understand that this is just a suggestion instead of silver bullet, but is it applicable when it comes to something mentioned above?
Uncle Bob does not mention a hard limit of arguments that would make your code smell, but I would consider 9 arguments as too much.
Today's IDEs are much better in supporting the readability of the code, nevertheless refactoring stays tricky, especially with a large number of equally typed arguments.
The suggested solution is to encapsulate the arguments in a single struct/object (depending on your language). In the given case, this could be a GroupingStrategy:
strategy = GroupingStrategy();
strategy.by = "Foo"
strategy.axis = 0
strategy.sorted = true
DataFrame.groupby(strategy)
All not mentioned attributes will be assigned with the respective default values.
You could then also convert it to a fluent API:
DataFrame.groupby(GroupingStrategy.by("Foo").axis(0).sorted())
Or keep some of the arguments, if this feels better:
DataFrame.groupby("Foo", GroupingStrategy.default())
The first point to note is that all those arguments to groupby are relevant. You can reduce the number of arguments by having different versions of groupby but that doesn't help much when the arguments can be applied independently of each other, as is the case here. The same logic would apply to re.split.
It's true that integer "flag" arguments can be dodgy from a maintenance point of view - what happens if you want to change a flag value in your code? You have to hunt through and manually fix each case. The traditional approach is to use enums (which map numbers to words eg a Day enum would have Day.Sun = 0, Day.Mon = 1, etc) In compiled languages like C++ or C# this gives you the speed of using integers under the hood but the readability of using labels/words in your code. However enums in Python are slow.
One rule that I think applies to any source code is to avoid "magic numbers", ie numbers which appear directly in the source code. The enum is one solution. Another solution is to have constant variables to represent different flag settings. Python sort-of supports constants (uppercase variable names in constant.py which you then import) however they are constant only by convention, you can actually change their value :(
I've stumbled upon an interesting case - only thing I am sure about is that I will get an iterable object.
What I really, and only, want to do is to count it.
I've searched if iterable in python implies countable and I found various places claiming so, except the official docs.
So 2 questions arise:
In Python, does iterable => countable (number of items)? Or is it just very common to be so?
Is there a generic pythonic way to get count from an iterable? Which seems to be answered here https://stackoverflow.com/a/3345807/1835470 i.e. not without counting, but author provided a pythonic one-liner:
sum(1 for _ in iterableObject)
iterables do not support the len method. To get their length you have to cast them into something that does support it or manually loop through them (and exhaust them in the process) while counting such as in this code snippet you posted.
As a result, the requirement of using an iterable and needing to know its length is somewhat conflicting. Plus as #Veedrac says in the comments, iterables can have infinite length.
To sum up, if you need to know how many items are contained, you need a different data structure.
Some methods don't need to make a new variable, i.e. lists.reverse() works like this:
lists = [123, 456, 789]
lists.reverse()
print(lists)
this method make itself reversed (without new variable).
Why there is vary ways to manufacture variable in Python?
Some cases which is like variable.method().method2().method3() are typed continuously but type(variable) and print() are not. Why we can't typing like variable.print() or variable.type()?
Is there any philosophical reasons for Python?
You may be confused by the difference between a function and a method, and by three different purposes to them. As much as I dislike using SO for tutorial purposes, these issues can be hard to grasp from other documentation. You can look up function vs method easily enough -- once you know it's a (slightly) separate issue.
Your first question is a matter of system design. Python merely facilitates what programmers want to do, and the differentiation is common to many (most?) programming languages since ASM and FORTRAN crawled out of the binary slime pools in the days when dinosaurs roamed the earth.
When you design how your application works, you need to make a lot of implementation decisions: individual variables vs a sequence, in-line coding vs functions, separate functions vs encased functions vs classes and methods, etc. Part of this decision making is what each function should do. You've raised three main types:
(1) Process this data -- take the given data and change it, rearrange it, whatever needs doing -- but I don't need the previous version, just the improved version, so just put the new stuff where the old stuff was. This is used almost exclusively when one variable is getting processed; we don't generally take four separate variables and change each of them. In that case, we'd put them all in a list and change the list (a single variable). reverse falls into this class.
One important note is that for such a function, the argument in question must be mutable (capable of change). Python has mutable and immutable types. For instance, a list is mutable; a tuple is immutable. If you wanted to reverse a tuple, you'd need to return a new tuple; you can't change the original.
(2) Tell me something interesting -- take the given data and extract some information. However, I'm going to need the originals, so leave them alone. If I need to remember this cool new insight, I'll put it in a variable of my own. This is a function that returns a value. sqrt is one such function.
(3) Interact with the outside world -- input or output data permanently. For output, nothing in the program changes; we may present the data in an easy-to-read format, but we don't change anything internally. print is such a function.
Much of this decision also depends on the function's designed purpose: is this a "verb" function (do something) or a noun/attribute function (look at this data and tell me what you see)?
Now you get the interesting job for yourself: learn the art of system design. You need to become familiar enough with the available programming tools that you have a feeling for how they can be combined to form useful applications.
See the documentation:
The reverse() method modifies the sequence in place for economy of space when reversing a large sequence. To remind users that it operates by side effect, it does not return the reversed sequence.
Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.
I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.
Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.
Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...
So am I right, or are there more insights/something more that I'm not considering?
The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.
Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.
In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.
Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.
I'm quite new to python (2.7) and have a question about what's the most Pythonic way to do something; my code (part of a class) Looks like this (a somewhat naive Version):
def calc_pump_height(self):
for i in range(len(self.primary_)):
for j in range(len(self.primary_)):
if self.connections_[i][j].sub_kind_ in [1,4]:
self.calc_spec_pump_height(i,j)
def calc_spec_pump_height(self,i,j):
pass
(obviously pass will be replaced by something else, manipulating attributes of the object of this class, without generating a return value)
I'd like to ask how I should do this: I could avoid the second function and write the extra code directly into the first function, getting rid of one function (Simple is better than complex), but creating a heavily nested function at the same time (Flat is better than nested).
I could also create some sort of list comprehension to avoid using a double Loop, eg:
def calc_pump_height(self):
ra = range(len(self.primary_))
[self.calc_spec_pump_height(i,j) for i,j in zip(ra, ra)]
(I'd have to move the if condition into the 2nd function; this would also create a null-list but I don't care about this, since calc_spec_pump_height is supposed to manipulate the object, not return something useful)
In essence: I'm iterating over a 2D list, testing each object for a certain characteristic and then do something with that object.
Which of the above methods is 'the best'? Or is there another way that I'm missing?
The key thing about functions/methods is that they should do one thing.
calc_pump_height implements two things: It finds elements in a 2D list that match some criteria, and then it calculates a value for each of those elements. It's ok for its purpose to be combining the other two operations, if that makes sense for the object's public API, but its not ok for it to implement either or both.
Finding the elements that match the criteria is a discrete step; that should be a function.
Calculating your value is clearly a discrete step; that should be a function.
I would implement the element matcher as a (private) generator, that takes the test condition as an argument, and yields all matching elements. It's just an iterator over your data structure, masked by the logical test. You can wrap that in a named public method called get_1_4_subkinds() or something that makes more sense in your domain. That generalises the code and gives you the flexibility to implement other conditions in the future. Also, your i and j are tightly coupled, so it makes sense to pass them around as a single concept. Then your code becomes:
def calc_pump_height(self):
for subkind_indices in self.get_1_4_subkinds():
self.calc_pump_spec_height(subkind_indices)
You have misunderstood “simplicity”:
write the extra code directly into the first function, getting rid of one function (Simple is better than complex)
That's not simple. Breaking complex sequences into discrete, focussed functions increases simplicity.
In that light, I would say that yes, you should definitely prefer calc_spec_pump_height as a separate function.
You can eliminate one level of nesting in your first function by using itertools.product to generate your i and j values at the same time (itertools.product(range(len(self.primary_)), repeat=2). The zip you use in the your second version won't work correctly, it will only yield identical pairs, 0,0, 1,1, 2,2, etc.
As for the overall design, you should not use a list comprehension if you don't care about the return value from the function you're calling. Use an explicit loop when it's the looping you want (rather than a list of computed values).
If there's a non-trivial amount of code that will go in calc_spec_pump_height, it makes perfect sense to make it as a separate method. If it's a one or two liner, then it might be OK to inline within calc_pump_height, but that method's loops and condition testing may be complicated enough already to justify factoring out the inner part of the algorithm.
You should usually think about splitting a big function up when it is too long to fit onto a single screen in your editor. That is about the limit of how many details (variable names, etc.) we can keep in our mind simultaneously. On the other hand, you shouldn't waste time (either your own programming time or function call overhead at run time) by factoring out every little piece of every problem. Factor part of a function out if you're using it from more than one place, or if you can't keep the details of the whole function in your head at once otherwise.
So, other than the (marginal) improvement of itertools.product and given the limited information you've provided about what calc_spec_pump_height will do, I think your code is already about as good as it can get!