Understanding len function with iterators

Understanding len function with iterators - python

Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.
I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.
Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.
Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...
So am I right, or are there more insights/something more that I'm not considering?

The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.
Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.
In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.
Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.

Related

Why hasn't iter.remove been implemented in python dicts?

Is there a good reason that iter.remove() is not currently implemented in python dicts?
Let us say I need to remove about half the elements in a set/dictionary. Then I'm forced to either:
Copy the entire set/dictionary (n space, n time)
Iterate over the copy to find elements to remove, remove it from the original dictionary (n/2 plus n/2 distinct lookups)
Or:
Iterate over the dictionary, add elements to remove to a new set (n space, n time)
Iterate over the new set, removing each element from the original dictionary (n/2 plus n/2 lookups)
While asymptotically everything is still "O(n)" time, this is horribly inefficient and about 3 times as slow when compared to the sane way of doing this:
Iterate over the dict, removing what you don't want as you go. This is truly n time, and O(1) space.
At least under the common implementation of hash sets as buckets of linked lists, the iterator should be able to remove the element it just visited without making a new lookup, by simply removing the node in the linked list.
More importantly, the bad solution also requires O(n) space, which really is bad even for those who tend to dismiss these kinds of optimization concerns in python.

In your comparison, you made two big mistakes. First, you neglected to even consider the idiomatic "don't delete anything, copy half the dict" option. Second, you didn't realize that deleting half the entries in a hash table at 2/3 load leaves you with a hash table of the exact same size at 1/3 load.
So, let's compare the actual choices (I'll ignore the 2/3 load to be consistent with your n/2 measures). For each one, there's the peak space, the final space, and the time:
2.0n, 1.0n, 1.5n: Copy, delete half the original
2.0n, 1.0n, 1.5n: Copy, delete half the copy
1.5n, 1.0n, 1.5n: Built a deletions set then delete
1.0n, 1.0n, 0.5n: Delete half in-place
1.5n, 0.5n, 1.0n: Delete half in-place, then compact
1.5n, 0.5n, 0.5n: Copy half
So, your proposed design would be worse than what we already do idiomatically. Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space.
And meanwhile, building a new dictionary, especially if you use a comprehension, means:
Effectively non-mutating (automatic thread/process safety, referential transparency, etc.).
Fewer places to make "small" mistakes that are hard to detect and debug.
Generally more compact and more readable.
Semantically restricted looping, dict building, and exception handling provides opportunities for optimization (which CPython takes; typically a comprehension is about 40% faster than an explicit loop).
For more information on how dictionaries are implemented in CPython, look at the source, which is comprehensively documented, and mostly pretty readable even if you're not a C expert.
If you think about how things work, some of the choices you assumed should obviously go the other way—e.g., consider that Python only stores references in containers, not actual values, and avoids malloc overhead wherever possible, so what are the odds that it would use chaining instead of open addressing?
You may also want to look at the PyPy implementation, which is in Python and has more clever tricks.
Before I respond to all of your comments, you should keep in mind that StackOverflow is not where Python changes get considered or made. If you really think something should be changed, you should post it on python-ideas, python-dev, and/or the bugs site. But before you do: You're pretty clearly still using 2.x; if you're not willing to learn 3.x to get any of the improvements or optimizations made over the past half-decade, nobody over there is going to take you seriously when you suggest additional changes. Also, familiarize yourself with the constructs you want to change; as soon as you start arguing on the basis of Python dicts probably using chaining, the only replies you're going to get will be corrections. Anyway:
Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space.
I can't explain something I didn't say and that isn't true. There's no "adds" anywhere. My numbers are total peak space and total final space. You're algorithm is clearly 1.0n for each. Which sounds great, until you compare it to the last two options, which have 0.5n total final space.
As your arguments in favor of not providing to the programmer the option of delete in place,
The argument not to make a change is never "that change is impossible", and rarely "that change is inherently bad", but usually "the costs of that change outweigh the benefits". The costs are obvious: there's the work involved; the added complexity of the language and each implementation; more differences between Python versions; potential TOOWTDI violations or attractive nuisances; etc. None of those things mean no change can go in; almost every change ever made to Python had almost all of those costs. But if the benefits of a change aren't worth the cost, it's not worth changing. And if the benefits are less than they initially appear because your hoped-for optimization (a) is actually a pessimization, and (b) would require giving up other benefits to use even if it weren't, that puts you a lot farther from the bar.
Also, I'm not sure, but it sounds like you believe that the idea of there being an obvious, one way to do things, and having a language designed to encourage that obvious way when possible, constitutes Python being a "nanny". If so, then you're seriously using the wrong language. There are people who hate Python for trying to get them to do things the Pythonic way, but those people are smart enough not to use Python, much less try to change it.
Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed … by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context.
How would that "fix" anything? It sounds like the exact same semantics you could get by writing it = iter(mydict.items()) then for (a, b) in it:. But whatever the semantics are, how would they provide the same, or equivalent, easy opportunities for compiler optimization that comprehensions provide? In a comprehension, there is only one place in the scope that you can return from. It always returns the top value already on the stack. There is guaranteed to be no exception handling in the current scope except a stereotyped StopIteration handler. There is a very specific sequence of events in building the list/set/dict that makes it safe to use generally-unsafe and inflexible opcodes that short-circuit the usual behavior. How are you expecting to get any of those optimizations, much less all of them?
"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works.
This works because 1.0 is double 0.5. More concretely, a hash table that's expanded to n elements and is now at about 1/3 load is twice as big as a hash table that's expanded to n/2 elements and is now at about 2/3 load. How is this not clear?
Delete in place takes O(1) space
OK, if you want to count extra final space instead of total final space, then yes, we can say that delete in place takes 0.0n space, and copying half takes -0.5n. Shifting the zero point doesn't change the comparison.
and none of the options can take less than 1.0n time
Sorry, this was probably unclear, because here I was talking about added cost, and probably shouldn't have been, and didn't mention it. But again, changing the scale or the zero point doesn't make any difference. It clearly takes just as much time to delete 0.5n keys from one dict as it does to add 0.5n keys to another one, and all of the other steps are identical, so there is no time difference. Whether you call them both 0.5n or both 1.0n, they're still equal.
The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated.
No, it isn't clearly stated. All you said is "I need to remove about half the elements in a set/dictionary". In 99% of use cases, d = {k: v for k, v in d.items() if pred(k)} is the way to write that. And many of the cases people come up with where that isn't true ("but I need the background thread to see the changes immediately") are actively bad ideas. Of course there are some counterexamples, but you can't expect people to just assume you had one when you didn't even give a hint that you might.
But also, the final space of that is 1.5n, not .5n
No it isn't. The original hash table is garbage, so it gets cleaned up, so the final space is just the new, half-sized hash table. (If that isn't true, then you actually still need the original dict alongside the new one, in which case you had no choice but to copy in the first place.)
And if you're going to say, "Yeah, but until it gets cleaned up"—yes, that's why the peak space is 1.5n instead of 1.0n, because there is some non-zero time that both hash tables are alive.

There is another approach:
for key in list(mydict.keys()):
val = mydict[key]
if <decide drop>(val):
mydict.pop(key)
Which could be explained as:
Copy the keys of the original dictionary
Iterate the dictionary through individual lookups
Delete elements when required
I suspect that the overhead of invidual lookups will be too high, comparing to the straightforward iteration. But, I am curious (and have not tested it yet).

Python complexity reference?

Is there any Python complexity reference? In cppreference, for example, for many functions (such as std::array::size or std::array::fill) there's a complexity section which describes their running complexity, in terms of linear in the size of the container or constant.
I would expect the same information to appear in the python website, perhaps, at least for the CPython implementation. For example, in the list reference, in list.insert I would expect to see complexity: linear; I know this case (and many other container-related operations) is covered here, but many other cases are not. Here are a few examples:
What is the complexity of tuple.__le__? It seems like when comparing two tuples of size n, k, the complexity is about O(min(n,k)) (however, for small n's it looks different).
What is the complexity of random.shuffle? It appears to be O(n). It also appears that the complexity of random.randint is O(1).
What is the complexity of the __format__ method of strings? It appears to be linear in the size of the input string; however, it also grows when the number of relevant arguments grow (compare ("{0}"*100000).format(*(("abc",)*100000)) with ("{}"*100000).format(*(("abc",)*100000))).
I'm aware that (a) each of these questions may be answered by itself, (b) one may look at the code of these modules (even though some are written in C), and (c) StackExchange is not a python mailing list for user requests. So: this is not a doc-feature request, just a question of two parts:
Do you know if such a resource exists?
If not, do you know what is the place to ask for such, or can you suggest why I don't need such?

CPython is pretty good about its algorithms, and the time complexity of an operation is usually just the best you would expect of a good standard library.
For example:
Tuple ordering has to be O(min(n,m)), because it works by comparing element-wise.
random.shuffle is O(n), because that's the complexity of the modern Fisher–Yates shuffle.
.format I imagine is linear, since it only requires one scan through the template string. As for the difference you see, CPython might just be clever enough to cache the same format code used twice.
The docs do mention time complexity, but generally only when it's not what you would expect — for example, because a deque is implemented with a doubly-linked list, it's explicitly mentioned as having O(n) for indexing in the middle.
Would the docs benefit from having time complexity called out everywhere it's appropriate? I'm not sure. The docs generally present builtins by what they should be used for and have implementations optimized for those use cases. Emphasizing time complexity seems like it would either be useless noise or encourage developers to second-guess the Python implementation itself.

copy.deepcopy or create a new object?

I'm developing a real-time application and sometimes I need to create instances to new objects always with the same data.
First, I did it just instantiating them, but then I realised maybe with copy.deepcopy it would be faster. Now, I find people who say deepcopy is horribly slow.
I can't do a simply copy.copy because my object has lists.
My question is, do you know a faster way or I just need to give up and instantiate them again?
Thank you for your time

I believe copy.deepcopy() is still pure Python, so it's unlikely to give you any speed boost.
It sounds to me a little like a classic case of early optimisation. I would suggest writing your code to be intuitive which, in my opinion, is simply instantiating each object. Then you can profile it and see where savings need to be made, if anywhere. It may well be in your real-world use-case that some completely different piece of code will be a bottleneck.
EDIT: One thing I forgot to mention in my original answer - if you're copying a list, make sure you use slice notation (new_list = old_list[:]) rather than iterating through it in Python, which will be slower. This won't do a deep copy, however, so if your lists have other lists or dictionaries you'll need to use deepcopy(). For dict objects, use the copy() method.
If you still find constructing your objects is what's taking the time, you can then consider how to speed it up. You could experiment with __slots__, although they're typically about saving memory more than CPU time so I doubt they'll buy you much. In the extreme case, you can push your object out to a C extension module, which is likely to be a lot faster at the expense of increased complexity. This is always the approach I've taken in the past, where I use native C data structures under the hood and use Python's special methods to wrap a "list-like" or "dict-like" interface on top. This is does rely on you being happy with coding in C, of course.
(As an aside I would avoid C++ unless you have a compelling reason, C++ Python extensions are slightly more fiddly to get building than plain C - it's entirely possible if you have a good motivation, however)
If you object has, for example, very long lists then you might get some mileage out of a sort of copy-on-write approach, where clones of objects just keep the same references instead of copying the lists. Every time you access them you could use sys.getrefcount() to see if it's safe to update in-place or whether you need to take a copy. This approach is likely to be error-prone and overly complex, but I thought I'd mention it for interest.
You could also look at your object hierarchy and see whether you can break objects up such that parts which don't need to be duplicated can be shared between other objects. Again, you'd need to take care when modifying such shared objects.
The important point, however, is that you first want to get your code correct and then make your code fast once you understand the best ways to do that from real world usage.

If I am sorting a sequence using the functional paradigm, isn't making copies wasteful?

Goal: sorting a sequence in a functional way without using builtin sorted(..) function.
def my_sorted(seq):
"""returns an iterator"""
pass
Motivation: In the FP way, I am constrained:
never mutate seq (which could be an iterator or a realized list)
By implication, no in-place sorting.
Question 1 Since I cannot mutate seq, I would need to maintain a separate mutable data structure to store the sorted sequence. That seems wasteful compared to an in-place list.sort(). How do other functional programming languages handle this ?
Question 2 If I return a mutable sequence, it that ok in the functional paradigm?

Of course sorting cannot be totally lazy (the last element of input could be the first on output) but you could implement a computational lazy sort that after reading the whole sequence only generates exact sorted output on request element-by-element. You can also delay reading input until at least one output is requested so sorting and ignoring the result will require no computation.
For this computationally lazy approach the best candidate I know is the heapsort algorithm (you only do the heap-building step upfront).

Mutation in-place is only safe if no one else has references to the data, expecting it to be as it was prior to the sort. So it isn't really wasteful to have a new structure for the sorted results, in general. The in-place optimization is only safe if you're using the data in a linear fashion.
So, just allocate a new structure, since that is more generally useful. The in-place version is a special case.

The appropriate defensive programming is wasteful at times, but there's also nothing you can do about it.
This is why languages built to support functional use from the ground up use structural sharing for their natively immutable types; programming in a functional style in a language which isn't built for it (such as Python) isn't going to be as well-supported as a matter of course. That said, a sort operation isn't necessarily a good candidate for structural sharing (if more than minor changes need to be made).
As such, there often is at least one copy operation involved in a sort, even in other functional languages. Clojure, for instance, delegates to Java's native (highly optimized) sort operation on a temporary mutable array, and returns a seq wrapping that array (and thus making the result just as immutible as the input which was used to populate same). If the inputs are immutible, and the outputs are immutible, and what happens inbetween isn't visible to the outside world (particularly, to any other thread), transient mutability is often a necessary and appropriate thing.

Use a sorting algorithm that can be performed in a manner that creates a new datastructure, such as heapsort or mergesort.

Wasteful of what? bits? electricity? wall-clock time? A parallel merge-sort may be the quickest to complete if you have enough cpus and a large amount of data, but may produce many intermediary representations.
In general, parallelising an algorithm may lead to a very different optimisation strategy than a serial algorithm. For instance, due to Amdahl's Law, re-performing redundant work locally to avoid sharing. This may be considered "wasteful" in a serial context, but leads to a much more scalable algorithm.

Py3k memory conservation by returning iterators rather than lists

Many methods that used to return lists in Python 2.x now seem to return iterators in Py3k
Are iterators also generator expressions? Lazy evaluation?
Thus, with this the memory footprint of python is going to reduce drastically. Isn't it?
What about for the programs converted from 2to3 using the builtin script?
Does the builtin tool explicitly convert all the returned iterators into lists, for compatibility? If so then the lower memory footprint benefit of Py3k is not really apparent in the converted programs. Is it?

Many of them are not exactly iterators, but special view objects. For instance range() now returns something similar to the old xrange object - it can still be indexed, but lazily constructs the integers as needed.
Similarly dict.keys() gives a dict_keys object implementing a view on the dict, rather than creating a new list with a copy of the keys.
How this affects memory footprints probably depends on the program. Certainly there's more of an emphasis towards using iterators unless you really need lists, whereas using lists was generally the default case in python2. That will cause the average program to probably be more memory efficient. Cases where there are really big savings are probably going to already be implemented as iterators in python2 programs however, as really large memory usage will stand out, and is more likely to be already addressed. (eg. the file iterator is already much more memory efficient than the older file.readlines() method)
Converting is done by the 2to3 tool, and will generally convert things like range() to iterators where it can safely determine a real list isn't needed, so code like:
for x in range(10): print x
will switch to the new range() object, no longer creating a list, and so will obtain the reduced memory benefit, but code like:
x = range(20)
will be converted as:
x = list(range(20))
as the converter can't know if the code expects a real list object in x.

Are iterators also generator expressions? Lazy evaluation?
An iterator is just an object with a next method. What the documentation means most of the time when saying that a function returns an iterator is that its result is lazily loaded.
Thus, with this the memory footprint of python is going to reduce drastically. Isn't it?
It depends. I'd guess that the average program wouldn't notice a huge difference though. The performance advantages of iterators over lists is really only significant if you have a large dataset. You may want to see this question.

One of the biggest benefits of iterators over lists isn't memory, it is actually computation time. For instance, in Python 2:
for i in range(1000000): # spend a bunch of time making a big list
if i == 0:
break # Building the list was a waste since we only looped once
Now take for instance:
for i in xrange(1000000): # starts loop almost immediately
if i == 0:
break # we did't waste time even if we break early
Although the example is contrived, the use case isn't: loops are often broken out of mid-way. Building an entire list to only use part of it is a waste unless you are going to use it more than once. If that is the case, you can explicitly build a list: r = list(range(100)). This is why iterators are the default in more places in Python 3; you aren't out anything since you can still explicitly create lists (or other containers) when you need. But you aren't forced to when all you plan to do is iterate over an iterable once (which I would argue is the much more common case).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.