In Python 2.x:
>>> '' > 0
True
Why is that?
The original design motivation for allowing order-comparisons of arbitrary objects was to allow sorting of heterogeneous lists -- usefully, that would put all strings next to each other in alphabetical order, and all numbers next to each other in numerical order, although which of the two blocks came first was not guaranteed by the language. For example, this allowed getting only unique items in any list (even one with non-hashable items) in O(N log N) worst-case time
Over the years, this pragmatic arrangement eroded. The first crack came when the ability to order-compare complex numbers was taken away, quite a few versions ago. Suddenly, the ability to sort any list disappeared: it did not apply any more if the list contained complex numbers, possibly together with items of other types. Then Guido started disliking heterogeneous lists more generally, and thus started thinking that it didn't really matter if such lists could be usefully sorted or not... because such lists should not exist in the first place, according to his new thinking. He didn't do anything to forbid them, but was not inclined to accept any compromises to support them either.
Note that both changes move the balance a little bit away from the "practicality beats purity" item of the Zen of Python (which was written earlier, back when complex numbers still could be order-compared ;-) – a bit more purity, a bit less practicality.
Nevertheless the ability to order-compare two arbitrary objects (as long as neither was a complex number ;-) remained for a long time, because around that same time Guido started really insisting on maintaining strong backwards compatibility (a shift that's both practical and pure ;-).
So, it's only in Python 3, which explicitly and deliberately removed the constraint of strong backwards compatibility to allow some long-desired but backwards incompatible enhancements (especially simplifications and removal of obsolete, redundant way to perform certain tasks), that order comparison of instances of different types became an error.
So this historical and philosophical treatise is basically the only way to truly respond to your "why" question...! :-)
from https://docs.python.org/2.7/tutorial/datastructures.html#id1
Note that comparing objects of different types is legal. The outcome
is deterministic but arbitrary: the types are ordered by their name.
Thus, a list is always smaller than a string, a string is always
smaller than a tuple, etc. [1] Mixed numeric types are compared
according to their numeric value, so 0 equals 0.0, etc.
Related
I have a really big, like huge, Dictionary (it isn't really but pretend because it is easier and not relevant) that contains the same strings over and over again. I have verified that I can store a lot more in memory if I do poor man's compression on the system and instead store INTs that correspond to the string.
animals = ['ape','butterfly,'cat','dog']
exists in a list and therefore has an index value such that animals.index('cat') returns 2
This allows me to store in my object BobsPets = set(2,3)
rather than Cat and Dog
For the number of items the memory savings are astronomical. (Really Don't try and dissuade me that is well tested.
Currently I then convert the INTs back to Strings with a FOR loop
tempWordList = set()
for IntegOfIndex in TempSet:
tempWordList.add(animals[IntegOfIndex])
return tempWordList
This code works. It feels "Pythonic," but it feels like there should be a better way. I am in Python 2.7 on AppEngine if that matters. It may since I wonder if Numpy has something I missed.
I have about 2.5 Million things in my object, and each has an average of 3 of these "pets" and there 7500-ish INTs that represent the pets. (no they aren't really pets)
I have considered using a Dictionary with the position instead of using Index. This doesn't seem faster, but am interested if anyone thinks it should be. (it took more memory and seemed to be the same speed or really close)
I am considering running a bunch of tests with Numpy and its array's rather than lists, but before I do, I thought I'd ask the audience and see if I would be wasting time on something that I have already reached the best solution for.
Last thing, The solution should be pickable since I do that for loading and transferring data.
It turns out that since my list of strings is fixed, and I just wish the index of the string, I am building what is essentially an index array that is immutable. Which is in short a Tuple.
Moving to a Tuple rather than a list gains about 30% improvement in speed. Far more than I would have anticipated.
The bonus is largest on very large lists. It seems that each time you cross a bit threshold the bonus increases, so in sub 1024 lists their is basically no bonus and at a million there is pretty significant.
The Tuple also uses very slightly less memory for the same data.
An aside, playing with the lists of integers, you can make these significantly smaller by using a NUMPY array, but the advantage doesn't extend to pickling. The Pickles will be about 15% larger. I think this is because of the object description being stored in the pickle, but I didn't spend much time looking.
So in short the only change was to make the Animals list a Tuple. I really was hoping the answer was something more exotic.
Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.
I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.
Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.
Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...
So am I right, or are there more insights/something more that I'm not considering?
The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.
Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.
In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.
Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.
Is there a good reason that iter.remove() is not currently implemented in python dicts?
Let us say I need to remove about half the elements in a set/dictionary. Then I'm forced to either:
Copy the entire set/dictionary (n space, n time)
Iterate over the copy to find elements to remove, remove it from the original dictionary (n/2 plus n/2 distinct lookups)
Or:
Iterate over the dictionary, add elements to remove to a new set (n space, n time)
Iterate over the new set, removing each element from the original dictionary (n/2 plus n/2 lookups)
While asymptotically everything is still "O(n)" time, this is horribly inefficient and about 3 times as slow when compared to the sane way of doing this:
Iterate over the dict, removing what you don't want as you go. This is truly n time, and O(1) space.
At least under the common implementation of hash sets as buckets of linked lists, the iterator should be able to remove the element it just visited without making a new lookup, by simply removing the node in the linked list.
More importantly, the bad solution also requires O(n) space, which really is bad even for those who tend to dismiss these kinds of optimization concerns in python.
In your comparison, you made two big mistakes. First, you neglected to even consider the idiomatic "don't delete anything, copy half the dict" option. Second, you didn't realize that deleting half the entries in a hash table at 2/3 load leaves you with a hash table of the exact same size at 1/3 load.
So, let's compare the actual choices (I'll ignore the 2/3 load to be consistent with your n/2 measures). For each one, there's the peak space, the final space, and the time:
2.0n, 1.0n, 1.5n: Copy, delete half the original
2.0n, 1.0n, 1.5n: Copy, delete half the copy
1.5n, 1.0n, 1.5n: Built a deletions set then delete
1.0n, 1.0n, 0.5n: Delete half in-place
1.5n, 0.5n, 1.0n: Delete half in-place, then compact
1.5n, 0.5n, 0.5n: Copy half
So, your proposed design would be worse than what we already do idiomatically. Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space.
And meanwhile, building a new dictionary, especially if you use a comprehension, means:
Effectively non-mutating (automatic thread/process safety, referential transparency, etc.).
Fewer places to make "small" mistakes that are hard to detect and debug.
Generally more compact and more readable.
Semantically restricted looping, dict building, and exception handling provides opportunities for optimization (which CPython takes; typically a comprehension is about 40% faster than an explicit loop).
For more information on how dictionaries are implemented in CPython, look at the source, which is comprehensively documented, and mostly pretty readable even if you're not a C expert.
If you think about how things work, some of the choices you assumed should obviously go the other way—e.g., consider that Python only stores references in containers, not actual values, and avoids malloc overhead wherever possible, so what are the odds that it would use chaining instead of open addressing?
You may also want to look at the PyPy implementation, which is in Python and has more clever tricks.
Before I respond to all of your comments, you should keep in mind that StackOverflow is not where Python changes get considered or made. If you really think something should be changed, you should post it on python-ideas, python-dev, and/or the bugs site. But before you do: You're pretty clearly still using 2.x; if you're not willing to learn 3.x to get any of the improvements or optimizations made over the past half-decade, nobody over there is going to take you seriously when you suggest additional changes. Also, familiarize yourself with the constructs you want to change; as soon as you start arguing on the basis of Python dicts probably using chaining, the only replies you're going to get will be corrections. Anyway:
Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space.
I can't explain something I didn't say and that isn't true. There's no "adds" anywhere. My numbers are total peak space and total final space. You're algorithm is clearly 1.0n for each. Which sounds great, until you compare it to the last two options, which have 0.5n total final space.
As your arguments in favor of not providing to the programmer the option of delete in place,
The argument not to make a change is never "that change is impossible", and rarely "that change is inherently bad", but usually "the costs of that change outweigh the benefits". The costs are obvious: there's the work involved; the added complexity of the language and each implementation; more differences between Python versions; potential TOOWTDI violations or attractive nuisances; etc. None of those things mean no change can go in; almost every change ever made to Python had almost all of those costs. But if the benefits of a change aren't worth the cost, it's not worth changing. And if the benefits are less than they initially appear because your hoped-for optimization (a) is actually a pessimization, and (b) would require giving up other benefits to use even if it weren't, that puts you a lot farther from the bar.
Also, I'm not sure, but it sounds like you believe that the idea of there being an obvious, one way to do things, and having a language designed to encourage that obvious way when possible, constitutes Python being a "nanny". If so, then you're seriously using the wrong language. There are people who hate Python for trying to get them to do things the Pythonic way, but those people are smart enough not to use Python, much less try to change it.
Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed … by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context.
How would that "fix" anything? It sounds like the exact same semantics you could get by writing it = iter(mydict.items()) then for (a, b) in it:. But whatever the semantics are, how would they provide the same, or equivalent, easy opportunities for compiler optimization that comprehensions provide? In a comprehension, there is only one place in the scope that you can return from. It always returns the top value already on the stack. There is guaranteed to be no exception handling in the current scope except a stereotyped StopIteration handler. There is a very specific sequence of events in building the list/set/dict that makes it safe to use generally-unsafe and inflexible opcodes that short-circuit the usual behavior. How are you expecting to get any of those optimizations, much less all of them?
"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works.
This works because 1.0 is double 0.5. More concretely, a hash table that's expanded to n elements and is now at about 1/3 load is twice as big as a hash table that's expanded to n/2 elements and is now at about 2/3 load. How is this not clear?
Delete in place takes O(1) space
OK, if you want to count extra final space instead of total final space, then yes, we can say that delete in place takes 0.0n space, and copying half takes -0.5n. Shifting the zero point doesn't change the comparison.
and none of the options can take less than 1.0n time
Sorry, this was probably unclear, because here I was talking about added cost, and probably shouldn't have been, and didn't mention it. But again, changing the scale or the zero point doesn't make any difference. It clearly takes just as much time to delete 0.5n keys from one dict as it does to add 0.5n keys to another one, and all of the other steps are identical, so there is no time difference. Whether you call them both 0.5n or both 1.0n, they're still equal.
The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated.
No, it isn't clearly stated. All you said is "I need to remove about half the elements in a set/dictionary". In 99% of use cases, d = {k: v for k, v in d.items() if pred(k)} is the way to write that. And many of the cases people come up with where that isn't true ("but I need the background thread to see the changes immediately") are actively bad ideas. Of course there are some counterexamples, but you can't expect people to just assume you had one when you didn't even give a hint that you might.
But also, the final space of that is 1.5n, not .5n
No it isn't. The original hash table is garbage, so it gets cleaned up, so the final space is just the new, half-sized hash table. (If that isn't true, then you actually still need the original dict alongside the new one, in which case you had no choice but to copy in the first place.)
And if you're going to say, "Yeah, but until it gets cleaned up"—yes, that's why the peak space is 1.5n instead of 1.0n, because there is some non-zero time that both hash tables are alive.
There is another approach:
for key in list(mydict.keys()):
val = mydict[key]
if <decide drop>(val):
mydict.pop(key)
Which could be explained as:
Copy the keys of the original dictionary
Iterate the dictionary through individual lookups
Delete elements when required
I suspect that the overhead of invidual lookups will be too high, comparing to the straightforward iteration. But, I am curious (and have not tested it yet).
Is there any Python complexity reference? In cppreference, for example, for many functions (such as std::array::size or std::array::fill) there's a complexity section which describes their running complexity, in terms of linear in the size of the container or constant.
I would expect the same information to appear in the python website, perhaps, at least for the CPython implementation. For example, in the list reference, in list.insert I would expect to see complexity: linear; I know this case (and many other container-related operations) is covered here, but many other cases are not. Here are a few examples:
What is the complexity of tuple.__le__? It seems like when comparing two tuples of size n, k, the complexity is about O(min(n,k)) (however, for small n's it looks different).
What is the complexity of random.shuffle? It appears to be O(n). It also appears that the complexity of random.randint is O(1).
What is the complexity of the __format__ method of strings? It appears to be linear in the size of the input string; however, it also grows when the number of relevant arguments grow (compare ("{0}"*100000).format(*(("abc",)*100000)) with ("{}"*100000).format(*(("abc",)*100000))).
I'm aware that (a) each of these questions may be answered by itself, (b) one may look at the code of these modules (even though some are written in C), and (c) StackExchange is not a python mailing list for user requests. So: this is not a doc-feature request, just a question of two parts:
Do you know if such a resource exists?
If not, do you know what is the place to ask for such, or can you suggest why I don't need such?
CPython is pretty good about its algorithms, and the time complexity of an operation is usually just the best you would expect of a good standard library.
For example:
Tuple ordering has to be O(min(n,m)), because it works by comparing element-wise.
random.shuffle is O(n), because that's the complexity of the modern Fisher–Yates shuffle.
.format I imagine is linear, since it only requires one scan through the template string. As for the difference you see, CPython might just be clever enough to cache the same format code used twice.
The docs do mention time complexity, but generally only when it's not what you would expect — for example, because a deque is implemented with a doubly-linked list, it's explicitly mentioned as having O(n) for indexing in the middle.
Would the docs benefit from having time complexity called out everywhere it's appropriate? I'm not sure. The docs generally present builtins by what they should be used for and have implementations optimized for those use cases. Emphasizing time complexity seems like it would either be useless noise or encourage developers to second-guess the Python implementation itself.
In the article linked below, the author compares the efficiency of different string concatenation methodologies in python:
http://www.skymind.com/~ocrow/python_string/
One thing that I did not understand is, why does method 3 (Mutable Character Arrays) result in a significantly slower performance than method 4 (joining a list of strings)
both of them are mutable and I would think that they should have comparable performance.
"Both of them are mutable" is misleading you a bit.
It's true that in the list-append method, the list is mutable. But building up the list isn't the slow part. If you have 1000 strings of average length 1000, you're doing 1000000 mutations to the array, but only 1000 mutations to the list (plus 1000 increfs to string objects).
In particular, that means the array will have to spend 1000x as much time expanding (allocating new storage and copying the whole thing so far).
The slow part for the list method is the str.join call at the end. But that isn't mutable, and doesn't require any expanding. It uses two passes, to first calculate the size needed, then copy everything into it.
Also, that code inside str.join has had (and has continued to have since that article was written 9 years ago) a lot of work to optimize it, because it's a very common, and recommended, idiom that many real programs depend on every day; array has barely been touched since it was first added to the language.
But if you really want to understand the differences, you have to look at the source. In 2.7, the main work for the array method is in array_fromstring, while the main work for the list method is in string_join. You can see how the latter takes advantage of the fact that we already know all of the strings we're going to be joining up at the start, while the former can't.