Strings from `raw_input()` in memory - python

I've known for a while that Python likes to reuse strings in memory instead of having duplicates:
>>> a = "test"
>>> id(a)
36910184L
>>> b = "test"
>>> id(b)
36910184L
However, I recently discovered that the string returned from raw_input() does not follow that typical optimization pattern:
>>> a = "test"
>>> id(a)
36910184L
>>> c = raw_input()
test
>>> id(c)
45582816L
I curious why this is the case? Is there a technical reason?

To me it appears that python interns string literals, but strings which are created via some other process don't get interned:
>>> s = 'ab'
>>> id(s)
952080
>>> g = 'a' if True else 'c'
>>> g += 'b'
>>> g
'ab'
>>> id(g)
951336
Of course, raw_input is creating new strings without using string literals, so it's quite feasible to assume that it won't have the same id. There are (at least) two reasons why C-python interns strings -- memory (you can save a bunch if you don't store a whole bunch of copies of the same thing) and resolution for hash collisions. If 2 strings hash to the same value (in a dictionary lookup for instance), then python needs to check to make sure that both strings are equivalent. It can do a string compare if they're not interned, but if they are interned, it only needs to do a pointer compare which is a bit more efficient.

The compiler can't intern strings except where they're present in actual source code (ie, string literals). In addition to that, raw_input also strips off new lines.

[update] in order to answer the question, it is necessary to know why, how and when Python reuse strings.
Lets start with the how: Python uses "interned" strings - from wikipedia:
In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
Why? Seems like saving memory is not the main goal here, only a nice side-effect.
String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.
Now the "when": cpython "interns" a string in the following situations:
when you use the intern() non-essential built-in function (which moved to sys.intern in Python 3).
small strings (0 or 1 byte) - this very informative article by Laurent Luce explains the implementation
the names used in Python programs are automatically interned
the dictionaries used to hold module, class or instance attributes have interned keys
For other situations, every implementation seems to have great variation on when strings are automatically interned.
I could not possibly put it better than Alex Martinelli did in this answer (no wonder the guy has a 245k reputation):
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
[initial answer]
This is more a comment than an answer, but the comment system is not well suited for posting code:
def main():
while True:
s = raw_input('feed me:')
print '"{}".id = {}'.format(s, id(s))
if __name__ == '__main__':
main()
Running this gives me:
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
From my experience, at least on 2.7, there is some optimization going on even for raw_input().
If the implementation uses hash tables, I guess there is more than one. Going to dive in the source right now.
[first update]
Looks like my experiment was flawed:
def main():
storage = []
while True:
s = raw_input('feed me:')
print '"{}".id = {}'.format(s, id(s))
storage.append(s)
if __name__ == '__main__':
main()
Result:
"test".id = 43274944
"test".id = 43277104
"test".id = 43487408
"test".id = 43487504
"test".id = 43487552
"test".id = 43487600
"test".id = 43487648
"test".id = 43487744
"test".id = 43487864
"test".id = 43487936
"test".id = 43487984
"test".id = 43488032
In his answer to another question, user tzot warns about object lifetime:
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is operator).

Related

If Python strings are immutable, why does it keep the same id if I use += to append to it?

Strings in Python are immutable, which means the value cannot be changed. However, when appending to the string in the following example, it looks like the original string memory is modified since the id remains the same:
>>> s = 'String'
>>> for i in range(5, 0, -1):
... s += str(i)
... print(f"{s:<11} stored at {id(s)}")
...
String5 stored at 139841228476848
String54 stored at 139841228476848
String543 stored at 139841228476848
String5432 stored at 139841228476848
String54321 stored at 139841228476848
Conversely, in the following example, the id changes:
>>> a = "hello"
>>> id(a)
139841228475760
>>> a = "b" + a[1:]
>>> print(a)
bello
>>> id(a)
139841228475312
It's a CPython-specific optimization for the case when the str being appended to happens to have no other living references. The interpreter "cheats" in this case, allowing it to modify the existing string by reallocating (which can be in place, depending on heap layout) and appending the data directly, and often reducing the work significantly in loops that repeatedly concatenate (making it behave more like the amortized O(1) appends of a list rather than O(n) copy operations each time). It has no visible effect besides the unchanged id, so it's legal to do this (no one with an existing reference to a str ever sees it change unless the str was logically being replaced).
You're not actually supposed to rely on it (non-reference counted interpreters can't use this trick, since they can't know if the str has other references), per PEP8's very first programming recommendation:
Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.
If you want to break the optimization, there are all sorts of ways to do so, e.g. changing your code to:
>>> while i!=0:
... s_alias = s # Gonna save off an alias here
... s += str(i)
... print(s + " stored at " + str(id(s)))
... i -= 1
...
breaks it by creating an alias, increasing the reference count and telling Python that the change would be visible somewhere other than s, so it can't apply it. Similarly, code like:
s = s + a + b
can't use it, because s + a occurs first, and produces a temporary that b must then be added to, rather than immediately replacing s, and the optimization is too brittle to try to handle that. Almost identical code like:
s += a + b
or:
s = s + (a + b)
restores the optimization by ensuring the final concatenation is always one where s is the left operand and the result is used to immediately replace s.
Regardless of implementation details, the docs say:
… Two objects with non-overlapping lifetimes may have the same id() value.
The previous object referenced by s no longer exists after the += so the new object breaks no rules by having the same id.
If objects have non-overlapping lifetimes, their id values may be the same, but if variables have overlapping lifetimes so they must have different id values.
In C, Java, or some other programming languages, a variable is an identifier or a name, connected to a memory location.
e.g.
but in Python, a variable is considered a tag that is tied to some value. Python considers value as an object. and it can save memory and assign memory location with respect to value instead of variable.
Variables are the same but if values are different then it can assign new memory, in a closer look no variable can be referenced a=10 then it is removed by the garbage collector and a new value hold the variable in a new memory location.
In C, Java memory location is for variable but in Python, the memory location is for value which can be treated as an object.

List comprehension is sorting autmatically [duplicate]

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Integers v/s Floats in python:Cannot understand the behavior

I was playing a bit in my python shell while learning about mutability of objects.
I found something strange:
>>> x=5.0
>>> id(x)
48840312
>>> id(5.0)
48840296
>>> x=x+3.0
>>> id(x) # why did x (now 8.0) keep the same id as 5.0?
48840296
>>> id(5.0)
36582128
>>> id(5.0)
48840344
Why is the id of 5.0 reused after the statement x=x+3.0?
Fundamentally, the answer to your question is "calling id() on numbers will give you unpredictable results". The reason for this is because unlike languages like Java, where primitives literally are their value in memory, "primitives" in Python are still objects, and no guarantee is provided that exactly the same object will be used every time, merely that a functionally equivalent one will be.
CPython caches the values of the integers from -5 to 256 for efficiency (ensuring that calls to id() will always be the same), since these are commonly used and can be effectively cached, however nothing about the language requires this to be the case, and other implementations may chose not to do so.
Whenever you write a double literal in Python, you're asking the interpreter to convert the string into a valid numerical object. If it can, Python will reuse existing objects, but if it cannot easily determine whether an object exits already, it will simply create a new one.
This is not to say that numbers in Python are mutable - they aren't. Any instance of a number, such as 5.0, in Python cannot be changed by the user after being created. However there's nothing wrong, as far as the interpreter is concerned, with constructing more than one instance of the same number.
Your specific example of the object representing x = 5.0 being reused for the value of x += 3.0 is an implementation detail. Under the covers, CPython may, if it sees fit, reuse numerical objects, both integers and floats, to avoid the costly activity of constructing a whole new object. I stress however, this is an implementation detail; it's entirely possible certain cases will not display this behavior, and CPython could at any time change its number-handling logic to no longer behave this way. You should avoid writing any code that relies on this quirk.
The alternative, as eryksun points out, is simply that you stumbled on an object being garbage collected and replaced in the same location. From the user's perspective, there's no difference between the two cases, and this serves to stress that id() should not be used on "primitives".
The Devil is in the details
PyObject* PyInt_FromLong(long ival)
Return value: New reference.
Create a new integer object with a value of ival.
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range
you actually just get back a reference to the existing object. So it
should be possible to change the value of 1. I suspect the behaviour
of Python in this case is undefined. :-)
Note This is true only for CPython and may not apply for other Python Distribution.

Python id of name and '__main__' different

Normally if I assign a variable some value, and then check their ids, I expect them to be the same, because python is essentially just giving my object a "name". This can be seen in the below code:
>>> a = 3
>>> id(a)
19845928
>>> id(3)
19845928
The problem is when I perform the same with "name"
>>> __name__
'__main__'
>>> id(__name__)
19652416
>>> id('__main__')
19652448
How can there ids be different, shouldn't they be the same? Because __name__ should also be just a reference.
id() gives essentially the memory pointer to the data. Although strings are immutable, they are not guaranteed to be interned. This means that some strings with equal values have different pointers.
For integers (especially small ones), the pointers will be the same, so your 3 example works fine.
#KartikAnand: The way you're checking for 'same object' is valid, although the usual way is to use x is y. The problem is that they are not the same object, and not guaranteed to be. They simply have the same value. Note that when you do "__main__" you're creating a new object. Sometimes python does a nice optimization and re-uses a previously-created string of the same value, but it doesn't have to.
Kartik's goal is to "verify that assignment is in a way reference and objects are not created on the fly". To do this, avoid creating new objects (no string literals).
>>> __name__
'__main__'
>>> x = __name__
>>> id(__name__)
3078339808L
>>> id(x)
3078339808L
>>> __name__ is x
True
Just because two strings have the same value, it does not follow that they are the same object. This is completely expected behaviour.
In Python, small integers are "pooled", so that all small integer values point to the same object. This is not necessary true for strings.
At any rate, this is an implementation detail that should not be relied upon.
What you are running in to here is the fact that primitives are pseudo (or real) singletons in Python. Further, looking at strings clouds the issue because when strings are interned, value and id become synonymous as a side effect, so some strings with value matches will have id matches and others won't. Try looking at hand-built objects instead, as then you control when a new instance is created, and id vs value becomes more explicitly clear.

In Python, are single character strings guaranteed to be identical?

I read somewhere (an SO post, I think, and probably somewhere else, too), that Python automatically references single character strings, so not only does 'a' == 'a', but 'a' is 'a'.
However, I can't remember reading if this is guaranteed behavior in Python, or is it just implementation specific?
Bonus points for official sources.
It's implementation specific. It's difficult to tell, because (as the reference says):
... for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed.
The interpreter's pretty good about ensuring they're identical, but it doesn't always work:
x = u'a'
y = u'abc'[:1]
print x == y, x is y
Run on CPython 2.6, this gives True False.
It is all implementation defined.
The documentation for intern says: "Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys."
That means that anything that could be a name and which is known at compile time is likely (but not guaranteed) to be the same as any other occurrences of the same name.
Other strings aren't stated to be interned. Constant strings appearing in the same compilation unit are folded together (but that is also just an implementation detail) so you get:
>>> a = '!'
>>> a is '!'
False
>>> a = 'a'
>>> a is 'a'
True
>>>
The string that contains an identifier is interned so even in different compilations you get the same string. The string that is not an identifier is only shared when in the same compilation unit:
>>> '!' is '!'
True

Categories