What determines which strings are interned and when? [duplicate] - python

This question already has answers here:
'is' operator behaves differently when comparing strings with spaces
(6 answers)
About the changing id of an immutable string
(5 answers)
Closed 8 years ago.
>>> s1 = "spam"
>>> s2 = "spam"
>>> s1 is s2
True
>>> q = 'asdalksdjfla;ksdjf;laksdjfals;kdfjasl;fjasdf'
>>> r = 'asdalksdjfla;ksdjf;laksdjfals;kdfjasl;fjasdf'
>>> q is r
False
How many characters should have to s1 is s2 give False? Where is limit? i.e. I am asking how long a string has to be before python starts making separate copies of it.

String interning is implementation specific and shouldn't be relied upon, use equality testing if you want to check two strings are identical.

If you want, for some bizarre reason, to force the comparison to be true then use the intern function:
>>> a = intern('12345678012345678901234567890qazwsxedcrfvtgbyhnujmikolp')
>>> b = intern('12345678012345678901234567890qazwsxedcrfvtgbyhnujmikolp')
>>> a is b
True

Here is a piece of comment about interned string from CPython 2.5.0 source file (stringobject.h)
/* ... ... This is generally restricted to strings that **"look like" Python identifiers**, although the intern() builtin can be used to force interning of any string ... ... */
Accordingly, strings contain only underscores, digits or alphabets will be interned. In your example, q and ``r contain ;, so they will not be interned.

Related

Python unnecessary string allocation [duplicate]

This question already has answers here:
What are the rules for cpython's string interning?
(2 answers)
Python string interning
(2 answers)
Why and where python interned strings when executing `a = 'python'` while the source code does not show that?
(1 answer)
Closed last year.
When you assign same string literal to two variables, Python only allocates one string. This is very reasonable since string is immutable object in Python.
>>> a = "Hello"
>>> b = "Hello"
>>> id(a)
4311984752
>>> id(b)
4311984752
>>> a is b
True
But the strange part is: when the string contains special character (like !), Python will allocate two strings with exact same content.
>>> a = "hi!"
>>> b = "hi!"
>>> id(a)
4328663024
>>> id(b)
4317237616
>>> a is b
False
I read about this strange behaviour from here: https://python-course.eu/python-tutorial/data-types-and-variables.php
But that guide didn't elaborate why Python does this seemingly unnecessary duplicated string allocation.
My question is what's rationale behind Python's design of duplicated string allocation for string containing special character?

Why Python "is" operator gives different result for variable containing same string [duplicate]

This question already has answers here:
'is' operator behaves differently when comparing strings with spaces
(6 answers)
About the changing id of an immutable string
(5 answers)
Closed 8 years ago.
>>> s1 = "spam"
>>> s2 = "spam"
>>> s1 is s2
True
>>> q = 'asdalksdjfla;ksdjf;laksdjfals;kdfjasl;fjasdf'
>>> r = 'asdalksdjfla;ksdjf;laksdjfals;kdfjasl;fjasdf'
>>> q is r
False
How many characters should have to s1 is s2 give False? Where is limit? i.e. I am asking how long a string has to be before python starts making separate copies of it.
String interning is implementation specific and shouldn't be relied upon, use equality testing if you want to check two strings are identical.
If you want, for some bizarre reason, to force the comparison to be true then use the intern function:
>>> a = intern('12345678012345678901234567890qazwsxedcrfvtgbyhnujmikolp')
>>> b = intern('12345678012345678901234567890qazwsxedcrfvtgbyhnujmikolp')
>>> a is b
True
Here is a piece of comment about interned string from CPython 2.5.0 source file (stringobject.h)
/* ... ... This is generally restricted to strings that **"look like" Python identifiers**, although the intern() builtin can be used to force interning of any string ... ... */
Accordingly, strings contain only underscores, digits or alphabets will be interned. In your example, q and ``r contain ;, so they will not be interned.

Why is id of two strings different? [duplicate]

This question already has answers here:
'is' operator behaves differently when comparing strings with spaces
(6 answers)
Closed 8 months ago.
>>> a = "zzzzqqqqasdfasdf1234"
>>> b = "zzzzqqqqasdfasdf1234"
>>> id(a)
4402117560
>>> id(b)
4402117560
but
>>> c = "!##$"
>>> d = "!##$"
>>> id(c) == id(d)
False
>>> id(a) == id(b)
True
Why get same id() result only when assign string?
Edited: I replace "ascii string" with just "string". Thanks for feedback
It's not about ASCII vs. non-ASCII (your "non-ASCII" is still ASCII, it's just punctuation, not alphanumeric). CPython, as an implementation detail, interns string constants that contain only "name characters". "Name characters" in this case means the same thing as the regex escape \w: Alphanumeric, plus underscore.
Note: This can change at any time, and should never be relied on, it's just an optimization they happen to use.
At a guess, this choice was made to optimize code that uses getattr and setattr, dicts keyed by a handful of string literals, etc., where interning means that the dictionary lookups involved often ends up doing pointer comparisons and avoiding comparing the strings at all (when two strings are both interned, they are definitionally either the same object, or not equal, so you can avoid reading their data entirely).

Are strings cached? [duplicate]

This question already has answers here:
'is' operator behaves differently when comparing strings with spaces
(6 answers)
Closed 8 months ago.
>>> a = "zzzzqqqqasdfasdf1234"
>>> b = "zzzzqqqqasdfasdf1234"
>>> id(a)
4402117560
>>> id(b)
4402117560
but
>>> c = "!##$"
>>> d = "!##$"
>>> id(c) == id(d)
False
>>> id(a) == id(b)
True
Why get same id() result only when assign string?
Edited: I replace "ascii string" with just "string". Thanks for feedback
It's not about ASCII vs. non-ASCII (your "non-ASCII" is still ASCII, it's just punctuation, not alphanumeric). CPython, as an implementation detail, interns string constants that contain only "name characters". "Name characters" in this case means the same thing as the regex escape \w: Alphanumeric, plus underscore.
Note: This can change at any time, and should never be relied on, it's just an optimization they happen to use.
At a guess, this choice was made to optimize code that uses getattr and setattr, dicts keyed by a handful of string literals, etc., where interning means that the dictionary lookups involved often ends up doing pointer comparisons and avoiding comparing the strings at all (when two strings are both interned, they are definitionally either the same object, or not equal, so you can avoid reading their data entirely).

Why id() for same Unicode string literals gives different result? [duplicate]

This question already has answers here:
memory location in unicode strings
(2 answers)
Closed 8 years ago.
Why Unicode string literals show different id's ? I was hoping the same behavior as that of String literals.
>>> p = 'abcd'
>>> q = 'abcd'
>>> id(p) == id(q)
True
>>> p = u'abcd'
>>> q = u'abcd'
>>> id(p) == id(q)
False
Please provide some pointers on this.
For the same reason two dicts with the same contents would have different ids: they are distinct objects. I suspect that the non-Unicode string literals being the same object is something of an optimization.

Categories