Python String Object Reference - python

I'm learning python (3.6) and I have discovered the following:
a = "hi"
b = "hi"
a == b #True
a is b #True
a = list(a)
b = list(b)
a = "".join(a)
b = "".join(b)
a == b #True
a is b #False
Why is the result different after conversion to list and joining back to string? I do understand that Python VM maintains a pool of strings and hence the reference is the same for a and b. But why does this not work after joining the list to the very same string?
Thanks!

The key lies here:
a = "".join(a)
b = "".join(b)
The string.join() method returns a new string, built by joining the element of a list.
Each call to string.join() instanciates a new string: in the first call a string is created and its reference is assigned to a, then, in the second call, a new string gets built and its reference is assigned to b. Because of this, the two names a and b are references to two new and distinct strings, which themselves are two separate objects.
The is operator behaves as designed, returning false as a and b are not references to the same object.
If you're trying to see if the two string are equal in content, then the operator == is likely a better choice.

You shouldn't really compare anything except singletons (like None, True or False) with is. Because is doesn't really compare the content, it just checks if it's the same object. So is will fail if you compare different objects with the same content.
The fact that your first a is b worked is because literals are interned (*). So a and b are the same object because both are literals with the same content. But that's an implementation and it could yield different results in future (or older) Python versions, so don't start comparing string literals with is on the basis that it works right now.
(*) It really should return False because the way you've written the cases they shouldn't be the same object. They just happen to be the same one because CPython optimizes some cases.

There are lots of ways to answer this, but here you can think about memory. The physical bits in your RAM that make up the data. In python, the keyword "is" checks to see if the address of two objects matches exactly. The operator "==" checks to see if the value of the objects are the same by running the comparison defined in the magic method - the python code responsible for turning operators into functions - this has to be defined for every class. The interesting part arises from the fact that they are originally identical, this question should help you with that.
when does Python allocate new memory for identical strings?.
Essentially python can optimise the "hi" strings because you've typed them before running your code it makes a table of all typed strings to save on memory. When the string object is built from a list, python doesn't know what the contents will be. By default, in your particular version of the interpreter this means a new string object is created - this saves time checking for duplicates, but requires more memory. If you want to force the interpreter to check for duplicates then use the "sys.intern" function: https://docs.python.org/3/library/sys.html
sys.intern(string):
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

Related

Is string internally stored as individual characters, each character in memory shared by other similar strings?

For example, is the string var1 = 'ROB' stored as 3 memory locations R, O and B each with its own address and the variable var1 points to the memory location R? Then how does it point to O and B?
And do other strings – for example: var2 = 'BOB' – point to the same B and O in memory that var1 refers to?
How strings are stored is an implementation detail, but in practice, on the CPython reference interpreter, they're stored as a C-style array of characters. So if the R is at address x, then O is at x+1 (or +2 or +4, depending on the largest ordinal value in the string), and B is at x+2 (or +4 or +8). Because the letters are stored consecutively, knowing where R is (and a flag in the str that says how big each character's storage is) is enough to locate O and B.
'BOB' is at a completely different address, y, and its O and B are contiguous as well. The OB in 'ROB' is utterly unrelated to the OB in 'BOB'.
There is a confusing aspect to this. If you index into the strings, and check the id of the result, it will seem like 'O' has the same address in both strings. But that's only because:
Indexing into a string returns a new string, unrelated to the one being indexed, and
CPython caches length one strings in the latin-1 range, so 'O' is a singleton (no matter how you make it, you get back the cached string)
I'll note that the actual str internals in modern Python are even more complicated than I covered above; a single string might store the same data in up to three different encodings in the same object (the canonical form, and cached version(s) for working with specific Python C APIs). It's not visible from the Python level aside from checking the size with sys.getsizeof though, so it's not worth worrying about in general.
If you really want to head off into the weeds, feel free to read PEP 393: Flexible String Representation which elaborates on the internals of the new str object structure adopted in CPython 3.3.
This is only a partial answer:
var1 is a name that refers to a string object 'ROB'.
var2 is a name that refers to another string object 'BOB'.
How a string object stores the individual characters, and whether different string objects share the same memory, I cannot answer now in more detail than "sometimes" and "it depends". It has to do with string interning, which may be used.

Comparing any two objects of any data type in python

I am trying to compare the values of any two objects, the datatype for which could be anything (including byte array, django objects, dictionary, boolean..... and so on). Right now I am using the '==' operator for the same. Is this the correct approach of comparing two objects?
'==' returns true if two objects are equal while 'is' returns true if the two variables point to the same object.
Look at this page for a more in depth explanation:
Is there a difference between `==` and `is` in Python?.
Are you asking how to compare two objects that are equal in value or if they are pointing at the same common object?

Does Python do slice-by-reference on strings?

I want to know if when I do something like
a = "This could be a very large string..."
b = a[:10]
a new string is created or a view/iterator is returned
Python does slice-by-copy, meaning every time you slice (except for very trivial slices, such as a[:]), it copies all of the data into a new string object.
According to one of the developers, this choice was made because
The [slice-by-reference] approach is more complicated, harder to implement
and may lead to unexpected behavior.
For example:
a = "a long string with 500,000 chars ..."
b = a[0]
del a
With the slice-as-copy design the string a is immediately freed. The
slice-as-reference design would keep the 500kB string in memory although
you are only interested in the first character.
Apparently, if you absolutely need a view into a string, you can use a memoryview object.
When you slice strings, they return a new instance of String. Strings are immutable objects.

Deterministic key serialization

I'm writing a mapping class which persists to the disk. I am currently allowing only str keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict), but more reasonable I would accept string, unicode, int, and tuples of these types.
To that end I would like to derive a deterministic serialization scheme.
Option 1 - Pickling the key
The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle do not match each other:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Is there any implementation/protocol combination of pickle which is deterministic for some set of types (e.g. can only use cPickle with protocol 0)?
Option 2 - Repr and ast.literal_eval
Another option is to use repr to dump and ast.literal_eval to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.
Option 3: Custom repr
Another option which is likely overkill is to write my own repr which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash is not deterministic across platforms either. Scratch that.)
Any insights?
Important note: repr() is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.
For example print repr({'a':1, 'b':2}) might print out as {'a':1, 'b':2} or {'b':2, 'a':1}, depending on how Python decides to manage the keys in the dictionary.
After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr of these types is, in fact, deterministic. But, frankly, this was expected.
I believe that the repr method is susceptible to nearly all of the same cases under which the marshal method would break down (longs > 2**32 can never be an int on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).
My solution for the time being has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values on the various platforms I am using.
In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.
"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).
For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:
>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'
This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).
You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:
You're building an SQLite backend to basically a dictionary.
You want to allow the keys to be more than basestring type (which types).
You want it to survive the Python 2 -> Python 3 barrier.
You want to support large integers above 2**32 as the key.
Ability to store infinite values (because you don't want hash collisions).
So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.
Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.

Python string interning and substrings

Does python create a completely new string (copying the contents) when you do a substring operation like:
new_string = my_old_string[foo:bar]
Or does it use interning to point to the old data ?
As a clarification, I'm curious if the underlying character buffer is shared as it is in Java. I realize that strings are immutable and will always appear to be a completely new string, and it would have to be an entirely new string object.
Examining the source reveals:
When the slice indexes match the start and end of the original string, then the original string is returned.
Otherwise, you get the result of the function PyString_FromStringAndSize, which takes the existing string object. This function returns an interned string in the case of a 0 or 1-character-width string; otherwise it copies the substring into a new string object.
You may also be interested in islice which does provide a view of the original string
>>> from sys import getrefcount
>>> from itertools import islice
>>> h="foobarbaz"
>>> getrefcount(h)
2
>>> g=islice(h,3,6)
>>> getrefcount(h)
3
>>> "".join(g)
'bar'
>>>
It's a completely new string (so the old bigger one can be let go when feasible, rather than staying alive just because some tiny string's been sliced from it and it being kept around).
intern is a different thing, though.
Looks like I can answer my own question, opened up the source and guess what I found:
static PyObject *
string_slice(register PyStringObject *a, register Py_ssize_t i,
register Py_ssize_t j)
... snip ...
return PyString_FromStringAndSize(a->ob_sval + i, j-i);
..and no reference to interning. FromStringAndSize() only explicitly interns on strings of size 1 and 0
So it seems clear that you'll always get a totally new object and they won't share any buffers.
In Python, strings are immutable. That means that you will always get a copy on any slice, concatenate, or other operations.
http://effbot.org/pyfaq/why-are-python-strings-immutable.htm is a nice explanation for some of the reasons behind immutable strings.

Categories