Does Python do slice-by-reference on strings? - python

I want to know if when I do something like
a = "This could be a very large string..."
b = a[:10]
a new string is created or a view/iterator is returned

Python does slice-by-copy, meaning every time you slice (except for very trivial slices, such as a[:]), it copies all of the data into a new string object.
According to one of the developers, this choice was made because
The [slice-by-reference] approach is more complicated, harder to implement
and may lead to unexpected behavior.
For example:
a = "a long string with 500,000 chars ..."
b = a[0]
del a
With the slice-as-copy design the string a is immediately freed. The
slice-as-reference design would keep the 500kB string in memory although
you are only interested in the first character.
Apparently, if you absolutely need a view into a string, you can use a memoryview object.

When you slice strings, they return a new instance of String. Strings are immutable objects.

Related

Is string internally stored as individual characters, each character in memory shared by other similar strings?

For example, is the string var1 = 'ROB' stored as 3 memory locations R, O and B each with its own address and the variable var1 points to the memory location R? Then how does it point to O and B?
And do other strings – for example: var2 = 'BOB' – point to the same B and O in memory that var1 refers to?
How strings are stored is an implementation detail, but in practice, on the CPython reference interpreter, they're stored as a C-style array of characters. So if the R is at address x, then O is at x+1 (or +2 or +4, depending on the largest ordinal value in the string), and B is at x+2 (or +4 or +8). Because the letters are stored consecutively, knowing where R is (and a flag in the str that says how big each character's storage is) is enough to locate O and B.
'BOB' is at a completely different address, y, and its O and B are contiguous as well. The OB in 'ROB' is utterly unrelated to the OB in 'BOB'.
There is a confusing aspect to this. If you index into the strings, and check the id of the result, it will seem like 'O' has the same address in both strings. But that's only because:
Indexing into a string returns a new string, unrelated to the one being indexed, and
CPython caches length one strings in the latin-1 range, so 'O' is a singleton (no matter how you make it, you get back the cached string)
I'll note that the actual str internals in modern Python are even more complicated than I covered above; a single string might store the same data in up to three different encodings in the same object (the canonical form, and cached version(s) for working with specific Python C APIs). It's not visible from the Python level aside from checking the size with sys.getsizeof though, so it's not worth worrying about in general.
If you really want to head off into the weeds, feel free to read PEP 393: Flexible String Representation which elaborates on the internals of the new str object structure adopted in CPython 3.3.
This is only a partial answer:
var1 is a name that refers to a string object 'ROB'.
var2 is a name that refers to another string object 'BOB'.
How a string object stores the individual characters, and whether different string objects share the same memory, I cannot answer now in more detail than "sometimes" and "it depends". It has to do with string interning, which may be used.

How to replace a individual bit within a python opbject of types bytes by setting its value explicitely

Let's assume I have a variable tmp that is of type bytes and contains zeros and ones. I want to replace the value of the fifth position within tmp by setting an explicit value (e.g. 1).
I wonder what is a clean way to replace individual bits within an object (tmp) that has type 'Bytes'. I would like to set it directly. My attempt does not work. Help in understanding the problem in my approach would highly be appreciated.
print(tmp) # -> b'00101001'
print(type(tmp)) # -> <class 'bytes'>
tmp[3] = 1 # Expected b'00111001' but actually got TypeError: 'bytes' object does not support item assignment
Is there a function like set_bit_in(tmp, position, bit_value)?
A bytes object is an immutable object in python, you can index it an iterate it though.
You can turn it into a bytearray though, and that would be the easiest way to go about it
Or what you can do is, for example, turn it into a list, then change the value, as follows:
tmp_list = list(bin(tmp)[2:])
tmp_list[3] = '1'
The first two characters are stripped ([2:]) because they are always '0b', of course that is optional.
Also a bytesis a string representation of a byte (hence immutable), thus the assignment you want to make is = '1' not = 1
If turning to a list, then back, is not the way you wanna go you can also just copy the string representation and change the one element you wanna change.
Alternatively you can perform bitwise operations (on the int itself), if you feel comfortable with working with binaries

Python String Object Reference

I'm learning python (3.6) and I have discovered the following:
a = "hi"
b = "hi"
a == b #True
a is b #True
a = list(a)
b = list(b)
a = "".join(a)
b = "".join(b)
a == b #True
a is b #False
Why is the result different after conversion to list and joining back to string? I do understand that Python VM maintains a pool of strings and hence the reference is the same for a and b. But why does this not work after joining the list to the very same string?
Thanks!
The key lies here:
a = "".join(a)
b = "".join(b)
The string.join() method returns a new string, built by joining the element of a list.
Each call to string.join() instanciates a new string: in the first call a string is created and its reference is assigned to a, then, in the second call, a new string gets built and its reference is assigned to b. Because of this, the two names a and b are references to two new and distinct strings, which themselves are two separate objects.
The is operator behaves as designed, returning false as a and b are not references to the same object.
If you're trying to see if the two string are equal in content, then the operator == is likely a better choice.
You shouldn't really compare anything except singletons (like None, True or False) with is. Because is doesn't really compare the content, it just checks if it's the same object. So is will fail if you compare different objects with the same content.
The fact that your first a is b worked is because literals are interned (*). So a and b are the same object because both are literals with the same content. But that's an implementation and it could yield different results in future (or older) Python versions, so don't start comparing string literals with is on the basis that it works right now.
(*) It really should return False because the way you've written the cases they shouldn't be the same object. They just happen to be the same one because CPython optimizes some cases.
There are lots of ways to answer this, but here you can think about memory. The physical bits in your RAM that make up the data. In python, the keyword "is" checks to see if the address of two objects matches exactly. The operator "==" checks to see if the value of the objects are the same by running the comparison defined in the magic method - the python code responsible for turning operators into functions - this has to be defined for every class. The interesting part arises from the fact that they are originally identical, this question should help you with that.
when does Python allocate new memory for identical strings?.
Essentially python can optimise the "hi" strings because you've typed them before running your code it makes a table of all typed strings to save on memory. When the string object is built from a list, python doesn't know what the contents will be. By default, in your particular version of the interpreter this means a new string object is created - this saves time checking for duplicates, but requires more memory. If you want to force the interpreter to check for duplicates then use the "sys.intern" function: https://docs.python.org/3/library/sys.html
sys.intern(string):
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

What does python sys getsizeof for string return?

What does sys.getsizeof return for a standard string? I am noticing that this value is much higher than what len returns.
I will attempt to answer your question from a broader point of view. You're referring to two functions and comparing their outputs. Let's take a look at their documentation first:
len():
Return the length (the number of items) of an object. The argument may
be a sequence (such as a string, bytes, tuple, list, or range) or a
collection (such as a dictionary, set, or frozen set).
So in case of string, you can expect len() to return the number of characters.
sys.getsizeof():
Return the size of an object in bytes. The object can be any type of
object. All built-in objects will return correct results, but this
does not have to hold true for third-party extensions as it is
implementation specific.
So in case of string (as with many other objects) you can expect sys.getsizeof() the size of the object in bytes. There is no reason to think that it should be the same as the number of characters.
Let's have a look at some examples:
>>> first = "First"
>>> len(first)
5
>>> sys.getsizeof(first)
42
This example confirms that the size is not the same as the number of characters.
>>> second = "Second"
>>> len(second)
6
>>> sys.getsizeof(second)
43
We can notice that if we look at a string one character longer, its size is one byte bigger as well. We don't know if it's a coincidence or not though.
>>> together = first + second
>>> print(together)
FirstSecond
>>> len(together)
11
If we concatenate the two strings, their combined length is equal to the sum of their lengths, which makes sense.
>>> sys.getsizeof(together)
48
Contrary to what someone might expect though, the size of the combined string is not equal to the sum of their individual sizes. But it still seems to be the length plus something. In particular, something worth 37 bytes. Now you need to realize that it's 37 bytes in this particular case, using this particular Python implementation etc. You should not rely on that at all. Still, we can take a look why it's 37 bytes what they are (approximately) used for.
String objects are in CPython (probably the most widely used implementation of Python) implemented as PyStringObject. This is the C source code (I use the 2.7.9 version):
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
You can see that there is something called PyObject_VAR_HEAD, one int, one long and a char array. The char array will always contain one more character to store the '\0' at the end of the string. This, along with the int, long and PyObject_VAR_HEAD take the additional 37 bytes. PyObject_VAR_HEAD is defined in another C source file and it refers to other implementation-specific stuff, you need to explore if you want to find out where exactly are the 37 bytes. Plus, the documentation mentions that sys.getsizeof()
adds an additional garbage collector overhead if the object is managed
by the garbage collector.
Overall, you don't need to know what exactly takes the something (the 37 bytes here) but this answer should give you a certain idea why the numbers differ and where to find more information should you really need it.
To quote the documentation:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Built in strings are not simple character sequences - they are full fledged objects, with garbage collection overhead, which probably explains the size discrepancy you're noticing.

Python string interning and substrings

Does python create a completely new string (copying the contents) when you do a substring operation like:
new_string = my_old_string[foo:bar]
Or does it use interning to point to the old data ?
As a clarification, I'm curious if the underlying character buffer is shared as it is in Java. I realize that strings are immutable and will always appear to be a completely new string, and it would have to be an entirely new string object.
Examining the source reveals:
When the slice indexes match the start and end of the original string, then the original string is returned.
Otherwise, you get the result of the function PyString_FromStringAndSize, which takes the existing string object. This function returns an interned string in the case of a 0 or 1-character-width string; otherwise it copies the substring into a new string object.
You may also be interested in islice which does provide a view of the original string
>>> from sys import getrefcount
>>> from itertools import islice
>>> h="foobarbaz"
>>> getrefcount(h)
2
>>> g=islice(h,3,6)
>>> getrefcount(h)
3
>>> "".join(g)
'bar'
>>>
It's a completely new string (so the old bigger one can be let go when feasible, rather than staying alive just because some tiny string's been sliced from it and it being kept around).
intern is a different thing, though.
Looks like I can answer my own question, opened up the source and guess what I found:
static PyObject *
string_slice(register PyStringObject *a, register Py_ssize_t i,
register Py_ssize_t j)
... snip ...
return PyString_FromStringAndSize(a->ob_sval + i, j-i);
..and no reference to interning. FromStringAndSize() only explicitly interns on strings of size 1 and 0
So it seems clear that you'll always get a totally new object and they won't share any buffers.
In Python, strings are immutable. That means that you will always get a copy on any slice, concatenate, or other operations.
http://effbot.org/pyfaq/why-are-python-strings-immutable.htm is a nice explanation for some of the reasons behind immutable strings.

Categories