Does converting from the mutable bytearray type to the non-mutable bytes type incur a copy? Is there any cost associated with it, or does the interpreter just treat it as an immutable byte sequence, like casting a char* to a const char* const in C++?
ba = bytearray()
ba.extend("some big long string".encode('utf-8'))
# Is this conversion free or expensive?
write_bytes(bytes(ba))
Does this differ between Python 3 where bytes is its own type and Python 2.7 where bytes is just an alias for str?
A new copy is created, the buffer is not shared between the bytesarray and the new bytes object, in either Python 2 or 3.
You couldn't share it, as the bytesarray object could still be referenced elsewhere and mutate the value.
For the details, see the bytesobject.c source code, where the buffer protocol is used to create a straight up copy of the data (via PyBuffer_ToContiguous()).
Martjin is right. I just wanted to back that answer up with the cpython source.
Looking at the source for bytes here, first bytes_new is called, which will call PyBytes_FromObject, which will call _PyBytes_FromBuffer, which creates a new bytes object and calls PyBuffer_ToContiguous (defined here). This calls buffer_to_contiguous, which is a memory copy function. The comment for the function reads:
Copy src to a contiguous representation. order is one of 'C', 'F' (Fortran) or 'A' (Any). Assumptions: src has PyBUF_FULL information, src->ndim >= 1, len(mem) == src->len.
Thus, a call to bytes with a bytearray argument will copy the data.
Related
Apologies if this is a stupid question which I suspect it may well be. I'm a Python user with little experience in C.
According to the official Python docs (v3.10.6):
PyObject *PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
Return value: New reference. Part of the Stable ABI.
Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed. [...]
which has me slightly confused.
It says the data i.e. the buffer u is copied.
But then it says the data may be shared which seems to contradict the first statement.
My Question is:
What exactly do they mean? That the newly allocated copy of the data is shared? If so who with?
Also, coming from Python: Why do they make a point of warning against tampering with the data anyway? Is changing a Python-immutable object something routinely done in C?
Ultimately, all I need to know is what to do with u: Can/should I free it or has Python taken ownership?
Ultimately, all I need to know is what to do with u: Can/should I free it or has Python taken ownership?
You still own u. Python has no idea where u came from or how it should be freed. It could even be a local array. Python will not retain a pointer to u. Cleaning up u is still your responsibility.
What exactly do they mean? That the newly allocated copy of the data is shared? If so who with?
The returned string object may be shared with arbitrary other code. Python makes no promises about how that might happen, but in the current implementation, one way is that a single-character ASCII string will be drawn from a cached array of such strings:
/* ASCII is equivalent to the first 128 ordinals in Unicode. */
if (size == 1 && (unsigned char)s[0] < 128) {
if (consumed) {
*consumed = 1;
}
return get_latin1_char((unsigned char)s[0]);
}
Also, coming from Python: Why do they make a point of warning against tampering with the data anyway? Is changing a Python-immutable object something routinely done in C?
It is in fact fairly routine. Python-immutable objects have to be initialized somehow, and that means writing to their memory, in C. Immutability is an abstraction presented at the Python level, but the physical memory of an object is mutable. However, such mutation is only safe in very limited circumstances, and one of the requirements is that no other code should hold any references to the object.
I extract strings from a PyObject pointer using:
char* str = PyBytes_AsString(PyUnicode_AsASCIIString(strObj));
I'm wondering whether I have to free the memory after doing this. The manual doesn't seem to provide any information about this for these functions, but other functions do provide information to free memory using PyMem_Free().
More details
After all str is a pointer to something reserved. I would've thought that this is like std::string::c_str(), where the return is a const char* to something inside the object, but Python strings can have any kind of encoding, which is in general not ASCII. Meaning that if we convert formatting, we need to reserve some new space.
Free or not? If not, how does Python do this?
If you look at the documentation for PyUnicode_AsASCIIString, you will see
Return value: New reference.
Encode a Unicode object using ASCII and return the result as Python bytes object. Error handling is “strict”. Return NULL if an exception was raised by the codec.
The returned object is a regular Python bytes object, subject to regular Python reference handling. The reference you receive is a new reference, so you own that reference, and you are responsible for Py_DECREFing it when you're done with it. Since you do not do this, your code leaks this object.
You also need to handle the null return case. Since you do not, your code currently invokes undefined behavior if the codec raises an exception.
If you look at the documentation for PyBytes_AsString, you will see
Return a pointer to the contents of o. The pointer refers to the internal buffer of o, which consists of len(o) + 1 bytes. The last byte in the buffer is always null, regardless of whether there are any other null bytes. The data must not be modified in any way, unless the object was just created using PyBytes_FromStringAndSize(NULL, size). It must not be deallocated. If o is not a bytes object at all, PyBytes_AsString() returns NULL and raises TypeError.
This function returns a pointer to the internals of a bytes object. You should not free the pointer, and you should not modify the data it points to. You should also wait until you're done with this pointer before Py_DECREFing the bytes object whose internals you're looking at.
When I ran the below code I got 3 and 36 as the answers respectively.
x ="abd"
print len(x)
print sys.getsizeof(x)
Can someone explain to me what's the difference between them ?
They are not the same thing at all.
len() queries for the number of items contained in a container. For a string that's the number of characters:
Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).
sys.getsizeof() on the other hand returns the memory size of the object:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Python string objects are not simple sequences of characters, 1 byte per character.
Specifically, the sys.getsizeof() function includes the garbage collector overhead if any:
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
String objects do not need to be tracked (they cannot create circular references), but string objects do need more memory than just the bytes per character. In Python 2, __sizeof__ method returns (in C code):
Py_ssize_t res;
res = PyStringObject_SIZE + PyString_GET_SIZE(v) * Py_TYPE(v)->tp_itemsize;
return PyInt_FromSsize_t(res);
where PyStringObject_SIZE is the C struct header size for the type, PyString_GET_SIZE basically is the same as len() and Py_TYPE(v)->tp_itemsize is the per-character size. In Python 2.7, for byte strings, the size per character is 1, but it's PyStringObject_SIZE that is confusing you; on my Mac that size is 37 bytes:
>>> sys.getsizeof('')
37
For unicode strings the per-character size goes up to 2 or 4 (depending on compilation options). On Python 3.3 and newer, Unicode strings take up between 1 and 4 bytes per character, depending on the contents of the string.
For containers such as dictionaries or lists that reference other objects, the memory size given covers only the memory used by the container and the pointer values used to reference those other objects. There is no straightforward method of including the memory size of the ‘contained’ objects because those same objects could have many more references elsewhere and are not necessarily owned by a single container.
The documentation states it like this:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
If you need to calculate the memory footprint of a container and anything referenced by that container you’ll have to use some method of traversing to those contained objects and get their size; the documentation points to a recursive recipe.
key difference is that len() will give actual length of elements in container , Whereas sys.sizeof() will give it's memory size which it occupy
for more information read docs of python which is available at
https://docs.python.org/3/library/sys.html#module-sys
When I ran the below code I got 3 and 36 as the answers respectively.
x ="abd"
print len(x)
print sys.getsizeof(x)
Can someone explain to me what's the difference between them ?
They are not the same thing at all.
len() queries for the number of items contained in a container. For a string that's the number of characters:
Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).
sys.getsizeof() on the other hand returns the memory size of the object:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Python string objects are not simple sequences of characters, 1 byte per character.
Specifically, the sys.getsizeof() function includes the garbage collector overhead if any:
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
String objects do not need to be tracked (they cannot create circular references), but string objects do need more memory than just the bytes per character. In Python 2, __sizeof__ method returns (in C code):
Py_ssize_t res;
res = PyStringObject_SIZE + PyString_GET_SIZE(v) * Py_TYPE(v)->tp_itemsize;
return PyInt_FromSsize_t(res);
where PyStringObject_SIZE is the C struct header size for the type, PyString_GET_SIZE basically is the same as len() and Py_TYPE(v)->tp_itemsize is the per-character size. In Python 2.7, for byte strings, the size per character is 1, but it's PyStringObject_SIZE that is confusing you; on my Mac that size is 37 bytes:
>>> sys.getsizeof('')
37
For unicode strings the per-character size goes up to 2 or 4 (depending on compilation options). On Python 3.3 and newer, Unicode strings take up between 1 and 4 bytes per character, depending on the contents of the string.
For containers such as dictionaries or lists that reference other objects, the memory size given covers only the memory used by the container and the pointer values used to reference those other objects. There is no straightforward method of including the memory size of the ‘contained’ objects because those same objects could have many more references elsewhere and are not necessarily owned by a single container.
The documentation states it like this:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
If you need to calculate the memory footprint of a container and anything referenced by that container you’ll have to use some method of traversing to those contained objects and get their size; the documentation points to a recursive recipe.
key difference is that len() will give actual length of elements in container , Whereas sys.sizeof() will give it's memory size which it occupy
for more information read docs of python which is available at
https://docs.python.org/3/library/sys.html#module-sys
Is it possible to make Python use less than 12 bytes for an int?
>>> x=int()
>>> x
0
>>> sys.getsizeof(x)
12
I am not a computer specialist but isn't 12 bytes excessive?
The smallest int I want to store is 0, the largest int 147097614, so I shouldn't really need more than 4 bytes.
(There is probably something I misunderstand here as I couldn't find an answer anywhere on the net. Keep that in mind.)
In python, ints are objects just like everything else. Because of that, there is a little extra overhead just associated with the fact that you're using an object which has some associated meta-data.
If you're going to use lots of ints, and it makes sense to lay them out in an array-like structure, you should look into numpy. Numpy ndarray objects will have a little overhead associated with them for the various pieces of meta-data that the array objects keep track of, but the actual data is stored as the datatype you specify (e.g. numpy.int32 for a 4-byte integer.)
Thus, if you have:
import numpy as np
a = np.zeros(5000,dtype=np.int32)
The array will take only slightly more than 4*5000 = 20000 bytes of your memory
Size of an integer object includes the overhead of maintaining other object information along with its value. The additional information can include object type, reference count and other implementation-specific details.
If you store many integers and want to optimize the space spent, use the array module, specifically arrays constructed with array.array('i').
Integers in python are objects, and are therefore stored with extra overhead.
You can read more information about it here
The integer type in cpython is stored in a structure like so:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro that expands out into a reference count and a pointer to the type object.
So you can see that:
long ob_ival - 4 bytes for a long.
Py_ssize_t ob_refcnt - I would assume to size_t here is 4 bytes.
PyTypeObject *ob_type - Is a pointer, so another 4 bytes.
12 bytes in total!