PyUnicode_FromStringAndSize: Very terse documentation - python

Apologies if this is a stupid question which I suspect it may well be. I'm a Python user with little experience in C.
According to the official Python docs (v3.10.6):
PyObject *PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
Return value: New reference. Part of the Stable ABI.
Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed. [...]
which has me slightly confused.
It says the data i.e. the buffer u is copied.
But then it says the data may be shared which seems to contradict the first statement.
My Question is:
What exactly do they mean? That the newly allocated copy of the data is shared? If so who with?
Also, coming from Python: Why do they make a point of warning against tampering with the data anyway? Is changing a Python-immutable object something routinely done in C?
Ultimately, all I need to know is what to do with u: Can/should I free it or has Python taken ownership?

Ultimately, all I need to know is what to do with u: Can/should I free it or has Python taken ownership?
You still own u. Python has no idea where u came from or how it should be freed. It could even be a local array. Python will not retain a pointer to u. Cleaning up u is still your responsibility.
What exactly do they mean? That the newly allocated copy of the data is shared? If so who with?
The returned string object may be shared with arbitrary other code. Python makes no promises about how that might happen, but in the current implementation, one way is that a single-character ASCII string will be drawn from a cached array of such strings:
/* ASCII is equivalent to the first 128 ordinals in Unicode. */
if (size == 1 && (unsigned char)s[0] < 128) {
if (consumed) {
*consumed = 1;
}
return get_latin1_char((unsigned char)s[0]);
}
Also, coming from Python: Why do they make a point of warning against tampering with the data anyway? Is changing a Python-immutable object something routinely done in C?
It is in fact fairly routine. Python-immutable objects have to be initialized somehow, and that means writing to their memory, in C. Immutability is an abstraction presented at the Python level, but the physical memory of an object is mutable. However, such mutation is only safe in very limited circumstances, and one of the requirements is that no other code should hold any references to the object.

Related

Is string internally stored as individual characters, each character in memory shared by other similar strings?

For example, is the string var1 = 'ROB' stored as 3 memory locations R, O and B each with its own address and the variable var1 points to the memory location R? Then how does it point to O and B?
And do other strings – for example: var2 = 'BOB' – point to the same B and O in memory that var1 refers to?
How strings are stored is an implementation detail, but in practice, on the CPython reference interpreter, they're stored as a C-style array of characters. So if the R is at address x, then O is at x+1 (or +2 or +4, depending on the largest ordinal value in the string), and B is at x+2 (or +4 or +8). Because the letters are stored consecutively, knowing where R is (and a flag in the str that says how big each character's storage is) is enough to locate O and B.
'BOB' is at a completely different address, y, and its O and B are contiguous as well. The OB in 'ROB' is utterly unrelated to the OB in 'BOB'.
There is a confusing aspect to this. If you index into the strings, and check the id of the result, it will seem like 'O' has the same address in both strings. But that's only because:
Indexing into a string returns a new string, unrelated to the one being indexed, and
CPython caches length one strings in the latin-1 range, so 'O' is a singleton (no matter how you make it, you get back the cached string)
I'll note that the actual str internals in modern Python are even more complicated than I covered above; a single string might store the same data in up to three different encodings in the same object (the canonical form, and cached version(s) for working with specific Python C APIs). It's not visible from the Python level aside from checking the size with sys.getsizeof though, so it's not worth worrying about in general.
If you really want to head off into the weeds, feel free to read PEP 393: Flexible String Representation which elaborates on the internals of the new str object structure adopted in CPython 3.3.
This is only a partial answer:
var1 is a name that refers to a string object 'ROB'.
var2 is a name that refers to another string object 'BOB'.
How a string object stores the individual characters, and whether different string objects share the same memory, I cannot answer now in more detail than "sometimes" and "it depends". It has to do with string interning, which may be used.

how to memset a unicode string in python 2.7

I have a unicode string f. I want to memset it to 0. print f should display null (\0)
I am using ctypes.memset to achieve this -
> >>> f
> u'abc'
> >>> print ("%s" % type(f))
> <type 'unicode'>
> >>> import ctypes
> **>>> ctypes.memset(id(f)+50,0,6)**
> **4363962530
> >>> f
> u'abc'
> >>> print f
> abc**
Why did the memory location not get memset in case of unicode string?
It works perfectly for an str object.
Thanks for help.
First, this is almost certainly a very bad idea. Python expects strings to be immutable. There's a reason that even the C API won't let you change their contents after they're flagged ready. If you're just doing this to play around with the interpreter's implementation, that can be fun and instructive, but if you're doing it for any real-life purpose, you're probably doing something wrong.
In particular, if you're doing it for "security", what you almost certainly really want to do is to not create a unicode in the first place, but instead create, say, a bytearray with the UTF-16 or UTF-32 encoding of your string, which can be zeroed out in a way that's safe, portable, and a lot easier.
Anyway, there's no reason to expect that two completely different types should store their buffers at the same offset.
In CPython 2.x, a str is a PyStringObject:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
} PyStringObject;
That ob_sval is the buffer; the offset should be 36 on 64-bit builds and (I think) 24 on 32-bit builds.
In a comment, you say:
I read it somewhere and also the offset for a string type is 37 in my system which is what sys.getsizeof('') shows -> >>> sys.getsizeof('') 37
The offset for a string buffer is actually 36, not 37. And the fact that it's even that close is just a coincidence of the way str is implemented. (Hopefully you can understand why by looking at the struct definition—if not, you definitely shouldn't be writing code like this.) There's no reason to expect the same trick to work for some other type without looking at its implementation.
A unicode is a PyUnicodeObject:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
Its buffer is not even inside the object itself; that str member is a pointer to the buffer (which is not guaranteed to be right after the struct). Its offset should be 24 on 64-bit builds, and (I think) 20 on 32-bit builds. So, to do the equivalent, you'd need to read the pointer there, then follow it to find the location to memset.
If you're using a narrow-Unicode build, it should look like this:
>>> ctypes.POINTER(ctypes.c_uint16 * len(g)).from_address(id(g)+24).contents[:]
[97, 98, 99]
That's the ctypes translation of finding (uint16_t *)(((char *)g)+24) and reading the array that starts at *that and ends at *(that+len(g)), which is what you'd have to do if you were writing C code and didn't have access to the unicodeobject.h header.
(In the the test I just quoted, g is at 0x10a598090, while its src points to 0x10a3b09e0, so the buffer is not immediately after the struct, or anywhere near it; it's about 2MB before it.)
For a wide-Unicode build, the same thing with c_uint32.
So, that should show you what you want to memset.
And you should also see a serious implication for your attempt at "security" here. (If I have to point it out, that's yet another indication that you should not be writing this code.)

Python 3.x C API: Do I have to free the memory after extracting a string from a PyObject?

I extract strings from a PyObject pointer using:
char* str = PyBytes_AsString(PyUnicode_AsASCIIString(strObj));
I'm wondering whether I have to free the memory after doing this. The manual doesn't seem to provide any information about this for these functions, but other functions do provide information to free memory using PyMem_Free().
More details
After all str is a pointer to something reserved. I would've thought that this is like std::string::c_str(), where the return is a const char* to something inside the object, but Python strings can have any kind of encoding, which is in general not ASCII. Meaning that if we convert formatting, we need to reserve some new space.
Free or not? If not, how does Python do this?
If you look at the documentation for PyUnicode_AsASCIIString, you will see
Return value: New reference.
Encode a Unicode object using ASCII and return the result as Python bytes object. Error handling is “strict”. Return NULL if an exception was raised by the codec.
The returned object is a regular Python bytes object, subject to regular Python reference handling. The reference you receive is a new reference, so you own that reference, and you are responsible for Py_DECREFing it when you're done with it. Since you do not do this, your code leaks this object.
You also need to handle the null return case. Since you do not, your code currently invokes undefined behavior if the codec raises an exception.
If you look at the documentation for PyBytes_AsString, you will see
Return a pointer to the contents of o. The pointer refers to the internal buffer of o, which consists of len(o) + 1 bytes. The last byte in the buffer is always null, regardless of whether there are any other null bytes. The data must not be modified in any way, unless the object was just created using PyBytes_FromStringAndSize(NULL, size). It must not be deallocated. If o is not a bytes object at all, PyBytes_AsString() returns NULL and raises TypeError.
This function returns a pointer to the internals of a bytes object. You should not free the pointer, and you should not modify the data it points to. You should also wait until you're done with this pointer before Py_DECREFing the bytes object whose internals you're looking at.

Does converting from bytearray to bytes incur a copy?

Does converting from the mutable bytearray type to the non-mutable bytes type incur a copy? Is there any cost associated with it, or does the interpreter just treat it as an immutable byte sequence, like casting a char* to a const char* const in C++?
ba = bytearray()
ba.extend("some big long string".encode('utf-8'))
# Is this conversion free or expensive?
write_bytes(bytes(ba))
Does this differ between Python 3 where bytes is its own type and Python 2.7 where bytes is just an alias for str?
A new copy is created, the buffer is not shared between the bytesarray and the new bytes object, in either Python 2 or 3.
You couldn't share it, as the bytesarray object could still be referenced elsewhere and mutate the value.
For the details, see the bytesobject.c source code, where the buffer protocol is used to create a straight up copy of the data (via PyBuffer_ToContiguous()).
Martjin is right. I just wanted to back that answer up with the cpython source.
Looking at the source for bytes here, first bytes_new is called, which will call PyBytes_FromObject, which will call _PyBytes_FromBuffer, which creates a new bytes object and calls PyBuffer_ToContiguous (defined here). This calls buffer_to_contiguous, which is a memory copy function. The comment for the function reads:
Copy src to a contiguous representation. order is one of 'C', 'F' (Fortran) or 'A' (Any). Assumptions: src has PyBUF_FULL information, src->ndim >= 1, len(mem) == src->len.
Thus, a call to bytes with a bytearray argument will copy the data.

Possible to store Python ints in less than 12 bytes?

Is it possible to make Python use less than 12 bytes for an int?
>>> x=int()
>>> x
0
>>> sys.getsizeof(x)
12
I am not a computer specialist but isn't 12 bytes excessive?
The smallest int I want to store is 0, the largest int 147097614, so I shouldn't really need more than 4 bytes.
(There is probably something I misunderstand here as I couldn't find an answer anywhere on the net. Keep that in mind.)
In python, ints are objects just like everything else. Because of that, there is a little extra overhead just associated with the fact that you're using an object which has some associated meta-data.
If you're going to use lots of ints, and it makes sense to lay them out in an array-like structure, you should look into numpy. Numpy ndarray objects will have a little overhead associated with them for the various pieces of meta-data that the array objects keep track of, but the actual data is stored as the datatype you specify (e.g. numpy.int32 for a 4-byte integer.)
Thus, if you have:
import numpy as np
a = np.zeros(5000,dtype=np.int32)
The array will take only slightly more than 4*5000 = 20000 bytes of your memory
Size of an integer object includes the overhead of maintaining other object information along with its value. The additional information can include object type, reference count and other implementation-specific details.
If you store many integers and want to optimize the space spent, use the array module, specifically arrays constructed with array.array('i').
Integers in python are objects, and are therefore stored with extra overhead.
You can read more information about it here
The integer type in cpython is stored in a structure like so:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
PyObject_HEAD is a macro that expands out into a reference count and a pointer to the type object.
So you can see that:
long ob_ival - 4 bytes for a long.
Py_ssize_t ob_refcnt - I would assume to size_t here is 4 bytes.
PyTypeObject *ob_type - Is a pointer, so another 4 bytes.
12 bytes in total!

Categories