how to memset a unicode string in python 2.7

how to memset a unicode string in python 2.7 - python

I have a unicode string f. I want to memset it to 0. print f should display null (\0)
I am using ctypes.memset to achieve this -
> >>> f
> u'abc'
> >>> print ("%s" % type(f))
> <type 'unicode'>
> >>> import ctypes
> **>>> ctypes.memset(id(f)+50,0,6)**
> **4363962530
> >>> f
> u'abc'
> >>> print f
> abc**
Why did the memory location not get memset in case of unicode string?
It works perfectly for an str object.
Thanks for help.

First, this is almost certainly a very bad idea. Python expects strings to be immutable. There's a reason that even the C API won't let you change their contents after they're flagged ready. If you're just doing this to play around with the interpreter's implementation, that can be fun and instructive, but if you're doing it for any real-life purpose, you're probably doing something wrong.
In particular, if you're doing it for "security", what you almost certainly really want to do is to not create a unicode in the first place, but instead create, say, a bytearray with the UTF-16 or UTF-32 encoding of your string, which can be zeroed out in a way that's safe, portable, and a lot easier.
Anyway, there's no reason to expect that two completely different types should store their buffers at the same offset.
In CPython 2.x, a str is a PyStringObject:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
} PyStringObject;
That ob_sval is the buffer; the offset should be 36 on 64-bit builds and (I think) 24 on 32-bit builds.
In a comment, you say:
I read it somewhere and also the offset for a string type is 37 in my system which is what sys.getsizeof('') shows -> >>> sys.getsizeof('') 37
The offset for a string buffer is actually 36, not 37. And the fact that it's even that close is just a coincidence of the way str is implemented. (Hopefully you can understand why by looking at the struct definition—if not, you definitely shouldn't be writing code like this.) There's no reason to expect the same trick to work for some other type without looking at its implementation.
A unicode is a PyUnicodeObject:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
Its buffer is not even inside the object itself; that str member is a pointer to the buffer (which is not guaranteed to be right after the struct). Its offset should be 24 on 64-bit builds, and (I think) 20 on 32-bit builds. So, to do the equivalent, you'd need to read the pointer there, then follow it to find the location to memset.
If you're using a narrow-Unicode build, it should look like this:
>>> ctypes.POINTER(ctypes.c_uint16 * len(g)).from_address(id(g)+24).contents[:]
[97, 98, 99]
That's the ctypes translation of finding (uint16_t *)(((char *)g)+24) and reading the array that starts at *that and ends at *(that+len(g)), which is what you'd have to do if you were writing C code and didn't have access to the unicodeobject.h header.
(In the the test I just quoted, g is at 0x10a598090, while its src points to 0x10a3b09e0, so the buffer is not immediately after the struct, or anywhere near it; it's about 2MB before it.)
For a wide-Unicode build, the same thing with c_uint32.
So, that should show you what you want to memset.
And you should also see a serious implication for your attempt at "security" here. (If I have to point it out, that's yet another indication that you should not be writing this code.)

Related

PyUnicode_FromStringAndSize: Very terse documentation

Apologies if this is a stupid question which I suspect it may well be. I'm a Python user with little experience in C.
According to the official Python docs (v3.10.6):
PyObject *PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
Return value: New reference. Part of the Stable ABI.
Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object, i.e. modification of the data is not allowed. [...]
which has me slightly confused.
It says the data i.e. the buffer u is copied.
But then it says the data may be shared which seems to contradict the first statement.
My Question is:
What exactly do they mean? That the newly allocated copy of the data is shared? If so who with?
Also, coming from Python: Why do they make a point of warning against tampering with the data anyway? Is changing a Python-immutable object something routinely done in C?
Ultimately, all I need to know is what to do with u: Can/should I free it or has Python taken ownership?

Ultimately, all I need to know is what to do with u: Can/should I free it or has Python taken ownership?
You still own u. Python has no idea where u came from or how it should be freed. It could even be a local array. Python will not retain a pointer to u. Cleaning up u is still your responsibility.
What exactly do they mean? That the newly allocated copy of the data is shared? If so who with?
The returned string object may be shared with arbitrary other code. Python makes no promises about how that might happen, but in the current implementation, one way is that a single-character ASCII string will be drawn from a cached array of such strings:
/* ASCII is equivalent to the first 128 ordinals in Unicode. */
if (size == 1 && (unsigned char)s[0] < 128) {
if (consumed) {
*consumed = 1;
}
return get_latin1_char((unsigned char)s[0]);
}
Also, coming from Python: Why do they make a point of warning against tampering with the data anyway? Is changing a Python-immutable object something routinely done in C?
It is in fact fairly routine. Python-immutable objects have to be initialized somehow, and that means writing to their memory, in C. Immutability is an abstraction presented at the Python level, but the physical memory of an object is mutable. However, such mutation is only safe in very limited circumstances, and one of the requirements is that no other code should hold any references to the object.

How to extract a memory address from inside a Python object

I'm using a binary Python library that returns a Buffer object. This object is basically a wrapper of a C object containing a pointer to the actual memory buffer. What I need is to get the memory address contained in that pointer from Python, the problem is that the Buffer object doesn't have a Python method to obtain it, so I need to do some hacky trick to get it.
For the moment I found an ugly and unsafe way to get the pointer value:
I know the internal structure of the C object:
typedef struct _Buffer {
PyObject_VAR_HEAD PyObject *parent;
int type; /* GL_BYTE, GL_SHORT, GL_INT, GL_FLOAT */
int ndimensions;
int *dimensions;
union {
char *asbyte;
short *asshort;
int *asint;
float *asfloat;
double *asdouble;
void *asvoid;
} buf;
} Buffer;
So I wrote this Python code:
# + PyObject_VAR_HEAD size
# + 8 bytes PyObject_VAR_HEAD PyObject *parent
# + 4 bytes from int type
# + 4 bytes from int ndimensions
# + 8 bytes from int *dimensions
# = 24
offset = sys.getsizeof(0) + 24
buffer_pointer_addr = id(buffer) + offset
buffer_pointer_data = ctypes.string_at(buffer_pointer_addr, 8)
buffer_pointer_value = struct.unpack('Q', buffer_pointer_data)[0]
This is working consistently for me. As you can see I'm getting the memory address of the Python Buffer object with id(buffer), but as you may know that's not the actual pointer to the buffer, but just a Python number that in CPython happens to be the memory address to the Python object.
So then I'm adding the offset that I calculated by adding the sizes of all the variables in the C struct. I'm hardcoding the byte sizes (which is obviously completely unsafe) except for the PyObject_VAR_HEAD, that I get with sys.getsizeof(0).
By adding the offset I get the memory address that contains the pointer to the actual buffer, then I use ctypes to extract it with ctypes.string_at hardcoding the size of the pointer as 8 bytes (I'm on a 64bit OS), then I use struct.unpack to convert it to an actual Python int.
So now my question is: how could I implement a safer solution without hardcoding all the sizes? (if it exists). Maybe something with ctypes? It's OK if it only works on CPython.

I found a safer solution after investigating about C Struct padding and based on the following assumptions:
The code will only be used on CPython.
The buffer pointer is at the end of the C Struct.
The buffer pointer size can be safely extracted from void * C-type as it's going to be the biggest of the union{} made in the C struct. Anyway there will be no different sizes between data pointer types on most modern OS's.
The C Struct members are going to be exactly the ones shown in the question
Based on all these assumptions and the rules found here: https://stackoverflow.com/a/38144117/8861787,
we can safely say that there will be no padding at the end of the struct and we can extract the pointer without hardcoding anything:
# Get the size of the Buffer Python object
buffer_obj_size = sys.getsizeof(buffer)
# Get the size of void * C-type
buffer_pointer_size = ctypes.sizeof(ctypes.c_void_p)
# Calculate the address to the pointer assuming that it's at the end of the C Struct
buffer_pointer_addr = id(buffer) + buffer_obj_size - buffer_pointer_size
# Get the actual pointer value as a Python Int
buffer_pointer_value = (ctypes.c_void_p).from_address(buffer_pointer_addr).value

Cython: when should I define a string as char*, str, or bytes?

When defining a variable type that will hold a string in Cython + Python 3, I can use (at least):
cdef char* mystring = "foo"
cdef str mystring = "foo"
cdef bytes mystring = "foo"
The documentation page on strings is unclear on this -- it mostly gives examples using char* and bytes, and frankly I'm having a lot of difficulty understanding it.
In my case the strings will be coming from a Python3 program and are assumed to be unicode. They will be used as dict keys and function arguments, but I will do no further manipulation on them. Needless to say I am trying to maximize speed.
This question suggests that under Python2.7 and without Unicode, typing as str makes string manipulation code run SLOWER than with no typing at all. (But that's not necessarily relevant here since I won't be doing much string manipulation.)
What are the advantages and disadvantages of each of these options?

If there is no further processing done on a particular type, it would be best and fastest to not type them at all, which means they are treated as a general purpose PyObject *.
The str type is a special case which means bytes on Python 2 and unicode on Python 3.
The str type is special in that it is the byte string in Python 2 and the Unicode string in Python 3
So code that types a string as str and handles it as unicode will break on python 2 where str means bytes.
Strings only need to be typed if they are to be converted to C char* or C++ std::string. There, you would use str to handle py2/py3 compatibility, along with helper functions to convert to/from bytes and unicode in order to be able to convert to either char* or std::string.
Typing of strings is for interoperability with C/C++, not for speed as such. Cython will auto-convert, without copying, a bytes string to a char* for example when it sees something like cdef char* c_string = b_string[:b_len] where b_string is a bytes type.
OTOH, if strings are typed without that type being used, Cython will do a conversion from object to bytes/unicode when it does not need to which leads to overhead.
This can be seen in the C code generated as Pyx_PyObject_AsString, Pyx_PyUnicode_FromString et al.
This is also true in general - the rule of thumb is if a specific type is not needed for further processing/conversion, best not to type it at all. Everything in python is an object so typing will convert from the general purpose PyObject* to something more specific.

Some quick testing revealed that for this particular case, only the str declaration worked -- all other options produced errors. Since the string is generated elsewhere in Python3, evidently the str type declaration is needed.
Whether it is faster not to make any declaration at all remains an open question.

Understanding memory allocation for large integers in Python

How does Python allocate memory for large integers?
An int type has a size of 28 bytes and as I keep increasing the value of the int, the size increases in increments of 4 bytes.
Why 28 bytes initially for any value as low as 1?
Why increments of 4 bytes?
PS: I am running Python 3.5.2 on a x86_64 (64 bit machine). Any pointers/resources/PEPs on how the (3.0+) interpreters work on such huge numbers is what I am looking for.
Code illustrating the sizes:
>>> a=1
>>> print(a.__sizeof__())
28
>>> a=1024
>>> print(a.__sizeof__())
28
>>> a=1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024
>>> print(a.__sizeof__())
32
>>> a=1024*1024*1024*1024*1024*1024
>>> a
1152921504606846976
>>> print(a.__sizeof__())
36

Why 28 bytes initially for any value as low as 1?
I believe #bgusach answered that completely; Python uses C structs to represent objects in the Python world, any objects including ints:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
PyObject_VAR_HEAD is a macro that when expanded adds another field in the struct (field PyVarObject which is specifically used for objects that have some notion of length) and, ob_digits is an array holding the value for the number. Boiler-plate in size comes from that struct, for small and large Python numbers.
Why increments of 4 bytes?
Because, when a larger number is created, the size (in bytes) is a multiple of the sizeof(digit); you can see that in _PyLong_New where the allocation of memory for a new longobject is performed with PyObject_MALLOC:
/* Number of bytes needed is: offsetof(PyLongObject, ob_digit) +
sizeof(digit)*size. Previous incarnations of this code used
sizeof(PyVarObject) instead of the offsetof, but this risks being
incorrect in the presence of padding between the PyVarObject header
and the digits. */
if (size > (Py_ssize_t)MAX_LONG_DIGITS) {
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
size*sizeof(digit));
offsetof(PyLongObject, ob_digit) is the 'boiler-plate' (in bytes) for the long object that isn't related with holding its value.
digit is defined in the header file holding the struct _longobject as a typedef for uint32:
typedef uint32_t digit;
and sizeof(uint32_t) is 4 bytes. That's the amount by which you'll see the size in bytes increase when the size argument to _PyLong_New increases.
Of course, this is just how CPython has chosen to implement it. It is an implementation detail and as such you wont find much information in PEPs. The python-dev mailing list would hold implementation discussions if you can find the corresponding thread :-).
Either way, you might find differing behavior in other popular implementations, so don't take this one for granted.

It's actually easy. Python's int is not the kind of primitive you may be used to from other languages, but a full fledged object, with its methods and all the stuff. That is where the overhead comes from.
Then, you have the payload itself, the integer that is being represented. And there is no limit for that, except your memory.
The size of a Python's int is what it needs to represent the number plus a little overhead.
If you want to read further, take a look at the relevant part of the documentation:
Integers have unlimited precision

Extract the fields of a C struct

I often have to write code in other languages that interact with C structs. Most typically this involves writing Python code with the struct or ctypes modules.
So I'll have a .h file full of struct definitions, and I have to manually read through them and duplicate those definitions in my Python code. This is time consuming and error-prone, and it's difficult to keep the two definitions in sync when they change frequently.
Is there some tool or library in any language (doesn't have to be C or Python) which can take a .h file and produce a structured list of its structs and their fields? I'd love to be able to write a script to generate my automatically generate my struct definitions in Python, and I don't want to have to process arbitrary C code to do it. Regular expressions would work great about 90% of the time and then cause endless headaches for the remaining 10%.

If you compile your C code with debugging (-g), pahole (git) can give you the exact structure layouts being used.
$ pahole /bin/dd
…
struct option {
const char * name; /* 0 8 */
int has_arg; /* 8 4 */
/* XXX 4 bytes hole, try to pack */
int * flag; /* 16 8 */
int val; /* 24 4 */
/* size: 32, cachelines: 1, members: 4 */
/* sum members: 24, holes: 1, sum holes: 4 */
/* padding: 4 */
/* last cacheline: 32 bytes */
};
…
This should be quite a lot nicer to parse than straight C.

Regular expressions would work great about 90% of the time and then cause endless headaches for the remaining 10%.
The headaches happen in the cases where the C code contains syntax that you didn't think of when writing your regular expressions. Then you go back and realise that C can't really be parsed by regular expressions, and life becomes not fun.
Try turning it around: define your own simple format, which allows less tricks than C does, and generate both the C header file and the Python interface code from your file:
define socketopts
int16 port
int32 ipv4address
int32 flags
Then you can easily write some Python to convert this to:
typedef struct {
short port;
int ipv4address;
int flags;
} socketopts;
and also to emit a Python class which uses struct to pack/unpack three values (possibly two of them big-endian and the other native-endian, up to you).

Have a look at Swig or SIP that would generate interface code for you or use ctypes.

Have you looked at Swig?

I have quite successfully used GCCXML on fairly large projects. You get an XML representation of the C code (including structures) which you can post-process with some simple Python.

ctypes-codegen or ctypeslib (same thing, I think) will generate ctypes Structure definitions (also other things, I believe, but I only tried structs) by parsing header files using GCCXML. It's no longer supported, but will likely work in some cases.

One my friend for this tasks done C-parser which he use with cog.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.