how to extract a unicode string with boost.python

how to extract a unicode string with boost.python - python

It seems that the code will crash when I do extract<const char*>("a unicode string")
Anyone know how to solve this?

This compiles and works for me, with your example string and using Python 2.x:
void process_unicode(boost::python::object u) {
using namespace boost::python;
const char* value = extract<const char*>(str(u).encode("utf-8"));
std::cout << "The string value is '"<< value << "'" << std::endl;
}
You can write a specific from-python converter, if you wish to auto-convert PyUnicode (#Python2.x) to const wchar_t* or to a type from ICU (that seems to be the common recommendation for dealing with Unicode on C++).
If you want full support to unicode characters which are not in the ASCII range (for example, accented characters such as á, ç or ï, you will need to write the from-python converter. Note this will have to be done separately for Python 2.x and 3.x, if you wish to support both. For Python 3.x, the PyUnicode type was deprecated and now the string type works as PyUnicode used to for Python 2.x. Nothing that a couple of #if PY_VERSION_HEX >= 0x03000000 cannot handle.
[edit]
The above comment was wrong. Note that, since Python 3.x treats unicode strings as normal strings, boost::python will wrap that into boost::python::str objects. I have not verified how those are handled w.r.t. unicode translation in this case.

Have you tried
extract<std::string>("a unicode string").c_str()
or
extract<wchar_t*>(...)

Related

Cython: when should I define a string as char*, str, or bytes?

When defining a variable type that will hold a string in Cython + Python 3, I can use (at least):
cdef char* mystring = "foo"
cdef str mystring = "foo"
cdef bytes mystring = "foo"
The documentation page on strings is unclear on this -- it mostly gives examples using char* and bytes, and frankly I'm having a lot of difficulty understanding it.
In my case the strings will be coming from a Python3 program and are assumed to be unicode. They will be used as dict keys and function arguments, but I will do no further manipulation on them. Needless to say I am trying to maximize speed.
This question suggests that under Python2.7 and without Unicode, typing as str makes string manipulation code run SLOWER than with no typing at all. (But that's not necessarily relevant here since I won't be doing much string manipulation.)
What are the advantages and disadvantages of each of these options?

If there is no further processing done on a particular type, it would be best and fastest to not type them at all, which means they are treated as a general purpose PyObject *.
The str type is a special case which means bytes on Python 2 and unicode on Python 3.
The str type is special in that it is the byte string in Python 2 and the Unicode string in Python 3
So code that types a string as str and handles it as unicode will break on python 2 where str means bytes.
Strings only need to be typed if they are to be converted to C char* or C++ std::string. There, you would use str to handle py2/py3 compatibility, along with helper functions to convert to/from bytes and unicode in order to be able to convert to either char* or std::string.
Typing of strings is for interoperability with C/C++, not for speed as such. Cython will auto-convert, without copying, a bytes string to a char* for example when it sees something like cdef char* c_string = b_string[:b_len] where b_string is a bytes type.
OTOH, if strings are typed without that type being used, Cython will do a conversion from object to bytes/unicode when it does not need to which leads to overhead.
This can be seen in the C code generated as Pyx_PyObject_AsString, Pyx_PyUnicode_FromString et al.
This is also true in general - the rule of thumb is if a specific type is not needed for further processing/conversion, best not to type it at all. Everything in python is an object so typing will convert from the general purpose PyObject* to something more specific.

Some quick testing revealed that for this particular case, only the str declaration worked -- all other options produced errors. Since the string is generated elsewhere in Python3, evidently the str type declaration is needed.
Whether it is faster not to make any declaration at all remains an open question.

how to memset a unicode string in python 2.7

I have a unicode string f. I want to memset it to 0. print f should display null (\0)
I am using ctypes.memset to achieve this -
> >>> f
> u'abc'
> >>> print ("%s" % type(f))
> <type 'unicode'>
> >>> import ctypes
> **>>> ctypes.memset(id(f)+50,0,6)**
> **4363962530
> >>> f
> u'abc'
> >>> print f
> abc**
Why did the memory location not get memset in case of unicode string?
It works perfectly for an str object.
Thanks for help.

First, this is almost certainly a very bad idea. Python expects strings to be immutable. There's a reason that even the C API won't let you change their contents after they're flagged ready. If you're just doing this to play around with the interpreter's implementation, that can be fun and instructive, but if you're doing it for any real-life purpose, you're probably doing something wrong.
In particular, if you're doing it for "security", what you almost certainly really want to do is to not create a unicode in the first place, but instead create, say, a bytearray with the UTF-16 or UTF-32 encoding of your string, which can be zeroed out in a way that's safe, portable, and a lot easier.
Anyway, there's no reason to expect that two completely different types should store their buffers at the same offset.
In CPython 2.x, a str is a PyStringObject:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
} PyStringObject;
That ob_sval is the buffer; the offset should be 36 on 64-bit builds and (I think) 24 on 32-bit builds.
In a comment, you say:
I read it somewhere and also the offset for a string type is 37 in my system which is what sys.getsizeof('') shows -> >>> sys.getsizeof('') 37
The offset for a string buffer is actually 36, not 37. And the fact that it's even that close is just a coincidence of the way str is implemented. (Hopefully you can understand why by looking at the struct definition—if not, you definitely shouldn't be writing code like this.) There's no reason to expect the same trick to work for some other type without looking at its implementation.
A unicode is a PyUnicodeObject:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
Its buffer is not even inside the object itself; that str member is a pointer to the buffer (which is not guaranteed to be right after the struct). Its offset should be 24 on 64-bit builds, and (I think) 20 on 32-bit builds. So, to do the equivalent, you'd need to read the pointer there, then follow it to find the location to memset.
If you're using a narrow-Unicode build, it should look like this:
>>> ctypes.POINTER(ctypes.c_uint16 * len(g)).from_address(id(g)+24).contents[:]
[97, 98, 99]
That's the ctypes translation of finding (uint16_t *)(((char *)g)+24) and reading the array that starts at *that and ends at *(that+len(g)), which is what you'd have to do if you were writing C code and didn't have access to the unicodeobject.h header.
(In the the test I just quoted, g is at 0x10a598090, while its src points to 0x10a3b09e0, so the buffer is not immediately after the struct, or anywhere near it; it's about 2MB before it.)
For a wide-Unicode build, the same thing with c_uint32.
So, that should show you what you want to memset.
And you should also see a serious implication for your attempt at "security" here. (If I have to point it out, that's yet another indication that you should not be writing this code.)

Strings operation issue while switching from python 2.x to python 3

I am facing some problem with strings switching from python 2.x to python 3
Issue 1:
from ctypes import*
charBuffer=create_string_buffer(1000)
var = charBuffer.value # var contains like this "abc:def:ghi:1234"
a,b,c,d= var.split(':')
It works fine in python 2.x but not in 3.x it is throwing some errors like this
a,b,c,d= var.split(':')
TypeError: 'str' does not support the buffer interface
I got the links after doing some research in stackoverflow link link2
If I print, desired output would be
a= abc
b =def
c=ghi
d=1234
Issue2:
from ctypes import*
cdll = "Windll"
var = 0x1fffffffffffffffffffffff # I want to send this long variable to character pointer which is in cdll
charBuf =create_string_buffer(var.to_bytes(32,'little'))
cdll.createBuff (charBuf )
cdll function
int createBuff (char * charBuff){
print charBuff
return 0;
}
I want to send this long variable to character pointer which is in cdll, since its a character pointer its throwing errors.
Need your valuable inputs on how could I achieve this. Thanks in advance

In Python 3.x , '.value' on return of create_string_buffer() returns a byte string .
In your example you are trying to split the byte string using a Unicode string (which is the normal string in Python 3.x ) . This is what is causing your issue.
You would need to either split with byte string . Example -
a,b,c,d = var.split(b':')
Or you can decode the byte string to a Unicode string using '.decode()' method on it .
Example -
var = var.decode('<encoding>')

Split using b":" and you will be fine in both versions of python.
In py2 str is a bytestring, in py3 str is a unicode object. The object returned by the ctypes string buffer is a bytestring (str on py2 and bytes on py3). By writing the string literal as b"... you force it to be a bytestring in both version of python.

How can ctypes be used to parse unicode strings?

Sending a string from Python to C++ using Python's ctypes module requires you to parse it as a c_char_p (a char *). I've found I need to use python pure string and not a python unicode string. If I use the unicode string, the variables just get overwritten instead of being sent properly. Here's an example:
C++
void test_function(char * a, char * b, char * c) {
printf("Result: %s %s %s", a, b, c);
}
Python
... load c++ using ctypes ...
lib.test_function.argtypes = [c_char_p, c_char_p, c_char_p]
lib.test_function(u'x', u'y', u'z')
lib.test_function('x', 'y', 'z')
Running the above Python code gives the following in stdout:
Result: z z z
Result: x y z
Why is this, is this a quirk of ctypes? What's an elegant way to avoid this quirk if I can am getting unicode strings?
Thanks!

C/C++ has no real support for Unicode, so there really isn't anything you can do about it. You must to encode your string as in order to pass them into the C/C++ world: you could use UTF-8, UTF-16, or UTF-32 depending on your use case.
For example, you can encode them as UTF-8 and pass in an array of bytes (bytes in Python and char * in C/C++):
lib.test_function(u'x'.encode('utf8'),
u'y'.encode('utf8'),
u'z'.encode('utf8'))
Exactly which encoding you pick is another story, but it will depend on what your C++ library is willing to accept.

Try c_wchar instead of c_char:
https://docs.python.org/2/library/ctypes.html

How to print out the memory contents of object in python?

As in C/C++, we can print the memory content of a variable as below:
double d = 234.5;
unsigned char *p = (unsigned char *)&d;
size_t i;
for (i=0; i < sizeof d; ++i)
printf("%02x\n", p[i]);
Yes, I know we can use pickle.dump() to serialize a object, but it seems generated a lot redundant things..
How can we achieve this in python in a pure way?

The internal memory representation of a Python object cannot be reached from the Python code logical level and you'd need to write a C extension.
If you're designing your own serialization protocol then may be the struct module is what you're looking for. It allows converting from Python values to binary data and back in the format you specify. For example
import struct
print(list(struct.pack('d', 3.14)))
will display [31, 133, 235, 81, 184, 30, 9, 64] because those are the byte values for the double precision representation of 3.14.
NOTE: struct.pack returns a bytes object in Python 3.x but an str object in Python 2.x. To see the numeric code of the bytes in Python 2.x you need to use print map(ord, struct.pack(...)) instead.

You can not do this in pure python. But you could write a Python extension module in C that does exactly what you ask for. But it would probably will not be very useful. You can read more about extension modules here
I assume that by Python you mean C-Python, and not PyPy, Jython or IronPython.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.