Testing wchar_t* for convertable characters - python

I'm working on talking to a library that handles strings as wchar_t arrays. I need to convert these to char arrays so that I can hand them over to Python (using SWIG and Python's PyString_FromString function). Obviously not all wide characters can be converted to chars. According to the documentation for wcstombs, I ought to be able to do something like
wcstombs(NULL, wideString, wcslen(wideString))
to test the string for unconvertable characters -- it's supposed to return -1 if there are any. However, in my test case it's always returning -1. Here's my test function:
void getString(wchar_t* target, int size) {
int i;
for(i = 0; i < size; ++i) {
target[i] = L'a' + i;
}
printf("Generated %d characters, nominal length %d, compare %d\n", size,
wcslen(target), wcstombs(NULL, target, size));
}
This is generating output like this:
Generated 32 characters, nominal length 39, compare -1
Generated 16 characters, nominal length 20, compare -1
Generated 4 characters, nominal length 6, compare -1
Any idea what I'm doing wrong?
On a related note, if you know of a way to convert directly from wchar_t*s to Python unicode strings, that'd be welcome. :) Thanks!

Clearly, as you found, it's essential to zero-terminate your input data.
Regarding the final paragraph, I would convert from wide to UTF8 and call PyUnicode_FromString.
Note that I am assuming you are using Python 2.x, it's presumably all different in Python 3.x.

Related

How to do random.choice in C

As a Python intermediate learner, I made an 8 ball in Python.
Now that I am starting on learning C, is there a way to simulate to the way random.choice can select a string from a list of strings , but in C ?
The closest thing to a "list of strings" in C is an array of string pointers; and the only standard library function that produces random numbers is rand(), defined in <stdlib.h>.
A simple example:
#include <stdio.h>
#include <stdlib.h>
#include <time.h> // needed for the usual srand() initialization
int main(void)
{
const char *string_table[] = { // array of pointers to constant strings
"alpha",
"beta",
"gamma",
"delta",
"epsilon"
};
int table_size = 5; // This must match the number of entries above
srand(time(NULL)); // randomize the start value
for (int i = 1; i <= 10; ++i)
{
const char *rand_string = string_table[rand() % table_size];
printf("%2d. %s\n", i, rand_string);
}
return 0;
}
That will generate and print ten random choices from an array of five strings.
The string_table variable is an array of const char * pointers. You should always use a constant pointer to refer to a literal character string like "alpha". It keeps you from using that pointer in a context where the string contents might be changed.
The random numbers are what are called "pseudorandom"; statistically uncorrelated, but completely determined by a starting "seed" value. Using the statement srand(time(NULL)) takes the current time/date value (seconds since some starting date) and uses that as a seed that won't be repeated in any computer's lifetime. But you will get exactly the same "random" numbers if you manage to run the program twice in the same second. This is easy to do in a shell script, for example. A higher-resolution timestamp would be nice, but there isn't anything useful in the C standard library.
The rand() function returns a non-negative int value from 0 to some implementation-dependent maximum value. The symbolic constant RAND_MAX has that value. The expression rand() % N will return the remainder from dividing that value by N, which is a number from 0 to N-1.
Aconcagua has pointed out that this isn't ideal. If N doesn't evenly divide RAND_MAX, then there will be a bias toward smaller numbers. It's okay for now, though, but plan to learn some other methods later if you do serious simulation or statistical work; and if you do get to that point, you probably won't use the built-in rand() function anyway.
You can write a function if you know the size of your array and use rand() % size to get a random index from your array. Then return the value of arr[randidx]

char array to unsigned char python

I'm trying to translate this c code into python, but Im having problems with the char* to ushort* conversion:
void sendAsciiCommand(string command) {
unsigned int nchars = command.length() + 1; // Char count of command string
unsigned int nshorts = ceil(nchars / 2); // Number of shorts to store the string
std::vector<unsigned short> regs(nshorts); // Vector of short registers
// Transform char array to short array with endianness conversion
unsigned short *ascii_short_ptr = (unsigned short *)(command.c_str());
for (unsigned int i = 0; i < nshorts; i++)
regs[i] = htons(ascii_short_ptr[i]);
return std::string((char *)regs.data());
}
As long I have tried with this code in Python 2.7:
from math import ceil
from array import array
command = "hello"
nchars = len(command) + 1
nshorts = ceil(nchars/2)
regs = array("H", command)
But it gives me the error:
ValueError: string length not a multiple of item size
Any help?
The exception text:
ValueError: string length not a multiple of item size
means what is says, i.e., the length of the string from which you are trying to create an array must be a multiple of the item size. In this case the item size is that of an unsigned short, which is 2 bytes. Therefore the length of the string must be a multiple of 2. hello has length 5 which is not a multiple of 2, so you can't create an array of 2 byte integers from it. It will work if the string is 6 bytes long, e.g. hello!.
>>> array("H", 'hello!')
array('H', [25960, 27756, 8559])
You might still need to convert to network byte order. array uses the native byte order on your machine, so if your native byte order is little endian you will need to convert it to big endian (network byte order). Use sys.byteorder to check and array.byteswap() to swap the byte order if required:
import sys
from array import array
s = 'hello!'
regs = array('H', s)
print(regs)
# array('H', [25960, 27756, 8559])
if sys.byteorder != 'big':
regs.byteswap()
print(regs)
# array('H', [26725, 27756, 28449])
However, it's easier to use struct.unpack() to convert straight to network byte order if necessary:
import struct
s = 'hello!'
n = len(s)/struct.calcsize('H')
regs = struct.unpack('!{}H'.format(n), s)
print(regs)
#(26725, 27756, 28449)
If you really need an array:
regs = array('H', struct.unpack('!{}H'.format(n), s))
It's also worth pointing out that your C++ code contains an error. If the string length is odd an extra byte will be read at the end of the string and this will be included in the converted data. That extra byte will be \0 as the C string should be null terminated, but the last unsigned short should either be ignored, or you should check that the length of the string is multiple of an unsigned short, just as Python does.

Python struct.unpack equivalent in C

What I have in python is a
string lenght = 4
, that I can unpack to get some values.
Is there an equivalent of this function on python, in C?
In C, there is no concept of "packing" like this. Whenever you have a char buffer such as
char buf[128];
whether you treat it like a string or a complex data structure is up to you. The simplest way would be to define a struct and copy data back and forth from your array.
struct MyStruct{
int data1;
int data2;
};
char buf[sizeof(struct MyStruct)];
struct MyStruct myStruct;
myStruct.data1 = 1;
myStruct.data2 = 2;
memcpy(buf, &myStruct, sizeof(struct MyStruct));
Please note that there IS some packing/padding that MAY happen here. For example, if you have a short in your struct, the compiler MAY use 4 bytes anyway. This also fails when you have to use pointers, such as char* strings in the structure.

Translate a hashing algorithm from C to Python

My client is a Python programmer and I have created a C++ backend for him which includes license generation and checking. For additional safety, the Python front-end will also perform a validity check of the license.
The license generation and checking algorithm however is based on hashing methods which rely on the fact that an integer is of a fixed byte size and bit-shifting a value will not extend the integers byte count.
This is a simplified example code:
unsigned int HashString(const char* str) {
unsigned int hash = 3151;
while (*str != 0) {
hash = (hash << 3) + (*str << 2) * 3;
str++;
}
return hash;
}
How can this be translated to Python? The direct translation obviously yields a different result:
def hash_string(str):
hash = 3151
for c in str:
hash = (hash << 3) + (ord(c) << 2) * 3
return hash
For instance:
hash_string("foo bar spam") # 228667414299004
HashString("foo bar spam") // 3355459964
Edit: The same would also be necessary for PHP since the online shop should be able to generate valid licenses, too.
Mask the hash value with &:
def hash_string(str, _width=2**32-1):
hash = 3151
for c in str:
hash = ((hash << 3) + (ord(c) << 2) * 3)
return hash & _width
This manually cuts the hash back to size. You only need to limit the result once; it's not as if those higher bits make a difference for the final result.
Demo:
>>> hash_string("foo bar spam")
3355459964
The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.
The easiest fix is just to correct at the end:
return hash % (1 << 32)
For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.
PHP may have the same problem, or a different one.
PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)
But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.
Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.

str_to_a32 - What does this function do?

I need to rewrite some Python script in Objective-C. It's not that hard since Python is easily readable but this piece of code struggles me a bit.
def str_to_a32(b):
if len(b) % 4:
# pad to multiple of 4
b += '\0' * (4 - len(b) % 4)
return struct.unpack('>%dI' % (len(b) / 4), b)
What is this function supposed to do?
I'm not positive, but I'm using the documentation to take a stab at it.
Looking at the docs, we're going to return a tuple based on the format string:
Unpack the string (presumably packed by pack(fmt, ...)) according to the given format. The result is a tuple even if it contains exactly one item. The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt)).
The item coming in (b) is probably a byte buffer (represented as a string) - looking at the examples they are represented the the \x escape, which consumes the next two characters as hex.
It appears the format string is
'>%dI' % (len(b) / 4)
The % and %d are going to put a number into the format string, so if the length of b is 32 the format string becomes
`>8I`
The first part of the format string is >, which the documentation says is setting the byte order to big-endian and size to standard.
The I says it will be an unsigned int with size 4 (docs), and the 8 in front of it means it will be repeated 8 times.
>IIIIIIII
So I think this is saying: take this byte buffer, make sure it's a multiple of 4 by appending as many 0x00s as is necessary, then unpack that into a tuple with as many unsigned integers as there are blocks of 4 bytes in the buffer.
Looks like it's supposed to take an input array of bytes represented as a string and unpack them as big-endian (the ">") unsigned ints (the 'I') The formatting codes are explaied in http://docs.python.org/2/library/struct.html
This takes a string and converts it into a tuple of Unsigned Integers. If you look at the python struct documentation you will see how it works. In a nutshell it handles conversions between Python values and C structs represented as Python strings for handling binary data stored in files (unceremoniously copied from the link provided).
In your case, the function takes a string, b and adds some extra characters to make sure that it is the standard size of the an unsigned int (see link), and then converts it into a tuple of integers using the big endian representation of the characters. This is the '>' part. The I part says to use unsigned integers

Categories