Translate a hashing algorithm from C to Python

Translate a hashing algorithm from C to Python - python

My client is a Python programmer and I have created a C++ backend for him which includes license generation and checking. For additional safety, the Python front-end will also perform a validity check of the license.
The license generation and checking algorithm however is based on hashing methods which rely on the fact that an integer is of a fixed byte size and bit-shifting a value will not extend the integers byte count.
This is a simplified example code:
unsigned int HashString(const char* str) {
unsigned int hash = 3151;
while (*str != 0) {
hash = (hash << 3) + (*str << 2) * 3;
str++;
}
return hash;
}
How can this be translated to Python? The direct translation obviously yields a different result:
def hash_string(str):
hash = 3151
for c in str:
hash = (hash << 3) + (ord(c) << 2) * 3
return hash
For instance:
hash_string("foo bar spam") # 228667414299004
HashString("foo bar spam") // 3355459964
Edit: The same would also be necessary for PHP since the online shop should be able to generate valid licenses, too.

Mask the hash value with &:
def hash_string(str, _width=2**32-1):
hash = 3151
for c in str:
hash = ((hash << 3) + (ord(c) << 2) * 3)
return hash & _width
This manually cuts the hash back to size. You only need to limit the result once; it's not as if those higher bits make a difference for the final result.
Demo:
>>> hash_string("foo bar spam")
3355459964

The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.
The easiest fix is just to correct at the end:
return hash % (1 << 32)
For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.
PHP may have the same problem, or a different one.
PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)
But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.
Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.

Related

When I use ctypes.c_int() it returns a different number?

When I use the code below, it returns a different number.
numb = 5000000000
n = ctypes.c_int(numb)
the number it converts : 705032703
UPDATE
now it doesn't give error. and performed the operation. However, the number it returns is still different. It should return the number I entered. but it does not return.
main.py
numb = 5000000000
n = ctypes.c_int64(numb)
x = hello_world(n)
print(x)
golang code that I converted to c code
main.go
package main
import "C"
func helloWorld(x int64) int64 {
s := int64(1)
for i := int64(1); i < x; i++ {
s = i
}
return s
}

5,000,000,000 is too large for a 32-bit integer. You can see below it just truncates to 32 bits:
>>> n=5000000000
>>> hex(n)
'0x12a05f200' # 33 bits
>>> import ctypes
>>> hex(ctypes.c_int(n).value)
'0x2a05f200' # dropped the 33rd bit
You can see the same thing by ANDing with a 32-bit mask:
>>> n & 0xffffffff
705032704

ctypes exposes the C-compatible types, with ctypes.c_int() being an exposure of the signed int type, which could be 32-bit or 64-bit depending on the platform (or possibly even narrower or wider).
In your instance, your c_int is a 32-bit type, and a 32-bit signed int can hold the range [-2147483648, 2147483647], or 4294967296 possible values.
So it can't hold your 5000000000 (5 billion) as-is, and it's implementation-defined how overflow is handled.
If you want to hold large numbers, it's recommended to used fixed-width types like c_int64 so you know exactly what you're dealing with, as c_int and c_long have some size guarantees but are otherwise implementation/platform specific.

Python C-API int128 support

In python one can handle very large integers (for instance uuid.uuid4().int.bit_length() gives 128), but the largest int datastructure the C-API documentation offers is long long, and it is a 64-bit int.
I would love to be able to get a C int128 from a PyLong, but it seems there is no tooling for this. PyLong_AsLongLong for instance cannot handle python integers bigger than 2**64.
Is there some documentation I missed, and it is actually possible?
Is there currently not possible, but some workaround exist? (I would love to use the tooling available in the python C-API for long long with int128, for instance a PyLong_AsInt128AndOverflow function).
Is it a planed feature in a forthcoming python release?

There are a couple of different ways you can access the level of precision you want.
Systems with 64-bit longs often have 128-bit long longs. Notice that the article you link says "at least 64 bits". It's worth checking sizeof(long long) in case there's nothing further to do.
Assuming that is not what you are working with, you'll have to look closer at the raw PyLongObject, which is actually a typedef of the private _longobject structure.
The raw bits are accessible through the ob_digit field, with the length given by ob_size. The data type of the digits, and the actual number of boots they hold is given by the typedef digit and the macro PYLONG_BITS_IN_DIGIT. The latter must be smaller than 8 * sizeof(digit), larger than 8, and a multiple of 5 (so 30 or 15, depending on how your build was done).
Luckily for you, there is an "undocumented" method in the C API that will copy the bytes of the number for you: _PyLong_AsByteArray. The comment in longobject.h reads:
/* _PyLong_AsByteArray: Convert the least-significant 8*n bits of long
v to a base-256 integer, stored in array bytes. Normally return 0,
return -1 on error.
If little_endian is 1/true, store the MSB at bytes[n-1] and the LSB at
bytes[0]; else (little_endian is 0/false) store the MSB at bytes[0] and
the LSB at bytes[n-1].
If is_signed is 0/false, it's an error if v < 0; else (v >= 0) n bytes
are filled and there's nothing special about bit 0x80 of the MSB.
If is_signed is 1/true, bytes is filled with the 2's-complement
representation of v's value. Bit 0x80 of the MSB is the sign bit.
Error returns (-1):
+ is_signed is 0 and v < 0. TypeError is set in this case, and bytes
isn't altered.
+ n isn't big enough to hold the full mathematical value of v. For
example, if is_signed is 0 and there are more digits in the v than
fit in n; or if is_signed is 1, v < 0, and n is just 1 bit shy of
being large enough to hold a sign bit. OverflowError is set in this
case, but bytes holds the least-significant n bytes of the true value.
*/
You should be able to get a UUID with something like
PyLongObject *mylong;
unsigned char myuuid[16];
_PyLong_AsByteArray(mylong, myuuid, sizeof(myuuid), 1, 0);

Unsigned ints of arbitrary length in python

I am trying to simulate a fixed-point filter implementation. I want to capture low-level hardware features like 2s-complement wraparound/overflow and fixed register widths. Some of the registers widths are set by hardware features at unusual and long widths (ie 72b).
I've been making some progress using the built-in integers. The infinite width is incredibly useful... but I find myself fighting Python a lot because it sometimes wants to interpret a binary as a positive integer, and sometimes it seems to want to interpret a very similar binary as a negative 2's complement. For example:
>> a = 0b11111 # sign-extended -1
>> b = 0b0011
>> print("{0:b}".format(a*b))
5f
>> print("{0:b}".format((a*b)&a)) # Truncate to correct product length
11101 # == -3 in 2s complement. Great!
>> print("{0:b}".format(~((a*b)&a)+1)) # Actually perform the 2's complement
-11101 # Arrrrggggghhh
>> print("{0:b}".format((~((a*b)&a)&a)+1)) # Truncate with extreme prejudice
11 # OK. Fine.
I guess if I think hard enough I can figure out why all this works the way it does, but if I could just do it all in unsigned space without worrying about python adding sign bits it would make things easier and less error-prone. Anyone know if there's a relatively easy way to do this? I considered bit strings, but I have to do a lot of adds & multiplies in this application and built-in integer arithmetic is really useful for that.

~x is literally defined on arbitrary precision integers as -(x+1). It does not do bit arithmetic: ~0 is 255 in one-byte integers, 65535 in two-byte integers, 1023 for 10-bit integers etc; so defining ~ via bit inversion on stretchy integers is useless.
If a defines the fixed width of your integers (with 0b11111 saying you are working with five-bit numbers), bit inversion is as simple as a^x.
print("{0:b}".format(a ^ b)
# => 11100
Two's complement is meanwhile easiest done as a+1-b, or equivalently a^b+1:
print("{0:b}".format((a + 1) - b))
# => 11101
print("{0:b}".format((a ^ b) + 1))
# => 11101
tl;dr: Don't use ~ if you want to stay unsigned.

Trouble porting a function in Python to C using the Python C API

I have a checksum function in Python:
def checksum(data):
a = b = 0
l = len(data)
for i in range(l):
a += ord(data[i])
b += (l - i)*ord(data[i])
return (b << 16) | a, a, b
that I am trying to port to a C module for speed. Here's the C function:
static PyObject *
checksum(PyObject *self, PyObject *args)
{
int i, length;
unsigned long long a = 0, b = 0;
unsigned long long checksum = 0;
char *data;
if (!PyArg_ParseTuple(args, "s#", &data, &length)) {
return NULL;
}
for (i = 0; i < length; i++) {
a += (int)data[i];
b += (length - i) * (int)data[i];
}
checksum = (b << 16) | a;
return Py_BuildValue("(Kii)", checksum, (int)a, (int)b);
}
I use it by opening a file and feeding it a 4096 block of data. They both return the same values for small strings, but when I feed it binary data straight from a file, the C version returns wildly different values. Any help would be appreciated.

I would guess that you have some kind of overflow in your local variables. Probably b gets to large. Just dump the values for debugging purposes and you should see if it's the problem. As you mention, that you are porting the method for performance reasons. Have you checked psyco? Might be fast enough and much easier. There are more other tools which compile parts of python code on the fly to C, but I don't have the names in my head.

I'd suggest that the original checksum function is "incorrect". The value returned for checksum is of unlimited size (for any given size in MB, you could construct an input for which the checksum will be at least of this size). If my calculations are correct, the value can fit in 64 bits for inputs of less than 260 MB, and b can fit in an integer for anything less than 4096 bytes. Now, I might be off with the number, but it means that for larger inputs the two functions are guaranteed to work differently.
To translate the first function to C, you'd need to keep b and c in Python integers, and to perform the last calculation as a Python expression. This can be improved, though:
You could use C long long variables to store an intermediate sum and add it to the Python integers after a certain number of iterations. If the number of iterations is n, the maximum value for a is n * 255, and for b is len(data) * n * 255. Try to keep those under 2**63-1 when storing them in C long long variables.
You can use long long instead of unsigned long long, and raise a RuntimeError every time it gets negative in debug mode.
Another solution would be to limit the Python equivalent to 64 bits by using a & 0xffffffffffffffff and b & 0xffffffffffffffff.
The best solution would be to use another kind of checksum, like binascii.crc32.

Python, len, and size of ints

So, cPython (2.4) has some interesting behaviour when the length of something gets near to 1<<32 (the size of an int).
r = xrange(1<<30)
assert len(r) == 1<<30
is fine, but:
r = xrange(1<<32)
assert len(r) == 1<<32
ValueError: xrange object size cannot be reported`__len__() should return 0 <= outcome
Alex's wowrange has this behaviour as well. wowrange(1<<32).l is fine, but len(wowrange(1<<32)) is bad. I'm guessing there is some floating point behaviour (being read as negative) action going on here.
What exactly is happening here? (this is pretty well-solved below!)
How can I get around it? Longs?
(My specific application is random.sample(xrange(1<<32),ABUNCH)) if people want to tackle that question directly!)

cPython assumes that lists fit in memory. This extends to objects that behave like lists, such as xrange. essentially, the len function expects the __len__ method to return something that is convertable to size_t, which won't happen if the number of logical elements is too large, even if those elements don't actually exist in memory.

You'll find that
xrange(1 << 31 - 1)
is the last one that behaves as you want. This is because the maximum signed (32-bit) integer is 2^31 - 1.
1 << 32 is not a positive signed 32-bit integer (Python's int datatype), so that's why you're getting that error.
In Python 2.6, I can't even do xrange(1 << 32) or xrange(1 << 31) without getting an error, much less len on the result.
Edit If you want a little more detail...
1 << 31 represents the number 0x80000000 which in 2's complement representation is the lowest representable negative number (-1 * 2^31) for a 32-bit int. So yes, due to the bit-wise representation of the numbers you're working with, it's actually becoming negative.
For a 32-bit 2's complement number, 0x7FFFFFFF is the highest representable integer (2^31 - 1) before you "overflow" into negative numbers.
Further reading, if you're interested.
Note that when you see something like 2147483648L in the prompt, the "L" at the end signifies that it's now being represented as a "long integer" (64 bits, usually, I can't make any promises on how Python handles it because I haven't read up on it).

1<<32, when treated as a signed integer, is negative.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Translate a hashing algorithm from C to Python - python

Related

When I use ctypes.c_int() it returns a different number?

Python C-API int128 support

Unsigned ints of arbitrary length in python

Trouble porting a function in Python to C using the Python C API

Python, len, and size of ints

Categories

Resources