Simulate integer overflow in Python - python

I'm working with Python and I would like to simulate the effect of a C/C++ cast on an integer value.
For example, if I have the unsigned number 234 on 8 bits, I would like a formula to convert it to -22 (signed cast), and another function to convert -22 to 234 (unsigned cast).
I know numpy already has functions to do this, but I would like to avoid it.

You could use bitwise operations based on a 2^(n-1) mask value (for the sign bit):
size = 8
sign = 1 << size-1
number = 234
signed = (number & sign-1) - (number & sign)
unsigned = (signed & sign-1) | (signed & sign)
print(signed) # -22
print(unsigned) # 234

You can easily create such a function yourself:
def toInt8(value):
valueUint8 = value & 255
if valueUint8 & 128:
return valueUint8 - 256
return valueUint8
>>> toInt8(234)
-22
You can make a version that accepts the number of bits as a parameter, rather than it being hardcoded to 8:
def toSignedInt(value, bits):
valueUint8 = value & (2**bits - 1)
if valueUint8 & 2**(bits-1):
return valueUint8 - 2**bits
return valueUint8
>>> toSignedInt(234, 8)
-22

Related

Converting lower and upper 8 bits into one value with python

I have pairs like these: (-102,-56), (123, -56). First value from the pairs represents the lower 8 bits and the second value represents the upper 8 bits, both are in signed decimal form. I need to convert these pairs into a single 16 bit values.
I think I was able to convert (-102,-56) pair by:
l = bin(-102 & 0b1111111111111111)[-8:]
u = bin(-56 & 0b1111111111111111)[-8:]
int(u+l,2)
But when I try to do the same with (123, -56) pair I get the following error:
ValueError: invalid literal for int() with base 2: '11001000b1111011'.
I understand that it's due to the different lengths for different values and I need to fill them up to 8 bits.
Am I approaching this completely wrong? What's the best way to do this so it works both on negative and positive values?
UPDATE:
I was able to solve this by:
low_int = 123
up_int = -56
(low_int & 0xFF) | ((up_int & 0xFF) << 8)
You can try to shift the first value 8 bits: try to use the logic described here https://stackoverflow.com/a/1857965/8947333
Just guessing
l, u = -102 & 255, -56 & 255
# shift 8 bits to left
u << 8 + l
Bitwise operations are fine, but not strictly required.
In the most common 2's complement representation for 8 bits:
-1 signed == 255 unsigned
-2 signed == 254 unsigned
...
-127 signed = 129 usigned
-128 signed = 128 usigned
simply the two absolute values always give the sum 256.
Use this to convert negative values:
if b < 0:
b += 256
and then combine the high and low byte:
value = 256 * hi8 + lo8

~ operator in python 3 ouput a amount that I did not expect. What is the reason for this? [duplicate]

I'm a little confused by the ~ operator. Code goes below:
a = 1
~a #-2
b = 15
~b #-16
How does ~ do work?
I thought, ~a would be something like:
0001 = a
1110 = ~a
why not?
You are exactly right. It's an artifact of two's complement integer representation.
In 16 bits, 1 is represented as 0000 0000 0000 0001. Inverted, you get 1111 1111 1111 1110, which is -2. Similarly, 15 is 0000 0000 0000 1111. Inverted, you get 1111 1111 1111 0000, which is -16.
In general, ~n = -n - 1
The '~' operator is defined as:
"The bit-wise inversion of x is defined as -(x+1). It only applies to integral numbers."Python Doc - 5.5
The important part of this sentence is that this is related to 'integral numbers' (also called integers). Your example represents a 4 bit number.
'0001' = 1
The integer range of a 4 bit number is '-8..0..7'. On the other hand you could use 'unsigned integers', that do not include negative number and the range for your 4 bit number would be '0..15'.
Since Python operates on integers the behavior you described is expected. Integers are represented using two's complement. In case of a 4 bit number this looks like the following.
7 = '0111'
0 = '0000'
-1 = '1111'
-8 = '1000'
Python uses 32bit for integer representation in case you have a 32-bit OS. You can check the largest integer with:
sys.maxint # (2^31)-1 for my system
In case you would like an unsigned integer returned for you 4 bit number you have to mask.
'0001' = a # unsigned '1' / integer '1'
'1110' = ~a # unsigned '14' / integer -2
(~a & 0xF) # returns 14
If you want to get an unsigned 8 bit number range (0..255) instead just use:
(~a & 0xFF) # returns 254
It looks like I found simpler solution that does what is desired:
uint8: x ^ 0xFF
uint16: x ^ 0xFFFF
uint32: x ^ 0xFFFFFFFF
uint64: x ^ 0xFFFFFFFFFFFFFFFF
You could also use unsigned ints (for example from the numpy package) to achieve the expected behaviour.
>>> import numpy as np
>>> bin( ~ np.uint8(1))
'0b11111110'
The problem is that the number represented by the result of applying ~ is not well defined as it depends on the number of bits used to represent the original value. For instance:
5 = 101
~5 = 010 = 2
5 = 0101
~5 = 1010 = 10
5 = 00101
~5 = 11010 = 26
However, the two's complement of ~5 is the same in all cases:
two_complement(~101) = 2^3 - 2 = 6
two_complement(~0101) = 2^4 - 10 = 6
two_complement(~00101) = 2^5 - 26 = 6
And given that the two's complement is used to represent negative values, it makes sense to consider ~5 as the negative value, -6, of its complement.
So, more formally, to arrive at this result we have:
flipped zeros and ones (that's equivalent to taking the ones' complement)
taken two's complement
applied negative sign
and if x is a n-digit number:
~x = - two_complement(one_complement(x)) = - two_complement(2^n - 1 - x) = - (2^n - (2^n - 1 - x)) = - (x + 1)

python: unpack IBM 32-bit float point

I was reading a binary file in python like this:
from struct import unpack
ns = 1000
f = open("binary_file", 'rb')
while True:
data = f.read(ns * 4)
if data == '':
break
unpacked = unpack(">%sf" % ns, data)
print str(unpacked)
when I realized unpack(">f", str) is for unpacking IEEE floating point, my data is IBM 32-bit float point numbers
My question is:
How can I impliment my unpack to unpack IBM 32-bit float point type numbers?
I don't mind using like ctypes to extend python to get better performance.
EDIT: I did some searching:
http://mail.scipy.org/pipermail/scipy-user/2009-January/019392.html
This looks very promising, but I want to get more efficient: there are potential tens of thousands of loops.
EDIT: posted answer below. Thanks for the tip.
I think I understood it:
first unpack the string to unsigned 4 byte integer, and then use this function:
def ibm2ieee(ibm):
"""
Converts an IBM floating point number into IEEE format.
:param: ibm - 32 bit unsigned integer: unpack('>L', f.read(4))
"""
if ibm == 0:
return 0.0
sign = ibm >> 31 & 0x01
exponent = ibm >> 24 & 0x7f
mantissa = (ibm & 0x00ffffff) / float(pow(2, 24))
return (1 - 2 * sign) * mantissa * pow(16, exponent - 64)
Thanks for all who helped!
IBM Floating Point Architecture, how to encode and decode:
http://en.wikipedia.org/wiki/IBM_Floating_Point_Architecture
My solution:
I wrote a class, I think in this way, it can be a bit faster, because used Struct object, so that the unpack fmt is compiled only once.
EDIT: also because it's unpacking size*bytes all at once, and unpacking can be an expensive operation.
from struct import Struct
class StructIBM32(object):
"""
see example in:
http://en.wikipedia.org/wiki/IBM_Floating_Point_Architecture#An_Example
>>> import struct
>>> c = StructIBM32(1)
>>> bit = '11000010011101101010000000000000'
>>> c.unpack(struct.pack('>L', int(bit, 2)))
[-118.625]
"""
def __init__(self, size):
self.p24 = float(pow(2, 24))
self.unpack32int = Struct(">%sL" % size).unpack
def unpack(self, data):
int32 = self.unpack32int(data)
return [self.ibm2ieee(i) for i in int32]
def ibm2ieee(self, int32):
if int32 == 0:
return 0.0
sign = int32 >> 31 & 0x01
exponent = int32 >> 24 & 0x7f
mantissa = (int32 & 0x00ffffff) / self.p24
return (1 - 2 * sign) * mantissa * pow(16, exponent - 64)
if __name__ == "__main__":
import doctest
doctest.testmod()

Translating a C binary data read function to Python

(I've edited this for clarity, and changed the actual question a bit based on EOL's answer)
I'm trying to translate the following function in C to Python but failing miserably (see C code below). As I understand it, it takes four 1-byte chars starting from the memory location pointed to by from, treats them as unsigned long ints in order to give each one 4 bytes of space, and does some bitshifting to arrange them as a big-endian 32-bit integer. It's then used in an algorithm of checking file validity. (from the Treaty of Babel)
static int32 read_alan_int(unsigned char *from)
{
return ((unsigned long int) from[3])| ((unsigned long int)from[2] << 8) |
((unsigned long int) from[1]<<16)| ((unsigned long int)from[0] << 24);
}
/*
The claim algorithm for Alan files is:
* For Alan 3, check for the magic word
* load the file length in blocks
* check that the file length is correct
* For alan 2, each word between byte address 24 and 81 is a
word address within the file, so check that they're all within
the file
* Locate the checksum and verify that it is correct
*/
static int32 claim_story_file(void *story_file, int32 extent)
{
unsigned char *sf = (unsigned char *) story_file;
int32 bf, i, crc=0;
if (extent < 160) return INVALID_STORY_FILE_RV;
if (memcmp(sf,"ALAN",4))
{ /* Identify Alan 2.x */
bf=read_alan_int(sf+4);
if (bf > extent/4) return INVALID_STORY_FILE_RV;
for (i=24;i<81;i+=4)
if (read_alan_int(sf+i) > extent/4) return INVALID_STORY_FILE_RV;
for (i=160;i<(bf*4);i++)
crc+=sf[i];
if (crc!=read_alan_int(sf+152)) return INVALID_STORY_FILE_RV;
return VALID_STORY_FILE_RV;
}
else
{ /* Identify Alan 3 */
bf=read_alan_int(sf+12);
if (bf > (extent/4)) return INVALID_STORY_FILE_RV;
for (i=184;i<(bf*4);i++)
crc+=sf[i];
if (crc!=read_alan_int(sf+176)) return INVALID_STORY_FILE_RV;
}
return INVALID_STORY_FILE_RV;
}
I'm trying to reimplement this in Python. For implementing the read_alan_int function, I would think that importing struct and doing struct.unpack_from('>L', data, offset) would work. However, on valid files, this always returns 24 for the value bf, which means that the for loop is skipped.
def read_alan_int(file_buffer, i):
i0 = ord(file_buffer[i]) * (2 ** 24)
i1 = ord(file_buffer[i + 1]) * (2 ** 16)
i2 = ord(file_buffer[i + 2]) * (2 ** 8)
i3 = ord(file_buffer[i + 3])
return i0 + i1 + i2 + i3
def is_a(file_buffer):
crc = 0
if len(file_buffer) < 160:
return False
if file_buffer[0:4] == 'ALAN':
# Identify Alan 2.x
bf = read_alan_int(file_buffer, 4)
if bf > len(file_buffer)/4:
return False
for i in range(24, 81, 4):
if read_alan_int(file_buffer, i) > len(file_buffer)/4:
return False
for i in range(160, bf * 4):
crc += ord(file_buffer[i])
if crc != read_alan_int(file_buffer, 152):
return False
return True
else:
# Identify Alan 3.x
#bf = read_long(file_buffer, 12, '>')
bf = read_alan_int(file_buffer, 12)
print bf
if bf > len(file_buffer)/4:
return False
for i in range(184, bf * 4):
crc += ord(file_buffer[i])
if crc != read_alan_int(file_buffer, 176):
return False
return True
return False
if __name__ == '__main__':
import sys, struct
data = open(sys.argv[1], 'rb').read()
print is_a(data)
...but the damn thing still returns 24. Unfortunately, my C skills are non-existent so I'm having trouble getting the original program to print some debug output so I can know what bf is supposed to be.
What am I doing wrong?
Ok, so I'm apparently doing read_alan_int correctly. However, what's failing for me is the check that the first 4 characters are "ALAN". All of my test files fail this test. I've changed the code to remove this if/else statement and to instead just take advantage of early returns, and now all of my unit tests pass. So, on a practical level, I'm done. However, I'll keep the question open to address the new problem: how can I possibly wrangle the bits to get "ALAN" out of the first 4 chars?
def is_a(file_buffer):
crc = 0
if len(file_buffer) < 160:
return False
#if file_buffer.startswith('ALAN'):
# Identify Alan 2.x
bf = read_long(file_buffer, 4)
if bf > len(file_buffer)/4:
return False
for i in range(24, 81, 4):
if read_long(file_buffer, i) > len(file_buffer)/4:
return False
for i in range(160, bf * 4):
crc += ord(file_buffer[i])
if crc == read_long(file_buffer, 152):
return True
# Identify Alan 3.x
crc = 0
bf = read_long(file_buffer, 12)
if bf > len(file_buffer)/4:
return False
for i in range(184, bf * 4):
crc += ord(file_buffer[i])
if crc == read_long(file_buffer, 176):
return True
return False
Ah, I think I've got it. Note that the description says
/*
The claim algorithm for Alan files is:
* For Alan 3, check for the magic word
* load the file length in blocks
* check that the file length is correct
* For alan 2, each word between byte address 24 and 81 is a
word address within the file, so check that they're all within
the file
* Locate the checksum and verify that it is correct
*/
which I read as saying that there's a magic word in Alan 3, but not in Alan 2. However, your code goes the other way, even though the C code only assumes that the ALAN exists for Alan 3 files.
Why? Because you don't speak C, so you guessed -- naturally enough! -- that memcmp would return (the equivalent of a Python) True if the first four characters of sf and "ALAN" are equal.. but it doesn't. memcmp returns 0 if the contents are equal, and nonzero if they differ.
And that seems to be the way it works:
>>> import urllib2
>>>
>>> alan2 = urllib2.urlopen("http://ifarchive.plover.net/if-archive/games/competition2001/alan/chasing/chasing.acd").read(4)
>>> alan3 = urllib2.urlopen("http://mirror.ifarchive.org/if-archive/games/competition2006/alan/enterthedark/EnterTheDark.a3c").read(4)
>>>
>>> alan2
'\x02\x08\x01\x00'
>>> alan3
'ALAN'
Hypothesis 1: You are running on Windows, and you haven't opened your file in binary mode.
Your Python version looks fine to me.
PS: I missed the "memcmp() catch" that DSM found, so the Python code for if memcmp(…)… should actually be `if file_buffer[0:4] != 'ALAN'.
As far as I can see from the C code and from the sample file you give in the comments to the original question, the sample file is indeed invalid; here are the values:
read_alan_int(sf+12) == 24 # 0, 0, 0, 24 in file sf, big endian
crc = 0
read_alan_int(sf+176) = 46 # 0, 0, 0, 46 in file sf, big endian
So, crc != read_alan_int(sf+176), indeed.
Are you sure that the sample file is a valid file? Or is part of the calculation of crc missing from the original post??

Hex string to signed int in Python

How do I convert a hex string to a signed int in Python 3?
The best I can come up with is
h = '9DA92DAB'
b = bytes(h, 'utf-8')
ba = binascii.a2b_hex(b)
print(int.from_bytes(ba, byteorder='big', signed=True))
Is there a simpler way? Unsigned is so much easier: int(h, 16)
BTW, the origin of the question is itunes persistent id - music library xml version and iTunes hex version
In n-bit two's complement, bits have value:
bit 0 = 20
bit 1 = 21
bit n-2 = 2n-2
bit n-1 = -2n-1
But bit n-1 has value 2n-1 when unsigned, so the number is 2n too high. Subtract 2n if bit n-1 is set:
def twos_complement(hexstr, bits):
value = int(hexstr, 16)
if value & (1 << (bits - 1)):
value -= 1 << bits
return value
print(twos_complement('FFFE', 16))
print(twos_complement('7FFF', 16))
print(twos_complement('7F', 8))
print(twos_complement('FF', 8))
Output:
-2
32767
127
-1
import struct
For Python 3 (with comments' help):
h = '9DA92DAB'
struct.unpack('>i', bytes.fromhex(h))
For Python 2:
h = '9DA92DAB'
struct.unpack('>i', h.decode('hex'))
or if it is little endian:
h = '9DA92DAB'
struct.unpack('<i', h.decode('hex'))
Here's a general function you can use for hex of any size:
import math
# hex string to signed integer
def htosi(val):
uintval = int(val,16)
bits = 4 * (len(val) - 2)
if uintval >= math.pow(2,bits-1):
uintval = int(0 - (math.pow(2,bits) - uintval))
return uintval
And to use it:
h = str(hex(-5))
h2 = str(hex(-13589))
x = htosi(h)
x2 = htosi(h2)
This works for 16 bit signed ints, you can extend for 32 bit ints. It uses the basic definition of 2's complement signed numbers. Also note xor with 1 is the same as a binary negate.
# convert to unsigned
x = int('ffbf', 16) # example (-65)
# check sign bit
if (x & 0x8000) == 0x8000:
# if set, invert and add one to get the negative value, then add the negative sign
x = -( (x ^ 0xffff) + 1)
It's a very late answer, but here's a function to do the above. This will extend for whatever length you provide. Credit for portions of this to another SO answer (I lost the link, so please provide it if you find it).
def hex_to_signed(source):
"""Convert a string hex value to a signed hexidecimal value.
This assumes that source is the proper length, and the sign bit
is the first bit in the first byte of the correct length.
hex_to_signed("F") should return -1.
hex_to_signed("0F") should return 15.
"""
if not isinstance(source, str):
raise ValueError("string type required")
if 0 == len(source):
raise valueError("string is empty")
sign_bit_mask = 1 << (len(source)*4-1)
other_bits_mask = sign_bit_mask - 1
value = int(source, 16)
return -(value & sign_bit_mask) | (value & other_bits_mask)

Categories