Python struct library has a bunch of format strings corresponding with a ctype ("h": int16, "H": uint16).
Is there a simple way to go from a format string (e.g. "h", "H", etc.) to the range of possible values (e.g. -32768 to 32767, 0 to 65535, etc.)?
I see the struct library provides calcsize, but what I really want is something like calcrange.
Is there a built-in solution, or an elegant solution I am neglecting? I am also open to third party libraries.
I have made a DIY calcrange below, but it only covers a limited number of possible format strings and makes some non-generalizable assumptions.
def calcrange(fmt: str) -> Tuple[int, int]:
"""Calculate the min and max possible value of a given struct format string."""
size: int = calcsize(fmt)
unsigned_max = int("0x" + "FF" * size, 16)
if fmt.islower():
# Signed case
min_ = -1 * int("0x80" + "00" * (calcsize(fmt) - 1), 16)
return min_, unsigned_max + min_
# Unsigned case
return 0, unsigned_max
The math can be simplified. If b is the bit-width, then unsigned values are 0 to 2b-1 and signed values are -2(b-1) to 2(b-1)-1. It only works for the integer types.
Here's a the simplified version:
from typing import *
import struct
def calcrange(intcode):
b = struct.calcsize(intcode) * 8
if intcode.islower():
return -2**(b-1),2**(b-1)-1
else:
return 0,2**b-1
for code in 'bBhHiIlLqQnN':
s,e = calcrange(code)
print(f'{code} {s:26,} to {e:26,}')
Output:
b -128 to 127
B 0 to 255
h -32,768 to 32,767
H 0 to 65,535
i -2,147,483,648 to 2,147,483,647
I 0 to 4,294,967,295
l -2,147,483,648 to 2,147,483,647
L 0 to 4,294,967,295
q -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Q 0 to 18,446,744,073,709,551,615
n -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
N 0 to 18,446,744,073,709,551,615
Related
How to get the string as binary IEEE 754 representation of a 32 bit float?
Example
1.00 -> '00111111100000000000000000000000'
You can do that with the struct package:
import struct
def binary(num):
return ''.join('{:0>8b}'.format(c) for c in struct.pack('!f', num))
That packs it as a network byte-ordered float, and then converts each of the resulting bytes into an 8-bit binary representation and concatenates them out:
>>> binary(1)
'00111111100000000000000000000000'
Edit:
There was a request to expand the explanation. I'll expand this using intermediate variables to comment each step.
def binary(num):
# Struct can provide us with the float packed into bytes. The '!' ensures that
# it's in network byte order (big-endian) and the 'f' says that it should be
# packed as a float. Alternatively, for double-precision, you could use 'd'.
packed = struct.pack('!f', num)
print 'Packed: %s' % repr(packed)
# For each character in the returned string, we'll turn it into its corresponding
# integer code point
#
# [62, 163, 215, 10] = [ord(c) for c in '>\xa3\xd7\n']
integers = [ord(c) for c in packed]
print 'Integers: %s' % integers
# For each integer, we'll convert it to its binary representation.
binaries = [bin(i) for i in integers]
print 'Binaries: %s' % binaries
# Now strip off the '0b' from each of these
stripped_binaries = [s.replace('0b', '') for s in binaries]
print 'Stripped: %s' % stripped_binaries
# Pad each byte's binary representation's with 0's to make sure it has all 8 bits:
#
# ['00111110', '10100011', '11010111', '00001010']
padded = [s.rjust(8, '0') for s in stripped_binaries]
print 'Padded: %s' % padded
# At this point, we have each of the bytes for the network byte ordered float
# in an array as binary strings. Now we just concatenate them to get the total
# representation of the float:
return ''.join(padded)
And the result for a few examples:
>>> binary(1)
Packed: '?\x80\x00\x00'
Integers: [63, 128, 0, 0]
Binaries: ['0b111111', '0b10000000', '0b0', '0b0']
Stripped: ['111111', '10000000', '0', '0']
Padded: ['00111111', '10000000', '00000000', '00000000']
'00111111100000000000000000000000'
>>> binary(0.32)
Packed: '>\xa3\xd7\n'
Integers: [62, 163, 215, 10]
Binaries: ['0b111110', '0b10100011', '0b11010111', '0b1010']
Stripped: ['111110', '10100011', '11010111', '1010']
Padded: ['00111110', '10100011', '11010111', '00001010']
'00111110101000111101011100001010'
Here's an ugly one ...
>>> import struct
>>> bin(struct.unpack('!i',struct.pack('!f',1.0))[0])
'0b111111100000000000000000000000'
Basically, I just used the struct module to convert the float to an int ...
Here's a slightly better one using ctypes:
>>> import ctypes
>>> bin(ctypes.c_uint32.from_buffer(ctypes.c_float(1.0)).value)
'0b111111100000000000000000000000'
Basically, I construct a float and use the same memory location, but I tag it as a c_uint32. The c_uint32's value is a python integer which you can use the builtin bin function on.
Note: by switching types we can do reverse operation as well
>>> ctypes.c_float.from_buffer(ctypes.c_uint32(int('0b111111100000000000000000000000', 2))).value
1.0
also for double-precision 64-bit float we can use the same trick using ctypes.c_double & ctypes.c_uint64 instead.
Found another solution using the bitstring module.
import bitstring
f1 = bitstring.BitArray(float=1.0, length=32)
print(f1.bin)
Output:
00111111100000000000000000000000
For the sake of completeness, you can achieve this with numpy using:
f = 1.00
int32bits = np.asarray(f, dtype=np.float32).view(np.int32).item() # item() optional
You can then print this, with padding, using the b format specifier
print('{:032b}'.format(int32bits))
With these two simple functions (Python >=3.6) you can easily convert a float number to binary and vice versa, for IEEE 754 binary64.
import struct
def bin2float(b):
''' Convert binary string to a float.
Attributes:
:b: Binary string to transform.
'''
h = int(b, 2).to_bytes(8, byteorder="big")
return struct.unpack('>d', h)[0]
def float2bin(f):
''' Convert float to 64-bit binary string.
Attributes:
:f: Float number to transform.
'''
[d] = struct.unpack(">Q", struct.pack(">d", f))
return f'{d:064b}'
For example:
print(float2bin(1.618033988749894))
print(float2bin(3.14159265359))
print(float2bin(5.125))
print(float2bin(13.80))
print(bin2float('0011111111111001111000110111011110011011100101111111010010100100'))
print(bin2float('0100000000001001001000011111101101010100010001000010111011101010'))
print(bin2float('0100000000010100100000000000000000000000000000000000000000000000'))
print(bin2float('0100000000101011100110011001100110011001100110011001100110011010'))
The output is:
0011111111111001111000110111011110011011100101111111010010100100
0100000000001001001000011111101101010100010001000010111011101010
0100000000010100100000000000000000000000000000000000000000000000
0100000000101011100110011001100110011001100110011001100110011010
1.618033988749894
3.14159265359
5.125
13.8
I hope you like it, it works perfectly for me.
This problem is more cleanly handled by breaking it into two parts.
The first is to convert the float into an int with the equivalent bit pattern:
import struct
def float32_bit_pattern(value):
return sum(ord(b) << 8*i for i,b in enumerate(struct.pack('f', value)))
Python 3 doesn't require ord to convert the bytes to integers, so you need to simplify the above a little bit:
def float32_bit_pattern(value):
return sum(b << 8*i for i,b in enumerate(struct.pack('f', value)))
Next convert the int to a string:
def int_to_binary(value, bits):
return bin(value).replace('0b', '').rjust(bits, '0')
Now combine them:
>>> int_to_binary(float32_bit_pattern(1.0), 32)
'00111111100000000000000000000000'
Piggy-tailing on Dan's answer with colored version for Python3:
import struct
BLUE = "\033[1;34m"
CYAN = "\033[1;36m"
GREEN = "\033[0;32m"
RESET = "\033[0;0m"
def binary(num):
return [bin(c).replace('0b', '').rjust(8, '0') for c in struct.pack('!f', num)]
def binary_str(num):
bits = ''.join(binary(num))
return ''.join([BLUE, bits[:1], GREEN, bits[1:10], CYAN, bits[10:], RESET])
def binary_str_fp16(num):
bits = ''.join(binary(num))
return ''.join([BLUE, bits[:1], GREEN, bits[1:10][-5:], CYAN, bits[10:][:11], RESET])
x = 0.7
print(x, "as fp32:", binary_str(0.7), "as fp16 is sort of:", binary_str_fp16(0.7))
After browsing through lots of similar questions I've written something which hopefully does what I wanted.
f = 1.00
negative = False
if f < 0:
f = f*-1
negative = True
s = struct.pack('>f', f)
p = struct.unpack('>l', s)[0]
hex_data = hex(p)
scale = 16
num_of_bits = 32
binrep = bin(int(hex_data, scale))[2:].zfill(num_of_bits)
if negative:
binrep = '1' + binrep[1:]
binrep is the result.
Each part will be explained.
f = 1.00
negative = False
if f < 0:
f = f*-1
negative = True
Converts the number to a positive if negative, and sets the variable negative to false. The reason for this is that the difference between positive and negative binary representations is just in the first bit, and this was the simpler way than to figure out what goes wrong when doing the whole process with negative numbers.
s = struct.pack('>f', f) #'?\x80\x00\x00'
p = struct.unpack('>l', s)[0] #1065353216
hex_data = hex(p) #'0x3f800000'
s is a hex representation of the binary f. it is however not in the pretty form i need. Thats where p comes in. It is the int representation of the hex s. And then another conversion to get a pretty hex.
scale = 16
num_of_bits = 32
binrep = bin(int(hex_data, scale))[2:].zfill(num_of_bits)
if negative:
binrep = '1' + binrep[1:]
scale is the base 16 for the hex. num_of_bits is 32, as float is 32 bits, it is used later to fill the additional places with 0 to get to 32. Got the code for binrep from this question. If the number was negative, just change the first bit.
I know this is ugly, but i didn't find a nice way and I needed it fast. Comments are welcome.
This is a little more than was asked, but it was what I needed when I found this entry. This code will give the mantissa, base and sign of the IEEE 754 32 bit float.
import ctypes
def binRep(num):
binNum = bin(ctypes.c_uint.from_buffer(ctypes.c_float(num)).value)[2:]
print("bits: " + binNum.rjust(32,"0"))
mantissa = "1" + binNum[-23:]
print("sig (bin): " + mantissa.rjust(24))
mantInt = int(mantissa,2)/2**23
print("sig (float): " + str(mantInt))
base = int(binNum[-31:-23],2)-127
print("base:" + str(base))
sign = 1-2*("1"==binNum[-32:-31].rjust(1,"0"))
print("sign:" + str(sign))
print("recreate:" + str(sign*mantInt*(2**base)))
binRep(-0.75)
output:
bits: 10111111010000000000000000000000
sig (bin): 110000000000000000000000
sig (float): 1.5
base:-1
sign:-1
recreate:-0.75
Convert float between 0..1
def float_bin(n, places = 3):
if (n < 0 or n > 1):
return "ERROR, n must be in 0..1"
answer = "0."
while n > 0:
if len(answer) - 2 == places:
return answer
b = n * 2
if b >= 1:
answer += '1'
n = b - 1
else:
answer += '0'
n = b
return answer
Several of these answers did not work as written with Python 3, or did not give the correct representation for negative floating point numbers. I found the following to work for me (though this gives 64-bit representation which is what I needed)
def float_to_binary_string(f):
def int_to_8bit_binary_string(n):
stg=bin(n).replace('0b','')
fillstg = '0'*(8-len(stg))
return fillstg+stg
return ''.join( int_to_8bit_binary_string(int(b)) for b in struct.pack('>d',f) )
I made a very simple one. please check it. and if you think there was any mistake please let me know. this works fine for me.
sds=float(input("Enter the number : "))
sf=float("0."+(str(sds).split(".")[-1]))
aa=[]
while len(aa)<15:
dd=round(sf*2,5)
if dd-1>0:
aa.append(1)
sf=dd-1
else:
sf=round(dd,5)
aa.append(0)
des=aa[:-1]
print("\n")
AA=([str(i) for i in des])
print("So the Binary Of : %s>>>"%sds,bin(int(str(sds).split(".")[0])).replace("0b",'')+"."+"".join(AA))
or in case of integer number just use bin(integer).replace("0b",'')
Let's use numpy!
import numpy as np
def binary(num, string=True):
bits = np.unpackbits(np.array([num]).view('u1'))
if string:
return np.array2string(bits, separator='')[1:-1]
else:
return bits
e.g.,
binary(np.pi)
# '0001100000101101010001000101010011111011001000010000100101000000'
binary(np.pi, string=False)
# array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
# 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
# 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
# dtype=uint8)
You can use the .format for the easiest representation of bits in my opinion:
my code would look something like:
def fto32b(flt):
# is given a 32 bit float value and converts it to a binary string
if isinstance(flt,float):
# THE FOLLOWING IS AN EXPANDED REPRESENTATION OF THE ONE LINE RETURN
# packed = struct.pack('!f',flt) <- get the hex representation in (!)Big Endian format of a (f) Float
# integers = []
# for c in packed:
# integers.append(ord(c)) <- change each entry into an int
# binaries = []
# for i in integers:
# binaries.append("{0:08b}".format(i)) <- get the 8bit binary representation of each int (00100101)
# binarystring = ''.join(binaries) <- join all the bytes together
# return binarystring
return ''.join(["{0:08b}".format(i) for i in [ord(c) for c in struct.pack('!f',flt)]])
return None
Output:
>>> a = 5.0
'01000000101000000000000000000000'
>>> b = 1.0
'00111111100000000000000000000000'
The usual saying is that string comparison must be done in constant time when checking things like password or hashes, and thus, it is recommended to avoid a == b.
However, I run the follow script and the results don't support the hypothesis that a==b short circuit on the first non-identical character.
from time import perf_counter_ns
import random
def timed_cmp(a, b):
start = perf_counter_ns()
a == b
end = perf_counter_ns()
return end - start
def n_timed_cmp(n, a, b):
"average time for a==b done n times"
ts = [timed_cmp(a, b) for _ in range(n)]
return sum(ts) / len(ts)
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 2 ** 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Here is the result of a run, re-running the script gives slightly different results, but nothing satisfactory.
# ran with cpython 3.8.3
6 78.051700
1 78.203200
15 78.222700
14 78.384800
11 78.396300
12 78.441800
9 78.476900
13 78.519000
8 78.586200
3 78.631500
---
0 80.691100
1 78.203200
I would've expected that the fastest comparison would be where the first differing character is at the beginning of the string, but it's not what I get.
Any idea what's going on ???
There's a difference, you just don't see it on such small strings. Here's a small patch to apply to your code, so I use longer strings, and I do 10 checks by putting the A at a place, evenly spaced in the original string, from the beginning to the end, I mean, like this:
A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
## -15,13 +15,13 ## def n_timed_cmp(n, a, b):
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
- n = 2 ** 8
+ n = 2 ** 16
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
and you'll get:
0 122.621000
1 213.465700
2 380.214100
3 460.422000
5 694.278700
4 722.010000
7 894.630300
6 1020.722100
9 1149.473000
8 1341.754500
---
0 122.621000
1 213.465700
Note that with your example, with only 2**8 characters, it's already noticable, apply this patch:
## -21,7 +21,7 ## def check_cmp_time():
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
to only keep the two extreme cases (first letter change vs last letter change) and you'll get:
$ python3 cmp.py
0 124.131800
1 135.566000
Numbers may vary, but most of the time test 0 is a tad faster that test 1.
To isolate more precisely which caracter is modified, it's possible as long as the memcmp does it character by character, so as long as it does not use integer comparisons, typically on the last character if they get misaligned, or on really short strings, like 8 char string, as I demo here:
from time import perf_counter_ns
from statistics import median
import random
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i + 1 :] for i in range(n)]
values = {x: [] for x in range(n)}
for _ in range(10_000_000):
for i, diff in enumerate(diffs):
start = perf_counter_ns()
s == diff
values[i].append(perf_counter_ns() - start)
timed = [[k, median(v)] for k, v in values.items()]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Which gives me:
1 221.000000
2 222.000000
3 223.000000
4 223.000000
5 223.000000
6 223.000000
7 223.000000
0 241.000000
The differences are so small, Python and perf_counter_ns may no longer be the right tools here.
See, to know why it doesn't short circuit, you'll have to do some digging. The simple answer is, of course, it doesn't short circuit because the standard doesn't specify so. But you might think, "Why wouldn't the implementations choose to short circuit? Surely, It must be faster!". Not quite.
Let's take a look at cpython, for obvious reasons. Look at the code for unicode_compare_eq function defined in unicodeobject.c
static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
{
int kind;
void *data1, *data2;
Py_ssize_t len;
int cmp;
len = PyUnicode_GET_LENGTH(str1);
if (PyUnicode_GET_LENGTH(str2) != len)
return 0;
kind = PyUnicode_KIND(str1);
if (PyUnicode_KIND(str2) != kind)
return 0;
data1 = PyUnicode_DATA(str1);
data2 = PyUnicode_DATA(str2);
cmp = memcmp(data1, data2, len * kind);
return (cmp == 0);
}
(Note: This function is actually called after deducing that str1 and str2 are not the same object - if they are - well that's just a simple True immediately)
Focus on this line specifically-
cmp = memcmp(data1, data2, len * kind);
Ahh, we're back at another cross road. Does memcmp short circuit? The C standard does not specify such a requirement. As seen in the opengroup docs and also in Section 7.24.4.1 of the C Standard Draft
7.24.4.1 The memcmp function
Synopsis
#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
The memcmp function compares the first n characters of the object pointed to by s1 to
the first n characters of the object pointed to by s2.
Returns
The memcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
Most Some C implementations (including glibc) choose to not short circuit. But why? are we missing something, why would you not short circuit?
Because the comparison they use isn't might not be as naive as a byte by byte by check. The standard does not require the objects to be compared byte by byte. Therein lies the chance of optimization.
What glibc does, is that it compares elements of type unsigned long int instead of just singular bytes represented by unsigned char. Check out the implementation
There's a lot more going under the hood - a discussion far outside the scope of this question, after all this isn't even tagged as a C question ;). Though I found that this answer may be worth a look. But just know, the optimization is there, just in a much different form than the approach that may come in mind at first glance.
Edit: Fixed wrong function link
Edit: As #Konrad Rudolph has stated, glibc memcmp does apparently short circuit. I've been misinformed.
Is there a method that converts a string of text such as 'you' to a number other than
y = tuple('you')
for k in y:
k = ord(k)
which only converts one character at a time?
In order to convert a string to a number (and the reverse), you should first always work with bytes. Since you are using Python 3, strings are actually Unicode strings and as such may contain characters that have a ord() value higher than 255. bytes however just have a single byte per character; so you should always convert between those two types first.
So basically, you are looking for a way to convert a bytes string (which is basically a list of bytes, a list of numbers 0–255) into a single number, and the inverse. You can use int.to_bytes and int.from_bytes for that:
import math
def convertToNumber (s):
return int.from_bytes(s.encode(), 'little')
def convertFromNumber (n):
return n.to_bytes(math.ceil(n.bit_length() / 8), 'little').decode()
>>> convertToNumber('foo bar baz')
147948829660780569073512294
>>> x = _
>>> convertFromNumber(x)
'foo bar baz'
Treat the string as a base-255 number.
# Reverse the digits to make reconstructing the string more efficient
digits = reversed(ord(b) for b in y.encode())
n = reduce(lambda x, y: x*255 + y, digits)
new_y = ""
while n > 0:
n, b = divmod(n, 255)
new_y += chr(b)
assert y == new_y.decode()
(Note this is essentially the same as poke's answer, but written explicitly rather than using available methods for converting between a byte string and an integer.)
You don't need to convert the string into tuple
k is overwritten. Collect items using something like list comprehension:
>>> text = 'you'
>>> [ord(ch) for ch in text]
[121, 111, 117]
To get the text back, use chr, and join the characters using str.join:
>>> numbers = [ord(ch) for ch in text]
>>> ''.join(chr(n) for n in numbers)
'you'
Though there are a number of ways to fulfill this task, I prefer the hashing way because it has the following nice properties
it ensures that the number you get is highly random, actually uniformly random
it ensures that even a small change in your input string will lead to a significant difference in output integer.
it is an irreversible process, i.e., you can't tell which string is the input based on the integer output.
import hashlib
# there are a number of hashing functions you can pick, and they provide tags of different lengths and security levels.
hashing_func = hashlib.md5
# the lambda func does three things
# 1. hash a given string using the given algorithm
# 2. retrive its hex hash tag
# 3. convert hex to integer
str2int = lambda s : int(hashing_func(s.encode()).hexdigest(), 16)
To see how the resulting integers are uniform randomly distributed, we first need to have some random string generator
import string
import numpy as np
# candidate characters
letters = string.ascii_letters
# total number of candidates
L = len(letters)
# control the seed or prng for reproducible results
prng = np.random.RandomState(1234)
# define the string prng of length 10
prng_string = lambda : "".join([letters[k] for k in prng.randint(0, L, size=(10))])
Now we generate sufficient number of random strings and obtain corresponding integers
ss = [prng_string() for x in range(50000)]
vv = np.array([str2int(s) for s in ss])
Let us check the randomness by comparing the theoretical mean and standard deviation of a uniform distribution and those we observed.
for max_num in [256, 512, 1024, 4096] :
ints = vv % max_num
print("distribution comparsions for max_num = {:4d} \n\t[theoretical] {:7.2f} +/- {:8.3f} | [observed] {:7.2f} +/- {:8.3f}".format(
max_num, max_num/2., np.sqrt(max_num**2/12), np.mean(ints), np.std(ints)))
Finally, you will see the results below, which indicates that the number you got are very uniform.
distribution comparsions for max_num = 256
[theoretical] 128.00 +/- 73.901 | [observed] 127.21 +/- 73.755
distribution comparsions for max_num = 512
[theoretical] 256.00 +/- 147.802 | [observed] 254.90 +/- 147.557
distribution comparsions for max_num = 1024
[theoretical] 512.00 +/- 295.603 | [observed] 512.02 +/- 296.519
distribution comparsions for max_num = 4096
[theoretical] 2048.00 +/- 1182.413 | [observed] 2048.67 +/- 1181.422
It is worthy to call out that other posted answers may not attain these these properties.
For example, #poke's convertToNumber solution will give
distribution comparsions for max_num = 256
[theoretical] 128.00 +/- 73.901 | [observed] 93.48 +/- 17.663
distribution comparsions for max_num = 512
[theoretical] 256.00 +/- 147.802 | [observed] 220.71 +/- 129.261
distribution comparsions for max_num = 1024
[theoretical] 512.00 +/- 295.603 | [observed] 477.67 +/- 277.651
distribution comparsions for max_num = 4096
[theoretical] 2048.00 +/- 1182.413 | [observed] 1816.51 +/- 1059.643
I was trying to find a way to convert a numpy character array into a unique numeric array in order to do some other stuff. I have implemented the following functions including the answers by #poke and #falsetrue (these methods were giving me some trouble when the strings were too large). I have also added the hash method (a hash is a fixed sized integer that identifies a particular value.)
import numpy as np
def str_to_num(x):
"""Converts a string into a unique concatenated UNICODE representation
Args:
x (string): input string
Raises:
ValueError: x must be a string
"""
if isinstance(x, str):
x = [str(ord(c)) for c in x]
x = int(''.join(x))
else:
raise ValueError('x must be a string.')
return x
def chr_to_num(x):
return int.from_bytes(x.encode(), 'little')
def char_arr_to_num(arr, type = 'hash'):
"""Converts a character array into a unique hash representation.
Args:
arr (np.array): numpy character array.
"""
if type == 'unicode':
vec_fun = np.vectorize(str_to_num)
elif type == 'byte':
vec_fun = np.vectorize(chr_to_num)
elif type == 'hash':
vec_fun = np.vectorize(hash)
out = np.apply_along_axis(vec_fun, 0, arr)
out = out.astype(float)
return out
a = np.array([['x', 'y', 'w'], ['x', 'z','p'], ['y', 'z', 'w'], ['x', 'w','y'], ['w', 'z', 'q']])
char_arr_to_num(a, type = 'unicode')
char_arr_to_num(a, type = 'byte')
char_arr_to_num(a, type = 'hash')
How do I convert a hex string to a signed int in Python 3?
The best I can come up with is
h = '9DA92DAB'
b = bytes(h, 'utf-8')
ba = binascii.a2b_hex(b)
print(int.from_bytes(ba, byteorder='big', signed=True))
Is there a simpler way? Unsigned is so much easier: int(h, 16)
BTW, the origin of the question is itunes persistent id - music library xml version and iTunes hex version
In n-bit two's complement, bits have value:
bit 0 = 20
bit 1 = 21
bit n-2 = 2n-2
bit n-1 = -2n-1
But bit n-1 has value 2n-1 when unsigned, so the number is 2n too high. Subtract 2n if bit n-1 is set:
def twos_complement(hexstr, bits):
value = int(hexstr, 16)
if value & (1 << (bits - 1)):
value -= 1 << bits
return value
print(twos_complement('FFFE', 16))
print(twos_complement('7FFF', 16))
print(twos_complement('7F', 8))
print(twos_complement('FF', 8))
Output:
-2
32767
127
-1
import struct
For Python 3 (with comments' help):
h = '9DA92DAB'
struct.unpack('>i', bytes.fromhex(h))
For Python 2:
h = '9DA92DAB'
struct.unpack('>i', h.decode('hex'))
or if it is little endian:
h = '9DA92DAB'
struct.unpack('<i', h.decode('hex'))
Here's a general function you can use for hex of any size:
import math
# hex string to signed integer
def htosi(val):
uintval = int(val,16)
bits = 4 * (len(val) - 2)
if uintval >= math.pow(2,bits-1):
uintval = int(0 - (math.pow(2,bits) - uintval))
return uintval
And to use it:
h = str(hex(-5))
h2 = str(hex(-13589))
x = htosi(h)
x2 = htosi(h2)
This works for 16 bit signed ints, you can extend for 32 bit ints. It uses the basic definition of 2's complement signed numbers. Also note xor with 1 is the same as a binary negate.
# convert to unsigned
x = int('ffbf', 16) # example (-65)
# check sign bit
if (x & 0x8000) == 0x8000:
# if set, invert and add one to get the negative value, then add the negative sign
x = -( (x ^ 0xffff) + 1)
It's a very late answer, but here's a function to do the above. This will extend for whatever length you provide. Credit for portions of this to another SO answer (I lost the link, so please provide it if you find it).
def hex_to_signed(source):
"""Convert a string hex value to a signed hexidecimal value.
This assumes that source is the proper length, and the sign bit
is the first bit in the first byte of the correct length.
hex_to_signed("F") should return -1.
hex_to_signed("0F") should return 15.
"""
if not isinstance(source, str):
raise ValueError("string type required")
if 0 == len(source):
raise valueError("string is empty")
sign_bit_mask = 1 << (len(source)*4-1)
other_bits_mask = sign_bit_mask - 1
value = int(source, 16)
return -(value & sign_bit_mask) | (value & other_bits_mask)
I have this Python code to do this:
from struct import pack as _pack
def packl(lnum, pad = 1):
if lnum < 0:
raise RangeError("Cannot use packl to convert a negative integer "
"to a string.")
count = 0
l = []
while lnum > 0:
l.append(lnum & 0xffffffffffffffffL)
count += 1
lnum >>= 64
if count <= 0:
return '\0' * pad
elif pad >= 8:
lens = 8 * count % pad
pad = ((lens != 0) and (pad - lens)) or 0
l.append('>' + 'x' * pad + 'Q' * count)
l.reverse()
return _pack(*l)
else:
l.append('>' + 'Q' * count)
l.reverse()
s = _pack(*l).lstrip('\0')
lens = len(s)
if (lens % pad) != 0:
return '\0' * (pad - lens % pad) + s
else:
return s
This takes approximately 174 usec to convert 2**9700 - 1 to a string of bytes on my machine. If I'm willing to use the Python 2.7 and Python 3.x specific bit_length method, I can shorten that to 159 usecs by pre-allocating the l array to be the exact right size at the very beginning and using l[something] = syntax instead of l.append.
Is there anything I can do that will make this faster? This will be used to convert large prime numbers used in cryptography as well as some (but not many) smaller numbers.
Edit
This is currently the fastest option in Python < 3.2, it takes about half the time either direction as the accepted answer:
def packl(lnum, padmultiple=1):
"""Packs the lnum (which must be convertable to a long) into a
byte string 0 padded to a multiple of padmultiple bytes in size. 0
means no padding whatsoever, so that packing 0 result in an empty
string. The resulting byte string is the big-endian two's
complement representation of the passed in long."""
if lnum == 0:
return b'\0' * padmultiple
elif lnum < 0:
raise ValueError("Can only convert non-negative numbers.")
s = hex(lnum)[2:]
s = s.rstrip('L')
if len(s) & 1:
s = '0' + s
s = binascii.unhexlify(s)
if (padmultiple != 1) and (padmultiple != 0):
filled_so_far = len(s) % padmultiple
if filled_so_far != 0:
s = b'\0' * (padmultiple - filled_so_far) + s
return s
def unpackl(bytestr):
"""Treats a byte string as a sequence of base 256 digits
representing an unsigned integer in big-endian format and converts
that representation into a Python integer."""
return int(binascii.hexlify(bytestr), 16) if len(bytestr) > 0 else 0
In Python 3.2 the int class has to_bytes and from_bytes functions that can accomplish this much more quickly that the method given above.
Here is a solution calling the Python/C API via ctypes. Currently, it uses NumPy, but if NumPy is not an option, it could be done purely with ctypes.
import numpy
import ctypes
PyLong_AsByteArray = ctypes.pythonapi._PyLong_AsByteArray
PyLong_AsByteArray.argtypes = [ctypes.py_object,
numpy.ctypeslib.ndpointer(numpy.uint8),
ctypes.c_size_t,
ctypes.c_int,
ctypes.c_int]
def packl_ctypes_numpy(lnum):
a = numpy.zeros(lnum.bit_length()//8 + 1, dtype=numpy.uint8)
PyLong_AsByteArray(lnum, a, a.size, 0, 1)
return a
On my machine, this is 15 times faster than your approach.
Edit: Here is the same code using ctypes only and returning a string instead of a NumPy array:
import ctypes
PyLong_AsByteArray = ctypes.pythonapi._PyLong_AsByteArray
PyLong_AsByteArray.argtypes = [ctypes.py_object,
ctypes.c_char_p,
ctypes.c_size_t,
ctypes.c_int,
ctypes.c_int]
def packl_ctypes(lnum):
a = ctypes.create_string_buffer(lnum.bit_length()//8 + 1)
PyLong_AsByteArray(lnum, a, len(a), 0, 1)
return a.raw
This is another two times faster, totalling to a speed-up factor of 30 on my machine.
For completeness and for future readers of this question:
Starting in Python 3.2, there are functions int.from_bytes() and int.to_bytes() that perform the conversion between bytes and int objects in a choice of byte orders.
I suppose you really should just be using numpy, which I'm sure has something or other built in for this. It might also be faster to hack around with the array module. But I'll take a stab at it anyway.
IMX, creating a generator and using a list comprehension and/or built-in summation is faster than a loop that appends to a list, because the appending can be done internally. Oh, and 'lstrip' on a large string has got to be costly.
Also, some style points: special cases aren't special enough; and you appear not to have gotten the memo about the new x if y else z construct. :) Although we don't need it anyway. ;)
from struct import pack as _pack
Q_size = 64
Q_bitmask = (1L << Q_size) - 1L
def quads_gen(a_long):
while a_long:
yield a_long & Q_bitmask
a_long >>= Q_size
def pack_long_big_endian(a_long, pad = 1):
if lnum < 0:
raise RangeError("Cannot use packl to convert a negative integer "
"to a string.")
qs = list(reversed(quads_gen(a_long)))
# Pack the first one separately so we can lstrip nicely.
first = _pack('>Q', qs[0]).lstrip('\x00')
rest = _pack('>%sQ' % len(qs) - 1, *qs[1:])
count = len(first) + len(rest)
# A little math trick that depends on Python's behaviour of modulus
# for negative numbers - but it's well-defined and documented
return '\x00' * (-count % pad) + first + rest
Just wanted to post a follow-up to Sven's answer (which works great). The opposite operation - going from arbitrarily long bytes object to Python Integer object requires the following (because there is no PyLong_FromByteArray() C API function that I can find):
import binascii
def unpack_bytes(stringbytes):
#binascii.hexlify will be obsolete in python3 soon
#They will add a .tohex() method to bytes class
#Issue 3532 bugs.python.org
return int(binascii.hexlify(stringbytes), 16)