Convert arbitrary int to bytes [duplicate]

Convert arbitrary int to bytes [duplicate] - python

This question already has answers here:
Why does "bytes(n)" create a length n byte string instead of converting n to a binary representation?
(16 answers)
Closed 2 years ago.
How can I use struct to convert an arbitrary int (unsigned, signed, long long,... just int in Python) to a byte sequence (bytes)?
If I read this correctly, I would need to decide on a format string depending on the sign or length of the int, but in Python I don't really have this distinction. I'm trying to convert an arbitrary int to bytes and re-create the int again from the sequence of bytes (bytes).
Here are some attempts which failed:
# int is too big
>>> struct.unpack('>i', struct.pack('>i', -12343243543543534))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
>>> -12343243543543534
-12343243543543534
>>> struct.unpack('>q', struct.pack('>q', -12343243543543534))
(-12343243543543534,)
# again, integer value is too big, but can be represented as integer (below)
>>> struct.unpack('>q', struct.pack('>q', -1234324354354353432432424))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
struct.error: int too large to convert
>>> -1234324354354353432432424
-1234324354354353432432424
Alternatively, I could also use the largest "container" to convert to bytes and "up cast" when turning the bytes back into an integer, but then I would know which format string is safe (=largest) to use.
bytes approach
bytes(int) seems to have the same problem and requires to know about the sign:
>>> bytes(i)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: negative count
>>> i
-123432432432232
int.to_bytes / int.from_bytes
With a sufficiently large number of bytes, I can store "any" integer value, but it is still required to know about the sign.
>>> int(-1234324354354353432432424).to_bytes(64, 'little')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: can't convert negative int to unsigned
>>> int(-1234324354354353432432424).to_bytes(64, 'little', signed=True)
b'\xd9\xf0;\xd5l5\x86$\x9f\xfa\xfe\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
>>> int.from_bytes(int(-1234324354354353432432424).to_bytes(64, 'little', signed=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: from_bytes() missing required argument 'byteorder' (pos 2)
# no sign parameter
>>> int.from_bytes(int(-1234324354354353432432424 + 1).to_bytes(64, 'little', signed=True), 'little')
13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690031858186486050853753882810712245592079295573651673
# actual number
>>> int.from_bytes(int(-1234324354354353432432424 + 1).to_bytes(64, 'little', signed=True), 'little', signed=True)
-1234324354354353432432423

bytes(int) seems to have the same problem and requires to know about the sign
No; this is not for the kind of conversion you have in mind at all. Observe:
>>> bytes(3) # 3 is the *length* of the result
b'\x00\x00\x00'
With a sufficiently large number of bytes, I can store "any" integer value, but it is still required to know about the sign.
This is not a limitation you could possibly avoid even in theory. The stored byte data is just data; whether it represents a signed or unsigned value is an interpretation of that data that must be imposed upon it. There is no way in principle that you could take in b'\x80' and just know whether it should represent 128 or -128, just by looking at it; and it does not matter whether that value came from using the struct module, int.to_bytes or anything else. No tool you use can make the decision for you, because you are not giving it any information with which the decision could be made.
(Note that your use of the struct module does encode assumptions about signedness. For example, q denotes a signed 64-bit type; for the corresponding unsigned type, you would use Q.)
In short, what you are asking for does not make sense.
You can, of course, work around the limitation by either a) adopting a convention (you know whether to interpret the value as signed or unsigned because of the context in which you are interpreting the bytes), or b) explicitly adding that information (storing a byte that encodes the signedness of the value - but this hardly ever is done in the real world). That is to say: you can give your decoding mechanism the information to make the decision, or you can make the decision yourself.
You can't just write this int into a file and read it from that file without also storing the sign (you need to specify a signed parameter in order to reassemble this integer correctly). I was hoping to "dump" the memory content of this integer into a file (as bytes) and read it again without worrying about size of sign.
You are, presumably, storing multiple values in the same file. That means that the foregoing is not a real limitation anyway - because you already have to make decisions about the size of the values you read from the file; where one ends and the next begins. Which is to say, in your example:
int.from_bytes(int(-1234324354354353432432424 + 1).to_bytes(64, 'little', signed=True), 'little', signed=True)
you tell yourself that you don't need to know the size for the .from_bytes call, but in a real application you still would - because you would be taking a slice of the file data, not the whole thing, and you would need to know the bounds of the slice.

Related

str.isdigit() behaviour when handling strings

Assuming the following:
>>> square = '²' # Superscript Two (Unicode U+00B2)
>>> cube = '³' # Superscript Three (Unicode U+00B3)
Curiously:
>>> square.isdigit()
True
>>> cube.isdigit()
True
OK, let's convert those "digits" to integer:
>>> int(square)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'
Oooops!
Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?

str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:
str.isdigit()
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?
Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

Bytearray conversion, integer is required error on python3

asking for an integer on 0x00 hex position, python3
>>> command = bytearray()
>>> command.extend(chr(0x00))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required

Bytearrays consist of either bytes (b'\x00') or byte-sized ints (0x00). The result of chr(0x00) is a unicode string, however.
You can feed bytearray.extend with either a) a bytes string or b) an iterable of byte-sized integers. Both of these represent "sequence of bytes", which a bytearray is. Also, both can be used with hex notation.
command.extend(b'\x00')
command.extend([0x00])
In case you want to add a single integer, you can also use bytearray.append:
command.append(0x00)
Since a string is an iterable, bytearray.extend tries to append its elements. These are also strings, however. Hence, the error that an integer was expected.

How do I encode hexadecimal to base64 in python?

If I try to do:
from base64 import b64encode
b64encode('ffffff')
I get this error:
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
base64.b64encode('ffffff')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/base64.py", line 58, in b64encode
encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'
Because it said bytes-like object I then tried this:
b64encode(bytes('ffffff'))
Which failed.
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
b64encode(bytes('ffffff'))
TypeError: string argument without an encoding
Finally, using the .encode('utf-8') function:
b64encode('ffffff'.encode('utf-8'))
With incorrect output b'ZmZmZmZm', the correct base64 encoding is ////.
I already know how to decode b64 to hex so don't say how to do that.
Edit: This question got flagged for being the same as converting hex strings to hex bytes. This involves base64.

To fully go from the string ffffff to base64 of the hex value, you need to run it through some encoding and decoding, using the codecs module:
import codecs
# Convert string to hex
hex = codecs.decode('ffffff', 'hex')
# Encode as base64 (bytes)
codecs.encode(hex, 'base64')
For an odd-length string like 0xfffff you need to put a zero at the beginning of the hex string (0x0fffff), otherwise python will give you an error.

Here's an alternative to using codecs.
This one is a bit less readable, but works great and hopefully teaches you how codecs, hex and integers work. (word of caution, works on odd lengths, but will ignore the odd byte-string-representation)
import struct
s = 'ffffff'
b''.join([struct.pack('B', int(''.join(x), 16)) for x in zip(s[0::2], s[1::2])])
Which should give you b'\xff\xff\xff'.
Your main problem is probably that you think 'ffffff' represents the values 255, 255, 255. Which they don't. They're still in a string format with the letters ff. Subsequently you need to parse/convert the string representation of hex, into actual hex. We can do this by first passing the string through int() which can intemperate hex in string representation format.
You will need to convert each pair of ff individually by doing int('ff', 16) which tells Python to intemperate the string as a base-16 integer (hex-numbers).
And then convert that integer into a bytes like object representing that integer. That's where struct.pack comes in. It's meant for exactly this.
struct.pack('B', 255) # 255 is given to us by int('ff', 16)
Essentially, 'B' tells Python to pack the value 255 into a 1-byte-object, in this case, that gives us b'\xff' which is your end goal. Now, do this for every 2-pair of letters in your original data.
This is more of a manual approach where you'll iterate over 2 characters in the string at a time, and use the above description to bundle them into what you expect them to be. Or just use codecs, either way works.
Expanded version of the above oneliner:
import struct
hex_string = 'ffffff'
result = b''
for pair in zip(hex_string[0::2], hex_string[1::2]):
value = int(''.join(pair), 16)
result += struct.pack('B', value)
At the very least, I hope this explains how hex works on a practical level. And how the computer interpenetrates hour humanly readable version of bits and bytes.

base64.encodestring failing in python 3

The following piece of code runs successfully on a python 2 machine:
base64_str = base64.encodestring('%s:%s' % (username,password)).replace('\n', '')
I am trying to port it over to Python 3 but when I do so I encounter the following error:
>>> a = base64.encodestring('{0}:{1}'.format(username,password)).replace('\n','')
Traceback (most recent call last):
File "/auto/pysw/cel55/python/3.4.1/lib/python3.4/base64.py", line 519, in _input_type_check
m = memoryview(s)
TypeError: memoryview: str object does not have the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/auto/pysw/cel55/python/3.4.1/lib/python3.4/base64.py", line 548, in encodestring
return encodebytes(s)
File "/auto/pysw/cel55/python/3.4.1/lib/python3.4/base64.py", line 536, in encodebytes
_input_type_check(s)
File "/auto/pysw/cel55/python/3.4.1/lib/python3.4/base64.py", line 522, in _input_type_check
raise TypeError(msg) from err
TypeError: expected bytes-like object, not str
I tried searching examples for encodestring usage but not able to find a good document. Am I missing something obvious? I am running this on RHEL 2.6.18-371.11.1.el5

You can encode() the string (to convert it to byte string) , before passing it into base64.encodestring . Example -
base64_str = base64.encodestring(('%s:%s' % (username,password)).encode()).decode().strip()

To expand on Anand's answer (which is quite correct), Python 2 made little distinction between "Here's a string which I want to treat like text" and "Here's a string which I want to treat like a sequence of 8-bit byte values". Python 3 firmly distinguishes the two, and doesn't let you mix them up: the former is the str type, and the latter is the bytes type.
When you Base64 encode a string, you're not actually treating the string as text, you're treating it as a series of 8-bit byte values. That's why you're getting an error from base64.encodestring() in Python 3: because that is an operation that deals with the string's characters as 8-bit bytes, and so you should pass it a paramter of type bytes rather than a parameter of type str.
Therefore, to convert your str object into a bytes object, you have to call its encode() method to turn it into a set of 8-bit byte values, in whatever Unicode encoding you have chosen to use. (Which should be UTF-8 unless you have a very specific reason to choose something else, but that's another topic).

In Python 3 encodestring docs says:
def encodestring(s):
"""Legacy alias of encodebytes()."""
import warnings
warnings.warn("encodestring() is a deprecated alias, use encodebytes()", DeprecationWarning, 2)
return encodebytes(s)
Here is working code for Python 3.5.1, it also shows how to url encode:
def _encodeBase64(consumer_key, consumer_secret):
"""
:type consumer_key: str
:type consumer_secret: str
:rtype str
"""
# 1. URL encode the consumer key and the consumer secret according to RFC 1738.
dummy_param_name = 'bla'
key_url_encoded = urllib.parse.urlencode({dummy_param_name: consumer_key})[len(dummy_param_name) + 1:]
secret_url_encoded = urllib.parse.urlencode({dummy_param_name: consumer_secret})[len(dummy_param_name) + 1:]
# 2. Concatenate the encoded consumer key, a colon character “:”, and the encoded consumer secret into a single string.
credentials = '{}:{}'.format(key_url_encoded, secret_url_encoded)
# 3. Base64 encode the string from the previous step.
bytes_base64_encoded_credentials = base64.encodebytes(credentials.encode('utf-8'))
return bytes_base64_encoded_credentials.decode('utf-8').replace('\n', '')
(I am sure it could be more concise, I am new to Python...)
Also see: http://pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/

COM call : OverflowError: Python int too large to convert to C long

I'm trying to call a COM Method using win32com Python (v3.3) library. This method accepts an address and a data to write to that address.
SWDIOW(IN addr, IN data)
Problem is: accepted size for data is 32bits. When I call this method with such parameters there is no problem:
swdiow(0x400c0000, 0x00000012)
But when data is too big (actually greater than 2^(-31)-1), such as;
swdiow(0x400c0000, 0xF3804770)
This happens:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\WinPython\python-3.3.2.amd64\lib\site-packages\win32com\gen_py\97FF1B53-6477-4BD1-8A29-ADDB58E862E5x0x16x0.py", line 1123, in swdiow
, data)
OverflowError: Python int too large to convert to C long
I think this problem happens because python takes 0xF3804770 as a signed integer. This value requires 32 bits to be represented as unsigned. To make it signed, Python adds one more sign bit, and size of 0xF3804770 becomes 33 bits. Because of that sign bit, it cannot convert to C long type which I believe the size is 32 bits.
I've found a workaround to this that will convert number to a negative integer when it's too big.
>> int.from_bytes((0xF3804770).to_bytes(4,'big',signed=False),'big', signed=True)
-209696912
Maybe this is the best thing I could do, but I wonder if there is a more elegant solution to this?
By the way, source of my 'data' is a text file that contains 4 bytes HEX values such as this;
D0010880
D1FD1E40
F3EF4770
431103C2
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.