Let's say, I have a string (Unicode if it matters) variable which is less than 100 bytes. I want to create another variable with exactly 100 byte in size which includes this string and is padded with zero or whatever. How would I do it in Python 3?
For assembling packets to go over the network, or for assembling byte-perfect binary files, I suggest using the struct module.
struct — Interpret bytes as packed binary data
Just for the string, you might not need struct, but as soon as you start also packing binary values, struct will make your life much easier.
Depending on your needs, you might be better off with an off-the-shelf network serialization library, such as Protocol Buffers; or you might even just use JSON for the wire format.
Protocol Buffer Basics: Python
PyMOTW - JavaScript Object Notation Serializer
Something like this should work:
st = "具有"
by = bytes(st, "utf-8")
by += b"0" * (100 - len(by))
print(by)
# b'\xe5\x85\xb7\xe6\x9c\x890000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
Obligatory addendum since your original post seems to conflate strings with the length of their encoded byte representation: Python unicode explanation
To pad with null bytes you can do it the way they do it in the stdlib base64 module.
some_data = b'foosdsfkl\x05'
null_padded = some_data + bytes(100 - len(some_data))
Here's a roundabout way of doing it:
>>> import sys
>>> a = "a"
>>> sys.getsizeof(a)
22
>>> a = "aa"
>>> sys.getsizeof(a)
23
>>> a = "aaa"
>>> sys.getsizeof(a)
24
So following this, an ASCII string of 100 bytes will need to be 79 characters long
>>> a = "".join(["a" for i in range(79)])
>>> len(a)
79
>>> sys.getsizeof(a)
100
This approach above is a fairly simple way of "calibrating" strings to figure out their lengths. You could automate a script to pad a string out to the appropriate memory size to account for other encodings.
def padder(strng):
TARGETSIZE = 100
padChar = "0"
curSize = sys.getsizeof(strng)
if curSize <= TARGETSIZE:
for i in range(TARGETSIZE - curSize):
strng = padChar + strng
return strng
else:
return strng # Not sure if you need to handle strings that start longer than your target, but you can do that here
Related
I have a long Hex string that represents a series of values of different types. I need to convert this Hex String into bytes or bytearray so that I can extract each value from the raw data. How can I do this?
For example, the string "ab" should convert to the bytes b"\xab" or equivalent byte array. Longer example:
>>> # what to use in place of `convert` here?
>>> convert("8e71c61de6a2321336184f813379ec6bf4a3fb79e63cd12b")
b'\x8eq\xc6\x1d\xe6\xa22\x136\x18O\x813y\xeck\xf4\xa3\xfby\xe6<\xd1+'
Suppose your hex string is something like
>>> hex_string = "deadbeef"
Convert it to a bytearray (Python 3 and 2.7):
>>> bytearray.fromhex(hex_string)
bytearray(b'\xde\xad\xbe\xef')
Convert it to a bytes object (Python 3):
>>> bytes.fromhex(hex_string)
b'\xde\xad\xbe\xef'
Note that bytes is an immutable version of bytearray.
Convert it to a string (Python ≤ 2.7):
>>> hex_data = hex_string.decode("hex")
>>> hex_data
"\xde\xad\xbe\xef"
There is a built-in function in bytearray that does what you intend.
bytearray.fromhex("de ad be ef 00")
It returns a bytearray and it reads hex strings with or without space separator.
provided I understood correctly, you should look for binascii.unhexlify
import binascii
a='45222e'
s=binascii.unhexlify(a)
b=[ord(x) for x in s]
Assuming you have a byte string like so
"\x12\x45\x00\xAB"
and you know the amount of bytes and their type you can also use this approach
import struct
bytes = '\x12\x45\x00\xAB'
val = struct.unpack('<BBH', bytes)
#val = (18, 69, 43776)
As I specified little endian (using the '<' char) at the start of the format string the function returned the decimal equivalent.
0x12 = 18
0x45 = 69
0xAB00 = 43776
B is equal to one byte (8 bit) unsigned
H is equal to two bytes (16 bit) unsigned
More available characters and byte sizes can be found here
The advantages are..
You can specify more than one byte and the endian of the values
Disadvantages..
You really need to know the type and length of data your dealing with
You can use the Codecs module in the Python Standard Library, i.e.
import codecs
codecs.decode(hexstring, 'hex_codec')
You should be able to build a string holding the binary data using something like:
data = "fef0babe"
bits = ""
for x in xrange(0, len(data), 2)
bits += chr(int(data[x:x+2], 16))
This is probably not the fastest way (many string appends), but quite simple using only core Python.
A good one liner is:
byte_list = map(ord, hex_string)
This will iterate over each char in the string and run it through the ord() function. Only tested on python 2.6, not too sure about 3.0+.
-Josh
I'm using the following code to pack an integer into an unsigned short as follows,
raw_data = 40
# Pack into little endian
data_packed = struct.pack('<H', raw_data)
Now I'm trying to unpack the result as follows. I use utf-16-le since the data is encoded as little-endian.
def get_bin_str(data):
bin_asc = binascii.hexlify(data)
result = bin(int(bin_asc.decode("utf-16-le"), 16))
trimmed_res = result[2:]
return trimmed_res
print(get_bin_str(data_packed))
Unfortunately, it throws the following error,
result = bin(int(bin_asc.decode("utf-16-le"), 16)) ValueError: invalid
literal for int() with base 16: '㠲〰'
How do I properly decode the bytes in little-endian to binary data properly?
Use unpack to reverse what you packed. The data isn't UTF-encoded so there is no reason to use UTF encodings.
>>> import struct
>>> data_packed = struct.pack('<H', 40)
>>> data_packed.hex() # the two little-endian bytes are 0x28 (40) and 0x00 (0)
2800
>>> data = struct.unpack('<H',data_packed)
>>> data
(40,)
unpack returns a tuple, so index it to get the single value
>>> data = struct.unpack('<H',data_packed)[0]
>>> data
40
To print in binary use string formatting. Either of these work work best. bin() doesn't let you specify the number of binary digits to display and the 0b needs to be removed if not desired.
>>> format(data,'016b')
'0000000000101000'
>>> f'{data:016b}'
'0000000000101000'
You have not said what you are trying to do, so let's assume your goal is to educate yourself. (If you are trying to pack data that will be passed to another program, the only reliable test is to check if the program reads your output correctly.)
Python does not have an "unsigned short" type, so the output of struct.pack() is a byte array. To see what's in it, just print it:
>>> data_packed = struct.pack('<H', 40)
>>> print(data_packed)
b'(\x00'
What's that? Well, the character (, which is decimal 40 in the ascii table, followed by a null byte. If you had used a number that does not map to a printable ascii character, you'd see something less surprising:
>>> struct.pack("<H", 11)
b'\x0b\x00'
Where 0b is 11 in hex, of course. Wait, I specified "little-endian", so why is my number on the left? The answer is, it's not. Python prints the byte string left to right because that's how English is written, but that's irrelevant. If it helps, think of strings as growing upwards: From low memory locations to high memory. The least significant byte comes first, which makes this little-endian.
Anyway, you can also look at the bytes directly:
>>> print(data_packed[0])
40
Yup, it's still there. But what about the bits, you say? For this, use bin() on each of the bytes separately:
>>> bin(data_packed[0])
'0b101000'
>>> bin(data_packed[1])
'0b0'
The two high bits you see are worth 32 and 8. Your number was less than 256, so it fits entirely in the low byte of the short you constructed.
What's wrong with your unpacking code?
Just for fun let's see what your sequence of transformations in get_bin_str was doing.
>>> binascii.hexlify(data_packed)
b'2800'
Um, all right. Not sure why you converted to hex digits, but now you have 4 bytes, not two. (28 is the number 40 written in hex, the 00 is for the null byte.) In the next step, you call decode and tell it that these 4 bytes are actually UTF-16; there's just enough for two unicode characters, let's take a look:
>>> b'2800'.decode("utf-16-le")
'㠲〰'
In the next step Python finally notices that something is wrong, but by then it does not make much difference because you are pretty far away from the number 40 you started with.
To correctly read your data as a UTF-16 character, call decode directly on the byte string you packed.
>>> data_packed.decode("utf-16-le")
'('
>>> ord('(')
40
I am working with python sockets and although I have read everything I can find on stackoverflow, regarding the socket.recv byte stream, I still have a few questions I hope some of you can answer.
So I have this line:
chunk = s.recv(1024)
I receive the following packet: '\x12\x03\x10\x00\x00\x00\x00\x23\x45\x34\x56'
I need to get bytes at index 1 & 2 ('\x03\x10') and convert the 2-byte data to an integer .. which should be 784.
What is the best way to handle this?
In addition, what is the best way to remove a portion of the chunk data and pass it into a function? I.E. I need to remove bytes at index 7 - 9 and pass that string of 3 bytes into a function.
Getting substrings out of a larger string is simple; for example, given:
>>> thestring = 'this is a test'
If I want bytes 1 and 2, and I can ask for:
>>> thestring[1:3]
'hi'
In your example, once you have the two bytes of interest:
>>> chunk = '\x12\x03\x10\x00\x00\x00\x00\x23\x45\x34\x56'
>>> data = chunk[1:3]
You can use the struct module to unpack those into a single integer:
>>> import struct
>>> struct.unpack('H', data)
(4099,)
That assumes native byte order. You can explicitly request big endian like this:
>>> struct.unpack('>H', data)
(784,)
Read the struct documentation for more information. Note that H (used here) means "unsigned short", while h (used in another answer) means "signed short", and which one is appropriate depends on your application.
Use the struct module:
import struct
pkt = b'\x12\x03\x10\x00\x00\x00\x00\x23\x45\x34\x56'
i = struct.unpack_from('>h', pkt, 1)[0]
>>> print i
784
The format string '>h' means that the data has big endian byte order (>) and to treat the data as a short 2-byte integer (h). There is also an unsigned variety denoted by H.
To grab a substring of the data use slicing:
>>> idx = 7
>>> length = 3
>>> pkt[idx:idx+length]
b'#E4'
and pass it to a function:
>>> func(pkt[idx:idx+length])
I have a long Hex string that represents a series of values of different types. I need to convert this Hex String into bytes or bytearray so that I can extract each value from the raw data. How can I do this?
For example, the string "ab" should convert to the bytes b"\xab" or equivalent byte array. Longer example:
>>> # what to use in place of `convert` here?
>>> convert("8e71c61de6a2321336184f813379ec6bf4a3fb79e63cd12b")
b'\x8eq\xc6\x1d\xe6\xa22\x136\x18O\x813y\xeck\xf4\xa3\xfby\xe6<\xd1+'
Suppose your hex string is something like
>>> hex_string = "deadbeef"
Convert it to a bytearray (Python 3 and 2.7):
>>> bytearray.fromhex(hex_string)
bytearray(b'\xde\xad\xbe\xef')
Convert it to a bytes object (Python 3):
>>> bytes.fromhex(hex_string)
b'\xde\xad\xbe\xef'
Note that bytes is an immutable version of bytearray.
Convert it to a string (Python ≤ 2.7):
>>> hex_data = hex_string.decode("hex")
>>> hex_data
"\xde\xad\xbe\xef"
There is a built-in function in bytearray that does what you intend.
bytearray.fromhex("de ad be ef 00")
It returns a bytearray and it reads hex strings with or without space separator.
provided I understood correctly, you should look for binascii.unhexlify
import binascii
a='45222e'
s=binascii.unhexlify(a)
b=[ord(x) for x in s]
Assuming you have a byte string like so
"\x12\x45\x00\xAB"
and you know the amount of bytes and their type you can also use this approach
import struct
bytes = '\x12\x45\x00\xAB'
val = struct.unpack('<BBH', bytes)
#val = (18, 69, 43776)
As I specified little endian (using the '<' char) at the start of the format string the function returned the decimal equivalent.
0x12 = 18
0x45 = 69
0xAB00 = 43776
B is equal to one byte (8 bit) unsigned
H is equal to two bytes (16 bit) unsigned
More available characters and byte sizes can be found here
The advantages are..
You can specify more than one byte and the endian of the values
Disadvantages..
You really need to know the type and length of data your dealing with
You can use the Codecs module in the Python Standard Library, i.e.
import codecs
codecs.decode(hexstring, 'hex_codec')
You should be able to build a string holding the binary data using something like:
data = "fef0babe"
bits = ""
for x in xrange(0, len(data), 2)
bits += chr(int(data[x:x+2], 16))
This is probably not the fastest way (many string appends), but quite simple using only core Python.
A good one liner is:
byte_list = map(ord, hex_string)
This will iterate over each char in the string and run it through the ord() function. Only tested on python 2.6, not too sure about 3.0+.
-Josh
I'm trying to create a hex representation of some data that needs to be transmitted (specifically, in ASN.1 notation). At some points, I need to convert data to its hex representation. Since the data is transmitted as a byte sequence, the hex representation has to be padded with a 0 if the length is odd.
Example:
>>> hex2(3)
'03'
>>> hex2(45)
'2d'
>>> hex2(678)
'02a6'
The goal is to find a simple, elegant implementation for hex2.
Currently I'm using hex, stripping out the first two characters, then padding the string with a 0 if its length is odd. However, I'd like to find a better solution for future reference. I've looked in str.format without finding anything that pads to a multiple.
def hex2(n):
x = '%x' % (n,)
return ('0' * (len(x) % 2)) + x
To be totally honest, I am not sure what the issue is. A straightforward implementation of what you describe goes like this:
def hex2(v):
s = hex(v)[2:]
return s if len(s) % 2 == 0 else '0' + s
I would not necessarily call this "elegant" but I would certainly call it "simple."
Python's binascii module's b2a_hex is guaranteed to return an even-length string.
the trick then is to convert the integer into a bytestring. Python3.2 and higher has that built-in to int:
from binascii import b2a_hex
def hex2(integer):
return b2a_hex(integer.to_bytes((integer.bit_length() + 7) // 8, 'big'))
Might want to look at the struct module, which is designed for byte-oriented i/o.
import struct
>>> struct.pack('>i',678)
'\x00\x00\x02\xa6'
#Use h instead of i for shorts
>>> struct.pack('>h',1043)
'\x04\x13'