I got a special packet in string format, which has 32 bytes header and the body contains one of more entries, each consist of 90 bytes.
I want to process this string using python. Can I just read like sock read first 32 bytes header, and take it off the string, and continue read 90 bytes of the first entry?
something like:
str.read(32) # => "x01x02..."
str.read(90) # => "x02x05..."
You can use StringIO to read a string like a file
>>> import StringIO
>>> s = 'Hello, World!'
>>> sio = StringIO.StringIO(s)
>>> sio.read(6)
'Hello,'
>>> sio.read()
' World!'
I would also suggest you take a look at the struct module for help with parsing binary data
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
You define the format of the data using format strings, so 'hhl' in the above example is short (2 bytes), short (2 bytes), int (4 bytes). It also supports specifying endianness (byte order) in the format string.
For example if your header format was uint, 4 byte str, uint, uint, ushort, ulong:
>>> import struct
>>> data = ''.join(chr(i) for i in range(128)) * 10
>>> hdr_fmt = 'I4sIIHL'
>>> struct.calcsize(hdr_fmt)
32
>>> struct.unpack_from(hdr_fmt, data, 0)
(50462976, '\x04\x05\x06\x07', 185207048, 252579084, 4368, 2242261671028070680)
To split the packet into a 32 byte header and body:
header = packet[:32]
body = packet[32:]
To further split the body into one or more entries:
entries = [packet[i:i+90] for i in range(0, len(packet), 90)]
In python 2.x you could do simply:
header = s[:32]
body = s[32:32+90]
In python 3.x all strings are unicode, so I would convert to bytearray firstly:
s = bytearray(s)
header = s[:32]
body = s[32:32+90]
Related
Suppose I read a long bytes object from somewhere, knowing it is utf-8 encoded. But the read may not fully consume the available content so that the last character in the stream may be incomplete. Calling bytes.decode() on this object may result in a decode error. But what really fails is only the last few bytes. Is there a function that works in this case, returning the longest decoded string and the remaining bytes?
utf-8 encodes a character into at most 4 bytes, so trying to decode truncated bytes should work, but a vast majority of computation will be wasted, and I don't really like this solution.
To give a simple but concrete example:
>>> b0 = b'\xc3\x84\xc3\x96\xc3'
>>> b1 = b'\x9c\xc3\x84\xc3\x96\xc3\x9c'
>>> (b0 + b1).decode()
>>> 'ÄÖÜÄÖÜ'
(b0 + b1).decode() is fine, but b0.decode() will raise. The solution should be able to decode b0 for as much as possible and return the bytes that cannot be decoded.
You are describing the basic usage of io.TextIOWrapper: a buffered text stream over a binary stream.
>>> import io
>>> txt = 'before\N{PILE OF POO}after'
>>> b = io.BytesIO(txt.encode('utf-8'))
>>> t = io.TextIOWrapper(b)
>>> t.read(5)
'befor'
>>> t.read(1)
'e'
>>> t.read(1)
'💩'
>>> t.read(1)
'a'
Contrast with reading a bytes stream directly, where it would be possible to read halfway through an encoded pile of poo:
>>> b.seek(0)
0
>>> b.read(5)
b'befor'
>>> b.read(1)
b'e'
>>> b.read(1)
b'\xf0'
>>> b.read(1)
b'\x9f'
>>> b.read(1)
b'\x92'
>>> b.read(1)
b'\xa9'
>>> b.read(1)
b'a'
Specify encoding="utf-8" if you want to be explicit. The default encoding, i.e. locale.getpreferredencoding(False), would usually be utf-8 anyway.
As I mentioned in the comments under #wim's answer, I think you could use the codecs.iterdecode() incremental decoder to do this. Since it's a generator function, there's no need to manually save and restore its state between iterative calls to it.
Here's how how it might be used to handle a situation like the one you described:
import codecs
from random import randint
def reader(sequence):
""" Yield random length chunks of sequence until exhausted. """
plural = lambda word, n, ending='s': (word+ending) if n > 1 else word
i = 0
while i < len(sequence):
size = randint(1, 4)
chunk = sequence[i: i+size]
hexrepr = '0x' + ''.join('%02X' % b for b in chunk)
print('read {} {}: {}'.format(size, plural('byte', len(chunk)), hexrepr))
yield chunk
i += size
bytes_obj = b'\xc3\x84\xc3\x96\xc3\x9c\xc3\x84\xc3\x96\xc3\x9c' # 'ÄÖÜÄÖÜ'
for decoded in codecs.iterdecode(reader(bytes_obj), 'utf-8'):
print(decoded)
Sample output:
read 3 bytes: 0xC384C3
Ä
read 1 byte: 0x96
Ö
read 1 byte: 0xC3
read 3 bytes: 0x9CC384
ÜÄ
read 2 bytes: 0xC396
Ö
read 4 bytes: 0xC39C
Ü
How can you decompress a string of text in Python 3, that has been compressed with gzip and converted to base 64?
For example, the text:
EgAAAB+LCAAAAAAABAALycgsVgCi4vzcVAWFktSKEgC9n1/fEgAAAA==
Should convert to:
This is some text
The following C# code successfully does this:
var gzBuffer = Convert.FromBase64String(compressedText);
using (var ms = new MemoryStream()) {
int msgLength = BitConverter.ToInt32(gzBuffer, 0);
ms.Write(gzBuffer, 4, gzBuffer.Length - 4);
var buffer = new byte[msgLength];
ms.Position = 0;
using (var zip = new GZipStream(ms, CompressionMode.Decompress)) {
zip.Read(buffer, 0, buffer.Length);
}
return Encoding.UTF8.GetString(buffer);
}
You can use the gzip and base64 modules.
>>> import gzip
>>> import base64
>>> s = 'EgAAAB+LCAAAAAAABAALycgsVgCi4vzcVAWFktSKEgC9n1/fEgAAAA=='
>>> gz = base64.b64decode(s)
>>> gz
b'\x12\x00\x00\x00\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\x0b\xc9\xc8,V\x00\xa2\xe2\xfc\xdcT\x05\x85\x92\xd4\x8a\x12\x00\xbd\x9f_\xdf\x12\x00\x00\x00'
# If you need the length
import struct
# Unpacks binary encoded 4 byte integer (assume native byte order)
# Only select first four bytes with [:4] slice
>>> struct.unpack('i', gz[:4])[0]
18
# Skip length value with [4:] slice
>>> gzip.decompress(gz[4:]).decode('UTF8')
'This is some text'
I´m learning Python.
Seeing the module struct, I found a doubt:
Is it possible to convert a "string" to "bin" without giving the length.
For the case (with chars length)
F = open("data.bin", "wb")
import struct
data = struct.pack("24s",b"This is an unknown string")
print(data)
F.write(data)
F.close()
I´m trying to do the same but with unknown length.
Thanks a lot!
If you have the string, use len to determine the length of the string.
i.e
data = struct.pack("{0}s".format(len(unknown_string)), unknown_string)
The Bytes type is a binary data type, it just stores a bunch of 8bit characters. Note that the code with struct.pack ends up creating a bytes object:
>>> import struct
>>> data = struct.pack("24s",b"This is an unknown string")
>>> type(data)
<class 'bytes'>
>>> len(data)
24
The length of this is 24 as per your format specifier. If you just want to place the bytes-string directly into the file without doing any length checking you don't even need to use the struct module, you can just write it directly to the file:
F = open("data.bin", "wb")
F.write(b"This will work")
If however you wanted to keep the 24 bytes length you could keep using struct.pack:
>>> data = struct.pack("24s",b"This is an unknown st")
>>> len(data)
24
>>> print(data)
b'This is an unknown st\x00\x00\x00'
>>> data = struct.pack("24s",b"This is an unknown string abcdef")
>>> print(data)
b'This is an unknown strin'
In the case of supplying a bytes that is too short struct.pack pads the remainder with 0's and in the case where it's too long it truncates it.
If you don't mind getting the missing space padded out with zeros you can just pass in the bytes object directly to struct.pack and it will handle it.
Thanks to both...
My new code:
F = open("data.bin", "wb")
strs = b"This is an unkown string"
import struct
data = struct.pack("{0}s".format(len(strs)),strs)
print(data)
F.write(data)
F.close()
I'm dealing with a character separated hex file, where each field has a particular start code. I've opened the file as 'rb', but I was wondering, after I get the index of the startcode using .find, how do I read a certain number of bytes from this position?
This is how I am loading the file and what I am attempting to do
with open(someFile, 'rb') as fileData:
startIndex = fileData.find('(G')
data = fileData[startIndex:7]
where 7 is the number of bytes I want to read from the index returned by the find function. I am using python 2.7.3
You can get the position of a substring in a bytestring under python2.7 like this:
>>> with open('student.txt', 'rb') as f:
... data = f.read()
...
>>> data # holds the French word for student: élève
'\xc3\xa9l\xc3\xa8ve\n'
>>> len(data) # this shows we are dealing with bytes here, because "élève\n" would be 6 characters long, had it been properly decoded!
8
>>> len(data.decode('utf-8'))
6
>>> data.find('\xa8') # continue with the bytestring...
4
>>> bytes_to_read = 3
>>> data[4:4+bytes_to_read]
'\xa8ve'
You can look for the special characters, and for compatibility with Python3k, it's better if you prepend the character with a b, indicating these are bytes (in Python2.x, it will work without though):
>>> data.find(b'è') # in python2.x this works too (unfortunately, because it has lead to a lot of confusion): data.find('è')
3
>>> bytes_to_read = 3
>>> pos = data.find(b'è')
>>> data[pos:pos+bytes_to_read] # when you use the syntax 'n:m', it will read bytes in a bytestring
'\xc3\xa8v'
>>>
I have a list of hex bytes strings like this
['0xe1', '0xd7', '0x7', '0x0']
(as read from a binary file)
I want to flip the list and append the list together to create one hex number,
['0x07D7E1']
How do I format the list to this format?
Concatenate your hex numbers into one string:
'0x' + ''.join([format(int(c, 16), '02X') for c in reversed(inputlist)])
This does include the 00 byte explicitly in the output:
>>> inputlist = ['0xe1', '0xd7', '0x7', '0x0']
>>> '0x' + ''.join([format(int(c, 16), '02X') for c in reversed(inputlist)])
'0x0007D7E1'
However, I'd look into reading your binary file format better; using struct for example to unpack bytes directly from the file into proper integers in the right byte order:
>>> import struct
>>> bytes = ''.join([chr(int(c, 16)) for c in inputlist])
>>> value = struct.unpack('<I', bytes)[0]
>>> print hex(value)
0x7d7e1