How to strip first and last byte from binary data in Python

How to strip first and last byte from binary data in Python - python

I have a Python program that is receiving data from a UDP socket. Python is reading it as binary data.
Each datagram should start with the STX character (0x02) and end with the ETX character (0x03). I want my code to test the first and last character and, if they are correct, to strip them before passing the rest of the data out of my function to be further processed by the calling function.
I believe I can test the first character with data[0].
How can I get the length of the data so as to test the last character?
How can I get a "substring" of the data?
I am tempted to decode it to a string, but suspect that involves extra overhead and anyway, as a Python novice, I have no idea how.

As I understand the question you need to slice your data.
So you can slice from 2nd character by using data[1:] and up to last character but not including by using data[:-1].
As a combination of both, you can use
data[1:-1]
You can get length by using len() function.
len(data)
You can get the last byte by:
data[-1]

Related

Reading mixed ASCII/binary serial data stream -- how to find delimiter?

I am trying to read serial data from a device that outputs in a mix of ASCII and binary, using Python 3.
The message format is: "$PASHR,msg type,binary payload,checksum,\r\n" (minus the quotes)
To make it more interesting, there are several different message types, and they have different payload lengths, so I can't just read X bytes (I can infer the payload length based on the message type). The sequence of a variable number of messages of each type (around 15 in total) is sent every 20 seconds, at 115200 baud.
I haven't been able to read this with serial.readline(), probably because of newlines embedded in the payload.
I think that if I could set the line-end character to the sequence "$PASHR" that would give me a way to frame the messages -- ie, everything between one $PASHR and the next is one message, and the likelihood of seeing the sequence in the binary payload is nil. But I have not been able to make it work either using serial.newline = b'$PASHR' or readuntil(b'$PASHR') -- I still get variable length reads. I suspect that the serial.timeout setting enters into the solution, but I am not sure how.
Here's the last version I ended up with last night:
delimiter = b'$PASHR'
while True:
if self.serial.in_waiting:
message = self.serial.read_until(delimiter)
Every method I've tried gives a variable response length, when each record of a give should be the same length. Is there a way to set the newline to that multi-character string, or is there a different/better approach I should use?
Thanks!

I think I figured it out. This seems to work with a timeout of 1 second:
self.serial.timeout = 1
delimiter = b'$PASHR'
if self.serial.in_waiting:
message = self.serial.read_until(delimiter)
if len(message) > 6: # ignore carryover PASHR'
***process message***
When I looked carefully at the variable length returns, I discovered that the message sequence after the first incomplete cycle was:
b'$PASHR' (6 bytes)
b',MPC,*data**checksum*\r\n$PASHR' (108 bytes)
...
b',MPC,*data**checksum*\r\n' (102 bytes)
(delay)
b'$PASHR' (6 bytes)
The delimiter from the last message in the sequence is chopped off and output at the beginning of the next sequence. I can deal with that.
I will try Deepstop's idea of the inter_byte timeout as it is probably more elegant than using the raw timeout, and might get around the partial message problem.

Python Parsing MBR

I am just starting out with Python scripting and I am trying to write a program that will parse through a provided MBR but I'm not sure how to start.
I want to write a program that will parse a portion of the MBR's partition table. The first partition entry is located at the address 1BE. Print out the status byte (1 byte located at the starting address), the partition type (1 byte located at the address 1BE + 4) and the address of the first sector in the partition (1BE + 8).
Any help would be greatly appreciated!

Batteries included. Use the array or struct module.
Or else one of these (but here they're likely overkill):
https://github.com/digidotcom/python-suitcase
https://github.com/vstinner/hachoir3
https://github.com/Muterra/py_smartyparse

I know this is a very old question, but I came here looking for an answer and the only one here did not answer the question itself very well. I believe I have a proper understanding of the question and answer at this point; So, starting with the first part which is the status address. A status address is typically something like: 0x80, which is an active status flag, and is only a byte long. This can be found with the following lines:
import struct # This is where we get our bytearray() structure
mbr = bytearray() # We want each index of our array to be a byte
binary_file = open(file, 'rb')
mbr = binary_file.read(512) # The first 512 bytes are the first sector, which is the MBR
status_flag = mbr[0x1BE]
The status flag is only a single byte, and because we know it is located at the address 0x1BE we are able to simply pull that index from the MBR array (what we gathered when we read the file but broken into 1 byte chunks). Another way to read 0x1BE could be as the integer 446; so we are really looking at the byte stored in the index mbr[446] in the example above (Because we start with 0x Python knows to interpret it as a hex value, so 446 is 0x1BE).
Moving onto the second part, similarly to the first part, the partition type is a single byte stored at the address 0x1BE+4 or 0x1C2. So, to find this, much like with the status byte, we are able to simply do:
partition_type = mbr[0x1C2]
Because the partition type is also just a byte, and each index of our mbr array is a byte, we can simply pull the value at the address 0x1C2.
As for the last part, the address of the first sector is a 4-byte value that starts at the address: 0x1BE+8 or 0x1C6. Because it is bytes, we know that it ends at the address 0x1BE+12 or 0x1CA. So, to find this, we can do the following:
first_sector_addr = struct.unpack('<I', mbr[0x1C6:0x1CA])
'''
For the line above, we are using the unpack function also
included with the struct import. This function takes two
primary arguments: the byte order/size/alignment, and the
data to read (https://docs.python.org/3/library/struct.html).
We must read the data as little-endian and as an unsigned int
(https://thestarman.pcministry.com/asm/mbr/PartTables.htm).
'''
Once we have all of the variables collected (status_flag, partition_type, first_sector_addr) we can print each of them to the screen. I recommend printing the first two as hex values as these are what are used for identification. For example, if the partition type has the hex value 0x83 it is a Linux Native file system (https://thestarman.pcministry.com/asm/mbr/PartTypes.htm)
https://thestarman.pcministry.com/asm/mbr/PartTables.htm
https://en.wikipedia.org/wiki/Master_boot_record#Sector_layout
https://www.ijais.org/research/volume10/number8/sadi-2016-ijais-451541.pdf
(Last link will prompt for pdf download, but is a useful resource on MBR. I think that is why I had to post it as code rather than text)

How do I search for a set amount of hex and non hex data in python

I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]

Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results

How to convert python byte string containing a mix of hex characters?

Specifically, I am receiving a stream of bytes from a TCP socket that looks something like this:
inc_tcp_data = b'\x02hello\x1cthisisthedata'
The stream using hex values to denote different parts of the incoming data. However I want to use the inc_data in the following format:
converted_data = '\x02hello\x1cthisisthedata'
essentially I want to get rid of the b and just literally spit out what came in.
I've tried various struct.unpack methods as well as .decode("encoding). I could not get the former to work at all, and the latter would strip out the hex values if there was no visual way to encode it or it would convert it to character if it could. Any ideas?
Update:
I was able to get my desired result with the following code:
inc_tcp_data = b'\x02hello\x3Fthisisthedata'.decode("ascii")
d = repr(inc_tcp_data)
print(d)
print(len(d))
print(len(inc_tcp_data))
the output is:
'\x02hello?thisisthedata'
25
20
however, this still doesn't help me because I do actually need the regular expression that follows to see \x02 as a hex value and not as a 4 byte string.
what am I doing wrong?
UPDATE
I've solved this issue by not solving it. The reason I wanted the hex characters to remain unchanged was so that a regular expression would be able to detect it further down the road. However what I should have done (and did) was simply change the regular expression to analyze the bytes without decoding it. Once I had separated out all the parts via regular expression, I decoded the parts with .decode("ascii") and everything worked out great.
I'm just updating this if it happens to help someone else.

Assuming you are using python 3
>>> inc_tcp_data.decode('ascii')
'\x02hello\x1cthisisthedata'

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!

LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.

If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.