Reading mixed ASCII/binary serial data stream -- how to find delimiter?

Reading mixed ASCII/binary serial data stream -- how to find delimiter? - python

I am trying to read serial data from a device that outputs in a mix of ASCII and binary, using Python 3.
The message format is: "$PASHR,msg type,binary payload,checksum,\r\n" (minus the quotes)
To make it more interesting, there are several different message types, and they have different payload lengths, so I can't just read X bytes (I can infer the payload length based on the message type). The sequence of a variable number of messages of each type (around 15 in total) is sent every 20 seconds, at 115200 baud.
I haven't been able to read this with serial.readline(), probably because of newlines embedded in the payload.
I think that if I could set the line-end character to the sequence "$PASHR" that would give me a way to frame the messages -- ie, everything between one $PASHR and the next is one message, and the likelihood of seeing the sequence in the binary payload is nil. But I have not been able to make it work either using serial.newline = b'$PASHR' or readuntil(b'$PASHR') -- I still get variable length reads. I suspect that the serial.timeout setting enters into the solution, but I am not sure how.
Here's the last version I ended up with last night:
delimiter = b'$PASHR'
while True:
if self.serial.in_waiting:
message = self.serial.read_until(delimiter)
Every method I've tried gives a variable response length, when each record of a give should be the same length. Is there a way to set the newline to that multi-character string, or is there a different/better approach I should use?
Thanks!

I think I figured it out. This seems to work with a timeout of 1 second:
self.serial.timeout = 1
delimiter = b'$PASHR'
if self.serial.in_waiting:
message = self.serial.read_until(delimiter)
if len(message) > 6: # ignore carryover PASHR'
***process message***
When I looked carefully at the variable length returns, I discovered that the message sequence after the first incomplete cycle was:
b'$PASHR' (6 bytes)
b',MPC,*data**checksum*\r\n$PASHR' (108 bytes)
...
b',MPC,*data**checksum*\r\n' (102 bytes)
(delay)
b'$PASHR' (6 bytes)
The delimiter from the last message in the sequence is chopped off and output at the beginning of the next sequence. I can deal with that.
I will try Deepstop's idea of the inter_byte timeout as it is probably more elegant than using the raw timeout, and might get around the partial message problem.

Related

How to strip first and last byte from binary data in Python

I have a Python program that is receiving data from a UDP socket. Python is reading it as binary data.
Each datagram should start with the STX character (0x02) and end with the ETX character (0x03). I want my code to test the first and last character and, if they are correct, to strip them before passing the rest of the data out of my function to be further processed by the calling function.
I believe I can test the first character with data[0].
How can I get the length of the data so as to test the last character?
How can I get a "substring" of the data?
I am tempted to decode it to a string, but suspect that involves extra overhead and anyway, as a Python novice, I have no idea how.

As I understand the question you need to slice your data.
So you can slice from 2nd character by using data[1:] and up to last character but not including by using data[:-1].
As a combination of both, you can use
data[1:-1]
You can get length by using len() function.
len(data)
You can get the last byte by:
data[-1]

Protobuf-net unrecognized stream prefix

I'm trying to reverse engineer the Quasar RAT protobuf protocol structure.
Quasar is a Remote Administration Tool written in C# which is open source and can be found online here.
https://github.com/quasar/QuasarRAT
I've managed to reverse most of it and I can now connect to the Quasar server client from a python script. How ever one question remains open, it appears that every byte stream that is being sent from the client to the server begins with a 3 byte field which is not registered within the protobuf class within Quasar. This field seems to provide the length of the message not including the prefixed bytes. As can be seen within this block for an example a prefixed byte stream generated for an array of size 0x2d2, these are the prefixed bytes being appended to the message.
0x0A, 0xCF, 0x05
If how ever I decide to change the message fields before serializing the message, this byte stream would change except from the first 0x0A byte. It seems that if I keep appending bytes to the message fields the second byte grows and if I overflow the second byte(make it reach above 0xff) - it would increment the third byte and reset the second byte to 0x80. But the math wont make sense to me at all as this field should return the size of the array but doesn't under any sensible formula that I could compute. I know that protobuf-net can generate PreLengthPrefix bytes to prefix the message with the length of it but this is not the case here.
Any help would be appreciated.

The encoding rules are here: https://developers.google.com/protocol-buffers/docs/encoding
Basically, each field is a encoded as a field-header (aka "tag"), followed by a payload. The field-header is a "varint" (see the encoding guide), the value of which is an integer composed of a field number and the wire-type. The wire-type is the 3 least significant bits, and the field number is the rest (shifted by 3 bits). In the case of 0x0A (binary 1010), the wire type is 2 (binary 010), and the field number is 1.
How you treat the payload depends on the wire type. For wire type 2 (length prefixed), you should expect next:
a varint that is the length of the payload in bytes, then
that many bytes of the actual payload
Unfortunately protobuf is ambiguous without a schema, so knowing that you have length prefixed data doesn't tell you what the data is; a length prefixed payload could be:
a UTF-8 string
a raw BLOB (bytes)
a sub-message
a "packed" array of some primitive type (integers/floating point numbers/etc) - remembering that the length prefix is the number of bytes, not the number of elements; the elements are not even necessarily fixed size (they could themselves be varints
In many ways, the purpose of the wire type isn't to tell you how to interpret the data; it is to tell you how to skip (or just store verbatim) the field if it isn't one you know about. For example, somebody else is using V3 of the API and you have only updated your schema to V2; they send a V3 message to your V2 API; V3 has extra fields you don't care about - the deserializer needs to not break when it hits them, so the wire type tells it how to ignore the field (i.e. what the rules are for finding the next field). Otherwise, we could just use the schema information and not store the wire type in the payload at all (although it is also used as an optimization on repeated primitive data, via "packed" arrays - it is up to the serializer whether it encodes such as length-prefixed vs lots of field header/value pairs).

How do I search for a set amount of hex and non hex data in python

I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]

Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results

Hex Formatting in Python

I've been stuck on this one for a while now. I need to send a serial command to a device in python. Here's an example of the format:
ser = serial.Serial('/dev/ttyO1',baudrate=115200)
ser.write('\xAA\x04\x05\x00\x00\x00')
I'm able to communicate just fine if I communicate using the format .write(~hex bytes~)
However, when I need to receive larger chunks of data, the device's communication protocol splits things into packets and I have to confirm receiving each packet. To avoid getting carpal tunnel syndrome while typing out individual ser.write() commands for each packet I want to write a loop that does the packet counting/confirmation for me. here's what I have:
for n in range(0,num_packets,1):
x = '\\x'"%0.2x"%(n)
print('Picture Ready Host Ack')
ser.write('\xAA\x0E\x00\x00' + x + '\x00')
time.sleep(.1)
response = ser.read(ser.inWaiting())
image_write(response,base_path, filename)
However, the double slashes ('\x'..) gives me two slashes:
'\xaa\x0e\x00\x00\\x05\x00'
While a single slash ('\x'...) returns a problem when I define the variable x:
ValueError: invalid \x escape
...Help?

If instead of defining x as '\\x'"%0.2x"%(n), you do x = chr(n), everything should work.

\xij just expands to the character chr(int(ij, 16))
Since you are anyway sending bytes, I would say in a loop, send your first chunk. Do not add anything to it, then send your delimiter by chr(int(n),16) and subsequently the next chunks. This should solve your problem.

How to parse binary data from socket in python?

I'm currently working on a project where I need to communicate to a microhard cellular modem IPn3G. I have the modem set up to send messages to my computer through TCP and I can pick up the message in a socket.
The message looks like this though:
���������DKReadyCANRogersWirelessInc. Home354626030393530302720391029547
Now, I can recognize a few of these fields like the Status or the Carrierinfo as well as the imei and imsi in the end.
My problem is, how do I parse the funny looking things? I have tried struct, but it didn't seem to help me out very much.
In the documentation of the modem I only found this:
Modem_event message structure:
fixed header (fixed size 20 bytes)
Modem ID (uint64_t (8 bytes))
Message type mask (uint8_t(1 byte))
reserved
packet length (uint16_t(2 bytes))
Note: packet length = length of fixed header + length of message payload.
Carrier info:
Content length 2 BYTES (UINT16_T)
RSSI 1 BYTE (UINT8_T)
RF Band 2 BYTES (UINT16_T)
Service type STRING (1-30 Bytes)
Channel number STRING (1-30 Bytes)
SIM card number STRING (1-30 Bytes)
Phone number STRING (1-30 Bytes)
To me it seems like the message doesn't even line up with what it's supposed to be. I would be very glad if anyone had advice on how to tackle this problem.
Thank you

Python has great struct module, which allows you to pack and unpack binary data.
I see, that you've already tried to use it. I don't have a documentation, for your device (go and check it!), but I can suppose, that strings are null-terminated (they don't have their size given anywhere before).
Get the size of the string part of the message (from size of the other fields and packet length) and read all strings into one Python string using <number>s and split them finding the null characters.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.