I have a MQTT broker which is recieving data from an external publisher. When a subscriber is recieving this data it is as a byte string. Here is an example:
payload = b'\xd1\x04\x1c\x00\x00\x00A8000_CP805x_VG_BF1702056637\x01\xe8\x03\x8f\x03\x01\x00'
The problem is in decoding this payload.
When i try to decode it with different encodings i get:
encodings = ['utf-7', 'utf-8', 'utf-8-sig',
'utf-16', 'utf-16-be', 'utf-16-le',
'utf-32', 'utf-32-be', 'utf-32-le',
'ASCII', 'latin-1', 'iso-8859-1']
for enc in encodings:
try:
print('[' + enc + ']: \t\t' + b'\xd1\x04\x1c\x00\x00\x00A8000_CP805x_VG_BF1702056637\x01\xe8\x03\x8f\x03\x01\x00'.decode(enc))
except Exception as e:
print('[' + enc + ']: \t\t' + str(e))
# Output:
#[utf-7]: 'utf7' codec can't decode byte 0xd1 in position 0: unexpected special character
#[utf-8]: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
#[utf-8-sig]: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
#[utf-16]: 'utf-16-le' codec can't decode byte 0x00 in position 40: truncated data
#[utf-16-be]: 'utf-16-be' codec can't decode byte 0x00 in position 40: truncated data
#[utf-16-le]: 'utf-16-le' codec can't decode byte 0x00 in position 40: truncated data
#[utf-32]: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
#[utf-32-be]: 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
#[utf-32-le]: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
#[ASCII]: 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)
#[latin-1]: ÑA8000_CP805x_VG_BF1702056637è
#[iso-8859-1]: ÑA8000_CP805x_VG_BF1702056637è
None of these are acceptable, and i am at a miss to what i can do. I'm guessing the problem is that i'm using the wrong encoding, but i unfortunately have no documentation on the payload.
Additional information:
The 'A8000_CP805x_VG_BF1702056637' of the byte string is a reference to the device sending the data. I am not sure if this is part of the problem.
This is only one of the payloads. There are also much larger payloads, where i get similar results that are unreadable, but also contain non byte strings in between the bytes.
Any and all help is welcome.
Related
How to decode pyhon bytecode to ascii?
I extract data with selenium from network response. Should get xml.
Getting: ['b'\xa5\xff\xff\xc7\x88\xe4\xb4\xd7\x03\xa0\x11:|\xce\xdb\xb7\x0f\xf1\xdf\xfc\x1f\xdb\x93\x91^\xbc\xa3\xdd\xc2\x02V\x00\xba$\xbd\x10\xd2\xd0E\xf2\x90\xb6\xca\xee\x10\xbf\xbf_\xbf\xfc\xef?\xe9\x13{H\xf1\xa1\xa0\x00\x1c\x01(\x80\x1c\x81\x02(s\xe7Z\xf3\xb3N\xf5L\xdc>\xe7\x8f\xbbwl\xbf\x99\x91\xd4O\xde\xb4,\xf3PH\x02L1\x00\xc98\xc3,\x13!\x82\xc6\xc2\xa6Bd"k\xcb\x9d(\xb9\x13%WQr\x15%W\xb1\xe5J\t\x9e:\x8a\x03\x99\x06H\xd0\x8f\xd8\xfe\x9f9\xbc\xfc\x157\x111\xd7\x15\xaab\xfb\xe8;\xab\xee\xfc\x9b\xeeu\x10<d\x04\x06Y\xa8\xd7\x9f\x11...
Code:
...
for request in driver.requests: if request.response: text_file.write(str(request.response.body))
I've tried:
decoded = request.response.body.decode('ascii')
or request.response.body.decode('utf-8') or cp1251/1252
I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa5 in position 0: ordinal not in range(128)
Response should be xml (~1,5mb) in attached photoresponse
If I use:
decoded = base64.b64decode(request.response.body)
I'm getting smth like: b'T#\x00\xad\x9a\xb5\xba\xfa3u\xca\x84PG\xbd\x8a\xab\x1f\xcdcJ%\r\xd4\xff\x0c$)\x9a>.... not what to be expected.
Combining decoded = base64.b64decode(request.response.body).decode('ascii') also doesnt help:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Help me, please.
Its because of the Header 'Content-Encoding': 'br'
Installing brotly helped. Also deleting
This message helped a lot
E-Mail clients decode messages correctly. So I assume there must be also a way do decode emails with python correctly.
I use the building email python library to process incoming emails.
import email
...
email_message = email.message_from_file(fp)
email_message.is_multipart() # => False
email_message.get_content_type() # 'text/plain'
to_decode = email_message.get_payload(decode=True)
charset = email_message.get_content_charset()
# charset is utf-8
to_decode.decode(charset)
Exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 135: invalid start byte
This is a part of the string within to_decode variable.
b'Dzie\\u0144 dobry,\n\nniestety w podany'
I figured out with try and error that I can to the following.
test = b'Dzie\\u0144 dobry,\n\nniestety w podany'
test.decode('unicode-escape')
>> output: 'Dzień dobry,\n\nniestety w podany'
Which is correct. But I think there must be a better way instead of guessing. How is my email client doing this?
I am using curl to data.
import os
cmd = "curl --data \"action=getdata\" https:localhost:8070"
print(cmd)
data = os.popen(cmd).read()
The line above produces an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 565334: character maps to <undefined>.
When I debugged using breakpoints, the command os.popen generates a large corpus of text and when it goes to read() the error arises in file cp1252.py in IncrementalDecoder class. I tried doing,
data = os.popen(cmd).read().encode('utf-8').decode('ascii')
and
data = os.popen(cmd).read().encode().decode('utf-8')
But the error persists. How can we solve this?
When I use the following code:
import requests
def googleSearch(qu):
with requests.session() as c:
url = 'https://www.google.com'
qu = {'q': qu}
urllink = requests.get(url, params=qu)
x=urllink.url
return x
x=googleSearch('translation')
print(x)
import urllib.request
site=urllib.request.urlopen(x)
bytes=site.read()
"artificial limit of size: "
"bytes=bytes[0:6000]"
text=bytes.decode("utf8")
print (text)
I got the the following errors (running the program again and again):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6116: invalid continuation byte
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6143: invalid continuation byte
etc.
So I suppose the "site" file is to big.
When I limit the size of the file to 6000 bytes there is no error"
What is happening? Should I slice the file and treat each slice separately?
It doesn't decode it properly. In fact, it says what's on the title. What does it mean? What should I do?
im1_bytes = client_socket.recv(int_size)
im1_str = im1_bytes.decode('utf-8')