How to permissively decode a UTF-8 bytearray?

How to permissively decode a UTF-8 bytearray? - python

I need to decode a UTF-8 sequence, which is stored in a bytearray, to a string.
The UTF-8 sequence might contain erroneous parts. In this case I need to decode as much as possible and (optionally?) substitute invalid parts by something like "?".
# First part decodes to "ABÄC"
b = bytearray([0x41, 0x42, 0xC3, 0x84, 0x43])
s = str(b, "utf-8")
print(s)
# Second part, invalid sequence, wanted to decode to something like "AB?C"
b = bytearray([0x41, 0x42, 0xC3, 0x43])
s = str(b, "utf-8")
print(s)
What's the best way to achieve this in Python 3?

There are several builtin error-handling schemes for encoding and decoding str to and from bytes and bytearray with e.g. bytearray.decode(). For example:
>>> b = bytearray([0x41, 0x42, 0xC3, 0x43])
>>> b.decode('utf8', errors='ignore') # discard malformed bytes
'ABC'
>>> b.decode('utf8', errors='replace') # replace with U+FFFD
'AB�C'
>>> b.decode('utf8', errors='backslashreplace') # replace with backslash-escape
'AB\\xc3C'
In addition, you can write your own error handler and register it:
import codecs
def my_handler(exception):
"""Replace unexpected bytes with '?'."""
return '?', exception.end
codecs.register_error('my_handler', my_handler)
>>> b.decode('utf8', errors='my_handler')
'AB?C'
All of these error handling schemes can also be used with the str() constructor as in your question:
>>> str(b, 'utf8', errors='my_handler')
'AB?C'
... although it's more idiomatic to use str.decode() explicitly.

Related

Converting RSA signature to String

I'm creating my RSA Signature like this.
transactionStr = json.dumps(GenesisTransaction())
signature = rsa.sign(transactionStr.encode(), client.privateKey, 'SHA-1')
But I'm unable to get it to a string so I can save it.
I have tried decoding it using utf8
signature.decode("utf8")
but I get the error "'utf-8' codec can't decode byte 0xe3 in position 2"
any way I can do this?
A RSA signature looks like this
b'aL\xe3\xf4\xbeEM\xc4\x9e\n\x9e\xf4M`\xba\x85*\x13\xd52x\xd9\\\xe8F\x1c\x07\x90[/\x9dy\xce\xa9IV\x89\xe0\xcd9\\_3\x1e\xaa\x80\xdea\xd1\xbem/\x8e\x91\xbd\x13\x12o\x8c\xed\xf6\x89\xb5\x0b'

.decode('utf8') is used to decode text encoded in UTF8, not arbitrary bytes. Convert the byte string to a hexadecimal string instead:
>>> sig = b'aL\xe3\xf4\xbeEM\xc4\x9e\n\x9e\xf4M`\xba\x85*\x13\xd52x\xd9\\\xe8F\x1c\x07\x90[/\x9dy\xce\xa9IV\x89\xe0\xcd9\\_3\x1e\xaa\x80\xdea\xd1\xbem/\x8e\x91\xbd\x13\x12o\x8c\xed\xf6\x89\xb5\x0b'
>>> s = sig.hex()
>>> s
'614ce3f4be454dc49e0a9ef44d60ba852a13d53278d95ce8461c07905b2f9d79cea9495689e0cd395c5f331eaa80de61d1be6d2f8e91bd13126f8cedf689b50b'
To convert back, if needed:
>>> b = bytes.fromhex(s)
>>> b
b'aL\xe3\xf4\xbeEM\xc4\x9e\n\x9e\xf4M`\xba\x85*\x13\xd52x\xd9\\\xe8F\x1c\x07\x90[/\x9dy\xce\xa9IV\x89\xe0\xcd9\\_3\x1e\xaa\x80\xdea\xd1\xbem/\x8e\x91\xbd\x13\x12o\x8c\xed\xf6\x89\xb5\x0b'
>>> b==sig
True

How to decode a string representation of a bytes object?

I have a string which includes encoded bytes inside it:
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
I want to decode it, but I can't since it has become a string. Therefore I want to ask whether there is any way I can convert it into
str2 = b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
Here str2 is a bytes object which I can decode easily using
str2.decode('utf-8')
to get the final result:
'Output file 문항분석.xlsx Created'

You could use ast.literal_eval:
>>> print(str1)
b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
>>> type(str1)
<class 'str'>
>>> from ast import literal_eval
>>> literal_eval(str1).decode('utf-8')
'Output file 문항분석.xlsx Created'

Based on the SyntaxError mentioned in your comments, you may be having a testing issue when attempting to print due to the fact that stdout is set to ascii in your console (and you may also find that your console does not support some of the characters you may be trying to print). You can try something like the following to set sys.stdout to utf-8 and see what your console will print (just using string slice and encode below to get bytes rather than the ast.literal_eval approach that has already been suggested):
import codecs
import sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
s = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
b = s[2:-1].encode().decode('utf-8')

A simple way is to assume that all the characters of the initial strings are in the [0,256) range and map to the same Unicode value, which means that it is a Latin1 encoded string.
The conversion is then trivial:
str1[2:-1].encode('Latin1').decode('utf8')

Finally I have found an answer where i use a function to cast a string to bytes without encoding.Given string
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
now i take only actual encoded text inside of it
str1[2:-1]
and pass this to the function which convert the string to bytes without encoding its values
import struct
def rawbytes(s):
"""Convert a string to raw bytes without encoding"""
outlist = []
for cp in s:
num = ord(cp)
if num < 255:
outlist.append(struct.pack('B', num))
elif num < 65535:
outlist.append(struct.pack('>H', num))
else:
b = (num & 0xFF0000) >> 16
H = num & 0xFFFF
outlist.append(struct.pack('>bH', b, H))
return b''.join(outlist)
So, calling the function would convert it to bytes which then is decoded
rawbytes(str1[2:-1]).decode('utf-8')
will give the correct output
'Output file 문항분석.xlsx Created'

How to decode longest sub-bytes into str?

Suppose I read a long bytes object from somewhere, knowing it is utf-8 encoded. But the read may not fully consume the available content so that the last character in the stream may be incomplete. Calling bytes.decode() on this object may result in a decode error. But what really fails is only the last few bytes. Is there a function that works in this case, returning the longest decoded string and the remaining bytes?
utf-8 encodes a character into at most 4 bytes, so trying to decode truncated bytes should work, but a vast majority of computation will be wasted, and I don't really like this solution.
To give a simple but concrete example:
>>> b0 = b'\xc3\x84\xc3\x96\xc3'
>>> b1 = b'\x9c\xc3\x84\xc3\x96\xc3\x9c'
>>> (b0 + b1).decode()
>>> 'ÄÖÜÄÖÜ'
(b0 + b1).decode() is fine, but b0.decode() will raise. The solution should be able to decode b0 for as much as possible and return the bytes that cannot be decoded.

You are describing the basic usage of io.TextIOWrapper: a buffered text stream over a binary stream.
>>> import io
>>> txt = 'before\N{PILE OF POO}after'
>>> b = io.BytesIO(txt.encode('utf-8'))
>>> t = io.TextIOWrapper(b)
>>> t.read(5)
'befor'
>>> t.read(1)
'e'
>>> t.read(1)
'💩'
>>> t.read(1)
'a'
Contrast with reading a bytes stream directly, where it would be possible to read halfway through an encoded pile of poo:
>>> b.seek(0)
0
>>> b.read(5)
b'befor'
>>> b.read(1)
b'e'
>>> b.read(1)
b'\xf0'
>>> b.read(1)
b'\x9f'
>>> b.read(1)
b'\x92'
>>> b.read(1)
b'\xa9'
>>> b.read(1)
b'a'
Specify encoding="utf-8" if you want to be explicit. The default encoding, i.e. locale.getpreferredencoding(False), would usually be utf-8 anyway.

As I mentioned in the comments under #wim's answer, I think you could use the codecs.iterdecode() incremental decoder to do this. Since it's a generator function, there's no need to manually save and restore its state between iterative calls to it.
Here's how how it might be used to handle a situation like the one you described:
import codecs
from random import randint
def reader(sequence):
""" Yield random length chunks of sequence until exhausted. """
plural = lambda word, n, ending='s': (word+ending) if n > 1 else word
i = 0
while i < len(sequence):
size = randint(1, 4)
chunk = sequence[i: i+size]
hexrepr = '0x' + ''.join('%02X' % b for b in chunk)
print('read {} {}: {}'.format(size, plural('byte', len(chunk)), hexrepr))
yield chunk
i += size
bytes_obj = b'\xc3\x84\xc3\x96\xc3\x9c\xc3\x84\xc3\x96\xc3\x9c' # 'ÄÖÜÄÖÜ'
for decoded in codecs.iterdecode(reader(bytes_obj), 'utf-8'):
print(decoded)
Sample output:
read 3 bytes: 0xC384C3
Ä
read 1 byte: 0x96
Ö
read 1 byte: 0xC3
read 3 bytes: 0x9CC384
ÜÄ
read 2 bytes: 0xC396
Ö
read 4 bytes: 0xC39C
Ü

python 3 prints bytestring and says string is of type str?

The code I have is from an single-sign-on function
from urllib.parse import unquote
import base64
payload = unquote(payload)
print(payload)
print(type(payload))
decoded = base64.decodestring(payload)
decodestring is complaining that I gave it a string instead of bytes...
File "/Users/Jeff/Development/langalang/proj/discourse/views.py", line 38, in sso
decoded = base64.decodestring(payload)
File "/Users/Jeff/.virtualenvs/proj/lib/python3.6/base64.py", line 559, in decodestring
return decodebytes(s)
File "/Users/Jeff/.virtualenvs/proj/lib/python3.6/base64.py", line 551, in decodebytes
_input_type_check(s)
File "/Users/Jeff/.virtualenvs/proj/lib/python3.6/base64.py", line 520, in _input_type_check
raise TypeError(msg) from err
TypeError: expected bytes-like object, not str
which is fine but when I look at what my print statements printed to the terminal I see this...
b'bm9uY2U9NDI5NDg5OTU0NjU4MjAzODkyNTI=\n'
<class 'str'>
it seems to be saying it is a string of bytes, but then it says that it is a string.
What is going on here?
if I add a encode() to the end of the payload declaration I see this...
payload = unquote(payload).encode()
b"b'bm9uY2U9NDQxMTQ4MzIyNDMwNjU3MjcyMDM=\\n'"
<class 'bytes'>
EDIT: adding the method that makes the payload
#patch("discourse.views.HttpResponseRedirect")
def test_sso_success(self, mock_redirect):
"""Test for the sso view"""
# Generating a random number, encoding for url, signing it with a hash
nonce = "".join([str(random.randint(0, 9)) for i in range(20)])
# The sso payload needs to be a dict of params
params = {"nonce": nonce}
payload = base64.encodestring(urlencode(params).encode())
print(payload.decode() + " tests")
key = settings.SSO_SECRET
h = hmac.new(key.encode(), payload, digestmod=hashlib.sha256)
signature = h.hexdigest()
url = reverse("discourse:sso") + "?sso=%s&sig=%s" % (payload, signature)
req = self.rf.get(url)
req.user = self.user
response = sso(req)
self.assertTrue(mock_redirect.called)

As you payload is generate by this base64.encodestring(s) which is by documentation is:
Encode the bytes-like object s, which can contain arbitrary binary
data, and return bytes containing the base64-encoded data, with
newlines (b'\n') inserted after every 76 bytes of output, and ensuring
that there is a trailing newline, as per RFC 2045 (MIME).
Then you do urllib.parse.unquote to a byte sequence that consists of ASCII chars. At that moment you got a prefix of b' to your string as unquote runs str constructor over payload bytearray. As a request you get a str instead of bytes , which is moreover a not valid base64 encoded.

it seems to be saying it is a string of bytes, but then it says that it is a string.
Looks like you have here string looks like: "b'bm9uY2U9NDQxMTQ4MzIyNDMwNjU3MjcyMDM=\\n'" so leading b is not byte literal it is just part of string's value.
So you need to rid off this symbols before pass it to base64 decoder:
from urllib.parse import unquote, quote_from_bytes
import base64
payload = unquote(payload)
print(payload[2:-1])
enc = base64.decodebytes(payload[2:-1].encode())
print(enc)

The original error allowed to think that, and the display of the encoded string confirms it: your payload string is a unicode string, which happens to begin with the prefix "b'" and ends with a single "'".
Such a string is generally built with a repr call:
>>> b = b'abc' # b is a byte string
>>> r = repr(b) # by construction r is a unicode string
>>> print(r) # will look like a byte string
b'abc'
>>> print(b) # what is printed for a true byte string
abc
You can revert to a true byte string with literal_eval:
>>> b2 = ast.literal_eval(r)
>>> type(b2)
<class 'bytes'>
>>> b == b2
True
But the revert is only a workaround and you should track in your code where you build a representation of a byte string.

How to print out 0xfb in python

I'm falling the unicode hell.
My environment in on unix, python 2.7.3
LC_CTYPE=zh_TW.UTF-8
LANG=en_US.UTF-8
I'm trying to dump hex encoded data in human readable format, here is simplified code
#! /usr/bin/env python
# encoding:utf-8
import sys
s=u"readable\n" # previous result keep in unicode string
s2="fb is not \xfb" # data read from binary file
s += s2
print s # method 1
print s.encode('utf-8') # method 2
print s.encode('utf-8','ignore') # method 3
print s.decode('iso8859-1') # method 4
# method 1-4 display following error message
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb
# in position 0: ordinal not in range(128)
f = open('out.txt','wb')
f.write(s)
I just want to print out the 0xfb.
I should describe more here. The key is 's += s2'.
Where s will keep my previous decoded string.
And the s2 is next string which should append into s.
If I modified as following, it occurs on write file.
s=u"readable\n"
s2="fb is not \xfb"
s += s2.decode('cp437')
print s
f=open('out.txt','wb')
f.write(s)
# UnicodeEncodeError: 'ascii' codec can't encode character
# u'\u221a' in position 1: ordinal not in range(128)
I wish the result of out.txt is
readable
fb is not \xfb
or
readable
fb is not 0xfb
[Solution]
#! /usr/bin/env python
# encoding:utf-8
import sys
import binascii
def fmtstr(s):
r = ''
for c in s:
if ord(c) > 128:
r = ''.join([r, "\\x"+binascii.hexlify(c)])
else:
r = ''.join([r, c])
return r
s=u"readable"
s2="fb is not \xfb"
s += fmtstr(s2)
print s
f=open('out.txt','wb')
f.write(s)

I strongly suspect that your code is actually erroring out on the previous line: the s += s2 one. s2 is just a series of bytes, which can't be arbitrarily tacked on to a unicode object (which is instead a series of code points).
If you had intended the '\xfb' to represent U+FB, LATIN SMALL LETTER U WITH CIRCUMFLEX, it would have been better to assign it like this instead:
s2 = u"\u00fb"
But you said that you just want to print out \xHH codes for control characters. If you just want it to be something humans can understand which still makes it apparent that special characters are in a string, then repr may be enough. First, don't have s be a unicode object, because you're treating your strings here as a series of bytes, not a series of code points.
s = s.encode('utf-8')
s += s2
print repr(s)
Finally, if you don't want the extra quotes on the outside that repr adds, for nice pretty printing or whatever, there's not a simple builtin way to do that in Python (that I know of). I've used something like this before:
import re
controlchars_re = re.compile(r'[\x00-\x31\x7f-\xff]')
def _show_control_chars(match):
txt = repr(match.group(0))
return txt[1:-1]
def escape_special_characters(s):
return controlchars_re.sub(_show_control_chars, s.replace('\\', '\\\\'))
You can pretty easily tweak the controlchars_re regex to define which characters you care about escaping.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to permissively decode a UTF-8 bytearray? - python

Related

Converting RSA signature to String

How to decode a string representation of a bytes object?

How to decode longest sub-bytes into str?

python 3 prints bytestring and says string is of type str?

How to print out 0xfb in python

Categories

Resources