I would like to include picture bytes into a JSON, but I struggle with a encoding issue:
import urllib
import json
data = urllib.urlopen('https://www.python.org/static/community_logos/python-logo-master-v3-TM-flattened.png').read()
json.dumps({'picture' : data})
UnicodeDecodeError: 'utf8' codec can't decode byte 0x89 in position 0: invalid start byte
I don't know how to deal with that issue since I am handling an image, so I am a bit confused about this encoding issue. I am using python 2.7. Does anyone can help me? :)
JSON data expects to handle Unicode text. Binary image data is not text, so when the json.dumps() function tries to decode the bytestring to unicode using UTF-8 (the default) that decoding fails.
You'll have to wrap your binary data in a text-safe encoding first, such as Base-64:
json.dumps({'picture' : data.encode('base64')})
Of course, this then assumes that the receiver expects your data to be wrapped so.
If your API endpoint has been so badly designed to expect your image bytes to be passed in as text, then the alternative is to pretend that your bytes are really text; if you first decode it as Latin-1 you can map those bytes straight to Unicode codepoints:
json.dumps({'picture' : data.encode('latin-1')})
With the data already a unicode object the json library will then proceed to treat it as text. This does mean that it can replace non-ASCII codepoints with \uhhhh escapes.
The best solution that comes to my mind for this situation, space-wise, is base85 encoding which represents four bytes as five characters. Also you could also map every byte to the corresponding character in U+0000-U+00FF format and then dump it in the json.
But still, those could be overkill methods for this and base64, ease-wise, would be the winner.
Related
Background: I've got a byte file that is encoded using unicode. However, I can't figure out the right method to get Python to decode it to a string. Sometimes is uses 1-byte ASCII text. The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character. So my python program needs to be able to decode that and handle it. Unfortunately byte_string.decode('unicode') isn't a thing, so I need to specify another encoding scheme. Using Python 3.9
I've read through the Python doc on unicode and utf-8 Python doc. If Python uses unicode for it's strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.
If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table. So I would expect 0x00_41 to decode to "A",
0x00_F2 =>
x65_03_01 => é (e with combining acute accent).
I wrote a short test file to experiment with these byte combinations, and I'm running into a few situations that I don't understand (despite extensive reading).
Example code:
def main():
print("Starting MAIN...")
vrsn_bytes = b'\x76\x72\x73\x6E'
serato_bytes = b'\x00\x53\x00\x65\x00\x72\x00\x61\x00\x74\x00\x6F'
special_bytes = b'\xB2\xF2'
combining_bytes = b'\x41\x75\x64\x65\x03\x01'
print(f"vrsn_bytes: {vrsn_bytes}")
print(f"serato_bytes: {serato_bytes}")
print(f"special_bytes: {special_bytes}")
print(f"combining_bytes: {combining_bytes}")
encoding_method = 'utf-8' # also tried latin-1 and cp1252
vrsn_str = vrsn_bytes.decode(encoding_method)
serato_str = serato_bytes.decode(encoding_method)
special_str = special_bytes.decode(encoding_method)
combining_str = combining_bytes.decode(encoding_method)
print(f"vrsn_str: {vrsn_str}")
print(f"serato_str: {serato_str}")
print(f"special_str: {special_str}")
print(f"combining_str: {combining_str}")
return True
if __name__ == '__main__':
print("Starting Command Line Experiment!")
if not main():
print("\n Command Line Test FAILED!!")
else:
print("\n Command Line Test PASSED!!")
Issue 1: utf-8 encoding. As the experiment is written, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
I don't understand why this fails to decode, according to the unicode decode table, 0x00B2 should be "SUPERSCRIPT TWO". In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.
I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.
If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:
vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude
However, this encoding is not handling the 2-byte codes properly. \x00_53 should be S, not �S, and none of the encoding methods I'll mention in this post handle the combining acute accent after Aude properly.
So far I've tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252. while I expect utf-8 to be what I'm supposed to use, it does not behave like it's described in the Python doc linked above.
Any help is appreciated. Besides trying more methods, I don't understand why this isn't decoding according to the table in link 3.
UPDATE:
After some more reading, and see your responses, I understand why you're so confused. I'm going to explain further so that hopefully this helps someone in the future.
The byte file that I'm decoding is not mine (hence why the encoding does not make sense). What I see now is that the bytes represent the code point, not the byte representation of the unicode character.
For example: I want 0x00_B2 to translate to ò. But the actual byte representation of ò is 0xC3_B2. What I have is the integer representation of the code point. So while I was trying to decode, what I actually need to do is convert 0x00B2 to an integer = 178. then I can use chr(178) to convert to unicode.
I don't know why the file was written this way, and I can't change it. But I see now why the decoding wasn't working. Hopefully this helps someone avoid the frustration I've been figuring out.
Thanks!
This isn't actually a python issue, it's how you're encoding the character. To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.
For example, the code point U+2192 is →. The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010
As we can see, this is 3 bytes, not 2 as we'd expect if we only used the position. To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:
https://onlineunicodetools.com/convert-unicode-to-binary
This will let you input a unicode character and get the utf-8 binary representation.
To get correct output for ò, we need to use 0xC3B2.
>>> s = b'\xC3\xB2'
>>> print(s.decode('utf-8'))
ò
The reason why you can't use the direct binary representation is because of the header for the bytes. In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints. For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points. So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.
For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte. If we combine those, we get
00010 000000, which is equivalent to 0x80.
I recently came across this string: b'\xd5\xa3Lk\xd4\xad\xaeH\xb8\xae\xab\xd8EL3\xd1RR\x17\x0c\xea~\xfa\xd0\xc9\xfeJ\x9aq\xd0\xc57\xfd\xfa\x1d}\x8f\x99?*\xef\x88\x1e\x99\x8d\x81t`1\x91\xebh\xc5\x9d\xa7\xa5\x8e\xb9X' and I wanna decode it.
Now I know this is possible with python with string.decode() but it requires an encoding. How can I determine the encoding to decode this string?
My earlier comment to your question, which is partly accurate and partly in error. From the documentation of Standard Encodings:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences.
So you should try to decode with 'utf-8-sig' (for the general case in which a Byte Order Mark or BOM might be present as the first 3 bytes -- which is not the case for your example so you could just use 'utf-8'). But if that fails, you are not guaranteed knowing what encoding was used using trial-and-error decoding because, according to the above documentation, an attempt at decoding with another codec could succeed (and possibly give you garbage). If the `utf-8' decoding succeeds, it is probably the encoding that was used. See below.
s = 'abcde'
print(s.encode('utf-32').decode('utf-16'))
print(s.encode('cp500').decode('latin-1'))
Prints:
a b c d e
�����
Of course, a 'utf-8' encoding will also successfully decode a string that was encoded with the 'ascii' codec, so there is that level of indeterminacy.
I'm Using python 3.5
I have a couple of byte strings representing text that is encoded in various codecs so: b'mybytesstring' , now some are Utf8 encoded other are latin1 and so on. What I want to in the following order is:
transform the bytes string into an ascii like string.
transform the ascii like string back to a bytes string.
decode the bytes string with correct codec.
The problem is that I have to move the bytes string into something that does not accept bytes objects so I'm looking for a solution that lets me do bytes -> ascii -> bytes safely.
x = x.decode().encode('ascii',errors='ignore')
You use the encode and decode methods for this, and supply the desired encoding to them. It's not clear to me if you know the encoding beforehand. If you don't know it you're in trouble. You may have to guess the encoding in some way, risking garbage output.
OK I found a solution which is much easier than I thought
mybytes = 'ëýđþé'.encode()
str_mybytes = str(mybytes)
again_mybytes = eval(str_mybytes)
decoded = again_mybytes.decode('utf8')
I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.
b = '\x80'
u8 = b.encode('utf-8')
j = json.dumps(u8)
I expected j to be '\xc2\x80' but instead I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.
Any ideas?
EDIT:
I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)
You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.
If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.
If it is in fact text, you need to know what encoding to use to decode it to Unicode.
Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:
>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>
UPDATE about latin1:
u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".
Use b.decode('name of source encoding') to get a unicode version. This was surprising to me when I learned it. eg:
In [123]: 'foo'.decode('latin-1')
Out[123]: u'foo'
I think what you are trying to do is decode the string object of some encoding. Do you know what that encoding is? To get the unicode object.
unicode_b = b.decode('some_encoding')
and then re-encoding the unicode object using the utf_8 encoding back to a string object.
b = unicode_b.encode('utf_8')
Using the unicode object as a translator, without knowing what the original encoding of the string is I can't know for certain but there is the possibility that the conversion will not go as expected. The unicode object is not meant for converting strings of one encoding to another. I would work with the unicode object assuming you know what the encoding is, if you don't know what the encoding is then there really isn't a way to find out without trial and error, and then convert back to the encoded string when you want a string object back.
I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:
'S�o Paulo'
so I tried to encode it like so:
encodeme = 'S�o Paulo'
encodeme.encode('utf-8)
but got the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of
São Paulo
Any help is appreciated.
It looks like you are having some kind of encoding problem.
First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?
Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.
So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?
I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.
Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.
One thing to get very clear (as Skirmantas also explains):
Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.
The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)
Try this example code:
>>> reportbody = 'S\xc3\xa3o Paulo' # UTF-8 encoded byte string
>>> reportbody.decode('utf-8') # returns a Unicode string, u'...'
u'S\xe3o Paulo'
You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.
I think you have to decode it using a correct encoding rather than encode it to utf-8. Try
s = s.decode('utf-8')
However you need to know which encoding to use. Input can come in other encodings that utf-8.
The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.
I'll try to explain the difference of unicode string and utf-8 bytestring in python.
unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.
bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.
utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.
Official Python unicode documentation
You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)