I need to convert a bytearray which contains non-encoded raw unicode data to an unicode string, e.g. the unicode \u2167 represents the roman number 8:
print(u'\u2167')
Ⅷ
having this information stored in a bytearray I need to find a way to convert it back to unicode. Decoding from e.g. 'utf8' obviously does not work:
b = bytearray([0x21,0x67])
print(b.decode('utf8'))
!g
Any ideas?
EDIT
#Luke's comment got me on the right track. Apparently the original data (not the simplified one I am showing here) is encoded as UTF-16le. The data is obtained from a wxpython TextDataObject. wxpython internally usually uses unicode. That is what made me think that I am dealing with unicode data.
... a bytearray which contains non-encoded raw unicode data
If it is in a bytearray, it is by definition encoded. The Python bytes or bytearray types can contain encoded Unicode data. The str type contains Unicode code points. You .decode a byte string to a Unicode string, and .encode a Unicode string into byte strings. The encoding used for your example is UTF-16BE:
>>> b = bytearray([0x21,0x67])
>>> b.decode('utf-16be')
'Ⅷ'
The line print(b.decode('utf8')) is not correct, correct usage is :
print(b.decode("utf-8"))
Related
I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.
I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?
Examples:
some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'
some_bytes.decode('unicode-escape')
yields: 7Q¬qo»5<ëD¾Þù¦XNÿ¡
some_bytes.decode('unicode-escape').encode()
yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'
xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2
So when you are encoding it is added before every code-point.
For more details, you can see below link
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex
I'm Using python 3.5
I have a couple of byte strings representing text that is encoded in various codecs so: b'mybytesstring' , now some are Utf8 encoded other are latin1 and so on. What I want to in the following order is:
transform the bytes string into an ascii like string.
transform the ascii like string back to a bytes string.
decode the bytes string with correct codec.
The problem is that I have to move the bytes string into something that does not accept bytes objects so I'm looking for a solution that lets me do bytes -> ascii -> bytes safely.
x = x.decode().encode('ascii',errors='ignore')
You use the encode and decode methods for this, and supply the desired encoding to them. It's not clear to me if you know the encoding beforehand. If you don't know it you're in trouble. You may have to guess the encoding in some way, risking garbage output.
OK I found a solution which is much easier than I thought
mybytes = 'ëýđþé'.encode()
str_mybytes = str(mybytes)
again_mybytes = eval(str_mybytes)
decoded = again_mybytes.decode('utf8')
There is no difference for the printing results, what is the usage of encoding and decoding for utf-8?
And is it encode('utf8') or encode('utf-8')?
u ='abc'
print(u)
u=u.encode('utf-8')
print(u)
uu = u.decode('utf-8')
print(uu)
str.encode encodes the string (or unicode string) into a series of bytes. In Python 3 this is a bytearray, in Python 2 it's str again (confusingly). When you encode a unicode string, you are left with bytes, not unicode—remember that UTF-8 is not unicode, it's an encoding method that can turn unicode codepoints into bytes.
str.decode will decode the serialized byte stream with the selected codec, picking the proper unicode codepoints and giving you a unicode string.
So, what you're doing in Python 2 is: 'abc' > 'abc' > u'abc', and in Python 3 is:
'abc' > b'abc' > 'abc'. Try printing repr(u) or type(u) in addition to see what's changing where.
utf_8 might be the most canonical, but it doesn't really matter.
Usually Python will first try to decode it to unicode before it can encode it back to UTF-8.There are encording which doesnt have anything to do with the character sets which can be applied to 8 bit strings
For eg
data = u'\u00c3' # Unicode data
data = data.encode('utf8')
print data
'\xc3\x83' //the output.
Please have a look through here and here.It would be helpful.
My question is about python 3.0 strings.
My understanding is that for the line str = "a", the charcter 'a' is encoded (using utf-8 - for example) and stored in the str object. If UTF-8 representation of 'a' is 1 byte the string is 1 byte long. Am I right?
if the above is true what happens when we read a binary file using read(). Suppose I have a two byte file with two bytes of binary data and I read it in a string using read command like
open(fileName, mode='rb')
str= file.read()
now str will be two bytes long and each byte will be what was stored in the fileName. Am I right?
If I am right in the above point then the str object is not in any particual encoding format (like UTF, etc.), So what does it mean that python strings are always unicode? Also what will happen if I call str.encode(). It will make no sense?
As the str object read from file is actually a array of bytes. Is there any way to convert it to bytearray type?
You are confused. "Encodings" pertain to byte strings, not to unicode strings. Meaningful statements: "This byte string is utf-8 encoded.", "This byte string is 2 bytes long." Meaningless statements: "This unicode string is utf-8 encoded", "This unicode string is 2 bytes long"
str = "a" means "create a unicode string 'a' and a reference to it named str". Unicode strings are of course stored in some encoding because it needs to exist as bytes in memory, but that is not relevant. All your code treats it as if it has no encoding at all--it has been abstracted away from bytes. A unicode string is a sequence of unicode code points (i.e. of integers that represent characters).
Yes and no. str here (the return value of read()) is a byte string, not a unicode string. "a" != b"a".
Your byte-string str possesses an unknown encoding and must be decoded to produce a unicode string. Byte strings don't have an encode() method because it is meaningless--they are either already an encoding of a unicode string, or they are not representing a unicode string at all (e.g. an image).
It's not an array of bytes, it's a byte-string. A bytearray is a mutable list of bytes. You can produce a bytearray with bytearray(byte_string), but bytearrays are intended for fairly specialized uses (e.g., to avoid copying for send-recv buffers), not casual use. Normally you just want a byte string.
When you read a file in binary mode, the value returned from the read() method is a bytes object, not a str object. The documentation covers this in depth.
>>> with open('foo', mode='rb') as f: s = f.read()
...
>>> s
b'abc\n'
>>> len(s)
4
>>> type(s)
<class 'bytes'>
Python strings store Unicode codepoints.
Codepoints are not the same thing as bytes. Bytes are a computer representation of numbers (most commonly between 0 and 255), and those numbers can be translated to codepoints through the process of decoding, and in the other direction with encoding. Python 3 strings contain codepoints, one for each character in the text.
Python source code can define string literals using a series of bytes, that the interpreter decodes to unicode using the UTF-8 codec by default, but you can set other codecs at the top of the file. On disk, the letter a in UTF-8 encoding is indeed just one byte, that is the nature of the UTF-8 standard.
If you read a file in text mode, Python applies the decoding process for you automatically, but when you open it in binary mode, no decoding is done and you get a bytes object instead. The contents of that object should reflect the contents of the file exactly. Note that it is not of type str, it is not unicode, it is not even a Python string. To turn bytes into a string you'd need to explicitly decode with the .decode() method.
A bytearray is trivially created from a bytes value, just call bytesarray() on it.