I have a string that contains printable and unprintable characters, for instance:
'\xe8\x00\x00\x00\x00\x60\xfc\xe8\x89\x00\x00\x00\x60\x89'
What's the most "pythonesque" way to convert this to a bytes object in Python 3, i.e.:
b'\xe8\x00\x00\x00\x00`\xfc\xe8\x89\x00\x00\x00`\x89'
If all your codepoints are within the range U+0000 to U+00FF, you can encode to Latin-1:
inputstring.encode('latin1')
as the first 255 codepoints of Unicode map one-to-one to bytes in the Latin-1 standard.
This is by far and away the fastest method, but won't work for any characters in the input string outside that range.
Basically, if you got Unicode that contains 'bytes' that should not have been decoded, encode to Latin-1 to get the original bytes again.
Demo:
>>> '\xe8\x00\x00\x00\x00\x60\xfc\xe8\x89\x00\x00\x00\x60\x89'.encode('latin1')
b'\xe8\x00\x00\x00\x00`\xfc\xe8\x89\x00\x00\x00`\x89'
Related
I need to convert a bytearray which contains non-encoded raw unicode data to an unicode string, e.g. the unicode \u2167 represents the roman number 8:
print(u'\u2167')
Ⅷ
having this information stored in a bytearray I need to find a way to convert it back to unicode. Decoding from e.g. 'utf8' obviously does not work:
b = bytearray([0x21,0x67])
print(b.decode('utf8'))
!g
Any ideas?
EDIT
#Luke's comment got me on the right track. Apparently the original data (not the simplified one I am showing here) is encoded as UTF-16le. The data is obtained from a wxpython TextDataObject. wxpython internally usually uses unicode. That is what made me think that I am dealing with unicode data.
... a bytearray which contains non-encoded raw unicode data
If it is in a bytearray, it is by definition encoded. The Python bytes or bytearray types can contain encoded Unicode data. The str type contains Unicode code points. You .decode a byte string to a Unicode string, and .encode a Unicode string into byte strings. The encoding used for your example is UTF-16BE:
>>> b = bytearray([0x21,0x67])
>>> b.decode('utf-16be')
'Ⅷ'
The line print(b.decode('utf8')) is not correct, correct usage is :
print(b.decode("utf-8"))
From what I understand, when concatenating a string and Unicode string, Python will automatically decode the string based on the default encoding and convert to Unicode before concatenating.
I'm assuming something like this if default is 'ascii' (please correct if mistaken):
string -> ASCII hexadecimal bytes -> Unicode hexadecimal bytes -> Unicode string
Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating? Why does the string need to be decoded first? Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?
Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating?
It could probably do that with literals, but not string characters at runtime. Imagine a string that contains a 'Ӹ' character. How do you think it can be converted to u'Ӹ' in Unicode? IT HAS TO BE DECODED!
Ӹ is Unicode codepoint U+04F8 CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS. 'Ӹ' and u'Ӹ' are not encoded the same way (in fact, I can't even find an 8bit encoding that supports U+04F8), so you can't simply change one to the other directly. A string has to be decoded from its source encoding (ASCII, ISO-8859-1, etc) to an intermediary (ISO 10646, Unicode) that can be represented in the target encoding (UTF-8, UTF-16, UTF-32, etc).
Why does the string need to be decoded first?
Because the two values being concatenated need to be in the same encoding before they can be concatented.
Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?
Because non-ASCII characters are represented differently in different encodings. Unicode is universal, but other encodings are not. And Python supports hundreds of encodings.
Take the Euro sign (€, Unicode codepoint U+20AC), for example. It does not exist in ASCII and most ISO-8859-X encodings, but it is encoded as byte 0xA4 in ISO-8859-7, -15, and -16, but as byte 0x88 in Windows-1251. But 0xA4 represents different Unicode codepoints in other encodings. It is ¤ (U+00A4 CURRENCY SIGN) in ISO-8859-1, but is Ł (U+0141 CAPITAL LETTER L WITH STROKE) in ISO-8859-2, etc.
So how do you expect Python to convert 0xA4 to Unicode? Should it convert to U+00A4, U+0141, or U+20AC?
So, string encoding matters!
See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?
unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'
The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.
I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?
Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.
Latter edit: pprint does that pretty well.
try:
string.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"
In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".
In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.
To Check if Unicode
>>>a = u'F'
>>>isinstance(a, unicode)
True
To Check if it is UTF-8 or ASCII
>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'
I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.
For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:
print len(my_unicode_string.encode('utf-8'))
Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.
I get a string from a function that is represented like u'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0', but to process it I need it to be bytestring (like '\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0').
How do I convert it without changes?
My best guess so far is to take s.encode('unicode_escape'), which will return '\\xd0\\xbc\\xd0\\xb0\\xd1\\x80\\xd0\\xba\\xd0\\xb0' and process every 5 characters so that '\xd0' becomes one character represented as '\xd0'.
ISO 8859-1 (aka Latin-1) maps the first 256 Unicode codepoints to their byte values.
>>> u'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'.encode('latin-1')
'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'