Given a byte string, for instanceB = b"\x81\xc9\x00\x07I ABCD_\xe2\x86\x97_" I want to be able to convert this to the valid printable UTF-8 string that is as UTF-8 as possible: S = "\\x81\\xc9\\x00\\x07I ABCD_↗_". Note that the first group of hex bytes are not valid UTF-8 characters, but the last 3 do define a valid UTF-8 character (the arrow). It seems like this should be part of codecs but I cannot figure out how to make this happen.
for instance
>>> codecs.decode(codecs.escape_encode(B, 'utf-8')[0], 'utf-8')
'\\x81\\xc9\\x00\\x07I\\x19ABCD_\\xe2\\x86\\x97_'
escapes a valid UTF-8 character along with the invalid characters.
Specifying 'backslashreplace' as the error handling mode when decoding a bytestring will replace un-decodable bytes with backslashed escape sequences:
decoded = b.decode('utf-8', errors='backslashreplace')
Also, this is a decoding operation, not an encoding operation. Decoding is bytes->string. Encoding is string->bytes.
Related
How come the following works without any errors in Python?
>>> '你好'.encode('UTF-8').decode('ISO8859-1')
'ä½\xa0好'
>>> _.encode('ISO8859-1').decode('UTF-8')
'你好'
I would have expected it to fail with a UnicodeEncodeError or UnicodeDecodeError
Is there some property of ISO8859-1 and UTF-8 such that I can take any UTF-8 encoded string and decode it to a ISO8859-1 string, which can later be reversed to get the original UTF-8 string?
I'm working with an older database that only supports the ISO8859-1 character set. It seems like the developers were able to store Chinese and other languages in this database by decoding UTF-8 encoded strings into ISO8859-1, and storing the resulting garbage string in the database. Downstream systems which query this database then have to encode the garbage string in ISO8859-1 and then decode the result with UTF-8 to get the correct string.
I would have assumed that such a process would not work at all.
What am I missing?
The special property of ISO-8859-1 is that the 256 characters it represents correspond 1:1 with the first 256 Unicode code points, so byte 00h decodes to U+0000, and byte FFh decodes to U+00FF.
So if you encode as UTF-8 and decode as ISO-8859-1 you get a Unicode string made up of code points whose values match the UTF-8 bytes encoded:
>>> s = '你好'
>>> s.encode('utf8').hex()
'e4bda0e5a5bd'
>>> s.encode('utf8').decode('iso-8859-1')
'ä½\xa0好'
>>> for c in u:
... print(f'{c} U+{ord(c):04X}')
...
ä U+00E4 # Unicode code points are the same as the bytes of UTF-8.
½ U+00BD
U+00A0
å U+00E5
¥ U+00A5
½ U+00BD
>>> u.encode('iso-8859-1').hex() # transform back to bytes.
'e4bda0e5a5bd'
>>> u.encode('iso-8859-1').decode('utf8') # and decode to UTF-8 again.
'你好'
Any 8-bit encoding that has a representation for all 256 bytes would also work, it just wouldn't be a 1:1 mapping. Code Page 1256 is one such encoding:
>>> for c in s.encode('utf8').decode('cp1256'):
... print(f'{c} U+{ord(c):04X}')
...
ن U+0646 # This would still .encode('cp1256') back to byte E4, for example
½ U+00BD
U+00A0
ه U+0647
¥ U+00A5
½ U+00BD
No, there is no special property of ISO8859-1, but one property common on many 8-bit encoding: they accept all bytes from 0 to 255.
So you decode('ISO8859-1') is just transforming bytes into 256 characters (and control codes) in a unique way. Then you do the contrary action, so you lose nothing.
This happens with most of old 8-bit encoding: they should just have a corresponding Unicode codepoint (because Python expect strings to be Unicode strings).
Note: really ISO8859-1 is special with Unicode: the first 256 codepoint of Unicode correspond to the Latin-1 characters (with same number). But this doesn't matter much on your experiment.
I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?
Examples:
some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'
some_bytes.decode('unicode-escape')
yields: 7Q¬qo»5<ëD¾Þù¦XNÿ¡
some_bytes.decode('unicode-escape').encode()
yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'
xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2
So when you are encoding it is added before every code-point.
For more details, you can see below link
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex
Is it possible to construct a unicode string that the utf-8 codec cannot encode?
From the documentation (https://docs.python.org/2/library/codecs.html), it appears that the utf-8 codec can encode a symbol in "any language". The docs also note when a codec can only encode certain characters or only the Basic Multilingual Plane. I don't know whether this is equivalent to saying "it is impossible to construct a unicode value that cannot be converted to a bytestring using the utf-8 codec", however.
Here's the table entry for the utf-8 codec.
Codec Aliases Purpose
utf_8 U8, UTF, utf8 all languages
The motivation here is that I have a utility function that takes either a unicode string or a byte string and converts it to a byte string. When given a byte string it is a no-op. This function is not supposed to throw an exception unless it is called with a non-string type and in that case it's supposed to fail informatively with a TypeError that will be caught later and logged. (We can still run into problems if the repr of the item we attempted to insert into the exception message is too big, but let's ignore that for now).
I'm using the strict setting because I want this function to throw an exception in the event that it encounters a unicode object that it cannot encode, but am hoping that that isn't possible.
def utf8_to_bytes(item):
"""take a bytes or unicode object and convert it to bytes,
using utf-8 if necessary"""
if isinstance(item, bytes):
return item
elif isinstance(item, unicode):
return codecs.encode(item, 'utf-8', 'strict')
else:
raise TypeError("item must be bytes or unicode. got %r" % type(item))
UTF-8 is designed to encode all of the Unicode standard. Encoding Unicode text to UTF-8 will not normally throw an exception.
From the Wikipedia article on the codec:
UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode
The Python 2 UTF-8 encoding has no edge-cases that I know of; non-BMP data and surrogate pairs are handled just the same:
>>> import sys
>>> hex(sys.maxunicode) # a narrow UCS-2 build
'0xffff'
>>> len(u'\U0001F525')
2
>>> u'\U0001F525'.encode('utf-8')
'\xf0\x9f\x94\xa5'
>>> u'\ud83d\udd25'
u'\U0001f525'
>>> len(u'\ud83d\udd25')
2
>>> u'\ud83d\udd25'.encode('utf-8')
'\xf0\x9f\x94\xa5'
Note that strict is the default encoding mode. You don't need to use the codecs module either, just use the encode method on the unicode object:
return item.encode('utf-8')
In Python 3, the situation is slightly more complicated. Decoding and encoding surrogate pairs is restricted; the official standard states such characters should only ever appear in UTF-16 encoded data, and then only in a low and high pair.
As such, you need to explicitly state that you want to support such codepoints with the surrogatepass error handler:
Allow encoding and decoding of surrogate codes. These codecs normally treat the presence of surrogates as an error.
The only difference between surrogatepass and strict is that surrogatepass will allow you to encode any surrogate codepoints in your Unicode text to UTF-8. You'd only get such data in rare circumstances (defined as literals, or when accidentally leaving such codepoints unpaired in UTF-16 and then decoding using surrogatepass).
So, in Python 3, only if you there is a chance your Unicode text could have been produced with a surrogatepass decode or from literal data, you'd need to use item.encode('utf8', 'surrogatepass') to be absolutely certain all possible Unicode values can be encoded.
I have a legacy code segment that always encode('utf-8') for me when I pass in an unicode string (directly from database), is there a way to change unicode string to other format to allow it to be encoded to 'utf-8' again without getting an error, since I am not allowed to change the legacy code segment.
I've tried decoding it first but it returns this error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
If I leave the unicode string as is it returns
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
If I change the legacy code to not encode('utf-8') it works, but this is not a viable option
Edit:
Here is the code snippet
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
For some reason if I skip #2 I don't get the error that I mentioned above, I double check the type for the string, it seems like both is unicode, and both is the same character, but the code I am working on does not allow me to encode or decode to utf-8 , while the same character in some snippet allows me to do that.
Consider the following cases:
If you want a unicode string, and you already have a unicode string, you need do nothing.
If you want a bytestring, and you already have a bytestring, you need do nothing.
If you have a unicode string and want a bytestring, you encode it.
If you have a bytestring and want a unicode string, you decode it.
In none of these cases is it appropriate to encode or decode more than once.
In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True
I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()
The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.
Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and
In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.