Python - convert unicode and hex to unicode - python

I have a supposedly unicode string like this:
u'\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xaf\xc2\xbc\xc2\x81\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xc2\xbc\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xe2\u20ac\u0161\xc2\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\xb0\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xc6\u2019\xe2\u20ac\xa0\xc3\xa3\xe2\u20ac\u0161\xc2\xa3\xc3\xa3\xc6\u2019\xc2\x90\xc3\xa3\xc6\u2019\xc2\xab\xc3\xaf\xc2\xbc\xcb\u2020\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xaf\xc2\xbc\xe2\u20ac\xb0'
How do I get the correct unicode string out of this? I think, the actual unicode value is ラブライブ!スクールアイドルフェスティバル(スクフェス)

You have a Mojibake, an incorrectly decoded piece text.
You can use the ftfy library to un-do the damage:
>>> from ftfy import fix_text
>>> fix_text(s)
u'\u30e9\u30d6\u30e9\u30a4\u30d6!\u30b9\u30af\u30fc\u30eb\u30a2\u30a4\u30c9\u30eb\u30d5\u30a7\u30b9\u30c6\u30a3\u30d0\u30eb(\u30b9\u30af\u30d5\u30a7\u30b9)'
>>> print fix_text(s)
ラブライブ!スクールアイドルフェスティバル(スクフェス)
According to ftfy, your data was encoded as UTF-8, then decoded as Windows codepage 1252; the ftfy.fixes.fix_one_step_and_explain() function shows the repair steps needed:
>>> ftfy.fixes.fix_one_step_and_explain(s)[-1]
[(u'encode', u'sloppy-windows-1252', 0), (u'decode', u'utf-8', 0)]
(the 'sloppy' encoding is needed because not all UTF-8 bytes can be decoded as cp1252, but some bad decoders then just copy the original byte; the special codec reverses that process).
In fact, in your case this was done twice, not a feat I had seen before:
>>> print s.encode('sloppy-cp1252').decode('utf8').encode('sloppy-cp1252').decode('utf8')
ラブライブ!スクールアイドルフェスティバル(スクフェス)

Related

Two unicode encodings represent 1 cyrillic letter

I have such string in unicode and utf-8 representation:
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
and
ЕÑли повезет то ÑÐµÐ³Ð¾Ð´Ð½Ñ ÑƒÐ¶Ðµ Ñкину.
The desired ouput is "Если повезет то сегодня уже скину".
I have tried all possible encodings but still wasn't able to get it in complete cyrillic form.
The best I got was
'�?�?ли повезе�? �?о �?егодн�? �?же �?кин�?'
using windows-1252.
And also I've noticed that one cyrillic letter in desired string means two unicode encodings.
For example: \u00d0\u0095 = 'Е'.
Maybe someone knows what encoding and how to use it to get a normal result?
You have a mis-decoded string where the UTF-8 bytes were translated as ISO-8859-1 (also known as latin1). Ideally, re-download with the correct encoding, but you can also encode with the wrongly-used encoding to regain the original byte stream, then decode with the right encoding (UTF-8):
Python:
>>> s = '\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> s
'Ð\x95Ñ\x81липовезеÑ\x82 Ñ\x82оÑ\x81егоднÑ\x8fÑ\x83жеÑ\x81кинÑ\x83'
>>> print(s)
ÐÑÐ»Ð¸Ð¿Ð¾Ð²ÐµÐ·ÐµÑ ÑоÑегоднÑÑжеÑкинÑ
>>> s.encode('latin1')
b'\xd0\x95\xd1\x81\xd0\xbb\xd0\xb8\xd0\xbf\xd0\xbe\xd0\xb2\xd0\xb5\xd0\xb7\xd0\xb5\xd1\x82 \xd1\x82\xd0\xbe\xd1\x81\xd0\xb5\xd0\xb3\xd0\xbe\xd0\xb4\xd0\xbd\xd1\x8f\xd1\x83\xd0\xb6\xd0\xb5\xd1\x81\xd0\xba\xd0\xb8\xd0\xbd\xd1\x83'
>>> s.encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'
You may also have a literal string of Unicode escape codes, which is a bit trickier:
>>> s=r'\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083'
>>> print(s)
\u00d0\u0095\u00d1\u0081\u00d0\u00bb\u00d0\u00b8\u00d0\u00bf\u00d0\u00be\u00d0\u00b2\u00d0\u00b5\u00d0\u00b7\u00d0\u00b5\u00d1\u0082 \u00d1\u0082\u00d0\u00be\u00d1\u0081\u00d0\u00b5\u00d0\u00b3\u00d0\u00be\u00d0\u00b4\u00d0\u00bd\u00d1\u008f\u00d1\u0083\u00d0\u00b6\u00d0\u00b5\u00d1\u0081\u00d0\u00ba\u00d0\u00b8\u00d0\u00bd\u00d1\u0083
In this case, the string has to be converted back to bytes, decoded as Unicode escapes, then encoded back to bytes and correctly decoded as UTF-8. latin1 has the feature that the first 256 code points of Unicode map to bytes 0-255 in that codec, so it converts 1:1 code point to byte value.
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'Еслиповезет тосегодняужескину'
d0 95 d1 81 d0 bb d0 b8 is the correct UTF-8 octet stream for "Если".
So you need to convert each character to a byte (8-bit word, octet) by removing the most significant part (which is always 0 anyway in your example). Then decode them as UTF-8.
Or better, go back to the source from which you got this, and make sure the stream of octets is not seen as single-byte encoding.

Python3 - reinterpret string as bytes [duplicate]

Is there a way to convert a \x escaped string like "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80" into readable form: "語言"?
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> print(a)
\xe8\xaa\x9e\xe8\xa8\x80
I am aware that there is a similar question here, but it seems the solution is only for latin characters. How can I convert this form of string into readable CJK characters?
Decode it first using 'unicode-escape', then as 'utf8':
a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
decoded = a.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(decoded)
# 語言
Note that since we can only decode bytes objects, we need to transparently encode it in between, using 'latin1'.
Starting with string a which appears to follow python's hex escaping rules, you can decode it to a bytes object plus length of string decoded.
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> import codecs
>>> codecs.escape_decode(a)
(b'\xe8\xaa\x9e\xe8\xa8\x80', 24)
You don't need the length here, so just get item 0. Now its time for some guessing. Assuming that this string actually represented a utf-8 encoding, you now have a bytes array that you can decode
>>> codecs.escape_decode(a)[0].decode('utf-8')
'語言'
If the underlying encoding was different (say, a Windows CJK code page), you'd have to decode with its decoder.
Text like this could make a valid Python bytes literal. Assuming we don't have to worry about invalid input, we can simply construct a string that looks like the corresponding source code, and use ast.literal_eval to interpret it that way (this is safe, unlike using eval). Finally we decode the resulting bytes as UTF-8. Thus:
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> ast.literal_eval(f"b'{a}'")
b'\xe8\xaa\x9e\xe8\xa8\x80'
>>> ast.literal_eval(f"b'{a}'").decode('utf-8')
'語言'
Such a codec is missing in stdlib. My package all-escapes registers a codec which can be used:
>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> a.encode('all-escapes').decode()
'語言'

Python: decoding a string that consists of both unicode code points and unicode text

Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?
unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'
The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()
The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.
Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and
In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

Unicode Normalization

Is there a possible normalization path which brings both strings below to same value?
u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
u'Aho\u2013Corasick string matching algorithm'
It looks like you have a Mojibake there, UTF-8 bytes that have been decoded as if they were Windows-1252 data instead. Your 3 'characters', encoded to Windows-1252, produce the exact 3 UTF-8 bytes for the U+2013 EN DASH character in your target string:
>>> u'\u2013'.encode('utf8')
'\xe2\x80\x93'
>>> u'\u2013'.encode('utf8').decode('windows-1252')
u'\xe2\u20ac\u201c'
You can use the ftfy module to repair that data, so you get an emdash for the bytes:
>>> import ftfy
>>> sample = u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
>>> ftfy.fix_text(sample)
u'Aho\u2013Corasick_string_matching_algorithm'
then simply replace underscores with spaces:
>>> ftfy.fix_text(sample).replace('_', ' ')
u'Aho\u2013Corasick string matching algorithm'
You can also simply encode to Windows-1252 and decode again as UTF-8, but that doesn't always work because there are specific bytes that cannot be decoded legally as Windows-1252, but some systems producing these Mojibakes do so anyway. ftfy includes specialised repair codecs to reverse that process. In addition, it detects the specific Mojibake errors made to automate the process across multiple possible codec errors.

Categories