I am trying to compare unicode strings in Python. Since a lot of the symbols look similar and some may contain non-printable characters, I am having trouble debugging where my comparisons are failing. Is there a way to take a string of unicode characters and print their unicode codes? i.e.:
>>> unicode_print('❄')
'\u2744'
You can encode that string with some other encoding:
>>> s = '❄'
>>> s.encode() # "utf8" by default
b'\xe2\x9d\x84'
And for the output you specified, I just found this from here:
>>> s.encode("unicode_escape")
b'\\u2744'
Related
I'm a Python newbie and I'm trying to make one script that writes some strings in a file if there's a difference. Problem is that original string has some characters in \uNNNN Unicode format and I cannot convert the new string to the same Unicode format.
The original string I'm trying to compare: \u00A1 ATENCI\u00D3N! \u25C4
New string is received as: ¡ ATENCIÓN! ◄
And this the code
str = u'¡ ATENCIÓN! ◄'
print(str)
str1 = str.encode('unicode_escape')
print (str1)
str2 = str1.decode()
print (str2)
And the result is:
¡ ATENCIÓN! ◄
b'\\xa1 ATENCI\\xd3N! \\u25c4'
\xa1 ATENCI\xd3N! \u25c4
So, how can I get \xa1 ATENCI\xd3N! \u25c4 converted to \u00A1 ATENCI\u00D3N! \u25C4 as this is the only Unicode format I can save?
Note: Cases of characters in strings also need to be the same for comparison.
The issue is, according to the docs (read down a little bit, between the escape sequences tables), the \u, \U, and \N Unicode escape sequences are only recognized in string literals. That means that once the literal is evaluated in memory, such as in a variable assignment:
s = "\u00A1 ATENCI\u00D3N! \u25C4"
any attempt to str.encode() it automatically converts it to a bytes object that uses \x where it can:
b'\\xa1 ATENCI\\xd3N! \\u25c4'
Using
b'\\xa1 ATENCI\\xd3N! \\u25c4'.decode("unicode_escape")
will convert it back to '¡ ATENCIÓN! ◄'. This uses the actual (intended) representation of the characters, and not the \uXXXX escape sequences of the original string s.
So, what you should do is not mess around with encoding and decoding things. Observe:
print("\u00A1 ATENCI\u00D3N! \u25C4" == '¡ ATENCIÓN! ◄')
True
That's all the comparison you need to do.
For further reading, you may be interested in:
How to work with surrogate pairs in Python?
Encodings and Unicode from the Python docs.
How do I convert a unicode character 'ב' to its corresponding Unicode character string '\u05d1' in Python?
I asked the opposite question a few days ago:
Python: convert unicode string to corresponding Unicode character
You can do something like,
>>> x
'ב'
>>> x.encode('ascii', 'backslashreplace').decode('utf-8')
'\\u05d1'
From the docs:
The errors parameter is the same as the parameter of the decode()
method but supports a few more possible handlers. As well as 'strict',
'ignore', and 'replace' (which in this case inserts a question mark
instead of the unencodable character), there is also
'xmlcharrefreplace' (inserts an XML character reference),
backslashreplace (inserts a \uNNNN escape sequence) and namereplace
(inserts a \N{...} escape sequence).
Something like this works
>>> hex(ord('ב'))
'0x5d1'
Python Specific Encodings:
unicode_escape - Encoding suitable as the contents of a Unicode
literal in ASCII-encoded Python source code, except that quotes are
not escaped.
'ב'.encode('unicode-escape').decode() ### '\\u05d1'
print('ב'.encode('unicode-escape').decode()) ### \u05d1
I prefer my own answer which is clean and simple:
json.dumps(unicode_character)
decoded_string = "ב"
encoded_string = decoded_string.encode("utf-8")
I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed last month.
I would like to turn this string:
a = '\\a'
into this one
b = '\a'
It doesn't seem like there is an obvious way to do this with replace?
To be more precise, I want to change the escaping of the backslash to escaping the character a.
The character '\a' is the ASCII BEL character, chr(7).
To do the conversion in Python 2:
from __future__ import print_function
a = '\\a'
c = a.decode('string-escape')
print(repr(a), repr(c))
output
'\\a' '\x07'
And for future reference, in Python 3:
a = '\\a'
b = bytes(a, encoding='ascii')
c = b.decode('unicode-escape')
print(repr(a), repr(c))
This gives identical output to the above snippet.
In Python 3, if you were working with bytes objects you'd do something like this:
a = b'\\a'
c = bytes(a.decode('unicode-escape'), 'ascii')
print(repr(a), repr(c))
output
b'\\a' b'\x07'
As Antti Haapala mentions, this simple strategy for Python 3 won't work if the source string contains unicode characters too. In tha case, please see his answer for a more robust solution.
On Python 2 you can use
>>> '\\a'.decode('string_escape')
'\x07'
Note how \a is repr'd as \x07.
If the string is a unicode string with also extended characters, you need to decode it to a bytestring first, otherwise the default encoding (ascii!) is used to convert the unicode object to a bytestring first.
However, this codec doesn't exist in Python 3, and things are very much more complicated. You can use the unicode-escape to decode but it is very broken if the source string contains unicode characters too:
>>> '\aäầ'.encode().decode('unicode_escape')
'\x07äầ'
The resulting string doesn't consist of Unicode characters but bytes decoded as latin-1. The solution is to re-encode to latin-1 and then decode as utf8 again:
>>> '\\aäầ\u1234'.encode().decode('unicode_escape').encode('latin1').decode()
'\x07äầሴ'
Unescape string is what I searched for to find this:
>>> a = r'\a'
>>> a.encode().decode('unicode-escape')
'\x07'
>>> '\a'
'\x07'
That's the way to do it with unicode. Since you're in Python 2 and may not be using unicode, you may actually one:
>>> a.decode('string-escape')
'\x07'
Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?
unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'
The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.