Char to Unicode in Python [duplicate]

Char to Unicode in Python [duplicate] - python

I am trying to use the encode method of python strings to return the unicode escape codes for characters, like this:
>>> print( 'ф'.encode('unicode_escape').decode('utf8') )
\u0444
This works fine with non-ascii characters, but for ascii characters, it just returns the ascii characters themselves:
>>> print( 'f'.encode('unicode_escape').decode('utf8') )
f
The desired output would be \u0066. This script is for pedagogical purposes.
How can I get the unicode hex codes for ALL characters?

ord can be used for this, there is no need for encoding/decoding at all:
>>> '"\\U{:08x}"'.format(ord('f')) # ...or \u{:04x} if you prefer
'"\\U00000066"'
>>> eval(_)
'f'

You'd have to do so manually; if you assume that all your input is within the Unicode BMP, then a straightforward regex will probably be fastest; this replaces every character with their \uhhhh escape:
import re
def unicode_escaped(s, _pattern=re.compile(r'[\x00-\uffff]')):
return _pattern.sub(lambda m: '\\u{:04x}'.format(
ord(m.group(0))), s)
I've explicitly limited the pattern to the BMP to gracefully handle non-BMP points.
Demo:
>>> print(unicode_escaped('foo bar ф'))
\u0066\u006f\u006f\u0020\u0062\u0061\u0072\u0020\u0444

Related

how to replace \u0000 with its letter if the string is in a list [duplicate]

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.
For example, let's say myString is defined as:
>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs
I want a function (I'll call it process) that does this:
>>> print(process(myString))
spam
eggs
It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).
Does Python have a function to do this?

The correct thing to do is use the 'string-escape' code to decode the string.
>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs
Don't use the AST or eval. Using the string codecs is much safer.

unicode_escape doesn't work in general
It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.
If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.
unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.
The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?
The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.
>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naÃ¯ve test
Well, that's wrong.
The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?
>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naÃ¯ve test
Not at all. (Also, the above is a UnicodeError on Python 2.)
The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:
>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve test
But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!
>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)
Adding a regular expression to solve the problem
(Surprisingly, we do not now have two problems.)
What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.
The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.
import re
import codecs
ESCAPE_SEQUENCE_RE = re.compile(r'''
( \\U........ # 8-digit hex escapes
| \\u.... # 4-digit hex escapes
| \\x.. # 2-digit hex escapes
| \\[0-7]{1,3} # Octal escapes
| \\N\{[^}]+\} # Unicode characters by name
| \\[\\'"abfnrtv] # Single-character escapes
)''', re.UNICODE | re.VERBOSE)
def decode_escapes(s):
def decode_match(match):
return codecs.decode(match.group(0), 'unicode-escape')
return ESCAPE_SEQUENCE_RE.sub(decode_match, s)
And with that:
>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő Rubik

The actually correct and convenient answer for python 3:
>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve test
Details regarding codecs.escape_decode:
codecs.escape_decode is a bytes-to-bytes decoder
codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
codecs.escape_decode does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.
Background:
#rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
#Jerub is correct: avoid the AST or eval.
I first discovered codecs.escape_decode from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.
Of course Python's interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.
Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

The (currently) accepted answer by Jerub is correct for python2, but incorrect and may produce garbled results (as Apalala points out in a comment to that solution), for python3. That's because the unicode_escape codec requires its source to be coded in latin-1, not utf-8, as per the official python docs. Hence, in python3 use:
>>> myString="špåm\\nëðþ\\x73"
>>> print(myString)
špåm\nëðþ\x73
>>> decoded_string = myString.encode('latin-1','backslashreplace').decode('unicode_escape')
>>> print(decoded_string)
špåm
ëðþs
This method also avoids the extra unnecessary roundtrip between strings and bytes in metatoaster's comments to Jerub's solution (but hats off to metatoaster for recognizing the bug in that solution).

This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.
input_string = eval('b"' + sys.argv[1] + '"')
It's worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python's eval() vs. ast.literal_eval()?

Quote the string properly so that it looks like the equivalent Python string literal, and then use ast.literal_eval. This is safe, but much trickier to get right than you might expect.
It's easy enough to add a " to the beginning and end of the string, but we also need to make sure that any " inside the string are properly escaped. If we want fully Python-compliant translation, we need to account for the deprecated behaviour of invalid escape sequences.
It works out that we need to add one backslash to:
any sequence of an even number of backslashes followed by a double-quote (so that we escape a quote if needed, but don't escape a backslash and un-escape the quote if it was already escaped); as well as
a sequence of an odd number of backslashes at the end of the input (because otherwise a backslash would escape our enclosing double-quote).
Here is an acid-test input showing a bunch of difficult cases:
>>> text = r'''\\ \ \" \\" \\\" \'你好'\n\u062a\xff\N{LATIN SMALL LETTER A}"''' + '\\'
>>> text
'\\\\ \\ \\" \\\\" \\\\\\" \\\'你好\'\\n\\u062a\\xff\\N{LATIN SMALL LETTER A}"\\'
>>> print(text)
\\ \ \" \\" \\\" \'你好'\n\u062a\xff\N{LATIN SMALL LETTER A}"\
I was eventually able to work out a regex that handles all these cases properly, allowing literal_eval to be used:
>>> def parse_escapes(text):
... fixed_escapes = re.sub(r'(?<!\\)(\\\\)*("|\\$)', r'\\\1\2', text)
... return ast.literal_eval(f'"{fixed_escapes}"')
...
Testing the results:
>>> parse_escapes(text)
'\\ \\ " \\" \\" \'你好\'\nتÿa"\\'
>>> print(parse_escapes(text))
\ \ " \" \" '你好'
تÿa"\
This should correctly handle everything - strings containing both single and double quotes, every weird situation with backslashes, and non-ASCII characters in the input. (I admit it's a bit difficult to verify the results by eye!)

Below code should work for \n is required to be displayed on the string.
import string
our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

How to combine a backslash and another char into an escape char in Python, manually? [duplicate]

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.
For example, let's say myString is defined as:
>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs
I want a function (I'll call it process) that does this:
>>> print(process(myString))
spam
eggs
It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).
Does Python have a function to do this?

The correct thing to do is use the 'string-escape' code to decode the string.
>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs
Don't use the AST or eval. Using the string codecs is much safer.

unicode_escape doesn't work in general
It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.
If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.
unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.
The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?
The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.
>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naÃ¯ve test
Well, that's wrong.
The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?
>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naÃ¯ve test
Not at all. (Also, the above is a UnicodeError on Python 2.)
The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:
>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve test
But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!
>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)
Adding a regular expression to solve the problem
(Surprisingly, we do not now have two problems.)
What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.
The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.
import re
import codecs
ESCAPE_SEQUENCE_RE = re.compile(r'''
( \\U........ # 8-digit hex escapes
| \\u.... # 4-digit hex escapes
| \\x.. # 2-digit hex escapes
| \\[0-7]{1,3} # Octal escapes
| \\N\{[^}]+\} # Unicode characters by name
| \\[\\'"abfnrtv] # Single-character escapes
)''', re.UNICODE | re.VERBOSE)
def decode_escapes(s):
def decode_match(match):
return codecs.decode(match.group(0), 'unicode-escape')
return ESCAPE_SEQUENCE_RE.sub(decode_match, s)
And with that:
>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő Rubik

The actually correct and convenient answer for python 3:
>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve test
Details regarding codecs.escape_decode:
codecs.escape_decode is a bytes-to-bytes decoder
codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
codecs.escape_decode does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.
Background:
#rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
#Jerub is correct: avoid the AST or eval.
I first discovered codecs.escape_decode from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.
Of course Python's interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.
Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

The (currently) accepted answer by Jerub is correct for python2, but incorrect and may produce garbled results (as Apalala points out in a comment to that solution), for python3. That's because the unicode_escape codec requires its source to be coded in latin-1, not utf-8, as per the official python docs. Hence, in python3 use:
>>> myString="špåm\\nëðþ\\x73"
>>> print(myString)
špåm\nëðþ\x73
>>> decoded_string = myString.encode('latin-1','backslashreplace').decode('unicode_escape')
>>> print(decoded_string)
špåm
ëðþs
This method also avoids the extra unnecessary roundtrip between strings and bytes in metatoaster's comments to Jerub's solution (but hats off to metatoaster for recognizing the bug in that solution).

This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.
input_string = eval('b"' + sys.argv[1] + '"')
It's worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python's eval() vs. ast.literal_eval()?

Quote the string properly so that it looks like the equivalent Python string literal, and then use ast.literal_eval. This is safe, but much trickier to get right than you might expect.
It's easy enough to add a " to the beginning and end of the string, but we also need to make sure that any " inside the string are properly escaped. If we want fully Python-compliant translation, we need to account for the deprecated behaviour of invalid escape sequences.
It works out that we need to add one backslash to:
any sequence of an even number of backslashes followed by a double-quote (so that we escape a quote if needed, but don't escape a backslash and un-escape the quote if it was already escaped); as well as
a sequence of an odd number of backslashes at the end of the input (because otherwise a backslash would escape our enclosing double-quote).
Here is an acid-test input showing a bunch of difficult cases:
>>> text = r'''\\ \ \" \\" \\\" \'你好'\n\u062a\xff\N{LATIN SMALL LETTER A}"''' + '\\'
>>> text
'\\\\ \\ \\" \\\\" \\\\\\" \\\'你好\'\\n\\u062a\\xff\\N{LATIN SMALL LETTER A}"\\'
>>> print(text)
\\ \ \" \\" \\\" \'你好'\n\u062a\xff\N{LATIN SMALL LETTER A}"\
I was eventually able to work out a regex that handles all these cases properly, allowing literal_eval to be used:
>>> def parse_escapes(text):
... fixed_escapes = re.sub(r'(?<!\\)(\\\\)*("|\\$)', r'\\\1\2', text)
... return ast.literal_eval(f'"{fixed_escapes}"')
...
Testing the results:
>>> parse_escapes(text)
'\\ \\ " \\" \\" \'你好\'\nتÿa"\\'
>>> print(parse_escapes(text))
\ \ " \" \" '你好'
تÿa"\
This should correctly handle everything - strings containing both single and double quotes, every weird situation with backslashes, and non-ASCII characters in the input. (I admit it's a bit difficult to verify the results by eye!)

Below code should work for \n is required to be displayed on the string.
import string
our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

How to print Unicode like “u{variable}” in Python 2.7?

For example, I can print Unicode symbol like:
print u'\u00E0'
Or
a = u'\u00E0'
print a
But it looks like I can't do something like this:
a = '\u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
🇦
Any help will be appreciated!

This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode method instead of the unicode type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes method to the int class that lets you bypass the struct module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"

You want print unichr(int('00E0',16)). Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won't work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you'll still need to use a font that supports the glyphs for the codepoints.

These are unicode code points but lack the \u python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console's encoding. If its a Windows code page and the characters are not in its range, you'll still get funky errors.
In python 3 that would be b"\\u".

Python: decoding a string that consists of both unicode code points and unicode text

Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?

unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'

The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.

How can I print the decimal representation of a unicode string?

I am trying to compare unicode strings in Python. Since a lot of the symbols look similar and some may contain non-printable characters, I am having trouble debugging where my comparisons are failing. Is there a way to take a string of unicode characters and print their unicode codes? i.e.:
>>> unicode_print('❄')
'\u2744'

You can encode that string with some other encoding:
>>> s = '❄'
>>> s.encode() # "utf8" by default
b'\xe2\x9d\x84'
And for the output you specified, I just found this from here:
>>> s.encode("unicode_escape")
b'\\u2744'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Char to Unicode in Python [duplicate] - python

ord can be used for this, there is no need for encoding/decoding at all: >>> '"\\U{:08x}"'.format(ord('f')) # ...or \u{:04x} if you prefer '"\\U00000066"' >>> eval(_) 'f'

Related

how to replace \u0000 with its letter if the string is in a list [duplicate]

How to combine a backslash and another char into an escape char in Python, manually? [duplicate]

How to print Unicode like “u{variable}” in Python 2.7?

Python: decoding a string that consists of both unicode code points and unicode text

How can I print the decimal representation of a unicode string?

Categories

Resources