converting Unicode code point numbers to Unicode characters - python

I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain "ordinary" Unicode characters (extended Latin, etc.), but sometimes--particularly when the characters belong to a right-to-left script--it's easier to encode the strings as Unicode code points, like \u0644. But argparse treats these designators as a sequence of characters, and does not convert them into the character they designate. For instance, if a command line parameter is
... -a "abc\06d2d" ...
then what I get in the argparse variable is
"abc\06d2d"
rather than the expected
"abcےd"
(the character between the 'c' and 'd' is the yeh baree). Of course both outcomes are logical, it's just that the second one is the one I want.
I tried to reproduce this in an interpreter, but under most circumstances Python3 automagically converts a string like "abc\06d2d" into "abcےd". Not so when I read the string using argparse...
I came up with a function to do the conversion, see below. But I feel like I'm missing something much simpler. Is there an easier way to do this conversion? (Obviously I could use str.startswith(), or regex's to match the entire thing, rather than going character by character, but the code below is really just an illustration. It seems like I shouldn't have to create my own function to do this at all, especially since in some circumstances it seems to happen automatically.)
---------My code to do this follows---------
def ParseString2Unicode(sInString):
"""Return a version of sInString in which any Unicode code points of the form
\uXXXX (X = hex digit)
have been converted into their corresponding Unicode characters.
Example:
"\u0064b\u0065"
becomes
"dbe"
"""
sOutString = ""
while sInString:
if len(sInString) >= 6 and \
sInString[0] == "\\" and \
sInString[1] == "u" and \
sInString[2] in "0123456789ABCDEF" and \
sInString[3] in "0123456789ABCDEF" and \
sInString[4] in "0123456789ABCDEF" and \
sInString[5] in "0123456789ABCDEF":
#If we get here, the first 6 characters of sInString represent
# a Unicode code point, like "\u0065"; convert it into a char:
sOutString += chr(int(sInString[2:6], 16))
sInString = sInString[6:]
else:
#Strip a single char:
sOutString += sInString[0]
sInString = sInString[1:]
return sOutString

What you may want to look at is the raw_unicode_escape encoding.
>>> len(b'\\uffff')
6
>>> b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\\uffff'.decode('raw_unicode_escape'))
1
So, the function would be:
def ParseString2Unicode(sInString):
try:
decoded = sInString.encode('utf-8')
return decoded.decode('raw_unicode_escape')
except UnicodeError:
return sInString
This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:
import re
escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')
def _escape_sequence_to_char(match):
return chr(int(match[0][2:], 16))
def ParseString2Unicode(sInString):
return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)

A concise, flexible way of handling this would be to use a regular expression:
return re.sub(
r"\\u([0-9A-Fa-f]{4})",
lambda m: chr(int(m[1], 16)),
sInString
)

Related

How to properly unescape select sequences in python

I'm escaping certain characters in strings (e.g., \n, \\) with double backslashes, like this: text.replace("\\", "\\\\").replace("\n", "\\n")
Naïvely, I tried to unescape using: text.replace("\\n", "\n").replace("\\\\", "\\")
However, this fails on strings like:
>>> text = "\\\n\\n"
>>> print(text)
\
\n
>>> etext = text.replace("\\", "\\\\").replace("\n", "\\n")
>>> print(etext)
\\\n\\n
>>> ftext = etext.replace("\\n", "\n").replace("\\\\", "\\")
>>> print(ftext)
\
\
>>>
As you can see the original string doesn't survive the round trip.
Even changing the order of replaces around would not solve the issue.
The only way to correctly unescape is to do the replacements in one go.
Python's str has maketrans and translate to achieve a similar effect
but they only work on single characters as keys.
re.sub also does not work since the substitution would need to distinguish the case somehow. (\1 does not work since if the second character is n we want the newline character as output instead of n)
A correct (but slow) solution would be:
def unescape(text: str) -> str:
res: list[str] = []
in_escape = False
for c in text:
if in_escape:
in_escape = False
if c == "\\":
res.append("\\")
continue
if c == "n":
res.append("\n")
continue
if c == "\\":
in_escape = True
continue
res.append(c)
return "".join(res)
>>> text = "\\\n\\n"
>>> print(text)
\
\n
>>> etext = text.replace("\\", "\\\\").replace("\n", "\\n")
>>> print(etext)
\\\n\\n
>>> print(unescape(etext))
\
\n
>>>
Is there a proper/canonical/fast way of escaping (only certain sequences in) strings?
(EDIT: to answer why a subset of escapes is preferred. in my case other escapes are not needed and it's easy to permanently corrupt your data by escaping things that don't need to. for example, from the top of my head I can think of three different escape functions just in python alone that all escape completely different subsets of characters. even the str.escape function changes what it escapes between python versions. now most of the time unescape can handle a wider set of escape sequences than its corresponding escape function but this is not always the case. this all doesn't even take into account trying to load the escaped data in a different language)

Convert escaped utf-8 string to utf in python 3

I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.

UTF-8 decoding with ascii code in it with Python

From the question and answer in UTF-8 coding in Python, I could use binascii package to decode an utf-8 string with '_' in it.
def toUtf(r):
try:
rhexonly = r.replace('_', '')
rbytes = binascii.unhexlify(rhexonly)
rtext = rbytes.decode('utf-8')
except TypeError:
rtext = r
return rtext
This code works fine with only utf-8 characters:
r = '_ed_8e_b8'
print toUtf(r)
>> 편
However, this code does not work when the string has normal ascii code in it. The ascii can be anywhere in the string.
r = '_2f119_ed_8e_b8'
print toUtf(r)
>> doesn't work - _2f119_ed_8e_b8
>> this should be '/119편'
Maybe, I can use regular expression to extract the utf-8 part and ascii part to reassmeble after the conversion, but I wonder if there is an easier way to do the conversion. Any good solution?
Quite straightforward with re.sub:
import re
bytegroup = r'(_[0-9a-z]{2})+'
def replacer(match):
return toUtf(match.group())
rtext = re.sub(bytegroup, replacer, r, flags=re.I)
That is some truly terrible input you've got. It's still fixable though. First off, replace the non-"encoded" stuff with hex equivalents:
import itertools
import re
r = '_2f119_ed_8e_b8'
# Split so you have even entries in the list as ASCII, odd as hex encodings
rsplit = re.split(r'((?:_[0-9a-fA-F]{2})+)', r) # ['', '_2f', '119', '_ed_8e_b8', '']
# Process the hex encoded UTF-8 with your existing function, leaving
# ASCII untouched
rsplit[1::2] = map(toUtf, rsplit[1::2]) # ['', '/', '119', '관', '']
rtext = ''.join(rsplit) # '/119편'
The above is a verbose version that shows the individual steps, but as chthonicdaemon's answer point's out, it can be shortened dramatically. You use the same regular expression with re.sub instead of re.split, and pass a function to perform the replacement instead of a replacement pattern string:
# One-liner equivalent to the above with no intermediate lists
rtext = re.sub(r'(?:_[0-9a-f]{2})+', lambda m: toUtf(m.group()), r, flags=re.I)
You can package that up as a function itself, so you have one function that deals with purely hex encoded UTF-8, and a second general function that uses the first function as part of processing mixed non-encoded ASCII and hex encoded UTF-8 data.
Mind you, this won't necessarily work all that well if the non-encoded ASCII might contain _ normally; the regex tries to be as targeted as possible, but you've got a problem here where no matter how finely you target your heuristics, some ASCII data will be mistaken for encoded UTF-8 data.

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

Show non printable characters in a string

Is it possible to visualize non-printable characters in a python string with its hex values?
e.g. If I have a string with a newline inside I would like to replace it with \x0a.
I know there is repr() which will give me ...\n, but I'm looking for the hex version.
I don't know of any built-in method, but it's fairly easy to do using a comprehension:
import string
printable = string.ascii_letters + string.digits + string.punctuation + ' '
def hex_escape(s):
return ''.join(c if c in printable else r'\x{0:02x}'.format(ord(c)) for c in s)
I'm kind of late to the party, but if you need it for simple debugging, I found that this works:
string = "\n\t\nHELLO\n\t\n\a\17"
procd = [c for c in string]
print(procd)
# Prints ['\n,', '\t,', '\n,', 'H,', 'E,', 'L,', 'L,', 'O,', '\n,', '\t,', '\n,', '\x07,', '\x0f,']
While just list is simpler, a comprehension makes it easier to add in filtering/mapping if necessary.
You'll have to make the translation manually; go through the string with a regular expression for example, and replace each occurrence with the hex equivalent.
import re
replchars = re.compile(r'[\n\r]')
def replchars_to_hex(match):
return r'\x{0:02x}'.format(ord(match.group()))
replchars.sub(replchars_to_hex, inputtext)
The above example only matches newlines and carriage returns, but you can expand what characters are matched, including using \x escape codes and ranges.
>>> inputtext = 'Some example containing a newline.\nRight there.\n'
>>> replchars.sub(replchars_to_hex, inputtext)
'Some example containing a newline.\\x0aRight there.\\x0a'
>>> print(replchars.sub(replchars_to_hex, inputtext))
Some example containing a newline.\x0aRight there.\x0a
Modifying ecatmur's solution to handle non-printable non-ASCII characters makes it less trivial and more obnoxious:
def escape(c):
if c.printable():
return c
c = ord(c)
if c <= 0xff:
return r'\x{0:02x}'.format(c)
elif c <= '\uffff':
return r'\u{0:04x}'.format(c)
else:
return r'\U{0:08x}'.format(c)
def hex_escape(s):
return ''.join(escape(c) for c in s)
Of course if str.isprintable isn't exactly the definition you want, you can write a different function. (Note that it's a very different set from what's in string.printable—besides handling non-ASCII printable and non-printable characters, it also considers \n, \r, \t, \x0b, and \x0c as non-printable.
You can make this more compact; this is explicit just to show all the steps involved in handling Unicode strings. For example:
def escape(c):
if c.printable():
return c
elif c <= '\xff':
return r'\x{0:02x}'.format(ord(c))
else:
return c.encode('unicode_escape').decode('ascii')
Really, no matter what you do, you're going to have to handle \r, \n, and \t explicitly, because all of the built-in and stdlib functions I know of will escape them via those special sequences instead of their hex versions.
I did something similar once by deriving a str subclass with a custom __repr__() method which did what I wanted. It's not exactly what you're looking for, but may give you some ideas.
# -*- coding: iso-8859-1 -*-
# special string subclass to override the default
# representation method. main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):
def __repr__(self):
# use double quotes unless there are more of them within the string than
# single quotes
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"
rep = [quotechar]
for ch in self:
# control char?
if ord(ch) < ord(' '):
# remove the single quotes around the escaped representation
rep += repr(str(ch)).strip("'")
# embedded quote matching quotechar being used?
elif ch == quotechar:
rep += "\\"
rep += ch
# else just use others as they are
else:
rep += ch
rep += quotechar
return "".join(rep)
if __name__ == "__main__":
s1 = '\tWürttemberg'
s2 = MsgStr(s1)
print "str s1:", s1
print "MsgStr s2:", s2
print "--only the next two should differ--"
print "repr(s1):", repr(s1), "# uses built-in string 'repr'"
print "repr(s2):", repr(s2), "# uses custom MsgStr 'repr'"
print "str(s1):", str(s1)
print "str(s2):", str(s2)
print "repr(str(s1)):", repr(str(s1))
print "repr(str(s2)):", repr(str(s2))
print "MsgStr(repr(MsgStr('\tWürttemberg'))):", MsgStr(repr(MsgStr('\tWürttemberg')))
There is also a way to print non-printable characters in the sense of them executing as commands within the string even if not visible (transparent) in the string, and their presence can be observed by measuring the length of the string using len as well as by simply putting the mouse cursor at the start of the string and seeing/counting how many times you have to tap the arrow key to get from start to finish, as oddly some single characters can have a length of 3 for example, which seems perplexing. (Not sure if this was already demonstrated in prior answers)
In this example screenshot below, I pasted a 135-bit string that has a certain structure and format (which I had to manually create beforehand for certain bit positions and its overall length) so that it is interpreted as ascii by the particular program I'm running, and within the resulting printed string are non-printable characters such as the 'line break` which literally causes a line break (correction: form feed, new page I meant, not line break) in the printed output there is an extra entire blank line in between the printed result (see below):
Example of printing non-printable characters that appear in printed string
Input a string:100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000
HPQGg]+\,vE!:#
>>> len('HPQGg]+\,vE!:#')
17
>>>
In the above code excerpt, try to copy-paste the string HPQGg]+\,vE!:# straight from this site and see what happens when you paste it into the Python IDLE.
Hint: You have to tap the arrow/cursor three times to get across the two letters from P to Q even though they appear next to each other, as there is actually a File Separator ascii command in between them.
However, even though we get the same starting value when decoding it as a byte array to hex, if we convert that hex back to bytes they look different (perhaps lack of encoding, not sure), but either way the above output of the program prints non-printable characters (I came across this by chance while trying to develop a compression method/experiment).
>>> bytes(b'HPQGg]+\,vE!:#').hex()
'48501c514767110c5d2b5c2c7645213a40'
>>> bytes.fromhex('48501c514767110c5d2b5c2c7645213a40')
b'HP\x1cQGg\x11\x0c]+\\,vE!:#'
>>> (0x48501c514767110c5d2b5c2c7645213a40 == 0b100100001010000000111000101000101000111011001110001000100001100010111010010101101011100001011000111011001000101001000010011101001000000)
True
>>>
In the above 135 bit string, the first 16 groups of 8 bits from the big-endian side encode each character (including non-printable), whereas the last group of 7 bits results in the # symbol, as seen below:
Technical breakdown of the format of the above 135-bit string
And here as text is the breakdown of the 135-bit string:
10010000 = H (72)
10100000 = P (80)
00111000 = x1c (28 for File Separator) *
10100010 = Q (81)
10001110 = G(71)
11001110 = g (103)
00100010 = x11 (17 for Device Control 1) *
00011000 = x0c (12 for NP form feed, new page) *
10111010 = ] (93 for right bracket ‘]’
01010110 = + (43 for + sign)
10111000 = \ (92 for backslash)
01011000 = , (44 for comma, ‘,’)
11101100 = v (118)
10001010 = E (69)
01000010 = ! (33 for exclamation)
01110100 = : (58 for colon ‘:’)
1000000 = # (64 for ‘#’ sign)
So in closing, the answer to the sub-question about showing the non-printable as hex, in byte array further above appears the letters x1c which denote the file separator command which was also noted in the hint. The byte array could be considered a string if excluding the prefix b on the left side, and again this value shows in the print string albeit it is invisible (although its presence can be observed as demonstrated above with the hint and len command).

Categories