Identify unicode characters that can't be printed - python

I need to be able to determine (or predict) when a unicode character won't be printable. For instance, if I print this unicode character under default settings, it prints fine:
>>> print(u'\ua62b')
ꘫ
But if I print another unicode character, it prints as a stupid, weird square:
>>> print(u'\ua62c')
꘬
I really need to be able to determine before a character is printed if it will display like this as an ugly square (or sometimes as an anonymous blank). What causes this, and how can I predict it?

While it's not very easy to tell if the terminal running your script (or the font your terminal is using) is able to render a given character correctly, you can at least check that the character actually has a representation.
The character \ua62b is defined as VAI SYLLABLE NDOLE DO, whereas the character \ua62c has no definition, hence why it may be rendered as a square or other generic symbol.
One way to check if a character is defined is to use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(u"\ua62b")
'VAI SYLLABLE NDOLE DO'
>>> unicodedata.name(u"\ua62c")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
As you can see above, a ValueError is raised for the \ua62c character because it isn't defined.
Another method is to check the category of the character. If it is Cn then the character is not assigned:
>>> import unicodedata
>>> unicodedata.category(u"\ua62b")
'Lo'
>>> unicodedata.category(u"\ua62c")
'Cn'

Related

str.isdigit() behaviour when handling strings

Assuming the following:
>>> square = '²' # Superscript Two (Unicode U+00B2)
>>> cube = '³' # Superscript Three (Unicode U+00B3)
Curiously:
>>> square.isdigit()
True
>>> cube.isdigit()
True
OK, let's convert those "digits" to integer:
>>> int(square)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'
Oooops!
Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?
str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:
str.isdigit()
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?
Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

How to check if a chr()'s output will be undefined

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?
for example,
print(chr(55396))
is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?
You could use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>

Why are there blank spaces in the Unicode character table and how do I check if a unicode value is one of those?

If you check out the Unicode Table, there are several spaces further in the table that are simply blank. There's a unicode value, but no character, ex. U+0BA5. Why are there these empty places?
Second of all, how would I check if a unicode value is one of these empty spaces? My code determines a unicode value using unichr(int), which returns a valid unicode value, but I don't know how to check if this unicode value will simply appear as an empty box.
Not all Unicode codepoints have received an assignment; this can be for any number of reasons, historical, practical, policital, etc. The full range of values between 0 and 10FFFF are Unicode codepoints but are not necessarily assigned a character or a name.
You could test if a given codepoint has a Unicode name, by using the unicodedata.name() function; it'll raise a ValueError when a codepoint has no name assigned to it:
>>> import unicodedata
>>> unicodedata.name('\u0BA5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
It'll depend on exactly what you want to skip if that suffices. For example, the control codes at the start of the Unicode table you referenced don't have names, but are assigned a specific purpose.
Every Unicode codepoint also has a general category, a two-letter code that tells you what the codepoint is meant for. The unicodedata.category() function gives you that category:
>>> unicodedata.category('\u0BA5')
'Cn'
The Cn category is the Other, not assigned category.
It depends on what specifically you need to do with the character. There are codepoints that don't have a name but have meaning, such as the control codes (category Cc) or are there for very specific purposes other than display text (such as the surrogate codepoints, or the formatting codepoints, categories Cs and Cf, respectively), or are reserved for future use (Co). As such you may want to exclude all C* category codepoints:
unicodedata.category(codepoint)[0] == "C"
Last, but not least, the Unicode standard is updated regularly, and codepoints that fall under the Cn category in older versions of the standard have received assignments in newer. New minor Python releases (the second digit in the Python version, so 3.7, 3.8, etc.) will generally include the most recent Unicode standard version at the time of their release. Check the unicodedata.unidata_version attribute for what specific version of the standard was bundled.
The accepted answer only works by coincidence.
There are code points that have no name but which are assigned. For example in 5.2.0, unicodedata.name(u'\u0000') errors. But it is assigned. As the NUL character has the Unicode category Cc.
To test for unassigned code points, test if the category is 'Cn':
unicodedata.category(u'\u0BA5') == 'Cn'
This evaluates to True meaning that it is not assigned.

How do I split a multi-languages line in Python and get the Unicode hex value?

I try to split this kind of lines in Python:
aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"
This line contains Hebrew, simplified Chinese and English.
If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).
The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:
print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))
And I get this error:
SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
In Python 2, you need to open the file specifying an encoding like this:
import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8")
In Python 3, you can just add the encoding option to any open() calls.
This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).
To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():
>>> ord(u"£")
163
if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.
Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:
>>> unicodedata.bidirectional(u"£")
ET # 'E'uropean 'T'erminator
In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:
print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))
In Python 3, string constants are Unicode by default.

Printing escaped Unicode in Python

>>> s = 'auszuschließen'
>>> print(s.encode('ascii', errors='xmlcharrefreplace'))
b'auszuschließen'
>>> print(str(s.encode('ascii', errors='xmlcharrefreplace'), 'ascii'))
auszuschließen
Is there a prettier way to print any string without the b''?
EDIT:
I'm just trying to print escaped characters from Python, and my only gripe is that Python adds "b''" when i do that.
If i wanted to see the actual character in a dumb terminal like Windows 7's, then i get this:
Traceback (most recent call last):
File "Mailgen.py", line 378, in <module>
marked_copy = mark_markup(language_column, item_row)
File "Mailgen.py", line 210, in mark_markup
print("TP: %r" % "".join(to_print))
File "c:\python32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 29: character maps to <undefined>
>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.
To see ascii representation (like repr() on Python 2) for debugging:
print(ascii('auszuschließen…'))
# -> 'auszuschlie\xdfen\u2026'
To print bytes:
sys.stdout.buffer.write('auszuschließen…'.encode('ascii', 'xmlcharrefreplace'))
# -> auszuschließen…
Not all terminals can handle more than some sort of 8-bit character set, that's true. But they won't handle that no matter what you do, really.
Printing a Unicode string will, assuming that your OS set's up the terminal properly, result in the best result possible, which means that the characters that the terminal can not print will be replaced with some character, like a question mark or similar. Doing that translation yourself will not really improve things.
Update:
Since you want to know what characters are in the string, you actually want to know the Unicode codes for them, or the XML equivalent in this case. That's more inspecting than printing, and then usually the b'' part isn't a problem per se.
But you can get rid of it easily and hackily like so:
print(repr(s.encode('ascii', errors='xmlcharrefreplace'))[2:-1])
Since you're using Python 3, you're afforded the ability to write print(s) to the console.
I can agree that, depending on the console, it may not be able to print properly, but I would imagine that most modern OSes since 2006 can handle Unicode strings without too much of an issue. I'd encourage you to give it a try and see if it works.
Alternatively, you can enforce a coding by placing this before any lines in a file (similar to a shebang):
# -*- coding: utf-8 -*-
This will force the interpreter to render it as UTF-8.

Categories