In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters.
>>> from unicodedata import name
>>> name(u'\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'
Certainly Unicode contains the character \n, and it has a name, specifically "LINE FEED".
NB. unicodedata.lookup('LINE FEED') and unicodedata.lookup(u'LINE FEED') both give a KeyError: undefined character name.
The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).
If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:
000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n has no name, other than the generic <control>, which the Python database ignores (as it is not unique).
Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):
Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.
TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.
Related
If you check out the Unicode Table, there are several spaces further in the table that are simply blank. There's a unicode value, but no character, ex. U+0BA5. Why are there these empty places?
Second of all, how would I check if a unicode value is one of these empty spaces? My code determines a unicode value using unichr(int), which returns a valid unicode value, but I don't know how to check if this unicode value will simply appear as an empty box.
Not all Unicode codepoints have received an assignment; this can be for any number of reasons, historical, practical, policital, etc. The full range of values between 0 and 10FFFF are Unicode codepoints but are not necessarily assigned a character or a name.
You could test if a given codepoint has a Unicode name, by using the unicodedata.name() function; it'll raise a ValueError when a codepoint has no name assigned to it:
>>> import unicodedata
>>> unicodedata.name('\u0BA5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
It'll depend on exactly what you want to skip if that suffices. For example, the control codes at the start of the Unicode table you referenced don't have names, but are assigned a specific purpose.
Every Unicode codepoint also has a general category, a two-letter code that tells you what the codepoint is meant for. The unicodedata.category() function gives you that category:
>>> unicodedata.category('\u0BA5')
'Cn'
The Cn category is the Other, not assigned category.
It depends on what specifically you need to do with the character. There are codepoints that don't have a name but have meaning, such as the control codes (category Cc) or are there for very specific purposes other than display text (such as the surrogate codepoints, or the formatting codepoints, categories Cs and Cf, respectively), or are reserved for future use (Co). As such you may want to exclude all C* category codepoints:
unicodedata.category(codepoint)[0] == "C"
Last, but not least, the Unicode standard is updated regularly, and codepoints that fall under the Cn category in older versions of the standard have received assignments in newer. New minor Python releases (the second digit in the Python version, so 3.7, 3.8, etc.) will generally include the most recent Unicode standard version at the time of their release. Check the unicodedata.unidata_version attribute for what specific version of the standard was bundled.
The accepted answer only works by coincidence.
There are code points that have no name but which are assigned. For example in 5.2.0, unicodedata.name(u'\u0000') errors. But it is assigned. As the NUL character has the Unicode category Cc.
To test for unassigned code points, test if the category is 'Cn':
unicodedata.category(u'\u0BA5') == 'Cn'
This evaluates to True meaning that it is not assigned.
I need to be able to determine (or predict) when a unicode character won't be printable. For instance, if I print this unicode character under default settings, it prints fine:
>>> print(u'\ua62b')
ꘫ
But if I print another unicode character, it prints as a stupid, weird square:
>>> print(u'\ua62c')
I really need to be able to determine before a character is printed if it will display like this as an ugly square (or sometimes as an anonymous blank). What causes this, and how can I predict it?
While it's not very easy to tell if the terminal running your script (or the font your terminal is using) is able to render a given character correctly, you can at least check that the character actually has a representation.
The character \ua62b is defined as VAI SYLLABLE NDOLE DO, whereas the character \ua62c has no definition, hence why it may be rendered as a square or other generic symbol.
One way to check if a character is defined is to use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(u"\ua62b")
'VAI SYLLABLE NDOLE DO'
>>> unicodedata.name(u"\ua62c")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
As you can see above, a ValueError is raised for the \ua62c character because it isn't defined.
Another method is to check the category of the character. If it is Cn then the character is not assigned:
>>> import unicodedata
>>> unicodedata.category(u"\ua62b")
'Lo'
>>> unicodedata.category(u"\ua62c")
'Cn'
Trying to get a unicode character by the (unique) name in python 2.7. The method I've found in the docs is not working for me:
>>> import unicodedata
>>> print unicodedata.lookup('PILE OF POO')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'PILE OF POO'"
The problem is, that PILE OF POO was introduced with Unicode 6. However, the data of unicodedata is mostly older, 5.X versions or so. The docs say:
The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
This means, unfortunately, that you also are out of luck with almost all Emoji and hieroglyphs (if you're into egyptology).
While inside the Python Interpreter:
What are some ways to learn about the packages I have?
>>> man sys
File "<stdin>", line 1
man sys
^
SyntaxError: invalid syntax
>>> sys --help
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: bad operand type for unary -: '_Helper'
Corrected:
>>> help(sys)
...
Now, how do I see all the packages available to me on my sys.path? And see their subsequent usage and documentation. I know that I can easily download a PDF but all this stuff is already baked in, I'd like to not duplicate files.
Thanks!
You can look at help("modules"), it displays the list of available modules.
To explore a particular module/class/function use dir and __doc__:
>>> import sys
>>> sys.__doc__
"This module ..."
>>> dir(sys)
[..., 'setprofile', ...]
>>> print(sys.setprofile.__doc__)
setprofile(function)
Set the profiling function. It will be called on each function call
and return. See the profiler chapter in the library manual.
Python has an extensive built-in help system which you can access with help():
>>> help()
Welcome to Python 2.7! This is the online help utility.
If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/tutorial/.
Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules. To quit this help utility and
return to the interpreter, just type "quit".
To get a list of available modules, keywords, or topics, type "modules",
"keywords", or "topics". Each module also comes with a one-line summary
of what it does; to list the modules whose summaries contain a given word
such as "spam", type "modules spam".
help> modules
Please wait a moment while I gather a list of all available modules...
BaseHTTPServer base64 importlib sha
Bastion bdb imputil shelve
CGIHTTPServer binascii inspect shlex
Canvas binhex io shutil
[...]
Enter any module name to get more help. Or, type "modules spam" to search
for modules whose descriptions contain the word "spam".
help> base64
Help on module base64:
NAME
base64 - RFC 3548: Base16, Base32, Base64 Data Encodings
FILE
/usr/local/lib/python2.7/base64.py
MODULE DOCS
http://docs.python.org/library/base64
FUNCTIONS
b16decode(s, casefold=False)
Decode a Base16 encoded string.
s is the string to decode. Optional casefold is a flag specifying whether
a lowercase alphabet is acceptable as input. For security purposes, the
default is False.
The decoded string is returned. A TypeError is raised if s were
incorrectly padded or if there are non-alphabet characters present in the
string.
b16encode(s)
Encode a string using Base16.
s is the string to encode. The encoded string is returned.
b32decode(s, casefold=False, map01=None)
Decode a Base32 encoded string.
s is the string to decode. Optional casefold is a flag specifying whether
a lowercase alphabet is acceptable as input. For security purposes, the
default is False.
RFC 3548 allows for optional mapping of the digit 0 (zero) to the letter O
(oh), and for optional mapping of the digit 1 (one) to either the letter I
(eye) or letter L (el). The optional argument map01 when not None,
specifies which letter the digit 1 should be mapped to (when map01 is not
None, the digit 0 is always mapped to the letter O). For security
purposes the default is None, so that 0 and 1 are not allowed in the
input.
The decoded string is returned. A TypeError is raised if s were
incorrectly padded or if there are non-alphabet characters present in the
string.
b32encode(s)
Encode a string using Base32.
s is the string to encode. The encoded string is returned.
There is a search engine called nullege to find out how a particular module or object is used in source code.
See the example for os
I wrote a python interface for that at github
I try to split this kind of lines in Python:
aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"
This line contains Hebrew, simplified Chinese and English.
If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).
The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:
print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))
And I get this error:
SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
In Python 2, you need to open the file specifying an encoding like this:
import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8")
In Python 3, you can just add the encoding option to any open() calls.
This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).
To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():
>>> ord(u"£")
163
if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.
Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:
>>> unicodedata.bidirectional(u"£")
ET # 'E'uropean 'T'erminator
In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:
print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))
In Python 3, string constants are Unicode by default.