How to check if a chr()'s output will be undefined - python

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?
for example,
print(chr(55396))
is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?

You could use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>

Related

Unicodedata.normalize() ValueError: invalid normalization form

I'm trying to take foreign language text and output a human-readable, filename-safe equivalent. After looking around, it seems like the best option is unicodedata.normalize(), but I can't get it to work. I've tried putting the exact code from some answers here and elsewhere, but it keeps giving me this error. I only got one success, when I ran:
unicodedata.normalize('NFD', '\u00C7')
'C\u0327'
But every other time, I get an error. Here's my code I've tried:
unicodedata.normalize('NFKD', u'\u2460') #error, not sure why. Look same as above.
s = 'ذهب الرجل'
unicodedata.normalize('NKFC',s) #error
unicodedata.normalize('NKFD', 'ñ') #error
Specifically, the error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid normalization form
I don't understand why this isn't working. All of these are strings, which means they are unicode in Python 3. I tried encoding them using .encode(), but then normalize() said it only takes arguments of string, so I know that can't be it. I'm seriously at a loss because even code I'm copying from here seems to error out. What's going on here?
Looking at unicodedata.c, the only way you can get that error is if you enter an invalid form string. The valid values are "NFC", "NFKC", "NFD", and "NFKD", but you seem to be using values with the "F" and "K" switched around:
>>> import unicodedata
>>>
>>> unicodedata.normalize('NKFD', 'ñ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid normalization form
>>>
>>> unicodedata.normalize('NFKD', 'ñ')
'ñ'

Identify unicode characters that can't be printed

I need to be able to determine (or predict) when a unicode character won't be printable. For instance, if I print this unicode character under default settings, it prints fine:
>>> print(u'\ua62b')
ꘫ
But if I print another unicode character, it prints as a stupid, weird square:
>>> print(u'\ua62c')
꘬
I really need to be able to determine before a character is printed if it will display like this as an ugly square (or sometimes as an anonymous blank). What causes this, and how can I predict it?
While it's not very easy to tell if the terminal running your script (or the font your terminal is using) is able to render a given character correctly, you can at least check that the character actually has a representation.
The character \ua62b is defined as VAI SYLLABLE NDOLE DO, whereas the character \ua62c has no definition, hence why it may be rendered as a square or other generic symbol.
One way to check if a character is defined is to use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(u"\ua62b")
'VAI SYLLABLE NDOLE DO'
>>> unicodedata.name(u"\ua62c")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
As you can see above, a ValueError is raised for the \ua62c character because it isn't defined.
Another method is to check the category of the character. If it is Cn then the character is not assigned:
>>> import unicodedata
>>> unicodedata.category(u"\ua62b")
'Lo'
>>> unicodedata.category(u"\ua62c")
'Cn'

How to find unicode characters by their descriptive names?

Trying to get a unicode character by the (unique) name in python 2.7. The method I've found in the docs is not working for me:
>>> import unicodedata
>>> print unicodedata.lookup('PILE OF POO')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'PILE OF POO'"
The problem is, that PILE OF POO was introduced with Unicode 6. However, the data of unicodedata is mostly older, 5.X versions or so. The docs say:
The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
This means, unfortunately, that you also are out of luck with almost all Emoji and hieroglyphs (if you're into egyptology).

IronPython define utf-16 character in string

I want to define utf-16 (LE) characters by their number.
An example is 'LINEAR B SYLLABLE B028 I'.
When I escape this character by u'\U00010001' I receive u'\u0001'.
Really,
>>> u'\U00010001' == u'\u0001'
True
If I use unichr() I get errors too:
>>> unichr(0x10001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 65536 is not in required range
How can I define utf-16 characters in my Python app?
IronPython 2.7
You could try using a named literal:
print "\N{LINEAR B SYLLABLE B038 E}"
If the other methods work on cpython but not ironpython please open an ironpython issue with a minimal test case.

How to format a write statement in Python?

I have data that I want to print to file. For missing data, I wish to print the mean of the actual data. However, the mean is calculated to more than the required 4 decimal places. How can I write to the mean to file and format this mean at the same time?
I have tried the following, but keep getting errors:
outfile.write('{0:%.3f}'.format(str(mean))+"\n")
First, remove the % since it makes your format syntax invalid. See a demonstration below:
>>> '{:%.3f}'.format(1.2345)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Invalid conversion specification
>>> '{:.3f}'.format(1.2345)
'1.234'
>>>
Second, don't put mean in str since str.format is expecting a float (that's what the f in the format syntax represents). Below is a demonstration of this bug:
>>> '{:.3f}'.format('1.2345')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Unknown format code 'f' for object of type 'str'
>>> '{:.3f}'.format(1.2345)
'1.234'
>>>
Third, the +"\n" is unnecessary since you can put the "\n" in the string you used on str.format.
Finally, as shown in my demonstrations, you can remove the 0 since it is redundant.
In the end, the code should be like this:
outfile.write('{:.3f}\n'.format(mean))
You don't need to convert to string using str(). Also, the "%" is not required. Just use:
outfile.write('{0:.3f}'.format(mean)+"\n")
First of all, the formatting of your string has nothing to do with your write statement. You can reduce your problem to:
string = '{0:%.3f}'.format(str(mean))+"\n"
outfile.write(string)
Then, your string specification is incorrect and should be:
string = '{0:.3f}\n'.format(mean)
outfile.write('{.3f}\n'.format(mean))

Categories