IronPython define utf-16 character in string - python

I want to define utf-16 (LE) characters by their number.
An example is 'LINEAR B SYLLABLE B028 I'.
When I escape this character by u'\U00010001' I receive u'\u0001'.
Really,
>>> u'\U00010001' == u'\u0001'
True
If I use unichr() I get errors too:
>>> unichr(0x10001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 65536 is not in required range
How can I define utf-16 characters in my Python app?
IronPython 2.7

You could try using a named literal:
print "\N{LINEAR B SYLLABLE B038 E}"
If the other methods work on cpython but not ironpython please open an ironpython issue with a minimal test case.

Related

str.isdigit() behaviour when handling strings

Assuming the following:
>>> square = '²' # Superscript Two (Unicode U+00B2)
>>> cube = '³' # Superscript Three (Unicode U+00B3)
Curiously:
>>> square.isdigit()
True
>>> cube.isdigit()
True
OK, let's convert those "digits" to integer:
>>> int(square)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'
Oooops!
Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?
str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:
str.isdigit()
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?
Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

How to check if a chr()'s output will be undefined

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?
for example,
print(chr(55396))
is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?
You could use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>

How to convert hex str to an hex bytearray [duplicate]

I have a long sequence of hex digits in a string, such as
000000000000484240FA063DE5D0B744ADBED63A81FAEA390000C8428640A43D5005BD44
only much longer, several kilobytes. Is there a builtin way to convert this to a bytes object in python 2.6/3?
result = bytes.fromhex(some_hex_string)
Works in Python 2.7 and higher including python3:
result = bytearray.fromhex('deadbeef')
Note: There seems to be a bug with the bytearray.fromhex() function in Python 2.6. The python.org documentation states that the function accepts a string as an argument, but when applied, the following error is thrown:
>>> bytearray.fromhex('B9 01EF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: fromhex() argument 1 must be unicode, not str`
You can do this with the hex codec. ie:
>>> s='000000000000484240FA063DE5D0B744ADBED63A81FAEA390000C8428640A43D5005BD44'
>>> s.decode('hex')
'\x00\x00\x00\x00\x00\x00HB#\xfa\x06=\xe5\xd0\xb7D\xad\xbe\xd6:\x81\xfa\xea9\x00\x00\xc8B\x86#\xa4=P\x05\xbdD'
Try the binascii module
from binascii import unhexlify
b = unhexlify(myhexstr)
import binascii
binascii.a2b_hex(hex_string)
Thats the way I did it.

How to find unicode characters by their descriptive names?

Trying to get a unicode character by the (unique) name in python 2.7. The method I've found in the docs is not working for me:
>>> import unicodedata
>>> print unicodedata.lookup('PILE OF POO')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'PILE OF POO'"
The problem is, that PILE OF POO was introduced with Unicode 6. However, the data of unicodedata is mostly older, 5.X versions or so. The docs say:
The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
This means, unfortunately, that you also are out of luck with almost all Emoji and hieroglyphs (if you're into egyptology).

How to create python bytes object from long hex string?

I have a long sequence of hex digits in a string, such as
000000000000484240FA063DE5D0B744ADBED63A81FAEA390000C8428640A43D5005BD44
only much longer, several kilobytes. Is there a builtin way to convert this to a bytes object in python 2.6/3?
result = bytes.fromhex(some_hex_string)
Works in Python 2.7 and higher including python3:
result = bytearray.fromhex('deadbeef')
Note: There seems to be a bug with the bytearray.fromhex() function in Python 2.6. The python.org documentation states that the function accepts a string as an argument, but when applied, the following error is thrown:
>>> bytearray.fromhex('B9 01EF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: fromhex() argument 1 must be unicode, not str`
You can do this with the hex codec. ie:
>>> s='000000000000484240FA063DE5D0B744ADBED63A81FAEA390000C8428640A43D5005BD44'
>>> s.decode('hex')
'\x00\x00\x00\x00\x00\x00HB#\xfa\x06=\xe5\xd0\xb7D\xad\xbe\xd6:\x81\xfa\xea9\x00\x00\xc8B\x86#\xa4=P\x05\xbdD'
Try the binascii module
from binascii import unhexlify
b = unhexlify(myhexstr)
import binascii
binascii.a2b_hex(hex_string)
Thats the way I did it.

Categories