How to find unicode characters by their descriptive names? - python

Trying to get a unicode character by the (unique) name in python 2.7. The method I've found in the docs is not working for me:
>>> import unicodedata
>>> print unicodedata.lookup('PILE OF POO')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: "undefined character name 'PILE OF POO'"

The problem is, that PILE OF POO was introduced with Unicode 6. However, the data of unicodedata is mostly older, 5.X versions or so. The docs say:
The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
This means, unfortunately, that you also are out of luck with almost all Emoji and hieroglyphs (if you're into egyptology).

Related

Expressions with Python 3.6 'f' strings not compatible with `.format` or `.format_map`

I discovered that .format_map and .format are not compatible with Python 3.6 f strings, in that the native f prefix allows complex expressions (such as slices, and function calls), but doing .format_map doesn't allow complex expressions, for example:
>>> version = '1.13.8.10'
>>> f'example-{".".join(version.split(".")[:3])}'
'example-1.13.8'
>>> 'example-{".".join(version.split(".")[:3])}'.format_map(dict(version=version))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '"'
'"'
>>> 'example-{".".join(version.split(".")[:3])}'.format(version=version)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '"'
'"'
I really want to be able to expose the full capability of Python f strings via a configuration file, where the user supplies a string which may contain non-trivial {...} segments which reference the same file and optionally do some basic data manipulation, e.g. within a YAML file I post-process a subset of keys using .format so they can reference variables from within the same config file.
This works fine for simple variables, e.g. {version} where the YAML file is a dictionary with a version key and I pass the dict in as an argument to .format_map, but throws KeyError with more complicated expressions (as shown above).
There must be a way to get the same functionality as f strings... I thought .format_map was it... but it doesn't offer complex expressions...
You are looking for eval
>>> def format_map_eval(string, mapping):
... return eval(f'f{string!r}', mapping)
...
>>> version = '1.13.8.10'
>>> some_config_string = 'example-{".".join(version.split(".")[:3])}'
>>> format_map_eval(some_config_string, dict(version=version))
'example-1.13.8'
This at least is explicit about you providing this.
The key feature of f-strings is that they evaluate arbitrary expressions inside their formatting brackets. If you want a function call that does this, you are asking for eval.
Now that I think about it, I'm not sure this is safe or portable across implementations because there is no guarantee about the repr of str as far as I know.

How to check if a chr()'s output will be undefined

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?
for example,
print(chr(55396))
is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?
You could use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>

How to iterate through arabic word in python? [duplicate]

In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters.
>>> from unicodedata import name
>>> name(u'\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'
Certainly Unicode contains the character \n, and it has a name, specifically "LINE FEED".
NB. unicodedata.lookup('LINE FEED') and unicodedata.lookup(u'LINE FEED') both give a KeyError: undefined character name.
The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).
If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:
000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n has no name, other than the generic <control>, which the Python database ignores (as it is not unique).
Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):
Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.
TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.

IronPython define utf-16 character in string

I want to define utf-16 (LE) characters by their number.
An example is 'LINEAR B SYLLABLE B028 I'.
When I escape this character by u'\U00010001' I receive u'\u0001'.
Really,
>>> u'\U00010001' == u'\u0001'
True
If I use unichr() I get errors too:
>>> unichr(0x10001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 65536 is not in required range
How can I define utf-16 characters in my Python app?
IronPython 2.7
You could try using a named literal:
print "\N{LINEAR B SYLLABLE B038 E}"
If the other methods work on cpython but not ironpython please open an ironpython issue with a minimal test case.

How to create python bytes object from long hex string?

I have a long sequence of hex digits in a string, such as
000000000000484240FA063DE5D0B744ADBED63A81FAEA390000C8428640A43D5005BD44
only much longer, several kilobytes. Is there a builtin way to convert this to a bytes object in python 2.6/3?
result = bytes.fromhex(some_hex_string)
Works in Python 2.7 and higher including python3:
result = bytearray.fromhex('deadbeef')
Note: There seems to be a bug with the bytearray.fromhex() function in Python 2.6. The python.org documentation states that the function accepts a string as an argument, but when applied, the following error is thrown:
>>> bytearray.fromhex('B9 01EF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: fromhex() argument 1 must be unicode, not str`
You can do this with the hex codec. ie:
>>> s='000000000000484240FA063DE5D0B744ADBED63A81FAEA390000C8428640A43D5005BD44'
>>> s.decode('hex')
'\x00\x00\x00\x00\x00\x00HB#\xfa\x06=\xe5\xd0\xb7D\xad\xbe\xd6:\x81\xfa\xea9\x00\x00\xc8B\x86#\xa4=P\x05\xbdD'
Try the binascii module
from binascii import unhexlify
b = unhexlify(myhexstr)
import binascii
binascii.a2b_hex(hex_string)
Thats the way I did it.

Categories