Details of Unicode Names \N Documented? [duplicate] - python

This question already has answers here:
List of unicode character names
(7 answers)
Closed 2 years ago.
It appears, based on a urwid example that u'\N{HYPHEN BULLET} will create a unicode character that is a hyphen intended for a bullet.
The names for unicode characters seem to be defined at fileformat.info and some element of using Unicode in Python appears in the howto documentation. Though there is no mention of the \N{} syntax.
If you pull all these docs together you get the idea that the constant u"\N{HYPHEN BULLET}" creates a ⁃
However, this is all a theory based on pulling all this data together. I can find no documentation for "\N{} in the Python docs.
My question is whether my theory of operation is correct and whether it is documented anywhere?

Not every gory detail can be found in a how-to. The table of escape sequences in the reference manual includes:
Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database (Unicode only)

You are correct that u"\N{CHARACTER NAME} produces a valid unicode character in Python.
It is not documented much in the Python docs, but after some searching I found a reference to it on effbot.org
http://effbot.org/librarybook/ucnhash.htm
The ucnhash module
(Implementation, 2.0 only) This module is an implementation module,
which provides a name to character code mapping for Unicode string
literals. If this module is present, you can use \N{} escapes to map
Unicode character names to codes.
In Python 2.1, the functionality of this module was moved to the
unicodedata module.
Checking the documentation for unicodedata shows that the module is using the data from the Unicode Character Database.
unicodedata — Unicode Database
This module provides access to the Unicode Character Database (UCD)
which defines character properties for all Unicode characters. The
data contained in this database is compiled from the UCD version
9.0.0.
The full data can be found at: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt
The data has the structure: HEXVALUE;CHARACTER NAME;etc.. so you could use this data to look up characters.
For example:
# 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
>>> u"\N{LATIN CAPITAL LETTER A}"
'A'
# FF7B;HALFWIDTH KATAKANA LETTER SA;Lo;0;L;<narrow> 30B5;;;;N;;;;;
>>> u"\N{HALFWIDTH KATAKANA LETTER SA}"
'サ'

The \N{} syntax is documented in the Unicode HOWTO, at least.
The names are documented in the Unicode standard, such as:
http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt
The unicodedata module can look up a name for a character:
>>> import unicodedata as ud
>>> ud.name('A')
'LATIN CAPITAL LETTER A'
>>> print('\N{LATIN CAPITAL LETTER A}')
A

Related

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Python: how to split a string by a delimiter that is invalid <0x0c>

I have a document which contain the charterer <0x0c>.
Using re.split.
The problem that it look like that:
import re
re.split('',text)
When although it works, you CAN'T see the charterer and except of living a nice comment it is a great candidate to be one of this legacy code that only I would understand.
How can I write it in a different, readable way?
You can express any character using escape codes. The 0x0C Form Feed ASCII codepoint can be expressed as \f or as \x0c:
re.split('\f', text)
See the Python string and byte literals syntax for more details on what escape sequences Python supports when defining a string literal value.
Note: you don't need to use the regex module to split on straight-up character sequences, you can just as well use str.split() here:
text.split('\f')

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?
SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).
If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.
It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

how to convert u'\uf04a' to unicode in python [duplicate]

This question already has answers here:
Python unicode codepoint to unicode character
(4 answers)
Closed 1 year ago.
I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode
The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS
Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm
one example looks like this:
"Oh god please some advice ":
Out[408]: u'Oh god please some advice \uf04c'
Given a thread like this as one example for test:
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')
print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!
'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined
With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?
References
https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py
Handling non-standard American English Characters and Symbols in a CSV, using Python
Solution:
According to the comments below: I replace these character set with the question mark('?')
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')
Hope this helpful to the other beginners.
The notation u'\uf04a' denotes the Unicode codepoint U+F04A, which is by definition a private use codepoint. This means that the Unicode standard does not assign any character to it, and never will; instead, it can be used by private agreements.
It is thus meaningless to talk about printing it. If there is a private agreement on using it in some context, then you print it using a font that has a glyph allocated to that codepoint. Different agreements and different fonts may allocate completely different characters and glyphs to the same codepoint.
It is possible that U+F04A is a result of erroneous processing (e.g., wrong conversions) of character data at some earlier phase.
u'\uf04a'
already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).
u'\uf04a'.encode("utf-8")
gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.
You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.
What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:
>>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
'who? why?'

Is there a list of all ASCII characters in python's standard library? [duplicate]

This question already has answers here:
How do I get a list of all the ASCII characters using Python?
(7 answers)
Closed 6 years ago.
Is there a field or a function that would return all ASCII characters in python's standard library?
You can make one.
ASCII = ''.join(chr(x) for x in range(128))
If you need to check for membership, there are other ways to do it:
if c in ASCII:
# c is an ASCII character
if c <= '\x7f':
# c is an ASCII character
If you want to check that an entire string is ASCII:
def is_ascii(s):
"""Returns True if a string is ASCII, False otherwise."""
try:
s.encode('ASCII')
return True
except UnicodeEncodeError:
return False
You can use the string module:
import string
print string.printable
which gives:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
I don't know of any included python module that has such a attribute. However, the easiest and shortest way is probably just to create it yourself
standard_ascii = [chr(i) for i in xrange(128)]
or
extended_ascii = [chr(i) for i in xrange(256)]
for the extended ascii character list.
Note that
import string
string.printable
does not include all of the 127 standard ascii characters, which you can see by
len(string.printable)
> 100
If you want them as string instead of a list, just add an "".join(), like so:
extended_ascii = "".join([chr(i) for i in xrange(256)])
You could use the Python Standard Library module curses.ascii. Some of the included functions include:
curses.ascii.isascii() # Checks for a character value in the 7-bit ASCII set.
curses.ascii.iscntrl() # Checks for an ASCII control character (in the range 0x00 to 0x1f).
curses.ascii.isalpha() # Check for an ASCII alphabetic character.
From the documentation:
The curses.ascii module supplies name constants for ASCII characters and functions to test membership in various ASCII character classes.
Note that the curses module is may not be available on a Windows system:
The curses module provides an interface to the curses library, the de-facto standard for portable advanced terminal handling.
While curses is most widely used in the Unix environment, versions are available for DOS, OS/2, and possibly other systems as well. This extension module is designed to match the API of ncurses, an open-source curses library hosted on Linux and the BSD variants of Unix.

Categories