From the Python's documentation of the chr built-in function, the maximum value that chr accepts is 1114111 (in decimal) or 0x10FFFF (in base 16). And in fact
>>> chr(1114112)
Traceback (most recent call last):
File "<pyshell#20>", line 1, in <module>
chr(1114112)
ValueError: chr() arg not in range(0x110000)
My first question is the following, why exactly that number? The second question is: if this number changes, is it possible to know from a Python command the maximum value accepted by chr?
Use sys.maxunicode:
An integer giving the value of the largest Unicode code point, i.e. 1114111 (0x10FFFF in hexadecimal).
On my Python 2.7 UCS-2 build the maximum Unicode character supported by unichr() is 0xFFFF:
>>> import sys
>>> sys.maxunicode
65535
but Python 3.3 and newer switched to a new internal storage format for Unicode strings, and the maximum is now always 0x10FFFF. See PEP 393.
0x10FFFF is the maximum Unicode codepoint as defined in the Unicode standard. Quoting the Wikipedia article on Unicode:
Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF.
Related
Assuming the following:
>>> square = '²' # Superscript Two (Unicode U+00B2)
>>> cube = '³' # Superscript Three (Unicode U+00B3)
Curiously:
>>> square.isdigit()
True
>>> cube.isdigit()
True
OK, let's convert those "digits" to integer:
>>> int(square)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'
Oooops!
Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?
str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:
str.isdigit()
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?
Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.
I am writing a program that is dealing with letters from a foreign alphabet. The program is taking the input of a number that is associated with the unicode number for a character. For example 062A is the number assigned in unicode for that character.
I first ask the user to input a number that corresponds to a specific letter, i.e. 062A. I am now attempting to turn that number into a 16-bit integer that can be decoded by python to print the character back to the user.
example:
for \u0394
print(bytes([0x94, 0x03]).decode('utf-16'))
however when I am using
int('062A', '16')
I receive this error:
ValueError: invalid literal for int() with base 10: '062A'
I know it is because I am using A in the string, however that is the unicode for the symbol. Can anyone help me?
however when I am using int('062A', '16'), I receive this error:
ValueError: invalid literal for int() with base 10: '062A'
No, you aren't:
>>> int('062A', '16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
It's exactly as it says. The problem is not the '062A', but the '16'. The base should be specified directly as an integer, not a string:
>>> int('062A', 16)
1578
If you want to get the corresponding numbered Unicode code point, then converting through bytes and UTF-16 is too much work. Just directly ask using chr, for example:
>>> chr(int('0394', 16))
'Δ'
Edit: I'm talking about behavior in Python 2.7.
The chr function converts integers between 0 and 127 into the ASCII characters. E.g.
>>> chr(65)
'A'
I get how this is useful in certain situations and I understand why it covers 0..127, the 7-bit ASCII range.
The function also takes arguments from 128..255. For these numbers, it simply returns the hexadecimal representation of the argument. In this range, different bytes mean different things depending on which part of the ISO-8859 standard is used.
I'd understand if chr took another argument, e.g.
>>> chr(228, encoding='iso-8859-1') # hypothetical
'ä'
However, there is no such option:
chr(i) -> character
Return a string of one character with ordinal i; 0 <= i < 256.
My questions is: What is the point of raising ValueError for i > 255 instead of i > 127? All the function does for 128 <= i < 256 is return hex values?
In Python 2.x, a str is a sequence of bytes, so chr() returns a string of one byte and accepts values in the range 0-255, as this is the range that can be represented by a byte. When you print the repr() of a string with a byte in the range 128-255, the character is printed in escape format because there is no standard way to represent such characters (ASCII defines only 0-127). You can convert it to Unicode using unicode() however, and specify the source encoding:
unicode(chr(200), encoding="latin1")
In Python 3.x, str is a sequence of Unicode characters and chr() takes a much larger range. Bytes are handled by the bytes type.
I see what you're saying but it isn't correct. In Python 3.4 chr is documented as:
Return the string representing a character whose Unicode codepoint is the integer i.
And here are some examples:
>>> chr(15000)
'㪘'
>>> chr(5000)
'ᎈ'
In Python 2.x it was:
Return a string of one character whose ASCII code is the integer i.
The function chr has been around for a long time in Python and I think the understanding of various encodings only developed in recent releases. In that sense it makes sense to support the basic ASCII table and return hex values for the extended ASCII set within the 128 - 255 range.
Even within Unicode the ASCII set is only defined as 128 characters, not 256, so there isn't (wasn't) a standard and accepted way of letting ord() return an answer for those input values.
Note that python 2 string handling is broken. It's one of the reasons I recommend switching to python 3.
In python 2, the string type was designed to represent both text and binary strings. So, chr() is used to convert an integer to a byte. It's not really related to text, or ASCII, or ISO-8859-1. It's a binary stream of bytes:
binary_command = chr(100) + chr(200) + chr(10)
device.write(binary_command)
etc()
In python 2.7, the bytes() type was added for forward compatibility with python 3 and it maps to str().
I'm in an interactive Python 2.7 Terminal (Terminal default output is "utf-8"). I have a string from the internet, lets call it a
>>> a
u'M\xfcssen'
>>> a[1]
u'\xfc'
I wonder why its value is not ü so I try
>>> print(a)
Müssen
>>> print(a[1])
ü
which works as intended.
So my first question is, what does print a do, which is missing if i just type a?
and out of curiosity: Why is it that I get another output for the following in the same python terminal session?
>>> "ü"
'\xc3\xbc'
>>> print "ü"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> print u"ü"
ü
You have to understand how python stores various data types and which functions expect which input. Its all quite confusing and also depends on your LOCALE setting of your terminal.
The following link might help to reduce the confusion: https://pythonhosted.org/kitchen/unicode-frustrations.html
All str objects like "My String" are stored as 8bit per character. In your case '\xc3\xbc' is the utf8 representation of the UMLAUT-U as a str object.
For unicode objects, python uses 16bit or 32bit integer to store the string.
Now the print function expects str objects as input. That's why the following works.
>>> print '\xc3\xbc'
ü
To turn the UMLAUT-U from a str into a unicode object. you have to tell python that the string is in UTF8 representation before you convert it into a unicode object.
>>> unicode('\xc3\xbc'.decode('utf8'))
u'\xfc'
what does print a do, which is missing if i just type a?
The interactive >>> prompt outputs values using the Python source code representation of the value, as returned by the repr() function. That's why you get not just \xFC for the ü character but also quote marks around the string. The prompt is trying to show you what you would need to type into a Python program to get the string value you have.
The print statement outputs the raw string conversion of the value, as returned by the str() function.
For some types repr() and str() generate the same output, but this is not the case for strings.
unichr(0x10000) fails with a ValueError when cpython is compiled without --enable-unicode=ucs4.
Is there a language builtin or core library function that converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on?
Yes, here you go:
>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'
The crucial point to understand is that unichr() converts an integer to a single code unit in the Python interpreter's string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr() reads,
Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.
I added emphasis to "one character", by which they mean "one code unit" in Unicode terms.
I'm assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in unichr() function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr() reads,
Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x, chr(0x10000) would behave as you expected. It "converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on".
But back to Python 2.x. If you use unichr() to create Python 2.x unicode objects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter's implementation of unicode objects.
You can isolate this awareness with a function which tries unichr() on a scalar value, catches ValueError, and tries again with the corresponding UTF-16 surrogate pair:
def unichr_supplemental(scalar):
try:
return unichr(scalar)
except ValueError:
return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
+unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )
>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)
But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte string, and decode this byte string into a unicode string:
>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)
The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn't test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.