supplemental codepoints to unicode string in python - python

unichr(0x10000) fails with a ValueError when cpython is compiled without --enable-unicode=ucs4.
Is there a language builtin or core library function that converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on?

Yes, here you go:
>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'
The crucial point to understand is that unichr() converts an integer to a single code unit in the Python interpreter's string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr() reads,
Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.
I added emphasis to "one character", by which they mean "one code unit" in Unicode terms.
I'm assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in unichr() function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr() reads,
Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x, chr(0x10000) would behave as you expected. It "converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on".
But back to Python 2.x. If you use unichr() to create Python 2.x unicode objects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter's implementation of unicode objects.
You can isolate this awareness with a function which tries unichr() on a scalar value, catches ValueError, and tries again with the corresponding UTF-16 surrogate pair:
def unichr_supplemental(scalar):
try:
return unichr(scalar)
except ValueError:
return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
+unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )
>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)
But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte string, and decode this byte string into a unicode string:
>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)
The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn't test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.

Related

Python 3.6 equivalent of md5 digest() method?

I'am having trouble in achieving the same results of md5 digest() method from Python 2.7 in Python 3.6.
Python 2.7:
import md5
encryption_base = 'cS35jJYp15kjQf01FVqA7ubRaNOXKPmYGRbLUiimX0g3frQhzOZBmTSni4IEjHLWYMMioGaliIz5z8u2:abcdefghkmnopqrstuvwxyz:4'
digest = md5.new (encryption_base).digest()
print(digest)
#T┼ǃ×ÞRK(M<¶┤# ²
Python 3.6:
from hashlib import md5
encryption_base = 'cS35jJYp15kjQf01FVqA7ubRaNOXKPmYGRbLUiimX0g3frQhzOZBmTSni4IEjHLWYMMioGaliIz5z8u2:abcdefghkmnopqrstuvwxyz:4'
digest = md5(encryption_base.encode()).digest()
print(digest)
#b'T\xc5\x80\x9f\x9e\xe8RK(M<\xf4\xb4#\t\xfd'
How can I get the same string as in the Python 2.7 result? .hexdigest is not the case for this also.
You have the exact same result, a bytestring. The only difference is that in Python 3 printing a bytestring gives you a debugging-friendly representation, not the raw bytes. That's because the raw bytes are not necessarily printable and print() needs Unicode strings.
If you must have the same output, write the bytes directly to the stdout buffer, bypassing the Unicode TextIOWrapper() that takes care of encoding text to the underlying locale codec:
import sys
digest = md5(encryption_base.encode('ASCII')).digest()
sys.stdout.buffer.write(digest + b'\n')
Note that you must ensure that you define your encryption_base value as a bytes value too, or at least encode it to the same codec, ASCII, like I did above.
Definining it as a bytestring gives you the same value as in Python 2 without encoding:
encryption_base = b'cS35jJYp15kjQf01FVqA7ubRaNOXKPmYGRbLUiimX0g3frQhzOZBmTSni4IEjHLWYMMioGaliIz5z8u2:abcdefghkmnopqrstuvwxyz:4'
When you use str.encode() without explicitly setting an argument, you are encoding to UTF-8. IF your encryption_base string only consists of ASCII codepoints, the result would be the same, but not if you have any Latin-1 or higher codepoints in there too. Don't conflate bytes with Unicode codepoints! See https://nedbatchelder.com/text/unipain.html to fully understand the difference and how that difference applies to Python 2 and 3.

convert Unicode to normal string [duplicate]

When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.

What's the point of chr(128) .. chr(255) in Python?

Edit: I'm talking about behavior in Python 2.7.
The chr function converts integers between 0 and 127 into the ASCII characters. E.g.
>>> chr(65)
'A'
I get how this is useful in certain situations and I understand why it covers 0..127, the 7-bit ASCII range.
The function also takes arguments from 128..255. For these numbers, it simply returns the hexadecimal representation of the argument. In this range, different bytes mean different things depending on which part of the ISO-8859 standard is used.
I'd understand if chr took another argument, e.g.
>>> chr(228, encoding='iso-8859-1') # hypothetical
'ä'
However, there is no such option:
chr(i) -> character
Return a string of one character with ordinal i; 0 <= i < 256.
My questions is: What is the point of raising ValueError for i > 255 instead of i > 127? All the function does for 128 <= i < 256 is return hex values?
In Python 2.x, a str is a sequence of bytes, so chr() returns a string of one byte and accepts values in the range 0-255, as this is the range that can be represented by a byte. When you print the repr() of a string with a byte in the range 128-255, the character is printed in escape format because there is no standard way to represent such characters (ASCII defines only 0-127). You can convert it to Unicode using unicode() however, and specify the source encoding:
unicode(chr(200), encoding="latin1")
In Python 3.x, str is a sequence of Unicode characters and chr() takes a much larger range. Bytes are handled by the bytes type.
I see what you're saying but it isn't correct. In Python 3.4 chr is documented as:
Return the string representing a character whose Unicode codepoint is the integer i.
And here are some examples:
>>> chr(15000)
'㪘'
>>> chr(5000)
'ᎈ'
In Python 2.x it was:
Return a string of one character whose ASCII code is the integer i.
The function chr has been around for a long time in Python and I think the understanding of various encodings only developed in recent releases. In that sense it makes sense to support the basic ASCII table and return hex values for the extended ASCII set within the 128 - 255 range.
Even within Unicode the ASCII set is only defined as 128 characters, not 256, so there isn't (wasn't) a standard and accepted way of letting ord() return an answer for those input values.
Note that python 2 string handling is broken. It's one of the reasons I recommend switching to python 3.
In python 2, the string type was designed to represent both text and binary strings. So, chr() is used to convert an integer to a byte. It's not really related to text, or ASCII, or ISO-8859-1. It's a binary stream of bytes:
binary_command = chr(100) + chr(200) + chr(10)
device.write(binary_command)
etc()
In python 2.7, the bytes() type was added for forward compatibility with python 3 and it maps to str().

Why don't python interpreter use the file coding format for decoding?

The code bellow will cause an UnicodeDecodeError:
#-*- coding:utf-8 -*-
s="中文"
u=u"123"
u=s+u
I know it's because python interpreter is using ascii to decode s.
Why don't python interpreter use the file format(utf-8) for decoding?
Implicit decoding cannot know what source encoding was used. That information is not stored with strings.
All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.
As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()
u = s.decode('utf8') + u
The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.
So, instead of doing s="中文", do as following to get unicode strings for both:
s=u"中文"
u=u"123"
u=s+u
The code works perfectly fine on Python 3.
However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.
You can add a line
from __future__ import unicode_literals
after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.

What does sys.maxunicode mean?

CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely.
E.g., on CPython 2.6 with sys.maxunicode = 65535:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character."
Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations?
I came across this problem in How to iterate over Unicode characters in Python 3?
Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates. Yes you have to deal with this yourself or use a wide build. Even with a wide build you also may have to deal with single characters represented by a combination of code points. For example:
>>> print('a\u0301')
á
>>> print('\xe1')
á
The first uses a combining accent character and the second doesn't. Both print the same. You can use unicodedata.normalize to convert the forms.

Categories