Edit: I'm talking about behavior in Python 2.7.
The chr function converts integers between 0 and 127 into the ASCII characters. E.g.
>>> chr(65)
'A'
I get how this is useful in certain situations and I understand why it covers 0..127, the 7-bit ASCII range.
The function also takes arguments from 128..255. For these numbers, it simply returns the hexadecimal representation of the argument. In this range, different bytes mean different things depending on which part of the ISO-8859 standard is used.
I'd understand if chr took another argument, e.g.
>>> chr(228, encoding='iso-8859-1') # hypothetical
'รค'
However, there is no such option:
chr(i) -> character
Return a string of one character with ordinal i; 0 <= i < 256.
My questions is: What is the point of raising ValueError for i > 255 instead of i > 127? All the function does for 128 <= i < 256 is return hex values?
In Python 2.x, a str is a sequence of bytes, so chr() returns a string of one byte and accepts values in the range 0-255, as this is the range that can be represented by a byte. When you print the repr() of a string with a byte in the range 128-255, the character is printed in escape format because there is no standard way to represent such characters (ASCII defines only 0-127). You can convert it to Unicode using unicode() however, and specify the source encoding:
unicode(chr(200), encoding="latin1")
In Python 3.x, str is a sequence of Unicode characters and chr() takes a much larger range. Bytes are handled by the bytes type.
I see what you're saying but it isn't correct. In Python 3.4 chr is documented as:
Return the string representing a character whose Unicode codepoint is the integer i.
And here are some examples:
>>> chr(15000)
'ใช'
>>> chr(5000)
'แ'
In Python 2.x it was:
Return a string of one character whose ASCII code is the integer i.
The function chr has been around for a long time in Python and I think the understanding of various encodings only developed in recent releases. In that sense it makes sense to support the basic ASCII table and return hex values for the extended ASCII set within the 128 - 255 range.
Even within Unicode the ASCII set is only defined as 128 characters, not 256, so there isn't (wasn't) a standard and accepted way of letting ord() return an answer for those input values.
Note that python 2 string handling is broken. It's one of the reasons I recommend switching to python 3.
In python 2, the string type was designed to represent both text and binary strings. So, chr() is used to convert an integer to a byte. It's not really related to text, or ASCII, or ISO-8859-1. It's a binary stream of bytes:
binary_command = chr(100) + chr(200) + chr(10)
device.write(binary_command)
etc()
In python 2.7, the bytes() type was added for forward compatibility with python 3 and it maps to str().
Related
I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("๐") is 1, just as with any other single character. That's independent of the fact that "๐" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "๐" in UTF-16LE, you would invoke "๐".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
I've been working on a project where I'm encoding numbers as characters. Being used to C++, I assumed I could just use any 8bit number and cast it to a character. However, python's chr() function is returning Unicode characters, which aren't 8-bit, so that will not work.
I am new to Python and, from what I've read, previous versions used to have 2 separate functions: chr() for ASCII characters and unichr() for Unicode characters.
I am also limited to what I can get in the standard python library for windows (we are not allowed to install modules with pip).
This might usually be okay, but here's an example of when this can mess with my program:
If I'm encoding the integer 143:
# this is not taken from my actual code
num = 143
c = chr(143)
print(c)
I would expect this to print the ASCII character (a capital A with a little circle above it). Instead, I get the unicode \x8f, which represents "SS3" (Single Shift 3).
TL;DR: I'm converting 8-bit numbers to characters, but chr() converts to Unicode and I REALLY need a way to convert to ASCII instead, but I can't seem to find it in the standard library.
I know that this is such a simple problem and it's extremely frustrating to be stuck on this of all things.
Thanks a lot in advance!
Have a nice day!
- Vlad
"A with a little circle above it" is not an ASCII character, and 143 is outside the ASCII range (0-127).
It seems you are thinking in terms of the encoded bytes rather than unicode codepoints (which Python3 uses to represent string values). See here for 8 bit encodings where b'\x8f' represents 'ร
โ'.
You probably want to do something like this:
import sys
c = 143
# Convert to byte
b = c.to_bytes(1, sys.byteorder)
# Decode to unicode (str) and print
print(b.decode('cp437'))
ร
โ
You could also take a look at the struct package in the standard library, which deals with bytes and chars in a more "C-like" fashion.
For a Computer Science class, we've got to make a python program that converts a character into it's Unicode Codepoint (the bin/hex number which is the reference to the character). Is there a function out there which can do this, like how the ord() function converts to ASCII and is there a function which does the reverse, turning a Unicode codepoint into a character?
Thanks
In Python3, if you know the unicode code point of a character, for example, ๆ with Unicode code point \u6211, you can get the character with:
chr(0x6211)
The builtin function ord also works for unicode characters both in Python2 and Python3.
Python 3
>>> c='\U0010ffff'
>>> ord(c)
1114111
Python 2
>>> c=u'\U0010ffff'
>>> ord(c)
1114111
Difference between Python 2 and 3
The difference between Python 2 and Python 3 is when you go the other way around.
In Python 3, the function chr can take any code, ascii or unicode, and outputs the character.
In Python 2, the function chr is for extended ascii (code 0 to 255) and the function unichr is for unicode.
This is due to the fact that in Python 2, unicode and ascii strings were two different types.
Hexadecimal
If you need to get the character code in hexadecimal, you can use hex.
>>> hex(1114111)
'0x10ffff'
Binary
If you need to get the character in binary, you can use bin.
>>> bin(1114111)
'0b100001111111111111111'
I found the Python 3 documentation on chr and ord to be a little unclear as to how they relate to the two main textual data types: str and bytes. Or maybe I'm overthinking it.
Here is what I think probably happens, but can you let me know if I'm right?
ord() takes as input a single-character str and returns an int. The input is a str just like any other str in Python 3. In particular, it is not bytes encoded in some specific Unicode format like UTF-8, rather it represents Unicode Code Points in Python's internal str format.
chr() takes as input an int, and returns a single character str. The returned str is just like any other str in Python, and similarly is not a specific encoding using bytes.
At no point do either ord() or chr() deal with bytes, nor do they deal with specific Unicode formats like UTF-8, they are only dealing with Python's internal str representation which deals more abstractly with Unicode Code Points.
You are right.
ord() and chr() deal only with single-character strings.
Their documentation is quite clear about that:
>>> help(ord)
ord(c, /)
Return the Unicode code point for a one-character string.
>>> help(chr)
chr(i, /)
Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.
Use str.encode / bytes.decode for conversion to/from bytes.
So I am pretty sure this is a dumb question, but I am trying to get a deeper understanding of the python chr() function.
Also, I am wondering if it is possible to always have the integer argument three digits long, or just a fixed length for all ascii values?
chr(20) ## '\x14'
chr(020) ## '\x10'
Why is it giving me different answers? Does it think '020' is hex or something?
Also, I am running Python 2.7 on Windows!
-Thanks!
There is nothing to do with char. It is all about Numeric literals. And it is cross-language. 0 indicates oct and 0x indicates hex.
print 010 # 8
print 0x10 # 16
It makes sense to explain chr and ord together.
You are obviously using Python2 (because of the octal problem, Python3 requires 0o as the prefix), but I'll explain both.
In Python2, chr is a function that takes any integer up to 256 returns a string containing just that extended-ascii character. unichr is the same but returns a unicode character up to 0x10FFFF. ord is the inverse function, which takes a single-character string (of either type) and returns an integer.
In Python3, chr returns a single-character unicode string. The equivalent for byte strings is bytes([v]). ord still does both.