I've been working on a project where I'm encoding numbers as characters. Being used to C++, I assumed I could just use any 8bit number and cast it to a character. However, python's chr() function is returning Unicode characters, which aren't 8-bit, so that will not work.
I am new to Python and, from what I've read, previous versions used to have 2 separate functions: chr() for ASCII characters and unichr() for Unicode characters.
I am also limited to what I can get in the standard python library for windows (we are not allowed to install modules with pip).
This might usually be okay, but here's an example of when this can mess with my program:
If I'm encoding the integer 143:
# this is not taken from my actual code
num = 143
c = chr(143)
print(c)
I would expect this to print the ASCII character (a capital A with a little circle above it). Instead, I get the unicode \x8f, which represents "SS3" (Single Shift 3).
TL;DR: I'm converting 8-bit numbers to characters, but chr() converts to Unicode and I REALLY need a way to convert to ASCII instead, but I can't seem to find it in the standard library.
I know that this is such a simple problem and it's extremely frustrating to be stuck on this of all things.
Thanks a lot in advance!
Have a nice day!
- Vlad
"A with a little circle above it" is not an ASCII character, and 143 is outside the ASCII range (0-127).
It seems you are thinking in terms of the encoded bytes rather than unicode codepoints (which Python3 uses to represent string values). See here for 8 bit encodings where b'\x8f' represents 'Å'.
You probably want to do something like this:
import sys
c = 143
# Convert to byte
b = c.to_bytes(1, sys.byteorder)
# Decode to unicode (str) and print
print(b.decode('cp437'))
Å
You could also take a look at the struct package in the standard library, which deals with bytes and chars in a more "C-like" fashion.
Related
Trying to understand encoding/decoding/unicode business in Python2.7 with vim.
I have a unicode string us to which I assign some unicode string u'é'.
Question 1
How is us represented in memory? Is it a sequence of 32- bits long ints that unicode code points \u should consist of? Or is it kept in memory as a sequence of 8- bits long hex values \x in some default encoding?
Question 2
I see four different ways to set encoding for the unicode string us: #1 in the beginning of the test.py file; #2 as an argument of encode function; #3 as an argument for vim; #4 as a local encoding of the file system. So, what do all these four encodings (#1,#2,#3,#4) do?
$ vim test.py
_____________
#encoding: #1
us=u'é'
print us.encode(encoding='#2')
_____________
:set encoding=#3
$ locale | grep LANG
LANG=en_US.#4
LANGUAGE=
In Python 2.x unicodes are encoded as either UCS-2 or UCS-4 depending on the options used when building it.
Source encoding as far as Python is concerned.
Encoding used to encode us as bytes when the code is executed.
Source encoding as far as vim is concerned. If this doesn't match #1 then expect trouble.
System encoding. Mostly affects filesystem and terminal output operations.
Question 1 - Storage
us = u'é'
This creates a Unicode character with a value of é - In Python 2.2+ Unicode characters are stored in UCS-2 or UCS-4 which use 2 or 4 byte long unsigned integers depending on a build time option.
Python 3.3+ uses UTF-8 which uses between 1 & 4 bytes for each character depending on the range it is in.
The storage of Unicode strings now depends on the highest codepoint in
the string:
pure ASCII and Latin1 strings (U+0000-U+007F) use 1 byte per codepoint 0xxxxxxx;
BMP strings partial (U+0080-U+07FF) use 2 bytes per codepoint 110xxxxx 10xxxxxx;
BMP strings remaining (U+0800-U+FFFF) use 3 bytes per codepoint 1110xxxx 10xxxxxx 10xxxxxx;
Other Plains (U+10000-U+10FFFF) use 4 bytes per codepoint 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Question 2 - Encoding
us=u'é'
Declares us to be a Unicode string stored as above, note that in python 3 all strings are by default Unicode so the u can be omitted.
print(us.encode('ascii', strict)) # encoding='#2')
Tells print how to attempt to translate the Unicode string for output, note that if you are using Python 3.3+ and a Unicode capable terminal/console you probably don't need to ever use this.
#set encoding=#3
Tells vim, emacs and a number of editors the encoding to use when displaying &/or editing the file applies to all text files not just python.
$ locale | grep LANG
LANG=en_US.#4
Is an operating system setting for the Locale Language that tells it how to display various things specifically which code page to use when displaying extended ASCII.
This doesn't actually answer the question but I'm hoping it gives some more insight into this problem.
Answer to question 1: it shouldn't matter to the programmer how Unicode strings are represented internally in Python.
To question 2:
All the programmer should care about is that the data sink and source encoding requirements are known and correctly specified. I would assume that Python can correctly interpret UTF encoded files by reading the BOM and maybe even by making educated guesses but without the BOM it can be ambiguous how to handle bytes with the high bit set so it's advisable to either make sure the BOM is there or tell Python that the file is UTF-8 encoded if you're not sure.
There's a difference between "Unicode" and "UTF" that seems to be glossed-over above; "UTF" specifies the representation in storage (disk, memory, network packet) but "Unicode" is simply the fact that each character has a single value (code point) that ranges from 0 to 0x10FFFF. The various flavors of UTF encode that value into the appropriate storage. Working with encoded strings can be annoying though (as the character width is variable) so when strings are actually represented in memory often it's easier to expand them into some format that allows for easy manipulation. (This is touched on in a comment on another answer.)
If you want a Unicode string in Python pre-3, just type u'<whatever>' and in 3+ type '<whatever>'. You'll get Unicode and you can use \uXXXX and \UXXXXXXXX escapes if it's infeasible to just type the characters in directly. When you want to write the data, specify the encoding. UTF-8 is often the easiest to deal with and seems to be the most commonly used but you may have reason to use a UTF-16 flavor.
The takeaway here is that the encoding is just a way to transform Unicode data so that it can be persisted. The various flavors of UTF are just the encodings, they are not actually Unicode.
From Dive into Python:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
I don't understand what the author means by that.
When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding. He says all strings are sequences of Unicode characters. But how many bytes is each character? Is this string UTF-8? Why does he say : "There is no such thing as a Python string encoded in UTF-8".
I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.
When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding.
It depends. Frankly, it doesn't matter. CPython now uses the Flexible String Representation, a wonderful space and time optimisation. But you shouldn't care because it doesn't matter.
He says all strings are sequences of Unicode characters. But how many bytes is each character?
Dunno. It depends. It'll probably be in Latin-1 (1 byte) (when using CPython) in that particular case.
Is this string UTF-8?
No.
Why does he say : "There is no such thing as a Python string encoded in UTF-8".
Because it's a series of Unicode Code points. If you confuse encodings with strings (as other languages often force you to do), you might think that 'Jalape\xc3\xb1o' is 'Jalapeño', because in UTF-8 the byte-sequence '\xc3\xb1o' represents 'ñ'. But it's not, because the string doesn't have an intrinsic encoding, just like the number 100 is the number 100, not 4, whether or not you represent it in binary, decimal or unary.
He says it because people come from languages where they only have bytes that represent strings and they think "but how is this encoded" as if they have to decode it themselves. It'd be like carrying a list of 1s and 0s instead of being able to use numbers, and you have to tell every function what endianness you're using.
I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.
Hopefully it does not any more :).
If this confuses you, I reccomend this question, partially 'cause someone called my answer "superbly comprehensive"¹ but also because Steven D'Aprano has had one of his Python Mailing List excelencies posted there - he and I answered from the list and had our text posted across.
If you're wondering why it's relevant, I'll quote:
So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all.
Isn't that exactly your confusion?
¹ Technically he called another answer "another superbly comprehensive answer", but that implies what I just said ;).
Author compares strings in Python 2 and 3. In Python 2 strings were represented as byte arrays and thus introduced a lot of problems when dealing with non-ASCII characters. Programmer had to always keep track of current encoding of strings in their applications (e.g. encoding of the text on HTML page). There was an attempt to solve it in Python 2.x with introduction of Unicode objects:
s = 'text' # string/byte array object
un = u'text' # unicode object
But many application still used normal, old-style strings.
So, in Python 3 it was decided to separate strings (making them all Unicode) and byte arrays. Thus, in Python 3 we have:
s = 'text' # string/unicode object
b = bytes([0xA2,0x01,0x02,0x03,0x04]) # byte array object
Python uses UCS-2 or UCS-4 encoding internally for unicode strings (at least in Python 2.x).
The python file:
# -*- coding: utf-8 -*-
print u"。"
print [u"。".encode('utf8')]
Produces:
。
['\xe3\x80\x82']
Why does python use 3 characters to store my 1 fullstop? This is really strange, if you print each one out individually, they are all different as well. Any ideas?
In UTF-8, three bytes (not really characters) are used to represent code points between U+07FF and U+FFFF, such as this character, IDEOGRAPHIC FULL STOP (U+3002).
Try dumping the script file with od -x. You should find the same three bytes used to represent the character there.
UTF-8 is a multibyte character representation so characters that are not ASCII will take up more than one byte.
Looks correctly UTF-8 encoded to me. See here for an explanation about UTF-8 encoding.
The latest version of Unicode supports more than 109,000 characters in 93 different scripts. Mathematically, the minimum number of bytes you'd need to encode that number of code points is 3, since this is 17 bits' worth of information. (Unicode actually reserves a 21-bit range, but this still fits in 3 bytes.) You might therefore reasonably expect every character to need 3 bytes in the most straightforward imaginable encoding, in which each character is represented as an integer using the smallest possible whole number of bytes. (In fact, as pointed out by dan04, you need 4 bytes to get all of Unicode's functionality.)
A common data compression technique is to use short tokens to represent frequently-occurring elements, even though this means that infrequently-occurring elements will need longer tokens than they otherwise might. UTF-8 is a Unicode encoding that uses this approach to store text written in English and other European languages in fewer bytes, at the cost of needing more bytes for text written in other languages. In UTF-8, the most common Latin characters need only 1 byte (UTF-8 overlaps with ASCII for the convenience of English users), and other common characters need only 2 bytes. But some characters need 3 or even 4 bytes, which is more than they'd need in a "naive" encoding. The particular character you're asking about needs 3 bytes in UTF-8 by definition.
In UTF-16, it happens, this code point would need only 2 bytes, though other characters will need 4 (there are no 3-byte characters in UTF-16). If you are truly concerned with space efficiency, do as John Machin suggests in his comment and use an encoding that is designed to be maximally space-efficient for your language.
CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely.
E.g., on CPython 2.6 with sys.maxunicode = 65535:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character."
Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations?
I came across this problem in How to iterate over Unicode characters in Python 3?
Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates. Yes you have to deal with this yourself or use a wide build. Even with a wide build you also may have to deal with single characters represented by a combination of code points. For example:
>>> print('a\u0301')
á
>>> print('\xe1')
á
The first uses a combining accent character and the second doesn't. Both print the same. You can use unicodedata.normalize to convert the forms.
How can i change character encoding of a string to UTF-8? I am making some execv calls to a python program but python returns the strings with the some characters cut of. I don't know if this a python issue or c issue but i thought if i can change the strings encoding in c and then pass it to python, it should do the trick. So how can i do that?
Thanks.
C as a language does not facilitate string encoding. A C string is simply a null-terminated sequence of characters (8-bit signed integers, on most systems).
A wide string (with characters of type wchar_t, typically 16-bit integers) can also be used to hold larger character values; however, again, C standard library functions and data types are in no way aware of any concept of string encoding.
The answer to your question is to ensure that the strings you're passing into Python are encoded as UTF-8.
In order to help you accomplish that in any detailed capacity, however, you will have to provide more information about how your strings are currently formed, what they contain, and how you're constructing your argument list for exec.
There is no such thing as character encoding in C.
A char* can hold any data, how you interpret the characters is up to you. For instance, printf will typically dump the characters as they are to the standard output, and if your console interprets those characters as UFT8, they'll appear as such.
If you want to convert between different encodings in the C side, you can have a look at ICU.
If you want to convert between encodings in the Python side, look at http://docs.python.org/howto/unicode.html.