latin-1 vs unicode in python

latin-1 vs unicode in python - python

I was reading this high rated post in SO on unicodes
Here is an `illustration given there :
$ python
>>> import sys
>>> print sys.stdout.encoding
UTF-8
>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
Ã©
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>
and the explanation were given as
(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.
My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?
Also when I tried
>>> print '\xe9'
?
>>> print u'\xe9'
é
I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?

You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.
The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.
When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é
If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph � instead:
>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�
That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:
>>> print '\xe9\x80\x80'
退
because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.
If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Related

How to initialize a UTF-16 in code?

Using Python3 to minimize the pain when dealing with Unicode, I can print a UTF-8 character as such:
>>> print (u'\u1010')
တ
But when trying to do the same with UTF-16, let's say U+20000, u'\u20000' is the wrong way to initialize the character:
>>> print (u'\u20000')
  0
>>> print (list(u'\u20000'))
['\u2000', '0']
It reads a 2 UTF-8 characters instead.
I've also tried the big U, i.e. u'\U20000', but it throws some escape error:
>>> print (u'\U20000')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
Big U outside the string didn't work too:
>>> print (U'\u20000')
 0
>>> print (U'\U20000')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape

These are not UTF-8 and UTF-16 literals, but just unicode literals, and they mean the same:
>>> print(u'\u1010')
တ
>>> print(u'\U00001010')
တ
>>> print(u'\u1010' == u'\U00001010')
True
The second form just allows you to specify a code point above U+FFFF.
How to do this the easiest way: encode your source file as UTF-8 (or UTF-16), and then you can just write u"တ" and u"𠀀".
UTF-8 and UTF-16 are ways to encode those to bytes. To be technical, in UTF-8 that would be "\xf0\xa0\x80\x80" (which I would probably write as u"𠀀".encode("utf-8")).

As #Mark Ransom commented, Python's UTF16 \U notation requires eight characters to work.
Therefore, the Python code to use is:
u"\U00020000"
as listed on this page:
Python source code u"\U00020000"

Python Encoding error with some unicode characers

I am having some problems with encoding some unicode characters.
This is the code I am using:
test = raw_input("Test: ")
print test.encode("utf-8")
When I use now normal ASCII characters it works, same with some "strange" unicode characters like ☃.
But when I use characters like ß ä ö ü § it fails creating this error:
Traceback (most recent call last):
File "C:\###\Test.py", line 5, in <module>
print test.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: ordinal not in range(128)
Note that I am using a pc where German is the default language (so these characters are default characters).

raw_input() returns a byte string. You don't need to encode that byte string, it is already encoded.
What happens instead then is that Python will first decode to get a unicode value to encode; you asked Python to encode so it'll damn well try to get you something that can be encoded. It is the decoding that fails here. Implicit decoding uses ASCII, which is why you got a UnicodeDecodeError exception (note the Decode in the name) for that codec.
If you wanted to produce a unicode object you'd have to explicitly decode. Use the codec Python has detected for stdin:
import sys
test = raw_input("Test: ")
print test.decode(sys.stdin.encoding)
You don't need to do that here because you are printing, so writing right back to the same terminal which will use the same codec for input and output. Writing a byte string encoded with UTF-8 when you just received that byte string is then fine. Decoding to unicode is fine too, as printing will auto-encode to sys.stdout.encoding.

Parsing letters with macron e.g ā when scraping a web page using lxml

Getting this error when trying to parse words in Te Reo Maori
Pāngarau - I am assuming its the macron
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0101'
Any ideas on how to sort this out?
from lxml import html
import requests
page = requests.get('http://www.nzqa.govt.nz/qualifications-standards/qualifications/ncea/subjects/')
tree = html.fromstring(page.text)
text = tree.xpath('//*[#id="mainPage"]/table[1]/tbody/tr[1]/td[3]/a')
print text[0].text
Traceback (most recent call last):
File "/Users/Teacher/Documents/Python/Standards/rip_html2.py", line 10, in <module>
print text[0].text
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0101' in position 1: ordinal not in range(128)
[Finished in 0.5s with exit code 1]

In Python2, lxml sometimes returns strs, and sometimes unicode when you inspect an Element's text attribute.
It returns a str when the text is composed entirely of ascii characters, but it returns a unicode otherwise.
At the point where the error occurs, text[0].text is a unicode containing the character u'\u0101'.
To fix the error, explicitly encode the unicode to a byte string before printing:
print(text[0].text.encode('utf-8'))
Note that utf-8 is just one of many encodings you could use.
Usually, if you are printing to a terminal, Python will detect the encoding used by the terminal, and use that encoding to encode unicode thus printing the bytes to the terminal.
Since you are getting the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0101' in position 1: ordinal not in range(128)
it appears you might be printing to a file, or Python was unable to determine the encoding of the output device. Since output devices only accept bytes (never unicode), all unicode must be encoded. In such cases Python2 automatically attempts to encode the unicode using the ascii codec. Hence the error.
See also: the PrintFails wiki page

It might be because Python 2 by default support only ASCII strings unless explicitly mentioned. To use Unicode instead of ASCII, you can add the following line on first line of your script:
# -*- coding: utf-8 -*-

How to left align a UTF-8 encoded string in python?

I'm trying to left align an UTF-8 encoded string with string.ljust. This exception is raised: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128). For example,
s = u"你好" // a Chinese string
stdout.write(s.encode("UTF-8").ljust(20))
Am I on the right track? Or I should use other approach to format?
Thanks and Best Regards.

Did you post the exact code and the exact error you received? Because your code works without throwing an error on both a cp437 and utf-8 terminal. In any case you should justify the Unicode string before sending it to the terminal. Note the difference because the UTF-8-encoded Chinese has length 6 when encoded instead of length 2:
>>> sys.stdout.write(s.encode('utf-8').ljust(20) + "hello")
你好 hello
>>> sys.stdout.write(s.ljust(20).encode('utf-8') + "hello")
你好 hello
Note also that Chinese characters are wider than the other characters in typical fixed-width fonts so things may still not line up as you like if mixing languages (see this answer for a solution):
>>> sys.stdout.write("12".ljust(20) + "hello")
12 hello
Normally you can skip explicit encoding to stdout. Python implicitly encodes Unicode strings to the terminal in the terminal's encoding (see sys.stdout.encoding):
sys.stdout.write(s.ljust(20))
Another option is using print:
print "%20s" % s # old-style
or:
print '{:20}'.format(s) # new-style

Trouble with encoding and urllib

I'm loading web-page using urllib. Ther eis russian symbols, but page encoding is 'utf-8'
1
pageData = unicode(requestHandler.read()).decode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 262: ordinal not in range(128)
2
pageData = requestHandler.read()
soupHandler = BeautifulSoup(pageData)
print soupHandler.findAll(...)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 340-345: ordinal not in range(128)

In your first snippet, the call unicode(requestHandler.read()) tells Python to convert the bytestring returned by read into unicode: since no code is specified for the conversion, ascii gets tried (and fails). It never gets to the point where you're going to call .decode (which would make no sense to call on that unicode object anyway).
Either use unicode(requestHandler.read(), 'utf-8'), or requestHandler.read().decode('utf-8'): either of these should produce a correct unicode object if the encoding is indeed utf-8 (the presence of that D0 byte suggests it may not be, but it's impossible to guess from being shown a single non-ascii character out of context).
printing Unicode data is a different issue and requires a well configured and cooperative terminal emulator -- one that lets Python set sys.stdout.encoding on startup. For example, on a Mac, using Apple's Terminal.App:
>>> sys.stdout.encoding
'UTF-8'
so the printing of Unicode objects works fine here:
>>> print u'\xabutf8\xbb'
«utf8»
as does the printing of utf8-encoded byte strings:
>>> print u'\xabutf8\xbb'.encode('utf8')
«utf8»
but on other machines only the latter will work (using the terminal emulator's own encoding, which you need to discover on your own because the terminal emulator isn't telling Python;-).

If requestHandler.read() delivers a UTF-8 encoded stream, then
pageData = requestHandler.read().decode('utf-8')
will decode this into a Unicode string (at which point, as Dietrich Epp noted correctly), the unicode() call is not necessary anymore.
If it throws an exception, then the input is obviously not UTF-8-encoded.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

latin-1 vs unicode in python - python

Related

How to initialize a UTF-16 in code?

Python Encoding error with some unicode characers

Parsing letters with macron e.g ā when scraping a web page using lxml

How to left align a UTF-8 encoded string in python?

Trouble with encoding and urllib

Categories

Resources