Python Encoding error with some unicode characers

Python Encoding error with some unicode characers - python

I am having some problems with encoding some unicode characters.
This is the code I am using:
test = raw_input("Test: ")
print test.encode("utf-8")
When I use now normal ASCII characters it works, same with some "strange" unicode characters like ☃.
But when I use characters like ß ä ö ü § it fails creating this error:
Traceback (most recent call last):
File "C:\###\Test.py", line 5, in <module>
print test.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: ordinal not in range(128)
Note that I am using a pc where German is the default language (so these characters are default characters).

raw_input() returns a byte string. You don't need to encode that byte string, it is already encoded.
What happens instead then is that Python will first decode to get a unicode value to encode; you asked Python to encode so it'll damn well try to get you something that can be encoded. It is the decoding that fails here. Implicit decoding uses ASCII, which is why you got a UnicodeDecodeError exception (note the Decode in the name) for that codec.
If you wanted to produce a unicode object you'd have to explicitly decode. Use the codec Python has detected for stdin:
import sys
test = raw_input("Test: ")
print test.decode(sys.stdin.encoding)
You don't need to do that here because you are printing, so writing right back to the same terminal which will use the same codec for input and output. Writing a byte string encoded with UTF-8 when you just received that byte string is then fine. Decoding to unicode is fine too, as printing will auto-encode to sys.stdout.encoding.

Related

Issue in encode/decode in python 3 with non-ascii character

I am trying to use python3 unicode_escape to escape \n in my string, but the challenge is there are non-ascii characters present in the whole string, and if I use utf8 to encode and then decode the bytes using unicode_escape then the special character gets garbled. Is there any way to have the \n escaped with a new line without garbling the special character?
s = "hello\\nworld└--"
print(s.encode('utf8').decode('unicode_escape'))
Expected Result:
hello
world└--
Actual Result:
hello
worldâ--

As user wowcha observes, the unicode-escape codec assumes a latin-1 encoding, but your string contains a character that is not encodable as latin-1.
>>> s = "hello\\nworld└--"
>>> s.encode('latin-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
Encoding the string as utf-8 gets around the encoding problem, but results in mojibake when decoding from unicode-escape
The solution is to use the backslashreplace error handler when encoding. This will convert the problem character to an escape sequence that can be encoded as latin-1 and does not get mangled when decoded from unicode-escape.
>>> s.encode('latin-1', errors='backslashreplace')
b'hello\\nworld\\u2514--'
>>> s.encode('latin-1', errors='backslashreplace').decode('unicode-escape')
'hello\nworld└--'
>>> print(s.encode('latin-1', errors='backslashreplace').decode('unicode-escape'))
hello
world└--

Try removing the second escape backslash and decode using utf8:
>>> s = "hello\nworld└--"
>>> print(s.encode('utf8').decode('utf8'))
hello
world└--

I believe the problem you are having is that unicode_escape was deprecated in Python 3.3 and it seems to be assuming your code is 'latin-1' due to that being the original codec used within the unicode_excape function...
Looking at the python documentation for codecs we see that Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default. which tells us that unicode_escape assumes that your text is ISO Latin-1.
So if we run your code with latin1 encoding we get this error:
s.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)
And the unicode character error is '\u2514' which when converted is '└' the simplest way to put it is the character cannot be used within a Latin-1 string hence why you get a different character.
I also think it's right to point out that within your string you have '\\n' and not just '\n' the extra backslash means this symbol is not carriage return but instead it is ignored the backward slash indicates to ignore the '\n'. Perhaps try not using the \\n...

Parsing letters with macron e.g ā when scraping a web page using lxml

Getting this error when trying to parse words in Te Reo Maori
Pāngarau - I am assuming its the macron
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0101'
Any ideas on how to sort this out?
from lxml import html
import requests
page = requests.get('http://www.nzqa.govt.nz/qualifications-standards/qualifications/ncea/subjects/')
tree = html.fromstring(page.text)
text = tree.xpath('//*[#id="mainPage"]/table[1]/tbody/tr[1]/td[3]/a')
print text[0].text
Traceback (most recent call last):
File "/Users/Teacher/Documents/Python/Standards/rip_html2.py", line 10, in <module>
print text[0].text
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0101' in position 1: ordinal not in range(128)
[Finished in 0.5s with exit code 1]

In Python2, lxml sometimes returns strs, and sometimes unicode when you inspect an Element's text attribute.
It returns a str when the text is composed entirely of ascii characters, but it returns a unicode otherwise.
At the point where the error occurs, text[0].text is a unicode containing the character u'\u0101'.
To fix the error, explicitly encode the unicode to a byte string before printing:
print(text[0].text.encode('utf-8'))
Note that utf-8 is just one of many encodings you could use.
Usually, if you are printing to a terminal, Python will detect the encoding used by the terminal, and use that encoding to encode unicode thus printing the bytes to the terminal.
Since you are getting the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0101' in position 1: ordinal not in range(128)
it appears you might be printing to a file, or Python was unable to determine the encoding of the output device. Since output devices only accept bytes (never unicode), all unicode must be encoded. In such cases Python2 automatically attempts to encode the unicode using the ascii codec. Hence the error.
See also: the PrintFails wiki page

It might be because Python 2 by default support only ASCII strings unless explicitly mentioned. To use Unicode instead of ASCII, you can add the following line on first line of your script:
# -*- coding: utf-8 -*-

Strange behavior of string format in python 2.7

Working with svn logs in xml format i've accidentally got an error in my script.
Error message is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
By debugging input data i have found what was wrong. Here is an example:
a=u'\u0440\u0435\u044c\u0434\u0437\u0444\u043a\u044b\u0443\u043a \u043c\u0443\u043a\u044b\u0448\u0449\u0442 \u0430\u0448\u0447'
>>> print a
реьдзфкыук мукышщт ашч
>>> print '{}'.format(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
Can you please explain what is wrong with format?
Seems like it sees u before string bytes and try to decode it from UTF8.
However in Python 3 above example works without error.

You are mixing Unicode and byte string values. Use a unicode format:
print u'{}'.format(a)
Demo:
>>> a=u'\u0440\u0435\u044c\u0434\u0437\u0444\u043a\u044b\u0443\u043a \u043c\u0443\u043a\u044b\u0448\u0449\u0442 \u0430\u0448\u0447'
>>> print u'{}'.format(a)
реьдзфкыук мукышщт ашч
In Python 3, strings are unicode values by default; in Python 2, u"..." indicates a unicode value and regular strings ("...") are byte strings.
Mixing a byte strings and unicode value results in automatic encoding or decoding with the default codec (ASCII), and that's what happens here. The str.format() method has to encode the Unicode value to a byte string to interpolate.

Converting Unicode to in python [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Convert Unicode to UTF-8 Python
I'm a very new python programmer, working on my first script. the script pulls in text from a plist string, then does some things to it, then packages it up as an HTML email.
from a few of the entries, I'm getting the dreaded Unicode "outside ordinal 128" error.
Having read as much as I can find about encoding, and decoding, I know that it is important for me to get the encoded, but I'm having a difficult time understanding when or how exactly to do this.
The offending variable is first pulled in using plistlib, and converted to HTML from markdown, like this:
entry = result['Entry Text']
donotecontent = markdown2.markdown(entry)
Later, it is put in the email like this:
html = donotecontent + '<br /><br />' + var3
part1 = MIMEText(html, 'html')
msg.attach(part1)
My question is, what is the best way for me to make sure that Unicode characters in this content doesn't cause this to throw an error. I prefer not to ignore the characters.

Sorry for my broken english. I am speaking Chinese/Japanese, and using CJK characters everyday.
Ceron solved almost of this problem, thus I won't talk about how to use encode()/decode() again.
When we use str() to cast any unicode object, it will encode unicode string to bytedata; when we use unicode() to cast str object, it will decode bytedata to unicode character.
And, the encoding must be what returned from sys.getdefaultencoding().
In default, sys.getdefaultencoding() return 'ascii' by default, the encoding/decoding exception may be thrown when doing str()/unicode() casting.
If you want to do str <-> unicode conversion by str() or unicode(), and also, implicity encoding/decoding with 'utf-8', you can execute the following statement:
import sys # sys.setdefaultencoding is cancelled by site.py
reload(sys) # to re-enable sys.setdefaultencoding()
sys.setdefaultencoding('utf-8')
and it will cause later execution of str() and unicode() convert any basestring object with encoding utf-8.
However, I would prefer to use encode()/decode() explicitly, because it makes code maintenance easier for me.

Assuming you're using Python 2.x, remember: there are two types of strings: str and unicode. str are byte strings, whereas unicode are unicode strings. unicode strings can be used to represent text in any language, but to store text in a computer or to send it via email, you need to represent that text using bytes. To represent text using bytes, you need an coding format. There are many coding formats, Python uses ascii by default, but ascii can only represent a few characters, mostly english letters. If you try to encode a text with other letters using ascii, you will get the famous "outside ordinal 128". For example:
>>> u'Cerón'.encode('ascii')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 3:
ordinal not in range(128)
The same happens if you use str(u'Cerón'), because Python uses ascii by default to convert unicode to str.
To make this work, you have to use a different coding format. UTF-8 is a coding format that can express any unicode text as bytes. To convert the u'Cerón' unicode string to bytes you have to use:
>>> u'Cerón'.encode('utf-8')
'Cer\xc3\xb3n'
No errors this time.
Now, back to your email problem. I can see that you're using MIMEText, which accepts an already encoded str argument, in your case is the html variable. MIMEText also accepts an argument specifying what kind of encoding is being used. So, in your case, if html is a unicode string, you have to encode it as utf-8 and pass the charset parameter too (because HTMLText uses ascii by default):
part1 = MIMEText(html.encode('utf-8'), 'html', 'utf-8')
But be careful, because if html is already a str instead of unicode, then the encoding will fail. This is one of the problems of Python 2.x, it allows you to encode an already encoded string but it throws an error.
Another problem to add to the list is that utf-8 is compatible with ascii characters, and Python will always try to automatically encode/decode strings using ascii. If you're not properly encoding your strings, but you only use ascii characters, things will work fine. However, if for some reason some non-ascii characters slips into your message, you will get the error, this makes errors harder to detect.

Remember: You can't decode a unicode, and you can't encode a str
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Checkout this excellent tutorial

Trouble with encoding and urllib

I'm loading web-page using urllib. Ther eis russian symbols, but page encoding is 'utf-8'
1
pageData = unicode(requestHandler.read()).decode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 262: ordinal not in range(128)
2
pageData = requestHandler.read()
soupHandler = BeautifulSoup(pageData)
print soupHandler.findAll(...)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 340-345: ordinal not in range(128)

In your first snippet, the call unicode(requestHandler.read()) tells Python to convert the bytestring returned by read into unicode: since no code is specified for the conversion, ascii gets tried (and fails). It never gets to the point where you're going to call .decode (which would make no sense to call on that unicode object anyway).
Either use unicode(requestHandler.read(), 'utf-8'), or requestHandler.read().decode('utf-8'): either of these should produce a correct unicode object if the encoding is indeed utf-8 (the presence of that D0 byte suggests it may not be, but it's impossible to guess from being shown a single non-ascii character out of context).
printing Unicode data is a different issue and requires a well configured and cooperative terminal emulator -- one that lets Python set sys.stdout.encoding on startup. For example, on a Mac, using Apple's Terminal.App:
>>> sys.stdout.encoding
'UTF-8'
so the printing of Unicode objects works fine here:
>>> print u'\xabutf8\xbb'
«utf8»
as does the printing of utf8-encoded byte strings:
>>> print u'\xabutf8\xbb'.encode('utf8')
«utf8»
but on other machines only the latter will work (using the terminal emulator's own encoding, which you need to discover on your own because the terminal emulator isn't telling Python;-).

If requestHandler.read() delivers a UTF-8 encoded stream, then
pageData = requestHandler.read().decode('utf-8')
will decode this into a Unicode string (at which point, as Dietrich Epp noted correctly), the unicode() call is not necessary anymore.
If it throws an exception, then the input is obviously not UTF-8-encoded.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Encoding error with some unicode characers - python

Related

Issue in encode/decode in python 3 with non-ascii character

Parsing letters with macron e.g ā when scraping a web page using lxml

Strange behavior of string format in python 2.7

Converting Unicode to in python [duplicate]

Trouble with encoding and urllib

Categories

Resources