UnicodeEncodeError only with str(text) in Python - python

I'm reading a utf-8 encoded file. When I print the text directly, everything is fine. When i print the text from a class using msg.__str__() it works too.
But I really don't know how to print it only with str(msg) because this will always raise the error "'ascii' codec can't encode character u'\xe4' in position 10: ordinal not in range(128)" if in the text is a umlaut.
Example Code:
#!/usr/bin/env python
# encoding: utf-8
import codecs from TempClass import TempClass
file = codecs.open("person.txt", encoding="utf-8") message =
file.read() #I am Mr. Händler.
#works
print message
msg = TempClass(message)
#works
print msg.__str__()
#works
print msg.get_string()
#error
print str(msg)
And the class:
class TempClass(object):
def __init__(self, text):
self.text = text
def get_string(self):
return self.text
def __str__(self):
return self.text
I tried to decode and encode the text in several ways but nothing works for me.
Help? :)
Edit: I am using Python 2.7.9

Because message (and msg.text) are not str but unicode objects. To call str() you need to specify utf-8 as the encoding again. Your __str__ method should look like:
def __str__(self):
return self.text.encode('utf-8')
unicode can be implicitly encoded to str if it contains only ASCII characters, which is why you only see the error when the input contains an umlaut.

Related

How to get py-scrypt's "simple password verifier" example functions to work?

I am using the example script provide by py-scrypt to build a simple password verifier. Below is my test script.
Test Script:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import scrypt
import os
def hash2_password(a_secret_message, password, maxtime=0.5, datalength=64):
#return scrypt.encrypt(a_secret_message, password, maxtime=maxtime)
return scrypt.encrypt(os.urandom(datalength), password, maxtime=maxtime)
def verify2_password(data, password, maxtime=0.5):
try:
secret_message = scrypt.decrypt(data, password, maxtime)
print('\nDecrypted secret message:', secret_message)
return True
except scrypt.error:
return False
password2 = 'Baymax'
secret_message2 = "Go Go"
data2 = hash2_password(secret_message2, password2, maxtime=0.1, datalength=64)
print('\nEncrypted secret message2:')
print(data2)
password_ok = verify2_password(data2, password2, maxtime=0.1)
print('\npassword_ok? :', password_ok)
Issues:
I often get an error messages, e.g.:
Traceback (most recent call last):
File "~/example_scrypt_v1.py", line 56, in <module>
password_ok = verify2_password(data2, password2, maxtime=0.1)
File "~/example_scrypt_v1.py", line 43, in verify2_password
secret_message = scrypt.decrypt(data, password, maxtime)
File "~/.local/lib/python3.5/site-packages/scrypt/scrypt.py", line 188, in decrypt
return str(out_bytes, encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte
where the last lines varies to e.g.:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 3: invalid start byte
or
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 1: invalid continuation byte
or no error message but return False
password_ok? : False
When I comment return scrypt.encrypt(os.urandom(datalength), password, maxtime=maxtime) to remove the random secret message generator and uncomment return scrypt.encrypt(a_secret_message, password, maxtime=maxtime) to use a non-random secret message, the function verify2_password works.
Question: How do I get the random secret message element to work? What is causing it's failure?
Explanation for UnicodeDecodeError Exception
Reason 1
I think I understand why Scrypt is issuing a UnicodeDecodeError. Quoting Python's UnicodeDecodeError :
The UnicodeDecodeError normally happens when decoding an str string
from a certain coding. Since codings map only a limited number of str
strings to unicode characters, an illegal sequence of str characters
will cause the coding-specific decode() to fail.
Also in Python's Unicode HOWTO section Python’s Unicode Support --> The String Type, it writes
In addition, one can create a string using the decode() method of
bytes. This method takes an encoding argument, such as UTF-8, and
optionally an errors argument
The errors argument specifies the response when the input string can’t
be converted according to the encoding’s rules. Legal values for this
argument are 'strict' (raise a UnicodeDecodeError exception),
'replace' (use U+FFFD, REPLACEMENT CHARACTER), 'ignore' (just leave
the character out of the Unicode result), or 'backslashreplace'
(inserts a \xNN escape sequence).
In short, whenever Python's .decode() method fails to map str strings to unicode characters, and when it uses the strict argument, the .decode() method will return a UnicodeDecodeError exception.
I tried to find the .decode() method in the .decrypt() method of py-scrypt/scrypt/scrypt.py. Initially, I could not locate it. For Python3, the .decrypt() method return statement was:
return str(out_bytes, encoding)
However, further checking Python's explanation on the str class, I found the explanation saying that:
if object is a bytes (or bytearray) object, then str(bytes, encoding,
errors) is equivalent to bytes.decode(encoding, errors).
This meant that without defining the error argument in str(bytes, encoding), this str class defaulted to returning bytes.decode(encoding, errors='strict') and returned the UnicodeDecodeError exception whenever it failed to map str strings to unicode characters.
Reason 2
In the "simple password verifier" example, the input argument of Scrypt.encrypt() was defined as os.urandom(datalength) which returned a <class 'bytes'>. When this <class 'bytes'> was encrypted, and subsequently decrypted by Scrypt.decrypt(), the returned decrypted value must also be a <class 'bytes'> . According to the doc_string of the .decrypt() method, for Python3 this method will return a str instance if encoded with encoding. If encoding=None, it will return a bytes instance. As Script.decrypt() defaults to encoding='utf-8' in function verify2_password(), Script.decrypt() attempts to return a <class str> resulted in the UnicodeDecodeError.
Solution to the "simple password verifier" example script given in py-scrypt:
The verify_password() function should contain the argument encoding=None .
scrypt.decrypt() should contain the argument encoding=encoding .
Revised Example Script:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import scrypt
import os
def encrypt_password(password, maxtime=0.5, datalength=64):
passphrase = os.urandom(datalength)
print('\npassphrase = ', passphrase, type(passphrase))
return scrypt.encrypt(passphrase, password, maxtime=maxtime)
def verify_password(encrpyted_passphrase, password, maxtime=0.5, encoding=None):
try:
passphrase = scrypt.decrypt(encrpyted_passphrase, password, maxtime,
encoding=encoding)
print('\npassphrase = ', passphrase, type(passphrase))
return True
except scrypt.error:
return False
password = 'Baymax'
encrypted_passphrase = encrypt_password(password, maxtime=0.5, datalength=64)
print('\nEncrypted PassPhrase:')
print(encrypted_passphrase)
password_ok = verify_password(encrypted_passphrase, password, maxtime=0.5)
print('\npassword_ok? :', password_ok)

Encode error scraping

Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)

Python 2.7: 'ascii' codec can't encode character u'\xe9' error while writing in file

I know this question have been asked various time but somehow I am not getting results.
I am fetching data from web which contains a string Elzéar. While going to read in CSV file it gives error which mentioned in question title.
While producing data I did following:
address = str(address).strip()
address = address.encode('utf8')
return name+','+address+','+city+','+state+','+phone+','+fax+','+pumps+','+parking+','+general+','+entertainment+','+fuel+','+resturants+','+services+','+technology+','+fuel_cards+','+credit_cards+','+permits+','+money_services+','+security+','+medical+','+longit+','+latit
and writing it as:
with open('records.csv', 'a') as csv_file:
print(type(data)) #prints <unicode>
data = data.encode('utf8')
csv_file.write(id+','+data+'\n')
status = 'OK'
the_file.write(ts+'\t'+url+'\t'+status+'\n')
Generates error as:
'ascii' codec can't encode character u'\xe9' in position 55: ordinal
not in range(128)
You could try something like (python2.7):
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
...
with codecs.open('records.csv', 'a', encoding="utf8") as csv_file:
print(type(data)) #prints <unicode>
# because data is unicode
csv_file.write(unicode(id)+u','+data+u'\n')
status = u'OK'
the_file.write(unicode(ts, encoding="utf8")+u'\t'+unicode(url, encoding="utf8")+u'\t'+status+u'\n')
The main idea is to work with unicode as much as possible and return str when outputing (better do not operate over str).

Ascii codec can't encode character

I'm running a Flask web service, with text-input, but now I have the problem that the text-input sometimes consists of characters that are not included in the ASCII-character set (Example of error: "(Error: no text provided) 'ascii' codec can't encode character u'\u2019' in position 20)")
My code for the Flask web service looks (somewhat) like this:
class Classname(Resource):
def __init__(self):
self.reqparse = reqparse.RequestParser()
self.reqparse.add_argument('text', type=str, required=True, help='Error: no text provided')
super(Classname,self).__init__()
def post(self):
args = self.reqparse.parse_args()
text = args['text']
return resultOfSomeFunction(text)
I already tried to turning the ascii-string into unicode, but that didn't work (error: 'unicode' object is not callable). I also tried to add:
text = re.sub(r'[^\x00-\x7f]',r' ',text)
after the rule
text = args['text']
but that also gave me the same error ('ascii' codec can't encode character).
How can I solve this?
Have you tried removing type=str from self.reqparse.add_argument('text', type=str, required=True, help='Error: no text provided')?
Note:
The default argument type is a unicode string. This will be str in
python3 and unicode in python2
Source: http://flask-restful-cn.readthedocs.org/en/0.3.4/reqparse.html

Encoding gives "'ascii' codec can't encode character … ordinal not in range(128)"

I am working through the Django RSS reader project here.
The RSS feed will read something like "OKLAHOMA CITY (AP) — James Harden let". The RSS feed's encoding reads encoding="UTF-8" so I believe I am passing utf-8 to markdown in the code snippet below. The em dash is where it chokes.
I get the Django error of "'ascii' codec can't encode character u'\u2014' in position 109: ordinal not in range(128)" which is an UnicodeEncodeError. In the variables being passed I see "OKLAHOMA CITY (AP) \u2014 James Harden". The code line that is not working is:
content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
I am using markdown 2.0, django 1.1, and python 2.4.
What is the magic sequence of encoding and decoding that I need to do to make this work?
(In response to Prometheus' request. I agree the formatting helps)
So in views I add a smart_unicode line above the parsed_feed encoding line...
content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict')
content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
This pushes the problem to my models.py for me where I have
def save(self, force_insert=False, force_update=False):
if self.excerpt:
self.excerpt_html = markdown(self.excerpt)
# super save after this
If I change the save method to have...
def save(self, force_insert=False, force_update=False):
if self.excerpt:
encoded_excerpt_html = (self.excerpt).encode('utf-8')
self.excerpt_html = markdown(encoded_excerpt_html)
I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "\xe2\x80\x94" where the em dash was
If the data that you are receiving is, in fact, encoded in UTF-8, then it should be a sequence of bytes -- a Python 'str' object, in Python 2.X
You can verify this with an assertion:
assert isinstance(content, str)
Once you know that that's true, you can move to the actual encoding. Python doesn't do transcoding -- directly from UTF-8 to ASCII, for instance. You need to first turn your sequence of bytes into a Unicode string, by decoding it:
unicode_content = content.decode('utf-8')
(If you can trust parsed_feed.encoding, then use that instead of the literal 'utf-8'. Either way, be prepared for errors.)
You can then take that string, and encode it in ASCII, substituting high characters with their XML entity equivalents:
xml_content = unicode_content.encode('ascii', 'xmlcharrefreplace')
The full method, then, would look somthing like this:
try:
content = content.decode(parsed_feed.encoding).encode('ascii', 'xmlcharrefreplace')
except UnicodeDecodeError:
# Couldn't decode the incoming string -- possibly not encoded in utf-8
# Do something here to report the error
Django provides a couple of useful functions for converting back and forth between Unicode and bytestrings:
from django.utils.encoding import smart_unicode, smart_str
I encountered this error during a write of a file name with zip file. The following failed
ZipFile.write(root+'/%s'%file, newRoot + '/%s'%file)
and the following worked
ZipFile.write(str(root+'/%s'%file), str(newRoot + '/%s'%file))

Categories