Ascii codec can't encode character - python

I'm running a Flask web service, with text-input, but now I have the problem that the text-input sometimes consists of characters that are not included in the ASCII-character set (Example of error: "(Error: no text provided) 'ascii' codec can't encode character u'\u2019' in position 20)")
My code for the Flask web service looks (somewhat) like this:
class Classname(Resource):
def __init__(self):
self.reqparse = reqparse.RequestParser()
self.reqparse.add_argument('text', type=str, required=True, help='Error: no text provided')
super(Classname,self).__init__()
def post(self):
args = self.reqparse.parse_args()
text = args['text']
return resultOfSomeFunction(text)
I already tried to turning the ascii-string into unicode, but that didn't work (error: 'unicode' object is not callable). I also tried to add:
text = re.sub(r'[^\x00-\x7f]',r' ',text)
after the rule
text = args['text']
but that also gave me the same error ('ascii' codec can't encode character).
How can I solve this?

Have you tried removing type=str from self.reqparse.add_argument('text', type=str, required=True, help='Error: no text provided')?
Note:
The default argument type is a unicode string. This will be str in
python3 and unicode in python2
Source: http://flask-restful-cn.readthedocs.org/en/0.3.4/reqparse.html

Related

How to get py-scrypt's "simple password verifier" example functions to work?

I am using the example script provide by py-scrypt to build a simple password verifier. Below is my test script.
Test Script:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import scrypt
import os
def hash2_password(a_secret_message, password, maxtime=0.5, datalength=64):
#return scrypt.encrypt(a_secret_message, password, maxtime=maxtime)
return scrypt.encrypt(os.urandom(datalength), password, maxtime=maxtime)
def verify2_password(data, password, maxtime=0.5):
try:
secret_message = scrypt.decrypt(data, password, maxtime)
print('\nDecrypted secret message:', secret_message)
return True
except scrypt.error:
return False
password2 = 'Baymax'
secret_message2 = "Go Go"
data2 = hash2_password(secret_message2, password2, maxtime=0.1, datalength=64)
print('\nEncrypted secret message2:')
print(data2)
password_ok = verify2_password(data2, password2, maxtime=0.1)
print('\npassword_ok? :', password_ok)
Issues:
I often get an error messages, e.g.:
Traceback (most recent call last):
File "~/example_scrypt_v1.py", line 56, in <module>
password_ok = verify2_password(data2, password2, maxtime=0.1)
File "~/example_scrypt_v1.py", line 43, in verify2_password
secret_message = scrypt.decrypt(data, password, maxtime)
File "~/.local/lib/python3.5/site-packages/scrypt/scrypt.py", line 188, in decrypt
return str(out_bytes, encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte
where the last lines varies to e.g.:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position 3: invalid start byte
or
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 1: invalid continuation byte
or no error message but return False
password_ok? : False
When I comment return scrypt.encrypt(os.urandom(datalength), password, maxtime=maxtime) to remove the random secret message generator and uncomment return scrypt.encrypt(a_secret_message, password, maxtime=maxtime) to use a non-random secret message, the function verify2_password works.
Question: How do I get the random secret message element to work? What is causing it's failure?
Explanation for UnicodeDecodeError Exception
Reason 1
I think I understand why Scrypt is issuing a UnicodeDecodeError. Quoting Python's UnicodeDecodeError :
The UnicodeDecodeError normally happens when decoding an str string
from a certain coding. Since codings map only a limited number of str
strings to unicode characters, an illegal sequence of str characters
will cause the coding-specific decode() to fail.
Also in Python's Unicode HOWTO section Python’s Unicode Support --> The String Type, it writes
In addition, one can create a string using the decode() method of
bytes. This method takes an encoding argument, such as UTF-8, and
optionally an errors argument
The errors argument specifies the response when the input string can’t
be converted according to the encoding’s rules. Legal values for this
argument are 'strict' (raise a UnicodeDecodeError exception),
'replace' (use U+FFFD, REPLACEMENT CHARACTER), 'ignore' (just leave
the character out of the Unicode result), or 'backslashreplace'
(inserts a \xNN escape sequence).
In short, whenever Python's .decode() method fails to map str strings to unicode characters, and when it uses the strict argument, the .decode() method will return a UnicodeDecodeError exception.
I tried to find the .decode() method in the .decrypt() method of py-scrypt/scrypt/scrypt.py. Initially, I could not locate it. For Python3, the .decrypt() method return statement was:
return str(out_bytes, encoding)
However, further checking Python's explanation on the str class, I found the explanation saying that:
if object is a bytes (or bytearray) object, then str(bytes, encoding,
errors) is equivalent to bytes.decode(encoding, errors).
This meant that without defining the error argument in str(bytes, encoding), this str class defaulted to returning bytes.decode(encoding, errors='strict') and returned the UnicodeDecodeError exception whenever it failed to map str strings to unicode characters.
Reason 2
In the "simple password verifier" example, the input argument of Scrypt.encrypt() was defined as os.urandom(datalength) which returned a <class 'bytes'>. When this <class 'bytes'> was encrypted, and subsequently decrypted by Scrypt.decrypt(), the returned decrypted value must also be a <class 'bytes'> . According to the doc_string of the .decrypt() method, for Python3 this method will return a str instance if encoded with encoding. If encoding=None, it will return a bytes instance. As Script.decrypt() defaults to encoding='utf-8' in function verify2_password(), Script.decrypt() attempts to return a <class str> resulted in the UnicodeDecodeError.
Solution to the "simple password verifier" example script given in py-scrypt:
The verify_password() function should contain the argument encoding=None .
scrypt.decrypt() should contain the argument encoding=encoding .
Revised Example Script:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import scrypt
import os
def encrypt_password(password, maxtime=0.5, datalength=64):
passphrase = os.urandom(datalength)
print('\npassphrase = ', passphrase, type(passphrase))
return scrypt.encrypt(passphrase, password, maxtime=maxtime)
def verify_password(encrpyted_passphrase, password, maxtime=0.5, encoding=None):
try:
passphrase = scrypt.decrypt(encrpyted_passphrase, password, maxtime,
encoding=encoding)
print('\npassphrase = ', passphrase, type(passphrase))
return True
except scrypt.error:
return False
password = 'Baymax'
encrypted_passphrase = encrypt_password(password, maxtime=0.5, datalength=64)
print('\nEncrypted PassPhrase:')
print(encrypted_passphrase)
password_ok = verify_password(encrypted_passphrase, password, maxtime=0.5)
print('\npassword_ok? :', password_ok)

UnicodeEncodeError only with str(text) in Python

I'm reading a utf-8 encoded file. When I print the text directly, everything is fine. When i print the text from a class using msg.__str__() it works too.
But I really don't know how to print it only with str(msg) because this will always raise the error "'ascii' codec can't encode character u'\xe4' in position 10: ordinal not in range(128)" if in the text is a umlaut.
Example Code:
#!/usr/bin/env python
# encoding: utf-8
import codecs from TempClass import TempClass
file = codecs.open("person.txt", encoding="utf-8") message =
file.read() #I am Mr. Händler.
#works
print message
msg = TempClass(message)
#works
print msg.__str__()
#works
print msg.get_string()
#error
print str(msg)
And the class:
class TempClass(object):
def __init__(self, text):
self.text = text
def get_string(self):
return self.text
def __str__(self):
return self.text
I tried to decode and encode the text in several ways but nothing works for me.
Help? :)
Edit: I am using Python 2.7.9
Because message (and msg.text) are not str but unicode objects. To call str() you need to specify utf-8 as the encoding again. Your __str__ method should look like:
def __str__(self):
return self.text.encode('utf-8')
unicode can be implicitly encoded to str if it contains only ASCII characters, which is why you only see the error when the input contains an umlaut.

Django 1.4 - django.db.models.FileField.save(filename, file, save=True) produces error with non-ascii filename

I'm making a fileupload feature using django.db.models.FileField of Django 1.4
When I try to upload a file whose name includes non-ascii characters, it produces error below.
'ascii' codec can't encode characters in position 109-115: ordinal not
in range(128)
The actual code is like below
file = models.FileField(_("file"),
max_length=512,
upload_to=os.path.join('uploaded', 'files', '%Y', '%m', '%d'))
file.save(filename, file, save=True) #<- This line produces the error
above, if 'filename' includes non-ascii character
If I try to use unicode(filename, 'utf-8') insteadof filename, it produces error below
TypeError: decoding Unicode is not supported
How can I upload a file whose name has non-ascii characters?
Info of my environment:
sys.getdefaultencoding() : 'ascii'
sys.getfilesystemencoding() : 'UTF-8'
using Django-1.4.10-py2.7.egg
You need to use .encode() to encode the string:
file.save(filename.encode('utf-8', 'ignore'), file, save=True)
In your FileField definition the 'upload_to' argument might be like os.path.join(u'uploaded', 'files', '%Y', '%m', '%d')
(see the first u'uploaded' started with u') so all string will be of type unicode and this may help you.

Insert record of utf-8 character (Chinese, Arabic, Japanese.. etc) into GAE datastore programatically with python

I just want to build simple UI translation built in GAE (using python SDK).
def add_translation(self, pid=None):
trans = Translation()
trans.tlang = db.Key("agtwaW1kZXNpZ25lcnITCxILQXBwTGFuZ3VhZ2UY8aIEDA")
trans.ttype = "UI"
trans.transid = "ui-about"
trans.content = "关于我们"
trans.put()
this is resulting encoding error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
How to encode the correct insert content with unicode(utf-8) character?
using the u notation:
>>> s=u"关于我们"
>>> print s
关于我们
Or explicitly, stating the encoding:
>>> s=unicode('אדם מתן', 'utf8')
>>> print s
אדם מתן
Read more at the Unicode HOWTO page in the python documentation site.

Encoding gives "'ascii' codec can't encode character … ordinal not in range(128)"

I am working through the Django RSS reader project here.
The RSS feed will read something like "OKLAHOMA CITY (AP) — James Harden let". The RSS feed's encoding reads encoding="UTF-8" so I believe I am passing utf-8 to markdown in the code snippet below. The em dash is where it chokes.
I get the Django error of "'ascii' codec can't encode character u'\u2014' in position 109: ordinal not in range(128)" which is an UnicodeEncodeError. In the variables being passed I see "OKLAHOMA CITY (AP) \u2014 James Harden". The code line that is not working is:
content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
I am using markdown 2.0, django 1.1, and python 2.4.
What is the magic sequence of encoding and decoding that I need to do to make this work?
(In response to Prometheus' request. I agree the formatting helps)
So in views I add a smart_unicode line above the parsed_feed encoding line...
content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict')
content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")
This pushes the problem to my models.py for me where I have
def save(self, force_insert=False, force_update=False):
if self.excerpt:
self.excerpt_html = markdown(self.excerpt)
# super save after this
If I change the save method to have...
def save(self, force_insert=False, force_update=False):
if self.excerpt:
encoded_excerpt_html = (self.excerpt).encode('utf-8')
self.excerpt_html = markdown(encoded_excerpt_html)
I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "\xe2\x80\x94" where the em dash was
If the data that you are receiving is, in fact, encoded in UTF-8, then it should be a sequence of bytes -- a Python 'str' object, in Python 2.X
You can verify this with an assertion:
assert isinstance(content, str)
Once you know that that's true, you can move to the actual encoding. Python doesn't do transcoding -- directly from UTF-8 to ASCII, for instance. You need to first turn your sequence of bytes into a Unicode string, by decoding it:
unicode_content = content.decode('utf-8')
(If you can trust parsed_feed.encoding, then use that instead of the literal 'utf-8'. Either way, be prepared for errors.)
You can then take that string, and encode it in ASCII, substituting high characters with their XML entity equivalents:
xml_content = unicode_content.encode('ascii', 'xmlcharrefreplace')
The full method, then, would look somthing like this:
try:
content = content.decode(parsed_feed.encoding).encode('ascii', 'xmlcharrefreplace')
except UnicodeDecodeError:
# Couldn't decode the incoming string -- possibly not encoded in utf-8
# Do something here to report the error
Django provides a couple of useful functions for converting back and forth between Unicode and bytestrings:
from django.utils.encoding import smart_unicode, smart_str
I encountered this error during a write of a file name with zip file. The following failed
ZipFile.write(root+'/%s'%file, newRoot + '/%s'%file)
and the following worked
ZipFile.write(str(root+'/%s'%file), str(newRoot + '/%s'%file))

Categories