What am I missing to pickle and unpickle Unicode - python

I'm working on a Py3k program that I want to be able to accept Unicode strings and pickle/unpickle them.
However, it is defaulting to an ASCII codec, and complaining about a Unicode error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u0161' in position 1442: ordinal not in range(128)
args = ('ascii', "Content-Type: text/html\n\n<!DOCTYPE html>\n<html>\n...ype='submit'>\n </form>\n </body>\n</html>", 1442, 1443, 'ordinal not in range(128)')
encoding = 'ascii'
end = 1443
object = "Content-Type: text/html\n\n<!DOCTYPE html>\n<html>\n...ype='submit'>\n </form>\n </body>\n</html>"
reason = 'ordinal not in range(128)'
start = 1442
with_traceback = <built-in method with_traceback of UnicodeEncodeError object>
How can I change the codec or otherwise change things so that Unicode values taken from a CGI string will be successfully marshalled and unmarshalled as Unicode strings?
Thanks,
--EDIT--
The source code is at http://pastebin.com/nX2w1tqa .

I would try and explicitly pass a unicode object to pickle.dump(), something like pickle.dump(unicode(state), output_file)

Related

How to fix UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d : character maps to <undefined>?

I am using curl to data.
import os
cmd = "curl --data \"action=getdata\" https:localhost:8070"
print(cmd)
data = os.popen(cmd).read()
The line above produces an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 565334: character maps to <undefined>.
When I debugged using breakpoints, the command os.popen generates a large corpus of text and when it goes to read() the error arises in file cp1252.py in IncrementalDecoder class. I tried doing,
data = os.popen(cmd).read().encode('utf-8').decode('ascii')
and
data = os.popen(cmd).read().encode().decode('utf-8')
But the error persists. How can we solve this?

UnicodeEncodeError: 'ascii' codec can't encode characters in position 90-96: ordinal not in range(128)

I have this code:
url= 'https://yandex.ru/search/xml?user=uid-2h3232xfhboy&key=03.292922330523:6b4c80ghghghhghgdsfdsfds4c4b4a7872fb7d2bb04bfdgbb02b76c3d&query='
key = "абс"
url = url + key
print(url)
xml = urllib.request.urlopen(url).read()
But I got an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 90-96: ordinal not in range(128)
What do I do?
I tried to do url= url.encode("utf-8")
But didn't help. Got this error:
AttributeError: 'bytes' object has no attribute 'timeout'
I tried to do this:
url = u''.join((self.ya_url, key)).encode('utf-8')
As suggested here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
But got the same error
AttributeError: 'bytes' object has no attribute 'timeout'
What do I do?
You can't use non-ASCII characters in a URL. You need to quote your key value appropriately:
import urllib.parse
url= 'https://yandex.ru/search/xml?user=uid-2h3232xfhboy&key=03.292922330523:6b4c80ghghghhghgdsfdsfds4c4b4a7872fb7d2bb04bfdgbb02b76c3d&query='
key = "абс"
quoted = urllib.parse.quote(key)
url = url + quoted
This method work for me (i use Pycharm ide) . you go to client.py , then change the request.encode('ascii') to request.encode('utf-8) or any encoder you want . Now it should be work with no problem
Edit: you need to change the source file in order to use utf character in url . in request.encode , it has been hard code to ascii

Python output replaces non ASCII characters with �

I am using Python 2.7 to read data from a MySQL table. In MySQL the name looks like this:
Garasa, Ángel.
But when I print it in Python the output is
Garasa, �ngel
The character set name in MySQL is utf8.
This is my Python code:
# coding: utf-8
import MySQLdb
connection = MySQLdb.connect
(host="localhost",user="root",passwd="root",db="jmdb")
cursor = connection.cursor ()
cursor.execute ("select * from actors where actorid=672462;")
data = cursor.fetchall ()
for row in data:
print "IMDB Name=",row[4]
wiki=("".join(row[4]))
print wiki
I have tried decoding it, but get error such as:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8:
invalid start byte
I have read about decoding and UTF-8 but couldn't find a solution.
Get the Mysql driver to return Unicode strings instead. This means that you don't have to deal with decoding in your code.
Simply set use_unicode=True in the connection parameters. If the table has been set with a specific encoding then set the charset attribute accordingly.
I think the right character mapping in your case is cp1252 :
>>> s = 'Garasa, Ángel.'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#63>", line 1, in <module>
s.decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8: invalid start byte
>>> s.decode('cp1252')
u'Garasa, \xc1ngel.'
>>>
>>> print s.decode('cp1252')
Garasa, Ángel.
EDIT: It could also be possible that it is latin-1 as well:
>>> s.decode('latin-1')
u'Garasa, \xc1ngel.'
>>> print s.decode('latin-1')
Garasa, Ángel.
As cp1252 and latin-1 code pages intersects for all codes except the range 128 to 159.
Quoting from this source (latin-1):
The Windows-1252 codepage coincides with ISO-8859-1 for all codes
except the range 128 to 159 (hex 80 to 9F), where the little-used C1
controls are replaced with additional characters including all the
missing characters provided by ISO-8859-15
And this one (cp1252):
This character encoding is a superset of ISO 8859-1, but differs from
the IANA's ISO-8859-1 by using displayable characters rather than
control characters in the 80 to 9F (hex) range.

Django 1.4 - django.db.models.FileField.save(filename, file, save=True) produces error with non-ascii filename

I'm making a fileupload feature using django.db.models.FileField of Django 1.4
When I try to upload a file whose name includes non-ascii characters, it produces error below.
'ascii' codec can't encode characters in position 109-115: ordinal not
in range(128)
The actual code is like below
file = models.FileField(_("file"),
max_length=512,
upload_to=os.path.join('uploaded', 'files', '%Y', '%m', '%d'))
file.save(filename, file, save=True) #<- This line produces the error
above, if 'filename' includes non-ascii character
If I try to use unicode(filename, 'utf-8') insteadof filename, it produces error below
TypeError: decoding Unicode is not supported
How can I upload a file whose name has non-ascii characters?
Info of my environment:
sys.getdefaultencoding() : 'ascii'
sys.getfilesystemencoding() : 'UTF-8'
using Django-1.4.10-py2.7.egg
You need to use .encode() to encode the string:
file.save(filename.encode('utf-8', 'ignore'), file, save=True)
In your FileField definition the 'upload_to' argument might be like os.path.join(u'uploaded', 'files', '%Y', '%m', '%d')
(see the first u'uploaded' started with u') so all string will be of type unicode and this may help you.

Insert record of utf-8 character (Chinese, Arabic, Japanese.. etc) into GAE datastore programatically with python

I just want to build simple UI translation built in GAE (using python SDK).
def add_translation(self, pid=None):
trans = Translation()
trans.tlang = db.Key("agtwaW1kZXNpZ25lcnITCxILQXBwTGFuZ3VhZ2UY8aIEDA")
trans.ttype = "UI"
trans.transid = "ui-about"
trans.content = "关于我们"
trans.put()
this is resulting encoding error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
How to encode the correct insert content with unicode(utf-8) character?
using the u notation:
>>> s=u"关于我们"
>>> print s
关于我们
Or explicitly, stating the encoding:
>>> s=unicode('אדם מתן', 'utf8')
>>> print s
אדם מתן
Read more at the Unicode HOWTO page in the python documentation site.

Categories