ConfigParser with Unicode items - python

my troubles with ConfigParser continue. It seems it doesn't support Unicode very well. The config file is indeed saved as UTF-8, but when ConfigParser reads it it seems to be encoded into something else. I assumed it was latin-1 and I thougt overriding optionxform could help:
-- configfile.cfg --
[rules]
Häjsan = 3
☃ = my snowman
-- myapp.py --
# -*- coding: utf-8 -*-
import ConfigParser
def _optionxform(s):
try:
newstr = s.decode('latin-1')
newstr = newstr.encode('utf-8')
return newstr
except Exception, e:
print e
cfg = ConfigParser.ConfigParser()
cfg.optionxform = _optionxform
cfg.read("myconfig")
Of course, when I read the config I get:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
I've tried a couple of different variations of decoding 's' but the point seems moot, since it really should be a unicode object from the beginning. After all, the config file is UTF-8? I have confirmed that's something is wrong in the way ConfigParser reads the file by stubbing it out with this DummyConfig class. If I use that then everything is nice unicode, fine and dandy.
-- config.py --
# -*- coding: utf-8 -*-
apa = {'rules': [(u'Häjsan', 3), (u'☃', u'my snowman')]}
class DummyConfig(object):
def sections(self):
return apa.keys()
def items(self, section):
return apa[section]
def add_section(self, apa):
pass
def set(self, *args):
pass
Any ideas what could be causing this or suggestions of other config modules that supports Unicode better are most welcome. I don't want to use sys.setdefaultencoding()!

The ConfigParser.readfp() method can take a file object, have you tried opening the file object with the correct encoding using the codecs module before sending it to ConfigParser like below:
cfg.readfp(codecs.open("myconfig", "r", "utf8"))
For Python 3.2 or above, readfp() is deprecated. Use read_file() instead.

In python 3.2 encoding parameter was introduced to read(), so it can now be used as:
cfg.read("myconfig", encoding='utf-8')

Try to overwrite the write function in RawConfigParser() like this:
class ConfigWithCoder(RawConfigParser):
def write(self, fp):
"""Write an .ini-format representation of the configuration state."""
if self._defaults:
fp.write("[%s]\n" % "DEFAULT")
for (key, value) in self._defaults.items():
fp.write("%s = %s\n" % (key, str(value).replace('\n', '\n\t')))
fp.write("\n")
for section in self._sections:
fp.write("[%s]\n" % section)
for (key, value) in self._sections[section].items():
if key == "__name__":
continue
if (value is not None) or (self._optcre == self.OPTCRE):
if type(value) == unicode:
value = ''.join(value).encode('utf-8')
else:
value = str(value)
value = value.replace('\n', '\n\t')
key = " = ".join((key, value))
fp.write("%s\n" % (key))
fp.write("\n")

Seems to be a problem with the ConfigParser version for python 2x, and version for 3x is free of this problem. In this issue of the Python Bug Tracker, the status is Closed + WONTFIX.
I've fixed it editing the ConfigParser.py file. In the write method (about the line 412), change:
key = " = ".join((key, str(value).replace('\n', '\n\t')))
by
key = " = ".join((key, str(value).decode('utf-8').replace('\n', '\n\t')))
I don't know if it's a real solution, but tested in Windows 7 and Ubuntu 15.04, works like a charm, and I can share and work with the same .ini file in both systems.

what I did is just:
file_name = file_name.decode("utf-8")
cfg.read(file_name)

Related

Streaming AP: Tracked keywords result in "error: Non-UTF-8 code... but no encoding declared"

I have a running code using tweepy's stream listener to stream tweets. It works just fine and I have run it a couple of times successfully, both using arabic, English, and French keywords combined.
For some reason, when I insert my whole set of keywords (397) the code results in the error reading
SyntaxError: Non-UTF-8 code starting with '\xd9' in file twitter_streaming_copy.py on line 67, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Quite oddly, I have tried to run the code using different parts of the set of keywords and it works fine, it is only when I put them all together that is stops working. Any idea? Here is my code: (I'm using python 3)
# Chap02-03/twitter_streaming.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import string
import time
import tweepy
from tweepy import Stream
from tweepy.streaming import StreamListener
consumer_key = ".."
consumer_secret = ".."
access_key = ".-."
access_secret = ".."
class CustomListener(StreamListener):
"""Custom StreamListener for streaming Twitter data."""
def __init__(self, fname):
safe_fname = format_filename(fname)
self.outfile = "stream_%s.jsonl" % safe_fname
def on_data(self, data):
try:
with open(self.outfile, 'a') as f:
f.write(data)
return True
except BaseException as e:
sys.stderr.write("Error on_data: {}\n".format(e))
time.sleep(5)
return True
def on_error(self, status):
if status == 420:
sys.stderr.write("Rate limit exceeded\n")
return False
else:
sys.stderr.write("Error {}\n".format(status))
return True
def format_filename(fname):
"""Convert fname into a safe string for a file name.
Return: string
"""
return ''.join(convert_valid(one_char) for one_char in fname)
def convert_valid(one_char):
"""Convert a character into '_' if "invalid".
Return: string
"""
valid_chars = "-_.%s%s" % (string.ascii_letters, string.digits)
if one_char in valid_chars:
return one_char
else:
return '_'
if __name__ == '__main__':
query = sys.argv[1:] # list of CLI arguments
query_fname = ' '.join(query) # string
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
twitter_stream = Stream(auth, CustomListener(query_fname))
twitter_stream.filter(track=['saudi لبنان', 'iran لبنان', 'iran lebanon', 'ايران لبنان', 'hezbollah lebanon', 'حزب الله لبنان', 'saoudite liban', 'iran liban', 'hezbollah liban'], async=True)
I reproduced a similar error with the following code by saving the file as Windows-1256 (Arabic):
# Chap02-03/twitter_streaming.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
s = ['saudi لبنان', 'iran لبنان', 'iran lebanon', 'ايران لبنان', 'hezbollah lebanon', 'حزب الله لبنان', 'saoudite liban', 'iran liban', 'hezbollah liban']
Output:
File "C:\test.py", line 4
SyntaxError: Non-UTF-8 code starting with '\xe1' in file C:\test.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
#Martijn's answer is correct that the coding line must be in the first two lines, but UTF-8 is the default encoding in Python 3 anyway. If the file was saved in UTF-8, it would have worked even with the comment on the wrong line, but the file must also be saved in the declared encoding.
You haven't saved your source file as UTF-8. Configure your editor correctly.
Alternatively, adjust your coding comment at the top; the default for Python 3 is UTF-8 but if you used a different codec you need to specify it in that comment. However,tThe encoding comment should appear in the first two lines of your file. You have it set on the third line. Quoting from the PEP linked in the error message:
To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file[.]
(Bold emphasis mine)
Re-arrange your comments to:
#!/usr/bin/env python
# -*- coding: <your codec> -*-
# Chap02-03/twitter_streaming.py
I moved the first comment down; the #! line must be the first line in the file for it to work. You could also just remove it altogether, since you were not using it.

Porting pickle py2 to py3 strings become bytes

I have a pickle file that was created with python 2.7 that I'm trying to port to python 3.6. The file is saved in py 2.7 via pickle.dumps(self.saved_objects, -1)
and loaded in python 3.6 via loads(data, encoding="bytes") (from a file opened in rb mode). If I try opening in r mode and pass encoding=latin1 to loads I get UnicodeDecode errors. When I open it as a byte stream it loads, but literally every string is now a byte string. Every object's __dict__ keys are all b"a_variable_name" which then generates attribute errors when calling an_object.a_variable_name because __getattr__ passes a string and __dict__ only contains bytes. I feel like I've tried every combination of arguments and pickle protocols already. Apart from forcibly converting all objects' __dict__ keys to strings I'm at a loss. Any ideas?
** Skip to 4/28/17 update for better example
-------------------------------------------------------------------------------------------------------------
** Update 4/27/17
This minimum example illustrates my problem:
From py 2.7.13
import pickle
class test(object):
def __init__(self):
self.x = u"test ¢" # including a unicode str breaks things
t = test()
dumpstr = pickle.dumps(t)
>>> dumpstr
"ccopy_reg\n_reconstructor\np0\n(c__main__\ntest\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'x'\np6\nVtest \xa2\np7\nsb."
From py 3.6.1
import pickle
class test(object):
def __init__(self):
self.x = "xyz"
dumpstr = b"ccopy_reg\n_reconstructor\np0\n(c__main__\ntest\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'x'\np6\nVtest \xa2\np7\nsb."
t = pickle.loads(dumpstr, encoding="bytes")
>>> t
<__main__.test object at 0x040E3DF0>
>>> t.x
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
t.x
AttributeError: 'test' object has no attribute 'x'
>>> t.__dict__
{b'x': 'test ¢'}
>>>
-------------------------------------------------------------------------------------------------------------
Update 4/28/17
To re-create my issue I'm posting my actual raw pickle data here
The pickle file was created in python 2.7.13, windows 10 using
with open("raw_data.pkl", "wb") as fileobj:
pickle.dump(library, fileobj, protocol=0)
(protocol 0 so it's human readable)
To run it you'll need classes.py
# classes.py
class Library(object): pass
class Book(object): pass
class Student(object): pass
class RentalDetails(object): pass
And the test script here:
# load_pickle.py
import pickle, sys, itertools, os
raw_pkl = "raw_data.pkl"
is_py3 = sys.version_info.major == 3
read_modes = ["rb"]
encodings = ["bytes", "utf-8", "latin-1"]
fix_imports_choices = [True, False]
files = ["raw_data_%s.pkl" % x for x in range(3)]
def py2_test():
with open(raw_pkl, "rb") as fileobj:
loaded_object = pickle.load(fileobj)
print("library dict: %s" % (loaded_object.__dict__.keys()))
return loaded_object
def py2_dumps():
library = py2_test()
for protcol, path in enumerate(files):
print("dumping library to %s, protocol=%s" % (path, protcol))
with open(path, "wb") as writeobj:
pickle.dump(library, writeobj, protocol=protcol)
def py3_test():
# this test iterates over the different options trying to load
# the data pickled with py2 into a py3 environment
print("starting py3 test")
for (read_mode, encoding, fix_import, path) in itertools.product(read_modes, encodings, fix_imports_choices, files):
py3_load(path, read_mode=read_mode, fix_imports=fix_import, encoding=encoding)
def py3_load(path, read_mode, fix_imports, encoding):
from traceback import print_exc
print("-" * 50)
print("path=%s, read_mode = %s fix_imports = %s, encoding = %s" % (path, read_mode, fix_imports, encoding))
if not os.path.exists(path):
print("start this file with py2 first")
return
try:
with open(path, read_mode) as fileobj:
loaded_object = pickle.load(fileobj, fix_imports=fix_imports, encoding=encoding)
# print the object's __dict__
print("library dict: %s" % (loaded_object.__dict__.keys()))
# consider the test a failure if any member attributes are saved as bytes
test_passed = not any((isinstance(k, bytes) for k in loaded_object.__dict__.keys()))
print("Test %s" % ("Passed!" if test_passed else "Failed"))
except Exception:
print_exc()
print("Test Failed")
input("Press Enter to continue...")
print("-" * 50)
if is_py3:
py3_test()
else:
# py2_test()
py2_dumps()
put all 3 in the same directory and run c:\python27\python load_pickle.py first which will create 1 pickle file for each of the 3 protocols. Then run the same command with python 3 and notice that it version converts the __dict__ keys to bytes. I had it working for about 6 hours, but for the life of me I can't figure out how I broke it again.
In short, you're hitting bug 22005 with datetime.date objects in the RentalDetails objects.
That can be worked around with the encoding='bytes' parameter, but that leaves your classes with __dict__ containing bytes:
>>> library = pickle.loads(pickle_data, encoding='bytes')
>>> dir(library)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'str' and 'bytes'
It's possible to manually fix that based on your specific data:
def fix_object(obj):
"""Decode obj.__dict__ containing bytes keys"""
obj.__dict__ = dict((k.decode("ascii"), v) for k, v in obj.__dict__.items())
def fix_library(library):
"""Walk all library objects and decode __dict__ keys"""
fix_object(library)
for student in library.students:
fix_object(student)
for book in library.books:
fix_object(book)
for rental in book.rentals:
fix_object(rental)
But that's fragile and enough of a pain you should be looking for a better option.
1) Implement __getstate__/__setstate__ that maps datetime objects to a non-broken representation, for instance:
class Event(object):
"""Example class working around datetime pickling bug"""
def __init__(self):
self.date = datetime.date.today()
def __getstate__(self):
state = self.__dict__.copy()
state["date"] = state["date"].toordinal()
return state
def __setstate__(self, state):
self.__dict__.update(state)
self.date = datetime.date.fromordinal(self.date)
2) Don't use pickle at all. Along the lines of __getstate__/__setstate__, you can just implement to_dict/from_dict methods or similar in your classes for saving their content as json or some other plain format.
A final note, having a backreference to library in each object shouldn't be required.
You should treat pickle data as specific to the (major) version of Python that created it.
(See Gregory Smith's message w.r.t. issue 22005.)
The best way to get around this is to write a Python 2.7 program to read the pickled data, and write it out in a neutral format.
Taking a quick look at your actual data, it seems to me that an SQLite database is appropriate as an interchange format, since the Books contain references to a Library and RentalDetails. You could create separate tables for each.
Question: Porting pickle py2 to py3 strings become bytes
The given encoding='latin-1' below, is ok.
Your Problem with b'' are the result of using encoding='bytes'.
This will result in dict-keys being unpickled as bytes instead of as str.
The Problem data are the datetime.date values '\x07á\x02\x10', starting at line 56 in raw-data.pkl.
It's a konwn Issue, as pointed already.
Unpickling python2 datetime under python3
http://bugs.python.org/issue22005
For a workaround, I have patched pickle.py and got unpickled object, e.g.
book.library.books[0].rentals[0].rental_date=2017-02-16
This will work for me:
t = pickle.loads(dumpstr, encoding="latin-1")
Output:
<main.test object at 0xf7095fec>
t.__dict__={'x': 'test ¢'}
test ¢
Tested with Python:3.4.2

Encode error scraping

Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)

UnicodeEncodeError when writing to file

I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected. The script reads an xml file and prints the contents in a new format. On my local machine, I can run the script with stdout to the terminal or to a file (i.e. > myFile.txt), and both work fine.
However, on the server (ssh), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128). All files are in utf-8 encoding, and utf-8 is declared in the magic comment.
If I print the str objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.
If I use print( x.encode('utf-8') ), then it prints code-style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0').
If I $ export PYTHONIOENCODING=utf-8 in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0>.
I have checked all of the locale variables and the relevant ones match what I have on my local machine.
I can simply process the file locally and upload it, but I really want to understand what is happening here. Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:
# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()
The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.
Confirm locale configuration with locale. If there's a problem, you'll see something like:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
Fix this with:
$ sudo locale-gen "en_US.UTF-8"
(replace "en_US.UTF-8" with the locale that's not working). For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue
You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.
Quoting the documentation:
UnicodeError has attributes that describe the encoding or decoding
error. For example, err.object[err.start:err.end] gives the particular
invalid input that the codec failed on.
encoding
The name of the encoding that raised the error.
reason
A string describing the specific codec error.
object
The object the codec was attempting to encode or decode.
start
The first index of invalid data in object.
end
The index after the last invalid data in object.

UnicodeEncodeError only with str(text) in Python

I'm reading a utf-8 encoded file. When I print the text directly, everything is fine. When i print the text from a class using msg.__str__() it works too.
But I really don't know how to print it only with str(msg) because this will always raise the error "'ascii' codec can't encode character u'\xe4' in position 10: ordinal not in range(128)" if in the text is a umlaut.
Example Code:
#!/usr/bin/env python
# encoding: utf-8
import codecs from TempClass import TempClass
file = codecs.open("person.txt", encoding="utf-8") message =
file.read() #I am Mr. Händler.
#works
print message
msg = TempClass(message)
#works
print msg.__str__()
#works
print msg.get_string()
#error
print str(msg)
And the class:
class TempClass(object):
def __init__(self, text):
self.text = text
def get_string(self):
return self.text
def __str__(self):
return self.text
I tried to decode and encode the text in several ways but nothing works for me.
Help? :)
Edit: I am using Python 2.7.9
Because message (and msg.text) are not str but unicode objects. To call str() you need to specify utf-8 as the encoding again. Your __str__ method should look like:
def __str__(self):
return self.text.encode('utf-8')
unicode can be implicitly encoded to str if it contains only ASCII characters, which is why you only see the error when the input contains an umlaut.

Categories