UnicodeEncodeError when writing to file - python

I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected. The script reads an xml file and prints the contents in a new format. On my local machine, I can run the script with stdout to the terminal or to a file (i.e. > myFile.txt), and both work fine.
However, on the server (ssh), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128). All files are in utf-8 encoding, and utf-8 is declared in the magic comment.
If I print the str objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.
If I use print( x.encode('utf-8') ), then it prints code-style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0').
If I $ export PYTHONIOENCODING=utf-8 in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0>.
I have checked all of the locale variables and the relevant ones match what I have on my local machine.
I can simply process the file locally and upload it, but I really want to understand what is happening here. Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:
# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()

The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.
Confirm locale configuration with locale. If there's a problem, you'll see something like:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
Fix this with:
$ sudo locale-gen "en_US.UTF-8"
(replace "en_US.UTF-8" with the locale that's not working). For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.
Quoting the documentation:
UnicodeError has attributes that describe the encoding or decoding
error. For example, err.object[err.start:err.end] gives the particular
invalid input that the codec failed on.
encoding
The name of the encoding that raised the error.
reason
A string describing the specific codec error.
object
The object the codec was attempting to encode or decode.
start
The first index of invalid data in object.
end
The index after the last invalid data in object.

Related

Python Camelot PDF - UnicodeEncodeError when using Stream flavor, on Windows

Python 3.7 on Windows 10. Camelot 0.8.2
I'm using the following code to convert a pdf file to HTML:
import camelot
import os
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I receive the following error at the tables.export line:
"UnicodeEncodeError -'charmap' codec can't encode character '\u2010'
in position y: character maps to undefined.
This code runs without issue on Mac. This error seems to pertain to Windows, which is the environment I will need to run this on.
I have now spent two entire days researching this error ad nauseum - I have tried many of the solutions offered here on Stack Overflow from the several posts related to this. The error persists. The problem with adding the lines of code suggested in all the solutions is that they're all arguments to be added to vanilla Python methods. These arguments are not available to the Camelot's export method.
EDIT 1: Updated post to specify which line is throwing the error.
EDIT 2: PDF file used: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf
EDIT 3: Here is the full Traceback from Windows console:
> Traceback (most recent call last): File "main.py", line 18, in
> <module>
> tables.export(os.path.normpath(os.path.join(folder_to_pdf, "foo.html")), f='html') File
> "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 737, in export
> self._write_file(f=f, **kwargs) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 699, in _write_file
> to_format(filepath) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 636, in to_html
> f.write(html_string) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py",
> line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in
> position 5737: character maps to <undefined>
The problem you are facing is related to the method camelot.core.Table.to_html:
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w") as f:
f.write(html_string)
Here, the file to be written should be opened with UTF-8 encoding and it is not.
This is my solution, which uses a monkey patch to replace original camelot method:
import camelot
import os
# here I define the corrected method
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w", encoding="utf-8") as f:
f.write(html_string)
# monkey patch: I replace the original method with the corrected one
camelot.core.Table.to_html=to_html
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I tested this solution and it works for Python 3.7, Windows 10, Camelot 0.8.2.
You're getting UnicodeEncodeError, which in this case means that the output to be written to file contains a character than cannot be encoded in the default encoding for your platform, cp1252.
camelot does not seem to handle setting an encoding when writing to an html file.
A workaround might be to set the PYTHONIOENCODING environment variable to "UTF-8" when running your program:
C:\> set PYTHONIOENCODING=UTF-8 && python myprog.py
to force outputting the file(s) with UTF-8 encoding.

How to give windows directory path in os.listdir(path)?

While I'm trying to give windows directory path in os.listdir() it gives error.
My code snippet:
with os.listdir('C:\Users\Hp\Desktop\video') as entries:
I know that python takes '\' as an escape sequence but I cannot find any alternative on windows. The error given out is:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX
I've tried below solutions but it gave me some other error:-
AttributeError: __enter__
Is there any problem with my code:
import os
import moviepy.editor as mp
#location = os.path.join("C:", "Users", "Hp", "Desktop", "video")
with os.listdir("C:\\Users\\Hp\\Desktop\\video") as entries:
for entry in entries:
if(".py" or ".png") not in entry:
video = mp.VideoFileClip("entry.name")
logo = (mp.ImageClip("logo.png")
.set_duration(video.duration)
.resize(height=50) # if you need to resize...
.margin(right=8, top=8, opacity=0) # (optional) logo-border padding
.set_pos(("right","top")))
final = mp.CompositeVideoClip([video, logo])
final.write_videofile('o' + "entry.name")
Either use raw strings which ignore the backslash as an escape character
with os.listdir(r'C:\Users\Hp\Desktop\video') as entries:
Or use a literal backslash (an escaped backslash)
with os.listdir('C:\\Users\\Hp\\Desktop\\video') as entries:
Or just use forward slashes. They work all over in Windows.
with os.listdir('C:/Users/Hp/Desktop/video') as entries:
#Adam Smith's answer is correct. But I'd like to point out a mistake. os.listdir doesn't return a context manager (just a normal list), so using it with with keyword is useless. Call the function normally.
entries = os.listdir('C:\Users\Hp\Desktop\video')
and you won't get AttributeError: __enter__ error.
with keyword is an automated way to call obj.__enter__() before the block and obj.__exit__() after it. If the object (here a list returned from os.listdir) doesn't have those methods, you'll get an error.
https://docs.python.org/3/library/os.html#os.listdir
https://docs.python.org/3/reference/datamodel.html#object.enter
import os
import moviepy.editor as mp
path="C:\\Users\\Hp\\Desktop\\video"
entries=os.listdir(path)
for entry in entries:
if(".py" or ".png") not in entry:
video = mp.VideoFileClip("entry.name")
logo = (mp.ImageClip("logo.png")
.set_duration(video.duration)
.resize(height=50) # if you need to resize...
.margin(right=8, top=8, opacity=0) # (optional) logo-border padding
.set_pos(("right","top")))
final = mp.CompositeVideoClip([video, logo])
final.write_videofile('o' + "entry.name")
It is better if you use pathlib python library. Such issues are handled very well in it.
from pathlib import Path
p = Path('C:\Users\Hp\Desktop\video') # Use any format
for file in p.iterdir():
# Every 'file' is a 'Path' variable with full path
file.name # Returns full filename
file.suffix # Returns extension. e.g. '.jpg'
str(file) # Returns path as python string
Refer here to know more.

nltk NERTagger UnicodeDecodeError in python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.
This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')
Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.
At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.

Django 1.4 - django.db.models.FileField.save(filename, file, save=True) produces error with non-ascii filename

I'm making a fileupload feature using django.db.models.FileField of Django 1.4
When I try to upload a file whose name includes non-ascii characters, it produces error below.
'ascii' codec can't encode characters in position 109-115: ordinal not
in range(128)
The actual code is like below
file = models.FileField(_("file"),
max_length=512,
upload_to=os.path.join('uploaded', 'files', '%Y', '%m', '%d'))
file.save(filename, file, save=True) #<- This line produces the error
above, if 'filename' includes non-ascii character
If I try to use unicode(filename, 'utf-8') insteadof filename, it produces error below
TypeError: decoding Unicode is not supported
How can I upload a file whose name has non-ascii characters?
Info of my environment:
sys.getdefaultencoding() : 'ascii'
sys.getfilesystemencoding() : 'UTF-8'
using Django-1.4.10-py2.7.egg
You need to use .encode() to encode the string:
file.save(filename.encode('utf-8', 'ignore'), file, save=True)
In your FileField definition the 'upload_to' argument might be like os.path.join(u'uploaded', 'files', '%Y', '%m', '%d')
(see the first u'uploaded' started with u') so all string will be of type unicode and this may help you.

ConfigParser with Unicode items

my troubles with ConfigParser continue. It seems it doesn't support Unicode very well. The config file is indeed saved as UTF-8, but when ConfigParser reads it it seems to be encoded into something else. I assumed it was latin-1 and I thougt overriding optionxform could help:
-- configfile.cfg --
[rules]
Häjsan = 3
☃ = my snowman
-- myapp.py --
# -*- coding: utf-8 -*-
import ConfigParser
def _optionxform(s):
try:
newstr = s.decode('latin-1')
newstr = newstr.encode('utf-8')
return newstr
except Exception, e:
print e
cfg = ConfigParser.ConfigParser()
cfg.optionxform = _optionxform
cfg.read("myconfig")
Of course, when I read the config I get:
'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
I've tried a couple of different variations of decoding 's' but the point seems moot, since it really should be a unicode object from the beginning. After all, the config file is UTF-8? I have confirmed that's something is wrong in the way ConfigParser reads the file by stubbing it out with this DummyConfig class. If I use that then everything is nice unicode, fine and dandy.
-- config.py --
# -*- coding: utf-8 -*-
apa = {'rules': [(u'Häjsan', 3), (u'☃', u'my snowman')]}
class DummyConfig(object):
def sections(self):
return apa.keys()
def items(self, section):
return apa[section]
def add_section(self, apa):
pass
def set(self, *args):
pass
Any ideas what could be causing this or suggestions of other config modules that supports Unicode better are most welcome. I don't want to use sys.setdefaultencoding()!
The ConfigParser.readfp() method can take a file object, have you tried opening the file object with the correct encoding using the codecs module before sending it to ConfigParser like below:
cfg.readfp(codecs.open("myconfig", "r", "utf8"))
For Python 3.2 or above, readfp() is deprecated. Use read_file() instead.
In python 3.2 encoding parameter was introduced to read(), so it can now be used as:
cfg.read("myconfig", encoding='utf-8')
Try to overwrite the write function in RawConfigParser() like this:
class ConfigWithCoder(RawConfigParser):
def write(self, fp):
"""Write an .ini-format representation of the configuration state."""
if self._defaults:
fp.write("[%s]\n" % "DEFAULT")
for (key, value) in self._defaults.items():
fp.write("%s = %s\n" % (key, str(value).replace('\n', '\n\t')))
fp.write("\n")
for section in self._sections:
fp.write("[%s]\n" % section)
for (key, value) in self._sections[section].items():
if key == "__name__":
continue
if (value is not None) or (self._optcre == self.OPTCRE):
if type(value) == unicode:
value = ''.join(value).encode('utf-8')
else:
value = str(value)
value = value.replace('\n', '\n\t')
key = " = ".join((key, value))
fp.write("%s\n" % (key))
fp.write("\n")
Seems to be a problem with the ConfigParser version for python 2x, and version for 3x is free of this problem. In this issue of the Python Bug Tracker, the status is Closed + WONTFIX.
I've fixed it editing the ConfigParser.py file. In the write method (about the line 412), change:
key = " = ".join((key, str(value).replace('\n', '\n\t')))
by
key = " = ".join((key, str(value).decode('utf-8').replace('\n', '\n\t')))
I don't know if it's a real solution, but tested in Windows 7 and Ubuntu 15.04, works like a charm, and I can share and work with the same .ini file in both systems.
what I did is just:
file_name = file_name.decode("utf-8")
cfg.read(file_name)

Categories