Python Camelot PDF - UnicodeEncodeError when using Stream flavor, on Windows - python

Python 3.7 on Windows 10. Camelot 0.8.2
I'm using the following code to convert a pdf file to HTML:
import camelot
import os
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I receive the following error at the tables.export line:
"UnicodeEncodeError -'charmap' codec can't encode character '\u2010'
in position y: character maps to undefined.
This code runs without issue on Mac. This error seems to pertain to Windows, which is the environment I will need to run this on.
I have now spent two entire days researching this error ad nauseum - I have tried many of the solutions offered here on Stack Overflow from the several posts related to this. The error persists. The problem with adding the lines of code suggested in all the solutions is that they're all arguments to be added to vanilla Python methods. These arguments are not available to the Camelot's export method.
EDIT 1: Updated post to specify which line is throwing the error.
EDIT 2: PDF file used: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf
EDIT 3: Here is the full Traceback from Windows console:
> Traceback (most recent call last): File "main.py", line 18, in
> <module>
> tables.export(os.path.normpath(os.path.join(folder_to_pdf, "foo.html")), f='html') File
> "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 737, in export
> self._write_file(f=f, **kwargs) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 699, in _write_file
> to_format(filepath) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 636, in to_html
> f.write(html_string) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py",
> line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in
> position 5737: character maps to <undefined>

The problem you are facing is related to the method camelot.core.Table.to_html:
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w") as f:
f.write(html_string)
Here, the file to be written should be opened with UTF-8 encoding and it is not.
This is my solution, which uses a monkey patch to replace original camelot method:
import camelot
import os
# here I define the corrected method
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w", encoding="utf-8") as f:
f.write(html_string)
# monkey patch: I replace the original method with the corrected one
camelot.core.Table.to_html=to_html
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I tested this solution and it works for Python 3.7, Windows 10, Camelot 0.8.2.

You're getting UnicodeEncodeError, which in this case means that the output to be written to file contains a character than cannot be encoded in the default encoding for your platform, cp1252.
camelot does not seem to handle setting an encoding when writing to an html file.
A workaround might be to set the PYTHONIOENCODING environment variable to "UTF-8" when running your program:
C:\> set PYTHONIOENCODING=UTF-8 && python myprog.py
to force outputting the file(s) with UTF-8 encoding.

Related

Opening a text file, and receiving a encoding error, tried multiple methods no hope

I'm trying to open up a password database file (consists of a bunch of common passwords) and I'm getting the following error:
Attempts so far..
Code:
f = open("crackstation-human-only.txt", 'r')
for i in f:
print(i)
Error Code:
Traceback (most recent call last):
File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
for i in f:
File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 753: character maps to <undefined>
After doing some research I was told to attempt encoding = 'utf-8' which I later discovered was basically guessing and hoping that the file would show all the outputs
Code:
f = open("crackstation-human-only.txt", 'r', encoding = 'utf-8')
for i in f:
print(i)
Error:
Traceback (most recent call last):
File "C:\Users\David\eclipse-workspace\Kaplin\password_cracker.py", line 3, in <module>
for i in f:
File "C:\Users\David\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 5884: invalid continuation byte
After receiving this error message, I was recommended to attempt to download a text editor like 'Sublime Text 3', and to open the console end enter the command 'Encoding()', but unfortunately it wasn't able to detect the encoding.
My professor was able to use bash to 'grep cat' the lines in the file (I honestly know very little about bash so if anyone else knows those terms i'm not sure if that will help them out)
If anyone has any suggestions on what I can do in order to get this to work out I would greatly appreciate it.
I will post the link to the text document if anyone is interested in seeing what types of characters are within the file.
Link to the file, it's a .txt from my school/professors domain
UPDATE:
I have a fellow classmate that is running elementary OS, and he was using the terminal to write his python program which would iterate through the file, and he was using the encoding 'latin-1', he was able to output more characters than me, I'm on Windows 10, using Eclipse-atom for all my scripts.
So there seems to be something that's causing me possibly not to get the correct outputs based on these factors, i'm guessing because it just seems that way based on the results,
I will be installing elementary-os and attempting all the solutions there, to see if I can get this file to work out. I'll add another update soon!
Faced a similar problem a while ago, and more often I've found that setting
encoding = 'raw_unicode_escape'
has worked for me
For your particular case, I tried all Python 3 supported encoding types and found
raw_unicode_escape
mbcs
palmos
Try either of the above to read your file
f = open("crackstation-human-only.txt", 'r', encoding = 'mbcs')
For more information on encodings, refer https://docs.python.org/2.4/lib/standard-encodings.html
Hope this helps.
re:
With the link above i made a list of encoding formats to try on your file. I hadn't saved my previous work, which was more in detail, but this code should do the same. I re-ran it now as follows:
enc_list = ['big5big5-tw,',
'cp037IBM037,',
'cp437437,',
'cp737Greek',
'cp850850,',
'cp855855,',
'cp857857,',
'cp861861,',
'cp863863,',
'cp865865,',
'cp869869,',
'cp875Greek',
'cp949949,',
'cp1006Urdu',
'cp1140ibm1140Western',
'cp1251windows-1251Bulgarian,',
'cp1253windows-1253Greek',
'cp1255windows-1255Hebrew',
'cp1257windows-1257Baltic',
'euc_jpeucjp,',
'euc_jisx0213eucjisx0213Japanese',
'gb2312chinese,',
'gb18030gb18030-2000Unified',
'iso2022_jpcsiso2022jp,',
'iso2022_jp_2iso2022jp-2,',
'iso2022_jp_3iso2022jp-3,',
'iso2022_krcsiso2022kr,',
'iso8859_2iso-8859-2,',
'iso8859_4iso-8859-4,',
'iso8859_6iso-8859-6,',
'iso8859_8iso-8859-8,',
'iso8859_10iso-8859-10',
'iso8859_14iso-8859-14,',
'johabcp1361,',
'koi8_uUkrainian',
'mac_greekmacgreekGreek',
'mac_latin2maclatin2,',
'mac_turkishmacturkishTurkish',
'shift_jiscsshiftjis,',
'shift_jisx0213shiftjisx0213,',
'utf_16_beUTF-16BEall',
'utf_16_le',
'utf_7',
'utf_8',
'base64_codec',
'bz2_codec',
'hex_codec',
'idna',
'mbcs',
'palmos',
'punycode',
'quopri_codec',
'raw_unicode_escape',
'rot_13',
'string_escape',
'undefined',
'unicode_escape',
'unicode_internal',
'uu_codec',
'zlib_codec'
]
for encode in enc_list:
try:
with open(r"crackstation-human-only.txt", encoding=encode) as f:
temp = len(f.read())
except:
enc_list.remove(encode)
print(enc_list)
Run this code on your machine and you'll get a list of encodings you can try on your file. The output i received was
['cp037IBM037,', 'cp737Greek', 'cp855855,', 'cp861861,', 'cp865865,', 'cp875Greek', 'cp1006Urdu', 'cp1251windows-1251Bulgarian,', 'cp1255windows-1255Hebrew', 'euc_jpeucjp,', 'gb2312chinese,', 'iso2022_jpcsiso2022jp,', 'iso2022_jp_3iso2022jp-3,', 'iso8859_2iso-8859-2,', 'iso8859_6iso-8859-6,', 'iso8859_10iso-8859-10', 'johabcp1361,', 'mac_greekmacgreekGreek', 'mac_turkishmacturkishTurkish', 'shift_jisx0213shiftjisx0213,', 'utf_16_le', 'utf_8', 'bz2_codec', 'idna', 'mbcs', 'palmos', 'quopri_codec', 'raw_unicode_escape', 'string_escape', 'unicode_escape', 'uu_codec']
You do have some interesting characters in there. Even though your code does work for me, I'd suggest using a try/except block to catch the lines your system can't handle and skip them:
with open("crackstation-human-only.txt", 'r') as f:
for i in f:
try:
print(i)
except UnicodeDecodeError:
continue
Alternatively, try using open with
the binary read mode 'rb' instead of 'r'
the errors='replace' argument, but that will not do what you want.
see the open documentation

How to find string and return it to stdout in Python

I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.
What is expected:
*If the output of the script below contains the string 5378, it should email me with the line the string appears.
#! /usr/bin/env python
from bs4 import BeautifulSoup
from lxml import html
import urllib2,re
import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)
BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"
webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name
I tried using findall(5378, name), but it returns to empty braces like this [].
I am struggling with Unicode issues if I am trying to use it along with grep.
$ python dell.py | grep 5378
Traceback (most recent call last):
File "dell.py", line 18, in <module>
print name
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)
Can someone tell me what am I doing wrong in both cases?
The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:
re.findall("5378", name)
When printed this will output [u'5378'] when it found something or [] when it didn't.
I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.
for input_element in findcolumn.find_all("div"):
name = unicode(input_element.text.strip())
if re.search("5378", name) != None:
print unicode(name)
As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().

Encode error scraping

Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)

python odfpy AttributeError: Text instance has no attribute encode

I'm trying to read from an ods (Opendocument spreadsheet) document with the odfpy modules. So far I've been able to extract some data but whenever a cell contains non-standard input the script errors out with:
Traceback (most recent call last):
File "python/test.py", line 26, in <module>
print x.firstChild
File "/usr/lib/python2.7/site-packages/odf/element.py", line 247, in __str__
return self.data.encode()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 4: ordinal not in range(128)
I tried to force an encoding on the output but apparently it does not go well with print:
Traceback (most recent call last):
File "python/test.py", line 27, in <module>
print x.firstChild.encode('utf-8', 'ignore')
AttributeError: Text instance has no attribute 'encode'
What is the problem here and how could it be solved without editing the module code (which I'd like to avoid at all cost)? Is there an alternative to running encode on output that could work?
Here is my code:
from odf.opendocument import Spreadsheet
from odf.opendocument import load
from odf.table import Table,TableRow,TableCell
from odf.text import P
import sys,codecs
doc = load(sys.argv[1])
d = doc.spreadsheet
tables = d.getElementsByType(Table)
for table in tables:
tName = table.attributes[(u'urn:oasis:names:tc:opendocument:xmlns:table:1.0', u'name')]
print tName
rows = table.getElementsByType(TableRow)
for row in rows[:2]:
cells = row.getElementsByType(TableCell)
for cell in cells:
tps = cell.getElementsByType(P)
if len(tps)>0:
for x in tps:
#print x.firstChild
print x.firstChild.encode('utf-8', 'ignore')
Maybe you are not using the latest odfpy, in the latest verion, the __str__ method of Text is implemented as:
def __str__(self):
return self.data
Update odfpy to the latest version, and modify your code as:
print x.firstChild.__str__().encode('utf-8', 'ignore')
UPDATE
This is another method for getting the raw unicode data for Text: __unicode__. So if you don't want to update odfpy, modify your code as:
print x.firstChild.__unicode__().encode('utf-8', 'ignore')
Seems like the library itself is calling encode() -
return self.data.encode()
This uses the system default encoding , which in your case seems to be ascii. you can check that by using -
import sys
sys.getdefaultencoding()
From the traceback, seems like the actual data exists in a variable called data.
Try doing the below instead -
print x.firstChild.data

nltk NERTagger UnicodeDecodeError in python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.
This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')
Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.
At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.

Categories