python odfpy AttributeError: Text instance has no attribute encode - python

I'm trying to read from an ods (Opendocument spreadsheet) document with the odfpy modules. So far I've been able to extract some data but whenever a cell contains non-standard input the script errors out with:
Traceback (most recent call last):
File "python/test.py", line 26, in <module>
print x.firstChild
File "/usr/lib/python2.7/site-packages/odf/element.py", line 247, in __str__
return self.data.encode()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 4: ordinal not in range(128)
I tried to force an encoding on the output but apparently it does not go well with print:
Traceback (most recent call last):
File "python/test.py", line 27, in <module>
print x.firstChild.encode('utf-8', 'ignore')
AttributeError: Text instance has no attribute 'encode'
What is the problem here and how could it be solved without editing the module code (which I'd like to avoid at all cost)? Is there an alternative to running encode on output that could work?
Here is my code:
from odf.opendocument import Spreadsheet
from odf.opendocument import load
from odf.table import Table,TableRow,TableCell
from odf.text import P
import sys,codecs
doc = load(sys.argv[1])
d = doc.spreadsheet
tables = d.getElementsByType(Table)
for table in tables:
tName = table.attributes[(u'urn:oasis:names:tc:opendocument:xmlns:table:1.0', u'name')]
print tName
rows = table.getElementsByType(TableRow)
for row in rows[:2]:
cells = row.getElementsByType(TableCell)
for cell in cells:
tps = cell.getElementsByType(P)
if len(tps)>0:
for x in tps:
#print x.firstChild
print x.firstChild.encode('utf-8', 'ignore')

Maybe you are not using the latest odfpy, in the latest verion, the __str__ method of Text is implemented as:
def __str__(self):
return self.data
Update odfpy to the latest version, and modify your code as:
print x.firstChild.__str__().encode('utf-8', 'ignore')
UPDATE
This is another method for getting the raw unicode data for Text: __unicode__. So if you don't want to update odfpy, modify your code as:
print x.firstChild.__unicode__().encode('utf-8', 'ignore')

Seems like the library itself is calling encode() -
return self.data.encode()
This uses the system default encoding , which in your case seems to be ascii. you can check that by using -
import sys
sys.getdefaultencoding()
From the traceback, seems like the actual data exists in a variable called data.
Try doing the below instead -
print x.firstChild.data

Related

Python Camelot PDF - UnicodeEncodeError when using Stream flavor, on Windows

Python 3.7 on Windows 10. Camelot 0.8.2
I'm using the following code to convert a pdf file to HTML:
import camelot
import os
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I receive the following error at the tables.export line:
"UnicodeEncodeError -'charmap' codec can't encode character '\u2010'
in position y: character maps to undefined.
This code runs without issue on Mac. This error seems to pertain to Windows, which is the environment I will need to run this on.
I have now spent two entire days researching this error ad nauseum - I have tried many of the solutions offered here on Stack Overflow from the several posts related to this. The error persists. The problem with adding the lines of code suggested in all the solutions is that they're all arguments to be added to vanilla Python methods. These arguments are not available to the Camelot's export method.
EDIT 1: Updated post to specify which line is throwing the error.
EDIT 2: PDF file used: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf
EDIT 3: Here is the full Traceback from Windows console:
> Traceback (most recent call last): File "main.py", line 18, in
> <module>
> tables.export(os.path.normpath(os.path.join(folder_to_pdf, "foo.html")), f='html') File
> "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 737, in export
> self._write_file(f=f, **kwargs) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 699, in _write_file
> to_format(filepath) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 636, in to_html
> f.write(html_string) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py",
> line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in
> position 5737: character maps to <undefined>
The problem you are facing is related to the method camelot.core.Table.to_html:
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w") as f:
f.write(html_string)
Here, the file to be written should be opened with UTF-8 encoding and it is not.
This is my solution, which uses a monkey patch to replace original camelot method:
import camelot
import os
# here I define the corrected method
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w", encoding="utf-8") as f:
f.write(html_string)
# monkey patch: I replace the original method with the corrected one
camelot.core.Table.to_html=to_html
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I tested this solution and it works for Python 3.7, Windows 10, Camelot 0.8.2.
You're getting UnicodeEncodeError, which in this case means that the output to be written to file contains a character than cannot be encoded in the default encoding for your platform, cp1252.
camelot does not seem to handle setting an encoding when writing to an html file.
A workaround might be to set the PYTHONIOENCODING environment variable to "UTF-8" when running your program:
C:\> set PYTHONIOENCODING=UTF-8 && python myprog.py
to force outputting the file(s) with UTF-8 encoding.

Attribute error when encoding with base64

I have two keys(secret key and public key) that are generated using curve25519. I want to encode the two keys using base64.safe_b64encode but i keep getting an error. Is there any way I can encode using this?
This is my code:
import libnacl.public
import libnacl.secret
import libnacl.utils
from tinydb import TinyDB
from hashlib import sha256
import json
import base64
pikeys = libnacl.public.SecretKey()
piprivkey = pikeys.sk
pipubkey = pikeys.pk
piprivkey = base64.safe_b64encode(piprivkey)
pipubkey = base64.safe_b64encode(pipubkey)
print("encoded priv", piprivkey)
print("encoded pub", pipubkey)
This is the error I got:
Traceback (most recent call last):
File "/home/pi/Desktop/finalcode/pillar1.py", line 130, in <module>
File "/home/pi/Desktop/finalcode/pillar1.py", line 50, in generatepillar1key
piprivkey = base64.safe_b64encode(piprivkey)
AttributeError: 'module' object has no attribute 'safe_b64encode'
The reason you get this error is because the base64 library does not have a function named safe_base64encode. What do you even mean by safe_base64encode? Why do you want to encode both of your keys with base64? there is a urlsafe encoding function and there is the regular base64 encoding function.
encoded_data = base64.b64encode(data_to_encode)
or
encoded_data = base64.urlsafe_b64encode(data_to_encode)
The latter one will just have a different alphabet with - instead of + and _ instead of / so it's urlsafe. I'm not sure what you want to do but refer to the docs here
The error is telling you that the function safe_b64encode does not exist in the base64 module. Perhaps you meant to use base64.urlsafe_b64encode(s)?

How to find string and return it to stdout in Python

I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.
What is expected:
*If the output of the script below contains the string 5378, it should email me with the line the string appears.
#! /usr/bin/env python
from bs4 import BeautifulSoup
from lxml import html
import urllib2,re
import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)
BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"
webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name
I tried using findall(5378, name), but it returns to empty braces like this [].
I am struggling with Unicode issues if I am trying to use it along with grep.
$ python dell.py | grep 5378
Traceback (most recent call last):
File "dell.py", line 18, in <module>
print name
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)
Can someone tell me what am I doing wrong in both cases?
The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:
re.findall("5378", name)
When printed this will output [u'5378'] when it found something or [] when it didn't.
I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.
for input_element in findcolumn.find_all("div"):
name = unicode(input_element.text.strip())
if re.search("5378", name) != None:
print unicode(name)
As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().

Strange UnicodeEncodeError/AttributeError in my script

Currently I am writing a script in Python 2.7 that works fine except for after running it for a few seconds it runs into an error:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants).encode('utf-8')
AttributeError: 'NoneType' object has no attribute 'encode'
The script is to get data from a Shopify website and then print it to console. Code here:
# -*- coding: utf-8 -*-
from __future__ import print_function
from lxml.html import fromstring
import requests
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Log file location, change "z://shopify_output.txt" to your location.
logFileLocation = "z:\shopify_output.txt"
log = open(logFileLocation, "w")
# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP): ") + '/sitemap_products_1.xml'
print ('Scraping! Check log file # ' + logFileLocation + ' to see output.')
print ("!!! Also make sure to clear file every hour or so !!!")
while True :
page = requests.get(url)
tree = fromstring(page.content)
# skip first url tag with no image:title
url_tags = tree.xpath("//url[position() > 1]")
data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in url_tags]
for prod, url in data:
# add xml extension to url
page = requests.get(url + ".xml")
tree = fromstring(page.content)
variants = tree.xpath("//variants[#type='array']//id[#type='integer']//text()")
print(prod, variants).encode('utf-8')
The most crazy part about it is that when I take out the .encode('utf-8') it gives me a UnicodeEncodeError seen here:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Copper Bracelet - 5mm - Brushed ['3726247811']
Copper Bracelet - 7mm - Polished ['3726253635']
Highsnobiety x EARLY - Leather Pouch ['14541472963', '14541473027', '14541473091']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 13: character maps to <undefined>'
Any ideas? Have no idea what else to try after hours of googling.
snakecharmerb almost got it, but missed the cause of your first error. Your code
print(prod, variants).encode('utf-8')
means you print the values of the prod and variants variables, then try to run the encode() function on the output of print. Unfortunately, print() (as a function in Python 2 and always in Python 3) returns None. To fix it, use the following instead:
print(prod.encode("utf-8"), variants)
Your console has a default encoding of cp437, and cp437 is unable to represent the character u'\xae'.
>>> print (u'\xae')
®
>>> print (u'\xae'.encode('utf-8'))
b'\xc2\xae'
>>> print (u'\xae'.encode('cp437'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 0: character maps to <undefined>
You can see that it's trying to convert to cp437 in the traceback:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
(I reproduced the problem in Python3.5, but it's the same issue in both versions of Python)

Python3 and hmac . How to handle string not being binary

I had a script in Python2 that was working great.
def _generate_signature(data):
return hmac.new('key', data, hashlib.sha256).hexdigest()
Where data was the output of json.dumps.
Now, if I try to run the same kind of code in Python 3, I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/hmac.py", line 144, in new
return HMAC(key, msg, digestmod)
File "/usr/lib/python3.4/hmac.py", line 42, in __init__
raise TypeError("key: expected bytes or bytearray, but got %r" %type(key).__name__)
TypeError: key: expected bytes or bytearray, but got 'str'
If I try something like transforming the key to bytes like so:
bytes('key')
I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding
I'm still struggling to understand the encodings in Python 3.
You can use bytes literal: b'key'
def _generate_signature(data):
return hmac.new(b'key', data, hashlib.sha256).hexdigest()
In addition to that, make sure data is also bytes. For example, if it is read from file, you need to use binary mode (rb) when opening the file.
Not to resurrect an old question but I did want to add something I feel is missing from this answer, to which I had trouble finding an appropriate explanation/example of anywhere else:
Aquiles Carattino was pretty close with his attempt at converting the string to bytes, but was missing the second argument, the encoding of the string to be converted to bytes.
If someone would like to convert a string to bytes through some other means than static assignment (such as reading from a config file or a DB), the following should work:
(Python 3+ only, not compatible with Python 2)
import hmac, hashlib
def _generate_signature(data):
key = 'key' # Defined as a simple string.
key_bytes= bytes(key , 'latin-1') # Commonly 'latin-1' or 'ascii'
data_bytes = bytes(data, 'latin-1') # Assumes `data` is also an ascii string.
return hmac.new(key_bytes, data_bytes , hashlib.sha256).hexdigest()
print(
_generate_signature('this is my string of data')
)
try
codecs.encode()
which can be used both in python2.7.12 and 3.5.2
import hashlib
import codecs
import hmac
a = "aaaaaaa"
b = "bbbbbbb"
hmac.new(codecs.encode(a), msg=codecs.encode(b), digestmod=hashlib.sha256).hexdigest()
for python3 this is how i solved it.
import codecs
import hmac
def _generate_signature(data):
return hmac.new(codecs.encode(key), codecs.encode(data), codecs.encode(hashlib.sha256)).hexdigest()

Categories