I have some code that is throwing a fatal exception on unicode input. I am using ElementTree to build up an xml document and tostring() to print it. I've tried passing in unicode objects, and encoding them as UTF-8 bytestrings, and it makes no difference. I can't figure out if I'm doing something wrong or if there is a bug in the module.
Here is a small sample to reproduce it.
#!/usr/bin/python
from __future__ import unicode_literals, print_function
from xml.etree.ElementTree import Element, SubElement, tostring
import xml.etree.ElementTree as ET
import time
def main():
xml = Element('build_summary')
mpversion = SubElement(xml, 'magpy_version')
mpversion.text = '1.2.3.4'
version = SubElement(xml, 'version')
version.text = '11.22.33.44'
date = SubElement(xml, 'date')
date.text = time.strftime("%a %b %-d %Y", time.localtime())
args = SubElement(xml, 'args')
args.text = 'build args'
issues = SubElement(xml, 'issues')
# Add the repos and the changes in them
changelog = 'this is the changelog \u2615'
#changelog = 'this is the changelog'
print("adding changelog:", changelog)
repository = SubElement(issues, 'repo')
reponame = SubElement(repository, 'reponame')
reponame.text = 'repo name'
repoissues = SubElement(repository, 'repoissues')
#repoissues.text = changelog.encode('UTF-8', 'replace')
repoissues.text = changelog
# Generate a string, reparse it, and pretty-print it.
#ET.dump(xml)
#xml.write('myoutput.xml')
rough = tostring(xml, encoding='UTF-8', method='xml')
#rough = tostring(xml)
print(rough)
if __name__ == '__main__':
main()
This yields the following:
msoulier#anton:~$ python treetest.py
adding changelog: this is the changelog ☕
Traceback (most recent call last):
File "treetest.py", line 38, in <module>
main()
File "treetest.py", line 33, in main
rough = tostring(xml, encoding='UTF-8', method='xml')
File "/usr/local/Cellar/python#2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
return "".join(data)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 22: ordinal not in range(128)
So, what am I doing wrong here? Oddly, ElementTree.dump works fine but the docs say not to use it for anything bug debugging.
Related
I am a newbie python programmer. I am strugling with strange error - which pops out only when I move working code from main script file to separate module (file) as a function. The error is SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xbf in position 58: invalid start byte.
If the function is in the main code there is no error and code works properly...
The code is about do some webscraping with the use of selenium and xpath
#main file:
import requests
import lxml.html as lh
import pandas as pd
import numpy
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import funkcje as f
spolka = "https://mojeinwestycje.interia.pl/gie/prof/spolki/notowania?wlid=213"
wynik = f.listaTransakcji(spolka)
#module file with function definition (funkcje.py):
def listaTransakcji(spolka):
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(spolka)
driver.find_element_by_xpath("//button[#class='rodo-popup-agree']").click()
driver.find_element_by_xpath("//input[#type='radio' and #name='typ' and #value='wsz']").click()
driver.find_element_by_xpath("//input[#type='submit' and #name='Submit' and #value='pokaż']").click()
page = driver.page_source
#end of selenium-----------------------------------------------------------------------------
#Store the contents of the website under doc
doc = lh.fromstring(page)
#wyluskanie rekordów transakcji - xpath------------------------------------------------------
tr_elements = doc.xpath('//table//tr[#bgcolor="#FFFFFF" or #bgcolor="#F7FAFF"]/td')
rekord = numpy.array([])
length = len(tr_elements)
for i in range (0, length):
if(tr_elements[i].text=='TRANSAKCJA') or (tr_elements[i].text=='WIDEŁKI STAT') or (tr_elements[i].text=='WIDEŁKI DYN'):
new_rekord=[tr_elements[i-5].text, tr_elements[i-4].text, tr_elements[i-3].text, tr_elements[i-2].text, tr_elements[i-1].text, tr_elements[i].text]
rekord=numpy.concatenate((rekord,new_rekord))
ilosc = (len(rekord))//6
tablica = numpy.array([])
tablica = rekord.reshape(ilosc, 6)
header = numpy.array(["godzina", "cena", "zmiana", "wolumen", "numer", "typ operacji"])
header = header.reshape(1, 6)
tablica = numpy.concatenate((header,tablica))
return (tablica)
offending line 10:
import funkcje as f
offending line 34:
driver.find_element_by_xpath("//input[#type='submit' and #name='Submit' and #value='pokaż']").click()
expected result:
["11:17:40","0,4930","0,00",24300,76,"TRANSAKCJA"]
actual result:
Traceback (most recent call last):
File "C:/Users/Vox/PycharmProjects/interia/scraper.py", line 10, in <module>
import funkcje as f
File "C:\Users\Vox\PycharmProjects\interia\funkcje.py", line 34
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xbf in position 58: invalid start byte
thanks Marat!
try putting # -- coding: utf-8 -- on top of the new file (replace
utf-8 with whatever encoding is used for pokaż
solved issue... no idea why it happened in the first place though. Like new file that is not a main file is not utf-8 by default?
Scraping site with chineese simbols .
How do i scrap chineese simbolse ??
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'
def parse_items():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
title = a.text
em = elem.cssselect('em')[0]
title2 = em.text
print(href, title, title2)
def main():
parse_items()
if __name__ == '__main__':
main()
Error looks like this.
Error looks like this
Error looks like this
Error looks like this
Error looks like this
http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
File "parser.py", line 27, in <module>
main()
File "parser.py", line 24, in main
parse_items()
File "parser.py", line 20, in parse_items
print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.
So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.
How to fix:
the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.
I would use:
import sys
def u_filter(s, encoding = sys.stdout.encoding):
return (s.encode(encoding, errors='replace').decode(encoding)
if isinstance(s, str) else s)
That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string
and next:
def fprint(*args, **kwargs):
fargs = [ u_filter(arg) for arg in args ]
print(*fargs, **kwargs)
means: filter out any offending character from unicode strings and print the remaining unchanged.
With that you can safely replace your print throwing the exception with:
fprint(href, title, title2)
I'm trying to read from an ods (Opendocument spreadsheet) document with the odfpy modules. So far I've been able to extract some data but whenever a cell contains non-standard input the script errors out with:
Traceback (most recent call last):
File "python/test.py", line 26, in <module>
print x.firstChild
File "/usr/lib/python2.7/site-packages/odf/element.py", line 247, in __str__
return self.data.encode()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 4: ordinal not in range(128)
I tried to force an encoding on the output but apparently it does not go well with print:
Traceback (most recent call last):
File "python/test.py", line 27, in <module>
print x.firstChild.encode('utf-8', 'ignore')
AttributeError: Text instance has no attribute 'encode'
What is the problem here and how could it be solved without editing the module code (which I'd like to avoid at all cost)? Is there an alternative to running encode on output that could work?
Here is my code:
from odf.opendocument import Spreadsheet
from odf.opendocument import load
from odf.table import Table,TableRow,TableCell
from odf.text import P
import sys,codecs
doc = load(sys.argv[1])
d = doc.spreadsheet
tables = d.getElementsByType(Table)
for table in tables:
tName = table.attributes[(u'urn:oasis:names:tc:opendocument:xmlns:table:1.0', u'name')]
print tName
rows = table.getElementsByType(TableRow)
for row in rows[:2]:
cells = row.getElementsByType(TableCell)
for cell in cells:
tps = cell.getElementsByType(P)
if len(tps)>0:
for x in tps:
#print x.firstChild
print x.firstChild.encode('utf-8', 'ignore')
Maybe you are not using the latest odfpy, in the latest verion, the __str__ method of Text is implemented as:
def __str__(self):
return self.data
Update odfpy to the latest version, and modify your code as:
print x.firstChild.__str__().encode('utf-8', 'ignore')
UPDATE
This is another method for getting the raw unicode data for Text: __unicode__. So if you don't want to update odfpy, modify your code as:
print x.firstChild.__unicode__().encode('utf-8', 'ignore')
Seems like the library itself is calling encode() -
return self.data.encode()
This uses the system default encoding , which in your case seems to be ascii. you can check that by using -
import sys
sys.getdefaultencoding()
From the traceback, seems like the actual data exists in a variable called data.
Try doing the below instead -
print x.firstChild.data
I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.
This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')
Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.
At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.
I have the following input XML file,i read the rel_notes tag and print it...running into the following error
Input XML:
<rel_notes>
• Please move to this build for all further test and development activities
• Please use this as base build to verify compilation and sanity before any check-in happens
</rel_notes>
Sample python code:
file = open('data.xml,'r')
from xml.etree import cElementTree as etree
tree = etree.parse(file)
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
OUTPUT
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
File "C:\python2.7.3\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 9: character maps to <undefined>
The issue is with printing Unicode to Windows console. Namely, the character '•' can't be represented in cp437 used by your console.
To reproduce the problem, try:
print u'\u2022'
You could set PYTHONIOENCODING environment variable to instruct python to replace all unrepresentable characters with corresponding xml char references:
T:\> set PYTHONIOENCODING=cp437:xmlcharrefreplace
T:\> python your_script.py
Or encode the text to bytes before printing:
print u'\u2022'.encode('cp437', 'xmlcharrefreplace')
answer to your initial question
To print text of each <build_location/> element:
import sys
from xml.etree import cElementTree as etree
input_file = sys.stdin # filename or file object
tree = etree.parse(input_file)
print('\n'.join(elem.text for elem in tree.iter('build_location')))
If input file is large; iterparse() could be used:
import sys
from xml.etree import cElementTree as etree
input_file = sys.stdin
context = iter(etree.iterparse(input_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == 'build_location':
print(elem.text)
root.clear() # free memory
I don't think the entire snippet above is completely helpful. But, UnicodeEncodeError usually happens when the ASCII characters aren't handled properly.
Example:
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
Its already explained clearly in this answer: Python: Convert Unicode to ASCII without errors
This should at least solve the UnicodeEncodeError.