Because I'm not able to use an XSL IDE, I've written a super-simple Python script using lxml to transform a given XML file with a given XSL transform, and write the results to a file. As follows (abridged):
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
print(xml_root.tag)
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
I'm getting the following error:
"lxml.etree.XSLTApplyError: Failed to evaluate the 'select' expression"
...but I have quite a number of select expressions in my XSLT. After having looked carefully and isolated blocks of code, I'm still at a loss as to which select is failing, or why.
Without trying to debug the code, is there a way to get more information out of lxml, like a line number or quote from the failing expression?
aaaaaand of course as soon as I actually take the time to post the question, I stumble upon the answer.
This might be a duplicate of this question, but I think the added benefit here is the Python side of things.
The linked answer points out that each parser includes an error log that you can access. The only "trick" is catching those errors so that you can look in the log once it's been created.
I did it thusly (perhaps also poorly, but it worked):
import os
import lxml.etree as etree
from lxml.etree import XMLParser
import sys
xml_filename = '(some path to an XML file)'
xsl_filename = '(some path to an XSL file)'
output = '(some path to a file)'
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = None
try:
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
except:
for error in transform.error_log:
print(error.message, error.line)
The messages in this log are more descriptive than those printed to the console, and the "line" element will point you to the line number where the failure occurred.
Related
This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.
I am trying to use this code to add into an xml tree a simple info, which I have in a table. Each file has its id which I need to add into it. the corresp dictionary has file name and id couples. there is already an empty element in the xml, called idno[#type='TM'] in which I need to enter the corresponding id number.
from bs4 import BeautifulSoup
DIR = 'files/'
corresp = {"00100004":"362375", "00100005":"362376", "00100006":"362377", "00100007":"362378"}
for fileName, tm in corresp.iteritems():
soup = BeautifulSoup(open(DIR + fileName + ".xml"))
tmid = soup.find("idno", type="TM")
tmid.append(tm)
print soup
My first problem is that some time it works, some time it says
tmid.append(tm)
AttributeError: 'NoneType' object has no attribute 'append'
I have no idea why. yesterday evening I run the same sample code and now it complains in this way.
I have also tried etree
import xml.etree.ElementTree as ET
DIR = 'files/'
corresp = {"00100004":"362375", "00100005":"362376", "00100006":"362377", "00100007":"362378"}
for fileName, tm in corresp.iteritems():
f = open(DIR + fileName + ".xml")
tree = ET.parse(f)
tmid = tree.findall("//idno[#type='TM']")
tmid.append(tm)
tree.write('output.xml', encoding='utf-8', xml_declaration=True)
But it says "no element found: line 1, column 0"
My second, probably related problem is that when it did work, I was not able to write the output to a file. Ideally I would like to simply write it back to the file I am modifying.
Thank you very much for any advise on this.
For your first question:
find() just returns the result. If find() can’t find anything, it returns None. both result and None are not a python list, so it does not have append() method.
check the doc:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
I would like to parse large HTML files and extract information from those files through xpath. Aiming to do that, I'm using python and lxml. However, lxml seems not to work well with large files, it can parse correctly files whose size isn't larger than around 16 MB. The fragment of code where it tries to extract information from HTML code though xpath is the following:
tree = lxml.html.fragment_fromstring(htmlCode)
links = tree.xpath("//*[contains(#id, 'item')]/div/div[2]/p/text()")
The variable htmlCode contains the HTML code read from a file. I also tried using parse method for reading the code from file instead of getting the code directly from a string, but it didn't work either. As the contents of file is read successfully from file, I think the problem is related to lxml. I've been looking for another libraries in order to parse HTML and use xpath, but it looks like lxml is the main library used for that.
Is there another method/function of lxml that deals better with large HTML files?
If the file is very large, you can use iterparse and add html=True argument to parse files without any validation.
You need to manually create conditions for xpath.
from lxml import etree
import sys
import unicodedata
TAG = '{http://www.mediawiki.org/xml/export-0.8/}text'
def fast_iter(context, func, *args, **kwargs):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
# modified to call func() only in the event and elem needed
for event, elem in context:
if event == 'end' and elem.tag == TAG:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem, fout):
global counter
normalized = unicodedata.normalize('NFKD', \
unicode(elem.text)).encode('ASCII','ignore').lower()
print >>fout, normalized.replace('\n', ' ')
if counter % 10000 == 0: print "Doc " + str(counter)
counter += 1
def main():
fin = open("large_file", 'r')
fout = open('output.txt', 'w')
context = etree.iterparse(fin,html=True)
global counter
counter = 0
fast_iter(context, process_element, fout)
if __name__ == "__main__":
main()
Source
I'm parsing an xml file using the code below:
import lxml
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
crid = (info.get('programId'))
titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
title = (titlex.text if titlex != None else 'Missing')
synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))
synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')
synopsis1 = synopsis1.replace('\r','').replace('\n','')
f.write('{}|{}|{}\n'.format(crid, title, synopsis1))
Let take an example title of 'Přešité bydlení'. If I print the title whilst parsing the file, it comes out as expected. When I write it out however, it displays as 'PÅ™eÅ¡ité bydlenÃ'.
I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired. I had a look at the codecs library, but couldn't wasn't successful. Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.
How can I configure the written output to be human readable?
I had all sorts of troubles with this before. But the solution is rather simple. There is a chapter on how to read and write in unicode to a file in the documentation. This Python talk is also very enlightening to understand the issue. Unicode can be a pain. It gets a lot easier if you start using python 3 though.
import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Your code looks ok, so I reckon your input is duff. Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.
This would explain why printing works but not writing to a file. If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.
I wrote this code to validate my xml file via a xsd
def parseAndObjectifyXml(xmlPath, xsdPath):
from lxml import etree
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
myxml = etree.parse(xmlinput) # In this line xml input is empty
schema.assertValid(myxml)
but when I want to validate it, my xmlinput is empty but my xmlContent is not empty.
what is the problem?
Files in python have a "current position"; it starts at the beginning of the file (position 0), then, as you read the file, the current position pointer moves along until it reaches the end.
You'll need to put that pointer back to the beginning before the lxml parser can read the contents in full. Use the .seek() method for that:
from lxml import etree
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
xmlinput.seek(0)
myxml = etree.parse(xmlinput)
schema.assertValid(myxml)
You only need to do this if you need xmlContent somewhere else too; you could alternatively pass it into the .parse() method if wrapped in a StringIO object to provide the necessary file object methods:
from lxml import etree
from cStringIO import StringIO
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
myxml = etree.parse(StringIO(xmlContent))
schema.assertValid(myxml)
If you are not using xmlContent for anything else, then you do not need the extra .read() call either, and subsequently won't have problems parsing it with lxml; just omit the call altogether, and you won't need to move the current position pointer back to the start either:
from lxml import etree
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
myxml = etree.parse(xmlinput)
schema.assertValid(myxml)
To learn more about .seek() (and it's counterpart, .tell()), read up on file objects in the Python tutorial.
You should use the XML content that you have read:
xmlContent = xmlinput.read()
myxml = etree.parse(xmlContent)
instead of:
myxml = etree.parse(xmlinput)