I wrote this code to validate my xml file via a xsd
def parseAndObjectifyXml(xmlPath, xsdPath):
from lxml import etree
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
myxml = etree.parse(xmlinput) # In this line xml input is empty
schema.assertValid(myxml)
but when I want to validate it, my xmlinput is empty but my xmlContent is not empty.
what is the problem?
Files in python have a "current position"; it starts at the beginning of the file (position 0), then, as you read the file, the current position pointer moves along until it reaches the end.
You'll need to put that pointer back to the beginning before the lxml parser can read the contents in full. Use the .seek() method for that:
from lxml import etree
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
xmlinput.seek(0)
myxml = etree.parse(xmlinput)
schema.assertValid(myxml)
You only need to do this if you need xmlContent somewhere else too; you could alternatively pass it into the .parse() method if wrapped in a StringIO object to provide the necessary file object methods:
from lxml import etree
from cStringIO import StringIO
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
xmlContent = xmlinput.read()
myxml = etree.parse(StringIO(xmlContent))
schema.assertValid(myxml)
If you are not using xmlContent for anything else, then you do not need the extra .read() call either, and subsequently won't have problems parsing it with lxml; just omit the call altogether, and you won't need to move the current position pointer back to the start either:
from lxml import etree
def parseAndObjectifyXml(xmlPath, xsdPath):
xsdFile = open(xsdPath)
schema = etree.XMLSchema(file=xsdFile)
xmlinput = open(xmlPath)
myxml = etree.parse(xmlinput)
schema.assertValid(myxml)
To learn more about .seek() (and it's counterpart, .tell()), read up on file objects in the Python tutorial.
You should use the XML content that you have read:
xmlContent = xmlinput.read()
myxml = etree.parse(xmlContent)
instead of:
myxml = etree.parse(xmlinput)
Related
I read XML file from gitlab into a variable, then I do some manipulations with it. And I need to rewrite the file in gitlab using that variable. When I use dump - it deletes all from the file. How can I rewrite XML file in gitlab from python?
import gitlab
import io
import xml.etree.ElementTree as ET
gl = gitlab.Gitlab(
private_token='xxxxx')
gl.auth()
projects = gl.projects.list(owned=True, search='Python')
raw_content = projects[0].files.raw(file_path='9_XML/XML_hw.xml', ref='main')
f = io.BytesIO()
f.write(raw_content)
f.seek(0)
xml_file = ET.parse(f) # read file
..... some manipulations with xml_file
project_id = 111111
project = gl.projects.get(project_id)
f = project.files.get(file_path='9_XML/XML_hw.xml', ref='main')
f.content = ET.dump(xml_file) # IT doesn't rewrite, it deletes everything from the file
f.save(branch='main', commit_message='Update file')
ET.dump doesn't produce a return value. It only prints to stdout. As stated in the docs:
Writes an element tree or element structure to sys.stdout. This function should be used for debugging only.
Hence, you end up setting f.content = None.
Instead of using .dump, use .tostring:
xml_str = ET.tostring(xml_file, encoding='unicode')
f.content = xml_str
I'm trying to make a class which makes it easier to handle XML-invoices, but I am having trouble getting ElementTree to work within a class.
This is the general idea of what I'm trying to do:
def open_invoice(input_file):
with open(input_file, 'r', encoding = 'utf8') as invoice_file:
return ET.parse(input_file).getroot()
This works fine, and I can make functions to handle the data without issue. But when trying to do the equivalent inside a class, I get an error message:
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
I think this means that the parser is never given anything to parse, though I could be wrong.
The class:
import xmltodict
import xml.etree.ElementTree as ET
class Peppol:
def __init__(self, peppol_invoice):
self.invoice = xmltodict.parse(
peppol_invoice.read()
)
self.root = ET.parse(peppol_invoice).getroot()
Making the class instance:
from pypeppol import Peppol
def open_invoice(input_file):
with open(input_file, 'r', encoding = 'utf8') as invoice_file:
return Peppol(invoice_file)
invoice = open_invoice('invoice.xml')
Help is much appreciated.
The error means that invoice.xml is empty, does not contain XML or contains XML + over stuff before the XML data.
import xml.etree.ElementTree as ET
with open('empty.xml', 'w') as f:
f.write('')
# or
# f.write('No xml here!')
with open('empty.xml') as f:
ET.parse(f).getroot()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
The problem here is that you are attempting to read the contents of the file peppol_invoice twice, once in the call to xmltodict.parse, and once in the call to ET.parse.
After the call to peppol_invoice.read() completes, peppol_invoice is left pointing at the end of the file. You get the error in your question title because when peppol_invoice is passed to ET.parse, there is nothing left to be read from the file.
If you want to read the contents of the file again, call peppol_invoice.seek(0) to reset the pointer back to the start of the file:
import xmltodict
import xml.etree.ElementTree as ET
class Peppol:
def __init__(self, peppol_invoice):
self.invoice = xmltodict.parse(
peppol_invoice.read()
)
peppol_invoice.seek(0) # add this line
self.root = ET.parse(peppol_invoice).getroot()
I am trying to parse an xml file(containing bad characters) using lxml module in recover = True mode.
Below is the code snippet
from lxml import etree
f=open('test.xml')
data=f.read()
f.close()
parser = etree.XMLParser(recover=True)
x = etree.fromstring(data, parser=parser)
Now I want to create another xml file (test1.xml) from the above object (x)
Could anyone please help in this matter.
Thanks
I think this is what you are searching for
from lxml import etree
# opening the source file
with open('test.xml','r') as f:
# reading the number
data=f.read()
parser = etree.XMLParser(recover=True)
# fromstring() parses XML from a string directly into an Element
x = etree.fromstring(data, parser=parser)
# taking the content retrieved
y = etree.tostring(x, pretty_print=True).decode("utf-8")
# writing the content on the output file
with open('test1.xml','w') as f:
f.write(y)
Because I'm not able to use an XSL IDE, I've written a super-simple Python script using lxml to transform a given XML file with a given XSL transform, and write the results to a file. As follows (abridged):
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
print(xml_root.tag)
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
I'm getting the following error:
"lxml.etree.XSLTApplyError: Failed to evaluate the 'select' expression"
...but I have quite a number of select expressions in my XSLT. After having looked carefully and isolated blocks of code, I'm still at a loss as to which select is failing, or why.
Without trying to debug the code, is there a way to get more information out of lxml, like a line number or quote from the failing expression?
aaaaaand of course as soon as I actually take the time to post the question, I stumble upon the answer.
This might be a duplicate of this question, but I think the added benefit here is the Python side of things.
The linked answer points out that each parser includes an error log that you can access. The only "trick" is catching those errors so that you can look in the log once it's been created.
I did it thusly (perhaps also poorly, but it worked):
import os
import lxml.etree as etree
from lxml.etree import XMLParser
import sys
xml_filename = '(some path to an XML file)'
xsl_filename = '(some path to an XSL file)'
output = '(some path to a file)'
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = None
try:
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
except:
for error in transform.error_log:
print(error.message, error.line)
The messages in this log are more descriptive than those printed to the console, and the "line" element will point you to the line number where the failure occurred.
This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.