lxml - parse an xml with no line breaks in it - python

I am using lxml iterparse in python to loop through the elements in my xml file. It works fine with most of the xmls, but fails for some. One of them has no line breaks in it. The error and a sample of such xml are as below. Any clues?
Thanks!!
<root><person><name>"xyz"</name><age>"10"</age></person><person><name>"abc"</name><age>"20"</age></person></root>
error -
XMLSyntaxError: Document is empty, line 1, column 1
code -
from lxml import etree
def parseXml(context,elemList):
for event, element in context:
if element.tag in elemList:
#read text and attributes is any
element.clear()
def main(object):
elemList= ['name','age','id']
context=etree.iterparse(fullFilePath, events=("start","end"))
parseXml(context,elemList)

etree.iterparse expects buffer for source argument. And name of variable you passing, "fullFilePath", tells me that it's not file (So parser is trying to parse file_path insted of file content ).
Try passing opened file instead.
context=etree.iterparse(open(fullFilePath), events=("start","end"))
or string:
from lxml import etree
xml = '<root><person><name>"xyz"</name><age>"10"</age></person><person><name>"abc"</name><age>"20"</age></person></root>\n'
def parseXml(context,elemList):
for event, element in context:
if element.tag in elemList:
print element.tag,
element.clear()
def main():
elemList= ['name','age','id']
context=etree.iterparse(StringIO(xml), events=("start","end"))
parseXml(context,elemList)
main()
>>>name name age age name name age age
PS: And what to do you mean by this?
def main(object):

Related

How to locate an XML error in python given the line number and column number?

I am getting an error when I parse my xml. It gives a line and column number, but I am not sure how to go about locating it.
My code
urlBase = 'https://www.goodreads.com/review/list_rss/'
urlMiddle = '?shelf=read&order=d&sort=rating&per_page=200&page='
finalUrl = urlBase + str(32994) + urlMiddle +str(1)
resp = requests.get(finalUrl)
from xml.etree import ElementTree as ET
x = ET.fromstring(resp.content)
Error
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 952, column 1023
I try to print the contents, but it's just one line
resp.content
The output is too big to print here.
So I'm not sure how to check a specific line since it's just one line.
You are trying to parse a HTML content with an XML parser. You may run into problem if the content is not XML-valid: if it is not XHTML.
Instead of that, you can use a HTML parser like the one available with lxml.
For instance
parser = etree.HTMLParser()
tree = etree.parse(BytesIO(resp.content), parser)
This will solve your issue.
Most likely you are on Windows and the print isn’t respecting e.g \n.
Try adding:
open(‘resp.xml’).write(resp.content)
After where you get resp
Then, you can open resp.xml in an editor and see what line 952 looks like.

Python xml.ElementTree - function to return parsed xml in variable to be used later

I have a funcion which sends get request and parse response to xml:
def get_object(object_name):
...
...
#parse xml file
encoded_text = response.text.encode('utf-8', 'replace')
root = ET.fromstring(encoded_text)
tree = ET.ElementTree(root)
return tree
Then I use this function to loop through object from list to get xmls and store them in variable:
jx_task_tree = ''
for jx in jx_tasks_lst:
jx_task_tree += str(get_object(jx))
I am not sure, if the function returns me data in correct format/form to use them later the way I need to.
When I want to parse variable jx_task_tree like this:
parser = ET.XMLParser(encoding="utf-8")
print(type(jx_task_tree))
tree = ET.parse(jx_task_tree, parser=parser)
print(ET.tostring(tree))
it throws me an error:
Traceback (most recent call last):
File "import_uac_wf.py", line 59, in <module>
tree = ET.parse(jx_task_tree, parser=parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in
parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 36] File name too long:
'<xml.etree.ElementTree.ElementTree
object at 0x7ff2607c8910>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e23d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607ee4d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607d8e90>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e2550>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607889d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff26079f3d0>\n'
Would anybody help me, what should function get_object() return and how to work with it later, so what's returned can be joined into one variable and parsed?
Regarding to your current exception:
According to [Python 3.Docs]: xml.etree.ElementTree.parse(source, parser=None) (emphasis is mine):
Parses an XML section into an element tree. source is a filename or file object containing XML data.
If you want to load the XML from a string, use ET.fromstring instead.
Then, as you suspected, the 2nd code snippet is completely wrong:
get_object(jx) returns an already parsed XML, so an ElementTree object
Calling str on it, will yield its textual representation (e.g. "<xml.etree.ElementTree.ElementTree object at 0x7ff26079f3d0>") which is not what you want
You could do something like:
jx_tasks_string = ""
for jx in jx_tasks_lst:
jx_tasks_string += ET.tostring(get_object(jx).getroot())
Since jx_tasks_string is the concatenation of some strings obtained from parsing some XML blobs, there's no reason to parse it again.

TypeError: write() got an unexpected keyword argument 'pretty_print'

I am writing a python script which will append a new Tag/elment in config.xml of my jenkins job.
This is how my script looks like:-
#!/usr/bin/python
import os, fnmatch, pdb, re, string, fileinput, sys
from lxml import etree
def find(pattern, path):
result = []
for root, dirs, files in os.walk(path):
for name in files:
if fnmatch.fnmatch(name, pattern):
result.append(os.path.join(root, name))
return result
finalresult = find('config.xml', './')
print finalresult
def writexml(filepath):
tree = etree.parse(filepath)
root = tree.getroot()
a=[]
for v in root.iter('publishers'):
for a in v:
if a.tag == "hudson.plugins.emailext.ExtendedEmailPublisher":
t1=etree.SubElement(v,'org.jenkinsci.plugins.postbuildscript.PostBuildScript',{'plugin':"postbuildscript#017"})
t2=etree.SubElement(t1,"buildSteps")
t3=etree.SubElement(t2,'hudson.tasks.Shell')
t4=etree.SubElement(t3,"command")
t4.text = "bash -x /d0/jenkins/scripts/parent-pom-enforcer.sh"
t5 = etree.SubElement(t1,'scriptOnlyIfSuccess')
t5.text = "false"
t6 = etree.SubElement(t1,'scriptOnlyIfFailure')
t6.text = "false"
t7= etree.SubElement(t1,'markBuildUnstable')
t7.text = "true"
tree.write(filepath,pretty_print=True)
findMavenProject=[]
for i in finalresult:
tree = etree.parse(i)
root = tree.getroot()
for v in root.iter('hudson.tasks.Maven'):
if v.tag == "hudson.tasks.Maven":
writexml(i)
findMavenProject.append(i)
print findMavenProject
When I execute this script, I get following error:
running with cElementTree on Python 2.5+
['./jen1/config.xml', './jen2/config.xml', './jen3/config.xml', './jen4/config.xml']
Traceback (most recent call last):
File "./find-test.py", line 50, in <module>
writexml(i)
File "./find-test.py", line 41, in writexml
tree.write(filepath,pretty_print=True)
TypeError: write() got an unexpected keyword argument 'pretty_print'
I googled this error and found that, I should use "lxml". I used it but even after that I am getting the same error. I am using Python 2.7.6 version..
Any clue?
Python's standard library xml.etree.ElementTree.write() does not have the pretty_print argument.
The version of that method that is in recent versions of lxml does. I just tested it with lxml 3.3.3, and the docs at http://lxml.de/tutorial.html#serialisation mention it.
You are either using an old version of lxml, or somehow still using the standard library's older copy of the library.
The tree.write(filepath,pretty_print=True) was wrong.
The right way: tree.write(filepath) or tree.tostring(root, pretty_print=True).
Detail explaination:
tree was a ElementTree Objects, who's write has no pretty_print argument.
write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None, method="xml")
Writes the element tree to a file, as XML. file is a file name, or a file object opened for writing. encoding [1] is the output encoding (default is US-ASCII). xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None). default_namespace sets the default XML namespace (for “xmlns”). method is either "xml", "html" or "text" (default is "xml"). Returns an encoded string.
xml.etree.ElementTree.Element docs
UPDATE
To answer another question, Pretty printing XML in Python:
import xml.dom.minidom
my_xml_doc = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = my_xml_doc.toprettyxml()
UPDATE2
lxml.etree does have the pretty_print argument as follows:
etree.tostring(root, pretty_print=True)
lxml docs
special thanks to #RemcoGerlich
The issue with below code is that it creates duplicate empty lines with tabs. if you are trying to edit the existing xml.
import xml.dom.minidom
xml = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = xml.toprettyxml()
Solution:- I tried below code and it worked for me.
from lxml import etree as ET
import xml.dom.minidom
def writexml(filepath):
parser = ET.XMLParser(resolve_entities=False, strip_cdata=False)
tree = ET.parse(filepath, parser)
root = tree.getroot()
a=[]
for v in root.iter('publishers'):
for a in v:
if a.tag == "hudson.plugins.emailext.ExtendedEmailPublisher":
t1=ET.SubElement(v,'org.jenkinsci.plugins.postbuildscript.PostBuildScript',{'plugin':"postbuildscript#0.17"})
t2=ET.SubElement(t1,"buildSteps")
t3=ET.SubElement(t2,'hudson.tasks.Shell')
t4=ET.SubElement(t3,"command")
t4.text = "bash -x /d0/jenkins/scripts/parent-pom-enforcer.sh"
t5 = ET.SubElement(t1,'scriptOnlyIfSuccess')
t5.text = "false"
t6 = ET.SubElement(t1,'scriptOnlyIfFailure')
t6.text = "false"
t7= ET.SubElement(t1,'markBuildUnstable')
t7.text = "true"
xml1 = xml.dom.minidom.parseString(ET.tostring(root, pretty_print=True))
pretty_xml_as_string = xml1.toprettyxml()
f = open(filepath, "w")
for v in str(pretty_xml_as_string).split("\n"):
if v.strip():
f.write(v+"\n")
f.close()
writexml('test.xml') #provide full path of the file as arg to function writexml.

"Invalid tag name" error when creating element with lxml in python

I am using lxml to make an xml file and my sample program is :
from lxml import etree
import datetime
dt=datetime.datetime(2013,11,30,4,5,6)
dt=dt.strftime('%Y-%m-%d')
page=etree.Element('html')
doc=etree.ElementTree(page)
dateElm=etree.SubElement(page,dt)
outfile=open('somefile.xml','w')
doc.write(outfile)
And I am getting the following error output :
dateElm=etree.SubElement(page,dt)
File "lxml.etree.pyx", line 2899, in lxml.etree.SubElement (src/lxml/lxml.etree.c:62284)
File "apihelpers.pxi", line 171, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:14296)
File "apihelpers.pxi", line 1523, in lxml.etree._tagValidOrRaise (src/lxml/lxml.etree.c:26852)
ValueError: Invalid tag name u'2013-11-30'
I thought it of a Unicode Error,
so tried changing encoding of 'dt' with codes like
str(dt)
unicode(dt).encode('unicode_escape')
dt.encocde('ascii','ignore')
dt.encode('ascii','decode')
and some others also, but none worked and same error msg generated.
You get the error because element names are not allowed to begin with a digit in XML. See http://www.w3.org/TR/xml/#sec-common-syn and http://www.w3.org/TR/xml/#sec-starttags. The first character of a name must be a NameStartChar, which disallows digits.
An element such as <2013-11-30>...</2013-11-30> is invalid.
An element such as <D2013-11-30>...</D2013-11-30> is OK.
If your program is changed to use ElementTree instead of lxml (from xml.etree import ElementTree as etree instead of from lxml import etree), there is no error. But I would consider that a bug. lxml does the right thing, ElementTree does not.
It is not about Unicode. There is no 2013-11-30 tag in HTML. You could use time tag instead:
#!/usr/bin/env python
from datetime import date
from lxml.html import tostring
from lxml.html.builder import E
datestr = date(2013, 11, 30).strftime('%Y-%m-%d')
page = E.html(
E.title("date demo"),
E('time', "some value", datetime=datestr))
with open('somefile.html', 'wb') as file:
file.write(tostring(page, doctype='<!doctype html>', pretty_print=True))

Python , XML Index error

Hello I am having trouble with a xml file I am using. Now what happens is on a short xml file the program works fine but for some reason once it reaches a size ( I am thinking 1 MB)
it gives me a "IndexError: list index out of range"
Here is the code I am writing so far.
from xml.dom import minidom
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
def xml_data():
f = open('C:\opidea_2.xml', 'r')
data = f.read()
f.close()
dom = minidom.parseString(data)
ic = (dom.getElementsByTagName('logentry'))
dom = None
content = ''
for num in ic:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
if name:
content += "***Changes by:" + str(name) + "*** " + '\n\n Date: '
else:
content += "***Changes are made Anonymously *** " + '\n\n Date: '
print content
if __name__ == "__main__":
xml_data ()
Here is part of the xml if it helps.
<log>
<logentry
revision="33185">
<author>glv</author>
<date>2012-08-06T21:01:52.494219Z</date>
<paths>
<path
kind="file"
action="M">/branches/Patch_4_2_0_Branch/text.xml</path>
<path
kind="dir"
action="M">/branches/Patch_4_2_0_Branch</path>
</paths>
<msg>PATCH_BRANCH:N/A
BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:N/A
Adding the SVN log size requirement to the branch
</msg>
</logentry>
</log>
The actual xml file is much bigger but this is the general format. It will actually work if it was this small but once it gets bigger I get problems.
here is the traceback
Traceback (most recent call last):
File "C:\python\src\SVN_Email_copy.py", line 141, in <module>
xml_data ()
File "C:\python\src\SVN_Email_copy.py", line 50, in xml_data
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
IndexError: list index out of range
Based on the code provided your error is going to be in this line:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
#xml node-^
#function call -------------------------^
#list indexing ----------------------------^
#attribute access -------------------------------------^
That's the only place in the demonstrated code that you're indexing into a list. That would imply that in your larger XML Sample you're missing an <author> tag. You'll have to correct that, or add in some level of error handling / data validation.
Please see the code elaboration for more explanation. You're doing a ton of things in a single line by taking advantage of the return behaviors of successive commands. So, the num is defined, that's fine. Then you call a function (method). It returns a list. You attempt to retrieve from that list and it throws an exception, so you never make it to the Attribute Access to get to firstChild, which definitely means you get no nodeValue.
Error checking may look something like this:
authors = num.getElementsByTagName('author')
if len(authors) > 0:
name = authors[0].firstChild.nodeValue
Though there are many, many ways you could achieve that.

Categories