Why LXML ElementMaker breaks when I use an integer as attribute value? - python

I am trying to create a XML document with the help of LXML. I realized that
ElementMaker breaks when I use an integer.
Code
from lxml.builder import ElementMaker
from lxml import etree
maker = ElementMaker()
maker.text(**{'label': 'my textarea'}) # works
maker.ratings(**{'points':5}) # breaks
Error
File "/usr/local/lib/python2.7/dist-packages/lxml/builder.py", line 210, in __call__
get(dict)(elem, attrib)
File "/usr/local/lib/python2.7/dist-packages/lxml/builder.py", line 197, in add_dict
attrib[k] = typemap[type(v)](None, v)
KeyError: <type 'int'>
Why I cannot assign the attribute value as an integer?

You cannot have integer values in XML.
You can enter data as string and convert is to the required format when you are parsing the data.
In your case try using 'points':"5" and then convert the string to integer when you are parsing it

Related

How to locate an XML error in python given the line number and column number?

I am getting an error when I parse my xml. It gives a line and column number, but I am not sure how to go about locating it.
My code
urlBase = 'https://www.goodreads.com/review/list_rss/'
urlMiddle = '?shelf=read&order=d&sort=rating&per_page=200&page='
finalUrl = urlBase + str(32994) + urlMiddle +str(1)
resp = requests.get(finalUrl)
from xml.etree import ElementTree as ET
x = ET.fromstring(resp.content)
Error
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 952, column 1023
I try to print the contents, but it's just one line
resp.content
The output is too big to print here.
So I'm not sure how to check a specific line since it's just one line.
You are trying to parse a HTML content with an XML parser. You may run into problem if the content is not XML-valid: if it is not XHTML.
Instead of that, you can use a HTML parser like the one available with lxml.
For instance
parser = etree.HTMLParser()
tree = etree.parse(BytesIO(resp.content), parser)
This will solve your issue.
Most likely you are on Windows and the print isn’t respecting e.g \n.
Try adding:
open(‘resp.xml’).write(resp.content)
After where you get resp
Then, you can open resp.xml in an editor and see what line 952 looks like.

Python xml.ElementTree - function to return parsed xml in variable to be used later

I have a funcion which sends get request and parse response to xml:
def get_object(object_name):
...
...
#parse xml file
encoded_text = response.text.encode('utf-8', 'replace')
root = ET.fromstring(encoded_text)
tree = ET.ElementTree(root)
return tree
Then I use this function to loop through object from list to get xmls and store them in variable:
jx_task_tree = ''
for jx in jx_tasks_lst:
jx_task_tree += str(get_object(jx))
I am not sure, if the function returns me data in correct format/form to use them later the way I need to.
When I want to parse variable jx_task_tree like this:
parser = ET.XMLParser(encoding="utf-8")
print(type(jx_task_tree))
tree = ET.parse(jx_task_tree, parser=parser)
print(ET.tostring(tree))
it throws me an error:
Traceback (most recent call last):
File "import_uac_wf.py", line 59, in <module>
tree = ET.parse(jx_task_tree, parser=parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in
parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 36] File name too long:
'<xml.etree.ElementTree.ElementTree
object at 0x7ff2607c8910>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e23d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607ee4d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607d8e90>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e2550>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607889d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff26079f3d0>\n'
Would anybody help me, what should function get_object() return and how to work with it later, so what's returned can be joined into one variable and parsed?
Regarding to your current exception:
According to [Python 3.Docs]: xml.etree.ElementTree.parse(source, parser=None) (emphasis is mine):
Parses an XML section into an element tree. source is a filename or file object containing XML data.
If you want to load the XML from a string, use ET.fromstring instead.
Then, as you suspected, the 2nd code snippet is completely wrong:
get_object(jx) returns an already parsed XML, so an ElementTree object
Calling str on it, will yield its textual representation (e.g. "<xml.etree.ElementTree.ElementTree object at 0x7ff26079f3d0>") which is not what you want
You could do something like:
jx_tasks_string = ""
for jx in jx_tasks_lst:
jx_tasks_string += ET.tostring(get_object(jx).getroot())
Since jx_tasks_string is the concatenation of some strings obtained from parsing some XML blobs, there's no reason to parse it again.

"Invalid tag name" error when creating element with lxml in python

I am using lxml to make an xml file and my sample program is :
from lxml import etree
import datetime
dt=datetime.datetime(2013,11,30,4,5,6)
dt=dt.strftime('%Y-%m-%d')
page=etree.Element('html')
doc=etree.ElementTree(page)
dateElm=etree.SubElement(page,dt)
outfile=open('somefile.xml','w')
doc.write(outfile)
And I am getting the following error output :
dateElm=etree.SubElement(page,dt)
File "lxml.etree.pyx", line 2899, in lxml.etree.SubElement (src/lxml/lxml.etree.c:62284)
File "apihelpers.pxi", line 171, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:14296)
File "apihelpers.pxi", line 1523, in lxml.etree._tagValidOrRaise (src/lxml/lxml.etree.c:26852)
ValueError: Invalid tag name u'2013-11-30'
I thought it of a Unicode Error,
so tried changing encoding of 'dt' with codes like
str(dt)
unicode(dt).encode('unicode_escape')
dt.encocde('ascii','ignore')
dt.encode('ascii','decode')
and some others also, but none worked and same error msg generated.
You get the error because element names are not allowed to begin with a digit in XML. See http://www.w3.org/TR/xml/#sec-common-syn and http://www.w3.org/TR/xml/#sec-starttags. The first character of a name must be a NameStartChar, which disallows digits.
An element such as <2013-11-30>...</2013-11-30> is invalid.
An element such as <D2013-11-30>...</D2013-11-30> is OK.
If your program is changed to use ElementTree instead of lxml (from xml.etree import ElementTree as etree instead of from lxml import etree), there is no error. But I would consider that a bug. lxml does the right thing, ElementTree does not.
It is not about Unicode. There is no 2013-11-30 tag in HTML. You could use time tag instead:
#!/usr/bin/env python
from datetime import date
from lxml.html import tostring
from lxml.html.builder import E
datestr = date(2013, 11, 30).strftime('%Y-%m-%d')
page = E.html(
E.title("date demo"),
E('time', "some value", datetime=datestr))
with open('somefile.html', 'wb') as file:
file.write(tostring(page, doctype='<!doctype html>', pretty_print=True))

objectify and etree elements

The module I've been writing works finestkind with the test data file, but totally moofs on the live data from flickrapi.
After days of frustration (see, I DO have a lot of nothing to do!) I think I found the problem, but I don't know the fix for it.
Internal test data returns a type() of: <type 'str'>
External test data returns a type() of: <type 'str'> ## opening &
reading external XML
Live data returns a type() of: <class
'xml.etree.ElementTree.Element'>
Beyond this point in the module, I use objectify. Objectify parses <type 'str'> just fine, but it will not read the etree elements. I think I need to convert the class 'xml.etree.ElementTree.Element' to str(), but haven't sussed that out yet.
The error I get from objectify.fromstring() is:
Traceback (most recent call last):
File "C:\Mirc\Python\Temp Files\test_lxml_2.py", line 101, in <module>
Grp = objectify.fromstring(flickr.groups_getInfo(group_id=gid))
File "lxml.objectify.pyx", line 1791, in lxml.objectify.fromstring (src\lxml\lxml.objectify.c:20904)
File "lxml.etree.pyx", line 2994, in lxml.etree.fromstring (src\lxml\lxml.etree.c:63296)
File "parser.pxi", line 1614, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:93607)
ValueError: can only parse strings
Please help before the boss turns loose those damn flying monkeys again!!!
import fileinput
from lxml import html, etree, objectify
import re
import time
import flickrapi
if '#N' in gid:
try:
if tst:
Grp = objectify.fromstring(test_data)
else:
Grp = objectify.fromstring(flickr.groups_getInfo(group_id=gid))
fErr = ''
mn = Grp.xpath(u'//group')[0].attrib
res = Grp.xpath(u'//restrictions')[0].attrib
root = Grp.group
gNSID = gid
gAlias = ""
err_tst = getattr(root, "not-there", "Error OK")
gName = getattr(root, "name", "")
Images = getattr(root, 'pool_count', (-1))
Mbr = getattr(root, "members", (-1))
The solution is to stop converting your live data to xml.etree.ElementTree.Element objects before invoking the objectify api.
If that's impossible (which I doubt), you can render the xml back to a text representation with lxml.etree.tostring, then pass that to etree.objectify.fromstring.
I think the "test_data" that you pass to objectify.fromstring is instansce of String IO , so you must read it first then objectify:
objectify.fromstring(test_data.read())

lxml - parse an xml with no line breaks in it

I am using lxml iterparse in python to loop through the elements in my xml file. It works fine with most of the xmls, but fails for some. One of them has no line breaks in it. The error and a sample of such xml are as below. Any clues?
Thanks!!
<root><person><name>"xyz"</name><age>"10"</age></person><person><name>"abc"</name><age>"20"</age></person></root>
error -
XMLSyntaxError: Document is empty, line 1, column 1
code -
from lxml import etree
def parseXml(context,elemList):
for event, element in context:
if element.tag in elemList:
#read text and attributes is any
element.clear()
def main(object):
elemList= ['name','age','id']
context=etree.iterparse(fullFilePath, events=("start","end"))
parseXml(context,elemList)
etree.iterparse expects buffer for source argument. And name of variable you passing, "fullFilePath", tells me that it's not file (So parser is trying to parse file_path insted of file content ).
Try passing opened file instead.
context=etree.iterparse(open(fullFilePath), events=("start","end"))
or string:
from lxml import etree
xml = '<root><person><name>"xyz"</name><age>"10"</age></person><person><name>"abc"</name><age>"20"</age></person></root>\n'
def parseXml(context,elemList):
for event, element in context:
if element.tag in elemList:
print element.tag,
element.clear()
def main():
elemList= ['name','age','id']
context=etree.iterparse(StringIO(xml), events=("start","end"))
parseXml(context,elemList)
main()
>>>name name age age name name age age
PS: And what to do you mean by this?
def main(object):

Categories