objectify and etree elements - python

The module I've been writing works finestkind with the test data file, but totally moofs on the live data from flickrapi.
After days of frustration (see, I DO have a lot of nothing to do!) I think I found the problem, but I don't know the fix for it.
Internal test data returns a type() of: <type 'str'>
External test data returns a type() of: <type 'str'> ## opening &
reading external XML
Live data returns a type() of: <class
'xml.etree.ElementTree.Element'>
Beyond this point in the module, I use objectify. Objectify parses <type 'str'> just fine, but it will not read the etree elements. I think I need to convert the class 'xml.etree.ElementTree.Element' to str(), but haven't sussed that out yet.
The error I get from objectify.fromstring() is:
Traceback (most recent call last):
File "C:\Mirc\Python\Temp Files\test_lxml_2.py", line 101, in <module>
Grp = objectify.fromstring(flickr.groups_getInfo(group_id=gid))
File "lxml.objectify.pyx", line 1791, in lxml.objectify.fromstring (src\lxml\lxml.objectify.c:20904)
File "lxml.etree.pyx", line 2994, in lxml.etree.fromstring (src\lxml\lxml.etree.c:63296)
File "parser.pxi", line 1614, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:93607)
ValueError: can only parse strings
Please help before the boss turns loose those damn flying monkeys again!!!
import fileinput
from lxml import html, etree, objectify
import re
import time
import flickrapi
if '#N' in gid:
try:
if tst:
Grp = objectify.fromstring(test_data)
else:
Grp = objectify.fromstring(flickr.groups_getInfo(group_id=gid))
fErr = ''
mn = Grp.xpath(u'//group')[0].attrib
res = Grp.xpath(u'//restrictions')[0].attrib
root = Grp.group
gNSID = gid
gAlias = ""
err_tst = getattr(root, "not-there", "Error OK")
gName = getattr(root, "name", "")
Images = getattr(root, 'pool_count', (-1))
Mbr = getattr(root, "members", (-1))

The solution is to stop converting your live data to xml.etree.ElementTree.Element objects before invoking the objectify api.
If that's impossible (which I doubt), you can render the xml back to a text representation with lxml.etree.tostring, then pass that to etree.objectify.fromstring.

I think the "test_data" that you pass to objectify.fromstring is instansce of String IO , so you must read it first then objectify:
objectify.fromstring(test_data.read())

Related

How to locate an XML error in python given the line number and column number?

I am getting an error when I parse my xml. It gives a line and column number, but I am not sure how to go about locating it.
My code
urlBase = 'https://www.goodreads.com/review/list_rss/'
urlMiddle = '?shelf=read&order=d&sort=rating&per_page=200&page='
finalUrl = urlBase + str(32994) + urlMiddle +str(1)
resp = requests.get(finalUrl)
from xml.etree import ElementTree as ET
x = ET.fromstring(resp.content)
Error
File "<string>", line unknown
ParseError: not well-formed (invalid token): line 952, column 1023
I try to print the contents, but it's just one line
resp.content
The output is too big to print here.
So I'm not sure how to check a specific line since it's just one line.
You are trying to parse a HTML content with an XML parser. You may run into problem if the content is not XML-valid: if it is not XHTML.
Instead of that, you can use a HTML parser like the one available with lxml.
For instance
parser = etree.HTMLParser()
tree = etree.parse(BytesIO(resp.content), parser)
This will solve your issue.
Most likely you are on Windows and the print isn’t respecting e.g \n.
Try adding:
open(‘resp.xml’).write(resp.content)
After where you get resp
Then, you can open resp.xml in an editor and see what line 952 looks like.

Python xml.ElementTree - function to return parsed xml in variable to be used later

I have a funcion which sends get request and parse response to xml:
def get_object(object_name):
...
...
#parse xml file
encoded_text = response.text.encode('utf-8', 'replace')
root = ET.fromstring(encoded_text)
tree = ET.ElementTree(root)
return tree
Then I use this function to loop through object from list to get xmls and store them in variable:
jx_task_tree = ''
for jx in jx_tasks_lst:
jx_task_tree += str(get_object(jx))
I am not sure, if the function returns me data in correct format/form to use them later the way I need to.
When I want to parse variable jx_task_tree like this:
parser = ET.XMLParser(encoding="utf-8")
print(type(jx_task_tree))
tree = ET.parse(jx_task_tree, parser=parser)
print(ET.tostring(tree))
it throws me an error:
Traceback (most recent call last):
File "import_uac_wf.py", line 59, in <module>
tree = ET.parse(jx_task_tree, parser=parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in
parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 36] File name too long:
'<xml.etree.ElementTree.ElementTree
object at 0x7ff2607c8910>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e23d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607ee4d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607d8e90>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e2550>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607889d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff26079f3d0>\n'
Would anybody help me, what should function get_object() return and how to work with it later, so what's returned can be joined into one variable and parsed?
Regarding to your current exception:
According to [Python 3.Docs]: xml.etree.ElementTree.parse(source, parser=None) (emphasis is mine):
Parses an XML section into an element tree. source is a filename or file object containing XML data.
If you want to load the XML from a string, use ET.fromstring instead.
Then, as you suspected, the 2nd code snippet is completely wrong:
get_object(jx) returns an already parsed XML, so an ElementTree object
Calling str on it, will yield its textual representation (e.g. "<xml.etree.ElementTree.ElementTree object at 0x7ff26079f3d0>") which is not what you want
You could do something like:
jx_tasks_string = ""
for jx in jx_tasks_lst:
jx_tasks_string += ET.tostring(get_object(jx).getroot())
Since jx_tasks_string is the concatenation of some strings obtained from parsing some XML blobs, there's no reason to parse it again.

TypeError: write() got an unexpected keyword argument 'pretty_print'

I am writing a python script which will append a new Tag/elment in config.xml of my jenkins job.
This is how my script looks like:-
#!/usr/bin/python
import os, fnmatch, pdb, re, string, fileinput, sys
from lxml import etree
def find(pattern, path):
result = []
for root, dirs, files in os.walk(path):
for name in files:
if fnmatch.fnmatch(name, pattern):
result.append(os.path.join(root, name))
return result
finalresult = find('config.xml', './')
print finalresult
def writexml(filepath):
tree = etree.parse(filepath)
root = tree.getroot()
a=[]
for v in root.iter('publishers'):
for a in v:
if a.tag == "hudson.plugins.emailext.ExtendedEmailPublisher":
t1=etree.SubElement(v,'org.jenkinsci.plugins.postbuildscript.PostBuildScript',{'plugin':"postbuildscript#017"})
t2=etree.SubElement(t1,"buildSteps")
t3=etree.SubElement(t2,'hudson.tasks.Shell')
t4=etree.SubElement(t3,"command")
t4.text = "bash -x /d0/jenkins/scripts/parent-pom-enforcer.sh"
t5 = etree.SubElement(t1,'scriptOnlyIfSuccess')
t5.text = "false"
t6 = etree.SubElement(t1,'scriptOnlyIfFailure')
t6.text = "false"
t7= etree.SubElement(t1,'markBuildUnstable')
t7.text = "true"
tree.write(filepath,pretty_print=True)
findMavenProject=[]
for i in finalresult:
tree = etree.parse(i)
root = tree.getroot()
for v in root.iter('hudson.tasks.Maven'):
if v.tag == "hudson.tasks.Maven":
writexml(i)
findMavenProject.append(i)
print findMavenProject
When I execute this script, I get following error:
running with cElementTree on Python 2.5+
['./jen1/config.xml', './jen2/config.xml', './jen3/config.xml', './jen4/config.xml']
Traceback (most recent call last):
File "./find-test.py", line 50, in <module>
writexml(i)
File "./find-test.py", line 41, in writexml
tree.write(filepath,pretty_print=True)
TypeError: write() got an unexpected keyword argument 'pretty_print'
I googled this error and found that, I should use "lxml". I used it but even after that I am getting the same error. I am using Python 2.7.6 version..
Any clue?
Python's standard library xml.etree.ElementTree.write() does not have the pretty_print argument.
The version of that method that is in recent versions of lxml does. I just tested it with lxml 3.3.3, and the docs at http://lxml.de/tutorial.html#serialisation mention it.
You are either using an old version of lxml, or somehow still using the standard library's older copy of the library.
The tree.write(filepath,pretty_print=True) was wrong.
The right way: tree.write(filepath) or tree.tostring(root, pretty_print=True).
Detail explaination:
tree was a ElementTree Objects, who's write has no pretty_print argument.
write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None, method="xml")
Writes the element tree to a file, as XML. file is a file name, or a file object opened for writing. encoding [1] is the output encoding (default is US-ASCII). xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None). default_namespace sets the default XML namespace (for “xmlns”). method is either "xml", "html" or "text" (default is "xml"). Returns an encoded string.
xml.etree.ElementTree.Element docs
UPDATE
To answer another question, Pretty printing XML in Python:
import xml.dom.minidom
my_xml_doc = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = my_xml_doc.toprettyxml()
UPDATE2
lxml.etree does have the pretty_print argument as follows:
etree.tostring(root, pretty_print=True)
lxml docs
special thanks to #RemcoGerlich
The issue with below code is that it creates duplicate empty lines with tabs. if you are trying to edit the existing xml.
import xml.dom.minidom
xml = xml.dom.minidom.parse(xml_fname) # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = xml.toprettyxml()
Solution:- I tried below code and it worked for me.
from lxml import etree as ET
import xml.dom.minidom
def writexml(filepath):
parser = ET.XMLParser(resolve_entities=False, strip_cdata=False)
tree = ET.parse(filepath, parser)
root = tree.getroot()
a=[]
for v in root.iter('publishers'):
for a in v:
if a.tag == "hudson.plugins.emailext.ExtendedEmailPublisher":
t1=ET.SubElement(v,'org.jenkinsci.plugins.postbuildscript.PostBuildScript',{'plugin':"postbuildscript#0.17"})
t2=ET.SubElement(t1,"buildSteps")
t3=ET.SubElement(t2,'hudson.tasks.Shell')
t4=ET.SubElement(t3,"command")
t4.text = "bash -x /d0/jenkins/scripts/parent-pom-enforcer.sh"
t5 = ET.SubElement(t1,'scriptOnlyIfSuccess')
t5.text = "false"
t6 = ET.SubElement(t1,'scriptOnlyIfFailure')
t6.text = "false"
t7= ET.SubElement(t1,'markBuildUnstable')
t7.text = "true"
xml1 = xml.dom.minidom.parseString(ET.tostring(root, pretty_print=True))
pretty_xml_as_string = xml1.toprettyxml()
f = open(filepath, "w")
for v in str(pretty_xml_as_string).split("\n"):
if v.strip():
f.write(v+"\n")
f.close()
writexml('test.xml') #provide full path of the file as arg to function writexml.

Why LXML ElementMaker breaks when I use an integer as attribute value?

I am trying to create a XML document with the help of LXML. I realized that
ElementMaker breaks when I use an integer.
Code
from lxml.builder import ElementMaker
from lxml import etree
maker = ElementMaker()
maker.text(**{'label': 'my textarea'}) # works
maker.ratings(**{'points':5}) # breaks
Error
File "/usr/local/lib/python2.7/dist-packages/lxml/builder.py", line 210, in __call__
get(dict)(elem, attrib)
File "/usr/local/lib/python2.7/dist-packages/lxml/builder.py", line 197, in add_dict
attrib[k] = typemap[type(v)](None, v)
KeyError: <type 'int'>
Why I cannot assign the attribute value as an integer?
You cannot have integer values in XML.
You can enter data as string and convert is to the required format when you are parsing the data.
In your case try using 'points':"5" and then convert the string to integer when you are parsing it

Python sys.argv TypeErrors with printing function results?

I have been trying to learn how to use sys.argv properly, while calling an executable file from the command line.
I wanted to have the functions results print to the command line when passing the filename and argument on the command line but, I get a TypeError.
So far I have:
#! /usr/bin/env python
import mechanize
from BeautifulSoup import BeautifulSoup
import sys
def dictionary(word):
br = mechanize.Browser()
response = br.open('http://www.dictionary.reference.com')
br.select_form(nr=0)
br.form['q'] = sys.argv
br.submit()
definition = BeautifulSoup(br.response().read())
trans = definition.findAll('td',{'class':'td3n2'})
fin = [i.text for i in trans]
query = {}
for i in fin:
query[fin.index(i)] = i
return query
print dictionary(sys.argv)
When I call this:
./this_file.py 'pass'
I am left with this error message:
Traceback (most recent call last):
File "./hot.py", line 20, in <module>
print dictionary(sys.argv)
File "./hot.py", line 10, in dictionary
br.form['q'] = sys.argv
File "/usr/local/lib/python2.7/dist-packages/mechanize/_form.py", line 2782, in __setitem__
control.value = value
File "/usr/local/lib/python2.7/dist-packages/mechanize/_form.py", line 1217, in __setattr__
raise TypeError("must assign a string")
TypeError: must assign a string
With
br.form['q'] = sys.argv
you are assigning a list of strings here instead of a string.
>>> type(sys.argv)
<type 'list'>
>>> type(sys.argv[0])
<type 'str'>
>>>
You want to identify a specific string to assign via an index.
Most likely it will be be index 1 given what you have in your post (and since index 0 is the name of the script). So perhaps
br.form['q'] = sys.argv[1]
will do for you. Of course it could be another index too, depending on your particular application/needs.
Note as #Dougal observes in a helpful comment below, the function parameter word in the function is not being used. You are calling your dictionary function sending it sys.argv and then ought to refer to word inside the function. The type doesn't change only the name that you refer to the command line args inside your function. The idea of word is good as it avoids the use of global variables. If you refer to use globals (not really encouraged) then removing word is recommended as it will be confusing to have it there).
So your statement should really read
br.form['q'] = word[1]

Categories