decorate xml subtree with lxml - python

I want to transform this xml tree
<doc>
<a>
<op>xxx</op>
</a>
</doc>
to
<doc>
<a>
<cls>
<op>xxx</op>
</cls>
</a>
</doc>
I use this python code
from lxml import etree
f = etree.fromstring('<doc><a><op>xxx</op></a></doc>')
node_a = f.xpath('/doc/a')[0]
ele = etree.Element('cls')
node_a.insert(0, ele)
node_cls = f.xpath('/doc/a/cls')[0]
node_op = f.xpath('/doc/a/op')[0]
node_cls.append(node_op)
print etree.tostring(f, pretty_print=True)
Is it the best solution ?
Now I want to obtain
<cls>
<doc>
<a>
<op>xxx</op>
</a>
</doc>
</cls>
I am unable to find any solution.
Thanks for your help.

I prefer to use beautifulsoup better than lxml. I find it easier to handle.
Both problems can be solved using the same approach, first find the element, get its parent, create the new element and put the old one inside it.
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
for e in soup.find_all(sys.argv[2]):
p = e.parent
cls = soup.new_tag('cls')
e_extracted = e.extract()
cls.append(e_extracted)
p.append(cls)
print(soup.prettify())
The script accepts two arguments, thr first one is the xml file and second one the tag to surround with new tag. Run it like:
python3 script.py xmlfile op
That yields:
<?xml version="1.0" encoding="utf-8"?>
<doc>
<a>
<cls>
<op>
xxx
</op>
</cls>
</a>
</doc>
For <doc>, run it like:
python3 script.py xmlfile doc
With following result:
<?xml version="1.0" encoding="utf-8"?>
<cls>
<doc>
<a>
<op>
xxx
</op>
</a>
</doc>
</cls>

Related

Extract the xpath for every node (tag) from XML file using python

I just have the similar like the following XML file :
<xml>
<Catalog>
<Book>
<Textbook>
<Author ="MEMO" />
</Textbook>
</Book>
<Journal>
<Science>
<Author ="David" />
</Science>
</Journal>
</Catalog>
</xml>
what i would like to do that write a python code that will find and print the xpath for every nodes in my XML file , any idea or suggest i will be very thankful :), any models i can use to find the full path example the result should look like :
MEMO: Catalog/Book/Textbook/Author
It can be done with lxml:
import lxml.html as lh
from lxml import etree
books = """[your html above]"""
doc = lh.fromstring(books)
tree = etree.ElementTree(doc)
for e in doc.iter('author'):
print("Memo: ",tree.getpath(e).replace('/xml/',''))
Output:
Memo: catalog/book/textbook/author
Memo: catalog/journal/science/author

python Parsing xml: get text from tag which contains <i> or <b> or similar

I'm using xml.etree.ElementTree. When I try to get text from AbstractText, I get None or partial text if there are formatting tags like i, b or similar tags into the text.
Here is a xml example
<root>
<AbstractText><b>1.</b> test text <b> 2. </b> is very silly.</AbstractText>
<AbstractText>hello <b> this is </b> another example </AbstractText>
</root>
python code is
tree = ET.parse("xml/test.xml")
root =tree.getroot()
for node in root.findall('AbstractText'):
print(node.text)
and the output is
None
hello
How can I fix it? I want all the text without i, b, or other information
Here a example
import xml.etree.ElementTree as ET
from lxml import etree
tree = etree.parse('file.xml')
for node in tree.findall('AbstractText'):
#print(node)
print(etree.tostring(node, encoding='utf8', method='text'))

Comment out an element using lxml

Is it possible to comment out an xml element with python's lxml while preserving the original element rendering inside the comment? I tried the following
elem.getparent().replace(elem, etree.Comment(etree.tostring(elem, pretty_print=True)))
but tostring() adds the namespace declaration.
The namespace of the commented-out element is inherited from the root element. Demo:
from lxml import etree
XML = """
<root xmlns='foo'>
<a>
<b>AAA</b>
</a>
</root>"""
root = etree.fromstring(XML)
b = root.find(".//{foo}b")
b.getparent().replace(b, etree.Comment(etree.tostring(b)))
print etree.tostring(root)
Result:
<root xmlns="foo">
<a>
<!--<b xmlns="foo">AAA</b>
--></a>
</root>
Manipulating namespaces is often harder than you might suspect. See https://stackoverflow.com/a/31870245/407651.
My suggestion here is to use BeautifulSoup, which in practice does not really care about namespaces (soup.find('b') returns the b element even though it is in the foo namespace).
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(XML, "xml")
b = soup.find('b')
b.replace_with(Comment(str(b)))
print soup.prettify()
Result:
<?xml version="1.0" encoding="utf-8"?>
<root mlns="foo">
<a>
<!--<b>AAA</b>-->
</a>
</root>

Using Python to extract information from a XML file?

Can anyone offer some help with regards to using Python to extract information from a XML file? This will be my example XML.
<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</root>
What I want to print out is the information between the root tags. However, I want it to print it as is, which means all the tags, text in between the tags, and the content within the tag (in this case number index ="2") I have tried itertext(), but that removes the tags and prints only the text in between the root tags. So far, I have a makeshift solution that prints out only the element.tag and the element.text but that does not print out the end tags and the content within the tag. Any help would be appreciated! :)
With s as your input,
s='''<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
</root>'''
Find all tags with tag name number and convert the tag to string using ET.tostring()
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
for node in root.findall('.//number'):
print ET.tostring(node)
Output:
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
from bs4 import BeautifulSoup
xml = "<root><number index=\"2\"><info><info.RANDOM>Random Text</info.RANDOM></info></root>"
soup = BeautifulSoup(xml, "xml")
output = soup.prettify()
print(output[output.find("<root>") + 7:output.rfind("</root>")])
the + 7 accounts for root>\n

How to parse xml with lxml

So for example I have XML doc:
<?xml version="1.0"?>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
How do I parse all texts inside b's. I read my whole file into a string.
I only know how to parse html, tried applying it to html, but failed.
from lxml import html
string = myfile.read();
tree = html.fromstring(string);
result = tree.xpath('//a/#b');
But it wont work.
The first thing that you should do is make sure that your xml file is properly formatted for lxml. If the entire document is not contained within an overall "body" tag, the lxml parser will fail. May I make this suggestion:
<?xml version="1.0"?>
<body>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
</body>
Let us refer to this file as "foo.xml". Now that this data format is better for parsing, import etree from the lxml library:
from lxml import etree as et
Now it is time to parse the data and create a root object from which to start:
file_name = r"C:\foo.xml"
xmlParse = et.parse(file_name) #Parse the xml file
root = xmlParse.getroot() #Get the root
Once the root object has been declared, we can now use the getiterator() method to iterate through all b tags. Because the getiterator() method is exactly what it sounds like, an iterator, we can use list comprehension to save the element objects in a list. From there we can edit the text between the b tags:
bTags = [tag for tag in root.getiterator("b")] #List comprehension with the iterator
bTags[0].text = "Change b tag 1." #Change tag from "Text I need"
bTags[1].text = "Change b tag 2." #Change tag from "Text I need2"
xmlParse.write(file_name) #Edit original xml file
The final output should look something like this:
<?xml version="1.0"?>
<body>
<a>
<b>Change b tag 1.</b>
</a>
<a>
<b>Change b tag 2.</b>
</a>
</body>

Categories