How to parse xml with lxml - python

So for example I have XML doc:
<?xml version="1.0"?>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
How do I parse all texts inside b's. I read my whole file into a string.
I only know how to parse html, tried applying it to html, but failed.
from lxml import html
string = myfile.read();
tree = html.fromstring(string);
result = tree.xpath('//a/#b');
But it wont work.

The first thing that you should do is make sure that your xml file is properly formatted for lxml. If the entire document is not contained within an overall "body" tag, the lxml parser will fail. May I make this suggestion:
<?xml version="1.0"?>
<body>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
</body>
Let us refer to this file as "foo.xml". Now that this data format is better for parsing, import etree from the lxml library:
from lxml import etree as et
Now it is time to parse the data and create a root object from which to start:
file_name = r"C:\foo.xml"
xmlParse = et.parse(file_name) #Parse the xml file
root = xmlParse.getroot() #Get the root
Once the root object has been declared, we can now use the getiterator() method to iterate through all b tags. Because the getiterator() method is exactly what it sounds like, an iterator, we can use list comprehension to save the element objects in a list. From there we can edit the text between the b tags:
bTags = [tag for tag in root.getiterator("b")] #List comprehension with the iterator
bTags[0].text = "Change b tag 1." #Change tag from "Text I need"
bTags[1].text = "Change b tag 2." #Change tag from "Text I need2"
xmlParse.write(file_name) #Edit original xml file
The final output should look something like this:
<?xml version="1.0"?>
<body>
<a>
<b>Change b tag 1.</b>
</a>
<a>
<b>Change b tag 2.</b>
</a>
</body>

Related

python Parsing xml: get text from tag which contains <i> or <b> or similar

I'm using xml.etree.ElementTree. When I try to get text from AbstractText, I get None or partial text if there are formatting tags like i, b or similar tags into the text.
Here is a xml example
<root>
<AbstractText><b>1.</b> test text <b> 2. </b> is very silly.</AbstractText>
<AbstractText>hello <b> this is </b> another example </AbstractText>
</root>
python code is
tree = ET.parse("xml/test.xml")
root =tree.getroot()
for node in root.findall('AbstractText'):
print(node.text)
and the output is
None
hello
How can I fix it? I want all the text without i, b, or other information
Here a example
import xml.etree.ElementTree as ET
from lxml import etree
tree = etree.parse('file.xml')
for node in tree.findall('AbstractText'):
#print(node)
print(etree.tostring(node, encoding='utf8', method='text'))

Extracting the hyperlink from link tag using xpath

Consider the html as
<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>
I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag.
The code is
page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')
But this is returning an empty list. However, this is returning a link element.
links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c>
Can anyone suggest how to extract the urls from the link tag?
You are using the wrong parser for the job; you don't have HTML, you have XML.
A proper HTML parser will ignore the contents of a <link> tag, because in the HTML specification that tag is always empty.
Use the etree.parse() function to parse your URL stream (no separate .read() call needed):
response = urllib.urlopen(url)
tree = etree.parse(response)
titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')
You could also use etree.fromstring(page) but leaving the reading to the parser is easier.
By parsing content by etree, the <link> tag get closed. So no text value present for link tag
Demo:
>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>
According to HTML, this is not valid tag.
I think link tag structure is like:
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>

Removing a tag and its contents in xml using BeautifulSoup and lxml in Python

I am working with my Evernote data - extracted to an xml file. I have parsed the data using BeautifulSoup and here is a sampling of my xml data.
<note>
<title>
Audio and camera roll from STUDY DAY! in San Francisco
</title>
<content>
<![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note><div><en-media type="image/jpeg" hash="e3a84de41c9886b93a6921413b8482d5" width="1080" style="" height="1920"/><en-media type="image/jpeg" hash="b907b22a9f2db379aec3739d65ce62db" width="1123" style="" height="1600"/><en-media type="audio/wav" hash="d3fdcd5a487531dc156a8c5ef6000764" style=""/><br/></div>
</en-note>
]]>
</content>
<created>
20130217T153800Z
</created>
<updated>
20130217T154311Z
</updated>
<note-attributes>
<latitude>
37.78670730072799
</latitude>
<longitude>
-122.4171893858559
</longitude>
<altitude>
42
</altitude>
<source>
mobile.iphone
</source>
<reminder-order>
0
</reminder-order>
</note-attributes>
<resource>
<data encoding="base64">
There are two avenues I would like to explore here:
1. Finding and removing Specific tags (in this case )
2. locating a group/list of tags to extract to another document
This is my current code which parses the xml prettifies it and outputs to a text file.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
with open("file.txt", "w") as f:
f.write(soup.prettify().encode('utf8'))
You can search nodes by name
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml', 'r'))
source = soup.source
print source
#<source>
# mobile.iphone
#</source>
source = soup.source
print source.string
# mobile.iphone
Another way to do it, findAll method:
for tag in soup.findAll('source'):
print tag.string
if you want to print every node stripping tags, this should do the job:
for tag in soup.findAll():
print tag.string
Hope it helps.
EDIT:________
BeautifulSoup asumes you know the structure, although by definition xml is a structured data storage.
So you need to give a guideline to BS to parse your xml.
row = []
title = soup.note.title.string
created = soup.note.created.string
row.append(title)
row.append(created)
Now you only have to iterate over xml.
If you're using BeautifulSoup, you could use the getText() method to strip out the tags in the child elements and get one consolidated text
source.getText()

decorate xml subtree with lxml

I want to transform this xml tree
<doc>
<a>
<op>xxx</op>
</a>
</doc>
to
<doc>
<a>
<cls>
<op>xxx</op>
</cls>
</a>
</doc>
I use this python code
from lxml import etree
f = etree.fromstring('<doc><a><op>xxx</op></a></doc>')
node_a = f.xpath('/doc/a')[0]
ele = etree.Element('cls')
node_a.insert(0, ele)
node_cls = f.xpath('/doc/a/cls')[0]
node_op = f.xpath('/doc/a/op')[0]
node_cls.append(node_op)
print etree.tostring(f, pretty_print=True)
Is it the best solution ?
Now I want to obtain
<cls>
<doc>
<a>
<op>xxx</op>
</a>
</doc>
</cls>
I am unable to find any solution.
Thanks for your help.
I prefer to use beautifulsoup better than lxml. I find it easier to handle.
Both problems can be solved using the same approach, first find the element, get its parent, create the new element and put the old one inside it.
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
for e in soup.find_all(sys.argv[2]):
p = e.parent
cls = soup.new_tag('cls')
e_extracted = e.extract()
cls.append(e_extracted)
p.append(cls)
print(soup.prettify())
The script accepts two arguments, thr first one is the xml file and second one the tag to surround with new tag. Run it like:
python3 script.py xmlfile op
That yields:
<?xml version="1.0" encoding="utf-8"?>
<doc>
<a>
<cls>
<op>
xxx
</op>
</cls>
</a>
</doc>
For <doc>, run it like:
python3 script.py xmlfile doc
With following result:
<?xml version="1.0" encoding="utf-8"?>
<cls>
<doc>
<a>
<op>
xxx
</op>
</a>
</doc>
</cls>

How to find a specific tag in an XML file and then access its parent tag with Python and minidom

I'm trying to write some code that will search through an XML file of articles for a particular DOI contained within a tag. When it has found the correct DOI I'd like it to then access the <title> and <abstract> text for the article associated with that DOI.
My XML file is in this format:
<root>
<article>
<number>
0
</number>
<DOI>
10.1016/B978-0-12-381015-1.00004-6
</DOI>
<title>
The patagonian toothfish biology, ecology and fishery.
</title>
<abstract>
lots of abstract text
</abstract>
</article>
<article>
...All the article tags as shown above...
</article>
</root>
I'd like the script to find the article with the DOI 10.1016/B978-0-12-381015-1.00004-6 (for example) and then for me to be able to access the <title> and <abstract> tags within the corresponding <article> tag.
So far I've tried to adapt code from this question:
from xml.dom import minidom
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)
#looking for: 10.1016/B978-0-12-381015-1.00004-6
matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']
for i in range(len(matchingNodes)):
DOI = str(matchingNodes[i])
print DOI
But I'm not entirely sure what I'm doing!
Thanks for any help.
Is minidom a requirement? It would be quite easy to parse it with lxml and XPath.
from lxml import etree
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml').read()
tree = etree.fromstring(datasource)
path = tree.xpath("//article[DOI="10.1016/B978-0-12-381015-1.00004-6")
This will get you the article with the DOI specified.
Also, it seems that there is whitespace between the tags. I dunno if this because of the Stackoverflow formatting or not. This is probably why you cannot match it with minidom.
imho - just look it up in the python docs!
try this (not tested):
from xml.dom import minidom
xmldoc = minidom.parse(datasource)
def get_xmltext(parent, subnode_name):
node = parent.getElementsByTagName(subnode_name)[0]
return "".join([ch.toxml() for ch in node.childNodes])
matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']
for node in matchingNodes:
print "title:", get_xmltext(node, "title")
print "abstract:", get_xmltext(node, "abstract")

Categories