Can anyone offer some help with regards to using Python to extract information from a XML file? This will be my example XML.
<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</root>
What I want to print out is the information between the root tags. However, I want it to print it as is, which means all the tags, text in between the tags, and the content within the tag (in this case number index ="2") I have tried itertext(), but that removes the tags and prints only the text in between the root tags. So far, I have a makeshift solution that prints out only the element.tag and the element.text but that does not print out the end tags and the content within the tag. Any help would be appreciated! :)
With s as your input,
s='''<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
</root>'''
Find all tags with tag name number and convert the tag to string using ET.tostring()
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
for node in root.findall('.//number'):
print ET.tostring(node)
Output:
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
from bs4 import BeautifulSoup
xml = "<root><number index=\"2\"><info><info.RANDOM>Random Text</info.RANDOM></info></root>"
soup = BeautifulSoup(xml, "xml")
output = soup.prettify()
print(output[output.find("<root>") + 7:output.rfind("</root>")])
the + 7 accounts for root>\n
Related
I'm using xml.etree.ElementTree. When I try to get text from AbstractText, I get None or partial text if there are formatting tags like i, b or similar tags into the text.
Here is a xml example
<root>
<AbstractText><b>1.</b> test text <b> 2. </b> is very silly.</AbstractText>
<AbstractText>hello <b> this is </b> another example </AbstractText>
</root>
python code is
tree = ET.parse("xml/test.xml")
root =tree.getroot()
for node in root.findall('AbstractText'):
print(node.text)
and the output is
None
hello
How can I fix it? I want all the text without i, b, or other information
Here a example
import xml.etree.ElementTree as ET
from lxml import etree
tree = etree.parse('file.xml')
for node in tree.findall('AbstractText'):
#print(node)
print(etree.tostring(node, encoding='utf8', method='text'))
I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems
So for example I have XML doc:
<?xml version="1.0"?>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
How do I parse all texts inside b's. I read my whole file into a string.
I only know how to parse html, tried applying it to html, but failed.
from lxml import html
string = myfile.read();
tree = html.fromstring(string);
result = tree.xpath('//a/#b');
But it wont work.
The first thing that you should do is make sure that your xml file is properly formatted for lxml. If the entire document is not contained within an overall "body" tag, the lxml parser will fail. May I make this suggestion:
<?xml version="1.0"?>
<body>
<a>
<b>Text I need</b>
</a>
<a>
<b>Text I need2</b>
</a>
</body>
Let us refer to this file as "foo.xml". Now that this data format is better for parsing, import etree from the lxml library:
from lxml import etree as et
Now it is time to parse the data and create a root object from which to start:
file_name = r"C:\foo.xml"
xmlParse = et.parse(file_name) #Parse the xml file
root = xmlParse.getroot() #Get the root
Once the root object has been declared, we can now use the getiterator() method to iterate through all b tags. Because the getiterator() method is exactly what it sounds like, an iterator, we can use list comprehension to save the element objects in a list. From there we can edit the text between the b tags:
bTags = [tag for tag in root.getiterator("b")] #List comprehension with the iterator
bTags[0].text = "Change b tag 1." #Change tag from "Text I need"
bTags[1].text = "Change b tag 2." #Change tag from "Text I need2"
xmlParse.write(file_name) #Edit original xml file
The final output should look something like this:
<?xml version="1.0"?>
<body>
<a>
<b>Change b tag 1.</b>
</a>
<a>
<b>Change b tag 2.</b>
</a>
</body>
I am working with my Evernote data - extracted to an xml file. I have parsed the data using BeautifulSoup and here is a sampling of my xml data.
<note>
<title>
Audio and camera roll from STUDY DAY! in San Francisco
</title>
<content>
<![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note><div><en-media type="image/jpeg" hash="e3a84de41c9886b93a6921413b8482d5" width="1080" style="" height="1920"/><en-media type="image/jpeg" hash="b907b22a9f2db379aec3739d65ce62db" width="1123" style="" height="1600"/><en-media type="audio/wav" hash="d3fdcd5a487531dc156a8c5ef6000764" style=""/><br/></div>
</en-note>
]]>
</content>
<created>
20130217T153800Z
</created>
<updated>
20130217T154311Z
</updated>
<note-attributes>
<latitude>
37.78670730072799
</latitude>
<longitude>
-122.4171893858559
</longitude>
<altitude>
42
</altitude>
<source>
mobile.iphone
</source>
<reminder-order>
0
</reminder-order>
</note-attributes>
<resource>
<data encoding="base64">
There are two avenues I would like to explore here:
1. Finding and removing Specific tags (in this case )
2. locating a group/list of tags to extract to another document
This is my current code which parses the xml prettifies it and outputs to a text file.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
with open("file.txt", "w") as f:
f.write(soup.prettify().encode('utf8'))
You can search nodes by name
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml', 'r'))
source = soup.source
print source
#<source>
# mobile.iphone
#</source>
source = soup.source
print source.string
# mobile.iphone
Another way to do it, findAll method:
for tag in soup.findAll('source'):
print tag.string
if you want to print every node stripping tags, this should do the job:
for tag in soup.findAll():
print tag.string
Hope it helps.
EDIT:________
BeautifulSoup asumes you know the structure, although by definition xml is a structured data storage.
So you need to give a guideline to BS to parse your xml.
row = []
title = soup.note.title.string
created = soup.note.created.string
row.append(title)
row.append(created)
Now you only have to iterate over xml.
If you're using BeautifulSoup, you could use the getText() method to strip out the tags in the child elements and get one consolidated text
source.getText()
I want to transform this xml tree
<doc>
<a>
<op>xxx</op>
</a>
</doc>
to
<doc>
<a>
<cls>
<op>xxx</op>
</cls>
</a>
</doc>
I use this python code
from lxml import etree
f = etree.fromstring('<doc><a><op>xxx</op></a></doc>')
node_a = f.xpath('/doc/a')[0]
ele = etree.Element('cls')
node_a.insert(0, ele)
node_cls = f.xpath('/doc/a/cls')[0]
node_op = f.xpath('/doc/a/op')[0]
node_cls.append(node_op)
print etree.tostring(f, pretty_print=True)
Is it the best solution ?
Now I want to obtain
<cls>
<doc>
<a>
<op>xxx</op>
</a>
</doc>
</cls>
I am unable to find any solution.
Thanks for your help.
I prefer to use beautifulsoup better than lxml. I find it easier to handle.
Both problems can be solved using the same approach, first find the element, get its parent, create the new element and put the old one inside it.
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
for e in soup.find_all(sys.argv[2]):
p = e.parent
cls = soup.new_tag('cls')
e_extracted = e.extract()
cls.append(e_extracted)
p.append(cls)
print(soup.prettify())
The script accepts two arguments, thr first one is the xml file and second one the tag to surround with new tag. Run it like:
python3 script.py xmlfile op
That yields:
<?xml version="1.0" encoding="utf-8"?>
<doc>
<a>
<cls>
<op>
xxx
</op>
</cls>
</a>
</doc>
For <doc>, run it like:
python3 script.py xmlfile doc
With following result:
<?xml version="1.0" encoding="utf-8"?>
<cls>
<doc>
<a>
<op>
xxx
</op>
</a>
</doc>
</cls>