My XML file sample is given below and I want to access text "The bread is top notch as well" and category "food".
<sentences>
<sentence id="32897564#894393#2">
<text>The bread is top notch as well.</text>
<aspectTerms>
<aspectTerm term="bread" polarity="positive" from="4" to="9"/>
</aspectTerms>
<aspectCategories>
<aspectCategory category="food" polarity="positive" />
</aspectCategories>
</sentence>
my code is
test_text_file=open('Restaurants_Test_Gold.txt', 'rt')
test_text_file1=test_text_file.read()
root = ET.fromstring(test_text_file1)
for page in list(root):
text = page.find('text').text
Category = page.find('aspectCategory')
print ('sentence: %s; category: %s' % (text,Category))
test_text_file.close()
It's depending on how complicated your XML format is. The easiest way is to access the path directly.
import xml.etree.ElementTree as ET
tree = ET.parse('x.xml')
root = tree.getroot()
print(root.find('.//text').text)
print(root.find('.//aspectCategory').attrib['category'])
But if there are similar tags, you might want to use longer path like .//aspectCategories/aspectCategory instead.
Here is my code solving your problem
import os
import xml.etree.ElementTree as ET
basedir = os.path.abspath(os.path.dirname(__file__))
filenamepath = os.path.join(basedir, 'Restaurants_Test_Gold.txt')
test_text_file = open(filenamepath, 'r')
file_contents = test_text_file.read()
tree = ET.fromstring(file_contents)
for sentence in list(tree):
sentence_items = list(sentence.iter())
# remove first element because it's the sentence element [<sentence>] itself
sentence_items = sentence_items[1:]
for item in sentence_items:
if item.tag == 'text':
print(item.text)
elif item.tag == 'aspectCategories':
category = item.find('aspectCategory')
print(category.attrib.get('category'))
test_text_file.close()
Hope it helps
Related
I have read some of the answers for related questions, but none of them is directly related with lxml tostring and pretty_print.
I am using lxml and trying to create a xml file on Python 3.6.
The problem I found is that elements are not wrapped and ordered by parent element and believe it is related with the "pretty_print" option.
What I need to achieve is:
<root>
<element1></element1>
<element2></element2>
<child1></child1>
<child2></child2>
</root>
The result I get is:
<root><element1></element1><element2></element2><child1></child1><child2></child2></root>
Part of the code I am using:
from lxml import etree as et
CompanyID = "Company Identification"
TaxRegistrationNumber = "Company Reg. Number"
TaxAccountingBasis = "File Tipe"
CompanyName = "Company Name"
BusinessName = "Business Name"
root = et.Element("root")
header = et.SubElement(root, 'Header')
header.tail = '\n'
data = (
('CompanyID', str(CompanyID)),
('TaxRegistrationNumber', str(TaxRegistrationNumber)),
('TaxAccountingBasis', str(TaxAccountingBasis)),
('CompanyName', str(CompanyName)),
('BusinessName', str(BusinessName)),
)
for tag, value in data:
if value is None :
continue
et.SubElement(header, tag).text=value
xml_txt = et.tostring(root, pretty_print=True, encoding="UTF-8")
print(xml_txt)
If I print the elements with no data into it, it works fine and the "pretty_print" works fine.
If I add data to each of the elements (using the above variables), the "pretty_print" does not work and the structure gets messed up.
What could be wrong?
I found it.
I have removed the "header.tail = '\n'" from the code and it's working now.
root = et.Element("root")
header = et.SubElement(root, 'Header')
#header.tail = '\n'
Thank you all
I want to save data in xml in this format
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="13"
CreationDate="2010-09-13T19:16:26.763" Score="297" ViewCount="472045"
Body="<p>This is a common question by those who have just rooted their phones. What apps, ROMs, benefits, etc. do I get from rooting? What should I be doing now?</p>
"
OwnerUserId="10"
LastEditorUserId="16575" LastEditDate="2013-04-05T15:50:48.133"
LastActivityDate="2018-05-19T19:51:11.530"
Title="I've rooted my phone. Now what? What do I gain from rooting?"
Tags="<rooting><root-access>"
AnswerCount="3" CommentCount="0" FavoriteCount="194"
CommunityOwnedDate="2011-01-25T08:44:10.820" />
</posts>
I tried this, but I don't know how will I save in the above format:
import xml.etree.cElementTree as ET
root = ET.Element("posts")
row = ET.SubElement(root, "row")
ET.SubElement(row, questionText = "questionText").text = questionText
ET.SubElement(row, votes = "votes").text = votes
ET.SubElement(row, tags = "tags").text = tags
tree = ET.ElementTree(root)
tree.write("data.xml")
The docs for SubElement say it has the following arguments: parent, tag, attrib={}, **extra.
You omit the tag, so you get an error TypeError: SubElement() takes at least 2 arguments (1 given) with your code. You need something like:
import xml.etree.cElementTree as ET
root = ET.Element("posts")
row = ET.SubElement(root, "row", attrib={"foo":"bar", "baz":"qux"})
tree = ET.ElementTree(root)
tree.write("data.xml")
Outputs: <posts><row baz="qux" foo="bar" /></posts>
I have an xml file like this
<?xml version="1.0"?>
<sample>
<text>My name is <b>Wrufesh</b>. What is yours?</text>
</sample>
I have a python code like this
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
print child.text()
I only get
'My name is' as an output.
I want to get
'My name is <b>Wrufesh</b>. What is yours?' as an output.
What can I do?
You can get your desired output using using ElementTree.tostringlist():
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('sample.xml').getroot()
>>> l = ET.tostringlist(root.find('text'))
>>> l
['<text', '>', 'My name is ', '<b', '>', 'Wrufesh', '</b>', '. What is yours?', '</text>', '\n']
>>> ''.join(l[2:-2])
'My name is <b>Wrufesh</b>. What is yours?'
I wonder though how practical this is going to be for generic use.
I don't think treating tag in xml as a string is right. You can access the text part of xml like this:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
text = root[0]
for i in text.itertext():
print i
# As you can see, `<b>` and `</b>` is a pair of tags but not strings.
print text._children
I would suggest pre-processing the xml file to wrap elements under <text> element in CDATA. You should be able to read the values without a problem afterwards.
<text><![CDATA[<My name is <b>Wrufesh</b>. What is yours?]]></text>
I have this XML file and I'd like to read some data out of it using Python's xml.etree :
<a>
<b>
<AuthorName>
<GivenName>John</GivenName>
<FamilyName>Smith</FamilyName>
</AuthorName>
<AuthorName>
<GivenName>Saint</GivenName>
<GivenName>Patrick</GivenName>
<FamilyName>Thomas</FamilyName>
</AuthorName>
</b>
</a>
The result that I wish to have is this :
John Smith
Saint Patrick Thomas
The thing, as you may have noticed, is that sometimes I have 1 GivenName tag and sometimes I have 2 GivenName tags
What I did was this :
from xml.etree import ElementTree as ET
xx = ET.parse('file.xml')
authorName = xx.findall('.//AuthorName')
for name in authorName:
print(name[0].text + " " + name[1].text)
It works fine with 1 GivenName tag but not when I have 2.
What can I do?
Thanks!
Try this:
from xml.etree import ElementTree as ET
xx = ET.parse('file.xml')
authorName = xx.findall('.//AuthorName')
for name in authorName:
nameStr = ' '.join([child.text for child in name])
print(nameStr)
You have to look at all child tags inside authorName, take their text and then join them to your nameStr.
It appears that you aren't really making use of your loop. Something like this might work a bit better for you:
from xml.etree import ElementTree as ET
xx = ET.parse('file.xml')
authorName = xx.finall('.//AuthorName')
nameParts = []
for name in authorName:
fullName.append(name)
fullName = ' '.join(nameParts)
print(fullName)
Now, one more thing that you can do here to make your life a bit easier is learn about list comprehensions. For example, the above can be reduced to:
from xml.etree import ElementTree as ET
xx = ET.parse('file.xml')
authorName = xx.finall('.//AuthorName')
fullName = ' '.join((name.text for name in xx.findall('.//AuthorName')))
print(fullName)
Note: This has not actually been tested to run. There may be typos.
i have a questiion about formatting xml files after generating them. Here is my code:
import csv
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from xml.etree.ElementTree import ElementTree
import xml.etree.ElementTree as etree
root = Element('Solution')
root.set('version','1.0')
tree = ElementTree(root)
head = SubElement(root, 'DrillHoles')
head.set('total_holes', '238')
description = SubElement(head,'description')
with open ('1250_12.csv', 'r') as data:
current_group = None
reader = csv.reader(data)
i = 0
for row in reader:
if i > 0:
x1,y1,z1,x2,y2,z2,cost = row
if current_group is None or i != current_group.text:
current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})
information = SubElement (current_group, 'hole',{'collar':', '.join((x1,y1,z1)),
'toe':', '.join((x2,y2,z2)),
'cost': cost})
i+=1
Which produces the following xml file:
<?xml version="1.0"?>
-<Solution version="1.0">
-<DrillHoles total_holes="238">
-<description>
-<hole hole_id="1">
<hole toe="5797.82, 3061.01, 2576.29" cost="102.12" collar="5720.44, 3070.94, 2642.19"/></hole>
that is just a part of the xml file but it is enough to serve this purpose.
There are many things i would like to change, first is i would like the toe,cost, and collar to be on different lines like so:
<collar>0,-150,0</collar>
<toe>69.9891,-18.731,-19.2345</toe>
<cost>15</cost>
and i would like it to be in the order of collar then toe then cost shown above.
Furthermore, in the xml file it displays : "hole toe ="5797.82, 3061.01, 2576.29", how do i get rid of the hole? Yea thats about it, i am really new to this python thing so go easy on me. haha