Extracting comments from XML file in Python

Extracting comments from XML file in Python - python

I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!

Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.

Related

How to manipulate xml based on the specific tags?

There's an XML something like this
<OUTER>
<TYPE>FIRST</TYPE>
<FIELD1>1</FIELD1>
<ID>55056</ID>
<TASK>
<FILE>
<OPTIONS>1</OPTIONS>
</FILE>
</TASK>
</OUTER>
<OUTER>
<TYPE>SECOND</TYPE>
<FIELD1>2</FIELD1>
<ID>58640</ID>
<TASK>
<FILE>
<OPTIONS>1</OPTIONS>
</FILE>
</TASK>
</OUTER>
The text in the tag ID needs to be updated with a new value, it's present in this variable NEW_ID1.The comparison should happen with the type tag, i.e only if the text == FIRST, we need to replace the ID with new ID, and write it back to XML similarly if type = SECOND, update ID with NEW_ID2 and so on,how to do so? I tried the following way,
tree = ET.parse("sample.xml")
root = tree.getroot()
det = tree.findall(".//OUTER[TYPE='FIRST']")
.
.
ID = NEW_ID1
tree.write("sample.xml")
but not able to manipulate it further

You are close, except TYPE isn't an attribute, it is a tag/element, so [TYPE='FIRST'] will not work.
Instead what you can do is iterate through all of the OUTER tags/elements, and test to see if they contain a TYPE with the value "FIRST" as text value. Then you can grab the OUTER tags ID decendant, and change it's text value.
For example:
tree = ET.parse("sample.xml")
root = tree.getroot()
for outer in tree.findall(".//OUTER"):
elem = outer.find(".//FIRST")
if elem.text == "FIRST":
id_elem = outer.find(".//ID")
id_elem.text = "NEWID1"
tree.write("sample.xml")
Note: I am assuming that your xml file doesn't only contain the markup that is in your question. There should only be one root element in an xml file.

XML to table form in Excel

There is this option when opening an xml file using Excel. You get prompted with the option as seen in the picture Here
It basically open that xml file in a table work and based on the analysis that I have done. It seems to do a pretty good job.
This is how it looks after I opened an xml file using excel as a tabel form Here
My Question: I want to convert an Xml into a table from like that feature in Excel does it. Is that possible?
The reason I want this result, is that working with tables inside excel is really easy using libraries like pandas. However, I don’t want to go an open every xml file with excel, show the table and then save it again. It is not very time efficient
This is my XML file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<SetFile dg="" dg_id="">
<SetData value="32" />
</SetFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<SetFile dg="" dg_id="">
<FileX dg="" axis_pts="2" name="" num="" dg_id="" />
<FileY unit="" axis_pts="20" name="TOOLS" text_id="23423" unit_id="" />
<SetData x="E1" value="21259" />
<SetData x="E2" value="0" />
</SetFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Request>225198</Request>
<Response>343243324234234</Response>
</RawData>
<Meaning text_id="434234234">The forth</Meaning>
<ValueDataset unit="m" unit_id="FEDS">
<FileX dg="kg" discrete="false" axis_pts="19" name="weight" text_id="SDF3" unit_id="SDGFDS" />
<SetData xin="sdf" xax="233" value="323" />
<SetData xin="123" xax="213" value="232" />
<SetData xin="2321" xax="232" value="23" />
</ValueDataset>
</START>
</FINAL>
</ProjectData>

So let's say I have the following input.xml file:
<main>
<item name="item1" image="a"></item>
<item name="item2" image="b"></item>
<item name="item3" image="c"></item>
<item name="item4" image="d"></item>
</main>
You can use the following code:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('input.xml')
tags = [ e.attrib for e in tree.getroot() ]
df = pd.DataFrame(tags)
# df:
# name image
# 0 item1 a
# 1 item2 b
# 2 item3 c
# 3 item4 d
And this should be independent of the number of attributes in a given file.
To write to a simple CSV file from pandas, you can use the to_csv command. See documentation. If it is necessary to be an excel sheet, you can use to_excel, see here.
# Write to csv without the row names
df.to_csv('file_name.csv', index = False)
# Write to xlsx sheet without the row names
df.to_excel('file_name.xlsx', index=False)
UPDATE:
For your XML file, and based on your clarification in the comments, I suggest the following, where all elements in the first level in the tree will be rows, and every attribute or node text will be column:
def has_children(e):
''' Check if element, e, has children'''
return(len(list(e)) > 0)
def has_attrib(e):
''' Check if element, e, has attributes'''
return(len(e.attrib)>0)
def get_uniqe_key(mydict, key):
''' Generate unique key if already exists in mydict'''
if key in mydict:
while key in mydict:
key = key + '*'
return(key)
tree = ET.parse('input2.xml')
root = tree.getroot()
# Get first level:
lvl_one = list(root)
myList = [];
for e in lvl_one:
mydict = {}
# Iterate over each node in level one element
for node in e.iter():
if (not has_children(node)) & (node.text != None):
uniqe_key = get_uniqe_key(mydict, node.tag)
mydict[uniqe_key] = node.text
if has_attrib(node):
for key in node.attrib:
uniqe_key = get_uniqe_key(mydict, key)
mydict[uniqe_key] = node.attrib[key]
myList.append(mydict)
print(pd.DataFrame(myList))
Notice in this code, I check if the column name exists for each key, and if it exists, I create a new column name by suffixing with '*'.

parse a section of an XML file with python

Im new to both python and xml. Have looked at the previous posts on the topic, and I cant figure out how to do exactly what I need to. Although it seems to be simple enough in principle.
<Project>
<Items>
<Item>
<Code>A456B</Code>
<Database>
<Data>
<Id>mountain</Id>
<Value>12000</Value>
</Data>
<Data>
<Id>UTEM</Id>
<Value>53.2</Value>
</Data>
</Database>
</Item>
<Item>
<Code>A786C</Code>
<Database>
<Data>
<Id>mountain</Id>
<Value>5000</Value>
</Data>
<Data>
<Id>UTEM</Id>
<Value></Value>
</Data>
</Database>
</Item>
</Items>
</Project>
All I want to do is extract all of the Codes, Values and ID's, which is no problem.
import xml.etree.cElementTree as ET
name = 'example tree.xml'
tree = ET.parse(name)
root = tree.getroot()
codes=[]
ids=[]
val=[]
for db in root.iter('Code'):
codes.append(db.text)
for ID in root.iter('Id'):
ids.append(ID.text)
for VALUE in root.iter('Value'):
val.append(VALUE.text)
print codes
print ids
print val
['A456B', 'A786C']
['mountain', 'UTEM', 'mountain', 'UTEM']
['12000', '53.2', '5000', None]
I want to know which Ids and Values go with which Code. Something like a dictionary of dictionaries maybe OR perhaps a list of DataFrames with the row index being the Id, and the column header being Code.
for example
A456B = {mountain:12000, UTEM:53.2}
A786C = {mountain:5000, UTEM: None}
Eventually I want to use the Values to feed an equation.
Note that the real xml file might not contain the same number of Ids and Values in each Code. Also, Id and Value might be different from one Code section to another.
Sorry if this question is elementary, or unclear...I've only been doing python for a month :/

BeautifulSoup is a very useful module for parsing HTML and XML.
from bs4 import BeautifulSoup
import os
# read the file into a BeautifulSoup object
soup = BeautifulSoup(open(os.getcwd() + "\\input.txt"))
results = {}
# parse the data, and put it into a dict, where the values are dicts
for item in soup.findAll('item'):
# assemble dicts on the fly using a dict comprehension:
# http://stackoverflow.com/a/14507637/4400277
results[item.code.text] = {data.id.text:data.value.text for data in item.findAll('data')}
>>> results
{u'A786C': {u'mountain': u'5000', u'UTEM': u''},
u'A456B': {u'mountain': u'12000', u'UTEM': u'53.2'}

This might be what you want:
import xml.etree.cElementTree as ET
name = 'test.xml'
tree = ET.parse(name)
root = tree.getroot()
codes={}
for item in root.iter('Item'):
code = item.find('Code').text
codes[code] = {}
for datum in item.iter('Data'):
if datum.find('Value') is not None:
value = datum.find('Value').text
else:
value = None
if datum.find('Id') is not None:
id = datum.find('Id').text
codes[code][id] = value
print codes
This produces:
{'A456B' : {'mountain' : '12000', 'UTEM' : '53.2'}, 'A786C' : {'mountain' : '5000', 'UTEM' : None}}
This iterates over all Item tags, and for each one, creates a dict key pointing to a dict of id/value pairs. An id/data pair is only created if the Id tag is not empty.

Search xml for text and return element/node

I'd like to be able to search an xml formatted file by the text value and return the id it is part of. I've looked through the python library at the xml commands but only saw examples for searching by elements/nodes. I have a simplified xml sample below and I'd like search for "3x3 Eyes" for example and return "2". It should also search for the exact text minus case. There will normally be multiple entries for title under each anime so the search can stop at the first match. Thanks
<?xml version="1.0" encoding="UTF-8"?>
<animetitles>
<anime aid="1">
<title type="official" xml:lang="fr">Crest of the Stars</title>
<title type="official" xml:lang="fr">Crest of the Stars</title>
</anime>
<anime aid="2">
<title type="official" xml:lang="en">3x3 Eyes</title>
</anime>
<anime aid="3">
<title type="official" xml:lang="en">3x3 Eyes: Legend of the Divine Demon</title>
</anime>
</animetitles>

tree = et.parse( ... )
# Unique match
results = []
for anime in tree.findall('anime'):
for title in anime.findall('title'):
if title.text == '3x3 Eyes':
results.append(anime.get('aid'))
print results
# Everything that starts with
results = []
for anime in tree.findall('anime'):
for title in anime.findall('title'):
if title.text.startswith('3x3 Eyes'):
results.append(anime.get('aid'))
print results
First one returns [2], the second one [2, 3].
Or a little bit more cryptic but, hey, why not :)
results = [anime.get('aid') for anime in tree.findall('anime')
for title in anime.findall('title') if title.text == '3x3 Eyes']

You can use ElementTree for your purpose.
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
def findParentAttrib(string):
for neighbor in root.iter():
for parent in neighbor.getiterator():
for child in parent:
if child.text == string:
return parent.attrib['aid']
print findParentAttrib("3x3 Eyes") # returns 2
Also refer to this page.

Parsing XML tags through Python and replace it using xml.dom.minidom

My XML file test.xml contains the following tags
<?xml version="1.0" encoding="ISO-8859-1"?>
<AppName>
<out>This is a sample output with <test>default</test> text </out>
<AppName>
I have written a python code which does the following till now:
from xml.dom.minidom import parseString
list = {'test':'example'}
file = open('test.xml','r')
data = file.read()
file.close()
dom = parseString(data)
if (len(dom.getElementsByTagName('out'))!=0):
xmlTag = dom.getElementsByTagName('out')[0].toxml()
out = xmlTag.replace('<out>','').replace('</out>','')
print out
The output of the following program is This is a sample output with <test>default</test> text
You will also notice i have a list with list = {'test':'example'} defined.
I want to check if in the out there is a tag which is listed in the list, will be replaced with the corresponding value, else the default value.
In this case, the output should be:
This is a sample output with example text

This will do more or less what you want:
from xml.dom.minidom import parseString, getDOMImplementation
test_xml = '''<?xml version="1.0" encoding="ISO-8859-1"?>
<AppName>
<out>This is a sample output with <test>default</test> text </out>
</AppName>'''
replacements = {'test':'example'}
dom = parseString(test_xml)
if (len(dom.getElementsByTagName('out'))!=0):
xmlTag = dom.getElementsByTagName('out')[0]
children = xmlTag.childNodes
text = ""
for c in children:
if c.nodeType == c.TEXT_NODE:
text += c.data
else:
if c.nodeName in replacements.keys():
text += replacements[c.nodeName]
else: # not text, nor a listed tag
text += c.toxml()
print text
Notice that I used replacements rather than list. In python terms, it's a dictionary, not a list, so that's a confusing name. It's also a builtin function, so you should avoid using it as a name.
If you want a dom object rather than just text, you'll need to take a different approach.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting comments from XML file in Python - python

Related

How to manipulate xml based on the specific tags?

XML to table form in Excel

parse a section of an XML file with python

Search xml for text and return element/node

Parsing XML tags through Python and replace it using xml.dom.minidom

Categories

Resources