parse a section of an XML file with python - python

Im new to both python and xml. Have looked at the previous posts on the topic, and I cant figure out how to do exactly what I need to. Although it seems to be simple enough in principle.
<Project>
<Items>
<Item>
<Code>A456B</Code>
<Database>
<Data>
<Id>mountain</Id>
<Value>12000</Value>
</Data>
<Data>
<Id>UTEM</Id>
<Value>53.2</Value>
</Data>
</Database>
</Item>
<Item>
<Code>A786C</Code>
<Database>
<Data>
<Id>mountain</Id>
<Value>5000</Value>
</Data>
<Data>
<Id>UTEM</Id>
<Value></Value>
</Data>
</Database>
</Item>
</Items>
</Project>
All I want to do is extract all of the Codes, Values and ID's, which is no problem.
import xml.etree.cElementTree as ET
name = 'example tree.xml'
tree = ET.parse(name)
root = tree.getroot()
codes=[]
ids=[]
val=[]
for db in root.iter('Code'):
codes.append(db.text)
for ID in root.iter('Id'):
ids.append(ID.text)
for VALUE in root.iter('Value'):
val.append(VALUE.text)
print codes
print ids
print val
['A456B', 'A786C']
['mountain', 'UTEM', 'mountain', 'UTEM']
['12000', '53.2', '5000', None]
I want to know which Ids and Values go with which Code. Something like a dictionary of dictionaries maybe OR perhaps a list of DataFrames with the row index being the Id, and the column header being Code.
for example
A456B = {mountain:12000, UTEM:53.2}
A786C = {mountain:5000, UTEM: None}
Eventually I want to use the Values to feed an equation.
Note that the real xml file might not contain the same number of Ids and Values in each Code. Also, Id and Value might be different from one Code section to another.
Sorry if this question is elementary, or unclear...I've only been doing python for a month :/

BeautifulSoup is a very useful module for parsing HTML and XML.
from bs4 import BeautifulSoup
import os
# read the file into a BeautifulSoup object
soup = BeautifulSoup(open(os.getcwd() + "\\input.txt"))
results = {}
# parse the data, and put it into a dict, where the values are dicts
for item in soup.findAll('item'):
# assemble dicts on the fly using a dict comprehension:
# http://stackoverflow.com/a/14507637/4400277
results[item.code.text] = {data.id.text:data.value.text for data in item.findAll('data')}
>>> results
{u'A786C': {u'mountain': u'5000', u'UTEM': u''},
u'A456B': {u'mountain': u'12000', u'UTEM': u'53.2'}

This might be what you want:
import xml.etree.cElementTree as ET
name = 'test.xml'
tree = ET.parse(name)
root = tree.getroot()
codes={}
for item in root.iter('Item'):
code = item.find('Code').text
codes[code] = {}
for datum in item.iter('Data'):
if datum.find('Value') is not None:
value = datum.find('Value').text
else:
value = None
if datum.find('Id') is not None:
id = datum.find('Id').text
codes[code][id] = value
print codes
This produces:
{'A456B' : {'mountain' : '12000', 'UTEM' : '53.2'}, 'A786C' : {'mountain' : '5000', 'UTEM' : None}}
This iterates over all Item tags, and for each one, creates a dict key pointing to a dict of id/value pairs. An id/data pair is only created if the Id tag is not empty.

Related

How to manipulate xml based on the specific tags?

There's an XML something like this
<OUTER>
<TYPE>FIRST</TYPE>
<FIELD1>1</FIELD1>
<ID>55056</ID>
<TASK>
<FILE>
<OPTIONS>1</OPTIONS>
</FILE>
</TASK>
</OUTER>
<OUTER>
<TYPE>SECOND</TYPE>
<FIELD1>2</FIELD1>
<ID>58640</ID>
<TASK>
<FILE>
<OPTIONS>1</OPTIONS>
</FILE>
</TASK>
</OUTER>
The text in the tag ID needs to be updated with a new value, it's present in this variable NEW_ID1.The comparison should happen with the type tag, i.e only if the text == FIRST, we need to replace the ID with new ID, and write it back to XML similarly if type = SECOND, update ID with NEW_ID2 and so on,how to do so? I tried the following way,
tree = ET.parse("sample.xml")
root = tree.getroot()
det = tree.findall(".//OUTER[TYPE='FIRST']")
.
.
ID = NEW_ID1
tree.write("sample.xml")
but not able to manipulate it further
You are close, except TYPE isn't an attribute, it is a tag/element, so [TYPE='FIRST'] will not work.
Instead what you can do is iterate through all of the OUTER tags/elements, and test to see if they contain a TYPE with the value "FIRST" as text value. Then you can grab the OUTER tags ID decendant, and change it's text value.
For example:
tree = ET.parse("sample.xml")
root = tree.getroot()
for outer in tree.findall(".//OUTER"):
elem = outer.find(".//FIRST")
if elem.text == "FIRST":
id_elem = outer.find(".//ID")
id_elem.text = "NEWID1"
tree.write("sample.xml")
Note: I am assuming that your xml file doesn't only contain the markup that is in your question. There should only be one root element in an xml file.

Extracting comments from XML file in Python

I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!
Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.

How do I write a function that takes an xml file and an integer value X as parameters and updates the attributes of the xml based on the given integer

I am trying to write a function that will take as parameters my xml file file.xml and an integer I want to input from the keyboard.
My xml files looks like this:
<root>
<item name="A" days="10"/>
<item name="B" days="20"/>
I have the integer X :
X= int(input("X value is:")
I want to add the X value to the days attribute in my xml.
for X=1.1 =>I want the output:
A, 11.1 days
B, 20.1 days
I don't know how to write the function because when I tried calling it the name of the file I wanted to open was not recognized =>
read_xml(file.xml)
NameError : name 'file' is not defined.
But more importantly, I don't know how to add an integer value to the attribute of an xml file.
What I did so far using the ElementTree library:
import os
import xml.etree.ElementTree as et
tree = et.ElementTree(file = 'file.xml')
root = tree.getroot()
for item in root.findall('item'):
names = item.get('name')
ages = item.get('age')
genders = item.get('sex')
print(f'''\n{names}, {ages} years old''')
At this moment I get the desired output format but without the integer X added to the days attribute.
Please let me know if you have any idea how to solve this in Python3.
Thanks!!!
import xml.etree.ElementTree as ET
xml = '''<root>
<item name="A" days="10"/>
<item name="B" days="20"/>
</root>'''
def change_days_value(factor):
root = ET.fromstring(xml)
items = root.findall('.//item')
for item in items:
item.attrib['days'] = str(int(item.attrib['days']) * factor)
ET.dump(root)
# read this value from the user
factor = 1.1
change_days_value(factor)
output
<root>
<item days="11.0" name="A" />
<item days="22.0" name="B" />
</root>

Finding element in xml with python

I am trying to parse XML before converting it's content into lists and then into CSV. Unfortunately, I think my search terms for finding the initial element are failing, causing subsequent searches further down the hierarchy. I am new to XML, so I've tried variations on namespace dictionaries and including the namespace references... The simplified XML is given below:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>​
The Code I am using to try to extract the com/...xml/station / ChangedBy element is below
tree = ET.parse(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
root = tree.getroot()
#get at the tags and their data
#for elem in tree.iter():
# print(f"this the tag {elem.tag} and this is the data: {elem.text}")
#open file for writing
station_data = open(rootfilepath + 'station_data.csv','w')
csvwriter = csv.writer(station_data)
station_head = []
count = 0
#inspiration for this code: http://blog.appliedinformaticsinc.com/how-to- parse-and-convert-xml-to-csv-using-python/
#this is where it goes wrong; some combination of the namespace and the tag can't find anything in line 27, 'StationList'
for member in root.findall('{http://nationalrail.co.uk/xml/station}Station'):
station = []
if count == 0:
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').tag
station_head.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').tag
station_head.append(name)
count = count+1
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').text
station.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').text
station.append(name)
csvwriter.writerow(station)
I have tried:
using dictionaries of namespaces but that results in nothing being found at all
using hard coded namespaces but that results in "Attribute Error: 'NoneType' object has no attribute 'tag'
Thanks in advance for all and any assistance.
First of all your XML is invalid (</StationList> is absent at the end of a file).
Assuming you have valid XML file:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>​
</StationList>
Then you can convert your XML to JSON and simply address to the required value:
import xmltodict
with open('file.xml', 'r') as f:
data = xmltodict.parse(f.read())
changed_by = data['StationList']['Station']['ChangeHistory']['com:ChangedBy']
Output:
spascos
Try lxml:
#!/usr/bin/env python3
from lxml import etree
ns = {"com": "http://nationalrail.co.uk/xml/common"}
with open("so.xml") as f:
tree = etree.parse(f)
for t in tree.xpath("//com:ChangedBy/text()", namespaces=ns):
print(t)
Output:
spascos
You can use Beautifulsoup which is an html and xml parser
from bs4 import BeautifulSoup
fd = open(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
soup = BeautifulSoup(fd,'lxml-xml')
for i in soup.findAll('ChangeHistory'):
print(i.ChangedBy.text)

How to copy certain information from a text file to XML using Python?

We get order e-mails whenever a buyer makes a purchase; these e-mails are sent in a text format with some relevant and some irrelevant information. I am trying to write a python program which will read the text and then build an XML file (using ElementTree) which we can important into other software.
Unfortunately I do not quite know the proper terms for some of this, so please bear with the overlong explanations.
The problem is that I cannot figure out how to make it work with more than one product on the order. The program currently goes through each order and puts the data in a dictionary.
while file_length_dic != 0:
#goes line by line and adds each value (and it's name) to a dictionary
#keys are the first have a sentence followed by a distinguishing number
for line in raw_email:
colon_loc = line.index(':')
end_loc = len(line)
data_type = line[0:colon_loc] + "_" + file_length
data_variable = line[colon_loc+2:end_loc].lstrip(' ')
xml_dic[data_type] = data_variable
if line.find("URL"):
break
file_lenght_dic -= 1
How can I get this dictionary values into XML? For example, under the main "JOB" element there will be a sub-element ITEMNUMBER and then SALESMANN and QUANTITY. How can I fill out multiple sets?
<JOB>
<ITEM>
<ITEMNUMBER>36322</ITEMNUMBER>
<SALESMANN>17</SALESMANN>
<QUANTITY>2</QUANTITY>
</ITEM>
<ITEM>
<ITEMNUMBER>22388</ITEMNUMBER>
<SALESMANN>5</SALESMANN>
<QUANTITY>8</QUANTITY>
</ITEM>
</JOB>
As far as I can tell, ElementTree will only let me but the data into the first set of children but I can't imagine this must be so. I also do not know in advance how many items are with each order; it can be anywhere from 1 to 150 and the program needs to scale easily.
Should I be using a different library? lxml looks powerful but again, I do not know what it is exactly I am looking for.
Here's a simple example. Note that the basic ElementTree doesn't pretty print, so I included a pretty print function from the ElementTree author.
If you provide an actual example of the input file and dictionary it would be easier to target your specific case. I just Put some data in a dictionary to show how to iterate over it and generate some XML.
from xml.etree import ElementTree as et
def indent(elem, level=0):
i = "\n" + level*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
D = {36322:(17,2),22388:(5,8)}
job = et.Element('JOB')
for itemnumber,(salesman,quantity) in D.items():
item = et.SubElement(job,'ITEMNUMBER').text = str(itemnumber)
et.SubElement(job,'SALESMAN').text = str(salesman)
et.SubElement(job,'QUANTITY').text = str(quantity)
indent(job)
et.dump(job)
Output:
<JOB>
<ITEMNUMBER>36322</ITEMNUMBER>
<SALESMAN>17</SALESMAN>
<QUANTITY>2</QUANTITY>
<ITEMNUMBER>22388</ITEMNUMBER>
<SALESMAN>5</SALESMAN>
<QUANTITY>8</QUANTITY>
</JOB>
Although as #alko mentioned, a more structured XML might be:
job = et.Element('JOB')
for itemnumber,(salesman,quantity) in D.items():
item = et.SubElement(job,'ITEM')
et.SubElement(item,'NUMBER').text = str(itemnumber)
et.SubElement(item,'SALESMAN').text = str(salesman)
et.SubElement(item,'QUANTITY').text = str(quantity)
Output:
<JOB>
<ITEM>
<NUMBER>36322</NUMBER>
<SALESMAN>17</SALESMAN>
<QUANTITY>2</QUANTITY>
</ITEM>
<ITEM>
<NUMBER>22388</NUMBER>
<SALESMAN>5</SALESMAN>
<QUANTITY>8</QUANTITY>
</ITEM>
</JOB>
Your XML structure do not seem valid to me. How can one tell which salesman refers which itemnumber?
Probably, you need something like
<JOB>
<ITEM>
<NUMBER>36322</NUMBER>
<SALESMANN>17</SALESMANN>
<QUANTITY>2</QUANTITY>
</ITEM>
<ITEM>
<NUMBER>22388</NUMBER>
<SALESMANN>5</SALESMANN>
<QUANTITY>8</QUANTITY>
</ITEM>
</JOB>
For a list of serialization techniques, refer to Serialize Python dictionary to XML
Sample with dicttoxml:
import dicttoxml
from xml.dom.minidom import parseString
xml = dicttoxml.dicttoxml({'JOB':[{'NUMBER':36322,
'QUANTITY': 2,
'SALESMANN': 17}
]}, root=False)
dom = parseString(xml)
and output
>>> print(dom.toprettyxml())
<?xml version="1.0" ?>
<JOB type="list">
<item type="dict">
<SALESMANN type="int">
17
</SALESMANN>
<NUMBER type="int">
36322
</NUMBER>
<QUANTITY type="int">
2
</QUANTITY>
</item>
</JOB>

Categories