difference of two xml files python - python

If I have two XML files with the same tags but different attributes and texts values. I have tried xmldiff but I did not know how to use it in my project.
I need to extract the differences in the attributes and the values to new XML structure.
In my code I tried to compare every node with all the other nodes but if the other file have sub elements with the same name multiple times this code does not work.
My code:
walkAll = tree1.getiterator()
for elt in walkAll:
el=root2.find(elt.tag)
for ee in root2.iter(elt.tag):
p1=ee.getparent()
p2=el.getparent()
if p1==p2:
if ee.text !=elt.text:
print(ee.text+"****"+elt.text )
How can I get the direct first order parent to an xml node?

Related

Get the start position and end position of found named entities in xml

I'm new in xml parsing. I have a xml file which has a content and an identified entities (person and location ). Number of "person" entity in the file is close to 10 and "location" is just 3.
<em>
Mad Max:
<location>Fury Road</location
</em>
and so on ..
I wanted to extract the content and start position and end position of each of the entities present in the xml file (using Python - for loop). But not sure how to start writing code to get the positions of it from the xml file.
Can someone please help me?
Instead of using regular for-loops (which could lead to problems in the future), you could use the builtin xml module in Python.
In your example:
import xml.etree.ElementTree as ET
tree = ET.parse(xmlfile)
root = tree.getroot()
From here you can get positions, or simply use this module instead of whatever you were planning to do with the xml data.

Change tag name in XML using Python [duplicate]

This question already has an answer here:
XML change tag name using Python
(1 answer)
Closed 4 years ago.
Very new to XML and Python. I want to change the tag names of certain elements in an XML document. Here's how the document looks now:
<Company>
<Employee>
<SSN>xxxxx1234</SSN>
<Dependent>
<SSN>xxxxx4321</SSN>
I want to change the <SSN> tag under the Employee to <EE SSN> and leave the tag under the Dependent the same. The document includes hundreds of companies and thousands of employees both with tens to hundreds of sub elements, so a find and replace option is what I believe I need.
I want to use ElementTree module, but open to other suggestions. There are other modifications I want to make (copying and pasting elements) and will be posting another question, so I would like to maintain one module for all. The only code I have that is working is the importing the data and writing it to a new file. Thanks for all your help!
ElementTree is your way to go.
First you get the root, and after that it's just a matter of accessing the nodes:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml) # assuming you can store your data in a string (xml)
root[i] # access the ith node of the root
root[i][j] # access the jth node of the ith node
To get data from the nodes, you can use:
root[i][j].text
...where you can change the data as needed.

xpath contains() not working: returning every node that contains text i.e. where text is not null

I am parsing a large xml file containing biological data.
The xml file is organised like this:
<part>
<part_id>41926</part_id>
<part_name>BBa_K1906018</part_name>
<part_short_name>K1906018</part_short_name>
<part_short_desc>Ribothermometer JB1-G3</part_short_desc>
<part_type>RBS</part_type>
<release_status>Not Released</release_status>
<sample_status>Not in stock</sample_status>
<part_results>None</part_results>
<part_nickname/>
<part_rating/>
<part_url>http://parts.igem.org/Part:BBa_K1906018</part_url>
<part_entered>2016-10-14</part_entered>
<part_author>Yuwei Han</part_author>
<deep_subparts/>
<specified_subparts/>
<specified_subscars/>
<sequences>
<seq_data>tactagagctcttattgtaaaacatgtactaaggagtactag </seq_data>
</sequences>
...
</part>
I have already developed xpath expressions that return exact matches E.g.
current_tree.xpath("//part/%s/text()[normalize-space(.)='%s']/../.."
"" % (arg_key, arg_values[0]))
Where arg_key will refer to one of the nodes of the document e.g. "part_type" and arg_values[0] will refer to an argument value such as "RBS".
I am trying to write an xpath expression that will find all seq_data nodes that contain a sequence motif, and return the nearest part parent node.
My xpath expression to do this (not working):
current_tree.xpath("//seq_data/text()[contains(.,%s)]"
"/ancestor::part" % (arg_values[0]))
This returns all parts whose seq_data node contains any text at all, i.e. it's fetching all nodes whose seq_data/text() is not empty.
I can't figure out why. Thanks

how to modify attribute values from xml tag with python

I have many graphml files starting with:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns/graphml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns/graphml">
I need to change the xmlns and xsi attributes to reflect proper values for this XML file format specification:
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
I tried to change these values with BeautifulSoup like:
soup = BeautifulSoup(myfile, 'html.parser')
soup.graphml['xmlns'] = 'http://graphml.graphdrawing.org/xmlns'
soup.graphml['xsi:schemalocation'] = "http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
It works fine but it is definitely too slow on some of my larger files, so I am trying to do the same with lxml, but I don't understand how to achieve the same result. I sort of managed to reach the attributes, but don't know how to change them:
doc = etree.parse(myfile)
root = doc.getroot()
root.attrib
> {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://graphml.graphdrawing.org/xmlns/graphml'}
What is the right way to accomplish this task?
When you say that you have many files "starting with" those 4 lines, if you really mean they're exactly like that, the fastest way is probably to entirely ignore that fact that it's XML, and just replace those lines.
In Python, just read the first four lines, compare them to what you expect (so you can issue a warning if they don't match), then discard them. Write out the new four lines you want, then copy the rest of the file out. Repeat for each file.
On the other hand, if you have namespace attributes anywhere else in the file this method wouldn't catch them, and you should probably do a real XML-based solution. With a regular SAX parser, you get a callback for each element start, element end, text node, etc. as it comes along. So you'd just copy them out until you hit the one(s) you want (in this case, a graphml element), then instead of copying out that start-tag, write out the new one you want. Then back to copying. XSLT is also a fine way to do this, which would let you write a tiny generic copier, plus one rule to handle the graphml element.

Multiple Elements in ElementMaker in lxml python

I wish to use ElementMaker in lxml to build an xml representation of an excel spreadsheet with the corresponding nesting. I would like
<excelbook>
workbook info
<excelsheet>
<sheetname>Sheet1</sheetname>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
</excelsheet>
<excelsheet>
<sheetname>Sheet2</sheetname>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
<exceltable>
<numrows>10</numrows>
<numcols>13</numcols>
</exceltable>
</excelsheet>
</excelbook>
My python code looks like the following
for excelSheet in excelBook.excelSheets:
for excelTable in excelSheet.excelTables:
exceltable = E.exceltable(
E.num_rows(str(excelTable.num_rows)),
E.num_cols(str(excelTable.num_cols)),
)
excelsheet = E.excelsheet(
exceltable,
E.sheetname(excelSheet.sheetName),
)
excelbook = E.excelbook(
excelsheet,
E.bookname(fullpathname),
E.numSheets(str(excelBook.numSheets)))
root = E.XML(excelbook)
The problem is that I can only nest one sheet inside each book and one table inside each sheet. How do I change the code to allow multiple sheets in each book and multiple tables inside each sheet.
You can't next a tag with the same tag name twice under the same root and expect them to be appended one after another rather than the second one overriding the first one, because this is not very coherent.
For example if you have:
<root>
<elem1>some text 1</elem1>
</root>
If you try to append another "elem1" tag to root, it will override the existing one.
I'd suggest you use indexing ("elem_0", "elem_1", "elem_2".. etc).

Categories