Suppress automatically added namespace in etree Python - python

<rootTag xmlns="model">
<tag>
I have an xml file with a namespace specified as above. I can use etree in Python to parse it, but after making changes and writing it back to the file, etree changes it to this
<rootTag xmlns:ns0="model">
<ns0:tag>
and prepended "ns0" to all the tags. I don't want that to happen.
A sample program is as follows:
et = xml.etree.ElementTree.parse(xml_name)
root = (et.getroot())
root.find('.//*'+pattern).text = new_text
et.write(xml_name)
Is there someway to suppress this automatic change? Thanks

This can be done using register_namespace() by using an empty string for the prefix...
ET.register_namespace("", "model")
Full working example...
import xml.etree.ElementTree as ET
xml = """
<rootTag xmlns="model">
<tag>foo</tag>
</rootTag>
"""
ET.register_namespace("", "model")
root = ET.fromstring(xml)
root.find("{model}tag").text = "bar"
print(ET.tostring(root).decode())
printed output...
<rootTag xmlns="model">
<tag>bar</tag>
</rootTag>
Also see this answer for another example.

Related

Update the xml using python3 at specific subelement?

I am trying to update below xml file in python 3 using import xml.etree.ElementTree as ET but not able to add anything between tags
Issue I am facing not able to get/fetch the tag after fileSets.
Can someone let me know how we could update the xml?
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
</includes>
</fileSet>
</fileSets>
</assembly>
Expected output:(file names will be added dynamically)
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
<include>abc.text</include>
<include>def.text</include>
<include>ghi.text</include>
</includes>
</fileSet>
</fileSets>
</assembly>
I am trying this and it prints me all four element inside this files but doesn't know how to access includes and then add something inside this abc.txt and so on.
import xml.etree.ElementTree as ET
tree = ET.parse(abc.xml)
root = tree.getroot()
for actor in root.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSets'):
for name in actor.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSet'):
print(name)
You don't have to do anything with fileSets orfileSet. Since you want to add children to includes, get that element directly.
import xml.etree.ElementTree as ET
# Ensure that the proper prefix is used in the output (in this case, no prefix at all)
ET.register_namespace("", "http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2")
tree = ET.parse("abc.xml")
# Find the 'includes' element (.// means search the whole document).
# {*} is a wildcard and matches any namespace (Python 3.8)
includes = tree.find(".//{*}includes")
# Create three new 'include' elements
include1 = ET.Element("include")
include1.text = "abc.text"
include2 = ET.Element("include")
include2.text = "def.text"
include3 = ET.Element("include")
include3.text = "ghi.text"
# Add the new elements as children of 'includes'
includes.append(include1)
includes.append(include2)
includes.append(include3)

python iterate xml avoiding namespace

with my python script i want to iterate my xml file searching a specific element tag.
I have some problem related to the namespace of the root tag.
Below my XML structure:
<?xml version="1.0" ?>
<rootTag xmlns="blablabla">
<tag_1>
<sub_tag_1>..something..</sub_tag_1>
</tag_1>
<tag_2>
<sub_tag_2>..something..</sub_tag_2>
</tag_2>
...and so on...
</rootTag>
Below my PYTHON script:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_taken_from_web)
print(root.tag)
The problem is that output of print is:
{blablabla}rootTag
so when i iter over it all the tag_1, tag_2, and so on tags will have the {blablabla} string so i'm not able to make any check on the tag.
I tried using regular expression in this way
root = re.sub('^{.*?}', '', root.tag)
the problem is that root after that is a string type and so i cannot over it such an Element type
How can i print only rootTag ?
With that just use:
import xml.etree.ElementTree as ET
from lxml import etree
root = ET.fromstring(xml_taken_from_web)
print(etree.QName(root.tag).localname)

Create array of values from specific element in XML using Python

I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>
The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.

Adding <root> tag to XML doc with Python

Attempting to add a root tag to the beginning and end of a 2mil line XML file so the file can be properly processed with my Python code.
I tried using this code from a previous post, but I am getting an error "XMLSyntaxError: Extra content at the end of the document, line __, column 1"
How do I solve this? Or is there a better way to add a root tag to the beginning and end of my large XML doc?
import lxml.etree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
newroot = ET.Element("root")
newroot.insert(0, root)
print(ET.tostring(newroot, pretty_print=True))
My test XML
<pub>
<ID>75</ID>
<title>Use of Lexicon Density in Evaluating Word Recognizers</title>
<year>2000</year>
<booktitle>Multiple Classifier Systems</booktitle>
<pages>310-319</pages>
<authors>
<author>Petr Slavík</author>
<author>Venu Govindaraju</author>
</authors>
</pub>
<pub>
<ID>120</ID>
<title>Virtual endoscopy with force feedback - a new system for neurosurgical training</title>
<year>2003</year>
<booktitle>CARS</booktitle>
<pages>782-787</pages>
<authors>
<author>Christos Trantakis</author>
<author>Friedrich Bootz</author>
<author>Gero Strauß</author>
<author>Edgar Nowatius</author>
<author>Dirk Lindner</author>
<author>Hüseyin Kemâl Çakmak</author>
<author>Heiko Maaß</author>
<author>Uwe G. Kühnapfel</author>
<author>Jürgen Meixensberger</author>
</authors>
</pub>
I suspect that that gambit works because there is only one A element at the highest level. Fortunately, even with two million lines it's easy to add the lines you need.
In doing this I noticed that the lxml parser seems unable to process the accented characters. I have there added code to anglicise them.
import re
def anglicise(matchobj): return matchobj.group(0)[1]
outputFilename = 'result.xml'
with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
outXML.write('<root>\n')
for line in inXML.readlines():
outXML.write(re.sub('&[a-zA-Z]+;',anglicise,line))
outXML.write('</root>\n')
from lxml import etree
tree = etree.parse(outputFilename)
years = tree.xpath('.//year')
print (years[0].text)
Edit: Replace anglicise to this version to avoid replacing &.
def anglicise(matchobj):
if matchobj.group(0) == '&':
return matchobj.group(0)
else:
return matchobj.group(0)[1]

Generating XML file with proper indentation

I am trying to generate the XML file in python but its not getting indented the out put is coming in straight line.
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
name = str(request.POST.get('name'))
top = Element('scenario')
environment = SubElement(top, 'environment')
cluster = SubElement(top, 'cluster')
cluster.text=name
I tried to use pretty parser but its giving me an error as: 'Element' object has no attribute 'read'
import xml.dom.minidom
xml_p = xml.dom.minidom.parse(top)
pretty_xml = xml_p.toprettyxml()
Is the input given to parser is proper format ? if this is wrong method please suggest another way to indent.
You cannot directly parse top which is an Element(), you need to make that a string (which is why you should import tostring. that you are currently not using), and use xml.dom.minidom.parseString() on the result:
import xml.dom.minidom
xml_p = xml.dom.minidom.parseString(tostring(top))
pretty_xml = xml_p.toprettyxml()
print(pretty_xml)
that gives:
<?xml version="1.0" ?>
<scenario>
<environment/>
<cluster>xyz</cluster>
</scenario>

Categories