Update the xml using python3 at specific subelement? - python

I am trying to update below xml file in python 3 using import xml.etree.ElementTree as ET but not able to add anything between tags
Issue I am facing not able to get/fetch the tag after fileSets.
Can someone let me know how we could update the xml?
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
</includes>
</fileSet>
</fileSets>
</assembly>
Expected output:(file names will be added dynamically)
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
<include>abc.text</include>
<include>def.text</include>
<include>ghi.text</include>
</includes>
</fileSet>
</fileSets>
</assembly>
I am trying this and it prints me all four element inside this files but doesn't know how to access includes and then add something inside this abc.txt and so on.
import xml.etree.ElementTree as ET
tree = ET.parse(abc.xml)
root = tree.getroot()
for actor in root.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSets'):
for name in actor.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSet'):
print(name)

You don't have to do anything with fileSets orfileSet. Since you want to add children to includes, get that element directly.
import xml.etree.ElementTree as ET
# Ensure that the proper prefix is used in the output (in this case, no prefix at all)
ET.register_namespace("", "http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2")
tree = ET.parse("abc.xml")
# Find the 'includes' element (.// means search the whole document).
# {*} is a wildcard and matches any namespace (Python 3.8)
includes = tree.find(".//{*}includes")
# Create three new 'include' elements
include1 = ET.Element("include")
include1.text = "abc.text"
include2 = ET.Element("include")
include2.text = "def.text"
include3 = ET.Element("include")
include3.text = "ghi.text"
# Add the new elements as children of 'includes'
includes.append(include1)
includes.append(include2)
includes.append(include3)

Related

Suppress automatically added namespace in etree Python

<rootTag xmlns="model">
<tag>
I have an xml file with a namespace specified as above. I can use etree in Python to parse it, but after making changes and writing it back to the file, etree changes it to this
<rootTag xmlns:ns0="model">
<ns0:tag>
and prepended "ns0" to all the tags. I don't want that to happen.
A sample program is as follows:
et = xml.etree.ElementTree.parse(xml_name)
root = (et.getroot())
root.find('.//*'+pattern).text = new_text
et.write(xml_name)
Is there someway to suppress this automatic change? Thanks
This can be done using register_namespace() by using an empty string for the prefix...
ET.register_namespace("", "model")
Full working example...
import xml.etree.ElementTree as ET
xml = """
<rootTag xmlns="model">
<tag>foo</tag>
</rootTag>
"""
ET.register_namespace("", "model")
root = ET.fromstring(xml)
root.find("{model}tag").text = "bar"
print(ET.tostring(root).decode())
printed output...
<rootTag xmlns="model">
<tag>bar</tag>
</rootTag>
Also see this answer for another example.

python lxml how i use tag in items name?

i need to build xml file using special name of items, this is my current code :
from lxml import etree
import lxml
from lxml.builder import E
wp = E.wp
tmp = wp("title")
print(etree.tostring(tmp))
current output is this :
b'<wp>title</wp>'
i want to be :
b'<wp:title>title</title:wp>'
how i can create items with name like this : wp:title ?
You confused the namespace prefix wp with the tag name. The namespace prefix is a document-local name for a namespace URI. wp:title requires a parser to look for a xmlns:wp="..." attribute to find the namespace itself (usually a URL but any globally unique string would do), either on the tag itself or on a parent tag. This connects tags to a unique value without making tag names too verbose to type out or read.
You need to provide the namepace, and optionally, the namespace mapping (mapping short names to full namespace names) to the element maker object. The default E object provided doesn't have a namespace or namespace map set. I'm going to assume that here that wp is the http://wordpress.org/export/1.2/ Wordpress namespace, as that seems the most likely, although it could also be that you are trying to send Windows Phone notifications.
Instead of using the default E element maker, create your own ElementMaker instance and pass it a namespace argument to tell lxml what URL the element belongs to. To get the right prefix on your element names, you also need to give it a nsmap dictionary that maps prefixes to URLs:
from lxml.builder import ElementMaker
namespaces = {"wp": "http://wordpress.org/export/1.2/"}
E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
title = E.title("Value of the wp:title tag")
This produces a tag with both the correct prefix, and the xmlns:wp attribute:
>>> from lxml.builder import ElementMaker
>>> namespaces = {"wp": "http://wordpress.org/export/1.2/"}
>>> E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
>>> title = E.title("Value of the wp:title tag")
>>> etree.tostring(title, encoding="unicode")
'<wp:title xmlns:wp="http://wordpress.org/export/1.2/">Value of the wp:title tag</wp:title>'
You can omit the nsmap value, but then you'd want to have such a map on a parent element of the document. In that case, you probably want to make separate ElementMaker objects for each namespace you need to support, and you put the nsmap namespace mapping on the outer-most element. When writing out the document, lxml then uses the short names throughout.
For example, creating a Wordpress WXR format document would require a number of namespaces:
from lxml.builder import ElementMaker
namespaces = {
"excerpt": "https://wordpress.org/export/1.2/excerpt/",
"content": "http://purl.org/rss/1.0/modules/content/",
"wfw": "http://wellformedweb.org/CommentAPI/",
"dc": "http://purl.org/dc/elements/1.1/",
"wp": "https://wordpress.org/export/1.2/",
}
RootElement = ElementMaker(nsmap=namespaces)
ExcerptElement = ElementMaker(namespace=namespaces["excerpt"])
ContentElement = ElementMaker(namespace=namespaces["content"])
CommentElement = ElementMaker(namespace=namespaces["wfw"])
DublinCoreElement = ElementMaker(namespace=namespaces["dc"])
ExportElement = ElementMaker(namespace=namespaces["wp"])
and then you'd construct a document with
doc = RootElement.rss(
RootElement.channel(
ExportElement.wxr_version("1.2"),
# etc. ...
),
version="2.0"
)
which, when pretty printed with etree.tostring(doc, pretty_print=True, encoding="unicode"), produces:
<rss xmlns:excerpt="https://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="https://wordpress.org/export/1.2/" version="2.0">
<channel>
<wp:wxr_version>1.2</wp:wxr_version>
</channel>
</rss>
Note how only the root <rss> element has xmlns attributes, and how the <wp:wxr_version> tag uses the right prefix even though we only gave it the namespace URI.
To give a different example, if you are building a Windows Phone tile notification, it'd be simpler. After all, there is just a single namespace to use:
from lxml.builder import ElementMaker
namespaces = {"wp": "WPNotification"}
E = ElementMaker(namespace=namespaces["wp"], nsmap=namespaces)
notification = E.Notification(
E.Tile(
E.BackgroundImage("https://example.com/someimage.png"),
E.Count("42"),
E.Title("The notification title"),
# ...
)
)
which produces
<wp:Notification xmlns:wp="WPNotification">
<wp:Tile>
<wp:BackgroundImage>https://example.com/someimage.png</wp:BackgroundImage>
<wp:Count>42</wp:Count>
<wp:Title>The notification title</wp:Title>
</wp:Tile>
</wp:Notification>
Only the outer-most element, <wp:Notification>, now has the xmlns:wp attribute. All other elements only need to include the wp: prefix.
Note that the prefix used is entirely up to you and even optional. It is the namespace URI that is the real key to uniquely identifying elements across different XML documents. If you used E = ElementMaker(namespace="WPNotification", nsmap={None: "WPNotification"}) instead, and so produced a top-level element with <Notification xmlns="WPNotification"> you still have a perfectly legal XML document that, according to the XML standard, has the exact same meaning.

How to find a string inside a XML like tag from a file in Python?

I have a RDF document, which looks like as follows:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http:xyz.com#">
<rdf:Description rdf:about="http:xyz.com#">
<cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
<cd:owner>arun</cd:owner>
<cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
<cd:purpose>Research</cd:purpose>
<cd:metadata>10</cd:metadata>
<cd:completeness>Partial</cd:completeness>
<cd:completeness>Yes</cd:completeness>
<cd:inclusion_1>age</cd:inclusion_1>
<cd:feature_1>Sex</cd:feature_1>
<cd:target>Diagnosis</cd:target>
</rdf:Description>
</rdf:RDF>
From the above texts, I need to extract the target (i.e. only the value inside the opening and closing "cd:target" tag). The desired output should be 'Diagnosis'. I tried with XML parser but it does not work because of the tree contains ':'. Any better solution, please?
Update: This is the I tried, sorry for naive coding style.
import xml.etree.ElementTree as et
def metadataParser(metadataFile):
with open(metadataFile, 'r') as m:
data = m.read()
# Load the xml content from a string
content = et.fromstring(data)
description = content.find('rdf:Description')
target = description.find("cd:target")
return target
target = metadataParser('metadata.rdf')
print(target)
You can use the BeautifulSoup module with its XML parser.
from bs4 import BeautifulSoup
XML = '''
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http:xyz.com#">
<rdf:Description rdf:about="http:xyz.com#">
<cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
<cd:owner>arun</cd:owner>
<cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
<cd:purpose>Research</cd:purpose>
<cd:metadata>10</cd:metadata>
<cd:completeness>Partial</cd:completeness>
<cd:completeness>Yes</cd:completeness>
<cd:inclusion_1>age</cd:inclusion_1>
<cd:feature_1>Sex</cd:feature_1>
<cd:target>Diagnosis</cd:target>
</rdf:Description>
</rdf:RDF>'''
soup = BeautifulSoup(XML, 'xml')
target = soup.find('target').text
print(target)
# Diagnosis
As you can see, it's pretty easy to use.
The rdf: and cd: are namespace tags. They need to be replaced in your search with the actual namespace identifiers, like so:
description = content.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description')
target = description.find("{http:xyz.com#}target")
You could use the following regex: this will get all the data from within all of the 'cd' tags in your file..
import re
with open("file.rdf", "r") as file:
for lines in file:
pattern = "<cd:.*>(.*)</cd:.*>"
output = re.findall(pattern, lines)
if len(output) != 0:
print(output[0])
And this outputs:
DPOT-5ab247867d368
arun
ACCESS-5ab247867d370
Research
10
Partial
Yes
age
Sex
Diagnosis
Explaination of the pattern variable:
the first .* tells the script that we want ANY characters that are in this space
(.*) tells the script that this is the section we want to capture
And the last .* does pretty much the same as before, searches for ANY character.
Note: I have involved a if statement to check if the output (which is in list form) contains any elements, if not, it excludes it from the output. (for example your heading RDF elements will be excluded).
The cd: part is a namespace. They're pretty common in XML, and just about any XML parser has a way to handle them.
Otherwise, if you are just looking for a single item and you don't care about structure, you could just do a simple string search and grab everything between <cd:target> and </cd:target>, like so:
rdf = '''rdf xml document'''
open_tag = '<cd:target>'
close_tag = '</cd:target>'
start = rdf.find(open_tag)
end = rdf.find(close_tag)
value = rdf[start + len(open_tag):end]
You can create a dictionary holding the namespace mappings seen at the top:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('input.xml')
ns = {'rdf' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'cd' : 'http:xyz.com#'}
description = tree.find('rdf:Description', ns)
target = description.find('cd:target', ns)
print(target.text)
This would display:
Diagnosis
This approach is described in the Python xml.etree.ElementTree documentation.

Create array of values from specific element in XML using Python

I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>
The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.

lxml - using find method to find specific tag? (does not find)

I have an xml file that I need to update some values from some specific tags. In header tag there are some tags with namespaces. Using find for such tags, works, but if I try to search for some other tags that do not have name spaces, it does not find it.
I tried relative, absolute path, but it does not find. The code is like this:
from lxml import etree
tree = etree.parse('test.xml')
root = tree.getroot()
# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}
# Replace values in tags
identity = tree.find('.//env:identity', nsmap)
identity.text = 'Placeholder' # works fine
e01_0017 = tree.find('.//e01_0017') # does not find
e01_0017.text = 'Placeholder' # and then it throws this ofcourse: AttributeError: 'NoneType' object has no attribute 'text'
# Also tried like this, but still not working
e01_0017 = tree.find('Envelope/Body/IVOIC/UNB/cmp04/e01_0017')
I even tried finding for example body tag, but it does not find it too.
This is how xml structure looks like:
<?xml version="1.0" encoding="ISO-8859-1"?><Envelope xmlns="http://www.someurl.com/TTT" xmlns:env="http://www.someurl.com/TTT_Envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="http://www.someurl.com/TTT TTT_INVOIC.xsd"><Header>
<env:delivery>
<env:to>
<env:address>Test</env:address>
</env:to>
<env:from>
<env:address>Test2</env:address>
</env:from>
<env:reliability>
<env:sendReceiptTo/>
<env:receiptRequiredBy/>
</env:reliability>
</env:delivery>
<env:properties>
<env:identity>some code</env:identity>
<env:sentAt>2006-03-17T00:38:04+01:00</env:sentAt>
<env:expiresAt/>
<env:topic>http://www.someurl.com/TTT/</env:topic>
</env:properties>
<env:manifest>
<env:reference uri="#INVOIC#D00A">
<env:description>Doc Name Descr</env:description>
</env:reference>
</env:manifest>
<env:process>
<env:type></env:type>
<env:instance/>
<env:handle></env:handle>
</env:process>
</Header>
<Body>
<INVOIC>
<UNB>
<cmp01>
<e01_0001>1</e01_0001>
<e02_0002>1</e02_0002>
</cmp01>
<cmp02>
<e01_0004>from</e01_0004>
</cmp02>
<cmp03>
<e01_0010>to</e01_0010>
</cmp03>
<cmp04>
<e01_0017>060334</e01_0017>
<e02_0019>1652</e02_0019>
</cmp04>
<e01_0020>1</e01_0020>
<cmp05>
<e01_0022>1</e01_0022>
</cmp05>
</UNB>
</INVOIC>
</Body>
</Envelope>
Update It seems something is wrong with header or envelope tags. If I for example use xml without that header and envelope info, then tags are found just fine. If I include envelope attributes and header, it stops finding tags. Updated xml sample with header info
The thing is that your elements like e01_0017 also has a namespace, it inherits its namespace from the namespace of its parent, in this case it goes all the way back to - <Envelope> . The namespace for your elements are - "http://www.someurl.com/TTT" .
You have two options ,
Either directly specify the namespace in the XPATH , Example -
e01_0017 = tree.find('.//{http://www.someurl.com/TTT}e01_0017')
Demo (for your xml) -
In [39]: e01_0017 = tree.find('.//{http://www.someurl.com/TTT}e01_0017')
In [40]: e01_0017
Out[40]: <Element {http://www.someurl.com/TTT}e01_0017 at 0x2fe78c8>
Another option is to add it to the nsmap with some default value for the key and then use it in the xpath. Example -
nsmap = {(k or 'def'):v for k,v in root.nsmap.items()}
e01_0017 = tree.find('.//def:e01_0017',nsmap)

Categories