I have a file that looks like this:
<?xml version="1.0"etc>
<xliff version="1.2" etc>
<file datatype="plaintext" mt="eMT-R2" original="" source-language="en-US" target-language="es">
<header/>
<body>
<trans-unit etc>
<source>blabla</source>
<target>blabla</target>
<note>blabla</note>
</trans-unit>
</body>
</file>
</xliff>
I want to go through the source and target elements. My code only works if I have <body> as a root. Is there a way to skip the first 4 elements at the beginning of the file or just set the root to <body>?
import xml.etree.ElementTree as ET
tree = ET.parse('myfile.xlf')
root = tree.getroot()
for trans in root.findall('trans-unit'):
source = trans.find('source').text
target = trans.find('target').text
lencomp = (len(target) - len(source))/len(source)*100.0
print(source,">>>", target)
ElementTree's findall takes a quasi-xpath string. Its not a full-featured xpath like is available with lxml but works for what you need
import xml.etree.ElementTree as ET
tree = ET.parse('myfile.xlf')
for trans in tree.findall('file/body/trans-unit'):
source = trans.find('source').text
target = trans.find('target').text
lencomp = (len(target) - len(source))/len(source)*100.0
print(source,">>>", target)
Ok, so it turns out the problem is not in the code but in my file. For anyone working with XLIFF files, this may be useful:
The issue is in the "XMLNS" - if you remove at least one letter, the file will be parsed correctly. I'm not sure exactly what the problem is, but changing this definitely solves the problem
Related
I am trying to update below xml file in python 3 using import xml.etree.ElementTree as ET but not able to add anything between tags
Issue I am facing not able to get/fetch the tag after fileSets.
Can someone let me know how we could update the xml?
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
</includes>
</fileSet>
</fileSets>
</assembly>
Expected output:(file names will be added dynamically)
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
<include>abc.text</include>
<include>def.text</include>
<include>ghi.text</include>
</includes>
</fileSet>
</fileSets>
</assembly>
I am trying this and it prints me all four element inside this files but doesn't know how to access includes and then add something inside this abc.txt and so on.
import xml.etree.ElementTree as ET
tree = ET.parse(abc.xml)
root = tree.getroot()
for actor in root.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSets'):
for name in actor.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSet'):
print(name)
You don't have to do anything with fileSets orfileSet. Since you want to add children to includes, get that element directly.
import xml.etree.ElementTree as ET
# Ensure that the proper prefix is used in the output (in this case, no prefix at all)
ET.register_namespace("", "http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2")
tree = ET.parse("abc.xml")
# Find the 'includes' element (.// means search the whole document).
# {*} is a wildcard and matches any namespace (Python 3.8)
includes = tree.find(".//{*}includes")
# Create three new 'include' elements
include1 = ET.Element("include")
include1.text = "abc.text"
include2 = ET.Element("include")
include2.text = "def.text"
include3 = ET.Element("include")
include3.text = "ghi.text"
# Add the new elements as children of 'includes'
includes.append(include1)
includes.append(include2)
includes.append(include3)
I have some variables in Python that I need to store as XML. I have been using the python:LXML module for this so far. Not too experienced with it. Have tried playing around with various tutorials and docs, but I am at a dead end need some help.
Here is the python script:
root = etree.Element("root")
coins=etree.Element("coins")
doc=etree.ElementTree(coins)
coins.append(etree.Element("trader"))
coins.append(etree.Element("metal"))
coins.append(etree.Element("type"))
coins.append(etree.Element("price"))
coins[0].text="Gold.co.uk"
coins[0].attrib["variable"]=("GLDAG_MAPLE")
coins[1].text="Silver"
coins[2].text="Britannia"
coins[3].text=str(GLDAG_MAPLE)
doc.write('data.xml', pretty_print=True)
As of now it outputs this:
<coins>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
<price>
£31.20
</price>
</coins>
However I would like it to look like this:
<root>
<coin>
<trader> Gold.co.uk </trader>
<type> Britannia </type>
<price> £31.20 </price>
</coin>
</root>
The tags and their sub-tags would be duplicated for every type of coin. I have no idea how to construct the XML so that the output looks like the third code-block. So far I have tried to follow other scripts that I have seen on github and other sites but modify them to suit my needs but my scripts keep failing or producing incorrect resaults for some reason.
If someone could help me out then that would be great!
You can simply append the Element to root:
from lxml import etree
coinItems = [
{'trader': 'Gold.co.uk', 'metal': 'Silver', 'type': 'Britannia'},
{'trader': 'copper.co.uk', 'metal': 'Copper', 'type': 'World'}
]
root = etree.Element("root")
for ci in coinItems:
coin=etree.Element("coin")
etree.SubElement(coin, "trader", {'variable': 'GLDAG_MAPLE'}).text = ci['trader'] # example how to use attributes!
etree.SubElement(coin, "metal").text = ci['metal']
etree.SubElement(coin, "type").text = ci['type']
root.append(coin)
fName = '/tmp/data.xml'
with open(fName, 'wb') as f:
# remove encoding here, in case you want escaped ASCII characters: £
f.write(etree.tostring(root, xml_declaration=True, encoding="utf-8", pretty_print=True))
print(open(fName).read())
Output:
<?xml version='1.0' encoding='utf-8'?>
<root>
<coin>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
</coin>
<coin>
<trader variable="GLDAG_MAPLE">copper.co.uk</trader>
<metal>Copper</metal>
<type>World</type>
</coin>
</root>
I prefer using the lxml builder (https://lxml.de/api/lxml.builder.ElementMaker-class.html) because imho it is easier to see the structure of your XML document.
from lxml.builder import E
root = E.root(
E.coin(
E.trader("Gold.co.uk",
variable="GLDAG_MAPLE"),
E.metal("silver"),
E.price("£31.20")
)
)
You can then append the root element to your main document.
I have a RDF document, which looks like as follows:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http:xyz.com#">
<rdf:Description rdf:about="http:xyz.com#">
<cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
<cd:owner>arun</cd:owner>
<cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
<cd:purpose>Research</cd:purpose>
<cd:metadata>10</cd:metadata>
<cd:completeness>Partial</cd:completeness>
<cd:completeness>Yes</cd:completeness>
<cd:inclusion_1>age</cd:inclusion_1>
<cd:feature_1>Sex</cd:feature_1>
<cd:target>Diagnosis</cd:target>
</rdf:Description>
</rdf:RDF>
From the above texts, I need to extract the target (i.e. only the value inside the opening and closing "cd:target" tag). The desired output should be 'Diagnosis'. I tried with XML parser but it does not work because of the tree contains ':'. Any better solution, please?
Update: This is the I tried, sorry for naive coding style.
import xml.etree.ElementTree as et
def metadataParser(metadataFile):
with open(metadataFile, 'r') as m:
data = m.read()
# Load the xml content from a string
content = et.fromstring(data)
description = content.find('rdf:Description')
target = description.find("cd:target")
return target
target = metadataParser('metadata.rdf')
print(target)
You can use the BeautifulSoup module with its XML parser.
from bs4 import BeautifulSoup
XML = '''
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http:xyz.com#">
<rdf:Description rdf:about="http:xyz.com#">
<cd:algorithmid>DPOT-5ab247867d368</cd:algorithmid>
<cd:owner>arun</cd:owner>
<cd:acesskey>ACCESS-5ab247867d370</cd:acesskey>
<cd:purpose>Research</cd:purpose>
<cd:metadata>10</cd:metadata>
<cd:completeness>Partial</cd:completeness>
<cd:completeness>Yes</cd:completeness>
<cd:inclusion_1>age</cd:inclusion_1>
<cd:feature_1>Sex</cd:feature_1>
<cd:target>Diagnosis</cd:target>
</rdf:Description>
</rdf:RDF>'''
soup = BeautifulSoup(XML, 'xml')
target = soup.find('target').text
print(target)
# Diagnosis
As you can see, it's pretty easy to use.
The rdf: and cd: are namespace tags. They need to be replaced in your search with the actual namespace identifiers, like so:
description = content.find('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description')
target = description.find("{http:xyz.com#}target")
You could use the following regex: this will get all the data from within all of the 'cd' tags in your file..
import re
with open("file.rdf", "r") as file:
for lines in file:
pattern = "<cd:.*>(.*)</cd:.*>"
output = re.findall(pattern, lines)
if len(output) != 0:
print(output[0])
And this outputs:
DPOT-5ab247867d368
arun
ACCESS-5ab247867d370
Research
10
Partial
Yes
age
Sex
Diagnosis
Explaination of the pattern variable:
the first .* tells the script that we want ANY characters that are in this space
(.*) tells the script that this is the section we want to capture
And the last .* does pretty much the same as before, searches for ANY character.
Note: I have involved a if statement to check if the output (which is in list form) contains any elements, if not, it excludes it from the output. (for example your heading RDF elements will be excluded).
The cd: part is a namespace. They're pretty common in XML, and just about any XML parser has a way to handle them.
Otherwise, if you are just looking for a single item and you don't care about structure, you could just do a simple string search and grab everything between <cd:target> and </cd:target>, like so:
rdf = '''rdf xml document'''
open_tag = '<cd:target>'
close_tag = '</cd:target>'
start = rdf.find(open_tag)
end = rdf.find(close_tag)
value = rdf[start + len(open_tag):end]
You can create a dictionary holding the namespace mappings seen at the top:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('input.xml')
ns = {'rdf' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'cd' : 'http:xyz.com#'}
description = tree.find('rdf:Description', ns)
target = description.find('cd:target', ns)
print(target.text)
This would display:
Diagnosis
This approach is described in the Python xml.etree.ElementTree documentation.
I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>
The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.
I am using Elementtree to parse an xml file, edit the contents and write to a new xml file. I have this all working apart form one issue. When I generate the file there are a lot of extra lines containing namespace information. Here are some snippets of code:
import xml.etree.ElementTree as ET
ET.register_namespace("", "http://clish.sourceforge.net/XMLSchema")
tree = ET.parse('ethernet.xml')
root = tree.getroot()
commands = root.findall('{http://clish.sourceforge.net/XMLSchema}'
'VIEW/{http://clish.sourceforge.net/XMLSchema}COMMAND')
for command in commands:
all1.append(list(command.iter()))
And a sample of the output file, with the erroneous line xmlns="http://clish.sourceforge.net/XMLSchema:
<COMMAND xmlns="http://clish.sourceforge.net/XMLSchema" help="Interface specific description" name="description">
<PARAM help="Description (must be in double-quotes)" name="description" ptype="LINE" />
<CONFIG />
</COMMAND>
How can I remove this with elementtree, can I? Or will i have to use some regex (I am writing a string to the file)?