xml.etree.ElementTree .remove

xml.etree.ElementTree .remove - python

I'm trying to remove tags from an Xml.Alto file with remove.
My Alto file looks like this:
<alto xmlns="http://www.loc.gov/standards/alto/ns-v4#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd"> <Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>filename</fileName>
</sourceImageInformation>
</Description>
<Layout>
<Page>
<PrintSpace>
<TextBlock>
<Shape><Polygon/></Shape>
<TextLine>
<Shape><Polygon/></Shape>
<String CONTENT="ABCDEF" HPOS="1234" VPOS="1234" WIDTH="1234" HEIGHT="1234" />
</TextLine>
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
AND my code is :
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace("", "http://www.loc.gov/standards/alto/ns-v4#")
for Test in root.findall('.//alto:TextBlock', ns):
root.remove(Test)
tree.write('out.xml', encoding="UTF-8", xml_declaration=True)
Here is the error I get:
ValueError: list.remove(x): x not in list
Thanks a lot for your help 💐

ElementFather.remove(ElementChild) works only if the ElementChild is a sub-element of ElementFather. In your case, you have to call remove from PrintSpace.
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace("", "http://www.loc.gov/standards/alto/ns-v4#")
for Test in root.findall('.//alto:TextBlock', ns):
PrintSpace = root.find('.//alto:PrintSpace',ns)
PrintSpace.remove(Test)
tree.write('out.xml', encoding="UTF-8", xml_declaration=True)
Note: This code is only an example of a working solution, for sure you can improve it.

Related

Fetching elements from XML and insert into Postgres DB

I have an XML file like this i need to insert this data to PostgreSQL DB.Below is the sample XML and the code which i use ,but i'm not getting any output,Can someone please guide on how to effectively fetch these XML values.
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
<item>
.
.
.
</item>
Below is the script which i use,
Python : 3.5 Postgres version 11
# import modules
import sys
import psycopg2
import datetime
now = datetime.datetime.now()
# current data and time
dt = now.strftime("%Y%m%dT%H%M%S")
# xml tree access
#from xml.etree import ElementTree
import xml.etree.ElementTree as ET
# incremental variable
x = 0
with open('/Users/admin/documents/shopping.xml', 'rt',encoding="utf8") as f:
#tree = ElementTree.parse(f)
tree = ET.parse(f)
# connection to postgreSQL database
try:
conn=psycopg2.connect(host='localhost', database='postgres',
user='postgres', password='postgres',port='5432')
except:
print ("Hey I am unable to connect to the database.")
cur = conn.cursor()
# access the xml tree element nodes
try:
for node in tree.findall('.//item'):
src = node.find('id')
tgt = node.find('mpn')
print(node)
except:
print ("Oops I can't insert record into database table!")
conn.commit()
conn.close()
The current output i'm getting is like,
None
None
None
Expected Output,
id title description gtin ......
20 product 1 g:description xxxx .....

Strange is that you can't find item. It seems you use wrong file and it doesn't have item.
Using your XML data as string and ET.fromstring() I have no problem to get item.
Maybe check print( f.read() ) to see what you really read from file.
Problem is only id, tgt which use namespace - g: - and it need something more then only g:id, g:tgt
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
or use directly as '{http://base.google.com/ns/1.0}id' '{http://base.google.com/ns/1.0}mpn'
tree = ET.fromstring(xml)
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Minimal working code:
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Result:
Node: <Element 'item' at 0x7f74ba45b710>
src: 20
tgt: 0014
BTW: It works even when I use io.StringIO to simulate file
f = io.StringIO(xml)
tree = ET.parse(f)
Minimal working code:
import xml.etree.ElementTree as ET
import io
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
f = io.StringIO(xml)
tree = ET.parse(f)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('mpn:', tgt.text)

copy xml node and children to new xml file

I have the following code that loops over a set of records and moves each record to a new file:
import os
import xml.etree.cElementTree as ET
for filename in os.listdir('modemuze'):
if filename.endswith('.xml'):
original_tree = ET.ElementTree(file='modemuze/'+filename)
root = original_tree.getroot()
for child in root[2]:
if child.tag == "{http://www.openarchives.org/OAI/2.0/}record":
new_tree = ET.ElementTree(file='test.xml')
new_root = new_tree.getroot()
new_root.append(child)
the file which contains the records i want to move has to following structure:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<responseDate>2018-02-14T12:15:11.241+02:00</responseDate>
<request verb="ListRecords">147.102.11.65:9000/fashionedmfp/oai</request>
<ListRecords>
<record>
<header>
<identifier>oai:fashionedmfp:8c549bcd078e2ce84a265d318547c5f8e8bf0cd0</identifier>
<datestamp>2016-06-27</datestamp>
</header>
<metadata>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:edm="http://www.europeana.eu/schemas/edm/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:edmfp="http://www.europeanafashion.eu/edmfp/" xmlns:rdaGr2="http://rdvocab.info/ElementsGr2/" xmlns:ore="http://www.openarchives.org/ore/terms/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:mrel="http://id.loc.gov/vocabulary/relators/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:wgs84="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:crm="http://www.cidoc-crm.org/rdfs/cidoc_crm_v5.0.2_english_label.rdfs#" xmlns:gr="http://www.heppnetz.de/ontologies/goodrelations/v1#" xmlns:xalan="http://xml.apache.org/xalan" xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<edm:ProvidedCHO rdf:about="localID/europeana-fashion/10918x1y95563">
<dcterms:created>circa 1855-1874</dcterms:created>
<dcterms:issued>1875</dcterms:issued>
<dcterms:issued>1875</dcterms:issued>
<dcterms:medium>katoen</dcterms:medium>
<dcterms:provenance>schenking</dcterms:provenance>
<dcterms:provenance>1958</dcterms:provenance>
<dcterms:spatial>
<edm:Place>
<skos:prefLabel>Nederland</skos:prefLabel>
</edm:Place>
</dcterms:spatial>
<dcterms:creator>
<edm:Agent/>
</dcterms:creator>
<dc:creator>
<edm:Agent/>
</dc:creator>
<dc:description>Jurk (kind) van wit piqué, gegarneerd met broderie en soustache</dc:description>
<dc:identifier>0061211</dc:identifier>
<dc:title>jurken</dc:title>
<dc:title>kinderkleding</dc:title>
<edm:type>IMAGE</edm:type>
<edmfp:localType>jurken</edmfp:localType>
<edmfp:localType>kinderkleding</edmfp:localType>
<edmfp:technique>piqué-weefsel</edmfp:technique>
</edm:ProvidedCHO>
<ore:Aggregation rdf:about="localID/europeana-fashion/Aggregation_10918x1y95563">
<edm:aggregatedCHO rdf:resource="localID/europeana-fashion/10918x1y95563"/>
<edm:dataProvider>
<edm:Agent>
<skos:prefLabel>Gemeentemuseum Den Haag</skos:prefLabel>
</edm:Agent>
</edm:dataProvider>
<edm:hasView>
<edm:WebResource rdf:about="http://images.gemeentemuseum.nl/br/0061211.jpg"/>
</edm:hasView>
<edm:isShownBy>
<edm:WebResource rdf:about="http://images.gemeentemuseum.nl/br/0061211.jpg"/>
</edm:isShownBy>
<edm:provider>
<edm:Agent rdf:about="http://www.europeanafashion.eu/">
<skos:prefLabel>Europeana Fashion</skos:prefLabel>
</edm:Agent>
</edm:provider>
<edm:rights rdf:resource="http://www.europeana.eu/rights/rr-f/"/>
</ore:Aggregation>
</rdf:RDF>
</metadata>
</record>
</ListRecords>
</OAI-PMH>
and the file i want to place the records in has the following structure:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ListRecords>
</ListRecords>
but when i run this code my test.xml file contains no records..
Why is this happening?
I am totally new to XML and python so if i missed something or something is unclear please let me know so i can clarify or add it.
I am doing this is python 2.7
All help and suggestions are very much appreciated!

You need to write changes to the file.
Add to the code new_tree.write('test.xml')
As a result, the code should look like this
import os
import xml.etree.cElementTree as ET
for filename in os.listdir('modemuze'):
if filename.endswith('.xml'):
original_tree = ET.ElementTree(file='modemuze/'+filename)
root = original_tree.getroot()
for child in root[2]:
if child.tag == "{http://www.openarchives.org/OAI/2.0/}record":
new_tree = ET.ElementTree(file='test.xml')
new_root = new_tree.getroot()
new_root.append(child)
new_tree.write('test.xml')

how to get value of an xml element not directly under root

I am trying to parse an xml and get the value of dir_path as below,however I dont seem to get the desired output,whats wrong here and how to fix it?
input.xml
<?xml version="1.0" ?>
<data>
<software>
<name>xyz</name>
<role>xyz</role>
<future>unknown</future>
</software>
<software>
<name>abc</name>
<role>abc</role>
<future>clear</future>
<dir_path cmm_root_path_var="COMP_softwareROOT">\\location\software\INR\</dir_path>
<loadit reduced="true">
<RW>yes</RW>
<readonly>R/</readonly>
</loadit>
<upload reduced="true">
</upload>
</software>
<software>
<name>def</name>
<role>def</role>
<future>clear</future>
<dir_path cmm_root_path_var="COMP2_softwareROOT">\\location1\software\INR\</dir_path>
<loadit reduced="true">
<RW>yes</RW>
<readonly>R/</readonly>
</loadit>
<upload reduced="true">
</upload>
</software>
</data>
CODE:-
tree = ET.parse(input.xml)
root = tree.getroot()
dir_path = root.find(".//dir_path")
print dir_path.text
OUTPUT:-
.\
EXPECTED OUTPUT:-
\\location\software\INR\

Try the following:
from xml.etree import ElementTree as ET
tree = ET.parse('filename.xml')
item = tree.find('software/[name="abc"]/dir_path')
print(item.text if item is not None else None)

python extract xml element value to csv

I'm a new to python so please bear with me as I try to explain what I am trying to do
here is my xml
<?xml version="1.0"?>
<playlist>
<list>
<txdate>2015-10-30</txdate>
<channel>cake</channel>
<name>Play List</name>
</list>
<eventlist>
<event type="MEDIA">
<title>title1</title>
<starttype>FIX</starttype>
<mediaid>a</mediaid>
<onairtime>2015-10-30T13:30:00:00</onairtime>
<som>00:00:40:03</som>
<duration>01:15:47:15</duration>
<reconcilekey>123</reconcilekey>
<category>PROGRAM</category>
<subtitles>
<cap>CLOSED</cap>
<file>a</file>
<lang>ENG</lang>
<lang>GER</lang>
</subtitles>
</event>
<event type="MEDIA">
<title>THREE DAYS AND A CHILD</title>
<mediaid>b</mediaid>
<onairtime>2015-10-30T14:45:47:15</onairtime>
<som>00:00:00:00</som>
<duration>01:19:41:07</duration>
<reconcilekey>321</reconcilekey>
<category>PROGRAM</category>
<subtitles>
<cap>CLOSED</cap>
<file>b</file>
<lang>ENG</lang>
<lang>GER</lang>
</subtitles>
</event>
</eventlist>
</playlist>
I would like to print all the mediaid values to a file
this is my code so far
import os
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
wfile = 'new.csv'
for child in root:
child.find( "media type" )
for x in child.iter("mediaid"):
file = open(wfile, 'a')
file.write(str(x))
file.close
I tried this with a few other nonstandard libraries but I didn't have much success

For your requirement (as mentioned in the comments) -
just the mediaid from each <event type="MEDIA">
You should use findall() method of ElementTree to get all the event elements with type="MEDIA" , and then get the child mediaid element from it. Example -
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
with open('new.csv','w') as outfile:
for elem in root.findall('.//event[#type="MEDIA"]'):
mediaidelem = elem.find('./mediaid')
if mediaidelem is not None:
outfile.write("{}\n".format(mediaidelem.text))

how can I select all descendants of a certain element with ElementTree in Python 3.3?

This is the sample data.
input.xml
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
I'd like to put node and its descendants into . That is,
output.xml
<root>
<entry id="1">
<headword>go</headword>
<examplegrp>
<example>I <hw>go</hw> to school.</example>
</examplegrp>
</entry>
</root>
My poor and incomplete script is:
import codecs
import xml.etree.ElementTree as ET
fin = codecs.open(r'input.xml', 'rb', encoding='utf-8')
data = ET.parse(fin)
root = data.getroot()
example = root.find('.//example')
for elem in example.iter():
---and then I don't know what to do---

Here's an example of how it can be done:
text = """
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
"""
import lxml.etree
import StringIO
data = lxml.etree.parse(StringIO.StringIO(text))
root = data.getroot()
for entry in root.xpath('//example/ancestor::entry[1]'):
examplegrp = lxml.etree.SubElement(entry,"examplegrp")
nodes = [node for node in entry.xpath('./example')]
for node in nodes:
entry.remove(node)
examplegrp.append(node)
print lxml.etree.tostring(root,pretty_print=True)
which will output:
<root>
<entry id="1">
<headword>go</headword>
<examplegrp><example>I <hw>go</hw> to school.</example>
</examplegrp></entry>
</root>

http://docs.python.org/3/library/xml.dom.html?highlight=xml#node-objects
http://docs.python.org/3/library/xml.dom.html?highlight=xml#document-objects
You probably want to follow some paradigm of creating a Document Element and appending reach result to it.
group = Document.createElement(tagName)
for found in founds:
group.appendNode(found)
Or something like this

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

xml.etree.ElementTree .remove - python

Related

Fetching elements from XML and insert into Postgres DB

copy xml node and children to new xml file

how to get value of an xml element not directly under root

python extract xml element value to csv

how can I select all descendants of a certain element with ElementTree in Python 3.3?

Categories

Resources