I'm trying to remove tags from an Xml.Alto file with remove.
My Alto file looks like this:
<alto xmlns="http://www.loc.gov/standards/alto/ns-v4#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd"> <Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>filename</fileName>
</sourceImageInformation>
</Description>
<Layout>
<Page>
<PrintSpace>
<TextBlock>
<Shape><Polygon/></Shape>
<TextLine>
<Shape><Polygon/></Shape>
<String CONTENT="ABCDEF" HPOS="1234" VPOS="1234" WIDTH="1234" HEIGHT="1234" />
</TextLine>
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
AND my code is :
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace("", "http://www.loc.gov/standards/alto/ns-v4#")
for Test in root.findall('.//alto:TextBlock', ns):
root.remove(Test)
tree.write('out.xml', encoding="UTF-8", xml_declaration=True)
Here is the error I get:
ValueError: list.remove(x): x not in list
Thanks a lot for your help š
ElementFather.remove(ElementChild) works only if the ElementChild is a sub-element of ElementFather. In your case, you have to call remove from PrintSpace.
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
root = tree.getroot()
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace("", "http://www.loc.gov/standards/alto/ns-v4#")
for Test in root.findall('.//alto:TextBlock', ns):
PrintSpace = root.find('.//alto:PrintSpace',ns)
PrintSpace.remove(Test)
tree.write('out.xml', encoding="UTF-8", xml_declaration=True)
Note: This code is only an example of a working solution, for sure you can improve it.
Related
I have an XML file like this i need to insert this data to PostgreSQL DB.Below is the sample XML and the code which i use ,but i'm not getting any output,Can someone please guide on how to effectively fetch these XML values.
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
<item>
.
.
.
</item>
Below is the script which i use,
Python : 3.5 Postgres version 11
# import modules
import sys
import psycopg2
import datetime
now = datetime.datetime.now()
# current data and time
dt = now.strftime("%Y%m%dT%H%M%S")
# xml tree access
#from xml.etree import ElementTree
import xml.etree.ElementTree as ET
# incremental variable
x = 0
with open('/Users/admin/documents/shopping.xml', 'rt',encoding="utf8") as f:
#tree = ElementTree.parse(f)
tree = ET.parse(f)
# connection to postgreSQL database
try:
conn=psycopg2.connect(host='localhost', database='postgres',
user='postgres', password='postgres',port='5432')
except:
print ("Hey I am unable to connect to the database.")
cur = conn.cursor()
# access the xml tree element nodes
try:
for node in tree.findall('.//item'):
src = node.find('id')
tgt = node.find('mpn')
print(node)
except:
print ("Oops I can't insert record into database table!")
conn.commit()
conn.close()
The current output i'm getting is like,
None
None
None
Expected Output,
id title description gtin ......
20 product 1 g:description xxxx .....
Strange is that you can't find item. It seems you use wrong file and it doesn't have item.
Using your XML data as string and ET.fromstring() I have no problem to get item.
Maybe check print( f.read() ) to see what you really read from file.
Problem is only id, tgt which use namespace - g: - and it need something more then only g:id, g:tgt
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
or use directly as '{http://base.google.com/ns/1.0}id' '{http://base.google.com/ns/1.0}mpn'
tree = ET.fromstring(xml)
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Minimal working code:
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
tree = ET.fromstring(xml)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('g:id', ns)
tgt = node.find('g:mpn', ns)
print('Node:', node)
print('src:', src.text)
print('tgt:', tgt.text)
Result:
Node: <Element 'item' at 0x7f74ba45b710>
src: 20
tgt: 0014
BTW: It works even when I use io.StringIO to simulate file
f = io.StringIO(xml)
tree = ET.parse(f)
Minimal working code:
import xml.etree.ElementTree as ET
import io
xml = '''<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0" encoding="utf-8">
<config>
<g:system>Magento</g:system>
<g:extension>Magmodules_Googleshopping</g:extension>
<g:extension_version>1.6.8</g:extension_version>
<g:store>emb</g:store>
<g:url>https://www.xxxxx.com/</g:url>
<g:products>1320</g:products>
<g:generated>2020-06-11 11:18:32</g:generated>
<g:processing_time>17.5007</g:processing_time>
</config>
<channel>
<item>
<g:id>20</g:id>
<g:title>product 1</g:title>
<g:description>description about product 1</g:description>
<g:gtin>42662</g:gtin>
<g:brand>company</g:brand>
<g:mpn>0014</g:mpn>
<g:link>link.html</g:link>
<g:image_link>link/c/a/cat_21_16.jpg</g:image_link>
<g:availability>in stock</g:availability>
<g:condition>new</g:condition>
<g:price>9</g:price>
<g:shipping>
<g:country>UAE</g:country>
<g:service>DHL</g:service>
<g:price>2.90</g:price>
</g:shipping>
</item>
</channel>
</rss>
'''
f = io.StringIO(xml)
tree = ET.parse(f)
ns = {'g': "http://base.google.com/ns/1.0"}
for node in tree.findall('.//item'):
src = node.find('{http://base.google.com/ns/1.0}id')
tgt = node.find('{http://base.google.com/ns/1.0}mpn')
print('Node:', node)
print('src:', src.text)
print('mpn:', tgt.text)
I have the following code that loops over a set of records and moves each record to a new file:
import os
import xml.etree.cElementTree as ET
for filename in os.listdir('modemuze'):
if filename.endswith('.xml'):
original_tree = ET.ElementTree(file='modemuze/'+filename)
root = original_tree.getroot()
for child in root[2]:
if child.tag == "{http://www.openarchives.org/OAI/2.0/}record":
new_tree = ET.ElementTree(file='test.xml')
new_root = new_tree.getroot()
new_root.append(child)
the file which contains the records i want to move has to following structure:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<responseDate>2018-02-14T12:15:11.241+02:00</responseDate>
<request verb="ListRecords">147.102.11.65:9000/fashionedmfp/oai</request>
<ListRecords>
<record>
<header>
<identifier>oai:fashionedmfp:8c549bcd078e2ce84a265d318547c5f8e8bf0cd0</identifier>
<datestamp>2016-06-27</datestamp>
</header>
<metadata>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:edm="http://www.europeana.eu/schemas/edm/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:edmfp="http://www.europeanafashion.eu/edmfp/" xmlns:rdaGr2="http://rdvocab.info/ElementsGr2/" xmlns:ore="http://www.openarchives.org/ore/terms/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:mrel="http://id.loc.gov/vocabulary/relators/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:wgs84="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:crm="http://www.cidoc-crm.org/rdfs/cidoc_crm_v5.0.2_english_label.rdfs#" xmlns:gr="http://www.heppnetz.de/ontologies/goodrelations/v1#" xmlns:xalan="http://xml.apache.org/xalan" xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<edm:ProvidedCHO rdf:about="localID/europeana-fashion/10918x1y95563">
<dcterms:created>circa 1855-1874</dcterms:created>
<dcterms:issued>1875</dcterms:issued>
<dcterms:issued>1875</dcterms:issued>
<dcterms:medium>katoen</dcterms:medium>
<dcterms:provenance>schenking</dcterms:provenance>
<dcterms:provenance>1958</dcterms:provenance>
<dcterms:spatial>
<edm:Place>
<skos:prefLabel>Nederland</skos:prefLabel>
</edm:Place>
</dcterms:spatial>
<dcterms:creator>
<edm:Agent/>
</dcterms:creator>
<dc:creator>
<edm:Agent/>
</dc:creator>
<dc:description>Jurk (kind) van wit piquƩ, gegarneerd met broderie en soustache</dc:description>
<dc:identifier>0061211</dc:identifier>
<dc:title>jurken</dc:title>
<dc:title>kinderkleding</dc:title>
<edm:type>IMAGE</edm:type>
<edmfp:localType>jurken</edmfp:localType>
<edmfp:localType>kinderkleding</edmfp:localType>
<edmfp:technique>piquƩ-weefsel</edmfp:technique>
</edm:ProvidedCHO>
<ore:Aggregation rdf:about="localID/europeana-fashion/Aggregation_10918x1y95563">
<edm:aggregatedCHO rdf:resource="localID/europeana-fashion/10918x1y95563"/>
<edm:dataProvider>
<edm:Agent>
<skos:prefLabel>Gemeentemuseum Den Haag</skos:prefLabel>
</edm:Agent>
</edm:dataProvider>
<edm:hasView>
<edm:WebResource rdf:about="http://images.gemeentemuseum.nl/br/0061211.jpg"/>
</edm:hasView>
<edm:isShownBy>
<edm:WebResource rdf:about="http://images.gemeentemuseum.nl/br/0061211.jpg"/>
</edm:isShownBy>
<edm:provider>
<edm:Agent rdf:about="http://www.europeanafashion.eu/">
<skos:prefLabel>Europeana Fashion</skos:prefLabel>
</edm:Agent>
</edm:provider>
<edm:rights rdf:resource="http://www.europeana.eu/rights/rr-f/"/>
</ore:Aggregation>
</rdf:RDF>
</metadata>
</record>
</ListRecords>
</OAI-PMH>
and the file i want to place the records in has the following structure:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ListRecords>
</ListRecords>
but when i run this code my test.xml file contains no records..
Why is this happening?
I am totally new to XML and python so if i missed something or something is unclear please let me know so i can clarify or add it.
I am doing this is python 2.7
All help and suggestions are very much appreciated!
You need to write changes to the file.
Add to the code new_tree.write('test.xml')
As a result, the code should look like this
import os
import xml.etree.cElementTree as ET
for filename in os.listdir('modemuze'):
if filename.endswith('.xml'):
original_tree = ET.ElementTree(file='modemuze/'+filename)
root = original_tree.getroot()
for child in root[2]:
if child.tag == "{http://www.openarchives.org/OAI/2.0/}record":
new_tree = ET.ElementTree(file='test.xml')
new_root = new_tree.getroot()
new_root.append(child)
new_tree.write('test.xml')
I am trying to parse an xml and get the value of dir_path as below,however I dont seem to get the desired output,whats wrong here and how to fix it?
input.xml
<?xml version="1.0" ?>
<data>
<software>
<name>xyz</name>
<role>xyz</role>
<future>unknown</future>
</software>
<software>
<name>abc</name>
<role>abc</role>
<future>clear</future>
<dir_path cmm_root_path_var="COMP_softwareROOT">\\location\software\INR\</dir_path>
<loadit reduced="true">
<RW>yes</RW>
<readonly>R/</readonly>
</loadit>
<upload reduced="true">
</upload>
</software>
<software>
<name>def</name>
<role>def</role>
<future>clear</future>
<dir_path cmm_root_path_var="COMP2_softwareROOT">\\location1\software\INR\</dir_path>
<loadit reduced="true">
<RW>yes</RW>
<readonly>R/</readonly>
</loadit>
<upload reduced="true">
</upload>
</software>
</data>
CODE:-
tree = ET.parse(input.xml)
root = tree.getroot()
dir_path = root.find(".//dir_path")
print dir_path.text
OUTPUT:-
.\
EXPECTED OUTPUT:-
\\location\software\INR\
Try the following:
from xml.etree import ElementTree as ET
tree = ET.parse('filename.xml')
item = tree.find('software/[name="abc"]/dir_path')
print(item.text if item is not None else None)
I'm a new to python so please bear with me as I try to explain what I am trying to do
here is my xml
<?xml version="1.0"?>
<playlist>
<list>
<txdate>2015-10-30</txdate>
<channel>cake</channel>
<name>Play List</name>
</list>
<eventlist>
<event type="MEDIA">
<title>title1</title>
<starttype>FIX</starttype>
<mediaid>a</mediaid>
<onairtime>2015-10-30T13:30:00:00</onairtime>
<som>00:00:40:03</som>
<duration>01:15:47:15</duration>
<reconcilekey>123</reconcilekey>
<category>PROGRAM</category>
<subtitles>
<cap>CLOSED</cap>
<file>a</file>
<lang>ENG</lang>
<lang>GER</lang>
</subtitles>
</event>
<event type="MEDIA">
<title>THREE DAYS AND A CHILD</title>
<mediaid>b</mediaid>
<onairtime>2015-10-30T14:45:47:15</onairtime>
<som>00:00:00:00</som>
<duration>01:19:41:07</duration>
<reconcilekey>321</reconcilekey>
<category>PROGRAM</category>
<subtitles>
<cap>CLOSED</cap>
<file>b</file>
<lang>ENG</lang>
<lang>GER</lang>
</subtitles>
</event>
</eventlist>
</playlist>
I would like to print all the mediaid values to a file
this is my code so far
import os
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
wfile = 'new.csv'
for child in root:
child.find( "media type" )
for x in child.iter("mediaid"):
file = open(wfile, 'a')
file.write(str(x))
file.close
I tried this with a few other nonstandard libraries but I didn't have much success
For your requirement (as mentioned in the comments) -
just the mediaid from each <event type="MEDIA">
You should use findall() method of ElementTree to get all the event elements with type="MEDIA" , and then get the child mediaid element from it. Example -
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
with open('new.csv','w') as outfile:
for elem in root.findall('.//event[#type="MEDIA"]'):
mediaidelem = elem.find('./mediaid')
if mediaidelem is not None:
outfile.write("{}\n".format(mediaidelem.text))
This is the sample data.
input.xml
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
I'd like to put node and its descendants into . That is,
output.xml
<root>
<entry id="1">
<headword>go</headword>
<examplegrp>
<example>I <hw>go</hw> to school.</example>
</examplegrp>
</entry>
</root>
My poor and incomplete script is:
import codecs
import xml.etree.ElementTree as ET
fin = codecs.open(r'input.xml', 'rb', encoding='utf-8')
data = ET.parse(fin)
root = data.getroot()
example = root.find('.//example')
for elem in example.iter():
---and then I don't know what to do---
Here's an example of how it can be done:
text = """
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
"""
import lxml.etree
import StringIO
data = lxml.etree.parse(StringIO.StringIO(text))
root = data.getroot()
for entry in root.xpath('//example/ancestor::entry[1]'):
examplegrp = lxml.etree.SubElement(entry,"examplegrp")
nodes = [node for node in entry.xpath('./example')]
for node in nodes:
entry.remove(node)
examplegrp.append(node)
print lxml.etree.tostring(root,pretty_print=True)
which will output:
<root>
<entry id="1">
<headword>go</headword>
<examplegrp><example>I <hw>go</hw> to school.</example>
</examplegrp></entry>
</root>
http://docs.python.org/3/library/xml.dom.html?highlight=xml#node-objects
http://docs.python.org/3/library/xml.dom.html?highlight=xml#document-objects
You probably want to follow some paradigm of creating a Document Element and appending reach result to it.
group = Document.createElement(tagName)
for found in founds:
group.appendNode(found)
Or something like this