I have the following code:
from xml.etree import ElementTree
file_path = 'some_file_path'
document = ElementTree.parse(file_path, ElementTree.XMLParser(encoding='utf-8'))
If my XML looks like the following it gives me the error: "xml.etree.ElementTree.ParseError: not well-formed"
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
In sublime or Notepad++ I see highlighted characters such as ACK, DC4, or STX which seem to be the culprit (one of them appears as a "-" in the above xml in the second "text" node). If I remove these characters it works. What are these and how can I fix this?
Running your code as follows, and it's working fine:
from xml.etree import ElementTree
from StringIO import StringIO
xml_content = """<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>"""
print("parsing xml document")
# using StringIO to simulate reading from file
document = ElementTree.parse(StringIO(xml_content), ElementTree.XMLParser(encoding='utf-8'))
for elem in document.iter():
print(elem.tag)
And the output is as expected:
parsing xml document
pages
page
textbox
textline
text
text
text
So, the issue is how you are copying and pasting your file from notepad++, maybe it's adding some special characters so try with another editor.
Related
I have an XML file as below:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<text>
I have <num1>two</num1> apples and <num2>four</num2> mangoes
</text>
</data>
I want to parse the file and get the whole context of text and its children elements and assign it to variable sentence:
sentence = "I have two apples and four mangoes"
How can I do that using Python ElementTree?
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<data>
<text>
I have <num1>two</num1> apples and <num2>four</num2> mangoes
</text>
</data>
"""
from xml.etree import ElementTree as ET
x_data = ET.fromstring(xml.strip())
all_text = list(x_data.findall(".//text")[0].itertext())
print(" ".join([text.strip() for text in all_text]))
Iterate through the text from the parent node, and process the text as per your need
I am trying to use this code
s = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd">"""
doctype = """<!DOCTYPE doc [
<!ENTITY ent SYSTEM "another_doc.xml">
]>"""
gdml = ET.fromstring(s.encode("UTF-8"))
But get XMLSyntaxError: EndTag
I want eventually to add the doctype as well
The string you pass to fromstring() needs to be well-formed XML. Since you only have a start tag, it's not well-formed.
This is easily resolvable by self-closing it or adding an end tag.
Also, if you want the XML declaration (<?xml ...?>) included in your serialized XML, you can add it to doctype.
Example...
from lxml import etree
s = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd"/>"""
doctype = """<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE gdml [
<!ENTITY ent SYSTEM "another_doc.xml">
]>"""
gdml = etree.fromstring(s.encode("UTF-8"))
print(etree.tostring(gdml, doctype=doctype).decode())
This prints...
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE gdml [
<!ENTITY ent SYSTEM "another_doc.xml">
]>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd"/>
I have an XML element, that looks like this:
XML
<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>
I am trying to get all the <textline> tags only:
with open(path_to_xml_file) as xml_file:
parsed_xml = BeautifulSoup(xml_file, 'xml')
text_lines = parsed_xml.find_all("textline")
However, text_lines includes all children of <textline> - which means it includes all the <text></text> tags.
I can't seem to find anything in the documentation that explains how to only select the actual tag (and not any children, sub children etc.).
I found the recursive=False option, which should only select direct children, so I thought I could apply this to the page tag:
text_lines = parsed_xml.find_all("page", recursive=False)
But that returns an empty list: []
Expected result:
<textline id="1"></textline>
<textline id="2"></textline>
<textline id="3"></textline>
You can set string=''
Ex:
xml = """<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>"""
from bs4 import BeautifulSoup
parsed_xml = BeautifulSoup(xml, 'xml')
text_lines = []
for tag in parsed_xml.find_all("textline"):
tag.string = ''
text_lines.append(tag)
print(text_lines)
Output:
[<textline id="1"></textline>,
<textline id="2"></textline>,
<textline id="3"></textline>]
You can use clear() method to remove all the inside <text> tags from <textline> tags,
one more thing you can't send file name to BeautifulSoup, you have to open it and send the content to it, here I kept xml content in a variable.
myxml = """<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>"""
parsed_xml = BeautifulSoup(myxml, 'xml')
text_lines = parsed_xml.find_all("textline")
for tl in text_lines:
tl.clear()
print(text_lines)
Output:
[<textline id="1"/>, <textline id="2"/>, <textline id="3"/>]
I know I originally tagged this question with beautifulsoup, but I just wanted to share what I actually ended up using. The solution from #Rakesh does works with beaufitulsoup.
I actually ended up using Pythons built-in XML parser:
import xml.etree.ElementTree as ET
tree = ET.parse(path_to_xml_file)
root = tree.getroot()
for textline in root.iter('textline'):
print(textline)
I think this is a much cleaner solution - so hopefully this can help anyone comign across this post.
I have an xml file which looks like the example below.
Many texts contain space as the start character, or have \n (newline) at the beginning, or other crazy stuff. I'm working with xml.etree.ElementTree, and it is good to parse from this file.
But I want more! :) I tried to prettify this mess, but without success. Tried many tutorials, but it always ends without pretty XML.
<?xml version="1.0"?>
<import>
<article>
<name> Name with space
</name>
<source> Daily Telegraph
</source>
<number>72/2015
</number>
<page>10
</page>
<date>2015-03-26
</date>
<author> Tomas First
</author>
<description>Economy
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text is here
</text>
</article>
<article>
<name> How to parse
</name>
<source> Internet article
</source>
<number>72/2015
</number>
<page>1
</page>
<date>2015-03-26
</date>
<author>Some author
</author>
<description> description
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text here
</text>
</article>
</import>
When I tried another answers from SO it generates same file or more messy XML
bs4 can do it
from bs4 import BeautifulSoup
doc = BeautifulSoup(xmlstring, 'xml')
print doc.prettify()
I am trying to remove an element in an xml which contains a namespace.
Here is my code:
templateXml = """<?xml version="1.0" encoding="UTF-8"?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
<Actor>
<ActorName locale="en-GB">XXX</ActorName>
<Character locale="en-GB">XXX</Character>
</Actor>
</TitleInfo>
</Movie>
</Metadata>"""
from lxml import etree
tree = etree.fromstring(templateXml)
namespaces = {'ns':'http://www.amazon.com/UnboxMetadata/v1'}
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
etree.strip_elements(tree, 'ns:Actor')
In my actual XML I have lots of tags, So I am trying to search for the Actor tags which contain XXX and completely remove that whole tag and its contents. But it's not working.
Use remove() method:
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
checkActor.getparent().remove(checkActor)
print etree.tostring(tree, pretty_print=True, xml_declaration=True)
prints:
<?xml version='1.0' encoding='ASCII'?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
</TitleInfo>
</Movie>
</Metadata>