I have an XML element, that looks like this:
XML
<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>
I am trying to get all the <textline> tags only:
with open(path_to_xml_file) as xml_file:
parsed_xml = BeautifulSoup(xml_file, 'xml')
text_lines = parsed_xml.find_all("textline")
However, text_lines includes all children of <textline> - which means it includes all the <text></text> tags.
I can't seem to find anything in the documentation that explains how to only select the actual tag (and not any children, sub children etc.).
I found the recursive=False option, which should only select direct children, so I thought I could apply this to the page tag:
text_lines = parsed_xml.find_all("page", recursive=False)
But that returns an empty list: []
Expected result:
<textline id="1"></textline>
<textline id="2"></textline>
<textline id="3"></textline>
You can set string=''
Ex:
xml = """<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>"""
from bs4 import BeautifulSoup
parsed_xml = BeautifulSoup(xml, 'xml')
text_lines = []
for tag in parsed_xml.find_all("textline"):
tag.string = ''
text_lines.append(tag)
print(text_lines)
Output:
[<textline id="1"></textline>,
<textline id="2"></textline>,
<textline id="3"></textline>]
You can use clear() method to remove all the inside <text> tags from <textline> tags,
one more thing you can't send file name to BeautifulSoup, you have to open it and send the content to it, here I kept xml content in a variable.
myxml = """<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>"""
parsed_xml = BeautifulSoup(myxml, 'xml')
text_lines = parsed_xml.find_all("textline")
for tl in text_lines:
tl.clear()
print(text_lines)
Output:
[<textline id="1"/>, <textline id="2"/>, <textline id="3"/>]
I know I originally tagged this question with beautifulsoup, but I just wanted to share what I actually ended up using. The solution from #Rakesh does works with beaufitulsoup.
I actually ended up using Pythons built-in XML parser:
import xml.etree.ElementTree as ET
tree = ET.parse(path_to_xml_file)
root = tree.getroot()
for textline in root.iter('textline'):
print(textline)
I think this is a much cleaner solution - so hopefully this can help anyone comign across this post.
Related
I am trying to replace the information in an SVG file using ElementTree, however I am very new to it and haven't been making much progress.
So far, my code is:
import xml.etree.ElementTree as ET
tree = ET.parse('path-to-file')
root = tree.getroot()
for item in root.iter('tspan'):
print(item)
However this doesn't find anything.
The SVG file information that I'm trying to locate is in the form:
<text
transform="matrix(0,-1,-1,0,2286,3426)"
style="font-variant:normal;font-weight:normal;font-size:123.10199738px;font-family:Arial;-inkscape-font-specification:ArialMT;writing-mode:lr-tb;fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:none"
id="text79724">
<tspan
x="0 71.891571 154.0006 188.22296 256.66766"
y="0"
sodipodi:role="line"
id="tspan79722"><SI1></tspan>
</text>
Where I am specifically looking to change the
x="0 71.891571 154.0006 188.22296 256.66766"
to x="0".
I am not set on using ElementTree to do this, however, most similar StackOverflow questions suggest that it's the best idea.
As you stated in the question, you are not set to using ElementTree - so here is a solution using beautifulsoup:
data = '''<text
transform="matrix(0,-1,-1,0,2286,3426)"
style="font-variant:normal;font-weight:normal;font-size:123.10199738px;font-family:Arial;-inkscape-font-specification:ArialMT;writing-mode:lr-tb;fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:none"
id="text79724"><tspan
x="0 71.891571 154.0006 188.22296 256.66766"
y="0"
sodipodi:role="line"
id="tspan79722"><SI1></tspan></text>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for tspan in soup.select('tspan[x]'):
if tspan['x'] == '0 71.891571 154.0006 188.22296 256.66766':
tspan['x'] = 0
print(soup.prettify())
#if writing to a new svg file, use soup instead of soup.prettify()
Prints:
<text id="text79724" style="font-variant:normal;font-weight:normal;font-size:123.10199738px;font-family:Arial;-inkscape-font-specification:ArialMT;writing-mode:lr-tb;fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:none" transform="matrix(0,-1,-1,0,2286,3426)">
<tspan id="tspan79722" sodipodi:role="line" x="0" y="0">
<SI1>
</tspan>
</text>
CSS selector tspan[x] will select <tspan> tag with attribute x. Then we check if attribute x is '0 71.891571 154.0006 188.22296 256.66766'. If it is, we set it to 0.
I'm using BeautifulSoup (version 4.4) to preprocess a Wikipedia textdump from https://dumps.wikimedia.org/enwiki/ for further parsing.
The textdump document contains multiple articles, each contained in a <page> tag.
Unfortunately, something about the document structure seems incompatible with BeautifulSoup: Within each <page>, the text body of an article is contained in a <text> block:
<text xml:space="preserve">...</text>
Once I've selected a certain <page> block, I should be able to access the content of the text block as page.text.string.
In BeautifulSoup, .text used to be reserved for the content of a tag between its brackets. In more recent versions, .string is used for that.
Unfortunately, it seems like page.text is still interpreted the same as page.string for backwards compatibility. (Edit: getattr(page, "text") does the same.)
Is there any way I can get around this and access an HTML tag named <text>?
(Edit: For a syntax example, see https://pastebin.com/WQvJn0gf.)
Using .find and .text works as expected:
from bs4 import BeautifulSoup
string = '''<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>...</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>854851586</id>
<parentid>834079434</parentid>
<timestamp>2018-08-14T06:47:24Z</timestamp>
<contributor>
<username>Godsy</username>
<id>23257138</id>
</contributor>
<comment>remove from category for seeking instructions on rcats</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
<sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
</revision>
</page>
...
</mediawiki>'''
soup = BeautifulSoup(string, 'html.parser')
page_tag = soup.find('page')
text_tag = page_tag.find('text')
print(text_tag.text)
# #REDIRECT [[Computer accessibility]]
# {{R from move}}
# {{R from CamelCase}}
# {{R unprintworthy}}
I have the following code:
from xml.etree import ElementTree
file_path = 'some_file_path'
document = ElementTree.parse(file_path, ElementTree.XMLParser(encoding='utf-8'))
If my XML looks like the following it gives me the error: "xml.etree.ElementTree.ParseError: not well-formed"
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
In sublime or Notepad++ I see highlighted characters such as ACK, DC4, or STX which seem to be the culprit (one of them appears as a "-" in the above xml in the second "text" node). If I remove these characters it works. What are these and how can I fix this?
Running your code as follows, and it's working fine:
from xml.etree import ElementTree
from StringIO import StringIO
xml_content = """<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>"""
print("parsing xml document")
# using StringIO to simulate reading from file
document = ElementTree.parse(StringIO(xml_content), ElementTree.XMLParser(encoding='utf-8'))
for elem in document.iter():
print(elem.tag)
And the output is as expected:
parsing xml document
pages
page
textbox
textline
text
text
text
So, the issue is how you are copying and pasting your file from notepad++, maybe it's adding some special characters so try with another editor.
I have an xml file which looks like the example below.
Many texts contain space as the start character, or have \n (newline) at the beginning, or other crazy stuff. I'm working with xml.etree.ElementTree, and it is good to parse from this file.
But I want more! :) I tried to prettify this mess, but without success. Tried many tutorials, but it always ends without pretty XML.
<?xml version="1.0"?>
<import>
<article>
<name> Name with space
</name>
<source> Daily Telegraph
</source>
<number>72/2015
</number>
<page>10
</page>
<date>2015-03-26
</date>
<author> Tomas First
</author>
<description>Economy
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text is here
</text>
</article>
<article>
<name> How to parse
</name>
<source> Internet article
</source>
<number>72/2015
</number>
<page>1
</page>
<date>2015-03-26
</date>
<author>Some author
</author>
<description> description
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text here
</text>
</article>
</import>
When I tried another answers from SO it generates same file or more messy XML
bs4 can do it
from bs4 import BeautifulSoup
doc = BeautifulSoup(xmlstring, 'xml')
print doc.prettify()
I want to make those sentences to xml
I will meet you at 1st.
5th... OK, 5th?
today is 2nd\n
Aug.3rd
Like this:
<Text VAlign="top" VPosition="85.00">
I will meet you at 1<Font Script="super">st</Font>.
</Text>
<Text VAlign="top" VPosition="85.00">
5<Font Script="super">th</Font>... OK, 5<Font Script="super">th</Font>
</Text>
<Text VAlign="top" VPosition="85.00">
today is 2<Font Script="super">nd</Font>\n
</Text>
<Text VAlign="top" VPosition="85.00">
Aug.3<Font Script="super">rd</Font>\n
</Text>
I am using the minidom, but after many posts and answers, I don't mind rewrite my code with other parser. At the beginning, I thought this is easy, just replace the st|nd|rd|th with
<Font Script="super">st|nd|rd|th</Font> and then createTextNode() with this new string.
However, the sign <, > and " turn out to be < > and $quot; by writexml() method. which works for XML specification, but not good for read.
How can I do with it? Thanks so much.
Here's what you can do with xml.etree.ElementTree from the standard library:
import re
import xml.etree.ElementTree as ET
data = """I will meet you at 1st.
5th... OK, 5th?
today is 2nd
Aug.3rd"""
endings = ['st', 'th', 'nd', 'rd']
pattern = re.compile('(%s)' % "|".join(endings))
root = ET.Element('root')
for line in data.split('\n'):
items = []
for item in re.split(pattern, line):
if item in endings:
items.append('<Font Script="super">%s</Font>' % item)
else:
items.append(item)
element = ET.fromstring("""<Text VAlign="top" VPosition="85.00">%s</Text>""" % ''.join(items))
root.append(element)
print ET.tostring(root)
It produces the following xml:
<root>
<Text VAlign="top" VPosition="85.00">I will meet you at 1<Font Script="super">st</Font>.
</Text>
<Text VAlign="top" VPosition="85.00">5<Font Script="super">th</Font>... OK, 5<Font Script="super">th</Font>?
</Text>
<Text VAlign="top" VPosition="85.00">today is 2
<Font Script="super">nd</Font>
</Text>
<Text VAlign="top" VPosition="85.00">Aug.3
<Font Script="super">rd</Font>
</Text>
</root>
To make the output with propel indent and newline, I need the lxml, and I put this in alecxe's code.
from lxml import etree as ET
print ET.tostring(root, pretty_print=True)