How to parse HTML document with < and &gt using BeautifulSoup? - python

I have mutliple HTML documents that I need to parse for a very specific attribute in the HTML document, but I am not use to the HTMl documents having < and &gt for tags. I know they represent < and > for the tag, but I am seeing if anyone knows how to deal with these issue.
Snippet of the HTML doc:
<score_result>
<Models>
<Model>
<Id>CLASS</Id>
<Description>Classifier Model 2.0</Description>
<Score>613</Score>
<Messages>
<Message>
<MessageType>RC</MessageType>
<Code>017111</Code>
<Description># of bananas, S&amp;Accounts Established</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>P11P</Code>
<Description>Absence of Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description>Presence of a Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description># of Inquiries</Description>
</Message>
</Messages>
</Model>
</Models>
</score_result>
I am specifically just trying to grab the Score value, <Score>613</Score> from these HTML documents.
I first take my dataframe and put it into a tuple to iterate through each HTML document, create a BeautifulSoup Object, then try to find the tag with .find_all().
I get an empty string every time. I considered also using regex but wanted to see what other people think.
My code:
result = [(x,y) for x,y in zip(df['ID'], df['data'])]
Score_lst = []
for row in result:
try:
Bs_data = BS(row[1])
Score_lst.append(Bs_data.find_all('score'))
except:
print('Na')
Expected Output:
Score_lst
[613,
...,
...,
....]
The ... will be the other values I will parse.

Here is one way to solve this conundrum:
from bs4 import BeautifulSoup as bs
html = '''
<score_result>
<Models>
<Model>
<Id>CLASS</Id>
<Description>Classifier Model 2.0</Description>
<Score>613</Score>
<Messages>
<Message>
<MessageType>RC</MessageType>
<Code>017111</Code>
<Description># of bananas, S&amp;Accounts Established</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>P11P</Code>
<Description>Absence of Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description>Presence of a Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description># of Inquiries</Description>
</Message>
</Messages>
</Model>
</Models>
</score_result>
'''
soup = bs(bs(html, 'html.parser').text, 'html.parser')
score = soup.select_one('Score')
print('And here is your score:', score.text)
Result in terminal:
And here is your score: 613

Related

how can it do an xml with element tree , python and a loop to look like this?

I need to do something like
<a code="007">
<element name="b" replace="true" id="" date="02-09-2022 15:30:00" special="true" from="" to="">
<component name="value" valor="55"/>
</element>
<element name="c" replace="true" id="" date="02-09-2022 18:30:00" special="true" from="" to="">
<component name="value" valor="15"/>
</element>
****a loop for "n" elements here****
</a>
i have a code that is constantly open and close "a" in every round of "element" , thats my biggest problem.
i would need to open a at the begging , then a loop (could be a for) to fill the list of "element" , and then to close the "a" tag.
any hint would be very appreciated
I suggest taking look at xml.dom.minidom, consider following simple example
from xml.dom.minidom import getDOMImplementation
impl = getDOMImplementation()
doc = impl.createDocument(None, "a", None)
for i in ['X','Y','Z']:
elem = doc.createElement("b")
elem.setAttribute("idcode",i)
doc.documentElement.appendChild(elem)
gives output
<?xml version="1.0" ?><a><b idcode="X"/><b idcode="Y"/><b idcode="Z"/></a>

To find element based on grandchildren tags using elementtree

I'm completely new to xml parsing .I have some thousands of xml's and I want to find out all element DE , only when I have country tag
Here is my sample xml
<?xml version="1.0" encoding="UTF-8"?>
<DE>
<CT>
<IG>
<FS id="01">
<FE id="A" fId="B">
<title>Apple</title>
</FE>
</FS>
<country syse="21" subSys="2">
<FF FR="101" fe="01" />
<referTo refType="t06">
<CF Code="350" />
</referTo>
<place id="00A" placeValue="00AB">
<Q>001</Q>
<TQ>0001</TQ>
<PR Value="A" CodeValue="C" />
</place>
<place id="00E" placeValue="00EF">
<Q>001</Q>
<TQ>0001</TQ>
<PR Value="03" AValue="957" />
<Books>
<IA>
<Part />
</IA>
<PRGroup>
<country Code="5">
<PR Value="02" AValue="345" />
<constrain>Double condition.</constrain>
<constrain>Double condition.</constrain>
</country>
</PRGroup>
</Books>
</place>
</country>
</IG>
</CT>
</DE>
import xml.etree.ElementTree as ET
tree = ET.parse(content)
root = tree.getroot()
Num = root.findall("//DE[//place/Books/PRGroup/country]")
am getting predicate error or absolute path error when am trying different ways but am not able to figure this out.
How can I retrieve the results and access the attributes based on that
could you please help me on this.
With lxml it should be something along these lines:
from lxml import etree
content = """[your xml above]"""
root = etree.fromstring(content.encode())
Num = root.xpath("//DE[//place/Books/PRGroup/country]")

BeautifulSoup can't access content of <text> tag

I'm using BeautifulSoup (version 4.4) to preprocess a Wikipedia textdump from https://dumps.wikimedia.org/enwiki/ for further parsing.
The textdump document contains multiple articles, each contained in a <page> tag.
Unfortunately, something about the document structure seems incompatible with BeautifulSoup: Within each <page>, the text body of an article is contained in a <text> block:
<text xml:space="preserve">...</text>
Once I've selected a certain <page> block, I should be able to access the content of the text block as page.text.string.
In BeautifulSoup, .text used to be reserved for the content of a tag between its brackets. In more recent versions, .string is used for that.
Unfortunately, it seems like page.text is still interpreted the same as page.string for backwards compatibility. (Edit: getattr(page, "text") does the same.)
Is there any way I can get around this and access an HTML tag named <text>?
(Edit: For a syntax example, see https://pastebin.com/WQvJn0gf.)
Using .find and .text works as expected:
from bs4 import BeautifulSoup
string = '''<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>...</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>854851586</id>
<parentid>834079434</parentid>
<timestamp>2018-08-14T06:47:24Z</timestamp>
<contributor>
<username>Godsy</username>
<id>23257138</id>
</contributor>
<comment>remove from category for seeking instructions on rcats</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
<sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
</revision>
</page>
...
</mediawiki>'''
soup = BeautifulSoup(string, 'html.parser')
page_tag = soup.find('page')
text_tag = page_tag.find('text')
print(text_tag.text)
# #REDIRECT [[Computer accessibility]]
# {{R from move}}
# {{R from CamelCase}}
# {{R unprintworthy}}

Using Python to extract information from a XML file?

Can anyone offer some help with regards to using Python to extract information from a XML file? This will be my example XML.
<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</root>
What I want to print out is the information between the root tags. However, I want it to print it as is, which means all the tags, text in between the tags, and the content within the tag (in this case number index ="2") I have tried itertext(), but that removes the tags and prints only the text in between the root tags. So far, I have a makeshift solution that prints out only the element.tag and the element.text but that does not print out the end tags and the content within the tag. Any help would be appreciated! :)
With s as your input,
s='''<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
</root>'''
Find all tags with tag name number and convert the tag to string using ET.tostring()
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
for node in root.findall('.//number'):
print ET.tostring(node)
Output:
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
from bs4 import BeautifulSoup
xml = "<root><number index=\"2\"><info><info.RANDOM>Random Text</info.RANDOM></info></root>"
soup = BeautifulSoup(xml, "xml")
output = soup.prettify()
print(output[output.find("<root>") + 7:output.rfind("</root>")])
the + 7 accounts for root>\n

Download and include referenced URL in XML

I have an RSS feed to a news source. Amongst the news text and other metadata, the feed also contains an URL reference to the comments section, which can also be in RSS format. I want to download and include the contents of the comments section for each news article. My aim is to create an RSS feed with the articles and the comments for each article included in the RSS, then convert this new RSS in calibre to PDF.
Here is an example XML:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<link href="http://thenews.com/article/123abc/comments" />
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
<entry>
<author>
<name>Some other Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is another news text.</content>
<id>123abd</id>
<link href="http://thenews.com/article/123abd/comments" />
<updated>2016-04-29T14:46:00+00:00</updated>
<title>The other Title</title>
</entry>
</feed>
Now I want to replace the <link href="http://thenews.com/article/123abc/comments" /> with the content of the URL. The RSS feed can be fetched by adding a /rss at the end of the URL. So in the end, a single entry would look like this:
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<comments>
<comment>
<author>A commenter</author>
<timestamp>2016-04-29T16:00:00+00:00</timestamp>
<text>Cool story, yo!</text>
</comment>
<comment>
<author>Another commenter</author>
<timestamp>2016-04-29T16:01:00+00:00</timestamp>
<text>This is interesting news.</text>
</comment>
</comments>
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
I'm open to any programming language. I tried this with python and lxml but couldn't get far. I was able to extract the comments URL and download the comments feed but couldn't replace the actual <link>-tag.
Without having to download the actual RSS, here's how far I've come:
import lxml.etree as et
import urllib2
import re
# These will be downloaded from the RSS feed source when the code works
xmltext = """[The above news feed, too long to paste]"""
commentsRSS = """[The above comments feed]"""
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
article = et.fromstring(xmltext)
for elem in article.xpath('//feed/entry'):
commentsURL = elem.xpath('link/#href')
#request = urllib2.Request(commentsURL[0] + '.rss', headers=hdr)
#comments = urllib2.urlopen(request).read()
comments = commentsRSS
# Now the <link>-tag should be replaced by the comments feed without the <?xml ...> tag
For each <link> element, download XML from the href attribute and then parse the XML into a new Element. Then replace <link> with the corresponding new Element, something like this :
....
article = et.fromstring(xmltext)
ns = {'d': 'http://www.w3.org/2005/Atom'}
for elem in article.xpath('//d:feed/d:entry/d:link', namespaces=ns):
request = urllib2.Request(elem.attrib['href'] + '.rss', headers=hdr)
comments = urllib2.urlopen(request).read()
newElem = et.fromstring(comments)
elem.getparent().replace(elem, newElem)
# print the result
print et.tostring(article)

Categories