Adjusting all text in DOM tree with BeautifulSoup - python

I'm trying to capitalize all the (user-visible) text in a HTML file. Here is the obvious thing:
from bs4 import BeautifulSoup
def upcaseAll(str):
soup = BeautifulSoup(str)
for tag in soup.find_all(True):
for s in tag.strings:
s.replace_with(unicode(s).upper())
return unicode(soup)
That crashes:
File "/Users/malvolio/flip.py", line 23, in upcaseAll
for s in tag.strings:
File "/Library/Python/2.7/site-packages/bs4/element.py", line 827, in _all_strings
for descendant in self.descendants:
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1198, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
All the variations I can think of crash the same way. BS4 does not seem to like it when I replace a lot of NavigableStrings. How can I do this?

You should not use str as the function argument as this is a shadow name of python builtin.
Also you should be able to convert the visible elements by just using prettify with formatter like this:
...
return soup.prettify(formatter=lambda x: unicode(x).upper())
I have tested now and it works:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.stackoverflow.com')
soup = BeautifulSoup(r.content)
print soup.prettify(formatter=lambda x: unicode(x).upper())[:200]
<!DOCTYPE html>
<html>
<head>
<title>
STACK OVERFLOW
</title>
<link href="//CDN.SSTATIC.NET/STACKOVERFLOW/IMG/FAVICON.ICO?V=00A326F96F68" rel="SHORTCUT ICON"/>
<link href="//CDN.SSTATIC.NE
...
You can read OUTPUT FORMATTER for more detailed information.
Hope this helps.

Related

How to delete line in HTML file with BeautifulSoup?

I am new to using BeautifulSoup.
I have a line in an HTML file that is stored locally.
<LINK rel="stylesheet" type="text/css" href="report.css" >
I wish to remove that line, but I don't know what approach to use to find the line and remove it.
I can find the line using: old_text = soup.find("link", {"href": "report.css"})
But I can't work out how to remove and save the file again?
You could use .decompose() to get rid of the tag:
soup.find("link", {"href": "report.css"}).decompose()
or
soup.select_one('link[href^="report."]').decompose()
and convert BeautifulSoup object back to string and save it:
str(soup)
Example
from bs4 import BeautifulSoup
html = '''
<some tag>some content</some tag>
<LINK rel="stylesheet" type="text/css" href="report.css" >
<some tag>some content</some tag>
'''
soup = BeautifulSoup(html, "html.parser")
soup.select_one('link[href^="report."]').decompose()
print(str(soup))

Parsing MS specific html tags in BeautifulSoup

When trying to parse an email sent using MS Outlook, I want to be able to strip the annoying Microsoft XML tags that it has added. One such example is the o:p tag. When trying to use Python's BeautifulSoup to parse an email as HTML, it can't seem to find these specialty tags.
For example:
from bs4 import BeautifulSoup
textToParse = """
<html>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "html5lib")
body = soup.find('body')
for otag in body.find_all('o'):
print(otag)
for otag in body.find_all('o:p'):
print(otag)
This will output no text to the console, but if I switched the find_all call to search for p then it would output the p node as expected.
How come these custom tags do not seem to work?
It's a namespace issue. Apparently, BeautifulSoup does not consider custom namespaces valid when parsed with "html5lib".
You can work around this with a regular expression, which – strangely – does work correctly!
print (soup.find_all(re.compile('o:p')))
>>> [<o:p>This should go</o:p>]
but the "proper" solution is to change the parser to "lxml-xml" and introducing o: as a valid namespace.
from bs4 import BeautifulSoup
textToParse = """
<html xmlns:o='dummy_url'>
<head>
<title>Something to parse</title>
</head>
<body>
<p><o:p>This should go</o:p>Paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(textToParse, "lxml-xml")
body = soup.find('body')
print ('this should find nothing')
for otag in body.find_all('o'):
print(otag)
print ('this should find o:p')
for otag in body.find_all('o:p'):
print(otag)
>>>
this should find nothing
this should find o:p
<o:p>This should go</o:p>

Using Beautiful Soup for parsing NELL Knowlege Base page

I'm using Beautiful Soup to parse list of categories from http://rtw.ml.cmu.edu/rtw/kbbrowser/, and I got the html code of this page:
<html>
<head>
<link href="../css/browser.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
if (parent.location.href == self.location.href) {
if (window.location.href.replace)
window.location.replace('index.php');
else
// causes problems with back button, but works
window.location.href = 'index.php';
}
</script>
</head>
<body id="ontology">
...
</body>
</html>
I'm using quite simple code, but when I'm trying to get to the <body> element, I get None:
import urllib
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import mechanize
from mechanize import Browser
import requests
import re
import os
link = 'http://rtw.ml.cmu.edu/rtw/kbbrowser/ontology.php'
pageFile = urllib.urlopen(link).read()
soup = BeautifulSoup(pageFile)
print soup.head.contents[0].name
print soup.html.contents[1].name
Why does the head element in this case not have a sibling?
I'm getting:
AttributeError: 'NoneType' object has no attribute 'next_element'
when trying to get head.next_Sibling also.
This is because text nodes are also a part of contents.
Instead of operating the contents property, use CSS selectors to locate the list of categories. For example, here is how you can list top-level categories:
for li in soup.select("body#ontology > ul > li"):
print li.find_all("a")[-1].text

Extracting the hyperlink from link tag using xpath

Consider the html as
<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>
I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag.
The code is
page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')
But this is returning an empty list. However, this is returning a link element.
links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c>
Can anyone suggest how to extract the urls from the link tag?
You are using the wrong parser for the job; you don't have HTML, you have XML.
A proper HTML parser will ignore the contents of a <link> tag, because in the HTML specification that tag is always empty.
Use the etree.parse() function to parse your URL stream (no separate .read() call needed):
response = urllib.urlopen(url)
tree = etree.parse(response)
titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')
You could also use etree.fromstring(page) but leaving the reading to the parser is easier.
By parsing content by etree, the <link> tag get closed. So no text value present for link tag
Demo:
>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>
According to HTML, this is not valid tag.
I think link tag structure is like:
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>

How can i grab CData out of BeautifulSoup

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.
I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice.
Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.
I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Cows and Sheep
</title>
</head>
<body>
<div id="main">
<div id="main-precontents">
<div id="main-contents" class="main-contents">
<script type="text/javascript">
//<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
<!--ts-->
get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
<!--yy-->
<span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
<!--?5695:5:40:45-->
';
//]]>
</script>
</div>
</div>
</div>
</body>
</html>
One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.
By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here
#Trying it with html.parser
>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
aaaaaaaaaaaaa
]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>
BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:
import BeautifulSoup
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, BeautifulSoup.CData):
print 'CData contents: %r' % cd
In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.
You could try this:
from BeautifulSoup import BeautifulSoup
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]
That should give you the contents of cdata.
Update
This may be a little cleaner:
from BeautifulSoup import BeautifulSoup
import re
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))
Just personal preference, but I like the bottom one a little better.
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(content)
for x in soup.find_all('item'):
print re.sub('[\[CDATA\]]', '', x.string)
For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:
from bs4 import BeautifulSoup, CData
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, Cdata):
print 'CData contents: %r' % cd

Categories