Beautiful Soup replaces < with < - python

I've found the text I want to replace, but when I print soup the format gets changed. <div id="content">stuff here</div> becomes <div id="content">stuff here</div>. How can i preserve the data? I have tried print(soup.encode(formatter="none")), but that produces the same incorrect format.
from bs4 import BeautifulSoup
with open(index_file) as fp:
soup = BeautifulSoup(fp,"html.parser")
found = soup.find("div", {"id": "content"})
found.replace_with(data)
When I print found, I get the correct format:
>>> print(found)
<div id="content">stuff</div>
index_file contents are below:
<!DOCTYPE html>
<head>
Apples
</head>
<body>
<div id="page">
This is the Id of the page
<div id="main">
<div id="content">
stuff here
</div>
</div>
footer should go here
</div>
</body>
</html>

The found object is not a Python string, it's a Tag that just happens to have a nice string representation. You can verify this by doing
type(found)
A Tag is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString. NavigableString is a lot like a string, but it can only contain things that would go into the content portion of the HTML.
When you do
found.replace_with('<div id="content">stuff here</div>')
you are asking the Tag to be replaced with a NavigableString containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.
Instead of that mess, you probably want to keep your Tag, and replace only it's content:
found.string.replace_with('stuff here')
Notice that the correct replacement does not attempt to overwrite the tags.
When you do found.replace_with(...), the object referred to by the name found gets replaced in the parent hierarchy. However, the name found keeps pointing to the same outdated object as before. That is why printing soup shows the update, but printing found does not.

Related

BeautifulSoup: can I prettify wihout adding extra tags?

Here's my code:
from bs4 import BeautifulSoup as bs
html = "<div><span>I am Spantacus</div></span>"
pretty = bs(html).prettify()
print("after:\n", pretty)
What I want:
A nicely indented and newline-d representation of the html, without anything added i.e.
<div>
<span>
I am Spantacus
</span>
</div>
What I get instead:
<html>
<body>
<div>
<span>
I am Spantacus
</span>
</div>
</body>
</html>
From stepping into prettify(), it seems the html, body tags get added by the soup's __init__, not by prettify. Is there some keyword or option to disable this addition?
try this
from bs4 import BeautifulSoup
html = "<div><span>I am Spantacus</div></span>"
soup1 = BeautifulSoup(html, "html.parser")
# you can also use lxml for parse
pretty = soup1.prettify()
print("after:\n", pretty)
for more information here
Use following to traverse to body and print out:
for c in soup.html.body.contents:
print(c.prettify())
Modify it base on your needs

Can't get data from inside of span-tag with beautifulsoup

I am trying to scrape Instagram page, and want to get/access div-tags present inside of span-tag. but I can't! the HTML of the Instagram page looks like as
<head>--</head>
<body>
<span id="react-root" aria-hidden="false">
<form enctype="multipart/form-data" method="POST" role="presentation">…</form>
<section class="_9eogI E3X2T">
<main class="SCxLW o64aR" role="main">
<div class="v9tJq VfzDr">
<header class=" HVbuG">…</header>
<div class="_4bSq7">…</div>
<div class="fx7hk">…</div>
</div>
</main>
</section>
</body>
I do, it as
from bs4 import BeautifulSoup
import urllib.request as urllib2
html_page = urllib2.urlopen("https://www.instagram.com/cherrified_/?hl=en")
soup = BeautifulSoup(html_page,"lxml")
span_tag = soup.find('span') # return span-tag correctly
span_tag.find_all('div') # return empty list, why ?
please also specify an example.
Instagram is a Single Page Application powered by React, which means its source is just a simple "empty" page that loads JavaScript to dynamically generate the content in the browser after downloading.
Click "View source" or go to view-source:https://www.instagram.com/cherrified_/?hl=en in Chrome. This is the HTML you download with urllib.request.
You can see that there is a single <span> tag, which does not include a <div> tag. (Note: <div> inside a <span> is not allowed).
Scraping instagram.com this way is not possible. It also might not be legal (I am not a lawyer).
Notes:
your HTML code example doesn't include a closing tag for <span>.
your HTML code example doesn't match the link you provide in the python snippet.
in the last line of the python snippet you probably meant span_tag.find_all('div') (note the variable name and the singular 'div').

Unwrap "a" tag from image, without losing content

I wanted to remove 'a' tag (link) from all the images found. Hence for performance I made a list of all images in the html and look for wrapping a tag and simply remove the link.
I am using BeautifulSoup and not sure what I am doing wrong, instead of removing the a tag it is removing the inside content.
This is what I did
from bs4 import BeautifulSoup
html = '''<div> <img src="http://imgsrc.jpg" /> <img src="http://imgsrc2.jpg />" '''
soup = BeautifulSoup(html)
for img in soup.find_all('img'):
print 'THIS IS THE BEGINING /////////////// '
#print img.find_parent('a').unwrap()
print img.parent.unwrap()
This gives me following output
> >> print img.parent()
<img src="http://imgsrc.jpg" />
<img src="http://imgsrc2.jpg />
> >> print img.parent.unwrap()
I have tried replaceWith and replaceWithChildren but not working when I use object.parent or findParent
I am not sure what I am doing wrong.
Its been just few weeks since I started python.
The unwrap() function returns the tag that has been removed. The tree itself has been properly modified. Quoting from the unwrap() documentation:
Like replace_with(), unwrap() returns the tag that was replaced.
In other words: it works correctly! Print the new parent of img instead of the return value of unwrap() to see that the <a> tags have indeed been removed:
>>> from bs4 import BeautifulSoup
>>> html = '''<div> <img src="http://imgsrc.jpg" /> <img src="http://imgsrc2.jpg />" '''
>>> soup = BeautifulSoup(html)
>>> for img in soup.find_all('img'):
... img.parent.unwrap()
... print img.parent
...
<div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg /></a>"/></div>
<div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg /></a>"/></div>
Here python echoes the img.parent.unwrap() return value, followed by the output of the print statement showing the parent of the <img> tag is now the <div> tag. The first print shows the other <img> tag still wrapped, the second print shows them both as direct children of the <div> tag.
I'm not sure what output you are looking for. Is this it?
from bs4 import BeautifulSoup
html = '''<div> <img src="http://imgsrc.jpg" /> <img src="http://imgsrc2.jpg" /> '''
soup = BeautifulSoup(html)
for img in soup.find_all('img'):
img.parent.unwrap()
print(soup)
yields
<html><body><div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg"/></div></body></html>
I haven't worked much with Python, but it looks like unwrap returns the HTML that was removed, and not the img tag you're looking for. Try calling soup.prettify() and see if the link was removed after all.

html5lib. How to get valid html without adding html, head and body tags?

I'm validating custom HTML from users with html5lib. The problem is the html5lib adds html, head and body tags, which I don't need.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
f = open('/home/user/ex.html')
doc = parser.parse(f)
doc.toxml()
'<html><head/><body><div>\n speedhunters.com\n</div>\n</body></html>'
This is validated, can be sanitized, but how can I remove or prevent adding these tags to the tree?
I mean exclude replace using.
It seems that we can use the hidden property of Tags in order to prevent the tag itself from being 'exported' when casting a tag/soup to string/unicode:
>>> from bs4 import BeautifulSoup
>>> html = u"<div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print soup.body.prettify()
<body>
<div>
<footer>
<h3>
foot
</h3>
</footer>
</div>
<div>
foo
</div>
</body>
Essentially, the questioner's goal is to get the entire content of the body tag without the <body> wrapper itself. This works:
>>> soup.body.hidden=True
>>> print soup.body.prettify()
<div>
<footer>
<h3>
foot
</h3>
</footer>
</div>
<div>
foo
</div>
I found this by going through the BeautifulSoup source. After calling soup = BeautifulSoup(html), the root tag has the internal name '[document]'. By default, only the root tag has hidden==True. This prevents its name from ending up in any HTML output.
Wow, html5lib has horrible documentation.
Looking through the source, and working on a quick test case, this appears to work:
import html5lib
from html5lib import treebuilders
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
with open('test.html') as test:
doc = parser.parse(test)
for child in doc:
if child.parent.name == "body":
return child.toxml()
It's a bit hackish, but less so than a replace().
lxml may be a better choice if you're dealing with "uncommon" html.

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?
To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

Categories