Split html document in pieces parsing html comments with BeautifulSoup

Split html document in pieces parsing html comments with BeautifulSoup - python

This is a pretty small question that has been almost resolved in a previous question.
Problem is that right now i have and array of comments, but it does not quite what I need. I get an array of comments-content. And I need to get the html in-between.
Say I have something like:
<p>some html here<p>
<!-- begin mark -->
<p>Html i'm interested at.</p>
<p>More html i want to pull out of the document.</p>
<!-- end mark -->
<!-- begin mark -->
<p>This will be pulled later, but we will come to it when I get to pull the previous section.</p>
<!-- end mark -->
In a reply, they point to Crummy explanation on navigating the html tree, but I didnt find there and answer to my problem.
Any ideas? Thanks.
PS. Extra kudos if someone point me an elegant way to repeat the process a few times in a document, as I probably may get it to work, but poorly :D
Edited to add:
With the information provided by Martijn Pieters, I got to pass the comments array obtained using the above code to the generator function he designed. So this gives no error:
for elem in comments:
htmlcode = allnext(comments)
print htmlcode
I think now it will be possible to manipulate the htmlcode content before iterating through the array.

You can use the .next_sibling pointer to get to the next element. You can use that to find everything following a comment, up to but not including another comment:
from bs4 import Comment
def allnext(comment):
curr = comment
while True:
curr = curr.next_sibling
if isinstance(curr, Comment):
return
yield curr
This is a generator function, you use it to iterate over all 'next' elements:
for elem in allnext(comment):
print elem
or you can use it to create a list of all next elements:
elems = list(allnext(comment))
Your example is a little too small for BeautifulSoup and it'll wrap each comment in a <p> tag but if we use a snippet from your original target www.gamespot.com this works just fine:
<div class="ad_wrap ad_wrap_dart"><div style="text-align:center;"><img alt="Advertisement" src="http://ads.com.com/Ads/common/advertisement.gif" style="display:block;height:10px;width:120px;margin:0 auto;"/></div>
<!-- start of gamespot gpt ad tag -->
<div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>
<!-- end of gamespot gpt tag -->
</div>
If comment is a reference to the first comment in that snippet, the allnext() generator gives me:
>>> list(allnext(comment))
[u'\n', <div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>, u'\n']

Related

Can't get data from inside of span-tag with beautifulsoup

I am trying to scrape Instagram page, and want to get/access div-tags present inside of span-tag. but I can't! the HTML of the Instagram page looks like as
<head>--</head>
<body>
<span id="react-root" aria-hidden="false">
<form enctype="multipart/form-data" method="POST" role="presentation">…</form>
<section class="_9eogI E3X2T">
<main class="SCxLW o64aR" role="main">
<div class="v9tJq VfzDr">
<header class=" HVbuG">…</header>
<div class="_4bSq7">…</div>
<div class="fx7hk">…</div>
</div>
</main>
</section>
</body>
I do, it as
from bs4 import BeautifulSoup
import urllib.request as urllib2
html_page = urllib2.urlopen("https://www.instagram.com/cherrified_/?hl=en")
soup = BeautifulSoup(html_page,"lxml")
span_tag = soup.find('span') # return span-tag correctly
span_tag.find_all('div') # return empty list, why ?
please also specify an example.

Instagram is a Single Page Application powered by React, which means its source is just a simple "empty" page that loads JavaScript to dynamically generate the content in the browser after downloading.
Click "View source" or go to view-source:https://www.instagram.com/cherrified_/?hl=en in Chrome. This is the HTML you download with urllib.request.
You can see that there is a single <span> tag, which does not include a <div> tag. (Note: <div> inside a <span> is not allowed).
Scraping instagram.com this way is not possible. It also might not be legal (I am not a lawyer).
Notes:
your HTML code example doesn't include a closing tag for <span>.
your HTML code example doesn't match the link you provide in the python snippet.
in the last line of the python snippet you probably meant span_tag.find_all('div') (note the variable name and the singular 'div').

Beautiful Soup replaces < with <

I've found the text I want to replace, but when I print soup the format gets changed. <div id="content">stuff here</div> becomes <div id="content">stuff here</div>. How can i preserve the data? I have tried print(soup.encode(formatter="none")), but that produces the same incorrect format.
from bs4 import BeautifulSoup
with open(index_file) as fp:
soup = BeautifulSoup(fp,"html.parser")
found = soup.find("div", {"id": "content"})
found.replace_with(data)
When I print found, I get the correct format:
>>> print(found)
<div id="content">stuff</div>
index_file contents are below:
<!DOCTYPE html>
<head>
Apples
</head>
<body>
<div id="page">
This is the Id of the page
<div id="main">
<div id="content">
stuff here
</div>
</div>
footer should go here
</div>
</body>
</html>

The found object is not a Python string, it's a Tag that just happens to have a nice string representation. You can verify this by doing
type(found)
A Tag is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString. NavigableString is a lot like a string, but it can only contain things that would go into the content portion of the HTML.
When you do
found.replace_with('<div id="content">stuff here</div>')
you are asking the Tag to be replaced with a NavigableString containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.
Instead of that mess, you probably want to keep your Tag, and replace only it's content:
found.string.replace_with('stuff here')
Notice that the correct replacement does not attempt to overwrite the tags.
When you do found.replace_with(...), the object referred to by the name found gets replaced in the parent hierarchy. However, the name found keeps pointing to the same outdated object as before. That is why printing soup shows the update, but printing found does not.

Parse paragraph and subsequent element with BeautifulSoup with one loop-cycle

I have a long list of blog comments which are coded as
<p> This is the text <br /> of the comment </p>
<div id="comment_details"> Here the details of the same comment </div>
I need to parse comment and details in the same cycle of the loop so to store them orderly.
Yet I am not sure how I should proceed because I can parse them easily in two different loops. Is it elegant and practical to do it in only one?
Please consider the following MWE
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<body>
<div id="firstDiv">
<br></br>
<p>My first paragraph.<br />But this a second line</p>
<div id="secondDiv">
<b>Date1</b>
</div>
<br></br>
<p>My second paragraph.</p>
<div id="secondDiv">
<b>Date2</b>
</div>
<br></br>
<p>My third paragraph.</p>
<div id="secondDiv">
<b>Date3</b>
</div>
<br></br>
<p>My fourth paragraph.</p>
<div id="secondDiv">
<b>Date4</b>
</div>
<br></br>
<p>My fifth paragraph.</p>
<div id="secondDiv">
<b>Date5</b>
</div>
<br></br>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
for div in soup.find(id="firstDiv").find_all(id="secondDiv"):
print div.b.get_text()

If you really wanted the subsequent sibling, that would be easy:
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
print p.next_sibling.b.get_text()
However, the next thing after the p is the string `'\n', not the div you want.
The problem is, there is no real structural relationship between the p and the div; it just so happens that each p always has a div with a certain id as a later sibling, and you want to exploit that. (If you're generating this HTML, obviously fix the structure… but I'll assume you aren't.) So, here are some options.
The best is probably:
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
print p.find_next_sibling(id='secondDiv').b.get_text()
If you only care about this particular document, and you know that it will always be true that the next sibling after the next sibling is the div you want:
print p.get_text()
print p.next_sibling.next_sibling.b.get_text()
Or you could rely on the fact that find_next_sibling() with no argument, unlike next_sibling, skips to the first actual DOM element, so:
print p.get_text()
print p.get_next_sibling().b.get_text()
If you don't want to rely on any of that, but can count on the fact that they're always one-to-one (that is, no chance of any stray p elements without a corresponding secondDiv), you can just zip the two searches together:
fdiv = soup.find(id='firstDiv')
for p, sdiv in zip(fdiv.find_all('p'), fdiv.find_all(id='secondDiv'):
print p.get_text(), div.b.get_text()
You could also iterate p.next_siblings to find the element you want:
for p in soup.find(id='firstDiv').find_all('p'):
div = next(sib for sib in p.next_siblings if sib.id == 'secondDiv')
print p.get_text(), div.b.get_text()
But ultimately, that's just a more verbose way of writing the first solution, so go back to that one. :)

Help in this content extraction + beautiful soup

I am trying to extract data from a site which is in this format
<div id=storytextp class=storytextp align=center style='padding:10px;'>
<div id=storytext class=storytext>
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'>
..... extra stuff
</div> **Main Content**
</div>
</div>
Note that the MainContent can contain other tags but i want the entire content like string
So what i did was this
_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null
thus the _divTag will have only the main content but this does not work. Can anybody tell what mistake i am making and how should i extract the main content

Just do _divTag.contents[2].
Your formatting was maybe misleading you - this text does not belong to the innermost div tag (as innerdiv.text, innerdiv.contents or innerdiv.findChildren() will show you).
It makes things clearer if you indent your original XML:
<div id=storytextp class=storytextp align=center style='padding:10px;'>
<div id=storytext class=storytext>
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'>
..... extra stuff
</div> **Main Content**
</div>
</div>
(PS: I'm not clear what the intent of your innerdiv.contents[0].replaceWith("") was? To squelch the attributes? newlines? Anyway, the BS philosophy is not to edit the parse-tree, but simply to ignore the 99.9% that you don't care about. BS Documentation is here).

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?

To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split html document in pieces parsing html comments with BeautifulSoup - python

Related

Can't get data from inside of span-tag with beautifulsoup

Beautiful Soup replaces < with <

Parse paragraph and subsequent element with BeautifulSoup with one loop-cycle

Help in this content extraction + beautiful soup

Editing tree in place while iterating in lxml

Categories

Resources