Parse paragraph and subsequent element with BeautifulSoup with one loop-cycle - python

I have a long list of blog comments which are coded as
<p> This is the text <br /> of the comment </p>
<div id="comment_details"> Here the details of the same comment </div>
I need to parse comment and details in the same cycle of the loop so to store them orderly.
Yet I am not sure how I should proceed because I can parse them easily in two different loops. Is it elegant and practical to do it in only one?
Please consider the following MWE
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<body>
<div id="firstDiv">
<br></br>
<p>My first paragraph.<br />But this a second line</p>
<div id="secondDiv">
<b>Date1</b>
</div>
<br></br>
<p>My second paragraph.</p>
<div id="secondDiv">
<b>Date2</b>
</div>
<br></br>
<p>My third paragraph.</p>
<div id="secondDiv">
<b>Date3</b>
</div>
<br></br>
<p>My fourth paragraph.</p>
<div id="secondDiv">
<b>Date4</b>
</div>
<br></br>
<p>My fifth paragraph.</p>
<div id="secondDiv">
<b>Date5</b>
</div>
<br></br>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
for div in soup.find(id="firstDiv").find_all(id="secondDiv"):
print div.b.get_text()

If you really wanted the subsequent sibling, that would be easy:
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
print p.next_sibling.b.get_text()
However, the next thing after the p is the string `'\n', not the div you want.
The problem is, there is no real structural relationship between the p and the div; it just so happens that each p always has a div with a certain id as a later sibling, and you want to exploit that. (If you're generating this HTML, obviously fix the structure… but I'll assume you aren't.) So, here are some options.
The best is probably:
for p in soup.find(id="firstDiv").find_all("p"):
print p.get_text()
print p.find_next_sibling(id='secondDiv').b.get_text()
If you only care about this particular document, and you know that it will always be true that the next sibling after the next sibling is the div you want:
print p.get_text()
print p.next_sibling.next_sibling.b.get_text()
Or you could rely on the fact that find_next_sibling() with no argument, unlike next_sibling, skips to the first actual DOM element, so:
print p.get_text()
print p.get_next_sibling().b.get_text()
If you don't want to rely on any of that, but can count on the fact that they're always one-to-one (that is, no chance of any stray p elements without a corresponding secondDiv), you can just zip the two searches together:
fdiv = soup.find(id='firstDiv')
for p, sdiv in zip(fdiv.find_all('p'), fdiv.find_all(id='secondDiv'):
print p.get_text(), div.b.get_text()
You could also iterate p.next_siblings to find the element you want:
for p in soup.find(id='firstDiv').find_all('p'):
div = next(sib for sib in p.next_siblings if sib.id == 'secondDiv')
print p.get_text(), div.b.get_text()
But ultimately, that's just a more verbose way of writing the first solution, so go back to that one. :)

Related

Find an element with Python Selenium

Somebody can help me to find a way to click on an img with Python Selenium?
I need to find the div class="A" which has a div class="B" and then, click on his img in class="D"
<div class="A">
...
</div>
...
<div class="A">
<div class="B"> <!-- HERE IS WHAT I NEED TO CHECK-->
<span class="E"> <!-- HERE IS WHAT I NEED TO CHECK-->
<span class="F"> <!-- HERE IS WHAT I NEED TO CHECK-->
</div> <!-- HERE IS WHAT I NEED TO CHECK-->
<div class="C"> </div>
<div class="D">
<img class="Y" src="..."> <!-- HERE IS WHAT I NEED TO CLICK IF class="B" DOES EXIST (EVERY class="A" has an IMG in div class="D")-->
</div>
</div>
...
<div class="A">
...
</div>
I know I have to use XPath or CSS selector, but here it's not a rel/abs XPath... Please help me.
To check if the element exists you can use driver.find_elements.
It returns a list of web elements matching the passed locator.
In case the element exists it returns non-empty list treated as Boolean true in Python, otherwise it returns an empty list recognized as false in Python.
UPD:
According to the latest requirements your code can be something like this:
a_list = driver.find_elements_by_xpath("//div[#class='A']")
for a_el in a_list:
if a_el.find_elements_by_xpath(".//div[#class ='B']"):
a_el.find_element_by_xpath(".//img[#class ='Y']").click()
You can try this:
if driver.find_elements_by_xpath("//div[#class = 'A']/div[#class ='B']"): # Finding the element in case it exists.
image_to_click = driver.find_element_by_xpath("//div[#class = 'D']//img[#class ='Y']")
image_to_click.click() # Clicking on the image.
Thanks for your help !
The final code is :
skin_list = driver.find_elements_by_class_name("B")
if len(skin_list) != 0:
skin_list[0].find_element_by_xpath("..//div[#class='D']").click()
But now, my new problem is the "skin_list = driver.find_elements_by_class_name" take a long time to search...

XPath for anchor element not in certain parent element?

Using xpath, how can I get all anchor tags except the ones in italics from the second paragraph? (Question and example has been simplified. Imagine a regular HTML page with multiple <p> and <a>).
<html><body>
<p>
A
<b>B</b>
<i>C</i>
</p>
<p>
<b>E</b>
F
<i>G</i>
</p>
</body></html>
Should get:
<a href="e.html">
<a href="f.html">
What I have:
root.xpath('//body//p')[1].xpath('a[not(self::i)]')
I am only getting:
`<a href="f.html">`
Try below XPath to get required output:
//p[2]//a[not(parent::i)]
As #Andersson commented, it's unclear where your a elements are supposed to end.
Assuming that your a elements are meant to be self-closing,
<html><body>
<p>
<a href="a.html"/>
<b><a href="b.html"/></b>
<i><a href="c.html"/></i>
</p>
<p>
<b><a href="e.html"/></b>
<a href="f.html"/>
<i><a href="g.html"/></i>
</p>
</body>
</html>
Then this XPath,
/html/body/p[2]//a[not(parent::i)]
selects all of the a descendents of the second paragraph whose parent is not an i element:
<a href="e.html"/>
<a href="f.html"/>
Credit: Thanks to #Andersson for a correction. Please upvote his answer. Thanks.

How to join two elements together in scrapy?

I am trying to join all "The Text" parts into one string or one item in my scrapy output file. The source code below:
<div class="sth">
<h3 class="sth">The Text</h3>
<h4 class="sth2">
<span class="sth11">The Text</span>
</h4>
<h4 class="sth3">
<span class="sth11">The Text</span>
<span>The Text</span>
</h4>
</div>
Is there a good way to join all the "The Text" element all together into one item or one string?
Considering you want any text that is children of the wrapping div, that you want to join them with a new line and you will run this inside a scrapy parsing method, you could:
"\n".join(response.xpath("//div[#class='sth']/descendant::*/text()").extract())

Split html document in pieces parsing html comments with BeautifulSoup

This is a pretty small question that has been almost resolved in a previous question.
Problem is that right now i have and array of comments, but it does not quite what I need. I get an array of comments-content. And I need to get the html in-between.
Say I have something like:
<p>some html here<p>
<!-- begin mark -->
<p>Html i'm interested at.</p>
<p>More html i want to pull out of the document.</p>
<!-- end mark -->
<!-- begin mark -->
<p>This will be pulled later, but we will come to it when I get to pull the previous section.</p>
<!-- end mark -->
In a reply, they point to Crummy explanation on navigating the html tree, but I didnt find there and answer to my problem.
Any ideas? Thanks.
PS. Extra kudos if someone point me an elegant way to repeat the process a few times in a document, as I probably may get it to work, but poorly :D
Edited to add:
With the information provided by Martijn Pieters, I got to pass the comments array obtained using the above code to the generator function he designed. So this gives no error:
for elem in comments:
htmlcode = allnext(comments)
print htmlcode
I think now it will be possible to manipulate the htmlcode content before iterating through the array.
You can use the .next_sibling pointer to get to the next element. You can use that to find everything following a comment, up to but not including another comment:
from bs4 import Comment
def allnext(comment):
curr = comment
while True:
curr = curr.next_sibling
if isinstance(curr, Comment):
return
yield curr
This is a generator function, you use it to iterate over all 'next' elements:
for elem in allnext(comment):
print elem
or you can use it to create a list of all next elements:
elems = list(allnext(comment))
Your example is a little too small for BeautifulSoup and it'll wrap each comment in a <p> tag but if we use a snippet from your original target www.gamespot.com this works just fine:
<div class="ad_wrap ad_wrap_dart"><div style="text-align:center;"><img alt="Advertisement" src="http://ads.com.com/Ads/common/advertisement.gif" style="display:block;height:10px;width:120px;margin:0 auto;"/></div>
<!-- start of gamespot gpt ad tag -->
<div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>
<!-- end of gamespot gpt tag -->
</div>
If comment is a reference to the first comment in that snippet, the allnext() generator gives me:
>>> list(allnext(comment))
[u'\n', <div id="div-gpt-ad-1359295192-lb-top">
<script type="text/javascript">
googletag.display('div-gpt-ad-1359295192-lb-top');
</script>
<noscript>
<a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192">
<img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/6975/row/gamespot.com/home&sz=728x90|970x66|970x150|970x250|960x150&t=pos%3Dtop%26platform%3Ddesktop%26&c=1359295192"/>
</a>
</noscript>
</div>, u'\n']

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?
To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

Categories