Python/Beautiful Soup find particular heading output full div - python

I'm attempting to parse a very extensive HTML document looks something like:
<div class="reportsubsection n" ><br>
<h2> part 1 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
<h2> part 2 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:
divTag = soup.find("div", {"id": "reportsubsection"})
but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.
EDIT/UPDATE
Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help
divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
continue<br>
print divTag

You can always go back up after finding the right h2, or you can test all subsections:
for subsection in soup.select('div#reportsubsection #subsection'):
if not subsection.find('h2', text=re.compile('part 2')):
continue
# do something with this subsection
This uses a CSS selector to locate all subsections.
Or, going back up with the .parent attribute:
for header in soup.find_all('h2', text=re.compile('part 2')):
section = header.parent
The trick is to narrow down your search as early as possible; the second option has to find all h2 elements in the whole document, while the former narrows the search down quicker.

Related

In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data

I have html as follows:
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>
I am trying to get only the the text found in the maindivelement, without getting text data found in the somename element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.
My approach is as follows:
textdata= soup.find('div', class_='maindiv').get_text()
This gets all the text data found within the maindiv element, as well as the text data found in the somename div element.
The logic I'd like to use is more along the lines of:
textdata = soup.find('div', class_='maindiv').get_text(recursive=False) which would omit any text data found within the somename element.
I know the recursive=False argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text() method.
I've realized the approach of finding all the text, then subtracting the string data found in the somename element from the string data found in the maindiv element, but I'm looking for something a little more efficient.
Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.
s = soup.find('div', class_='maindiv')
for child in s.find_all("div"):
child.decompose()
print(s.get_text())
Would print something like:
text data here
continued text data
That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.
from bs4 import BeautifulSoup
html ='''
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>'''
soup = BeautifulSoup(html, 'lxml')
soup.find('div', class_="maindiv").next_element
out:
'\n text data here \n '

python how to identify block html contain text?

I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.

Using scrapy to digest loosely formatted, primive html into blocks of text

Just introduced to Scrapy. Have gone through the basic tutorial, but feel I don't quite grok how to use it.
Simple case I need to parse is like:
BASIC CASE
<li>
<b>Descriptive Title</b>
<br>
The_first_paragraph_of_text
<p>Second paragraphs of text</p>
...
<p>Nth Paragraph of text</p>
</li>
What I want to do is generate a database record with two columns, "title" and "body_text". Title comes from the 'Descriptive Title' and the body_text comes from taking all paragraphs of text and concatenating into a block of text.
I wrote something simple like
"""pulls out the Descriptive Title and all the <p>-wrapped paragraphs but misses the first paragraph (which isn't wrapped in a <p>"""
for sel in response.xpath("//li"):
b = sel.xpath('b').extract()
print "b = {0}\n".format(b)
for p in sel.xpath('p'):
paragraph = p.xpath('text()').extract()
print"\n{0}".format(paragraph)
But this doesn't catch the unwrapped first paragraph, only paragraph two and onwards. And also, it's not robust to variations of the <li> html blocks.
In one variation the first paragraph is sometimes wrapped in italics.
ITALICS COMPLICATION
<li>
<b>Descriptive Title</b>
<br>
<i>Occasionally The_first_paragraph_of_text is in italics</i>
<p>Second paragraphs of text</p>
...
<p>Nth Paragraph of text</p>
</li>
In another variation, sometimes <li> are embedded inside some of the paragraph blocks.
SUB LIST-ITEM COMPLICATION
<li>
<b>Descriptive Title</b>
<br>
<i>Occasionally The_first_paragraph_of_text is in italics</i>
<p>Second paragraphs of text</p>
<p>Sometimes paragraphs will have lists inside them
<li>idea 1</li>
<li>idea 2</li>
<li>idea N</li>
</p>
<p>Nth Paragraph of text</p>
</li>
I suspect I'm not really digesting the html file in a "scrapythonic" way. What's the right approach to write a more robust Selector(s) to pull out what I want?
More of an xpath question than a scrapy question.
If your title is always in the first element, being a <b> tag
it's easy as sel.xpath('b[1]/text()').
If I were you however I would add some strict assertions
to make it fail rather than scrap the wrong title text
because a <b> tag can often play other roles.
Try:
title, = sel.xpath('*[1][self::b and count(node())=count(text())]')
title = u'\n'.join(sel.xpath('text()'))
This reads as follows (assertions in parentheses):
There must exist one and only one tag (asserted by title, =)
that is the first tag (asserted by [1])
and is a <b> (asserted by self::b)
and has only text nodes (asserted by count(text())=count(node()))
For the body text,
you'll have to get used to the use of html for formatting
unless you are scraping a site
that formats its main content very simply
(eg arranged <p> tags).
You can get all the text descendants in document order with:
sel.xpath('.//text()')
However <script> and <style> tags may show up unwanted text;
you don't want your body text to be filled with javascript/css gibberish.
You can prevent them from being selected with:
sel.xpath('.//text()[not(parent::script or parent::style)]')
Then, you'd probably want to join the extracted text with u'\n'.join()
This will solve your problem but it will not handle all unwanted tags.
Play around with lxml.html.clean.Cleaner(style=True)
(documented here http://lxml.de/lxmlhtml.html#cleaning-up-html)
and consider finding a library that renders html to text.

Find elements which have a specific child with BeautifulSoup

With BeautifulSoup, how to access to a <li> which has a specific div as child?
Example: How to access to the text (i.e. info#blah.com) of the li which has Email as child div?
<li>
<div>Country</div>
Germany
</li>
<li>
<div>Email</div>
info#blah.com
</li>
I tried to do it manually: looping on all li, and for each of them, relooping on all child div to check if text is Email, etc. but I'm sure there exists a more clever version with BeautifulSoup.
There are multiple ways to approach the problem.
One option is to locate the Email div by text and get the next sibling:
soup.find("div", text="Email").next_sibling.strip() # prints "info#blah.com"
Your Question is about the get the whole <li> part which has "Email" inside the <div> tag right? Meaning you need to get the following result,
<li>
<div>Email</div>
info#blah.com
</li>
If I am understanding you question correctly means you need to do the following thing.
soup.find("div", text="Email").parent
or if you need "info#blah.com" as your result you need to do the following thing.
soup.find("div", text="Email").next_sibling
If you have only a single div has content "Email", you can do this way.
soup.find("div", text="Email").find_parent('li')

How to surround an html element with another tag using lxml in Python

What I want to do is something like this. In my page I have an html document which has this tag
<p class="pretty">
Some text
</p>
And I want to replace it with
<blockquote>
<p>
Some Text
</p>
</blockquote>
I can strip the class of the tag using tag.attrib.pop('class') but I am unable to get how to wrap another html tag around a particular tag.
Any help is appreciated.
I believe you are thinking it the wrong way: you cannot wrap an element around another. What you need to do is to copy the contents of the <p> into a variable, delete the <p> element, create a <blockquote> element into where the <p> element used to be and then add the contents of the <p> element into the <blockquote>.

Categories