I have html as follows:
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>
I am trying to get only the the text found in the maindivelement, without getting text data found in the somename element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.
My approach is as follows:
textdata= soup.find('div', class_='maindiv').get_text()
This gets all the text data found within the maindiv element, as well as the text data found in the somename div element.
The logic I'd like to use is more along the lines of:
textdata = soup.find('div', class_='maindiv').get_text(recursive=False) which would omit any text data found within the somename element.
I know the recursive=False argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text() method.
I've realized the approach of finding all the text, then subtracting the string data found in the somename element from the string data found in the maindiv element, but I'm looking for something a little more efficient.
Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.
s = soup.find('div', class_='maindiv')
for child in s.find_all("div"):
child.decompose()
print(s.get_text())
Would print something like:
text data here
continued text data
That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.
from bs4 import BeautifulSoup
html ='''
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>'''
soup = BeautifulSoup(html, 'lxml')
soup.find('div', class_="maindiv").next_element
out:
'\n text data here \n '
Related
I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.
Just introduced to Scrapy. Have gone through the basic tutorial, but feel I don't quite grok how to use it.
Simple case I need to parse is like:
BASIC CASE
<li>
<b>Descriptive Title</b>
<br>
The_first_paragraph_of_text
<p>Second paragraphs of text</p>
...
<p>Nth Paragraph of text</p>
</li>
What I want to do is generate a database record with two columns, "title" and "body_text". Title comes from the 'Descriptive Title' and the body_text comes from taking all paragraphs of text and concatenating into a block of text.
I wrote something simple like
"""pulls out the Descriptive Title and all the <p>-wrapped paragraphs but misses the first paragraph (which isn't wrapped in a <p>"""
for sel in response.xpath("//li"):
b = sel.xpath('b').extract()
print "b = {0}\n".format(b)
for p in sel.xpath('p'):
paragraph = p.xpath('text()').extract()
print"\n{0}".format(paragraph)
But this doesn't catch the unwrapped first paragraph, only paragraph two and onwards. And also, it's not robust to variations of the <li> html blocks.
In one variation the first paragraph is sometimes wrapped in italics.
ITALICS COMPLICATION
<li>
<b>Descriptive Title</b>
<br>
<i>Occasionally The_first_paragraph_of_text is in italics</i>
<p>Second paragraphs of text</p>
...
<p>Nth Paragraph of text</p>
</li>
In another variation, sometimes <li> are embedded inside some of the paragraph blocks.
SUB LIST-ITEM COMPLICATION
<li>
<b>Descriptive Title</b>
<br>
<i>Occasionally The_first_paragraph_of_text is in italics</i>
<p>Second paragraphs of text</p>
<p>Sometimes paragraphs will have lists inside them
<li>idea 1</li>
<li>idea 2</li>
<li>idea N</li>
</p>
<p>Nth Paragraph of text</p>
</li>
I suspect I'm not really digesting the html file in a "scrapythonic" way. What's the right approach to write a more robust Selector(s) to pull out what I want?
More of an xpath question than a scrapy question.
If your title is always in the first element, being a <b> tag
it's easy as sel.xpath('b[1]/text()').
If I were you however I would add some strict assertions
to make it fail rather than scrap the wrong title text
because a <b> tag can often play other roles.
Try:
title, = sel.xpath('*[1][self::b and count(node())=count(text())]')
title = u'\n'.join(sel.xpath('text()'))
This reads as follows (assertions in parentheses):
There must exist one and only one tag (asserted by title, =)
that is the first tag (asserted by [1])
and is a <b> (asserted by self::b)
and has only text nodes (asserted by count(text())=count(node()))
For the body text,
you'll have to get used to the use of html for formatting
unless you are scraping a site
that formats its main content very simply
(eg arranged <p> tags).
You can get all the text descendants in document order with:
sel.xpath('.//text()')
However <script> and <style> tags may show up unwanted text;
you don't want your body text to be filled with javascript/css gibberish.
You can prevent them from being selected with:
sel.xpath('.//text()[not(parent::script or parent::style)]')
Then, you'd probably want to join the extracted text with u'\n'.join()
This will solve your problem but it will not handle all unwanted tags.
Play around with lxml.html.clean.Cleaner(style=True)
(documented here http://lxml.de/lxmlhtml.html#cleaning-up-html)
and consider finding a library that renders html to text.
I'm attempting to parse a very extensive HTML document looks something like:
<div class="reportsubsection n" ><br>
<h2> part 1 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
<h2> part 2 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:
divTag = soup.find("div", {"id": "reportsubsection"})
but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.
EDIT/UPDATE
Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help
divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
continue<br>
print divTag
You can always go back up after finding the right h2, or you can test all subsections:
for subsection in soup.select('div#reportsubsection #subsection'):
if not subsection.find('h2', text=re.compile('part 2')):
continue
# do something with this subsection
This uses a CSS selector to locate all subsections.
Or, going back up with the .parent attribute:
for header in soup.find_all('h2', text=re.compile('part 2')):
section = header.parent
The trick is to narrow down your search as early as possible; the second option has to find all h2 elements in the whole document, while the former narrows the search down quicker.
While parsing a table on a webpage with little semantic structure, my beautiful soup expressions are getting really ugly. I might be going about it the wrong way and would like to know how I can rewrite my code to make it more readable and less messy?
For example, in a page there are three tables. Relevant data is in the third table. The actual data starts in the second row. The first entry in the row is an index and the data I need is in the second td element. This second td element has two links and my text of interest is within the second a tag. Translating this into beuatifulsoup I wrote
soup.find_all('table')[2].find_all('tr')[2].find_all('td')[1].find_all('a')[1].text
works fine, and I grab all the 70 elements in the table using the same principle in a list comprehension.
relevant_data = [ x.find_all('td')[1].find_all('a')[1].text for x in soup.find_all('table')[2].find_all('tr')[2:]]
Is this kind of code OK or is there any scope for improvement?
Using lxml, you can use XPath.
For example:
html = '''
<body>
<table></table>
<table></table>
<table>
<tr></tr>
<tr></tr>
<tr><td></td><td><a>blah1</a><a>blah1-1</a></td></tr>
<tr><td></td><td><a>blah2</a><a>blah2-1</a></td></tr>
<tr><td></td><td><a>blah3</a><a>blah3-1</a></td></tr>
<tr><td></td><td><a>blah4</a><a>blah4-1</a></td></tr>
<tr><td></td><td><a>blah5</a><a>blah5-1</a></td></tr>
</table>
<table></table>
</body>
'''
import lxml.html
root = lxml.html.fromstring(html)
print(root.xpath('.//table[3]/tr[position()>=2]/td[2]/a[2]/text()'))
output:
['blah1-1', 'blah2-1', 'blah3-1', 'blah4-1', 'blah5-1']
I have the following HTML:
<h1 class="price">
<span class="strike">$325.00</span>$295.00
</h1>
I'd like to get the $295 out. However, if I simply use PyQuery as follows:
price = pq('h1').text()
I get both prices.
Extracting only direct child text for an element in jQuery looks reasonably complicated - is there a way to do it at all in PyQuery?
Currently I'm extracting the first price separately, then using replace to remove it from the text, which is a bit fiddly.
Thanks for your help.
I don't think there is an clean way to do that. At least I've found this solution:
>>> print doc('h1').html(doc('h1')('span').outerHtml())
<h1 class="price"><span class="strike">$325.00</span></h1>
You can use .text() instead of .outerHtml() if you don't want to keep the span tag.
Removing the first one is much more easy:
>>> print doc('h1').remove('span')
<h1 class="price">
$295.00
</h1>