soup.find_all will search a BeautifulSoup document for all occurrences of a single tag. Is there a way to search for particular patterns of nested tags?
For example, I would like to search for all occurrences of this pattern:
<div class="separator">
<a>
<img />
</a>
</div>
There are multiple ways to find the pattern, but the easiest one would be to use a CSS selector:
for img in soup.select('div.separator > a > img'):
print img # or img.parent.parent to get the "div"
Demo:
>>> from bs4 import BeautifulSoup
>>> data = """
... <div>
... <div class="separator">
... <a>
... <img src="test1"/>
... </a>
... </div>
...
... <div class="separator">
... <a>
... <img src="test2"/>
... </a>
... </div>
...
... <div>test3</div>
...
... <div>
... <a>test4</a>
... </div>
... </div>
... """
>>> soup = BeautifulSoup(data)
>>>
>>> for img in soup.select('div.separator > a > img'):
... print img.get('src')
...
test1
test2
I do understand that, strictly speaking, the solution would not work if the div has more than just one a child, or inside the a tag there is smth else except the img tag. If this is the case the solution can be improved with additional checks (will edit the answer if needed).
Check out this part of the docs. You probably want a function like this:
def nested_img(div):
child = div.contents[0]
return child.name == "a" and child.contents[0].name == "img"
soup.find_all("div", nested_img)
P.S.: This is untested.
Related
I am using lxml with html:
from lxml import html
import requests
How would I check if any of an element's children have the class = "nearby"
my code (essentially):
url = "www.example.com"
Page = requests.get(url)
Tree = html.fromstring(Page.content)
resultList = Tree.xpath('//p[#class="result-info"]')
i=len(resultList)-1 #to go though the list backwards
while i>0:
if (resultList[i].HasChildWithClass("nearby")):
print('This result has a child with the class "nearby"')
How would I replace "HasChildWithClass()" to make it actually work?
Here's an example tree:
...
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
...
I tried to understand why you use lxml to find the element. However BeautifulSoup and re may be a better choice.
lxml = """
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
"""
But i done what you want.
from lxml import html
Tree = html.fromstring(lxml)
resultList = Tree.xpath('//p[#class="result-info"]')
i = len(resultList) - 1 #to go though the list backwards
for result in resultList:
for e in result.iter():
if e.attrib.get("class") == "nearby":
print(e.text)
Try to use bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(lxml,"lxml")
result = soup.find_all("span", class_="nearby")
print(result[0].text)
Here is an experiment I did.
Take r = resultList[0] in python shell and type:
>>> dir(r)
['__bool__', '__class__', ..., 'find_class', ...
Now this find_class method is highly suspicious. If you check its help doc:
>>> help(r.find_class)
you'll confirm the guess. Indeed,
>>> r.find_class('nearby')
[<Element span at 0x109788ea8>]
For the other tag s = resultList[1] in the example xml code you gave,
>>> s.find_class('nearby')
[]
Now it's clear how to tell whether a 'nearby' child exists or not.
Cheers!
I'm not very familliar with BeautifulSoup.
I have the html code like (it's only part of it):
<div class="central-featured-lang lang1" lang="en">
<a class="link-box" href="//en.wikibooks.org/">
<strong>English</strong><br>
<em>Open-content textbooks</em><br>
<small>51 000+ pages</small></a>
</div>
On the output I should get (and for other languages):
English: 51 000+ pages.
I tried something like:
for item in soup.find_all('div'):
print item.get('class')
But this does not work. Can you help me, or at least lead to solution?
item.get() returns attribute values, not text contained under an element.
You can get the text directly contained in elements with the Element.string attribute, or all contained text (recursively) with the Element.get_text() method.
Here, I'd search for div elements with a lang attribute, then use the contained elements to find strings:
for item in soup.find_all('div', lang=True):
if not (item.strong and item.small):
continue
language = item.strong.string
pages = item.small.string
print '{}: {}'.format(language, pages)
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div class="central-featured-lang lang1" lang="en">
... <a class="link-box" href="//en.wikibooks.org/">
... <strong>English</strong><br>
... <em>Open-content textbooks</em><br>
... <small>51 000+ pages</small></a>
... </div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for item in soup.find_all('div', lang=True):
... if not (item.strong and item.small):
... continue
... language = item.strong.string
... pages = item.small.string
... print '{}: {}'.format(language, pages)
...
English: 51 000+ pages
My code is like this
response = urllib2.urlopen("file:///C:/data20140801.html")
page = response.read()
tree = etree.HTML(page)
data = tree.xpath("//p/span/text()")
html page could have this structures
<span style="font-size:10.0pt">Something</span>
html page could also have this structures
<p class="Normal">
<span style="font-size:10.0pt">Some</span>
<span style="font-size:10.0pt">thing<span>
</p>
Using same code for both I want to get "Something"
The XPath expression returns a list of values:
>>> from lxml.html import etree
>>> tree = etree.HTML('''\
... <p class="Normal">
... <span style="font-size:10.0pt">Some</span>
... <span style="font-size:10.0pt">thing<span>
... </p>
... ''')
>>> tree.xpath("//p/span/text()")
['Some', 'thing']
Use ''.join() to combine the two strings into one:
>>> ''.join(tree.xpath("//p/span/text()"))
'Something'
I am trying to parse all elements under div using beautifulsoup the issue is that I don't know all the elements underneath the div prior to parsing. For example a div can have text data in paragraph mode and bullet format along with some href elements. Each url that I open can have different elements underneath the specific div class that I am looking at:
example:
url a can have following:
<div class='content'>
<p> Hello I have a link </p>
<li> I have a bullet point
foo
</div>
but url b
can have
<div class='content'>
<p> I only have paragraph </p>
</div>
I started as doing something like this:
content = souping_page.body.find('div', attrs={'class': 'content})
but how to go beyond this is little confuse. I was hoping to create one string from all the parse data as a end result.
At the end I want the following string to be obtain from each example:
Example 1: Final Output
parse_data = Hello I have a link I have a bullet point
parse_links = foo.com
Example 2: Final Output
parse_data = I only have paragraph
You can get just the text of a text with element.get_text():
>>> from bs4 import BeautifulSoup
>>> sample1 = BeautifulSoup('''\
... <div class='content'>
... <p> Hello I have a link </p>
...
... <li> I have a bullet point
...
... foo
... </div>
... ''').find('div')
>>> sample2 = BeautifulSoup('''\
... <div class='content'>
... <p> I only have paragraph </p>
...
... </div>
... ''').find('div')
>>> sample1.get_text()
u'\n Hello I have a link \n I have a bullet point\n\nfoo\n'
>>> sample2.get_text()
u'\n I only have paragraph \n'
or you can strip it down a little using element.stripped_strings:
>>> ' '.join(sample1.stripped_strings)
u'Hello I have a link I have a bullet point foo'
>>> ' '.join(sample2.stripped_strings)
u'I only have paragraph'
To get all links, look for all a elements with href attributes and gather these in a list:
>>> [a['href'] for a in sample1.find_all('a', href=True)]
['foo.com']
>>> [a['href'] for a in sample2.find_all('a', href=True)]
[]
The href=True argument limits the search to <a> tags that have a href attribute defined.
Per the Beautiful Soup docs, to iterate over the children of a tag use either .contents to get them as a list or .children (a generator).
for child in title_tag.children:
print(child)
So, in your case, for example, you grab the .text of each tag and concatenate it together. I'm not clear on whether you want the link location or simply the label, if the former, refer to this SO question.
I'm trying to match something like this with beautifulsoup.
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
In a regexp, it would look something like this. I want to capture the url.
<a href="\.(.*)">
<b>.*</b>
</a>
How do I go about doing something like this with BeautifulSoup? I need to use the b tags inside of the 'a' tags I want, since that's the only thing that differentiates these 'a's from any other link on the page. It seems like I can only write regexps to match the tag name or specific attributes?
If you just want to get the href from all a tags which contain one b tag:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """
... <html><head><title>Title</title></head><body>
... <b>first</b>
... <a><b>no-href</b></a>
... <div><b>second</b></div>
... <div><b>third</b></div>
... no-bold-tag
... <b>text</b><p>other-stuff</p>
... </body></html>
... ... """
>>> soup = BeautifulSoup(html)
>>> [a['href'] for a in soup('a', href=True) if a.b and len(a) == 1]
[u'first/index.php', u'second/index.php', u'third/index.php']
This can be done quite elegantly using an XPath expression if you don't mind using lxml.
import lxml.html as lh
html = '''
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
<a href="./Another/URL.php">
<b>foo</b>
<p>bar</p>
</a>
'''
tree = lh.fromstring(html)
for link in tree.xpath('a[count(b) = 1 and count(*) = 1]'):
print lh.tostring(link)
Result:
<a href="./SlimLineUSB3/SlimLine1BayUSB3.php">
<b>1 Bay SlimLine with both eSATA and USB 3.0</b>
</a>
Or if you wanted to use a method more similar to #ekhumoro's with lxml you could do:
[a for a in tree.xpath('a[#href]') if a.find('b') != None and len(a) == 1]