Suppose we have the html code as follows:
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'lxml')
I want to get the name xyz. Then, I write
soup.find('div',{'class':'name'})
However, it returns abc.
How to solve this problem?
The thing is that Beautiful Soup returns the first element that has the class name and div so the thing is that the first div has class name and class dt so it selects that div.
So, div helps but it still narrows down to 2 divs. Next, it returns a array so check the second div to use print(soup('div')[1].text). If you want to print all the divs use this code:
for i in range(len(soup('div')))
print(soup('div')[i].text)
And as pointed out in Ankur Sinha's answer, if you want to select all the divs that have only class name, then you have to use select, like this:
soup.select('div[class=name]')[0].get_text()
But if there are multiple divs that satisfy this property, use this:
for i in range(len(soup.select('div[class=name]'))):
print(soup.select('div[class=name]')[i].get_text())
Just to continue Ankur Sinha, when you use select or even just soup() it forms a array, because there can be multiple items so that's why I used len(), to figure out the length of the array. Then I ran a for loop on it and then printed the select function at i starting from 0.
When you do that, it rather would give a specific div instead of a array, and if it gave out a array, calling get_text() would produce errors because the array is NOT text.
This blog was helpful in doing what you would like, and that is to explicitly find a tag with specific class attribute:
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'html.parser')
soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['name'])
Output:
<div class="name">xyz</div>
You can do it without lambda also using select to find exact class name like this:
soup.select("div[class = name]")
Will give:
[<div class="name">xyz</div>]
And if you want the value between tags:
soup.select("div[class=name]")[0].get_text()
Will give:
xyz
In case you have multiple div with class = 'name', then you can do:
for i in range(len(soup.select("div[class=name]"))):
print(soup.select("div[class=name]")[i].get_text())
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
This might work for you, note that it is contingent on the div being the second div item in the html.
import requests
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, features='lxml')
print(soup('div')[1].text)
Related
I would like to select the second element by specifying the fact that it contains "title" element in it (I don't want to just select the second element in the list)
sample = """<h5 class="card__coins">
<a class="link-detail" href="/en/coin/smartcash">SmartCash (SMART)</a>
</h5>
<a class="link-detail" href="/en/event/smartrewards-812" title="SmartRewards">"""
How could I do it?
My code (does not work):
from bs4 import BeautifulSoup
soup = BeautifulSoup(sample.content, "html.parser")
second = soup.find("a", {"title"})
for I in soup.find_all('a', title=True):
print(I)
After looping through all a tags we are checking if it contains title attribute and it will only print it if it contains title attribute.
Another way to do this is by using CSS selector
soup.select_one('a[title]')
this selects the first a element having the title attribute.
I was wondering what I need to put inside of the .find parameter when using beautifulsoup to get the contents of "the-target" shown below.
<div class="item" the-target="this text" another-target="not this text">
This is the .find beautifulsoup parameter I am talking about
help = soup.find('div', 'What should I put here?').get_text()
Thanks
You can filter the div with class item and get the value of the-target key from the resulting tag object (which is a dict-like object):
soup.find('div', attrs={'class': 'item'})['the-target']
If you want to find by the the-target attribute:
soup.find('div', attrs={'the-target': 'this text'})
And get the value of the attribute like before:
soup.find('div', attrs={'the-target': 'this text'})['the-target']
In two steps:
tag = soup.find('div', attrs={'the-target': 'this text'})
the_target = tag.get('the-target')
You can use css selector to find item.
soup.select_one('div.item')['the-target']
OR
soup.select_one('.item')['the-target']
This is my code:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').string)
It returns None. I think it has to do with that span tag which is empty. I think it goes into that span tag, and returns those contents? So I either want to delete that span tag, or stop as soon as it finds the 'Data I want to extract', or tell it to ignore empty tags
If there are no empty tags inside 'td' it actually works.
Is there a way to ignore empty tags in general and go one step back? Instead of ignoring this specific span tag?
Sorry if this is too elementary, but I spent a fair amount of time searching.
Use .text property, not .string:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').text)
Output:
Data I want to extract
Use .text:
>>> soup.find('td').text
u'Data I want to extract'
I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]
Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').
I am trying to parse several div blocks using Beautiful Soup using some html from a website. However, I cannot work out which function should be used to select these div blocks. I have tried the following:
import urllib2
from bs4 import BeautifulSoup
def getData():
html = urllib2.urlopen("http://www.racingpost.com/horses2/results/home.sd?r_date=2013-09-22", timeout=10).read().decode('UTF-8')
soup = BeautifulSoup(html)
print(soup.title)
print(soup.find_all('<div class="crBlock ">'))
getData()
I want to be able to select everything between <div class="crBlock "> and its correct end </div>. (Obviously there are other div tags but I want to select the block all the way down to the one that represents the end of this section of html.)
The correct use would be:
soup.find_all('div', class_="crBlock ")
By default, beautiful soup will return the entire tag, including contents. You can then do whatever you want to it if you store it in a variable. If you are only looking for one div, you can also use find() instead. For instance:
div = soup.find('div', class_="crBlock ")
print(div.find_all(text='foobar'))
Check out the documentation page for more info on all the filters you can use.