Beautifulsoup selecting the element that contains certain attribute - python

I would like to select the second element by specifying the fact that it contains "title" element in it (I don't want to just select the second element in the list)
sample = """<h5 class="card__coins">
<a class="link-detail" href="/en/coin/smartcash">SmartCash (SMART)</a>
</h5>
<a class="link-detail" href="/en/event/smartrewards-812" title="SmartRewards">"""
How could I do it?
My code (does not work):
from bs4 import BeautifulSoup
soup = BeautifulSoup(sample.content, "html.parser")
second = soup.find("a", {"title"})

for I in soup.find_all('a', title=True):
print(I)
After looping through all a tags we are checking if it contains title attribute and it will only print it if it contains title attribute.

Another way to do this is by using CSS selector
soup.select_one('a[title]')
this selects the first a element having the title attribute.

Related

Extract data-content from span tag in BeautifulSoup

I have such HTML code:
<li class="IDENTIFIER"><h5 class="hidden">IDENTIFIER</h5><p>
<span class="tooltip-iws" data-toggle="popover" data-content="SOME TEXT">
other text</span></p></li>
And I'd like to obtain the SOME TEXT from the data-content.
I wrote
target = soup.find('span', {'class' : 'tooltip-iws'})['data-content']
to get the span, and I wrote
identifier_elt= soup.find("li", {'class': 'IDENTIFIER'})
to get the class, but I'm not sure how to combine the two.
But the class tooltip-iws is not unique, and I would get extraneous results if I just used that (there are other spans, before the code snippet, with the same class)
That's why I want to specify my search within the class IDENTIFIER. How can I do that in BeautifulSoup?
try using css selector,
soup.select_one("li[class='IDENTIFIER'] > p > span")['data-content']
Try using selectorlib, should solve your issue, comment if you need further assistance
https://selectorlib.com/

Pull Title attribute with out .get("title")

I'm having value like
<a href="/for-sale/property/abu-dhabi/page-3/" title="Next" class="b7880daf"><div title="Next" class="ea747e34 ">
I need to pull out only ""Next" from title="Next" for the one i used
soup.find('a',attrs={"title": "Next"}).get('title')
is there any method to get the tittle value with out using .get("title")
My code
next_page_text = soup.find('a',attrs={"title": "Next"}).get('title')
Output:
Next
I need:
next_page_text = soup.find('a',attrs={"title": "Next"})
Output:
Next
Please let me know if there is any method to find.
You should get Next.Try this. Using find() or select_one() and Use If to check if element is present on a page.
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.bayut.com/for-sale/property/abu-dhabi/page-182/")
soup=BeautifulSoup(res.text,"html.parser")
if soup.find("a", attrs={"title": "Next"}):
print(soup.find("a", attrs={"title": "Next"})['title'])
If you want to use css selector.
if soup.select_one("a[title='Next']"):
print(soup.select_one("a[title='Next']")['title'])
I'm re-writing my answer as there was confusion in your original post.
If you'd like to take the URL associated with the Next tag:
soup.find('a', title='Next')['href']
['href'] can be replaced with any other attribute in the element, so title, itemprop etc.
If you'd like to select the element with Next in the title:
soup.find('a', title='Next')

How to get the desired value in BeautifulSoup?

Suppose we have the html code as follows:
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'lxml')
I want to get the name xyz. Then, I write
soup.find('div',{'class':'name'})
However, it returns abc.
How to solve this problem?
The thing is that Beautiful Soup returns the first element that has the class name and div so the thing is that the first div has class name and class dt so it selects that div.
So, div helps but it still narrows down to 2 divs. Next, it returns a array so check the second div to use print(soup('div')[1].text). If you want to print all the divs use this code:
for i in range(len(soup('div')))
print(soup('div')[i].text)
And as pointed out in Ankur Sinha's answer, if you want to select all the divs that have only class name, then you have to use select, like this:
soup.select('div[class=name]')[0].get_text()
But if there are multiple divs that satisfy this property, use this:
for i in range(len(soup.select('div[class=name]'))):
print(soup.select('div[class=name]')[i].get_text())
Just to continue Ankur Sinha, when you use select or even just soup() it forms a array, because there can be multiple items so that's why I used len(), to figure out the length of the array. Then I ran a for loop on it and then printed the select function at i starting from 0.
When you do that, it rather would give a specific div instead of a array, and if it gave out a array, calling get_text() would produce errors because the array is NOT text.
This blog was helpful in doing what you would like, and that is to explicitly find a tag with specific class attribute:
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, 'html.parser')
soup.find(lambda tag: tag.name == 'div' and tag['class'] == ['name'])
Output:
<div class="name">xyz</div>
You can do it without lambda also using select to find exact class name like this:
soup.select("div[class = name]")
Will give:
[<div class="name">xyz</div>]
And if you want the value between tags:
soup.select("div[class=name]")[0].get_text()
Will give:
xyz
In case you have multiple div with class = 'name', then you can do:
for i in range(len(soup.select("div[class=name]"))):
print(soup.select("div[class=name]")[i].get_text())
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
This might work for you, note that it is contingent on the div being the second div item in the html.
import requests
from bs4 import BeautifulSoup
html = '<div class="dt name">abc</div><div class="name">xyz</div>'
soup = BeautifulSoup(html, features='lxml')
print(soup('div')[1].text)

How to get the contents inside of a div tag

I was wondering what I need to put inside of the .find parameter when using beautifulsoup to get the contents of "the-target" shown below.
<div class="item" the-target="this text" another-target="not this text">
This is the .find beautifulsoup parameter I am talking about
help = soup.find('div', 'What should I put here?').get_text()
Thanks
You can filter the div with class item and get the value of the-target key from the resulting tag object (which is a dict-like object):
soup.find('div', attrs={'class': 'item'})['the-target']
If you want to find by the the-target attribute:
soup.find('div', attrs={'the-target': 'this text'})
And get the value of the attribute like before:
soup.find('div', attrs={'the-target': 'this text'})['the-target']
In two steps:
tag = soup.find('div', attrs={'the-target': 'this text'})
the_target = tag.get('the-target')
You can use css selector to find item.
soup.select_one('div.item')['the-target']
OR
soup.select_one('.item')['the-target']

Search and Replace in HTML with BeautifulSoup

I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a').
In the documentation, there is a:
find(text="ahh").replaceWith('Hooray')
So I would assume it would be along the lines of:
soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')
But that doesn't work and the python help() doesn't give much
This will insert a <br> tag after the end of each <a>...</a> element:
from BeautifulSoup import BeautifulSoup, Tag
# ....
soup = BeautifulSoup(data)
for a in soup.findAll('a'):
a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))
You can't use soup.findAll(tag = '</a>') because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.
If you wanted to put the <a> elements inside a <p> element as you ask in a comment, you can use this:
for a in soup.findAll('a'):
p = Tag(soup, 'p') #create a P element
a.replaceWith(p) #Put it where the A element is
p.insert(0, a) #put the A element inside the P (between <p> and </p>)
Again, you don't create the <p> and </p> separately because they are part of the same thing.
suppose you have an element which you know contains the "br" markup tags, one way to remove & replace the "br" tags with a different string is like this:
originalSoup = BeautifulSoup("your_html_file.html")
replaceString = ", " # replace each <br/> tag with ", "
# Ex. <p>Hello<br/>World</p> to <p>Hello, World</p>
cleanSoup = BeautifulSoup(str(originalSoup).replace("<br/>", replaceString))
You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.
What you want to do is insert a new <br> element immediately after the <a>...</a> element. To do so you'll need to find out the index of the <a> element inside its parent element, and insert the new element just after that index. eg.
soup= BeautifulSoup('<body>blah blah blah</body>')
for link in soup.findAll('a'):
br= Tag(soup, 'br')
index= link.parent.contents.index(link)
link.parent.insert(index+1, br)
# soup now serialises to '<body>blah blah<br /> blah</body>'

Categories