Getting an error using Beautifulsoup find_all() .get('href') - python

I'm trying to scrape a html for links under a specific class called "category-list"
Each link reside under a h4 tag(I'm ignoring its parent h3 tag):
<ul class="category-list">
<li class="category-item">
<h3>
<a href="/derdubor/c/alarm_og_sikkerhet/">
Alarm og sikkerhet
</a>
</h3>
<ul>
<li>
<h4>
<a href="/derdubor/c/alarm_og_sikkerhet/brannsikring/">
<span class="category-has-customers">
Brannsikring
</span>
(1)
</a>
</h4>
</li>
</ul>
</li>
...
My code for scraping the html is the following:
r = request.urlopen(str_top_url)
soup = BeautifulSoup(r.read(),'html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for tag_item in tag_items.find_all('a'):
print(tag_item.get('href'))
I get the error:
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item..."
Reading the BeautifulSoup manual on crummy, it looks like you can use the same methods belonging to the BeautifulSoup class on a tag object?
I can't seem to figure out what I'm doing wrong...
I've tried numerous answers her on stackoverflow. But to no avail...
Regards MH

Problem is in this line for tag_item in tag_items.find_all('a'):. You should first iterate through tag_items and the through find_all('a') items. Here is the edited code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<ul class="category-list"><li class="category-item"><h3>Alarm og sikkerhet</h3><ul><li><h4><span class="category-has-customers">Brannsikring</span>(1)</h4></li></ul></li>','html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for elm in tag_items:
for tag_item in elm.find_all('a'):
print(tag_item.get('href'))
And here is the result:
/derdubor/c/alarm_og_sikkerhet/brannsikring/

The problem is that tag_items is a ResultSet, not a Tag.
From the Beautiful Soup documentation:
AttributeError: 'ResultSet' object has no attribute 'foo' - This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a list of tags and strings–a ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all().
So this nested loop should work:
for tag_item in tag_items:
for link in tag_item.find_all('a'):
print(link.get('href'))
Or, if you were only expecting one h4, change find_all('h4') to find('h4').

Related

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

I am trying to get information from the <a> tag in between these two span tags
<span class="mentioned">
<a class="mentioned-123" onclick="information('123');" href="#28669">>>28669</a>
</span>
For example I would like to be able to get the value of the href in it. How can I do this?
You can look for the mentioned-123 class and then access the href with:
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a", class_="mentioned-123")["href"])

What is the difference between find() and find_all() in beautiful soup python?

I was doing web scraping but i stuck/confused in find() and find_all().
Like where to use find_all, where to user find().
Also, where can i use this methods like in for loop or in ul li list ??
Here is the code i tried
from bs4 import BeautifulSoup
import requests
urls = "https://www.flipkart.com/offers-list/latest-launches?screen=dynamic&pk=themeViews%3DAug19-Latest-launch-Phones%3ADTDealcard~widgetType%3DdealCard~contentType%3Dneo&wid=7.dealCard.OMU_5&otracker=hp_omu_Latest%2BLaunches_5&otracker1=hp_omu_WHITELISTED_neo%2Fmerchandising_Latest%2BLaunches_NA_wc_view-all_5"
source = requests.get(urls)
soup = BeautifulSoup(source.content, 'html.parser')
divs = soup.find_all('div', class_='MDGhAp')
names = divs.find_all('a')
full_name = names.find_all('div', class_='iUmrbN').text
print(full_name)
And got error like this
File "C:/Users/ASUS/Desktop/utube/sunil.py", line 9, in <module>
names = divs.find_all('a')
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python38-32\lib\site-packages\bs4\element.py", line 1601, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
So can anyone explain where should i use find and find all ??
find()- It just returns the result when the searched element is found in the page.And the return type will be <class 'bs4.element.Tag'>.
find_all()- It returns all the matches (i.e) it scans the entire document and returns all the results and the return type will be <class 'bs4.element.ResultSet'>
from robobrowser import RoboBrowser
browser = RoboBrowser(history=True)
browser = RoboBrowser(parser='html.parser')
browser.open('http://www.stackoverflow.com')
res=browser.find('h3')
print(type(res),res)
print(" ")
res=browser.find_all('h3')
print(type(res),res)
print(" ")
print("Iterating the Resultset")
print(" ")
for x in range(0,len(res)):
print(x,res[x])
print(" ")
Output:
<class 'bs4.element.Tag'> <h3>current community
</h3>
<class 'bs4.element.ResultSet'> [<h3>current community
</h3>, <h3>
your communities </h3>, <h3>more stack exchange communities
</h3>, <h3 class="w90 mx-auto ta-center p-ff-roboto-slab-bold fs-headline2 mb24">Questions are everywhere, answers are on Stack Overflow</h3>, <h3 class="w90 mx-auto ta-center p-ff-roboto-slab-bold fs-headline2 mb24">Learn and grow with Stack Overflow</h3>, <h3 class="mx-auto w90 wmx12 p-ff-roboto-slab-bold fs-headline2 mb24 lg:ta-center">Looking for a job?</h3>]
Iterating the Resultset
0 <h3>current community
</h3>
1 <h3>
your communities </h3>
2 <h3>more stack exchange communities
</h3>
3 <h3 class="w90 mx-auto ta-center p-ff-roboto-slab-bold fs-headline2 mb24">Questions are everywhere, answers are on Stack Overflow</h3>
4 <h3 class="w90 mx-auto ta-center p-ff-roboto-slab-bold fs-headline2 mb24">Learn and grow with Stack Overflow</h3>
5 <h3 class="mx-auto w90 wmx12 p-ff-roboto-slab-bold fs-headline2 mb24 lg:ta-center">Looking for a job?</h3>
With this example maybe is more clear :
from bs4 import BeautifulSoup
import re
html = """
<ul>
<li>First</li>
<li>Second</li>
<li>Third</li>
</ul>
"""
soup = BeautifulSoup(html,'html.parser')
for n in soup.find('li'):
# It Give you one element
print(n)
for n in soup.find_all('li'):
# It Give you all elements
print(n)
Result :
First
<li>First</li>
<li>Second</li>
<li>Third</li>
For more information pls read this https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Found this from the Beautiful Soup documentation. If you are scraping something more specific, try find and if you are scraping something more general from a or span, probably give find_all a try.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Hope this helps!
Let us understand with the help of an example: I am trying to get the list of book names on the mentioned website. (https://www.bookdepository.com/bestsellers)
To iterate through all the book related tags at once I use find_all command, subsequently I use find inside each list item to get the title of the book.
Note: find will fetch you the first match (only match in this case) while find_all will produce a list of all matching items, which you can use futher to iterate through.)
from bs4 import BeautifulSoup as bs
import requests
url = "https://www.bookdepository.com/bestsellers"
response = requests.get(url)
Use find_all to go through all book items:
a=soup.find_all("div",class_ = "item-info")
Use find to go through title of each book inside each book item
for i in a:
print(i.find("h3",class_ = "title").get_text())
From Documentation
The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method ... both next sentences are equivalent :
soup.find_all('title', limit=1)
soup.find('title')

Parsing mutiple items using BeautifulSoup in Python

I'm trying to parse HTML from a website, where there are multiple elements having the same class ID. I can't seem to find a solution; I manage to get one item but not all of them.
Here's a bit of the HTML I'm trying to parse :
<h1>Synonymes travail</h1>
<div class="container-bloc1">
<strong> Nom</strong>
<br/>
-
<i><a class="lien2" href="/fr/accouchement.html"> accouchement </a></i>
:
<a class="lien3" href="/fr/gésine.html"> gésine</a>
<br/>
-
<i> <a class="lien2" href="/fr/action.html"> action </a></i>
:
<a class="lien3" href="/fr/activité.html"> activité</a>
,
<a class="lien3" href="/fr/labeur.html"> labeur</a>
</div>
In Python, I wrote it like this :
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get("http://www.synonymes.net/fr/travail.html").text
soup = BeautifulSoup(source, "lxml")
for synonyme in soup.find_all("div", class_="container-bloc1"):
print(synonyme)
synonymesdumot = synonyme.find("a", class_="lien2").text
print(synonymesdumot)
for synonymesautres in synonyme.find_all("a", class_="lien3").text:
print(synonymesautres)
The first part is working, since there is only one "lien2" in the HTML file. I could do the same for "lien3" but I'd only get one item, and I want all of them.
What am I doing wrong here? Thanks for your help guys!
If you the code as is in your question, you run into an AttributeError because the output of .find_all() is a collection of tags (a ResultSet more specifically) that has no attribute text; but each of its elements, which are of type bs4.Element.Tag, do. So you need to get the text attribute for each of the tags inside the for loop:
for synonymesautres in synonyme.find_all("a", class_="lien3"):
print(synonymesautres.text)
Output:
le
travail
manque
de
travail
travail
fatigant

beautifulsoup CSS Select - find a tag in which a particular attribute (style for ex) is not present

My first here on SO. Thanks for helping us noobs for so long. Coming straight to point:
Scenario:
I am working on an existing program that is reading the CSS selector as a string from a configuration file to make the program dynamic and able to scrap any site by just changing the configuration value of CSS selector.
Problem:
I am trying to scrape a site which is rendering items as one of the 2 options below:
Option1:
.........
<div class="price">
<span class="price" style="color:red;margin-right:0.1in">
<del>$299</del>
</span>
<span class="price">
$195
</span>
</div>
soup = soup.select("span.price") - this doesn't work as I need second span tag or last span tag :(
Option2:
.........
<div class="price">
<span class="price">
$199
</span>
</div>
soup = soup.select("span.price") - this works great!
Question:
In both the above options I want to be able to get the last span tag ($195 or $199) and don't care about the $299. Basically I just want to extract the final sale price and not the original price.
So the 2 ways I know as of now are:
1) Always get the last span tag
2) Always get the span tag which doesn't have style attribute
Now, I know the not operator, last-of-type are not present in bs4 (only nth-of-type is available) so I am stuck here. Any suggestions are helpful.
Edit: - Since this is an existing program, I cant use soup.find_all() or any other method apart from soup.select(). Sorry :(
Thanks!
You can search for the span tag without the style attribute:
prices = soup.select('span.price')
no_style = [price for price in prices if 'style' not in price.attrs]
>> [<span class="price">$199</span>]
This might be a good time to use a function. In this case BeautifulSoup gives span_with_style each tag and the function tests whether the tag's name is span and it has the attribute style. If this is true then BeautifulSoup appends the tag to its list of results.
HTML = '''\
<div class='price'>
<span class='price' style='color: red; margin-right: 0.1in'>
<del>$299</del>
</span>
<span class='price'>
$195
</span>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, 'lxml')
for item in soup.find_all(lambda tag: tag.name=='span' and tag.has_attr('style')):
print (item)
The code inside the select function needs to change to:
def select(soup, the_variable_you_pass):
soup.find('div', attrs={'class': 'price'}).find_all(the_variable_you_pass)[-1]

Access next sibling <li> element with BeautifulSoup

I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:
<div id="pages">
<ul>
<li class="active">Example</li>
<li>Example</li>
<li>Example 1</li>
<li>Example 2</li>
</ul>
</div>
I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.
But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?
Use find_next_sibling() and be explicit about what sibling element do you want to find:
next_li_element = page.find_next_sibling("li")
next_li_element would become None if the page corresponds to the last active li:
if next_li_element is None:
# no more pages to go
Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()

Categories