How to extract a href from an a class in beautiful soup? - python

I'm trying to extract href= in an class but am unable to extract it.
I've tried url = tag_variable.find("href"), but am getting None.
<a class="product-card__name" href="/store/groceryGateway/en/Herbs/Fresh/Longo%27s-Fresh-Herbs-Basil/p/00772468010517">
<strong>
Longo's Fresh Herbs Basil</strong>
</a>

href is an attribute or property of the a tag, not a tag object itself which find wants.
Assuming you have the desired a tag as tag_variable, you can use subscription like dict:
url = tag_variable["href"]

Related

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

I am trying to get information from the <a> tag in between these two span tags
<span class="mentioned">
<a class="mentioned-123" onclick="information('123');" href="#28669">>>28669</a>
</span>
For example I would like to be able to get the value of the href in it. How can I do this?
You can look for the mentioned-123 class and then access the href with:
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a", class_="mentioned-123")["href"])

Getting an error using Beautifulsoup find_all() .get('href')

I'm trying to scrape a html for links under a specific class called "category-list"
Each link reside under a h4 tag(I'm ignoring its parent h3 tag):
<ul class="category-list">
<li class="category-item">
<h3>
<a href="/derdubor/c/alarm_og_sikkerhet/">
Alarm og sikkerhet
</a>
</h3>
<ul>
<li>
<h4>
<a href="/derdubor/c/alarm_og_sikkerhet/brannsikring/">
<span class="category-has-customers">
Brannsikring
</span>
(1)
</a>
</h4>
</li>
</ul>
</li>
...
My code for scraping the html is the following:
r = request.urlopen(str_top_url)
soup = BeautifulSoup(r.read(),'html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for tag_item in tag_items.find_all('a'):
print(tag_item.get('href'))
I get the error:
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item..."
Reading the BeautifulSoup manual on crummy, it looks like you can use the same methods belonging to the BeautifulSoup class on a tag object?
I can't seem to figure out what I'm doing wrong...
I've tried numerous answers her on stackoverflow. But to no avail...
Regards MH
Problem is in this line for tag_item in tag_items.find_all('a'):. You should first iterate through tag_items and the through find_all('a') items. Here is the edited code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<ul class="category-list"><li class="category-item"><h3>Alarm og sikkerhet</h3><ul><li><h4><span class="category-has-customers">Brannsikring</span>(1)</h4></li></ul></li>','html.parser')
tag_category_list = soup.find('ul', class_ = 'category-list')
tag_items = tag_category_list.find_all('h4')
for elm in tag_items:
for tag_item in elm.find_all('a'):
print(tag_item.get('href'))
And here is the result:
/derdubor/c/alarm_og_sikkerhet/brannsikring/
The problem is that tag_items is a ResultSet, not a Tag.
From the Beautiful Soup documentation:
AttributeError: 'ResultSet' object has no attribute 'foo' - This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a list of tags and strings–a ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all().
So this nested loop should work:
for tag_item in tag_items:
for link in tag_item.find_all('a'):
print(link.get('href'))
Or, if you were only expecting one h4, change find_all('h4') to find('h4').

Parsing mutiple items using BeautifulSoup in Python

I'm trying to parse HTML from a website, where there are multiple elements having the same class ID. I can't seem to find a solution; I manage to get one item but not all of them.
Here's a bit of the HTML I'm trying to parse :
<h1>Synonymes travail</h1>
<div class="container-bloc1">
<strong> Nom</strong>
<br/>
-
<i><a class="lien2" href="/fr/accouchement.html"> accouchement </a></i>
:
<a class="lien3" href="/fr/gésine.html"> gésine</a>
<br/>
-
<i> <a class="lien2" href="/fr/action.html"> action </a></i>
:
<a class="lien3" href="/fr/activité.html"> activité</a>
,
<a class="lien3" href="/fr/labeur.html"> labeur</a>
</div>
In Python, I wrote it like this :
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get("http://www.synonymes.net/fr/travail.html").text
soup = BeautifulSoup(source, "lxml")
for synonyme in soup.find_all("div", class_="container-bloc1"):
print(synonyme)
synonymesdumot = synonyme.find("a", class_="lien2").text
print(synonymesdumot)
for synonymesautres in synonyme.find_all("a", class_="lien3").text:
print(synonymesautres)
The first part is working, since there is only one "lien2" in the HTML file. I could do the same for "lien3" but I'd only get one item, and I want all of them.
What am I doing wrong here? Thanks for your help guys!
If you the code as is in your question, you run into an AttributeError because the output of .find_all() is a collection of tags (a ResultSet more specifically) that has no attribute text; but each of its elements, which are of type bs4.Element.Tag, do. So you need to get the text attribute for each of the tags inside the for loop:
for synonymesautres in synonyme.find_all("a", class_="lien3"):
print(synonymesautres.text)
Output:
le
travail
manque
de
travail
travail
fatigant

Python: extract the href surrounding image

I am using bs4 and want to extract a href of a specified image.
For example in the html code I have:
<div style="text-align:center;"><img src="page_files/image.jpg" alt="Picture" border="0" width="150" height="150"></div>
</div>
And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:
soup = bs4.BeautifulSoup(page)
for img in soup('img'):
imgLink = img.find_previous("a")
This returns the whole tag:
<img alt="Tumblr" border="0" src="Here_is_source"/>
But I can't take the href content, because when I try to make:
imgLink = img.find_previous("a")['href']
I have an error.
The same thing is when I try to use find_parent like
imgLink = img.find_parent("a")['href']
How can I fix that? And what is better: find_previous() or find_parent()?
Make sure you are only looking for images that have a <a> parent tag with href attribute:
for img in soup.select('a[href] img'):
link = img.find_parent('a', href=True)
print link['href']
The CSS selector picks only images that have an <a href="..."> parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.
If you are searching for all images, chances are you are finding some that have a <a> tag parent or preceding tag that does not have the a href attribute; <a> tags can also be used for link targets with <a name="...">, for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given <img> tag.

splitting the outerHTML attribute in python

I would want to split out particular text from the outerHTML attribute for a web link.
while Id is true:
link = driver.find_element_by_xpath("//a[#id='bu:ms:all-sp:2']")
href = link.get_attribute("outerHTML")
link.click()
# This will load the link in the same page !
self.assertIn(href, self.page.get_current_url())
When I print the href, output would be,
<a id="bu:ms:all-sp:8" href="/euro/tennis" class="Pointer"><span class="SportImg8"></span> Tennis <span class="NumEvt">51</span></a>
I would want to split this and assert the value of href alone (/euro/tennis) with the current URL.
Could anyone please help me out here ?
Get href attribute instead of outerHTML:
href = link.get_attribute("href")

Categories