How to use beautifulSoup to get the image from a webpage - python

I am struggling to get an image from this webpage, I am able to get the Title, price and other elements fine, just not the image.
<div class="product-img">
<a data-test-selector="linkProductURL" href="https://www.scottycameron.com/store/product/3494">
<div class="image" style="min-height: 350px;">
<img data-test-selector="imgProductImage" id="img-3494" class="img-responsive b-lazy b-loaded"
src="https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-
9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg">
The code I am currently using is:
for ele in array:
item = [ele.find('h4', {'class': 'title'}).text, #title
ele.find('span', {'data-test-selector': 'spanPrice'}).text,
ele.find('img', {'class': 'img-responsive b-lazy b-loaded'})['src']]
But that returns:
TypeError: 'NoneType' object is not subscriptable
Anyone have any idea?

You might want to check if there's an image tag in the first place and then reach for the attribute:
from bs4 import BeautifulSoup
element = """
<div class="product-img">
<a data-test-selector="linkProductURL" href="https://www.scottycameron.com/store/product/3494">
<div class="image" style="min-height: 350px;">
<img data-test-selector="imgProductImage" id="img-3494" class="img-responsive b-lazy b-loaded"
src="https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg">
</div>
</div>"""
image = BeautifulSoup(element, "html.parser").find("img", class_="img-responsive b-lazy b-loaded")
if image is not None:
print(image["src"])
Output:
https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg
EDIT:
As per your comment, try this:
item = []
for ele in array:
title = ele.find('h4', {'class': 'title'}).tex
price = ele.find('span', {'data-test-selector': 'spanPrice'}).text
img_src = ele.find('img', {'class': 'img-responsive b-lazy b-loaded'})
if img_src is not None:
item.extend([title, price, img_src["src"]])
else:
item.append([title, price, "No image source"])

Using this
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://www.scottycameron.com/store/product/3494')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
Output
/img/icon-header-user.png
/img/icon-header-cart.png
https://www.scottycameron.com/media/18299/puttertarchivenav_jan2021.jpg
https://www.scottycameron.com/media/18302/customizenav_jan2021.jpg
https://www.scottycameron.com/media/18503/showcasenav_2_2021.jpg
https://www.scottycameron.com/media/18454/2021phtmx_new_nws_thmb1.jpg
https://www.scottycameron.com/media/18301/aboutnav_jan2021_b.jpg
https://api.scottycameron.com/Data/Media/Catalog/1/1000/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE PLATE FRAME - SCOTTY CAMERON FINE MILLED PUTTERS.jpg
/store/content/images/loading.svg
all image in site url will be collected , from that we can do further required process.

Related

How to extract SRC attribute from IMG tag with BeautifulSoup?

This is the code I obtain with BS4 and I succeed to write in a CSV
[<div class="clearfix hide-mobile" id="image-block"><span id="view_full_size"><span class="product_game-language"><svg class="svg-flag_en icon"><use xlink:href="#svg-flag_en"></use></svg></span><span class="ratio-container" style="max-width:372px"><img alt="Dominion - Plunder" height="372" id="bigpic" itemprop="image" loading="lazy" src="https://imageotoget.jpg" width="372"/></span></span></div>]
How can I get only the SRC attribute and not the entire tags text?
from bs4 import BeautifulSoup
html = '<div class="clearfix hide-mobile" id="image-block"><span id="view_full_size"><span class="product_game-language"><svg class="svg-flag_en icon"><use xlink:href="#svg-flag_en"></use></svg></span><span class="ratio-container" style="max-width:372px"><img alt="Dominion - Plunder" height="372" id="bigpic" itemprop="image" loading="lazy" src="https://imageotoget.jpg" width="372"/></span></span></div>'
soup = BeautifulSoup(html, 'html.parser')
element = soup.select_one('img')
src = element.attrs['src']
print(src)
or as #Ahmed Mohamedeen said
src = element['src']
result is the same
https://imageotoget.jpg

How to get child value of div seperately using beautifulsoup

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name

BS4 Python get a href url

I stacked with on bs4 script, I need to get href link or meta content, how I could that? Basically I need to get this :
<meta itemprop="image" content="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950">
or
<img src="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950" alt="Posted by Publica Group " width="120" height="50" class=" b-loaded" style="display: inline;">
I tried do that with :
logoscrap = soup.find('meta', attrs={'itemprop': 'image'})
and
logoscrap = soup.find('img', class_="b-loaded").attrs['src']
But my code isn't work...
soup.find return dict object you can directly acces attibute from dict
img = soup.find('meta', attrs={'itemprop': 'image'})
logoscrap = img['content']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950
or
img = soup.find('img', class_="b-loaded")
logoscrap = img['src']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950

Extract element from HTML

I have html:
<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>
With:
s = soup.find('div', {'class' : 'img-holder'}).h1
s = s.get_text()
Displays the 'Sample image'.
How do i get the image src using the same format?
Use img.attrs["src"]
Ex:
from bs4 import BeautifulSoup
s = """<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>"""
soup = BeautifulSoup(s, "html.parser")
s = soup.find('div', {'class' : 'img-holder'})
print( s.img.attrs["src"] )
Like this?
soup.find('img')['src']

Categories