BS4 Python get a href url - python

I stacked with on bs4 script, I need to get href link or meta content, how I could that? Basically I need to get this :
<meta itemprop="image" content="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950">
or
<img src="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950" alt="Posted by Publica Group " width="120" height="50" class=" b-loaded" style="display: inline;">
I tried do that with :
logoscrap = soup.find('meta', attrs={'itemprop': 'image'})
and
logoscrap = soup.find('img', class_="b-loaded").attrs['src']
But my code isn't work...

soup.find return dict object you can directly acces attibute from dict
img = soup.find('meta', attrs={'itemprop': 'image'})
logoscrap = img['content']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950
or
img = soup.find('img', class_="b-loaded")
logoscrap = img['src']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950

Related

How to get child value of div seperately using beautifulsoup

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

How to use beautifulSoup to get the image from a webpage

I am struggling to get an image from this webpage, I am able to get the Title, price and other elements fine, just not the image.
<div class="product-img">
<a data-test-selector="linkProductURL" href="https://www.scottycameron.com/store/product/3494">
<div class="image" style="min-height: 350px;">
<img data-test-selector="imgProductImage" id="img-3494" class="img-responsive b-lazy b-loaded"
src="https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-
9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg">
The code I am currently using is:
for ele in array:
item = [ele.find('h4', {'class': 'title'}).text, #title
ele.find('span', {'data-test-selector': 'spanPrice'}).text,
ele.find('img', {'class': 'img-responsive b-lazy b-loaded'})['src']]
But that returns:
TypeError: 'NoneType' object is not subscriptable
Anyone have any idea?
You might want to check if there's an image tag in the first place and then reach for the attribute:
from bs4 import BeautifulSoup
element = """
<div class="product-img">
<a data-test-selector="linkProductURL" href="https://www.scottycameron.com/store/product/3494">
<div class="image" style="min-height: 350px;">
<img data-test-selector="imgProductImage" id="img-3494" class="img-responsive b-lazy b-loaded"
src="https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg">
</div>
</div>"""
image = BeautifulSoup(element, "html.parser").find("img", class_="img-responsive b-lazy b-loaded")
if image is not None:
print(image["src"])
Output:
https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg
EDIT:
As per your comment, try this:
item = []
for ele in array:
title = ele.find('h4', {'class': 'title'}).tex
price = ele.find('span', {'data-test-selector': 'spanPrice'}).text
img_src = ele.find('img', {'class': 'img-responsive b-lazy b-loaded'})
if img_src is not None:
item.extend([title, price, img_src["src"]])
else:
item.append([title, price, "No image source"])
Using this
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://www.scottycameron.com/store/product/3494')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
Output
/img/icon-header-user.png
/img/icon-header-cart.png
https://www.scottycameron.com/media/18299/puttertarchivenav_jan2021.jpg
https://www.scottycameron.com/media/18302/customizenav_jan2021.jpg
https://www.scottycameron.com/media/18503/showcasenav_2_2021.jpg
https://www.scottycameron.com/media/18454/2021phtmx_new_nws_thmb1.jpg
https://www.scottycameron.com/media/18301/aboutnav_jan2021_b.jpg
https://api.scottycameron.com/Data/Media/Catalog/1/1000/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE PLATE FRAME - SCOTTY CAMERON FINE MILLED PUTTERS.jpg
/store/content/images/loading.svg
all image in site url will be collected , from that we can do further required process.

How to select second div tag with same classname?

I'm trying to select the the second div tag with the info classname, but with no success using bs4 find_next. How Do you go about selecting the text inside the second div tag that share classname?
[<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>
<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>]
Here is what I have tried
from bs4 import BeautifulSoup
import requests
players_url =['http://www.premierleague.com//players/13559/Axel-Tuanzebe/stats']
# this is dict where we store all information:
players = {}
for url in players_url:
player_page = requests.get(url)
cont = soup(player_page.content, 'lxml')
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
club = {"Club" : cont.find('div', attrs={'class' : 'info'}).get_text(strip=True)}
position = {"Position": cont.find_next('div', attrs={'class' : 'info'})}
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
print(position)
You can try follows:
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'})}

Extract element from HTML

I have html:
<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>
With:
s = soup.find('div', {'class' : 'img-holder'}).h1
s = s.get_text()
Displays the 'Sample image'.
How do i get the image src using the same format?
Use img.attrs["src"]
Ex:
from bs4 import BeautifulSoup
s = """<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>"""
soup = BeautifulSoup(s, "html.parser")
s = soup.find('div', {'class' : 'img-holder'})
print( s.img.attrs["src"] )
Like this?
soup.find('img')['src']

BeautifulSoup create a <img /> tag

I need to create a <img /> tag.
BeautifulSoup creates an image tag like this with code I did:
soup = BeautifulSoup(text, "html5")
tag = Tag(soup, name='img')
tag.attrs = {'src': '/some/url/here'}
text = soup.renderContents()
print text
Output: <img src="/some/url/here"></img>
How to make it? : <img src="/some/url/here" />
It can be of course done with REGEX or similar chemistry. However I was wondering maybe there is any standard way to produce tags like this?
Don't use Tag() to create new elements. Use the soup.new_tag() method:
soup = BeautifulSoup(text, "html5")
new_tag = soup.new_tag('img', src='/some/url/here')
some_element.append(new_tag)
The soup.new_tag() method will pass along the correct builder to the Tag() object, and it is the builder that is responsible for recognising <img/> as an empty tag.
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div></div>', "html5")
>>> new_tag = soup.new_tag('img', src='/some/url/here')
>>> new_tag
<img src="/some/url/here"/>
>>> soup.div.append(new_tag)
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<div>
<img src="/some/url/here"/>
</div>
</body>
</html>
In BS4 you can also do this:
img = BeautifulSoup('<img src="/some/url/here" />', 'lxml').img
print(img)
print(type(img))
which will output:
<img src="/some/url/here"/>
<class 'bs4.element.Tag'>

Categories