I stacked with on bs4 script, I need to get href link or meta content, how I could that? Basically I need to get this :
<meta itemprop="image" content="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950">
or
<img src="https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950" alt="Posted by Publica Group " width="120" height="50" class=" b-loaded" style="display: inline;">
I tried do that with :
logoscrap = soup.find('meta', attrs={'itemprop': 'image'})
and
logoscrap = soup.find('img', class_="b-loaded").attrs['src']
But my code isn't work...
soup.find return dict object you can directly acces attibute from dict
img = soup.find('meta', attrs={'itemprop': 'image'})
logoscrap = img['content']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950
or
img = soup.find('img', class_="b-loaded")
logoscrap = img['src']
#output:
https://resources.reed.co.uk/profileimages/logos/thumbs/Logo_71709.png?v=20200828172950
Related
I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))
I am struggling to get an image from this webpage, I am able to get the Title, price and other elements fine, just not the image.
<div class="product-img">
<a data-test-selector="linkProductURL" href="https://www.scottycameron.com/store/product/3494">
<div class="image" style="min-height: 350px;">
<img data-test-selector="imgProductImage" id="img-3494" class="img-responsive b-lazy b-loaded"
src="https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-
9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg">
The code I am currently using is:
for ele in array:
item = [ele.find('h4', {'class': 'title'}).text, #title
ele.find('span', {'data-test-selector': 'spanPrice'}).text,
ele.find('img', {'class': 'img-responsive b-lazy b-loaded'})['src']]
But that returns:
TypeError: 'NoneType' object is not subscriptable
Anyone have any idea?
You might want to check if there's an image tag in the first place and then reach for the attribute:
from bs4 import BeautifulSoup
element = """
<div class="product-img">
<a data-test-selector="linkProductURL" href="https://www.scottycameron.com/store/product/3494">
<div class="image" style="min-height: 350px;">
<img data-test-selector="imgProductImage" id="img-3494" class="img-responsive b-lazy b-loaded"
src="https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg">
</div>
</div>"""
image = BeautifulSoup(element, "html.parser").find("img", class_="img-responsive b-lazy b-loaded")
if image is not None:
print(image["src"])
Output:
https://api.scottycameron.com/Data/Media/Catalog/1/370/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE%20PLATE%20FRAME%20-%20SCOTTY%20CAMERON%20FINE%20MILLED%20PUTTERS.jpg
EDIT:
As per your comment, try this:
item = []
for ele in array:
title = ele.find('h4', {'class': 'title'}).tex
price = ele.find('span', {'data-test-selector': 'spanPrice'}).text
img_src = ele.find('img', {'class': 'img-responsive b-lazy b-loaded'})
if img_src is not None:
item.extend([title, price, img_src["src"]])
else:
item.append([title, price, "No image source"])
Using this
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://www.scottycameron.com/store/product/3494')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
Output
/img/icon-header-user.png
/img/icon-header-cart.png
https://www.scottycameron.com/media/18299/puttertarchivenav_jan2021.jpg
https://www.scottycameron.com/media/18302/customizenav_jan2021.jpg
https://www.scottycameron.com/media/18503/showcasenav_2_2021.jpg
https://www.scottycameron.com/media/18454/2021phtmx_new_nws_thmb1.jpg
https://www.scottycameron.com/media/18301/aboutnav_jan2021_b.jpg
https://api.scottycameron.com/Data/Media/Catalog/1/1000/c09b7470-42dd-47e5-a244-9ef3d073c742LICENSE PLATE FRAME - SCOTTY CAMERON FINE MILLED PUTTERS.jpg
/store/content/images/loading.svg
all image in site url will be collected , from that we can do further required process.
I'm trying to select the the second div tag with the info classname, but with no success using bs4 find_next. How Do you go about selecting the text inside the second div tag that share classname?
[<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>
<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>]
Here is what I have tried
from bs4 import BeautifulSoup
import requests
players_url =['http://www.premierleague.com//players/13559/Axel-Tuanzebe/stats']
# this is dict where we store all information:
players = {}
for url in players_url:
player_page = requests.get(url)
cont = soup(player_page.content, 'lxml')
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
club = {"Club" : cont.find('div', attrs={'class' : 'info'}).get_text(strip=True)}
position = {"Position": cont.find_next('div', attrs={'class' : 'info'})}
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
print(position)
You can try follows:
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'})}
I have html:
<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>
With:
s = soup.find('div', {'class' : 'img-holder'}).h1
s = s.get_text()
Displays the 'Sample image'.
How do i get the image src using the same format?
Use img.attrs["src"]
Ex:
from bs4 import BeautifulSoup
s = """<div class="img-holder">
<h1>Sample Image</h1>
<img src="http://sample.com/img.jpg"/>
</div>"""
soup = BeautifulSoup(s, "html.parser")
s = soup.find('div', {'class' : 'img-holder'})
print( s.img.attrs["src"] )
Like this?
soup.find('img')['src']
I need to create a <img /> tag.
BeautifulSoup creates an image tag like this with code I did:
soup = BeautifulSoup(text, "html5")
tag = Tag(soup, name='img')
tag.attrs = {'src': '/some/url/here'}
text = soup.renderContents()
print text
Output: <img src="/some/url/here"></img>
How to make it? : <img src="/some/url/here" />
It can be of course done with REGEX or similar chemistry. However I was wondering maybe there is any standard way to produce tags like this?
Don't use Tag() to create new elements. Use the soup.new_tag() method:
soup = BeautifulSoup(text, "html5")
new_tag = soup.new_tag('img', src='/some/url/here')
some_element.append(new_tag)
The soup.new_tag() method will pass along the correct builder to the Tag() object, and it is the builder that is responsible for recognising <img/> as an empty tag.
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div></div>', "html5")
>>> new_tag = soup.new_tag('img', src='/some/url/here')
>>> new_tag
<img src="/some/url/here"/>
>>> soup.div.append(new_tag)
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<div>
<img src="/some/url/here"/>
</div>
</body>
</html>
In BS4 you can also do this:
img = BeautifulSoup('<img src="/some/url/here" />', 'lxml').img
print(img)
print(type(img))
which will output:
<img src="/some/url/here"/>
<class 'bs4.element.Tag'>