With the help of BeautifulSoup I try to read the image address of an image from a homepage.
In the page source text I see the URL of the image.
But if I try to read the address with the command find_all from BeautifulSoup I only get a placeholder for the image URL.
The URL from the image is structured as follows:
<br /><img src="mangas/Young Justice (2019)/Young Justice (2019) Issue 11/cw002.jpg" alt="" width="1200" height="1846" class="picture" />
In BeautifulSoup i get this:
<img 0="" alt="" class="picture" height="" src="/pics/placeholder2.jpg" width=""/>]
I hope anybody can give me a tip or why i get a placeholder instead the original image url.
My Code:
import requests
from bs4 import BeautifulSoup as BS
from requests.exceptions import ConnectionError
def getimageurl(url):
try:
response = requests.get(url)
soup = BS(response.text, 'html.parser')
data = soup.find_all('a', href=True)
for a in data:
t = a.find_all('img', attrs={'class': 'picture'})
print(t)
except ConnectionError:
print('Cant open url: {0}'.format(url))
Related
I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years with beautiful soup.
This is my code:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
And this is my output:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
What I hope to get:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
My problem is:
How do I get the data-sourcre only?
How to I get the extension of the image?
1: How do I get the data-source only?
You can access the data-src by calling element['data-src']:
cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
2: How to I get the extension of the image?
You can access the extension of the file like diggusbickus mentioned (good approache), but this will not help you if you try to request the file like https://www.nb.co.za/en/helper/ReadImage/25929.jpg this will cause a 404 error.
The image is dynamically loaded / served additional infos -> https://stackoverflow.com/a/5110673/14460824
Example
baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []
for item in soup.select('.book-slider-frame'):
data.append({
'link' : baseurl+item.a['href'],
'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
})
data
Output
[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]
i'll show you how to do it for that small example, i'll let you handle the rest. just use the imghdr module
import imghdr
import requests
from bs4 import BeautifulSoup
data="""<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>"""
soup=BeautifulSoup(data, 'lxml')
base_url="https://www.nb.co.za"
img_src=soup.select_one('img')['data-src']
img_name=img_src.split('/')[-1]
data=requests.get(base_url+img_src)
with open(img_name, 'wb') as f:
f.write(data.content)
print(imghdr.what(img_name))
>>> jpeg
To wait until all images are loaded you can tell requests to use timeout argument or set it to timeout=None which will tell requests to wait forever for a response if the page loaded slowly.
The reason why you get a .gif at the end of the image results is that the image hasn't been loaded yet and a gif was showing that.
You can access data-src attribute the same way you would access a dictionary: class[attribute]
If you want to save an image locally, you can use urllib.request.urlretrieve:
import urllib.request
urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
soup = BeautifulSoup(response.text, 'lxml')
for result in soup.select(".img-container"):
link = f'https://www.nb.co.za{result.select_one("a")["href"]}'
# try/except to handle error when there's no image on the website (last 3 results)
try:
image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
except: image = None
print(link, image, sep="\n")
# part of the output:
'''
# first result (Step by Step: Counting to 50)
https://www.nb.co.za/en/view-book/?id=9780798182539
https://www.nb.co.za/en/helper/ReadImage/25929
# last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
https://www.nb.co.za/en/helper/ReadImage/10853
https://www.nb.co.za/en/view-book/?id=9780624035480
# last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
https://www.nb.co.za/en/view-book/?id=9780624043003
None
'''
<div class="col-lg-2 col-md-2 vcenter no-pad-top no-pad-bot">
<img itemprop="image" src="/uploads/images/cache/20955226c5c975c230cb8e1f8cff0e6f1583249561_150_150.png" alt="SPINNEY MOBILE DEVELOPMENT" class="b-lazy pull-left center-block img-responsive b-loaded"></div>
I need to extract image from this particular class only
"/uploads/images/cache/20955226c5c975c230cb8e1f8cff0e6f1583249561_150_150.png"
My code:
url = "https://www.appfutura.com/developers/spinney"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
soup.prettify()
for link in soup.find_all('img'):
print(link.get('src'))
how can i Acheive further task?
please help
If I understood your question correctly :
When printing img "src" attribute you can check whether it contains "/uploads/images/cache/" or not.
img = soup.find_all('img')
for link in img:
if "/uploads/images/cache/" in link.get('src'):
print(link.get('src'))
there is a module called 'webbrowser' which lets you open a url, such as:
import webbrowser
url = "https://www.appfutura.com/developers/spinney"
webbrowser.open(url)
but to DOWNLOAD a url you must import requests
import requests
url = 'https://www.appfutura.com/developers/spinney'
r = requests.get(url, allow_redirects=True)
open('yoururlname.html', 'wb').write(r.content)
if you created a folder with that program, the download will end up in your folder. the program will be just a black screen popping up and closing, and the url will be downloaded
I am trying to retrieve the link for an image using imgur.com. It seems that the picture (if .jpg or .png) is usually stored within (div class="image post-image") on their website, like:
<div class='image post-image'>
<img alt="" src="//i.imgur.com/QSGvOm3.jpg" original-title="" style="max-width: 100%; min-height: 666px;">
</div>
so here is my code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://imgur.com/gallery/0PTPt'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
info = soup.find_all('div', {'class':'post-image'})
file = open('imgur-html.txt', 'w')
file.write(str(info))
file.close()
Instead of being able to get everything within these tags, this is my output:
<div class="post-image" style="min-height: 666px">
</div>
What do I need to do in order to access this further so I can get the image link? Or is this simply something where I need to only use the API? Thanks for any help.
The child img it would appear is dynamically added and not present. You can extract full link from rel
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://imgur.com/gallery/0PTPt')
soup = bs(r.content, 'lxml')
print(soup.select_one('[rel=image_src]')['href'])
I am an absolute newbie in the field of web scraping and right now I want to extract visible text from a web page. I found a piece of code online :
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(url , "lxml")
print (soup.prettify())
To the above code, I get the following result :
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
<html>
<body>
<p>
http://www.espncricinfo.com/
</p>
</body>
</html>
Anyway I could get a more concrete result and what wrong is happening with the code. Sorry for being clueless.
Try passing the html document and not url to prettify to:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))
soup = BeautifulSoup(web_page, "lxml")
you should pass a file-like object to BeautifulSoup,not url.
url is handled by urllib2.urlopen(url) and stored in web_page
I was trying to scrape the image url from a website using python urllib2.
Here is my code to get the html string:
req = urllib2.Request(url, headers = urllib2Header)
htmlStr = urllib2.urlopen(req, timeout=15).read()
When I view from the browser, the html code of the image looks like this:
<img id="main-image" src="http://abcd.com/images/41Q2VRKA2QL._SY300_.jpg" alt="" rel="" style="display: inline; cursor: pointer;">
However, when I read from the htmlStr I captured, the image was converted to base64 image, which looks like this:
<img id="main-image" src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQU....">
I am wondering why this happened. Is there a way to get the original image url rather than the base64 image string?
Thanks.
you could use BeautifulSoup
Example:
import urllib2
from bs4 import BeautifulSoup
url = "www.theurlyouwanttoscrape.com"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
img_src = soup.find('img', {'id':'main_image'})['src']