Extract image url from html - python

<div class="col-lg-2 col-md-2 vcenter no-pad-top no-pad-bot">
<img itemprop="image" src="/uploads/images/cache/20955226c5c975c230cb8e1f8cff0e6f1583249561_150_150.png" alt="SPINNEY MOBILE DEVELOPMENT" class="b-lazy pull-left center-block img-responsive b-loaded"></div>
I need to extract image from this particular class only
"/uploads/images/cache/20955226c5c975c230cb8e1f8cff0e6f1583249561_150_150.png"
My code:
url = "https://www.appfutura.com/developers/spinney"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
soup.prettify()
for link in soup.find_all('img'):
print(link.get('src'))
how can i Acheive further task?
please help

If I understood your question correctly :
When printing img "src" attribute you can check whether it contains "/uploads/images/cache/" or not.
img = soup.find_all('img')
for link in img:
if "/uploads/images/cache/" in link.get('src'):
print(link.get('src'))

there is a module called 'webbrowser' which lets you open a url, such as:
import webbrowser
url = "https://www.appfutura.com/developers/spinney"
webbrowser.open(url)
but to DOWNLOAD a url you must import requests
import requests
url = 'https://www.appfutura.com/developers/spinney'
r = requests.get(url, allow_redirects=True)
open('yoururlname.html', 'wb').write(r.content)
if you created a folder with that program, the download will end up in your folder. the program will be just a black screen popping up and closing, and the url will be downloaded

Related

Get image data-src with Beautiful Soup when there is no image extension

I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years with beautiful soup.
This is my code:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
And this is my output:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
What I hope to get:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
My problem is:
How do I get the data-sourcre only?
How to I get the extension of the image?
1: How do I get the data-source only?
You can access the data-src by calling element['data-src']:
cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
2: How to I get the extension of the image?
You can access the extension of the file like diggusbickus mentioned (good approache), but this will not help you if you try to request the file like https://www.nb.co.za/en/helper/ReadImage/25929.jpg this will cause a 404 error.
The image is dynamically loaded / served additional infos -> https://stackoverflow.com/a/5110673/14460824
Example
baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []
for item in soup.select('.book-slider-frame'):
data.append({
'link' : baseurl+item.a['href'],
'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
})
data
Output
[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]
i'll show you how to do it for that small example, i'll let you handle the rest. just use the imghdr module
import imghdr
import requests
from bs4 import BeautifulSoup
data="""<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>"""
soup=BeautifulSoup(data, 'lxml')
base_url="https://www.nb.co.za"
img_src=soup.select_one('img')['data-src']
img_name=img_src.split('/')[-1]
data=requests.get(base_url+img_src)
with open(img_name, 'wb') as f:
f.write(data.content)
print(imghdr.what(img_name))
>>> jpeg
To wait until all images are loaded you can tell requests to use timeout argument or set it to timeout=None which will tell requests to wait forever for a response if the page loaded slowly.
The reason why you get a .gif at the end of the image results is that the image hasn't been loaded yet and a gif was showing that.
You can access data-src attribute the same way you would access a dictionary: class[attribute]
If you want to save an image locally, you can use urllib.request.urlretrieve:
import urllib.request
urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
soup = BeautifulSoup(response.text, 'lxml')
for result in soup.select(".img-container"):
link = f'https://www.nb.co.za{result.select_one("a")["href"]}'
# try/except to handle error when there's no image on the website (last 3 results)
try:
image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
except: image = None
print(link, image, sep="\n")
# part of the output:
'''
# first result (Step by Step: Counting to 50)
https://www.nb.co.za/en/view-book/?id=9780798182539
https://www.nb.co.za/en/helper/ReadImage/25929
# last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
https://www.nb.co.za/en/helper/ReadImage/10853
https://www.nb.co.za/en/view-book/?id=9780624035480
# last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
https://www.nb.co.za/en/view-book/?id=9780624043003
None
'''

Python beautifulsoup: Get a placeholder from img src

With the help of BeautifulSoup I try to read the image address of an image from a homepage.
In the page source text I see the URL of the image.
But if I try to read the address with the command find_all from BeautifulSoup I only get a placeholder for the image URL.
The URL from the image is structured as follows:
<br /><img src="mangas/Young Justice (2019)/Young Justice (2019) Issue 11/cw002.jpg" alt="" width="1200" height="1846" class="picture" />
In BeautifulSoup i get this:
<img 0="" alt="" class="picture" height="" src="/pics/placeholder2.jpg" width=""/>]
I hope anybody can give me a tip or why i get a placeholder instead the original image url.
My Code:
import requests
from bs4 import BeautifulSoup as BS
from requests.exceptions import ConnectionError
def getimageurl(url):
try:
response = requests.get(url)
soup = BS(response.text, 'html.parser')
data = soup.find_all('a', href=True)
for a in data:
t = a.find_all('img', attrs={'class': 'picture'})
print(t)
except ConnectionError:
print('Cant open url: {0}'.format(url))

Retrieving Imgur Image Link via Web Scraping Python

I am trying to retrieve the link for an image using imgur.com. It seems that the picture (if .jpg or .png) is usually stored within (div class="image post-image") on their website, like:
<div class='image post-image'>
<img alt="" src="//i.imgur.com/QSGvOm3.jpg" original-title="" style="max-width: 100%; min-height: 666px;">
</div>
so here is my code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://imgur.com/gallery/0PTPt'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
info = soup.find_all('div', {'class':'post-image'})
file = open('imgur-html.txt', 'w')
file.write(str(info))
file.close()
Instead of being able to get everything within these tags, this is my output:
<div class="post-image" style="min-height: 666px">
</div>
What do I need to do in order to access this further so I can get the image link? Or is this simply something where I need to only use the API? Thanks for any help.
The child img it would appear is dynamically added and not present. You can extract full link from rel
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://imgur.com/gallery/0PTPt')
soup = bs(r.content, 'lxml')
print(soup.select_one('[rel=image_src]')['href'])

Wix doesn't work with BeautifulSoup

Why doesn't BeautifulSoup manage to download information from wix? I'm trying to use BeautifulSoup in order to download images from my website, while other sites do work (example of the code actually working) wix does not work...
Is there anything I can change in my site's settings in order for it to work?
EDIT: CODE
from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import time
def make_soup(url):
req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})
html = urllib2.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print 'Downloading images to current working directory.'
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print 'Getting: ' + filename
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print ' An error occurred. Continuing.'
print 'Done.'
def main():
url = HIDDEN ADDRESS
get_images(url)
if __name__ == '__main__':
main()
BeautifulSoup can only parse html. Wix sites are generated by javascript that runs when you load the page. When you request the page's html via urllib, you don't get the rendered html, you just get the base html with scripts to build the rendered html. In order to do this, you'd need something like selenium or a headless chrome browser to render the site via it's javascript, and then get the rendered html and feed it to beautifulsoup.
Here's an example of the body of a wix site, which you can see has no content other than a single div that gets populated via javascript.
...
<body>
<div id="SITE_CONTAINER"></div>
</body>
...
For anyone out there trying to download images from the wix website, I managed to figure out a simple idea.
Open an HTML Code frame in your page and in your code link the img srcs of the pictures in your site. When you use BeautifulSoup on the HTML code's URL, all of the images (linked in the code) will be downloaded!

When scraping image url src, get data:image/jpeg;base64

I was trying to scrape the image url from a website using python urllib2.
Here is my code to get the html string:
req = urllib2.Request(url, headers = urllib2Header)
htmlStr = urllib2.urlopen(req, timeout=15).read()
When I view from the browser, the html code of the image looks like this:
<img id="main-image" src="http://abcd.com/images/41Q2VRKA2QL._SY300_.jpg" alt="" rel="" style="display: inline; cursor: pointer;">
However, when I read from the htmlStr I captured, the image was converted to base64 image, which looks like this:
<img id="main-image" src="....">
I am wondering why this happened. Is there a way to get the original image url rather than the base64 image string?
Thanks.
you could use BeautifulSoup
Example:
import urllib2
from bs4 import BeautifulSoup
url = "www.theurlyouwanttoscrape.com"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
img_src = soup.find('img', {'id':'main_image'})['src']

Categories