Fetching Image from URL using BeautifulSoup - python

I am trying to fetch important images and not thumbnail or other gifs from the Wikipedia page and using following code. However the "img" is coming as length of "0". any suggestion on how to rectify it.
Code :
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(html)
imgs = soup.findAll("div",{"class":"image"})
Also if someone can explain in detail that how to use the findAll by looking at "source element" in webpage. That will be awesome.

The a tags on the page have an image class, not div:
>>> img_links = soup.findAll("a", {"class":"image"})
>>> for img_link in img_links:
... print img_link.img['src']
...
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
Or, even better, use a.image > img CSS selector:
>>> for img in soup.select('a.image > img'):
... print img['src']
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
UPD (downloading images using urllib.urlretrieve):
from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2
url = "http://en.wikipedia.org/wiki/Main_Page"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('a.image > img'):
img_url = urlparse.urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)

I don't see any div tags with a class called 'image' on that page.
You could get all the image tags and throw away ones that are small.
imgs = soup.select('img')

Related

Trying to scrape image urls but not able to get it using beautiful soup and python

I am scraping this link : https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds
and get image urls
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']
html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(html_1,'lxml')
address = soup_1.find('div',attrs={"class" : identity[0]})
for x in address.find_all('div', class_ = 'filmstrip-imgContainer'):
print(x.find('div').get('img'))
but i am getting output as the following :
None
None
None
None
None
None
None
The follwing is the image of the html code from where I am trying to get the image urls :
This is the section of page from where I'd like to get the urls
I'd like to get to know if there are any changes to be made in the code so that I get all the image urls.
They are dynamically loaded from a script tag. You can easily regex them from the .text of the response. The regex below specifically matches the 7 images you say you want to retrieve and show in the picture.
import requests, re
r = requests.get('https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds').text
p = re.compile(r'imgurl":"(.*?)"')
links = p.findall(r)
print(links)
Regex explanation:
Were you to decide to go with the more expensive selenium you could match with
links = [i['src'] for i in driver.find_all_elements_with_css_selector('.filmstrip-imgContainer img')]
Try this
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
import requests
import re
AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']
r = requests.get(AMEXurl[0])
html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(r.content,'lxml')
Extracting All Images
images = soup_1.find_all('img', src=True)
for img in images:
print(img['src'])
all image tags that display png files.
platinum_card_image=soup_1.find('img', src=re.compile('Platinum_Card\.png$'))
print(platinum_card_image.get('src'))
all image tags that display svg files.
platinum_card_image=soup_1.find_all('img', src=re.compile('\.svg$'))
for img in platinum_card_image:
print(img.get('src'))
Edit
images_7 = soup_1.select('script')[8].string.split('__REDUX_STATE__ = ')
data = images_7[1]
for d in json.loads(data)["modelData"]['componentFeaturedCards']['cards']:
print(d['imgurl'])

Downloading images with BeautifulSoup when complete image link does not appear unless hovering over the src tag

I'm trying to download images from this page. I wrote the following Python script:
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print (img.get('src'))
However, I get only the image names and not the complete path. On the site, I can hover over the image name when I inspect the html and the link appears. Is there any way I can parse this link using BeautifulSoup?
Image
The image URIs in your page are marked up relative to the hostname.
You can build an absolute url for each image using urljoin function in urllib.parse module.
from urllib.parse import urljoin
page_url = "http://ottofrello.dk/malerierstor.htm"
request = requests.get(page_url)
...
for img in element:
image_url = urljoin(
page_url,
img.get('src')
)
print(image_url)
From what I understood, you are interested in absolute image path and not the relative path which you are getting right now. The only change I made is in your print statement.
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print ('http://ottofrello.dk/' + img.get('src'))

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

Beautifulsoup - How to open images and download them

I am looking to grab the full size product images from here
My thinking was:
Follow the image link
Download the picture
Go back
Repeat for n+1 pictures
I know how to open the image thumbnails but not how to get the full size images. Any ideas on how this could be done?
This will get you all URL of the images:
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
print img.a['href'].split("imgurl=")[1]
Output:
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g1_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g4_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g2_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g5_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g3_satellite-pro-c850.jpg
And this code is for downloading and saving those images:
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
imgUrl = img.a['href'].split("imgurl=")[1]
urllib.urlretrieve(imgUrl, os.path.basename(imgUrl))

Python Beautifulsoup img tag parsing

I am using beautifulsoup to parse all img tags which is present in 'www.youtube.com'
The code is
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
But am not getting all img tags.The getting img tags are invalid also.
The img tags which i got after parsing is different from the page source img tags. Some attributes are missing.
I need to get all video img tags in youtube.com
Please help
Seems to work when I try it here
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
print "\n".join(set(tag['src'] for tag in tags))
Produces this which looks OK to me
http://i1.ytimg.com/vi/D9Zg67r9q9g/market_thumb.jpg?v=723c8e
http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
//s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
/gen_204?a=fvhr&v=mha7pAOfqt4&nocache=1337083207.97
http://i3.ytimg.com/vi/fNs8mf2OdkU/market_thumb.jpg?v=4f85544b
http://i4.ytimg.com/vi/CkQFjyZCq4M/market_thumb.jpg?v=4f95762c
http://i3.ytimg.com/vi/fzD5gAecqdM/market_thumb.jpg?v=b0cabf
http://i3.ytimg.com/vi/2M3pb2_R2Ng/market_thumb.jpg?v=4f0d95fa
//i2.ytimg.com/vi/mha7pAOfqt4/hqdefault.jpg
I had the similar problem. I couldn't find all images. So here is the piece of code that will give you any attribute value of an image tag.
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
Explicitly using soup.findAll(name='img') worked for me, and I don't appear to be missing anything from the page.
def grabimagetags():
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags = soup.findAll('img')
list.extend(set(tag['src'] for tag in tags))
return list
grabimagetags()
i would only make this change so that you can pass the list of img tags
in my case some images didn't contain src.
so i did this to avoid keyError exception:
art_imgs = set(img['src'] for img in article.find_all('img') if img.has_attr('src'))
Try this.
from simplified_scrapy import SimplifiedDoc, req
url = 'https://www.youtube.com'
html = req.get(url)
doc = SimplifiedDoc(html)
imgs = doc.listImg(url = url)
print([img.url for img in imgs])
imgs = doc.selects('img')
for img in imgs:
print (img)
print (doc.absoluteUrl(url,img.src))

Categories