I am looking to grab the full size product images from here
My thinking was:
Follow the image link
Download the picture
Go back
Repeat for n+1 pictures
I know how to open the image thumbnails but not how to get the full size images. Any ideas on how this could be done?
This will get you all URL of the images:
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
print img.a['href'].split("imgurl=")[1]
Output:
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g1_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g4_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g2_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g5_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g3_satellite-pro-c850.jpg
And this code is for downloading and saving those images:
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
imgUrl = img.a['href'].split("imgurl=")[1]
urllib.urlretrieve(imgUrl, os.path.basename(imgUrl))
Related
Ive successfully created a python script that can print all image paths from a specified url:
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url="https://www.example.com/"
session = HTMLSession()
r = session.get(url)
b = requests.get(url)
soup = BeautifulSoup(b.text, "lxml")
images = soup.find_all('img')
for img in images:
if img.has_attr('src') :
print(img['src'])
What i now want to do is print the image size alongside the printed url using PIL. Ive tried this but it errors:
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
from PIL import Image
import requests
url="https://www.example.com/"
session = HTMLSession()
r = session.get(url)
b = requests.get(url)
soup = BeautifulSoup(b.text, "lxml")
images = soup.find_all('img')
for img in images:
if img.has_attr('src') :
## Get image sizes in PIL
imgsize = Image.open(requests.get(img, stream=True).raw)
print(img['src'], imgsize.size)
Any ideas how to get this working?
You should use img['src'] instead of img
requests.get(img['src'], ...).raw
I am trying to scrape some img files from IG using selenium and bs4. I have this following script to do it, it seems to work fine, but eventually I'd like it to just print img src, a sample: https://scontent-lax3-2.cdninstagram.com/vp/2592f6b07f88bfc4bfdf6d73400a04b8/5BA6E998/t51.2885-15/s640x640/sh0.08/e35/28752330_1972627949433283_1816022201220988928_n.jpg and download images later. But for now I would need some help to just print that img src link without the tags and extras. Thanks for the advice.
Code:
import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = ('https://www.instagram.com/kitties/')
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
img_url = soup.find_all('img', class_='_2di5p')
print img_url
Just print out the src of the found images.
imgs= soup.find_all('img', class_='_2di5p')
for img in imgs:
img_url=img["src"]
print img_url
I'm trying to download images from this page. I wrote the following Python script:
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print (img.get('src'))
However, I get only the image names and not the complete path. On the site, I can hover over the image name when I inspect the html and the link appears. Is there any way I can parse this link using BeautifulSoup?
Image
The image URIs in your page are marked up relative to the hostname.
You can build an absolute url for each image using urljoin function in urllib.parse module.
from urllib.parse import urljoin
page_url = "http://ottofrello.dk/malerierstor.htm"
request = requests.get(page_url)
...
for img in element:
image_url = urljoin(
page_url,
img.get('src')
)
print(image_url)
From what I understood, you are interested in absolute image path and not the relative path which you are getting right now. The only change I made is in your print statement.
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print ('http://ottofrello.dk/' + img.get('src'))
I am trying to fetch important images and not thumbnail or other gifs from the Wikipedia page and using following code. However the "img" is coming as length of "0". any suggestion on how to rectify it.
Code :
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(html)
imgs = soup.findAll("div",{"class":"image"})
Also if someone can explain in detail that how to use the findAll by looking at "source element" in webpage. That will be awesome.
The a tags on the page have an image class, not div:
>>> img_links = soup.findAll("a", {"class":"image"})
>>> for img_link in img_links:
... print img_link.img['src']
...
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
Or, even better, use a.image > img CSS selector:
>>> for img in soup.select('a.image > img'):
... print img['src']
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
UPD (downloading images using urllib.urlretrieve):
from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2
url = "http://en.wikipedia.org/wiki/Main_Page"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('a.image > img'):
img_url = urlparse.urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)
I don't see any div tags with a class called 'image' on that page.
You could get all the image tags and throw away ones that are small.
imgs = soup.select('img')
I am using beautifulsoup to parse all img tags which is present in 'www.youtube.com'
The code is
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
But am not getting all img tags.The getting img tags are invalid also.
The img tags which i got after parsing is different from the page source img tags. Some attributes are missing.
I need to get all video img tags in youtube.com
Please help
Seems to work when I try it here
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
print "\n".join(set(tag['src'] for tag in tags))
Produces this which looks OK to me
http://i1.ytimg.com/vi/D9Zg67r9q9g/market_thumb.jpg?v=723c8e
http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
//s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
/gen_204?a=fvhr&v=mha7pAOfqt4&nocache=1337083207.97
http://i3.ytimg.com/vi/fNs8mf2OdkU/market_thumb.jpg?v=4f85544b
http://i4.ytimg.com/vi/CkQFjyZCq4M/market_thumb.jpg?v=4f95762c
http://i3.ytimg.com/vi/fzD5gAecqdM/market_thumb.jpg?v=b0cabf
http://i3.ytimg.com/vi/2M3pb2_R2Ng/market_thumb.jpg?v=4f0d95fa
//i2.ytimg.com/vi/mha7pAOfqt4/hqdefault.jpg
I had the similar problem. I couldn't find all images. So here is the piece of code that will give you any attribute value of an image tag.
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
Explicitly using soup.findAll(name='img') worked for me, and I don't appear to be missing anything from the page.
def grabimagetags():
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags = soup.findAll('img')
list.extend(set(tag['src'] for tag in tags))
return list
grabimagetags()
i would only make this change so that you can pass the list of img tags
in my case some images didn't contain src.
so i did this to avoid keyError exception:
art_imgs = set(img['src'] for img in article.find_all('img') if img.has_attr('src'))
Try this.
from simplified_scrapy import SimplifiedDoc, req
url = 'https://www.youtube.com'
html = req.get(url)
doc = SimplifiedDoc(html)
imgs = doc.listImg(url = url)
print([img.url for img in imgs])
imgs = doc.selects('img')
for img in imgs:
print (img)
print (doc.absoluteUrl(url,img.src))