I am using beautifulsoup to parse all img tags which is present in 'www.youtube.com'
The code is
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
But am not getting all img tags.The getting img tags are invalid also.
The img tags which i got after parsing is different from the page source img tags. Some attributes are missing.
I need to get all video img tags in youtube.com
Please help
Seems to work when I try it here
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
print "\n".join(set(tag['src'] for tag in tags))
Produces this which looks OK to me
http://i1.ytimg.com/vi/D9Zg67r9q9g/market_thumb.jpg?v=723c8e
http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
//s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
/gen_204?a=fvhr&v=mha7pAOfqt4&nocache=1337083207.97
http://i3.ytimg.com/vi/fNs8mf2OdkU/market_thumb.jpg?v=4f85544b
http://i4.ytimg.com/vi/CkQFjyZCq4M/market_thumb.jpg?v=4f95762c
http://i3.ytimg.com/vi/fzD5gAecqdM/market_thumb.jpg?v=b0cabf
http://i3.ytimg.com/vi/2M3pb2_R2Ng/market_thumb.jpg?v=4f0d95fa
//i2.ytimg.com/vi/mha7pAOfqt4/hqdefault.jpg
I had the similar problem. I couldn't find all images. So here is the piece of code that will give you any attribute value of an image tag.
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
Explicitly using soup.findAll(name='img') worked for me, and I don't appear to be missing anything from the page.
def grabimagetags():
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags = soup.findAll('img')
list.extend(set(tag['src'] for tag in tags))
return list
grabimagetags()
i would only make this change so that you can pass the list of img tags
in my case some images didn't contain src.
so i did this to avoid keyError exception:
art_imgs = set(img['src'] for img in article.find_all('img') if img.has_attr('src'))
Try this.
from simplified_scrapy import SimplifiedDoc, req
url = 'https://www.youtube.com'
html = req.get(url)
doc = SimplifiedDoc(html)
imgs = doc.listImg(url = url)
print([img.url for img in imgs])
imgs = doc.selects('img')
for img in imgs:
print (img)
print (doc.absoluteUrl(url,img.src))
Related
when extracting image from a website I use below command to get the URL:
image = soup.findAll('img')
image_link = image["src"]
But as the picture shows there is not a compelite link to save the image. Now my question is that what is this 'current source' showed on the picture and how I can extract the link from there?
soup.findAll() returns a list of elements. Iterate over the "image" variable then access the "src" attribute on it.
If need to resolve relative URLs then need to call requests.compat.urljoin(url, src) on the image src value.
Try something like this:
import requests
from bs4 import BeautifulSoup
# sample base url for testing
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for img in soup.findAll('img'):
src = img.get("src")
if src:
# resolve any relative urls to absolute urls using base URL
src = requests.compat.urljoin(url, src)
print(">>", src)
Output:
...
>> https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
>> https://en.wikipedia.org/static/images/footer/wikimedia-button.png
Without resolving relative urls in example above, the URL would be instead "/static/images/footer/wikimedia-button.png".
In your case u can scrap images like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sfmta.com/getting-around/drive-park/how-avoid-parking-tickets'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for image in soup.find_all('img'):
image_link = requests.compat.urljoin(url, image.get('src'))
print(image_link)
OUTPUT:
https://www.sfmta.com/sites/all/themes/clients-theme/logo.png
https://www.sfmta.com/sites/default/files/imce-images/repair-40cd8d7db439deac706e161cd89ea3cc.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-a30c0bf7f9e9f1fcf5a4c6b69548c46b.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-cca5688579bf809ecb49daed5fab030a.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-8c6467ecb4673775240576524e4c5bc6.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-02709e2cecd6edde21a728562995764f.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-86826c5eeae51535f527f5a1a56a80fb.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-db285d8e0abc5e28e53f75a1a99d4a0b.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-73faffef0e5f0f36e0295e573dea1381.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-4eb40baaa405c6cb3e8379d5693c2941.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-f8fdee3388b83ec5eac01ff7c93a923e.jpg
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
But i recommend specify body class like field-body in soup, or check if link contains imce-images
I'm trying to download images from this page. I wrote the following Python script:
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print (img.get('src'))
However, I get only the image names and not the complete path. On the site, I can hover over the image name when I inspect the html and the link appears. Is there any way I can parse this link using BeautifulSoup?
Image
The image URIs in your page are marked up relative to the hostname.
You can build an absolute url for each image using urljoin function in urllib.parse module.
from urllib.parse import urljoin
page_url = "http://ottofrello.dk/malerierstor.htm"
request = requests.get(page_url)
...
for img in element:
image_url = urljoin(
page_url,
img.get('src')
)
print(image_url)
From what I understood, you are interested in absolute image path and not the relative path which you are getting right now. The only change I made is in your print statement.
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print ('http://ottofrello.dk/' + img.get('src'))
I'm trying to get all the images src/hyperlink form a webpage
import requests
from bs4 import BeautifulSoup
image_list = []
r = requests.get('https://example.com')
soup = BeautifulSoup(r.content)
for link in soup.find_all('img'):
image_list.append(link)
find the attributes of an html tag using get function. Pass the name of the attribute you want to extract from html tag to get function
for link in soup.find_all('img'):
image_list.append(link.get('src'))
I am trying to fetch important images and not thumbnail or other gifs from the Wikipedia page and using following code. However the "img" is coming as length of "0". any suggestion on how to rectify it.
Code :
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(html)
imgs = soup.findAll("div",{"class":"image"})
Also if someone can explain in detail that how to use the findAll by looking at "source element" in webpage. That will be awesome.
The a tags on the page have an image class, not div:
>>> img_links = soup.findAll("a", {"class":"image"})
>>> for img_link in img_links:
... print img_link.img['src']
...
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
Or, even better, use a.image > img CSS selector:
>>> for img in soup.select('a.image > img'):
... print img['src']
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
UPD (downloading images using urllib.urlretrieve):
from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2
url = "http://en.wikipedia.org/wiki/Main_Page"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('a.image > img'):
img_url = urlparse.urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)
I don't see any div tags with a class called 'image' on that page.
You could get all the image tags and throw away ones that are small.
imgs = soup.select('img')
I am looking to grab the full size product images from here
My thinking was:
Follow the image link
Download the picture
Go back
Repeat for n+1 pictures
I know how to open the image thumbnails but not how to get the full size images. Any ideas on how this could be done?
This will get you all URL of the images:
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
print img.a['href'].split("imgurl=")[1]
Output:
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g1_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g4_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g2_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g5_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g3_satellite-pro-c850.jpg
And this code is for downloading and saving those images:
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
imgUrl = img.a['href'].split("imgurl=")[1]
urllib.urlretrieve(imgUrl, os.path.basename(imgUrl))