I'm trying to download some images from tripadvisor using urllib but all that I get for the url in the src field from the html is this
I've done some research and I found out that those are lazy load images... Is there any way to download them??
You can extract a list of images from Javascript using the Beautiful Soup and json modules, then iterate over the list and retrieve the images you are interested in.
EDIT:
The problem was that the images have the same name, so they got overwritten. Fetching the first three images is trivial, but references to the other images in the carousel are not loaded until the carousel is opened, so that's trickier. For some images you can find a higher resolution version by substituting "photo-s" in the path with "photo-w", but figuring out which requires diving deeper into Javascript logic.
import urllib, re, json
from bs4 import BeautifulSoup as bs
def img_data_filter(tag):
if tag.name == "script" and tag.text.strip().startswith("var lazyImgs"):
return True
return False
response = urllib.urlopen("https://www.tripadvisor.it/Restaurant_Review-g3174493-d3164947-Reviews-Le_Ciaspole-Tret_Fondo_Province_of_Trento_Trentino_Alto_Adige.html")
soup = bs(response.read(), 'html.parser')
img_data = soup.find(img_data_filter)
js = img_data.text
js = js.replace("var lazyImgs = ", '')
js = re.sub(r";\s+var lazyHtml.+", '', js, flags=re.DOTALL)
imgs = json.loads(js)
suffix = 1
for img in imgs:
img_url = img["data"]
if not "media/photo-s" in img_url:
continue
img_name = img_url[img_url.rfind('/')+1:-4]
img_name = "%s-%03d.jpg" % (img_name, suffix)
suffix += 1
urllib.urlretrieve(img_url, img_name)
Related
I'm trying to create a program that scrapes a site for images using bs4. The site contains two types of images, low quality ones and high quality ones. The high quality files are named the same thing as their low quality versions, but contain the word "website" and the end before the .png. I'd like to only download the "website" files. Here's what I tried.
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if not image.endswith("Website.png"):
continue
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
I don't get any error messages, but no files download. Any tips?
You are only checking if it ends with "Website.png" after you have already established that it doesn't. Better not to even check if it doesn't:
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
Actually using list comprehensions you can make your code less procedural and prevent mistakes of the sort you made in the future:
from bs4 import BeautifulSoup
import requests
from requests.compat import urljoin
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
image_urls = [urljoin(URL,image.get('src')) for image in soup.find_all('img')]
# let's make this one a generator so we don't keep too many downloaded
# images in memory
images = (requests.get(url) for url in image_urls if url.endswith("Website.png"))
for image in images:
# use the context manager so the files are closed after write
with open('scraped_images/' + image.split('/')[-1], 'wb') as f:
f.write(image.content)
So I am working on a project that requires me to auto-download images from a basic websearch. I've created this script which should be downloading all the images it finds, except the script isn't working as intended. I have consulted various forms and tutorials but non seem to have the fix.
from bs4 import beutifulsoup, requests
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
image = soup.find_all("img")
for img in image:
name = img['alt']
link = img['src']
with open(name.replace(" ", "-").replace("/", "") + ".jpg", "wb") as f:`
im = requests.get(link)
f.write(im.content)
If I print the img links, it shows all the images which can be downloaded, but for some reason it downloads 1-2 images then stops. Not to mention the downloaded images are further down the list of links.
To solve this issue, you can append the links to a list, then download the images via the iterating list of urls
list_of_urls = []
for img in image:
link = img["src"]
list_of_urls.append(link)
for link in list_of_urls:
with open(str(list_of_urls.index(link)) + ".jpg", "wb") as f:
f.write(requests.get(link).content)
when extracting image from a website I use below command to get the URL:
image = soup.findAll('img')
image_link = image["src"]
But as the picture shows there is not a compelite link to save the image. Now my question is that what is this 'current source' showed on the picture and how I can extract the link from there?
soup.findAll() returns a list of elements. Iterate over the "image" variable then access the "src" attribute on it.
If need to resolve relative URLs then need to call requests.compat.urljoin(url, src) on the image src value.
Try something like this:
import requests
from bs4 import BeautifulSoup
# sample base url for testing
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for img in soup.findAll('img'):
src = img.get("src")
if src:
# resolve any relative urls to absolute urls using base URL
src = requests.compat.urljoin(url, src)
print(">>", src)
Output:
...
>> https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
>> https://en.wikipedia.org/static/images/footer/wikimedia-button.png
Without resolving relative urls in example above, the URL would be instead "/static/images/footer/wikimedia-button.png".
In your case u can scrap images like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sfmta.com/getting-around/drive-park/how-avoid-parking-tickets'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for image in soup.find_all('img'):
image_link = requests.compat.urljoin(url, image.get('src'))
print(image_link)
OUTPUT:
https://www.sfmta.com/sites/all/themes/clients-theme/logo.png
https://www.sfmta.com/sites/default/files/imce-images/repair-40cd8d7db439deac706e161cd89ea3cc.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-a30c0bf7f9e9f1fcf5a4c6b69548c46b.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-cca5688579bf809ecb49daed5fab030a.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-8c6467ecb4673775240576524e4c5bc6.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-02709e2cecd6edde21a728562995764f.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-86826c5eeae51535f527f5a1a56a80fb.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-db285d8e0abc5e28e53f75a1a99d4a0b.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-73faffef0e5f0f36e0295e573dea1381.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-4eb40baaa405c6cb3e8379d5693c2941.jpg
https://www.sfmta.com/sites/default/files/imce-images/repair-f8fdee3388b83ec5eac01ff7c93a923e.jpg
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
https://www.sfmta.com/sites/all/themes/clients-theme/images/icons/application-pdf.png
But i recommend specify body class like field-body in soup, or check if link contains imce-images
I'm trying to write a program that will pull data from a URL and format it so that I can copy into another program. I've got everything working except I can't get it to skip an item if there is no img src in the imagelink tag.
import requests, sys, webbrowser, bs4
res = requests.get('http://hzws.selco.info/prototype.php?type=new-arrivals&lib=nor&collect=Bnewnf,Bnewmys,Bnewf,Bnewsf&days=14&key=7a8adfa9aydfa999997af')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "lxml")
img = soup.select('imagelink') #why won't this pull anything?!?!?!?!
link = soup.select('cataloglink')
length = min([14, len(img)])
for i in range(length):
img1 = img[i].getText()
link1 = link[i].getText()
print('<div>' + link1 + img1 + '</a></div>')
Right now this prints all of the URLs regardless of whether or not there is an imagelink attached to it. I've tried numerous different things to get it to skip an item if there is no img src. Any ideas?
Looking at the BS4 docs, It looks like "lxml" is actually a HTML parser. You should replace it with "lxml-xml", since you're trying to scrape an XML page. That should be working.
I have a python script that downloads the html and the images shown in the html so I can open the file locally.
It works fine, the only problem is, there is a certain div in which the images don't get downloaded/found by the regex. I have no idea why tho. It's not a huge problem, but I'd like to know the reason.
This is the important part of the script:
url = "http://www.somedomain.com"
urlContent = urllib2.urlopen(url).read()
#Write originalHtml to file
f = open("originalHtml",'w')
f.write(urlContent)
f.close()
# HTML image tag: some_text
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)
After that I loop over the links, one by one, downloading the images and replacing the links in the html so the "src" points to the local path where I downloaded it. The script takes care of relative links and direct links.
However, part of the images never gets downloaded.
This is html that doesn't get picked up :
<img src="/images/news/den-mcx80001.jpg" style="width:60px;height:36px;margin-top:12px; margin-bottom:12px; margin-left:17px; margin-right:17px;float:left; ">
This however does get picked up:
<img class="productimg" style="width:72px;height:74px;margin-top:15px; margin-bottom:15px; margin-left:3px; margin-right:28px " src="/images/01_prdarticledocs/ImagesSmall/jpr/jpr-prx718xlf.jpg" alt="jpr-prx718xlf">
I'm not an expert in regexes, far from it, but it does seem that it should pick up both, no?
Fixed with BeautifulSoup, as the comments suggested.
Code snippet for anyone looking for a script to download HTML with the Images, save them and relink the images in the html to local relative links.
import urllib2
import re
from BeautifulSoup import BeautifulSoup
from os.path import basename
from urlparse import urlsplit
#get content of a url and save (not necessary) the originalhtml
url = "http://www.someDomain.com"
urlContent = urllib2.urlopen(url).read()
page = BeautifulSoup(urlContent)
f = open("originalHtml",'w')
f.write(urlContent)
f.close()
#Find all images in the file, put them in imgUrls
imgUrls = page.findAll('img')
imagesDict = {}
# download all images
for image in imgUrls:
try:
#get src tag and download file, save link and local link in dict
imgUrl = image['src']
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
location = "images/" + fileName;
imagesDict[location] = imgUrl
print "loc=" + location
output = open(location,'wb')
output.write(imgData)
output.close()
except:
#not so clean solution to catch hard-linked images ('http://somedomain.com/img/image.jpg
try:
imgData = urllib2.urlopen(url + imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
location = "images/" + fileName
imagesDict[location] = imgUrl
print "loc=" + location
output = open(location,'wb')
output.write(imgData)
output.close()
except:
print "Double ERROR"
print "Error" + imgUrl
pass
#Replace the old links to new local links
for dictKey in imagesDict:
urlContent = re.sub(imagesDict[dictKey], dictKey, urlContent)
#save HTML
f = open("imagesReplaced.html", 'w')
f.write(urlContent)
f.close()
You should not use regex to parse html.
It's really hard to debug these failures. I can't see any reason why the image tags you post should not be matched by the regex. But here are a few examples where this regex pattern will fail.
urlContent = """
single quotes <img src='/image/one.jpg' />
unexpected space <img src ="/image/two.jpg" />
not an img tag <script src="/some/javascript.js">
"""
>>> re.findall('img .*?src="(.*?)"', urlContent)
['/some/javascript.js']
Using a html/xml parser as the other answer suggests is the only sane way to solve your problem.
PS: This has already been linked in the comments, but I guess it's mandatory to include this answer every time the topic is discussed: RegEx match open tags except XHTML self-contained tags