BeautifulSoup scraper downloaded images are corrupt - python

I greatly need help for my code. I was attempting to do an exercise from a book and I followed it exactly. The code worked and it downloaded the images. However, all the images that was downloaded were corrupted. I have no idea whats causing it or what I missed.
Thanks.
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
url = 'http://xkcd.com'
os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'):
# Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find comic image')
else:
comicUrl = comicElem[0].get('src')
# Download the image.
print('Downloading image %s' %(comicUrl))
res.raise_for_status()
# Save the image to ./xkcd.
imagefile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imagefile.write(chunk)
imagefile.close()
# Get the prev button's url
prevlink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevlink.get('href')
print('Done')

You are writing wrong data to the file:
for chunk in res.iter_content(100000)
res is the data of the webpage. You should be getting the data of the image with the url comicUrl instead. I think you forgot this line:
print('Downloading image %s' %(comicUrl))
res = requests.get('http:' + comicUrl)
Note: I added http: before the url because the image urls you are extracting lack this. You should define a function to check whether it is necessary to add this schema.

Related

Web Scraping Images: Cannot find 'rel' selector

I'm following the tutorial for Automate the Boring Stuff's web-scraping section, and want to scrape the images from https://swordscomic.com/ .
The script should 1) download and parse the html 2) download the comic image 3) click on the "previous comic" button 4) repeat 1 - 3
The script is able to download the first comic, but gets stuck either on hitting the "previous comic" button, or downloading the next comic image.
Possible issues for this may be:
Al's tutorial instructs to find the "rel" selector, but I am unable to find it. I believe this site uses a slightly different format than that site Al's tutorial instructs to scrape. I believe I'm using the correct selector, but the script still crashes.
It may also be the way this site's home landing page contains a comic image, and then each "previous" comic has an additional file-path (in the form of /CCCLXVIII/ or something thereof).
I have tried:
adding the edition # for the comic for the initial page, but that only causes the script to crash earlier.
pointing the "previous button" part of the script to a different selector in the element, but still gives me the "Index out of range" error.
Here is the script as I have it:
#! python3
#swordscraper.py - Downloads all the swords comics.
import requests, os, bs4
os.chdir(r'C:\Users\bromp\OneDrive\Desktop\Python')
os.makedirs('swords', exist_ok=True) #store comics in /swords
url = 'https://swordscomic.com/' #starting url
while not url.endswith('#'):
#Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Find the URL of the comic image.
comicElem = soup.select('#comic-image')
if comicElem == []:
print('Could not find comic image.')
else:
comicUrl = comicElem[0].get('src')
comicUrl = "http://" + comicUrl
if 'swords' not in comicUrl:
comicUrl=comicUrl[:7]+'swordscomic.com/'+comicUrl[7:]
#Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
#Save the image to ./swords
imageFile = open(os.path.join('swords', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
#Get the Prev button's url.
prevLink = soup.select('a[id=navigation-previous]')[0]
url = 'https://swordscomic.com/' + prevLink.get('href')
print('Done')
Here is the output the script does and the particular error message it gives:
Downloading page https://swordscomic.com/...
Downloading image http://swordscomic.com//media/Swords363bt.png...
Downloading page https://swordscomic.com//comic/CCCLXII/...
Could not find comic image.
Traceback (most recent call last):
File "C:\...\", line 39, in <module>
prevLink = soup.select('a[id=navigation-previous]')[0]
IndexError: list index out of range
The page is rendered with JavaScript. In particular the link you extract:
has an onclick() event which presumably links to the next page. In addition the page uses XHR. So your only option is to use a technology that renders JavaScript so try using Selenium or requests-html https://github.com/psf/requests-html.

How to check if (https://) is a image or web link

How to check if a hyperlink is a image link or web link.
image_list = []
url = 'http://www.image.jpg/'
if any(x in '.jpg .gif .png .jpeg' for x in url):
image_list.append(url)
else:
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
for link in soup.find_all('img'):
src = link.get('src')
if src.startswith("https"):
image_list.append(src)
The code above works in finding out the hyperlink contains image formats, however whenever i use a link that does not contain ".jpg ect..." it still appends the link to the image_list and skips the else statement.
Let's look at this code:
any(x in '.jpg .gif .png .jpeg' for x in url):
This checks if any letter in the URL is in the string. The 'p' from http is in the string, so you will always get a true result.
Here's how you could check the extension of a URL:
import posixpath
import urllib.parse
IMAGE_EXTS = { '.png', '.jpg', '.jpeg', '.gif' }
url = 'http://example.com/'
if posixpath.splitext(urllib.parse.urlparse(url).path)[1] in IMAGE_EXTS:
# Has image extension...
But that's a moot point, because the extension of a URL doesn't tell you whether it's an image. Unlike regular files, for URLs, the extension is completely irrelevant! You can have an .html URL which gives you a PNG image, or a .gif URL which is really an HTML web page. You need to check the Content-Type of the HTTP reply.

Extract images from HTML file using python standard libraries

so I'm trying to write a script that basically parses through an HTML file, finds all the images and saves those images into another folder. How would one accomplish this only using libraries that come with python3 when you install it on your computer? I currently have this script that I would like to incorporate more into.
date = datetime.date.today()
backup_path = os.path.join(str(date), language)
if not os.path.exists(backup_path):
os.makedirs(backup_path)
log = []
endpoint = zendesk + '/api/v2/help_center/en-us/articles.json'
while endpoint:
response = requests.get(endpoint, auth=credentials)
if response.status_code != 200:
print('Failed to retrieve articles with error {}'.format(response.status_code))
exit()
data = response.json()
for article in data['articles']:
if article['body'] is None:
continue
title = '<h1>' + article['title'] + '</h1>'
filename = '{id}.html'.format(id=article['id'])
with open(os.path.join(backup_path, filename), mode='w', encoding='utf-8') as f:
f.write(title + '\n' + article['body'])
print('{id} copied!'.format(id=article['id']))
log.append((filename, article['title'], article['author_id']))
endpoint = data['next_page']
This is a script I found on a zendesk forum that basically backs up our articles on Zendesk.
Try using beautiful soup to retrieve all the nodes and for each node using urllib to get the picture.
from bs4 import BeautifulSoup
#note here using response.text to get raw html
soup = BeautifulSoup(response.text)
#get the src of all images
img_source = [x.src for x in soup.find_all("img")]
#get the images
images = [urllib.urlretrieve(x) for x in img_source]
And you probably need to add some error handling and change it a bit to fit your page, but the idea remains the same.

My scraper throws error instead of downloading images

I've made a scraper to download images from a site. However, when i run this, it throws error showing: [raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403]. I used this method on other sites as well to scrape images but faced no issues. I can't figure out Why this error shows up and what is the workaround. Hope someone will look into it.
import requests
import urllib.request
from lxml import html
def PictureScraping():
url = "https://www.yify-torrent.org/search/1080p/"
response = requests.get(url)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[#class="movie-image"]')
for title in titles:
Pics = "https:" + title.xpath('.//img/#src')[0]
urllib.request.urlretrieve(Pics, Pics.split('/')[-1])
PictureScraping()
You need to download images using the same web-scraping session you've used to get the initial page. Working code:
import requests
from lxml import html
def PictureScraping():
url = "https://www.yify-torrent.org/search/1080p/"
with requests.Session() as session:
response = session.get(url)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[#class="movie-image"]')
for title in titles:
image_url = title.xpath('.//img/#src')[0]
image_name = image_url.split('/')[-1]
print(image_name)
image_url = "https:" + image_url
# download image
response = session.get(image_url, stream=True)
if response.status_code == 200:
with open(image_name, 'wb') as f:
for chunk in response.iter_content(1024):
f.write(chunk)
PictureScraping()

requests.exceptions.MissingSchema: Invalid URL (with bs4)

I am getting this error:
requests.exceptions.MissingSchema: Invalid URL 'http:/1525/bg.png': No schema supplied. Perhaps you meant http://http:/1525/bg.png?
I don't really care why the error happened, I want to be able to capture any Invalid URL errors, issue a message and proceed with the rest of the code.
Below is my code, where I'm trying to use try/except for that specific error but its not working...
# load xkcd page
# save comic image on that page
# follow <previous> comic link
# repeat until last comic is reached
import webbrowser, bs4, os, requests
url = 'http://xkcd.com/1526/'
os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'): # - last page
# download the page
print('Dowloading page %s...' % (url))
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
# find url of the comic image (<div id ="comic"><img src="........"
</div
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find any images')
else:
comicUrl = 'http:' + comicElem[0].get('src')
#download the image
print('Downloading image... %s' % (comicUrl))
res = requests.get(comicUrl)
try:
res.raise_for_status()
except requests.exceptions.MissingSchema as err:
print(err)
continue
# save image to folder
imageFile = open(os.path.join('xkcd',
os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(1000000):
imageFile.write(chunk)
imageFile.close()
#get <previous> button url
prevLink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevLink.get('href')
print('Done')
What a my not doing? (I'm on python 3.5)
Thanks allot in advance...
if you don't care about the error (which i see as bad programming), just use a blank except statement that catches all exceptions.
#download the image
print('Downloading image... %s' % (comicUrl))
try:
res = requests.get(comicUrl) # moved inside the try block
res.raise_for_status()
except:
continue
but on the other hand if your except block isn't catching the exception then it's because the exception actually happens outside your try block, so move requests.get into the try block and the exception handling should work (that's if you still need it).
Try this, if you have this type of issue occur on use wrong URL.
Solution:
import requests
correct_url = False
url = 'Ankit Gandhi' # 'https://gmail.com'
try:
res = requests.get(url)
correct_url = True
except:
print("Please enter a valid URL")
if correct_url:
"""
Do your operation
"""
print("Correct URL")
Hope this help full.
The reason your try/except block isn't caching the exception is that the error is happening at the line
res = requests.get(comicUrl)
Which is above the try keyword.
Keeping your code as is, and just moving the try block up one line will fix it.

Categories