requests.exceptions.MissingSchema: Invalid URL (with bs4) - python

I am getting this error:
requests.exceptions.MissingSchema: Invalid URL 'http:/1525/bg.png': No schema supplied. Perhaps you meant http://http:/1525/bg.png?
I don't really care why the error happened, I want to be able to capture any Invalid URL errors, issue a message and proceed with the rest of the code.
Below is my code, where I'm trying to use try/except for that specific error but its not working...
# load xkcd page
# save comic image on that page
# follow <previous> comic link
# repeat until last comic is reached
import webbrowser, bs4, os, requests
url = 'http://xkcd.com/1526/'
os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'): # - last page
# download the page
print('Dowloading page %s...' % (url))
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
# find url of the comic image (<div id ="comic"><img src="........"
</div
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find any images')
else:
comicUrl = 'http:' + comicElem[0].get('src')
#download the image
print('Downloading image... %s' % (comicUrl))
res = requests.get(comicUrl)
try:
res.raise_for_status()
except requests.exceptions.MissingSchema as err:
print(err)
continue
# save image to folder
imageFile = open(os.path.join('xkcd',
os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(1000000):
imageFile.write(chunk)
imageFile.close()
#get <previous> button url
prevLink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevLink.get('href')
print('Done')
What a my not doing? (I'm on python 3.5)
Thanks allot in advance...

if you don't care about the error (which i see as bad programming), just use a blank except statement that catches all exceptions.
#download the image
print('Downloading image... %s' % (comicUrl))
try:
res = requests.get(comicUrl) # moved inside the try block
res.raise_for_status()
except:
continue
but on the other hand if your except block isn't catching the exception then it's because the exception actually happens outside your try block, so move requests.get into the try block and the exception handling should work (that's if you still need it).

Try this, if you have this type of issue occur on use wrong URL.
Solution:
import requests
correct_url = False
url = 'Ankit Gandhi' # 'https://gmail.com'
try:
res = requests.get(url)
correct_url = True
except:
print("Please enter a valid URL")
if correct_url:
"""
Do your operation
"""
print("Correct URL")
Hope this help full.

The reason your try/except block isn't caching the exception is that the error is happening at the line
res = requests.get(comicUrl)
Which is above the try keyword.
Keeping your code as is, and just moving the try block up one line will fix it.

Related

Touble web scraping streaming sites

I'm trying to make a web scraper using Python and I'm facing one issue. Every streaming site I try to inspect doesnt let me inspect it's html code when i'm on the episodes page. It sends me back to the home page whenever I open Element Inspector. Any help?
I'm new to web scraping so I don't know another way of doing this
import requests
from bs4 import BeautifulSoup
# THIS PROJECT IS JUST TO SCRAPE ZORO.TO AND DOWNLOAD ANIME USING IT
# Make a request and get the return the HTML content of a webpage
def getWebpage(url):
r = requests.get(url)
return BeautifulSoup(r.content, 'html5lib')
# Search the title and get it's Page Url
def getWatchUrl(titleName):
keyword = ""
# Transform the Anime title into a Url as on the website
for word in titleName.split(sep=" "):
keyword += '+' + word
keyword = keyword[1:]
SearchURL = f'https://zoro.to/search?keyword={keyword}'
# Get the HTML contents of the website
try:
soup = getWebpage(SearchURL)
except Exception as e:
print(f"Unexpected Error {e}. Please try again!")
return None
# Seperate the useful content (Anime title and Links)
try:
titles = soup.findAll('div', {'class' : 'film-poster'}, limit=None)
except:
print(e)
print("Couldn't Find title. Check spellings and please try again!")
return None
# Search the content for Anime title and extract it's link
for title in titles:
for content in title:
try:
if titleName in content.get('title').lower():
path = content.get('href').replace('?ref=search', '')
watchURL = f'https://zoro.to/watch{path}'
return watchURL
except:
continue
print("Couldn't Find title. Check spellings and please try again!")
return None
# Get the direct download links from the Web page
def getDownloadUrl(watchUrl):
try:
soup = getWebpage(watchUrl)
print(soup.prettify())
except Exception as e:
print(f"Unexpected Error {e}. Please try again!")
return None
def main():
animeName = input("Enter Anime Name: ").lower()
watchURL = getWatchUrl(animeName)
if watchURL is not None: getDownloadUrl(watchURL)
if __name__ == "__main__":
main()

python get url from request

i get data from an api in django.
The data comes from an order form from another website.
The data also includes an url, for example like example.com but i can't validate the input because i don't have access to the order form.
The url that i get can also have different kinds. More examples:
example.de
http://example.de
www.example.com
https://example.de
http://www.example.de
https://www.example.de
Now i would like to open the url to get the correct url.
For example if i open example.com in my browser, i got the correct url http://example.com/ and that is what i wish for all urls.
How can i do that in python fast?
If you get status_code 200 you know that you have a valid address.
In regards to HTTPS://. You will get an SSL error if you don't Follow the answers in this guide. Once you have that in place, the program will find the correct URL for you.
import requests
import traceback
validProtocols = ["https://www.", "http://www.", "https://", "http://"]
def removeAnyProtocol(url):
url = url.replace("www.","") # to remove any inputs containing just www since we aren't planning on using them regardless.
for protocol in validProtocols:
url = url.replace(protocol, "")
return url
def validateUrl(url):
for protocol in validProtocols:
if(protocol not in url):
pUrl = protocol + removeAnyProtocol(url)
try:
req = requests.head(pUrl, allow_redirects=True)
if req.status_code == 200:
return pUrl
else:
continue
except Exception:
print(traceback.format_exc())
continue
else:
try:
req = requests.head(url, allow_redirects=True)
if req.status_code == 200:
return url
except Exception:
print(traceback.format_exc())
continue
Usage:
correctUrl = validateUrl("google.com") # https://www.google.com

how to use requests raise error in python

Iam using python to fetch content from some urls. So I have a list of urls, and all are fine except one of them where I get a 404. I wanted to fetch this like:
for url in urls:
r = requests.get(url)
try:
r.raise_for_status()
except RuntimeError:
print('error: could not get content from url because of {}'.format(r.status_code))
But now, the exception raised by raise_for_status() is not fetched but just printed out? How can I print my own error code if its raised?
You need to modify your try catch block
try:
r = requests.get(url)
r.raise_for_status()
except requests.exceptions.HTTPError as error:
print error
You could create your own exception class and just raise that,
class MyException(Exception):
pass
...
...
for url in urls:
r = requests.get(url)
try:
r.raise_for_status()
except requests.exceptions.HTTPError as error:
raise MyException('error: could not get content from url because of {}'.format(r.status_code))

Python Http, getting response code from server

I want to get the response code from a web server, but sometime I get code 200 even if the page doesn't exist and I don't know how to deal with it.
I'm using this code:
def checking_url(link):
try:
link = urllib.request.urlopen(link)
response = link.code
except urllib.error.HTTPError as e:
response = e.code
return response
When I'm checking a website like this one:
https://www.wykop.pl/notexistlinkkk/
It still returns code 200 even if the page doesn't exist.
Is there any solution to deal with it?
I found solution, now gonna test it with more websites
I had to use http.client.
You are getting response code 200, because the website you are checking has automatic redirection. In the URL you gave, even if you specify a non-existing page, it automatically redirects you to the home page, rather than returning a 404 status code. Your code works fine.
import urllib2
thisCode = None
try:
i = urllib2.urlopen('http://www.google.com')
thisCode = i.code
except urllib2.HTTPError, e:
thisCode = e.code
print thisCode

BeautifulSoup scraper downloaded images are corrupt

I greatly need help for my code. I was attempting to do an exercise from a book and I followed it exactly. The code worked and it downloaded the images. However, all the images that was downloaded were corrupted. I have no idea whats causing it or what I missed.
Thanks.
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
url = 'http://xkcd.com'
os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'):
# Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find comic image')
else:
comicUrl = comicElem[0].get('src')
# Download the image.
print('Downloading image %s' %(comicUrl))
res.raise_for_status()
# Save the image to ./xkcd.
imagefile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imagefile.write(chunk)
imagefile.close()
# Get the prev button's url
prevlink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevlink.get('href')
print('Done')
You are writing wrong data to the file:
for chunk in res.iter_content(100000)
res is the data of the webpage. You should be getting the data of the image with the url comicUrl instead. I think you forgot this line:
print('Downloading image %s' %(comicUrl))
res = requests.get('http:' + comicUrl)
Note: I added http: before the url because the image urls you are extracting lack this. You should define a function to check whether it is necessary to add this schema.

Categories