Freezes at accessing Mouser.com using Urllib - python

I want to access mouser.com using urllib. When I try to fetch the data from the URL, it hangs indefinitely.
Here is code:
import urllib.error
import urllib.request
try:
htmls = urllib.request.urlopen("https://www.mouser.com/")
except urllib.error.HTTPError as e:
print("HTTP ERROR")
except urllib.error.URLError as e:
print("URL ERROR")
else :
print(htmls.read().decode("utf-8"))
This piece of code works fine for most URLs, but for some URLs it doesn't like Mouser or element14.

Related

check website is valid or not through python

I have 1000 website list to check weather its exist or not but my code is showing all correct which is starting with https:// here is below my code
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://stackoverflow.com")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print ('Website is working fine')
You can use the python request library.
If you do response = requests.get('http://stackoverflow.com') and then do response.status_code, you should get 200. But if you try for site that is not available, you should get status_code as 404. You can use status_code in your case.
More on status codes: Link.

How to apply a customized function to a list of urls?

Update
I am trying to define a function for a list of urls, the function is intended to print either if a certain link or a server is not found from the original list (job_title_links):
This is what I've got so far:
from urllib.error import URLError
from urllib.error import HTTPError
from urllib.request import urlopen
job_title_links =['https://www.salario.com.br/profissao/abacaxicultor-cbo-612510/',
'https://www.salario.com.br/profissao/abade-cbo-263105/',
'https://www.salario.com.br/profissao/abanador-na-agricultura-cbo-622020/']
def try_url_exist(links):
for link in job_title_links:
try:
html=urlopen(link)
except HTTPError as e:
print(e) # not found url
except URLError as e:
print(e) # server not found
try_url_exist(job_title_links)
However the function returns me a list of HTTPError 403 even when the
url's exist.
Console output:
HTTP Error 403: Forbidden
HTTP Error 403: Forbidden
HTTP Error 403: Forbidden
Expected function output should do nothing if the url exists and should return
either HTTPError or URLError and the name of the url when the url does not exist.
How could I accomplish this task?
By changing urlopen() to requests.get() from requests library and adding it
to an empty list, the code worked.
import requests
from urllib.error import URLError
from urllib.error import HTTPError
from urllib.request import urlopen
def try_url_exist(links):
for link in job_title_links:
try:
html=requests.get(link)
except HTTPError as e:
print(e)
except URLError as e:
print(e)
else:
print(link)
functional_links = []
functional_links = try_url_exist(job_title_links)

how to get specific tag elements from a page using python requests library

I am trying to get all the 'a' tags which are used for links and also the the 'form' tag. The code that I have written fetches the whole page.
import requests
from requests.exceptions import HTTPError
for url in ['http://www.example.com', 'http://mail.example.com']:
try:
response = requests.get(url)
# If the response was successful, no Exception will be raised
response.raise_for_status()
except HTTPError as http_err:
print(f'HTTP error occurred: {http_err}') # Python 3.6
except Exception as err:
print(f'Other error occurred: {err}') # Python 3.6
else:
response.encoding = 'utf-8' # Optional: requests infers this internally
print(response.text)
I can use regular expressions to get a specific thing from the page, but I don't know how to get the entire contents of a particular tag.
Thanks
You can use BeautifulSoup to parse html page:
import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
for url in ['http://www.example.com', 'http://mail.example.com']:
try:
response = requests.get(url)
# If the response was successful, no Exception will be raised
response.raise_for_status()
except HTTPError as http_err:
print(f'HTTP error occurred: {http_err}') # Python 3.6
except Exception as err:
print(f'Other error occurred: {err}') # Python 3.6
else:
response.encoding = 'utf-8' # Optional: requests infers this internally
soup = BeautifulSoup(response.text, 'lxml')
links = soup.find_all('a')
forms = soup.find_all('form')
To install BeautifulSoup use:
pip install beautifulsoup4

How to remove URL with error from the list?

I have the list of more than 1000 URLs (those URLs are to download the reports) saved in a .csv file.
Some of the URLs have 404 error and I want to find a way to remove them from the list.
I managed to write a code to identify which URL is invalid (for python 3) below. However I don't know how to remove those URLs from the list automatically given there any many URLs. Thank you!
from urllib.request import urlopen
from urllib.error import HTTPError
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
raise
You can use another list to save the 404 url(if 404 url less than normal url), then get the difference set, so:
from urllib.request import urlopen
from urllib.error import HTTPError
exclude_urls = set()
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
exclude_urls.add(url)
valid_urls = set(all_urls) - exclude_urls
Consider List A has all the urls.
A = A.remove("invalid_url")
You can do something like this:
from urllib.request import urlopen
from urllib.error import HTTPError
def load_data(csv_name):
...
def save_data(data,csv_name):
...
links=load_data(csv_name)
new_links=set()
for i in links:
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
new_links.add(i)
save_data( list(new_links),csv_name)
Create a loop and write valid urls in finally statement to new csv file would be the easiest solution.

Syntax error on urllib.httperror in python 3.4.3

I'm following a book Web Scraping with Python and I'm trying this :
I'm in a Virtual Environment with python 3.4.3 on OSX
BeautifulSoup library is installed
When I'm trying this :
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.error import HTTPError
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
except urllib.error.HTTPError as e:
print(e.code)
if html is none:
print("url is not found")
else:
bsObj = BeautifulSoup(html.read());
print(bsObj
When I'm running it, I have the following error :
(scrapingEnv)Macintosh:scrapingenv nicolas$ python3 scrapetest.py
File "scrapetest.py", line 6
except urllib.error.HTTPError as e:
^
SyntaxError: invalid syntax
I also tried with "except urllib.HTTPError" in line 6 without any success.
What am I doing wrong ?
You are missing your try statement
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.error import HTTPError
try:
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
except urllib.error.HTTPError as e:
print(e.code)
if html is None:
print("url is not found")
Edit: you should also change none to None

Categories