How can I modify my script to skip a URL if the connection times out or is invalid/404?
Python
#!/usr/bin/python
#parser.py: Downloads Bibles and parses all data within <article> tags.
__author__ = "Cody Bouche"
__copyright__ = "Copyright 2012 Digital Bible Society"
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS COMPLETE!")
To apply the timeout to your request add the timeout variable to your call to urlopen. From the docs:
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
Refer to this guide's section on how to handle exceptions with urllib2. Actually I found the whole guide very useful.
The request timeout exception code is 408. Wrapping it up, if you were to handle timeout exceptions you would:
try:
response = urlopen(req, 3) # 3 seconds
except URLError, e:
if hasattr(e, 'code'):
if e.code==408:
print 'Timeout ', e.code
if e.code==404:
print 'File Not Found ', e.code
# etc etc
Try putting your urlopen line under a try catch statment. Look this up:
docs.python.org/tutorial/errors.html section 8.3
Look at the different exceptions and when you encounter one just restart the loop using the statement continue
Related
I wonder if it is possible to test the URL's from my bookmarks.
So I can see if the URL still is online or offline.
I can see that I can test it with Urllip2
Urllip2 code
import socket
from urllib2 import urlopen, URLError, HTTPError
socket.setdefaulttimeout( 23 ) # timeout in seconds
url = 'http://google.com/'
try :
response = urlopen( url )
except HTTPError, e:
print 'The server couldn\'t fulfill the request. Reason:', str(e.code)
except URLError, e:
print 'We failed to reach a server. Reason:', str(e.reason)
else :
html = response.read()
print 'got response!'
# do something, turn the light on/off or whatever
My question is, can I get the links/URL's from my bookmarks (Chrome) and the test the URL's in a loop (for) if the URL is Offline or Online.
EDIT 26/02/2019...
Have t/ried this code, and get no folder found error..
/
import json
from jsonpath_rw import parse
import os
# PArse te Bookmarks file from json into a dict
input_filename = os.path.join(os.getenv("APPDATA"), "\\Local\\Google\\Chrome\\User Data\\Default\\Bookmarks")
if os.path.isfile(input_filename):
with open(input_filename) as data_file:
bookmark_data = json.load(data_file)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(bookmark_data)])
else:
print("File not found!")
print(input_filename)
Chrome (or at least Chromium) stores your bookmarks in a file called Bookmarks in your chrome config area - on linux this is usually .config/chromium/Default/Bookmarks on Windows it is AppData\Local\Google\Chrome\User Data\Default\Bookmarks (though you may need to hunt for it if your system is different).
Assuming you wan to check all links, then you probably want to recursively walk the tree, looking for url keys and getting their values. Since this is JSON, I would recommend using the JSONPath library (https://readthedocs.org/projects/jsonpath-rw/), rather than writing your own recursion function:
import json
from jsonpath_rw import parse
# PArse te Bookmarks file from json into a dict
with open('Bookmarks') as bm:
data = json.load(bm)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(data)])
I want to add a referer while retrieving data from the web but this is not working on my python2 referer request.add_header('Referer', 'https://www.python.org').
My Url.txt content
https://www.python.org/about/
https://stackoverflow.com/questions
https://docs.python.org/2.7/
These are my codes
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import urllib2
import threading
import time
import requests
max_thread = 5
urllist = open("Url.txt").readlines()
def url_connect(url):
try :
request = urllib2.Request(url)
request.add_header('Referer', 'https://www.python.org')
request.add_header('User-agent', 'Mozilla/5.0')
goo = re.findall('<title>(.*?)</title>', urllib2.urlopen(url.replace(' ','')).read())[0]
print '\n' + goo.decode("utf-8")
with open('SaveMyDataFile.txt', 'ab') as f:
f.write(goo + "\n")
except Exception as Errors:
pass
for i in urllist:
i = i.strip()
if i.startswith("http"):
while threading.activeCount() >= max_thread:
time.sleep(0.1)
threading.Thread(target=url_connect, args=(i,)).start()
Looks to me the problem is in your call to urlopen. You call it with the url and not with the request.
From https://docs.python.org/2/library/urllib2.html#urllib2.urlopen
Open the URL url, which can be either a string or a Request object.
You need to pass urllib.urlopen() the Request object that you just built--you're currently not doing anything with that.
I am using JSON library and trying to import a page feed to an CSV file. Tried many a ways to get the result however every time code execute it Gives JSON not serialzable. No Facebook use auth code which I have and used it so connection string will change however if you use a page which has public privacy you will still be able to get the result from below code.
following is the code
import urllib3
import json
import requests
#from pprint import pprint
import csv
from urllib.request import urlopen
page_id = "abcd" # username or id
api_endpoint = "https://graph.facebook.com"
fb_graph_url = api_endpoint+"/"+page_id
try:
#api_request = urllib3.Requests(fb_graph_url)
#http = urllib3.PoolManager()
#api_response = http.request('GET', fb_graph_url)
api_response = requests.get(fb_graph_url)
try:
#print (list.sort(json.loads(api_response.read())))
obj = open('data', 'w')
# write(json_dat)
f = api_response.content
obj.write(json.dumps(f))
obj.close()
except Exception as ee:
print(ee)
except Exception as e:
print( e)
Tried many approach but not successful. hope some one can help
api_response.content is the text content of the API, not a Python object so you won't be able to dump it.
Try either:
f = api_response.content
obj.write(f)
Or
f = api_response.json()
obj.write(json.dumps(f))
requests.get(fb_graph_url).content
is probably a string. Using json.dumps on it won't work. This function expects a list or a dictionary as the argument.
If the request already returns JSON, just write it to the file.
I am having trouble building a basic spider program in Python. Whenever I try to run I get an error. The error occurs somewhere in the last seven lines of code.
#These modules do most of the work.
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO
def log_stdout(msg):
"""Print msg to the screen."""
print msg
def get_page(url, log):
"""Retrieve URL and return contents, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body
def find_links(html):
"""Return a list links in html."""
# We're using the parser just to get the HREFs
writer = formatter.DumbWriter(StringIO())
f = formatter.AbstractFormatter(writer)
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
class Spider:
"""
The heart of this program, finds all links within a web site.
run() contains the main loop.
process_page() retrieves each page and finds the links.
"""
def __init__(self, startURL, log=None):
#This method sets initial values
self.URLs = set()
self.URLs.add(startURL)
self.include = startURL
self._links_to_process = [startURL]
if log is None:
# Use log_stdout function if no log provided
self.log = log_stdout
else:
self.log = log
def run(self):
#Processes list of URLs one at a time
while self._links_to_process:
url = self._links_to_process.pop()
self.log("Retrieving: " + url)
self.process_page(url)
def url_in_site(self, link):
#Checks whether the link starts with the base URL
return link.startswith(self.include)
def process_page(self, url):
#Retrieves page and finds links in it
html = get_page(url, self.log)
for link in find_links(html):
#Handle relative links
link = urlparse.urljoin(url, link)
self.log("Checking: " + link)
# Make sure this is a new URL within current site
if link not in self.URLs and self.url_in_site(link):
self.URLs.add(link)
self._links_to_process.append(link)
The error message pertains to this block of code.
if __name__ == '__main__':
#This code runs when script is started from command line
startURL = sys.argv[1]
spider = Spider(startURL)
spider.run()
for URL in sorted(spider.URLs):
print URL
The error message:
startURL = sys.argv[1]
IndexError: list index out of range
You aren't calling your spider program with an argument. sys.argv[0] is your script file, and sys.argv[1] would be the first argument you pass it. The "list index out of range" means you didn't give it any arguments.
Try calling it as python spider.py http://www.example.com (with your actual URL).
This doesn't directly answer your question, but:
I would go something as:
START_PAGE = 'http://some.url.tld'
ahrefs = lxml.html.parse(START_PAGE).getroottree('//a/#href')
Then use the available methods on lmxl.html objects and multiprocess the links
This handles "semi-malformed" HTML, and you can plug-in the BeautifulSoup library.
A bit of work is required if you want to even try to attempt to follow JavaScript generated links, but - that's life!
i have the following script i am using to scrap data from my uni website and insert into a GAE Db
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech = Browser()
try:
page = mech.open(url)
html = page.read()
except Exception, err:
print str(err)
#print html
soup = BeautifulSoup(html)
soup.prettify()
tables = soup.find('select')
for options in tables:
intake = options.string
#print intake
try:
#print viewurl+intake
page = mech.open(viewurl+intake)
html = page.read()
print html
if html=="Exist in database":
print intake, " Exist in the database skiping"
else:
page = mech.open(inserturl+intake)
html = page.read()
print html
if html=="Ok":
print intake, "added to the database"
else:
print "Error adding ", intake, " to database"
except Exception, err:
print str(err)
i am wondering what would be the best way to optimize this script so i can run it on the app engine servers. as it is, it is now scraping over 300 entries and take well over 10 mins to insert all the data on my local machine
the model that is being used to store the data is
class Intake(db.Model):
intake=db.StringProperty(multiline=False, required=True)
##permerlink
def get_absolute_url(self):
return "/timekeeper/%s/" % self.intake
class Meta:
db_table = "Intake"
verbose_name_plural = "Intakes"
ordering = ['intake']
Divide and conquer.
Make a list of tasks (e.g. urls to scrape/parse)
Add your tasks into a queue (appengine taskqueue api, amazon sqs, …)
Process your queue
The first thing you should do is rewrite your script to use the App Engine datastore directly. A large part of the time you're spending is undoubtedly because you're using HTTP requests (two per entry!) to insert data into your datastore. Using the datastore directly with batch puts ought to cut a couple of orders of magnitude off your runtime.
If your parsing code is still too slow, you can cut the work up into chunks and use the task queue API to do the work in multiple requests.
hi according to tosh and nick i have modified the script as bellow
from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
page = urlfetch.fetch(url)
#print html
soup = BeautifulSoup(page.content)
soup.prettify()
tables = soup.find('select')
models=[]
for options in tables:
intake_code = options.string
if Intake.all().filter('intake',intake_code).count()<1:
data = Intake(intake=intake_code)
models.append(data)
try:
if len(models)>0:
db.put(models)
else:
pass
except Exception,err:
pass
except Exception, err:
print str(err)
am i on the right track ? also i am not really sure how to get this to invoke on a schedule (once a week) what would be the best way to do it?
and thanks for the prompt answers