I have two sets of scripts. One to download a webpage and another to download links from the webpage. They both run but the links script doesn't return any scripts. Can anyone see or tell me why?
webpage script;
import sys, urllib
def getWebpage(url):
print '[*] getWebpage()'
url_file = urllib.urlopen(url)
page = url_file.read()
return page
def main():
sys.argv.append('http://www.bbc.co.uk')
if len(sys.argv) != 2:
print '[-] Usage: webpage_get URL'
return
else:
print getWebpage(sys.argv[1])
if __name__ == '__main__':
main()
Links Script
import sys, urllib, re
import getWebpage
def print_links(page):
print '[*] print_links()'
links = re.findall(r'\<a.*href\=.*http\:.+', page)
links.sort()
print '[+]', str(len(links)), 'HyperLinks Found:'
for link in links:
print link
def main():
sys.argv.append('http://www.bbc.co.uk')
if len(sys.argv) != 2:
print '[-] Usage: webpage_links URL'
return
page = webpage_get.getWebpage(sys.argv[1])
print_links(page)
This will fix most of your problems:
import sys, urllib, re
def getWebpage(url):
print '[*] getWebpage()'
url_file = urllib.urlopen(url)
page = url_file.read()
return page
def print_links(page):
print '[*] print_links()'
links = re.findall(r'\<a.*href\=.*http\:.+', page)
links.sort()
print '[+]', str(len(links)), 'HyperLinks Found:'
for link in links:
print link
def main():
site = 'http://www.bbc.co.uk'
page = getWebpage(site)
print_links(page)
if __name__ == '__main__':
main()
Then you can move on to fixing your regular expression.
While we are on the topic, though, I have two material recommendations:
use python library requests for getting web pages
use a real XML/HTML library for parsing HTML (recommend lxml)
Your regular expression doesn't have an end, so when you find the first it will display you the entire rest of page as you use the http\:.+ which means return all what is : till the end of the html page you need to specify the as end of the regular expression
Related
Hi I am a high school student who has not used python to code programs much, and I was having trouble with creating code to check when a website was updated. I have looked at different resources and I have used them to create what I have but when I run the code it doesn't seem to work and do what I expect it to do. When I run the code I expect it to tell me if a site has been updated or stayed the same from when I last checked it. I put some print statements in the code to try to catch the issue, but it has only showed me that the website has changed even though it doesn't look like it has changed.
import time
import hashlib
from urllib.request import urlopen, Request
url = Request('https://www.canada.ca/en/immigration-refugees-citizenship/services/immigrate-canada/express-entry/submit-profile/rounds-invitations.html')
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print("running")
time.sleep(10)
while True:
try:
res = urlopen(url).read()
current = hashlib.sha224(res).hexdigest()
print(current)
print(res)
time.sleep(30)
res = urlopen(url).read()
newHash = hashlib.sha224(res).hexdigest()
print (newHash)
print(res)
if newHash == current:
print ("nothing changed")
continue
else:
print("there was a change")
except AttributeError as e:
print ("error")
The code below is what i have currently done, but i am struggling to get it working properly...
hope you can help :)
#A python programme which shows the current price of bitcoin.
#(a well-known crypto-currency.)
import urllib
import urllib2
def webConnect():
aResp = urllib2.urlopen("https://www.cryptocompare.com/coins/btc/overview/GBP")
web_pg = aResp.read();
print web_pg
def main():
webConnect()
main()
g = Grab()
g.go(address)
btc_div = g.xpath('//*/div[class="ng-binding"]')
val = btc_div.xpath(u"dl/dt[contains(text(),'%s')]/../dd/text()" % 'if only that tag contains this text')
print val[0]
One option is to use beautifulsoup library.
This question has example of finding tags by text : BeautifulSoup - search by text inside a tag
Tutorial : https://www.dataquest.io/blog/web-scraping-tutorial-python/
I have file test.py:
import cgi, cgitb # Import modules for CGI handling
form = cgi.FieldStorage()
person_name = form.getvalue('person_name')
print ("Content-type:text/html\n\n")
print ("<html>")
print ("<head>")
print ("</head>")
print ("<body>")
print (" hello world <br/>")
print(person_name)
print ("</body>")
print ("</html>")
When I go to www.myexample.com/test.py?person_name=john, the result I get is:
hello world
None
meaning that I could not get the param "person_name" from the url.
p.s. It works perfect in my localhost server, but when upload it to online webserver, somewhy cant parse the param from url.
How can I fix it?
Use this then
form_arguments = cgi.FieldStorage(environ={'REQUEST_METHOD':'GET', 'QUERY_STRING':qs})
for i in form_arguments.keys():
print form_arguments[i].value
In my previous answer I assumed you have webapp2. I think this will solve your purpose.
Alternatively you can try:
import urlparse
url = 'www.myexample.com/test.py?person_name=john'
par = urlparse.parse_qs(urlparse.urlparse(url).query)
person_name= par['person_name']
And to get the current url, use this:
url = os.environ['HTTP_HOST']
uri = os.environ['REQUEST_URI']
url = url + uri
par = urlparse.parse_qs( urlparse.urlparse(url).query )
I am having trouble building a basic spider program in Python. Whenever I try to run I get an error. The error occurs somewhere in the last seven lines of code.
#These modules do most of the work.
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO
def log_stdout(msg):
"""Print msg to the screen."""
print msg
def get_page(url, log):
"""Retrieve URL and return contents, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body
def find_links(html):
"""Return a list links in html."""
# We're using the parser just to get the HREFs
writer = formatter.DumbWriter(StringIO())
f = formatter.AbstractFormatter(writer)
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
class Spider:
"""
The heart of this program, finds all links within a web site.
run() contains the main loop.
process_page() retrieves each page and finds the links.
"""
def __init__(self, startURL, log=None):
#This method sets initial values
self.URLs = set()
self.URLs.add(startURL)
self.include = startURL
self._links_to_process = [startURL]
if log is None:
# Use log_stdout function if no log provided
self.log = log_stdout
else:
self.log = log
def run(self):
#Processes list of URLs one at a time
while self._links_to_process:
url = self._links_to_process.pop()
self.log("Retrieving: " + url)
self.process_page(url)
def url_in_site(self, link):
#Checks whether the link starts with the base URL
return link.startswith(self.include)
def process_page(self, url):
#Retrieves page and finds links in it
html = get_page(url, self.log)
for link in find_links(html):
#Handle relative links
link = urlparse.urljoin(url, link)
self.log("Checking: " + link)
# Make sure this is a new URL within current site
if link not in self.URLs and self.url_in_site(link):
self.URLs.add(link)
self._links_to_process.append(link)
The error message pertains to this block of code.
if __name__ == '__main__':
#This code runs when script is started from command line
startURL = sys.argv[1]
spider = Spider(startURL)
spider.run()
for URL in sorted(spider.URLs):
print URL
The error message:
startURL = sys.argv[1]
IndexError: list index out of range
You aren't calling your spider program with an argument. sys.argv[0] is your script file, and sys.argv[1] would be the first argument you pass it. The "list index out of range" means you didn't give it any arguments.
Try calling it as python spider.py http://www.example.com (with your actual URL).
This doesn't directly answer your question, but:
I would go something as:
START_PAGE = 'http://some.url.tld'
ahrefs = lxml.html.parse(START_PAGE).getroottree('//a/#href')
Then use the available methods on lmxl.html objects and multiprocess the links
This handles "semi-malformed" HTML, and you can plug-in the BeautifulSoup library.
A bit of work is required if you want to even try to attempt to follow JavaScript generated links, but - that's life!
i have the following script i am using to scrap data from my uni website and insert into a GAE Db
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech = Browser()
try:
page = mech.open(url)
html = page.read()
except Exception, err:
print str(err)
#print html
soup = BeautifulSoup(html)
soup.prettify()
tables = soup.find('select')
for options in tables:
intake = options.string
#print intake
try:
#print viewurl+intake
page = mech.open(viewurl+intake)
html = page.read()
print html
if html=="Exist in database":
print intake, " Exist in the database skiping"
else:
page = mech.open(inserturl+intake)
html = page.read()
print html
if html=="Ok":
print intake, "added to the database"
else:
print "Error adding ", intake, " to database"
except Exception, err:
print str(err)
i am wondering what would be the best way to optimize this script so i can run it on the app engine servers. as it is, it is now scraping over 300 entries and take well over 10 mins to insert all the data on my local machine
the model that is being used to store the data is
class Intake(db.Model):
intake=db.StringProperty(multiline=False, required=True)
##permerlink
def get_absolute_url(self):
return "/timekeeper/%s/" % self.intake
class Meta:
db_table = "Intake"
verbose_name_plural = "Intakes"
ordering = ['intake']
Divide and conquer.
Make a list of tasks (e.g. urls to scrape/parse)
Add your tasks into a queue (appengine taskqueue api, amazon sqs, …)
Process your queue
The first thing you should do is rewrite your script to use the App Engine datastore directly. A large part of the time you're spending is undoubtedly because you're using HTTP requests (two per entry!) to insert data into your datastore. Using the datastore directly with batch puts ought to cut a couple of orders of magnitude off your runtime.
If your parsing code is still too slow, you can cut the work up into chunks and use the task queue API to do the work in multiple requests.
hi according to tosh and nick i have modified the script as bellow
from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
page = urlfetch.fetch(url)
#print html
soup = BeautifulSoup(page.content)
soup.prettify()
tables = soup.find('select')
models=[]
for options in tables:
intake_code = options.string
if Intake.all().filter('intake',intake_code).count()<1:
data = Intake(intake=intake_code)
models.append(data)
try:
if len(models)>0:
db.put(models)
else:
pass
except Exception,err:
pass
except Exception, err:
print str(err)
am i on the right track ? also i am not really sure how to get this to invoke on a schedule (once a week) what would be the best way to do it?
and thanks for the prompt answers