i have the following script i am using to scrap data from my uni website and insert into a GAE Db
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech = Browser()
try:
page = mech.open(url)
html = page.read()
except Exception, err:
print str(err)
#print html
soup = BeautifulSoup(html)
soup.prettify()
tables = soup.find('select')
for options in tables:
intake = options.string
#print intake
try:
#print viewurl+intake
page = mech.open(viewurl+intake)
html = page.read()
print html
if html=="Exist in database":
print intake, " Exist in the database skiping"
else:
page = mech.open(inserturl+intake)
html = page.read()
print html
if html=="Ok":
print intake, "added to the database"
else:
print "Error adding ", intake, " to database"
except Exception, err:
print str(err)
i am wondering what would be the best way to optimize this script so i can run it on the app engine servers. as it is, it is now scraping over 300 entries and take well over 10 mins to insert all the data on my local machine
the model that is being used to store the data is
class Intake(db.Model):
intake=db.StringProperty(multiline=False, required=True)
##permerlink
def get_absolute_url(self):
return "/timekeeper/%s/" % self.intake
class Meta:
db_table = "Intake"
verbose_name_plural = "Intakes"
ordering = ['intake']
Divide and conquer.
Make a list of tasks (e.g. urls to scrape/parse)
Add your tasks into a queue (appengine taskqueue api, amazon sqs, …)
Process your queue
The first thing you should do is rewrite your script to use the App Engine datastore directly. A large part of the time you're spending is undoubtedly because you're using HTTP requests (two per entry!) to insert data into your datastore. Using the datastore directly with batch puts ought to cut a couple of orders of magnitude off your runtime.
If your parsing code is still too slow, you can cut the work up into chunks and use the task queue API to do the work in multiple requests.
hi according to tosh and nick i have modified the script as bellow
from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db
__author__ = "Nash Rafeeq"
url = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
page = urlfetch.fetch(url)
#print html
soup = BeautifulSoup(page.content)
soup.prettify()
tables = soup.find('select')
models=[]
for options in tables:
intake_code = options.string
if Intake.all().filter('intake',intake_code).count()<1:
data = Intake(intake=intake_code)
models.append(data)
try:
if len(models)>0:
db.put(models)
else:
pass
except Exception,err:
pass
except Exception, err:
print str(err)
am i on the right track ? also i am not really sure how to get this to invoke on a schedule (once a week) what would be the best way to do it?
and thanks for the prompt answers
Related
On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!
That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.
I wonder if it is possible to test the URL's from my bookmarks.
So I can see if the URL still is online or offline.
I can see that I can test it with Urllip2
Urllip2 code
import socket
from urllib2 import urlopen, URLError, HTTPError
socket.setdefaulttimeout( 23 ) # timeout in seconds
url = 'http://google.com/'
try :
response = urlopen( url )
except HTTPError, e:
print 'The server couldn\'t fulfill the request. Reason:', str(e.code)
except URLError, e:
print 'We failed to reach a server. Reason:', str(e.reason)
else :
html = response.read()
print 'got response!'
# do something, turn the light on/off or whatever
My question is, can I get the links/URL's from my bookmarks (Chrome) and the test the URL's in a loop (for) if the URL is Offline or Online.
EDIT 26/02/2019...
Have t/ried this code, and get no folder found error..
/
import json
from jsonpath_rw import parse
import os
# PArse te Bookmarks file from json into a dict
input_filename = os.path.join(os.getenv("APPDATA"), "\\Local\\Google\\Chrome\\User Data\\Default\\Bookmarks")
if os.path.isfile(input_filename):
with open(input_filename) as data_file:
bookmark_data = json.load(data_file)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(bookmark_data)])
else:
print("File not found!")
print(input_filename)
Chrome (or at least Chromium) stores your bookmarks in a file called Bookmarks in your chrome config area - on linux this is usually .config/chromium/Default/Bookmarks on Windows it is AppData\Local\Google\Chrome\User Data\Default\Bookmarks (though you may need to hunt for it if your system is different).
Assuming you wan to check all links, then you probably want to recursively walk the tree, looking for url keys and getting their values. Since this is JSON, I would recommend using the JSONPath library (https://readthedocs.org/projects/jsonpath-rw/), rather than writing your own recursion function:
import json
from jsonpath_rw import parse
# PArse te Bookmarks file from json into a dict
with open('Bookmarks') as bm:
data = json.load(bm)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(data)])
I am trying to collect data from a webpage which has a bunch of select lists i need to fetch
data from. Here is the page:- http://www.asusparts.eu/partfinder/Asus/All In One/E Series/
And this is what i have so far:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
I am not sure if i am going about the right way? As all the select menus make it confusing. Basically i need to grab
all the data from the selected results such as images,price,description,etc. They are all contained within
one div tag which contains all the results, which is named 'accordion' so would this still gather all the data?
or would i need to dig deeper to search through the tags inside this div? Also i would have prefered to search by id rather than
class as i could fetch all the data in one go. How would i do this from what i have above? Thanks. Also i am unsure about the glob function too if i am using that correctly or not?
EDIT
Here is my edited code, no errors return however i am not sure if it returns all the models for the e-series?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']
First of all, I'd like to point out a couple of problems you have in the code you posted. First, of all the glob module is not typically used for making HTTP requests. It is useful for iterating through a subset of files on a specified path, you can read more about it in its docs.
The second issue is that in the line:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
you have an indentation error, because there is no indented code that follows. This will raise an error and prevent the rest of the code from being executed.
Another problem is that you are using some of python's "reserved" names for your variables. You should never use words such as all or file for variable names.
Finally when you are looping through option_tags:
for option in option_tags:
open(url + option['value'])
The open statement will try and open a local file whose path is url + option['value']. This will likely raise an error, as I doubt you'll have a file at that location. In addition, you should be aware that you aren't doing anything with this open file.
Okay, so enough with the critique. I've taken a look at the asus page and I think I have an idea of what you want to accomplish. From what I understand, you want to scrape a list of parts (images, text, price, etc..) for each computer model on the asus page. Each model has its list of parts located at a unique URL (for example: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220). This means that you need to be able to create this unique URL for each model. To make matters more complicated, each parts category is loaded dynamically, so for example the parts for the "Cooling" section are not loaded until you click on the link for "Cooling". This means we have a two part problem: 1) Get all of the valid (brand, type, family, model) combinations and 2) Figure out how to load all the parts for a given model.
I was kind of bored and decided to write up a simple program that will take care of most of the heavy lifting. It isn't the most elegant thing out there, but it'll get the job done. Step 1) is accomplished in get_model_information(). Step 2) is taken care of in parse_models() but is a little less obvious. Taking a look at the asus website, whenever you click on a parts subsection the JavaScript function getProductsBasedOnCategoryID() is run, which makes an ajax call to a formatted PRODUCT_URL (see below). The response is some JSON information that is used to populate the section you clicked on.
import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
Finally, one side note. I have dropped urllib2 in favor of the requests library. I personally think provides much more functionality and has better semantics, but you can use whatever you would like.
I am having trouble building a basic spider program in Python. Whenever I try to run I get an error. The error occurs somewhere in the last seven lines of code.
#These modules do most of the work.
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO
def log_stdout(msg):
"""Print msg to the screen."""
print msg
def get_page(url, log):
"""Retrieve URL and return contents, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body
def find_links(html):
"""Return a list links in html."""
# We're using the parser just to get the HREFs
writer = formatter.DumbWriter(StringIO())
f = formatter.AbstractFormatter(writer)
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
class Spider:
"""
The heart of this program, finds all links within a web site.
run() contains the main loop.
process_page() retrieves each page and finds the links.
"""
def __init__(self, startURL, log=None):
#This method sets initial values
self.URLs = set()
self.URLs.add(startURL)
self.include = startURL
self._links_to_process = [startURL]
if log is None:
# Use log_stdout function if no log provided
self.log = log_stdout
else:
self.log = log
def run(self):
#Processes list of URLs one at a time
while self._links_to_process:
url = self._links_to_process.pop()
self.log("Retrieving: " + url)
self.process_page(url)
def url_in_site(self, link):
#Checks whether the link starts with the base URL
return link.startswith(self.include)
def process_page(self, url):
#Retrieves page and finds links in it
html = get_page(url, self.log)
for link in find_links(html):
#Handle relative links
link = urlparse.urljoin(url, link)
self.log("Checking: " + link)
# Make sure this is a new URL within current site
if link not in self.URLs and self.url_in_site(link):
self.URLs.add(link)
self._links_to_process.append(link)
The error message pertains to this block of code.
if __name__ == '__main__':
#This code runs when script is started from command line
startURL = sys.argv[1]
spider = Spider(startURL)
spider.run()
for URL in sorted(spider.URLs):
print URL
The error message:
startURL = sys.argv[1]
IndexError: list index out of range
You aren't calling your spider program with an argument. sys.argv[0] is your script file, and sys.argv[1] would be the first argument you pass it. The "list index out of range" means you didn't give it any arguments.
Try calling it as python spider.py http://www.example.com (with your actual URL).
This doesn't directly answer your question, but:
I would go something as:
START_PAGE = 'http://some.url.tld'
ahrefs = lxml.html.parse(START_PAGE).getroottree('//a/#href')
Then use the available methods on lmxl.html objects and multiprocess the links
This handles "semi-malformed" HTML, and you can plug-in the BeautifulSoup library.
A bit of work is required if you want to even try to attempt to follow JavaScript generated links, but - that's life!
How can I modify my script to skip a URL if the connection times out or is invalid/404?
Python
#!/usr/bin/python
#parser.py: Downloads Bibles and parses all data within <article> tags.
__author__ = "Cody Bouche"
__copyright__ = "Copyright 2012 Digital Bible Society"
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS COMPLETE!")
To apply the timeout to your request add the timeout variable to your call to urlopen. From the docs:
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
Refer to this guide's section on how to handle exceptions with urllib2. Actually I found the whole guide very useful.
The request timeout exception code is 408. Wrapping it up, if you were to handle timeout exceptions you would:
try:
response = urlopen(req, 3) # 3 seconds
except URLError, e:
if hasattr(e, 'code'):
if e.code==408:
print 'Timeout ', e.code
if e.code==404:
print 'File Not Found ', e.code
# etc etc
Try putting your urlopen line under a try catch statment. Look this up:
docs.python.org/tutorial/errors.html section 8.3
Look at the different exceptions and when you encounter one just restart the loop using the statement continue