How can I make this recursive crawl function iterative? - python

For academic and performance sake, given this crawl recursive web-crawling function (which crawls only within the given domain) what would be the best approach to make it run iteratively? Currently when it runs, by the time it finishes python has climbed to using over 1GB of memory which isn't acceptable for running in a shared environment.
def crawl(self, url):
"Get all URLS from which to scrape categories."
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
self.crawl(attr[1])

Use a BFS instead of crawling recursively (DFS): http://en.wikipedia.org/wiki/Breadth_first_search
You can use an external storage solution (such as a database) for BFS queue to free up RAM.
The algorithm is:
//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.
// initialization:
urlsToVisit.Add( rootUrl );
while(urlsToVisit.Count > 0) {
var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
var page = FetchPage(nextUrl);
ProcessPage(page);
visitedUrls.Add(nextUrl);
var links = ParseLinks(page);
foreach (var link in links)
if (!visitedUrls.Contains(link))
urlsToVisit.Add(link);
}

Instead of recursing, you could put the new URLs to crawl into a queue. Then run until the queue is empty without recursing. If you put the queue into a file this uses almost no memory at all.

#Mehrdad - Thank you for your reply, the example you provided was concise and easy to understand.
The solution:
def crawl(self, url):
urls = Queue(-1)
_crawled = []
urls.put(url)
while not urls.empty():
url = urls.get()
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
continue
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
continue
else:
Crawler._visit(attr[1])
if attr[1] not in _crawled:
urls.put(attr[1])

You can do it pretty easily just by using links as a queue:
def get_links(url):
"Extract all matching links from a url"
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return []
def crawl(self, url):
"Get all URLS from which to scrape categories."
links = get_links(url)
while len(links) > 0:
link = links.pop()
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
# prepend the new links to the queue
links = get_links(attr[1]) + links
Of course, this doesn't solve the memory problem...

Related

Automatically Add Layers to a Dictionary in Python

I'm creating a program that uses a dictionary to store large trees of web links in Python. Basically, you start with the root URL, and that creates a dictionary based on the URLs found from the HTML of the root. In the next step, I want to get the pages of each of those URLs and grab the links on those URLs. Eventually, I want to have a dictionary with all the links in it and their relation to each other.
This is what I have for the first two depths
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][url]
#Get page source
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][secondurl][url]
This system works, but as you can tell, if I want a dictionary N layers deep, that gets to be a lot of blocks of code. Is there a way to automatically add more layers? Any help is appreciated!
This can be done using recursive functions.
A basic example which will crawl all the urls found in a page one by one, then crawl all the urls found in that page one by one, and so on... It will also print out every url it finds.
def recursive_fetch(url_to_fetch):
# get page source from url_to_fetch
# make a new soup
for link in soup.find_all('a'):
url = link.get('href')
print url
# run the function recursively
# for the current url
recursive_fetch(url)
# Usage
recursive_fetch(root_url)
Since you want to have a dict of tree of all the urls found, the above code isn't much helpful, but it's a start.
This is where it gets really complex. Because now you'll also need to keep track of the parent of the current url being crawled, the parent of that url, parent of that url, parent of that url, parent of ...
You see, what I mean? It gets very complex, very fast.
Below is the code that does all that. I've written comments in the code to explain it as best as I can. But you'll need to actually understand how recursive functions work for a better understanding.
First, let's look at another function which will be very helpful in getting the parent of a url from the tree:
def get_parent(tree, parent_list):
"""How it works:
Let's say the `tree` looks like this:
tree = {
'root-url': {
'link-1': {
'link-1-a': {...}
}
}
}
and `parent_list` looks like this:
parent_list = ['root-url', 'link-1', 'link-1-a']
this function will chain the values in the list and
perform a dict lookup like this:
tree['root-url']['link-1']['link-1-a']
"""
first, rest = parent_list[0], parent_list[1:]
try:
if tree[first] and rest:
# if tree or rest aren't empty
# run the function recursively
return get_parent(tree[first], rest)
else:
return tree[first]
except KeyError:
# this is required for creating the
# root_url dict in the tree
# because it doesn't exist yet
tree[first] = {}
return tree[first]
And the recursive_fetch function will look like this:
url_tree = {} # dict to store the url tree
def recursive_fetch(fetch_url, parents=None):
"""
`parents` is a list of parents of the current url
Example:
parents = ['root-url', 'link-1', ... 'parent-link']
"""
parents = parents or []
parents.append(fetch_url)
# get page source from fetch_url
# make new soup object
for link in soup.find_all('a'):
url = link.get('href')
if parents:
parent = get_parent(url_tree, parents)
else:
parent = None
if parent is not None:
# this will run when parent is not None
# i.e. even if parent is empty dict {}
# create a new dict of the current url
# inside the parent dict
parent[url] = {}
else:
# this url has no parent,
# insert this directly in the url_tree
url_tree[url] = {}
# now crawl the current url
recursive_fetch(url, parents)
# Next is the most import block of code
# Whenever 1 recursion completes,
# it will pop the last parent from
# the `parents` list so that in the
# next recursion, the parents are correct.
# Whithout this block, the url_tree wouldn't
# look as expected.
# It took me many hours to figure this out
try:
parents.pop(-1)
except IndexError:
pass

How to properly use checkpoint to cache results from asynchronous web requests in Python?

I would like to use ediblepickle checkpoint to cache the results from parsing a list of URLs with asynchronous http requests request-futures module.
I want to create a checkpoint to cache the results from each individual URL (ie. one cache file per URL), so that I don't need to request and parse the same webpage again. However I am confused on how that would be implemented when I am using request-futures to handle the HTTP requests.
Implementation 1:
I think it doesn't make sense to wrap a checkpoint cache around this whole function, because if one URL in link_list input changes, my cached results would be useless.
def get_all_captions(link_list):
# Request/parse different pages (contained in link_list) within base_url website asynchoronously
base_url = r'***BASE URL OF WEBSITE****'
if s is None:
s = FuturesSession()
future = []
## make async request to different pages
for link, _ in link_list:
future.append(s.get(base_url + link))
## wait for results and parse
caption_list = []
for i in range(len(link_list)):
r = future[i].result().text
####
#### code to parse webpage context for some photo captions and store in caption_list
####
return caption_list
Implementation 2:
If I create a sub-function like get_captions_from_one_page() below, then the whole process become synchronous and defeated the purpose.
def get_all_captions(link_list):
base_url = r'***BASE URL OF WEBSITE ***'
if s is None:
s = FuturesSession()
caption_list = []
caption_list += get_captions_from_one_page(base_url + link, s)
return caption_list
#checkpoint(key=lambda args,kargs: urllib2.quote(args[0].split('/')[-1])+'.p', work_dir=cache_dir, refresh=False)
def get_captions_from_one_page(link, s):
future.append(s.get(link))
r = future.result().text
### code to parse results and return caption_list for this page
return caption_list
What is the proper way to use checkpoint with request-futures to cache the results of each single webpage? Thanks!

Scrapy Deploy Doesn't Match Debug Result

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

How to return multiple values in python

I am working currently working on a spider; but I need to be able to call the Spider() function more than once to follow links, here is my code:
import httplib, sys, re
def spider(target, link):
try:
conn = httplib.HTTPConnection(target)
conn.request("GET", "/")
r2 = conn.getresponse()
data = r2.read().split('\n')
for x in data[:]:
if link in x:
a=''.join(re.findall("href=([^ >]+)",x))
a=a.translate(None, '''"'"''')
if a:
return a
except:
exit(0)
print spider("www.yahoo.com", "http://www.yahoo.com")
but I only get 1 link from the output, how can I make this all the links?
also how can I get the subsite from the links so the spider can follow them?
This is probably closer to what you're looking for
import httplib, sys, re
def spider(link, depth=0):
if(depth > 2): return []
try:
conn = httplib.HTTPConnection(link)
conn.request("GET", "/")
r2 = conn.getresponse()
data = r2.read().split('\n')
links = []
for x in data[:]:
if link in x:
a=''.join(re.findall("href=([^ >]+)",x))
a=a.translate(None, '"' + "'")
if a:
links.append(a)
# Recurse for each link
for link in links:
links += spider(link, (depth + 1))
return links
except:
exit(1)
print spider("http://www.yahoo.com")
It's untested, but the basics are there. Scrape all the links, then recursively crawl them. The function returns a list of links on the page on each call. And when a page is recursively crawled, those links that are returned by the recursive call are added to this list. The code also has a max recursion depth so you don't go forever.
It's missing some obvious oversights, like cycle detection.
A few sidenotes, there are better ways to do some of this stuff.
For example, urllib2 can fetch webpages for you a lot easier than using httplib.
And BeautifulSoup extracts links from web pages better than your regex + translate kluge.
Following doorknob's hint, if you just change the return a to yield a, your function becomes a generator. Instead of calling it and getting back a result, you call it and get back an iterator—something you can loop over.
So, change your if block to this:
if link in x:
a=''.join(re.findall("href=([^ >]+)",x))
a=a.translate(None, '''"'"''')
if a:
yield a
Then change your print statement to this:
for a in spider("www.yahoo.com", "http://www.yahoo.com"):
print a
And you're done.
However, I'm guessing you didn't really want to join up the findall; you wanted to loop over each "found" thing separately. How do you fix that? Easy, just loop around the re.findall, and yield once per loop:
if link in x:
for a in re.findall("href=([^ >]+)",x)):
a=a.translate(None, '''"'"''')
if a:
yield a
For a more detailed explanation of how generators and iterators work, see this presentation.

Basic Spider Program will not run

I am having trouble building a basic spider program in Python. Whenever I try to run I get an error. The error occurs somewhere in the last seven lines of code.
#These modules do most of the work.
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO
def log_stdout(msg):
"""Print msg to the screen."""
print msg
def get_page(url, log):
"""Retrieve URL and return contents, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body
def find_links(html):
"""Return a list links in html."""
# We're using the parser just to get the HREFs
writer = formatter.DumbWriter(StringIO())
f = formatter.AbstractFormatter(writer)
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
class Spider:
"""
The heart of this program, finds all links within a web site.
run() contains the main loop.
process_page() retrieves each page and finds the links.
"""
def __init__(self, startURL, log=None):
#This method sets initial values
self.URLs = set()
self.URLs.add(startURL)
self.include = startURL
self._links_to_process = [startURL]
if log is None:
# Use log_stdout function if no log provided
self.log = log_stdout
else:
self.log = log
def run(self):
#Processes list of URLs one at a time
while self._links_to_process:
url = self._links_to_process.pop()
self.log("Retrieving: " + url)
self.process_page(url)
def url_in_site(self, link):
#Checks whether the link starts with the base URL
return link.startswith(self.include)
def process_page(self, url):
#Retrieves page and finds links in it
html = get_page(url, self.log)
for link in find_links(html):
#Handle relative links
link = urlparse.urljoin(url, link)
self.log("Checking: " + link)
# Make sure this is a new URL within current site
if link not in self.URLs and self.url_in_site(link):
self.URLs.add(link)
self._links_to_process.append(link)
The error message pertains to this block of code.
if __name__ == '__main__':
#This code runs when script is started from command line
startURL = sys.argv[1]
spider = Spider(startURL)
spider.run()
for URL in sorted(spider.URLs):
print URL
The error message:
startURL = sys.argv[1]
IndexError: list index out of range
You aren't calling your spider program with an argument. sys.argv[0] is your script file, and sys.argv[1] would be the first argument you pass it. The "list index out of range" means you didn't give it any arguments.
Try calling it as python spider.py http://www.example.com (with your actual URL).
This doesn't directly answer your question, but:
I would go something as:
START_PAGE = 'http://some.url.tld'
ahrefs = lxml.html.parse(START_PAGE).getroottree('//a/#href')
Then use the available methods on lmxl.html objects and multiprocess the links
This handles "semi-malformed" HTML, and you can plug-in the BeautifulSoup library.
A bit of work is required if you want to even try to attempt to follow JavaScript generated links, but - that's life!

Categories