Automatically Add Layers to a Dictionary in Python - python

I'm creating a program that uses a dictionary to store large trees of web links in Python. Basically, you start with the root URL, and that creates a dictionary based on the URLs found from the HTML of the root. In the next step, I want to get the pages of each of those URLs and grab the links on those URLs. Eventually, I want to have a dictionary with all the links in it and their relation to each other.
This is what I have for the first two depths
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][url]
#Get page source
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][secondurl][url]
This system works, but as you can tell, if I want a dictionary N layers deep, that gets to be a lot of blocks of code. Is there a way to automatically add more layers? Any help is appreciated!

This can be done using recursive functions.
A basic example which will crawl all the urls found in a page one by one, then crawl all the urls found in that page one by one, and so on... It will also print out every url it finds.
def recursive_fetch(url_to_fetch):
# get page source from url_to_fetch
# make a new soup
for link in soup.find_all('a'):
url = link.get('href')
print url
# run the function recursively
# for the current url
recursive_fetch(url)
# Usage
recursive_fetch(root_url)
Since you want to have a dict of tree of all the urls found, the above code isn't much helpful, but it's a start.
This is where it gets really complex. Because now you'll also need to keep track of the parent of the current url being crawled, the parent of that url, parent of that url, parent of that url, parent of ...
You see, what I mean? It gets very complex, very fast.
Below is the code that does all that. I've written comments in the code to explain it as best as I can. But you'll need to actually understand how recursive functions work for a better understanding.
First, let's look at another function which will be very helpful in getting the parent of a url from the tree:
def get_parent(tree, parent_list):
"""How it works:
Let's say the `tree` looks like this:
tree = {
'root-url': {
'link-1': {
'link-1-a': {...}
}
}
}
and `parent_list` looks like this:
parent_list = ['root-url', 'link-1', 'link-1-a']
this function will chain the values in the list and
perform a dict lookup like this:
tree['root-url']['link-1']['link-1-a']
"""
first, rest = parent_list[0], parent_list[1:]
try:
if tree[first] and rest:
# if tree or rest aren't empty
# run the function recursively
return get_parent(tree[first], rest)
else:
return tree[first]
except KeyError:
# this is required for creating the
# root_url dict in the tree
# because it doesn't exist yet
tree[first] = {}
return tree[first]
And the recursive_fetch function will look like this:
url_tree = {} # dict to store the url tree
def recursive_fetch(fetch_url, parents=None):
"""
`parents` is a list of parents of the current url
Example:
parents = ['root-url', 'link-1', ... 'parent-link']
"""
parents = parents or []
parents.append(fetch_url)
# get page source from fetch_url
# make new soup object
for link in soup.find_all('a'):
url = link.get('href')
if parents:
parent = get_parent(url_tree, parents)
else:
parent = None
if parent is not None:
# this will run when parent is not None
# i.e. even if parent is empty dict {}
# create a new dict of the current url
# inside the parent dict
parent[url] = {}
else:
# this url has no parent,
# insert this directly in the url_tree
url_tree[url] = {}
# now crawl the current url
recursive_fetch(url, parents)
# Next is the most import block of code
# Whenever 1 recursion completes,
# it will pop the last parent from
# the `parents` list so that in the
# next recursion, the parents are correct.
# Whithout this block, the url_tree wouldn't
# look as expected.
# It took me many hours to figure this out
try:
parents.pop(-1)
except IndexError:
pass

Related

AttributeError: 'set' object has no attribute 'popleft'

I am trying to create a site-map generator. In a nutshell, I feed it a link, it looks for more links on the site and so on.
To avoid any long limbo-chains I thought I'd create a blocked_sites.txt which I can read from and compare my unprocessed_urls to and remove all the items that CONTAIN a blocker.
My problem is that, being naive I thought I could simply do some set/list comparing and removing, and viola, done, yet the problem was bigger mainly with the collection *deque*
The code
I start off by defining my strating url, which is the user input and I add it to a que:
# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])
Now comes the part where I'll start handling my urls:
# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):
# Remove unwanted items
unprocessed_urls = {url for url in unprocessed_urls if not any(blocker in url for blocker in blockers)} <-- THIS IS THE PROBLEM
# move next url from the queue to the set of processed urls
newurl = unprocessed_urls.popleft()
processed_urls.add(newurl)
# extract base url to resolve relative links
parts = urlsplit(newurl)
base_url = "{0.scheme}://{0.netloc}".format(parts)
if parts.scheme !='mailto' and parts.scheme !='#':
path = newurl[:newurl.rfind('/')+1] if '/' in parts.path else newurl
else:
continue
# get url's content
print(Fore.CYAN + "Crawling URL %s" % newurl + Fore.WHITE)
try:
response = requests.get(newurl, timeout=3)
So the problem is, that the program shouldn't go onto big sites, that I have explicitly defined to be blocked like so:
# Blockers
blockers = set(line.strip() for line in open('blocked_sites.txt'))
And by using a suggested way of stripping the unprocessed_urls from the unwanted I use this bit of line(also pointed out in the code):
# Remove unwanted items
unprocessed_urls = {url for url in unprocessed_urls if not any(blocker in url for blocker in blockers)}
Thus we find ourselves here:
AttributeError: 'set' object has no attribute 'popleft'
What I could devise from this is that by attempting to remove the unwanted items It's somehow altering the type of the collection
I don't really know how to move forward from here.
The line unprocessed_urls = {...} creates a new set object and assigns it to unprocessed_urls. The fact that this new value is logically similar to the old value is irrelevant; assigning to a variable overwrites whatever was in there before.
However, a collections.deque can be created from any iterable, so you can instead do
unprocessed_urls = deque(url for url in unprocessed_urls if ...)
to create a new collections.deque so that all values you assign to unprocessed_urls will have the same type.

Created my first webcrawler, how do I get a "URL stack-trace"/history of URL's for each endpoint?

I created a web crawler that, given a base_url, will spider out and find all possible endpoints. While I am able to get all the endpoints, I need a way to figure out how I got there in the first place -- a 'url stack-trace' persay or breadcrumbs of url's leading to each endpoint.
I first start by finding all url's given a base url. Since the sublinks I'm looking for are within a json, I thought the best way to do this would be using a variation of a recurisve dictionary example I found here: http://www.saltycrane.com/blog/2011/10/some-more-python-recursion-examples/:
import requests
import pytest
import time
BASE_URL = "https://www.my-website.com/"
def get_leaf_nodes_list(base_url):
"""
:base_url: The starting point to crawl
:return: List of all possible endpoints
"""
class Namespace(object):
# A wrapper function is used to create a Namespace instance to hold the ns.results variable
pass
ns = Namespace()
ns.results = []
r = requests.get(BASE_URL)
time.sleep(0.5) # so we don't cause a DDOS?
data = r.json()
def dict_crawler(data):
# Retrieve all nodes from nested dict
if isinstance(data, dict):
for item in data.values():
dict_crawler(item)
elif isinstance(data, list) or isinstance(data, tuple):
for item in data:
dict_crawler(item)
else:
if type(data) is unicode:
if "http" in data: # If http in value, keep going
# If data is not already in ns.results, don't append it
if str(data) not in ns.results:
ns.results.append(data)
sub_r = requests.get(data)
time.sleep(0.5) # so we don't cause a DDOS?
sub_r_data = sub_r.json()
dict_crawler(sub_r_data)
dict_crawler(data)
return ns.results
To reiterate, the get_leaf_nodes_list does a get request and looks for any values within the json for a url (if 'http' string is in the value for each key) to recursively do more get requests until there's no url's left.
So to reiterate here are the questions I have:
How do I get a linear history of all the url's I hit to get to each endpoint?
Corollary to that, how would I store this history? As the leaf nodes list grows, my process gets expontentially slower and I am wondering if there's a better data type out there to store this information or a more efficient process to the code above.

Scrapy Deploy Doesn't Match Debug Result

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

getting data from webpage with select menus with python and beautifulsoup

I am trying to collect data from a webpage which has a bunch of select lists i need to fetch
data from. Here is the page:- http://www.asusparts.eu/partfinder/Asus/All In One/E Series/
And this is what i have so far:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
I am not sure if i am going about the right way? As all the select menus make it confusing. Basically i need to grab
all the data from the selected results such as images,price,description,etc. They are all contained within
one div tag which contains all the results, which is named 'accordion' so would this still gather all the data?
or would i need to dig deeper to search through the tags inside this div? Also i would have prefered to search by id rather than
class as i could fetch all the data in one go. How would i do this from what i have above? Thanks. Also i am unsure about the glob function too if i am using that correctly or not?
EDIT
Here is my edited code, no errors return however i am not sure if it returns all the models for the e-series?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']
First of all, I'd like to point out a couple of problems you have in the code you posted. First, of all the glob module is not typically used for making HTTP requests. It is useful for iterating through a subset of files on a specified path, you can read more about it in its docs.
The second issue is that in the line:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
you have an indentation error, because there is no indented code that follows. This will raise an error and prevent the rest of the code from being executed.
Another problem is that you are using some of python's "reserved" names for your variables. You should never use words such as all or file for variable names.
Finally when you are looping through option_tags:
for option in option_tags:
open(url + option['value'])
The open statement will try and open a local file whose path is url + option['value']. This will likely raise an error, as I doubt you'll have a file at that location. In addition, you should be aware that you aren't doing anything with this open file.
Okay, so enough with the critique. I've taken a look at the asus page and I think I have an idea of what you want to accomplish. From what I understand, you want to scrape a list of parts (images, text, price, etc..) for each computer model on the asus page. Each model has its list of parts located at a unique URL (for example: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220). This means that you need to be able to create this unique URL for each model. To make matters more complicated, each parts category is loaded dynamically, so for example the parts for the "Cooling" section are not loaded until you click on the link for "Cooling". This means we have a two part problem: 1) Get all of the valid (brand, type, family, model) combinations and 2) Figure out how to load all the parts for a given model.
I was kind of bored and decided to write up a simple program that will take care of most of the heavy lifting. It isn't the most elegant thing out there, but it'll get the job done. Step 1) is accomplished in get_model_information(). Step 2) is taken care of in parse_models() but is a little less obvious. Taking a look at the asus website, whenever you click on a parts subsection the JavaScript function getProductsBasedOnCategoryID() is run, which makes an ajax call to a formatted PRODUCT_URL (see below). The response is some JSON information that is used to populate the section you clicked on.
import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
Finally, one side note. I have dropped urllib2 in favor of the requests library. I personally think provides much more functionality and has better semantics, but you can use whatever you would like.

How can I make this recursive crawl function iterative?

For academic and performance sake, given this crawl recursive web-crawling function (which crawls only within the given domain) what would be the best approach to make it run iteratively? Currently when it runs, by the time it finishes python has climbed to using over 1GB of memory which isn't acceptable for running in a shared environment.
def crawl(self, url):
"Get all URLS from which to scrape categories."
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
self.crawl(attr[1])
Use a BFS instead of crawling recursively (DFS): http://en.wikipedia.org/wiki/Breadth_first_search
You can use an external storage solution (such as a database) for BFS queue to free up RAM.
The algorithm is:
//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.
// initialization:
urlsToVisit.Add( rootUrl );
while(urlsToVisit.Count > 0) {
var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
var page = FetchPage(nextUrl);
ProcessPage(page);
visitedUrls.Add(nextUrl);
var links = ParseLinks(page);
foreach (var link in links)
if (!visitedUrls.Contains(link))
urlsToVisit.Add(link);
}
Instead of recursing, you could put the new URLs to crawl into a queue. Then run until the queue is empty without recursing. If you put the queue into a file this uses almost no memory at all.
#Mehrdad - Thank you for your reply, the example you provided was concise and easy to understand.
The solution:
def crawl(self, url):
urls = Queue(-1)
_crawled = []
urls.put(url)
while not urls.empty():
url = urls.get()
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
continue
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
continue
else:
Crawler._visit(attr[1])
if attr[1] not in _crawled:
urls.put(attr[1])
You can do it pretty easily just by using links as a queue:
def get_links(url):
"Extract all matching links from a url"
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return []
def crawl(self, url):
"Get all URLS from which to scrape categories."
links = get_links(url)
while len(links) > 0:
link = links.pop()
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
# prepend the new links to the queue
links = get_links(attr[1]) + links
Of course, this doesn't solve the memory problem...

Categories