I am trying to create a site-map generator. In a nutshell, I feed it a link, it looks for more links on the site and so on.
To avoid any long limbo-chains I thought I'd create a blocked_sites.txt which I can read from and compare my unprocessed_urls to and remove all the items that CONTAIN a blocker.
My problem is that, being naive I thought I could simply do some set/list comparing and removing, and viola, done, yet the problem was bigger mainly with the collection *deque*
The code
I start off by defining my strating url, which is the user input and I add it to a que:
# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])
Now comes the part where I'll start handling my urls:
# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):
# Remove unwanted items
unprocessed_urls = {url for url in unprocessed_urls if not any(blocker in url for blocker in blockers)} <-- THIS IS THE PROBLEM
# move next url from the queue to the set of processed urls
newurl = unprocessed_urls.popleft()
processed_urls.add(newurl)
# extract base url to resolve relative links
parts = urlsplit(newurl)
base_url = "{0.scheme}://{0.netloc}".format(parts)
if parts.scheme !='mailto' and parts.scheme !='#':
path = newurl[:newurl.rfind('/')+1] if '/' in parts.path else newurl
else:
continue
# get url's content
print(Fore.CYAN + "Crawling URL %s" % newurl + Fore.WHITE)
try:
response = requests.get(newurl, timeout=3)
So the problem is, that the program shouldn't go onto big sites, that I have explicitly defined to be blocked like so:
# Blockers
blockers = set(line.strip() for line in open('blocked_sites.txt'))
And by using a suggested way of stripping the unprocessed_urls from the unwanted I use this bit of line(also pointed out in the code):
# Remove unwanted items
unprocessed_urls = {url for url in unprocessed_urls if not any(blocker in url for blocker in blockers)}
Thus we find ourselves here:
AttributeError: 'set' object has no attribute 'popleft'
What I could devise from this is that by attempting to remove the unwanted items It's somehow altering the type of the collection
I don't really know how to move forward from here.
The line unprocessed_urls = {...} creates a new set object and assigns it to unprocessed_urls. The fact that this new value is logically similar to the old value is irrelevant; assigning to a variable overwrites whatever was in there before.
However, a collections.deque can be created from any iterable, so you can instead do
unprocessed_urls = deque(url for url in unprocessed_urls if ...)
to create a new collections.deque so that all values you assign to unprocessed_urls will have the same type.
Related
I would like to extract entire unique url items in my list in order to move on a web scraping project. Although I have huge list of URLs on my side, I would like to generate here minimalist scenario to explain main issue on my side. Assume that my list is like that:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"https://www.ox.ac.uk/research"
]
def ExtractUniqueUrls(urls):
pass
ExtractUniqueUrls(url_list)
For the minimalist scenario, I am expecting there are only two unique urls which are "https://www.ox.ac.uk" and "https://www.ox.ac.uk/research". Although each url element have some differences such as "http", "https", with ending "/", without ending "/", index.php, index.html; they are all pointing exactly the same web page. There might be some other possibilities which I already missed them (Please remember them if you catch any). Anyway, what is the proper and efficient way to handle this issue using Python 3?
I am not looking for a hard-coded solution like focusing on each case individually. For instance, I do not want to manually check whether the url has "/" at the end or not. Possibly there is a much better solution with other packages such as urllib? For that reason, I looked the method of urllib.parse, but I could not come up a proper solution so far.
Thanks
Edit: I added one more example into my list at the end in order to explain in a better way. Otherwise, you might assume that I am looking for the root url, but this not the case at all.
By only following all cases you've reveiled:
url_list = ["https://www.ox.ac.uk/",
"http://www.ox.ac.uk/",
"https://www.ox.ac.uk",
"http://www.ox.ac.uk",
"https://www.ox.ac.uk/index.php",
"https://www.ox.ac.uk/index.html",
"http://www.ox.ac.uk/index.php",
"http://www.ox.ac.uk/index.html",
"www.ox.ac.uk/",
"ox.ac.uk",
"ox.ac.uk/research",
"ox.ac.uk/index.php?12"]
def url_strip_gen(source: list):
replace_dict = {".php": "", ".html": "", "http://": "", "https://": ""}
for url in source:
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = url.rstrip('/')
yield url[4:] if url.startswith("www.") else url
print(set(url_strip_gen(url_list)))
{'ox.ac.uk/index?12', 'ox.ac.uk/index', 'ox.ac.uk/research', 'ox.ac.uk'}
This won't cover case if url contains .html like www.htmlsomething, in that case it can be compensated with urlparse as it stores path and url separately like below:
>>> import pprint
>>> from urllib.parse import urlparse
>>> a = urlparse("http://ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='http', netloc='ox.ac.uk', path='/index.php', params='', query='12', fragment='')
However, if without scheme:
>>> a = urlparse("ox.ac.uk/index.php?12")
>>> pprint.pprint(a)
ParseResult(scheme='', netloc='', path='ox.ac.uk/index.php', params='', query='12', fragment='')
All host goes to path attribute.
To compensate this we either need to remove scheme and add one for all or check if url starts with scheme else add one. Prior is easier to implement.
replace_dict = {"http://": "", "https://": ""}
for url in source:
# Unify scheme to HTTP
for key, val in replace_dict.items():
url = url.replace(key, val, 1)
url = "http://" + (url[4:] if url.startswith("www.") else url)
parsed = urlparse(url)
With this you are guaranteed to get separate control of each sections for your url via urlparse. However as you do not specified which parameter should be considered for url to be unique enough, I'll leave that task to you.
Here's a quick and dirty attempt:
def extract_unique_urls(url_list):
unique_urls = []
for url in url_list:
# Removing the 'https://' etc. part
if url.find('//') > -1:
url = url.split('//')[1]
# Removing the 'www.' part
url = url.replace('www.', '')
# Removing trailing '/'
url = url.rstrip('/')
# If not root url then inspect the last part of the url
if url.find('/') > -1:
# Extracting the last part
last_part = url.split('/')[-1]
# Deciding if to keep the last part (no if '.' in it)
if last_part.find('.') > -1:
# If no to keep: Removing last part and getting rid of
# trailing '/'
url = '/'.join(url.split('/')[:-1]).rstrip('/')
# Append if not already in list
if url not in unique_urls:
unique_urls.append(url)
# Sorting for the fun of it
return sorted(unique_urls)
I'm sure it doesn't cover all possible cases. But maybe you can extend it if that's not the case. I'm also not sure if you wanted to keep the 'http(s)://' parts. If yes, then just add them to the results.
I am using Selenium library and trying to iterate through list of items look them up on web and while my loop is working when item are found I am having hard time handling the exception when item is not find on the web page. For this instance I know that if the item is not found the page will show " No Results For" within span to which I can access with:
browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]')[0].text
Now the problem is that this span only appear when item loop is searching is not found. So I tried this logic, if this span doesn't exist than it means item is found so execute rest of the loop, if the the span does exist and is equal to " No Results For", then go and search for next item. Here is my code:
data = pd.DataFrame()
for i in lookup_list:
start_url = f"https://www.amazon.com/s?k=" + i +"&ref=nb_sb_noss_1"
browser.visit(start_url)
if browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]') is not None :
#browser.find_by_xpath("//a[#class='a-size-medium a-color-base']"):
item = browser.find_by_xpath("//a[#class='a-link-normal']")
item.click()
html = browser.html
soup = bs(html, "html.parser")
collection_dict ={
'PART_NUMBER': getmodel(soup),
'DIMENSIONS': getdim(soup),
'IMAGE_LINK': getImage(soup)
}
elif browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]')[0].text != 'No results for':
continue
data = data.append(collection_dict, ignore_index=True)
The error I am getting is:
AttributeError: 'ElementList' object has no attribute 'click'
I do understand that the error I am getting is because I cant access attribute click since it the list has multiple items and therefore i cant click on all of them. But what im trying to do is to avoid even trying to access it if the page showes that the item is not found, i want the script to simply go to next item and search.
How do I modify this?
Thank you in advance.
Using a try-except with a pass is what you want in this situation, like #JammyDodger said. Although using this typically isn't a good sign because you don't want to simply ignore errors most of the time. pass will simply ignore the error and continue the rest of the loop.
try:
item.click()
except AttributeError:
pass
In order to skip to the next iteration of the loop, you may want to use the continue keyword.
try:
item.click()
except AttributeError:
continue
I'm trying to utilize RSS to get auto notifications for specific security vulnerabilities i may be concerned with. I have gotten it functional for searching for keywords in the title and url of feed entries, but it seems to ignore the rss description.
I've verified the description field exists within the feed (I originally started with summary in place of description before discovering this) but don't understand why its not working (relatively new to python). Is it possibly a sanitation issue, or am i missing something on how the search is performed?
#!/usr/bin/env python3.6
import feedparser
#Keywords to search for in the rss feed
key_words = ['Chrome','Tomcat','linux','windows']
# get the urls we have seen prior
f = open('viewed_urls.txt', 'r')
urls = f.readlines()
urls = [url.rstrip() for url in urls]
f.close()
#Returns true if keyword is in string
def contains_wanted(in_str):
for wrd in key_words:
if wrd.lower() in in_str:
return True
return False
#Returns true if url result has not been seen before
def url_is_new(urlstr):
# returns true if the url string does not exist
# in the list of strings extracted from the text file
if urlstr in urls:
return False
else:
return True
#actual parsing phase
feed = feedparser.parse('https://nvd.nist.gov/feeds/xml/cve/misc/nvd-rss.xml')
for key in feed["entries"]:
title = key['title']
url = key['links'][0]['href']
description = key['description']
#formats and outputs the specified rss fields
if contains_wanted(title.lower()) and contains_wanted(description.lower()) and url_is_new(url):
print('{} - {} - {}\n'.format(title, url, description))
#appends reoccurring rss feeds in the viewed_urls file
with open('viewed_urls.txt', 'a') as f:
f.write('{}\n'.format(title,url))
This helped. I was unaware of the conjunction logic but have resolved it. I omitted contains_wanted(title.lower()) since this was not necessary in the statement logic as contains_wanted(description.lower()) fulfills the title statements purpose as well as its own. and am getting proper output.
Thank you pbn.
I'm creating a program that uses a dictionary to store large trees of web links in Python. Basically, you start with the root URL, and that creates a dictionary based on the URLs found from the HTML of the root. In the next step, I want to get the pages of each of those URLs and grab the links on those URLs. Eventually, I want to have a dictionary with all the links in it and their relation to each other.
This is what I have for the first two depths
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][url]
#Get page source
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][secondurl][url]
This system works, but as you can tell, if I want a dictionary N layers deep, that gets to be a lot of blocks of code. Is there a way to automatically add more layers? Any help is appreciated!
This can be done using recursive functions.
A basic example which will crawl all the urls found in a page one by one, then crawl all the urls found in that page one by one, and so on... It will also print out every url it finds.
def recursive_fetch(url_to_fetch):
# get page source from url_to_fetch
# make a new soup
for link in soup.find_all('a'):
url = link.get('href')
print url
# run the function recursively
# for the current url
recursive_fetch(url)
# Usage
recursive_fetch(root_url)
Since you want to have a dict of tree of all the urls found, the above code isn't much helpful, but it's a start.
This is where it gets really complex. Because now you'll also need to keep track of the parent of the current url being crawled, the parent of that url, parent of that url, parent of that url, parent of ...
You see, what I mean? It gets very complex, very fast.
Below is the code that does all that. I've written comments in the code to explain it as best as I can. But you'll need to actually understand how recursive functions work for a better understanding.
First, let's look at another function which will be very helpful in getting the parent of a url from the tree:
def get_parent(tree, parent_list):
"""How it works:
Let's say the `tree` looks like this:
tree = {
'root-url': {
'link-1': {
'link-1-a': {...}
}
}
}
and `parent_list` looks like this:
parent_list = ['root-url', 'link-1', 'link-1-a']
this function will chain the values in the list and
perform a dict lookup like this:
tree['root-url']['link-1']['link-1-a']
"""
first, rest = parent_list[0], parent_list[1:]
try:
if tree[first] and rest:
# if tree or rest aren't empty
# run the function recursively
return get_parent(tree[first], rest)
else:
return tree[first]
except KeyError:
# this is required for creating the
# root_url dict in the tree
# because it doesn't exist yet
tree[first] = {}
return tree[first]
And the recursive_fetch function will look like this:
url_tree = {} # dict to store the url tree
def recursive_fetch(fetch_url, parents=None):
"""
`parents` is a list of parents of the current url
Example:
parents = ['root-url', 'link-1', ... 'parent-link']
"""
parents = parents or []
parents.append(fetch_url)
# get page source from fetch_url
# make new soup object
for link in soup.find_all('a'):
url = link.get('href')
if parents:
parent = get_parent(url_tree, parents)
else:
parent = None
if parent is not None:
# this will run when parent is not None
# i.e. even if parent is empty dict {}
# create a new dict of the current url
# inside the parent dict
parent[url] = {}
else:
# this url has no parent,
# insert this directly in the url_tree
url_tree[url] = {}
# now crawl the current url
recursive_fetch(url, parents)
# Next is the most import block of code
# Whenever 1 recursion completes,
# it will pop the last parent from
# the `parents` list so that in the
# next recursion, the parents are correct.
# Whithout this block, the url_tree wouldn't
# look as expected.
# It took me many hours to figure this out
try:
parents.pop(-1)
except IndexError:
pass
I created a web crawler that, given a base_url, will spider out and find all possible endpoints. While I am able to get all the endpoints, I need a way to figure out how I got there in the first place -- a 'url stack-trace' persay or breadcrumbs of url's leading to each endpoint.
I first start by finding all url's given a base url. Since the sublinks I'm looking for are within a json, I thought the best way to do this would be using a variation of a recurisve dictionary example I found here: http://www.saltycrane.com/blog/2011/10/some-more-python-recursion-examples/:
import requests
import pytest
import time
BASE_URL = "https://www.my-website.com/"
def get_leaf_nodes_list(base_url):
"""
:base_url: The starting point to crawl
:return: List of all possible endpoints
"""
class Namespace(object):
# A wrapper function is used to create a Namespace instance to hold the ns.results variable
pass
ns = Namespace()
ns.results = []
r = requests.get(BASE_URL)
time.sleep(0.5) # so we don't cause a DDOS?
data = r.json()
def dict_crawler(data):
# Retrieve all nodes from nested dict
if isinstance(data, dict):
for item in data.values():
dict_crawler(item)
elif isinstance(data, list) or isinstance(data, tuple):
for item in data:
dict_crawler(item)
else:
if type(data) is unicode:
if "http" in data: # If http in value, keep going
# If data is not already in ns.results, don't append it
if str(data) not in ns.results:
ns.results.append(data)
sub_r = requests.get(data)
time.sleep(0.5) # so we don't cause a DDOS?
sub_r_data = sub_r.json()
dict_crawler(sub_r_data)
dict_crawler(data)
return ns.results
To reiterate, the get_leaf_nodes_list does a get request and looks for any values within the json for a url (if 'http' string is in the value for each key) to recursively do more get requests until there's no url's left.
So to reiterate here are the questions I have:
How do I get a linear history of all the url's I hit to get to each endpoint?
Corollary to that, how would I store this history? As the leaf nodes list grows, my process gets expontentially slower and I am wondering if there's a better data type out there to store this information or a more efficient process to the code above.