So, I am trying to make a website crawler which would retrieve all links within the site and print them to the console and also redirect the links to a text file using a python script.
This script will take in the URL of the website you want to retrieve links from and the no.of URLs to be followed from the main page and the maximum number of URLs to be retrieved and then using the functions crawl(), is_valid() and get_all_website_links() it retrieves the URLs. It also separates external links and internal links through the get_all_website_links() function.
So far I have been successful with the retrieving and printing and redirecting the links to the text file using the script but I faced a problem when the server refuses to connect. It stops the link retrieval and ends the execution.
What I want my script to do is to retry a specified number of times and continue to the next link if it fails even after retrying.
I tried to implement this mechanism by myself but I did not get any idea.
I'm appending my python script below for your better understanding.
An elaborate explanation with implementation would be deeply appreciated!
Pardon me if my grammar is bad ;)
Thanks for your time :)
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
import sys
sys.setrecursionlimit(99999999)
print("WEBSITE CRAWLER".center(175,"_"))
print("\n","="*175)
print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n")
print("\n","="*175)
siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :")
max_urls = int(input("Enter the number of urls you want to crawl through the main page : "))
filename = input("Give a name for your text file (Don't append .txt at the end!) : ")
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
MAGENTA = colorama.Fore.MAGENTA
RESET = colorama.Fore.RESET
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{MAGENTA} [!] External link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
urls.add(href)
internal_urls.add(href)
return urls
# number of urls visited so far will be stored here
total_urls_visited = 0
def crawl(url, max_urls=50000):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl(siteurl,max_urls)
print("[+] Total External links:", len(external_urls))
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total:", len(external_urls) + len(internal_urls))
input("Press any key to exit...")
Related
I'm trying to write a python script to check the status's display text for a specific country (ie. Ecuador)
on this website:
https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps.
How do I keep track on that specific text when a change happens?
Currently, I tried to compare the hash codes after a time delay interval however the hash code seems to change every time even though nothing change visually.
input_website = 'https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps'
time_delay = 60
#Monitor the website
def monitor_website():
# Run the loop the keep monitoring
while True:
# Visit the website to know if it is up
status = urllib.request.urlopen(input_website).getcode()
# If it returns 200, the website is up
if status != 200:
# Call email function
send_email("The website is DOWN")
else:
send_email("The website is UP")
# Open url and create the hash code
response = urllib.request.urlopen(input_website).read()
current_hash = hashlib.sha224(response).hexdigest()
# Revisit the website after time delay
time.sleep(time_delay)
# Visit the website after delay, and generate the new website
response = urllib.request.urlopen(input_website).read()
new_hash = hashlib.sha224(response).hexdigest()
# Check the hash codes
if new_hash != current_hash:
send_email("The website CHANGED")
Can you check it using Beautiful Soup?
Crawl the page for "Ecuador" and then check the next word for "suspended**"
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# create list of all tags 'td'
list_name = list()
tags = soup('td')
for tag in tags:
#take out whitespace and \u200b unicode
url_grab = tag.get_text().strip(u'\u200b').strip()
list_name.append(url_grab)
#Search list for Ecuador and following item in list
country_status ={}
for i in range(len(list_name)):
if "Ecuador" in list_name[i]:
country_status[list_name[i]] = list_name[i+1]
print(country_status)
else:
continue
#Check website
if country_status["Ecuador"] != "suspended**":
print("Website has changed")
Basically, I'm trying to remove all the characters after the URL extension in a URL, but it's proving difficult. The application works off a list of various URLs with various extensions.
Here's my source:
import requests
from bs4 import BeautifulSoup
from time import sleep
#takes userinput for path of panels they want tested
import_file_path = input('Enter the path of the websites to be tested: ')
#takes userinput for path of exported file
export_file_path = input('Enter the path of where we should export the panels to: ')
#reads imported panels
with open(import_file_path, 'r') as panels:
panel_list = []
for line in panels:
panel_list.append(line)
x = 0
for panel in panel_list:
url = requests.get(panel)
soup = BeautifulSoup(url.content, "html.parser")
forms = soup.find_all("form")
action = soup.find('form').get('action')
values = {
soup.find_all("input")[0].get("name") : "user",
soup.find_all("input")[1].get("name") : "pass"
}
print(values)
r = requests.post(action, data=values)
print(r.headers)
print(r.status_code)
print(action)
sleep(10)
x += 1
What I'm trying to achieve is an application that automatically tests your username/password from a list of URLs provided in a text document. However, BeautifulSoup returns an incomplete URL when crawling for action tags, i.e instead of returning the full http://example.com/action.php it will return action.php as it would be in the code. The only way I can think to get past this would be to restate the 'action' variable as 'panel' with all characters after the url extension removed, followed by 'action'.
Thanks!
I wrote a simple Python crawler to fetch urls from a website. Here is the code:
from bs4 import BeautifulSoup
import requests as req
def get_soup(url):
content = req.get(url).content
return BeautifulSoup(content,'lxml')
def extract_links(url):
soup = get_soup(url)
a_tags = soup.find_all('a', class_="kkyou true-load-invoker")
links = set(a_tag.get('href') for a_tag in a_tags)
return links
def set_of_links(url, size):
'''
breadth-first search for article hyperlinks
'''
seen = set()
active = extract_links(start_url)
while active:
next_active = set()
for item in active:
for result in extract_links(item):
if result not in seen:
if len(seen) >= size:
break
else:
seen.add(result)
next_active.add(result)
active = next_active
return seen
Essentially, I get a soup from a start url I specify, extract all urls within the start url that have class kkyou true-load-invoker and then I repeat the process in a breadth-first fashion for all urls I have collected. I stop this process when I have seen a certain number of urls.
Until a couple of weeks ago, I had no problem running this. I could specify any number of urls and it would fetch them for me. I just tried the exact same code today, and it only returns a maximum of 14 urls. For example, if I ask it to fetch me 50 urls, it will only fetch 10 and stop. Clearly, this cannot be a problem with the code because I changed nothing! I am wondering whether the page I am trying to crawl is using some mechanism to stop me from crawling an "excessive" number of pages. The page I am trying to crawl is this (choose any article as a start url).
Any insights on this will be greatly appreciated! I am a complete novice in web crawling.
I am writing a simple script that checks if a website is present on google first search for a determined keyword.
Now,this is the function that parse a url and return the host name:
def parse_url(url):
url = urlparse(url)
hostname = url.netloc
return hostname
and starting from a list of tags selected by:
linkElems = soup.select('.r a') #in google first page the resulting urls have class r
I wrote this:
for link in linkElems:
l = link.get("href")[7:]
url = parse_url(l)
if "www.example.com" == url:
#do stuff (ex store in a list, etc)
in this last one, in the second line, i have to start from the seventh index, because all href values start with '/url?q='.
I am learning python, so i am wondering if there is a better way to do this, or simply an alternative one (maybe with regex or replace method or from urlparse library)
You can use python lxml module to do that which is also order of magnitude faster than BeautifulSoup.
This can be done something like this :
import requests
from lxml import html
blah_url = "https://www.google.co.in/search?q=blah&oq=blah&aqs=chrome..69i57j0l5.1677j0j4&sourceid=chrome&ie=UTF-8"
r = requests.get(blah_url).content
root = html.fromstring(r)
print(root.xpath('//h3[#class="r"]/a/#href')[0].replace('/url?q=', ''))
print([url.replace('/url?q=', '') for url in root.xpath('//h3[#class="r"]/a/#href')])
This will result in :
http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA
['http://www.urbandictionary.com/define.php%3Fterm%3Dblah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggTMAA&usg=AFQjCNFge5GFNmjpan7S_UCNjos1RP5vBA', 'http://www.dictionary.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggZMAE&usg=AFQjCNE1UVR3krIQHfEuIzHOeL0ZvB5TFQ', 'http://www.dictionary.com/browse/blah-blah-blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggeMAI&usg=AFQjCNFw8eiSqTzOm65PQGIFEoAz0yMUOA', 'https://en.wikipedia.org/wiki/Blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggjMAM&usg=AFQjCNFxEB8mEjEy6H3YFOaF4ZR1n3iusg', 'https://www.merriam-webster.com/dictionary/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggpMAQ&usg=AFQjCNHYXX53LmMF-DOzo67S-XPzlg5eCQ', 'https://en.oxforddictionaries.com/definition/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgguMAU&usg=AFQjCNGlgcUx-BpZe0Hb-39XvmNua2n8UA', 'https://en.wiktionary.org/wiki/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFggzMAY&usg=AFQjCNGc9VmmyQls_rOBOR_lMUnt1j3Flg', 'http://dictionary.cambridge.org/dictionary/english/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg5MAc&usg=AFQjCNHJgZR1c6VY_WgFa6Rm-XNbdFJGmA', 'http://www.thesaurus.com/browse/blah&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQFgg-MAg&usg=AFQjCNEtnpmKxVJqUR7P1ss4VHnt34f4Kg', 'https://www.youtube.com/watch%3Fv%3D3taEuL4EHAg&sa=U&ved=0ahUKEwiyscHQ5_LSAhWFvI8KHctAC0IQtwIIRTAJ&usg=AFQjCNFnKlMFxHoYAIkl1MCrc_OXjgiClg']
I am trying to create a website downloader using python. I have the code for:
Finding all URLs from a page
Downloading a given URL
What I have to do is to recursively download a page, and if there's any other link in that page, I need to download them also. I tried combining the above two functions, but recursion thing doesn't work.
The codes are given below:
1)
*from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
wanted_url=raw_input("Enter the URL: ")
usock = urllib.urlopen(wanted_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: download(url)*
2) where download(url) function is defined as follows:
*def download(url):
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
a=raw_input("Enter the URL")
download(a)
print "Done"*
Kindly help me on how to combine these two codes to "recursively" download the new links on a webpage that's being downloaded.
You may want to look into the Scrapy library.
It would make a task like this pretty trivial, and allow you to download multiple pages concurrently.
done_url = []
def download(url):
if url in done_url:return
...download url code...
done_url.append(url)
urls = sone_function_to_fetch_urls_from_this_page()
for url in urls:download(url)
This is a very sad/bad code. For example you will need to check if the url is within the domain you want to crawl or not. However, you asked for recursive.
Be mindful of the recursion depth.
There are just so many things wrong with my solution. :P
You must try some crawling library like Scrapy or something.
Generally, the idea is this:
def get_links_recursive(document, current_depth, max_depth):
links = document.get_links()
for link in links:
downloaded = link.download()
if current_depth < max_depth:
get_links_recursive(downloaded, depth-1, max_depth)
Call get_links_recursive(document, 0, 3) to get things started.