Crawling over a website directories using BeautifulSoup?

Crawling over a website directories using BeautifulSoup? - python

This is my code:
https://pastebin.com/R11qiTF4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from urllib.parse import urljoin
import re
urls = ["https://www.helios-gesundheit.de"]
domain_list = ["https://www.helios-gesundheit.de/kliniken/schwerin/"]
prohibited = ["info", "news"]
text_keywords = ["Helios", "Helios"]
url_list = []
desired = "https://www.helios-gesundheit.de/kliniken/schwerin/unser-angebot/unsere-fachbereiche-klinikum/allgemein-und-viszeralchirurgie/team-allgemein-und-viszeralchirurgie/"
for x in range(len(domain_list)):
url_list.append(urls[x]+domain_list[x].replace(urls[x], ""))
print(url_list)
def prohibitedChecker(prohibited_list, string):
for x in prohibited_list:
if x in string:
return True
else:
return False
break
def parseHTML(url):
requestHTML = req(url)
htmlPage = requestHTML.read()
requestHTML.close()
parsedHTML = soup(htmlPage, "html.parser")
return parsedHTML
searched_word = "Helios"
for url in url_list:
parsedHTML = parseHTML(url)
href_crawler = parsedHTML.find_all("a", href=True)
for href in href_crawler:
crawled_url = urljoin(url,href.get("href"))
print(crawled_url)
if "www" not in crawled_url:
continue
parsedHTML = parseHTML(crawled_url)
results = parsedHTML.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)
for single_result in results:
keyword_text_check = prohibitedChecker(text_keywords, single_result.string)
if keyword_text_check != True:
continue
print(single_result.string)
I'm trying to print the contents of ''desired'' variable. The problem is the following, my code doesn't even get to request the URL of ''desired'' because its not in the website scope. ''desired'' href link is inside another href link that's inside the page I'm currently scraping. I thought I'd fix this by adding another for loop inside line 39 for loop, that requests every href found in my first, but this is too messy and not efficient
Is there way to get a list of every directory of a website url?

Related

python - Retrieve and save links from webpage but only one per domain

I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain
Example:
www.python.org/download and www.python.org/about
should only save the first one (www.python.org/download) and not repeat it later
This is what i've got so far
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse
url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop

As a one-liner:
links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}
Explained:
links = set() # define empty set
for a in doc.find_all('a', href=True): # loop over every <a> element
nl = urlparse(a["href"]).netloc # get netloc from url
if nl:
links.add(nl) # add to set if exists
output:
{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}

The python script is continuously running

I am trying to build a web crawler to extract all the links on a webpage. I have created 2 python files. (class: scanner.py and object: vulnerability-scanner.py). When I run the script, it is continuously running without stopping. I am unable to find the error. Help me to solve this.
Here is my source code:
scanner.py
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
class Scanner:
colorama.init()
def __init__(self, url):
self.target_url = url
self.target_links = []
def is_valid(self, url):
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(self, url):
GREEN = colorama.Fore.GREEN
WHITE = colorama.Fore.WHITE
RESET = colorama.Fore.RESET
urls = set()
internal_urls = set()
external_urls = set()
domain_name = urlparse(url).netloc
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
continue
href = urljoin(url, href)
parsed_href = urlparse(href)
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not self.is_valid(href):
continue
if href in internal_urls:
continue
if domain_name not in href:
if href not in external_urls:
print(f"{WHITE}[*] External link: {href}{RESET}")
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
urls.add(href)
internal_urls.add(href)
return urls
def crawl(self, url):
href_links = self.get_all_website_links(url)
for link in href_links:
print(link)
self.crawl(link)
vulnerability-scanner.py
import argu
target_url = "https://hack.me/"
vul_scanner = argu.Scanner(target_url)
vul_scanner.crawl(target_url)

The following part is (almost) an infinite recursion:
for link in href_links:
print(link)
self.crawl(link)
I believe you added this on the notion of crawling the links in the page. But you didn't put a stopping condition. (Although currently, it seems like your only stopping condition is if there is a crawled page with no links at all).
One stopping condition might be to set a predefined number of "max" levels to crawl.
Something like this in your init function:
def __init__(self, url):
self.target_url = url
self.target_links = []
self.max_parse_levels = 5 #you can go a step further and make this as an input to the constructore (i.e. __init__ function)
self.cur_parse_levels = 0
.
.
.
def crawl(url):
if self.cur_parse_levels > self.max_parse_levels:
return
for link in href_links:
print(link)
self.crawl(link)

How to get all links of a webpage whether linked to webpage directly or indirectly using python?

I need to get all the Links related to give homepage URL of a website, that all links mean the link which are present in homepage plus the links which are new and are reached via using the link in the homepage links.
I am using the BeautifulSoup python library. I am also thinking to use Scrapy.
This Below code extracts Links only linked to homepage.
from bs4 import BeautifulSoup
import requests
url = "https://www.dataquest.io"
def links(url):
html = requests.get(url).content
bsObj = BeautifulSoup(html, 'lxml')
links = bsObj.findAll('a')
finalLinks = set()
for link in links:
finalLinks.add(link)
return finalLinks
print(links(url))
linklis = list(links(url))
for l in linklis:
print(l)
print("\n")
I need a List which include all URL/Links which can be reached via the homepage URL (may be directly or indirectly linked to homepage).

This script will print all links found on the url https://www.dataquest.io:
from bs4 import BeautifulSoup
import requests
url = "https://www.dataquest.io"
def links(url):
html = requests.get(url).content
bsObj = BeautifulSoup(html, 'lxml')
links = bsObj.select('a[href]')
final_links = set()
for link in links:
url_string = link['href'].rstrip('/')
if 'javascript:' in url_string or url_string.startswith('#'):
continue
elif 'http' not in url_string and not url_string.startswith('//'):
url_string = 'https://www.dataquest.io' + url_string
elif 'dataquest.io' not in url_string:
continue
final_links.add(url_string)
return final_links
for l in sorted( links(url) ):
print(l)
Prints:
http://app.dataquest.io/login
http://app.dataquest.io/signup
https://app.dataquest.io/signup
https://www.dataquest.io
https://www.dataquest.io/about-us
https://www.dataquest.io/blog
https://www.dataquest.io/blog/learn-data-science
https://www.dataquest.io/blog/learn-python-the-right-way
https://www.dataquest.io/blog/the-perfect-data-science-learning-tool
https://www.dataquest.io/blog/topics/student-stories
https://www.dataquest.io/chat
https://www.dataquest.io/course
https://www.dataquest.io/course/algorithms-and-data-structures
https://www.dataquest.io/course/apis-and-scraping
https://www.dataquest.io/course/building-a-data-pipeline
https://www.dataquest.io/course/calculus-for-machine-learning
https://www.dataquest.io/course/command-line-elements
https://www.dataquest.io/course/command-line-intermediate
https://www.dataquest.io/course/data-exploration
https://www.dataquest.io/course/data-structures-algorithms
https://www.dataquest.io/course/decision-trees
https://www.dataquest.io/course/deep-learning-fundamentals
https://www.dataquest.io/course/exploratory-data-visualization
https://www.dataquest.io/course/exploring-topics
https://www.dataquest.io/course/git-and-vcs
https://www.dataquest.io/course/improving-code-performance
https://www.dataquest.io/course/intermediate-r-programming
https://www.dataquest.io/course/intro-to-r
https://www.dataquest.io/course/kaggle-fundamentals
https://www.dataquest.io/course/linear-algebra-for-machine-learning
https://www.dataquest.io/course/linear-regression-for-machine-learning
https://www.dataquest.io/course/machine-learning-fundamentals
https://www.dataquest.io/course/machine-learning-intermediate
https://www.dataquest.io/course/machine-learning-project
https://www.dataquest.io/course/natural-language-processing
https://www.dataquest.io/course/optimizing-postgres-databases-data-engineering
https://www.dataquest.io/course/pandas-fundamentals
https://www.dataquest.io/course/pandas-large-datasets
https://www.dataquest.io/course/postgres-for-data-engineers
https://www.dataquest.io/course/probability-fundamentals
https://www.dataquest.io/course/probability-statistics-intermediate
https://www.dataquest.io/course/python-data-cleaning-advanced
https://www.dataquest.io/course/python-datacleaning
https://www.dataquest.io/course/python-for-data-science-fundamentals
https://www.dataquest.io/course/python-for-data-science-intermediate
https://www.dataquest.io/course/python-programming-advanced
https://www.dataquest.io/course/r-data-cleaning
https://www.dataquest.io/course/r-data-cleaning-advanced
https://www.dataquest.io/course/r-data-viz
https://www.dataquest.io/course/recursion-and-tree-structures
https://www.dataquest.io/course/spark-map-reduce
https://www.dataquest.io/course/sql-databases-advanced
https://www.dataquest.io/course/sql-fundamentals
https://www.dataquest.io/course/sql-fundamentals-r
https://www.dataquest.io/course/sql-intermediate-r
https://www.dataquest.io/course/sql-joins-relations
https://www.dataquest.io/course/statistics-fundamentals
https://www.dataquest.io/course/statistics-intermediate
https://www.dataquest.io/course/storytelling-data-visualization
https://www.dataquest.io/course/text-processing-cli
https://www.dataquest.io/directory
https://www.dataquest.io/forum
https://www.dataquest.io/help
https://www.dataquest.io/path/data-analyst
https://www.dataquest.io/path/data-analyst-r
https://www.dataquest.io/path/data-engineer
https://www.dataquest.io/path/data-scientist
https://www.dataquest.io/privacy
https://www.dataquest.io/subscribe
https://www.dataquest.io/terms
https://www.dataquest.io/were-hiring
https://www.dataquest.io/wp-content/uploads/2019/03/db.png
https://www.dataquest.io/wp-content/uploads/2019/03/home-code-1.jpg
https://www.dataquest.io/wp-content/uploads/2019/03/python.png
EDIT: Changed the selector to a[href]
EDIT2: A primitive recursive crawler:
def crawl(urls, seen=set()):
for url in urls:
if url not in seen:
print(url)
seen.add(url)
new_links = links(url)
crawl(urls.union(new_links), seen)
starting_links = links(url)
crawl(starting_links)

Using BeautifulSoup to find links related to specific keyword

I have to modify this code so the scraping keeps only the links that contain a specific keyword. In my case I'm scraping a newspaper page to find news related to the term 'Brexit'.
I've tried modifying the method parse_links so it only keeps the links (or 'a' tags), that contain 'Brexit' in them, but it doesn't seem to work.
Where should i place the condition?
import requests
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
class MultiThreadScraper:
def __init__(self, base_url):
self.base_url = base_url
self.root_url = '{}://{}'.format(urlparse(self.base_url).scheme, urlparse(self.base_url).netloc)
self.pool = ThreadPoolExecutor(max_workers=20)
self.scraped_pages = set([])
self.to_crawl = Queue(10)
self.to_crawl.put(self.base_url)
def parse_links(self, html):
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
url = link['href']
if url.startswith('/') or url.startswith(self.root_url):
url = urljoin(self.root_url, url)
if url not in self.scraped_pages:
self.to_crawl.put(url)
def scrape_info(self, html):
return
def post_scrape_callback(self, res):
result = res.result()
if result and result.status_code == 200:
self.parse_links(result.text)
self.scrape_info(result.text)
def scrape_page(self, url):
try:
res = requests.get(url, timeout=(3, 30))
return res
except requests.RequestException:
return
def run_scraper(self):
while True:
try:
target_url = self.to_crawl.get(timeout=60)
if target_url not in self.scraped_pages:
print("Scraping URL: {}".format(target_url))
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
except Empty:
return
except Exception as e:
print(e)
continue
if __name__ == '__main__':
s = MultiThreadScraper("https://elpais.com/")
s.run_scraper()

You need to import re module to get the specific text value.Try the below code.
import re
links = soup.find_all('a', text=re.compile("Brexit"))
This should return links which contains only Brexit.

You can get text of the element by using method getText() and check, if string actually contain "Brexit":
if "Brexit" in link.getText().split():
url = link["href"]

I added a check in this function. See if that does the rick for you:
def parse_links(self, html):
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
if 'BREXIT' in link.text.upper(): #<------ new if statement
url = link['href']
if url.startswith('/') or url.startswith(self.root_url):
url = urljoin(self.root_url, url)
if url not in self.scraped_pages:
self.to_crawl.put(url)

How to obtain all the links in a domain using Python?

I want to use Python to obtain all the links in a domain given the 'root' URL (in a list). Suppose given a URL http://www.example.com this should return all the links on this page of the same domain as the root URL, then recurse on each of these links visiting them and extracting all the links of the same domain and so on. What I mean by same domain is if given http://www.example.com the only links I want back are http://www.example.com/something, http://www.example.com/somethingelse ... Anything external such as http://www.otherwebsite.com should be discarded. How can I do this using Python?
EDIT: I made an attempt using lxml. I don't think this works fully, and I am not sure how to take into account links to already processed pages (causing infinite loop).
import urllib
import lxml.html
#given a url returns list of all sublinks within the same domain
def getLinks(url):
urlList = []
urlList.append(url)
sublinks = getSubLinks(url)
for link in sublinks:
absolute = url+'/'+link
urlList.extend(getLinks(absolute))
return urlList
#determine whether two links are within the same domain
def sameDomain(url, dom):
return url.startswith(dom)
#get tree of sublinks in same domain, url is root
def getSubLinks(url):
sublinks = []
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
if not (link.startswith('#') or link.startswith('http') or link.startswith('mailto:')):
sublinks.append(link)
return sublinks
~

import sys
import requests
import hashlib
from bs4 import BeautifulSoup
from datetime import datetime
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content)
return soup
def get_status_code(link):
"""
Return the error code for any url
param: link
"""
try:
error_code = requests.get(link).status_code
except requests.exceptions.ConnectionError:
error_code =
return error_code
def find_internal_urls(lufthansa_url, depth=0, max_depth=2):
all_urls_info = []
status_dict = {}
soup = get_soup(lufthansa_url)
a_tags = soup.findAll("a", href=True)
if depth > max_depth:
return {}
else:
for a_tag in a_tags:
if "http" not in a_tag["href"] and "/" in a_tag["href"]:
url = "http://www.lufthansa.com" + a_tag['href']
elif "http" in a_tag["href"]:
url = a_tag["href"]
else:
continue
status_dict["url"] = url
status_dict["status_code"] = get_status_code(url)
status_dict["timestamp"] = datetime.now()
status_dict["depth"] = depth + 1
all_urls_info.append(status_dict)
return all_urls_info
if __name__ == "__main__":
depth = 2 # suppose
all_page_urls = find_internal_urls("someurl", 2, 2)
if depth > 1:
for status_dict in all_page_urls:
find_internal_urls(status_dict['url'])
The above snippet contains necessary modules for scrapping urls from lufthansa arlines website. The only thing additional here is you can specify depth to which you want to scrape recursively.

Here is what I've done, only following full urls like http://domain[xxx]. Quick but a bit dirty.
import requests
import re
domain = u"stackoverflow.com"
http_re = re.compile(u"(http:\/\/" + domain + "[\/\w \.-]*\/?)")
visited = set([])
def visit (url):
visited.add (url)
extracted_body = requests.get (url).text
matches = re.findall (http_re, extracted_body)
for match in matches:
if match not in visited :
visit (match)
visit(u"http://" + domain)
print (visited)

There are some bugs in the code of #namita . I modify it and it works well now.
import sys
import requests
import hashlib
from bs4 import BeautifulSoup
from datetime import datetime
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content, "lxml")
return soup
def get_status_code(link):
"""
Return the error code for any url
param: link
"""
try:
error_code = requests.get(link).status_code
except requests.exceptions.ConnectionError:
error_code = -1
return error_code
def find_internal_urls(main_url, depth=0, max_depth=2):
all_urls_info = []
soup = get_soup(main_url)
a_tags = soup.findAll("a", href=True)
if main_url.endswith("/"):
domain = main_url
else:
domain = "/".join(main_url.split("/")[:-1])
print(domain)
if depth > max_depth:
return {}
else:
for a_tag in a_tags:
if "http://" not in a_tag["href"] and "https://" not in a_tag["href"] and "/" in a_tag["href"]:
url = domain + a_tag['href']
elif "http://" in a_tag["href"] or "https://" in a_tag["href"]:
url = a_tag["href"]
else:
continue
# print(url)
status_dict = {}
status_dict["url"] = url
status_dict["status_code"] = get_status_code(url)
status_dict["timestamp"] = datetime.now()
status_dict["depth"] = depth + 1
all_urls_info.append(status_dict)
return all_urls_info
if __name__ == "__main__":
url = # your domain here
depth = 1
all_page_urls = find_internal_urls(url, 0, 2)
# print("\n\n",all_page_urls)
if depth > 1:
for status_dict in all_page_urls:
find_internal_urls(status_dict['url'])

The code worked, but I don't know if it's 100% correct
it is extracting all the internal urls in the website
import requests
from bs4 import BeautifulSoup
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content, "lxml")
return soup
visited = set([])
def visit (url,domain):
visited.add (url)
soup = get_soup(url)
a_tags = soup.findAll("a", href=True)
for a_tag in a_tags:
if "http://" not in a_tag["href"] and "https://" not in a_tag["href"] and "/" in a_tag["href"]:
url = domain + a_tag['href']
elif "http://" in a_tag["href"] or "https://" in a_tag["href"]:
url = a_tag["href"]
else:
continue
if url not in visited and domain in url:
# print(url)
visit (url,domain)
url=input("Url: ")
domain=input("domain: ")
visit(u"" + url,domain)
print (visited)

From the tags of your question, I assume you are using Beautiful Soup.
At first, you obviously need to download the webpage, for example with urllib.request. After you did that and have the contents in a string, you pass it to Beautiful Soup. After that, you can find all links with soup.find_all('a'), assuming soup is your beautiful soup object. After that, you simply need to check the hrefs:
The most simple version would be to just check if "http://www.example.com" is in the href, but that won't catch relative links. I guess some wild regular expression would do (find everything with "www.example.com" or starting with "/" or starting with "?" (php)), or you might look for everything that contains a www, but is not www.example.com and discard it, etc. The correct strategy might be depending on the website you are scraping, and it's coding style.

You can use regular expression to filter out such links
eg
<a\shref\=\"(http\:\/\/example\.com[^\"]*)\"
Take the above regex as reference and start writing script based on that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling over a website directories using BeautifulSoup? - python

Related

python - Retrieve and save links from webpage but only one per domain

The python script is continuously running

How to get all links of a webpage whether linked to webpage directly or indirectly using python?

Using BeautifulSoup to find links related to specific keyword

How to obtain all the links in a domain using Python?

Categories

Resources