Removing all characters after URL? - python

Basically, I'm trying to remove all the characters after the URL extension in a URL, but it's proving difficult. The application works off a list of various URLs with various extensions.
Here's my source:
import requests
from bs4 import BeautifulSoup
from time import sleep
#takes userinput for path of panels they want tested
import_file_path = input('Enter the path of the websites to be tested: ')
#takes userinput for path of exported file
export_file_path = input('Enter the path of where we should export the panels to: ')
#reads imported panels
with open(import_file_path, 'r') as panels:
panel_list = []
for line in panels:
panel_list.append(line)
x = 0
for panel in panel_list:
url = requests.get(panel)
soup = BeautifulSoup(url.content, "html.parser")
forms = soup.find_all("form")
action = soup.find('form').get('action')
values = {
soup.find_all("input")[0].get("name") : "user",
soup.find_all("input")[1].get("name") : "pass"
}
print(values)
r = requests.post(action, data=values)
print(r.headers)
print(r.status_code)
print(action)
sleep(10)
x += 1
What I'm trying to achieve is an application that automatically tests your username/password from a list of URLs provided in a text document. However, BeautifulSoup returns an incomplete URL when crawling for action tags, i.e instead of returning the full http://example.com/action.php it will return action.php as it would be in the code. The only way I can think to get past this would be to restate the 'action' variable as 'panel' with all characters after the url extension removed, followed by 'action'.
Thanks!

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

How to check if the specific text on a website changed using python script

I'm trying to write a python script to check the status's display text for a specific country (ie. Ecuador)
on this website:
https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps.
How do I keep track on that specific text when a change happens?
Currently, I tried to compare the hash codes after a time delay interval however the hash code seems to change every time even though nothing change visually.
input_website = 'https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps'
time_delay = 60
#Monitor the website
def monitor_website():
# Run the loop the keep monitoring
while True:
# Visit the website to know if it is up
status = urllib.request.urlopen(input_website).getcode()
# If it returns 200, the website is up
if status != 200:
# Call email function
send_email("The website is DOWN")
else:
send_email("The website is UP")
# Open url and create the hash code
response = urllib.request.urlopen(input_website).read()
current_hash = hashlib.sha224(response).hexdigest()
# Revisit the website after time delay
time.sleep(time_delay)
# Visit the website after delay, and generate the new website
response = urllib.request.urlopen(input_website).read()
new_hash = hashlib.sha224(response).hexdigest()
# Check the hash codes
if new_hash != current_hash:
send_email("The website CHANGED")
Can you check it using Beautiful Soup?
Crawl the page for "Ecuador" and then check the next word for "suspended**"
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# create list of all tags 'td'
list_name = list()
tags = soup('td')
for tag in tags:
#take out whitespace and \u200b unicode
url_grab = tag.get_text().strip(u'\u200b').strip()
list_name.append(url_grab)
#Search list for Ecuador and following item in list
country_status ={}
for i in range(len(list_name)):
if "Ecuador" in list_name[i]:
country_status[list_name[i]] = list_name[i+1]
print(country_status)
else:
continue
#Check website
if country_status["Ecuador"] != "suspended**":
print("Website has changed")

Retry mechanism for my web crawler script

So, I am trying to make a website crawler which would retrieve all links within the site and print them to the console and also redirect the links to a text file using a python script.
This script will take in the URL of the website you want to retrieve links from and the no.of URLs to be followed from the main page and the maximum number of URLs to be retrieved and then using the functions crawl(), is_valid() and get_all_website_links() it retrieves the URLs. It also separates external links and internal links through the get_all_website_links() function.
So far I have been successful with the retrieving and printing and redirecting the links to the text file using the script but I faced a problem when the server refuses to connect. It stops the link retrieval and ends the execution.
What I want my script to do is to retry a specified number of times and continue to the next link if it fails even after retrying.
I tried to implement this mechanism by myself but I did not get any idea.
I'm appending my python script below for your better understanding.
An elaborate explanation with implementation would be deeply appreciated!
Pardon me if my grammar is bad ;)
Thanks for your time :)
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
import sys
sys.setrecursionlimit(99999999)
print("WEBSITE CRAWLER".center(175,"_"))
print("\n","="*175)
print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n")
print("\n","="*175)
siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :")
max_urls = int(input("Enter the number of urls you want to crawl through the main page : "))
filename = input("Give a name for your text file (Don't append .txt at the end!) : ")
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
MAGENTA = colorama.Fore.MAGENTA
RESET = colorama.Fore.RESET
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{MAGENTA} [!] External link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
urls.add(href)
internal_urls.add(href)
return urls
# number of urls visited so far will be stored here
total_urls_visited = 0
def crawl(url, max_urls=50000):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl(siteurl,max_urls)
print("[+] Total External links:", len(external_urls))
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total:", len(external_urls) + len(internal_urls))
input("Press any key to exit...")

Looping URLs for scraping on BeautifulSoup

My script currently looks at a list of 5 URLs, once it reaches the end of the list it stops scraping. I want it to loop back to the first URL after it completes the last URL. How would I achieve that?
The reason I want it to loop is to monitor for any changes in the product such as the price etc.
I tried looking at a few method I found online but couldn't figure it out as I am new to this. Hope you can help!
import requests
import lxml.html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import random
ua = UserAgent()
header = {'User-Agent':ua.chrome}
# Proxies
proxy_list = []
for line in open('proxies.txt', 'r'):
line = line.replace('\n', '')
proxy_list.append(line)
def get_proxy():
proxy = random.choice(proxy_list)
proxies = {
"http": f'{str(proxy)}',
"https": f'{str(proxy)}'
}
return proxies
# Opening URL file
with open('urls.txt','r') as file:
for url in file.readlines():
proxies = get_proxy()
result = requests.get(url.strip() ,headers=header,timeout=4,proxies=proxies)
#src = result.content
soup = BeautifulSoup(result.content, 'lxml')
You can store the urls in a list and do a while loop over it, the basic logic will be
with open('urls.txt','r') as file:
url_list = file.readlines()
pos = 0
while True:
if pos >= len(url_list):
pos = 0
url = url_list[pos]
pos += 1
*** rest of your logic ***
You can add a while True: loop outside and above your main with statement & for loop (and add one level of indent to every line inside). This way the program will keep running until terminated by user.

How to load a webpage (NOT in a browser) then get the url of that webpage?

I wanted to make something I saw on Reddit that allows you to get a random Wikipedia article, see its headline then either A (open the article in your browser) or B (get a new random article). To get a random article you would type in this URL "https://en.wikipedia.org/wiki/Special:Random" but then I would need to reload the URL see what it changed to and then figure out what article I got to. How would I do this?
Breaking the task down into bite sized chunks:
get a random Wikipedia article
Cool. This is pretty straight forward. You can either use Python's built-in urllib2 or the requests package. Most people recommend requests (pip install requests) as it is a higher level library that is a bit simpler to use, but in this case what we are doing is so simple it might be overkill. At any rate:
import requests
RANDOM_WIKI_URL = "https://en.wikipedia.org/wiki/Special:Random"
response = requests.get(RANDOM_WIKI_URL)
data = response.content
url = response.url
see its headline
For this we need to parse the HTML. It's tempting to recommend that you just use a regex to extract the text from the element containing the title but really the proper way to do this sort of thing is to use a library like BeautifulSoup (pip install beautifulsoup4):
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
title = soup.select('#firstHeading')[0].get_text()
print title
either A ([...]) or B ([...])
print "=" * 80
print "(a): Open in new browser tab"
print "(b): Get new article"
print "(q): Quit"
user_input = raw_input("[a|b|q]: ").lower()
if user_input == 'a':
...
elif user_input == 'b':
...
elif user_input == 'q':
...
open the article in your browser
import webbrowser
webbrowser.open_new_tab(url)
get a new random article
response = requests.get(RANDOM_WIKI_URL)
data = response.content
url = response.url
Putting it all together:
from __future__ import unicode_literals
import webbrowser
from bs4 import BeautifulSoup
import requests
RANDOM_WIKI_URL = "https://en.wikipedia.org/wiki/Special:Random"
def get_user_input():
user_input = ''
while user_input not in ('a', 'b', 'q'):
print '-' * 79
print "(a): Open in new browser tab"
print "(b): Get new random article"
print "(q): Quit"
print '-' * 79
user_input = raw_input("[a|b|q]: ").lower()
return user_input
def main():
while True:
print "=" * 79
print "Retrieving random wikipedia article..."
response = requests.get(RANDOM_WIKI_URL)
data = response.content
url = response.url
soup = BeautifulSoup(data, 'html.parser')
title = soup.select('#firstHeading')[0].get_text()
print "Random Wikipedia article: '{}'".format(title)
user_input = get_user_input()
if user_input == 'q':
break
elif user_input == 'a':
webbrowser.open_new_tab(url)
if __name__ == '__main__':
main()
The Site:Random page in Wikipedia returns the redirection response with the destination location:
HTTP/1.1 302 Found
...
Location: https://en.wikipedia.org/wiki/URL_redirection
...
Most libraries (and all browsers) follow that link automatically, but you can disable it, for example, in requests:
import requests
url = 'https://en.wikipedia.org/wiki/Special:Random'
response = requests.get(url, allow_redirects=False)
real_url = response.headers['location']
# then use real_url to fetch the page
Alternatively, requests provides the redirection history:
response = requests.get(url)
real_url = response.history[-1].headers['location']
In the latter case, response already contains the page you need, so that's a simpler way.
URL - you can get the url with urllib2 response.geturl()
Wiki header - you can parse the header with the BeautifulSoup package
Browser - You can open the url in a web browser with the webbrowser.open(url)
Here's simple working example:
import urllib2
import webbrowser
from BeautifulSoup import BeautifulSoup
while (True):
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Special:Random')
headline = BeautifulSoup(response.read()).html.title.string
url = response.geturl()
print "The url: " +url
print "The headline: " + headline
x = raw_input("Press: [A - Open in browser] [B - Get a new random article] [Anything else to exit]\n>")
if x == "A":
webbrowser.open(url) #open in browser
elif x == "B":
continue # get a new random article
else:
break #exit

Categories