I'm trying to make a web scraper using Python and I'm facing one issue. Every streaming site I try to inspect doesnt let me inspect it's html code when i'm on the episodes page. It sends me back to the home page whenever I open Element Inspector. Any help?
I'm new to web scraping so I don't know another way of doing this
import requests
from bs4 import BeautifulSoup
# THIS PROJECT IS JUST TO SCRAPE ZORO.TO AND DOWNLOAD ANIME USING IT
# Make a request and get the return the HTML content of a webpage
def getWebpage(url):
r = requests.get(url)
return BeautifulSoup(r.content, 'html5lib')
# Search the title and get it's Page Url
def getWatchUrl(titleName):
keyword = ""
# Transform the Anime title into a Url as on the website
for word in titleName.split(sep=" "):
keyword += '+' + word
keyword = keyword[1:]
SearchURL = f'https://zoro.to/search?keyword={keyword}'
# Get the HTML contents of the website
try:
soup = getWebpage(SearchURL)
except Exception as e:
print(f"Unexpected Error {e}. Please try again!")
return None
# Seperate the useful content (Anime title and Links)
try:
titles = soup.findAll('div', {'class' : 'film-poster'}, limit=None)
except:
print(e)
print("Couldn't Find title. Check spellings and please try again!")
return None
# Search the content for Anime title and extract it's link
for title in titles:
for content in title:
try:
if titleName in content.get('title').lower():
path = content.get('href').replace('?ref=search', '')
watchURL = f'https://zoro.to/watch{path}'
return watchURL
except:
continue
print("Couldn't Find title. Check spellings and please try again!")
return None
# Get the direct download links from the Web page
def getDownloadUrl(watchUrl):
try:
soup = getWebpage(watchUrl)
print(soup.prettify())
except Exception as e:
print(f"Unexpected Error {e}. Please try again!")
return None
def main():
animeName = input("Enter Anime Name: ").lower()
watchURL = getWatchUrl(animeName)
if watchURL is not None: getDownloadUrl(watchURL)
if __name__ == "__main__":
main()
Related
I'm trying to write a python script to check the status's display text for a specific country (ie. Ecuador)
on this website:
https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps.
How do I keep track on that specific text when a change happens?
Currently, I tried to compare the hash codes after a time delay interval however the hash code seems to change every time even though nothing change visually.
input_website = 'https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps'
time_delay = 60
#Monitor the website
def monitor_website():
# Run the loop the keep monitoring
while True:
# Visit the website to know if it is up
status = urllib.request.urlopen(input_website).getcode()
# If it returns 200, the website is up
if status != 200:
# Call email function
send_email("The website is DOWN")
else:
send_email("The website is UP")
# Open url and create the hash code
response = urllib.request.urlopen(input_website).read()
current_hash = hashlib.sha224(response).hexdigest()
# Revisit the website after time delay
time.sleep(time_delay)
# Visit the website after delay, and generate the new website
response = urllib.request.urlopen(input_website).read()
new_hash = hashlib.sha224(response).hexdigest()
# Check the hash codes
if new_hash != current_hash:
send_email("The website CHANGED")
Can you check it using Beautiful Soup?
Crawl the page for "Ecuador" and then check the next word for "suspended**"
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "https://immi.homeaffairs.gov.au/what-we-do/whm-program/status-of-country-caps"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# create list of all tags 'td'
list_name = list()
tags = soup('td')
for tag in tags:
#take out whitespace and \u200b unicode
url_grab = tag.get_text().strip(u'\u200b').strip()
list_name.append(url_grab)
#Search list for Ecuador and following item in list
country_status ={}
for i in range(len(list_name)):
if "Ecuador" in list_name[i]:
country_status[list_name[i]] = list_name[i+1]
print(country_status)
else:
continue
#Check website
if country_status["Ecuador"] != "suspended**":
print("Website has changed")
So, I am trying to make a website crawler which would retrieve all links within the site and print them to the console and also redirect the links to a text file using a python script.
This script will take in the URL of the website you want to retrieve links from and the no.of URLs to be followed from the main page and the maximum number of URLs to be retrieved and then using the functions crawl(), is_valid() and get_all_website_links() it retrieves the URLs. It also separates external links and internal links through the get_all_website_links() function.
So far I have been successful with the retrieving and printing and redirecting the links to the text file using the script but I faced a problem when the server refuses to connect. It stops the link retrieval and ends the execution.
What I want my script to do is to retry a specified number of times and continue to the next link if it fails even after retrying.
I tried to implement this mechanism by myself but I did not get any idea.
I'm appending my python script below for your better understanding.
An elaborate explanation with implementation would be deeply appreciated!
Pardon me if my grammar is bad ;)
Thanks for your time :)
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
import sys
sys.setrecursionlimit(99999999)
print("WEBSITE CRAWLER".center(175,"_"))
print("\n","="*175)
print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n")
print("\n","="*175)
siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :")
max_urls = int(input("Enter the number of urls you want to crawl through the main page : "))
filename = input("Give a name for your text file (Don't append .txt at the end!) : ")
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
MAGENTA = colorama.Fore.MAGENTA
RESET = colorama.Fore.RESET
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{MAGENTA} [!] External link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
urls.add(href)
internal_urls.add(href)
return urls
# number of urls visited so far will be stored here
total_urls_visited = 0
def crawl(url, max_urls=50000):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl(siteurl,max_urls)
print("[+] Total External links:", len(external_urls))
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total:", len(external_urls) + len(internal_urls))
input("Press any key to exit...")
I wrote a code for 4 properties to scrape data but I'm only getting data from just first field "title" and the other 3 fields return empty results. could anyone please guide me how can I fix this issue. thanks!
here is my code:
import requests
from bs4 import BeautifulSoup
#import pandas as pd
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_detail_data(soup):
try:
title = soup.find('span',class_="text-info h4",id=False).find('strong').text
except:
title = 'empty'
print(title)
try:
add = soup.find('div',class_="col-xs-12 col-sm-4",id=False).find('strong')
except:
add = 'empty add'
print(add)
try:
phone = soup.find('div',class_="col-xs-12 col-sm-4",id=False).text
except:
phone = 'empty phone'
print(phone)
def main():
url = "https://www.dobsearch.com/people-finder/view.php?searchnum=287404084791&sessid=vusqgp50pm8r38lfe13la8ta1l"
get_detail_data(get_page(url))
if __name__ == '__main__':
main()
For the second one you are giving a class that has occurred before the one that you want so you need to change the class or go through multiple findings. and this happened for the third one too. this kinds of classes (col-xs-12) are some bootstrap classes and they are common classes to use so in general they are not good attempts to be used in finding (or you should make more complicated finds). and as I can see this site doesn't have much unique classes so I think you should use multiple find methods to get what you want. and another thing that I can say is to not use try...except unless you know what you are getting from that part.
I am trying to extract the followers of a random web page in Instagram. I tried to use python in combination with Beautiful Soup.
Nonetheless I have not received any information at web page where I could access
def get_user_info( user_name):
url = "https://www.instagram.com/" + user_name + "/?__a=1"
try:
r = requests.get(url)
except requests.exceptions.ConnectionError:
print ('Seems like dns lookup failed..')
time.sleep(60)
return None
if r.status_code != 200:
print ('User: ' + user_name + ' status code: ' + str(r.status_code))
print (r)
return None
info = json.loads(r.text)
return info['user']
get_user_info("wernergruener")
As mentioned I do not get the followers of the page. How could I do this?
Cheers,
Andi
With API/JSON:
I'm not familiar with the Instagram API, but it doesn't look like it returns detailed information about a person's followers, just the number of followers.
You should be able to get that information using info["user"]["followed_by"]["count"].
With raw page/Beautiful Soup:
Assuming the non-API page reveals the information you want about a person's followers, you'll want to download the raw HTML (instead of JSON) and parse it using Beautiful Soup.
def get_user_info( user_name):
url = "https://www.instagram.com/" + user_name
try:
r = requests.get(url)
except requests.exceptions.ConnectionError:
print ('Seems like dns lookup failed..')
time.sleep(60)
return None
if r.status_code != 200:
print ('User: ' + user_name + ' status code: ' + str(r.status_code))
print (r)
return None
soup = BeautifulSoup(r.text, 'html.parser')
# find things using Beautiful Soup
get_user_info("wernergruener")
Beautiful Soup has some of the most intuitive documentation I've ever read. I'd start there:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
With API/python-instagram:
Other people have already done a lot of the heavy lifting for you. I think python-instagram should offer you easier access to the information you want.
I am getting this error:
requests.exceptions.MissingSchema: Invalid URL 'http:/1525/bg.png': No schema supplied. Perhaps you meant http://http:/1525/bg.png?
I don't really care why the error happened, I want to be able to capture any Invalid URL errors, issue a message and proceed with the rest of the code.
Below is my code, where I'm trying to use try/except for that specific error but its not working...
# load xkcd page
# save comic image on that page
# follow <previous> comic link
# repeat until last comic is reached
import webbrowser, bs4, os, requests
url = 'http://xkcd.com/1526/'
os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'): # - last page
# download the page
print('Dowloading page %s...' % (url))
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
# find url of the comic image (<div id ="comic"><img src="........"
</div
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find any images')
else:
comicUrl = 'http:' + comicElem[0].get('src')
#download the image
print('Downloading image... %s' % (comicUrl))
res = requests.get(comicUrl)
try:
res.raise_for_status()
except requests.exceptions.MissingSchema as err:
print(err)
continue
# save image to folder
imageFile = open(os.path.join('xkcd',
os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(1000000):
imageFile.write(chunk)
imageFile.close()
#get <previous> button url
prevLink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevLink.get('href')
print('Done')
What a my not doing? (I'm on python 3.5)
Thanks allot in advance...
if you don't care about the error (which i see as bad programming), just use a blank except statement that catches all exceptions.
#download the image
print('Downloading image... %s' % (comicUrl))
try:
res = requests.get(comicUrl) # moved inside the try block
res.raise_for_status()
except:
continue
but on the other hand if your except block isn't catching the exception then it's because the exception actually happens outside your try block, so move requests.get into the try block and the exception handling should work (that's if you still need it).
Try this, if you have this type of issue occur on use wrong URL.
Solution:
import requests
correct_url = False
url = 'Ankit Gandhi' # 'https://gmail.com'
try:
res = requests.get(url)
correct_url = True
except:
print("Please enter a valid URL")
if correct_url:
"""
Do your operation
"""
print("Correct URL")
Hope this help full.
The reason your try/except block isn't caching the exception is that the error is happening at the line
res = requests.get(comicUrl)
Which is above the try keyword.
Keeping your code as is, and just moving the try block up one line will fix it.