Following links using Beautiful Soup?

Following links using Beautiful Soup? - python

So I have just started learning about python using the Coursera online course "Python for Everybody", and I have this assignment where I have to follow links using beautiful soup. I saw this question pop up before but when I tried using it, it just didn't work. I managed to create something but the thing doesn't actually follow through the links but instead just stays on the same page. If possible can anyone provide materials that can give better insight on this assignment as well? Thanks.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL - ')
cnt = input("Enter count -")
count = int(cnt)
pn = input("Enter position -")
position = int(pn)-1
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
lst = list()
for tag in tags:
lst.append(tag.get('href', None))
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)

You never set url to the new URL.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)
# What happens to lst?? you never use it
You should probably replace indxpos with url instead.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
url = lst[position]
count = count - 1
print("Retrieving:", url)
This way, the next time the loop runs, it will fetch the new URL.
Also: If the page does not have pn links (e.g. pn=12, page has 2 links), you will get an exception if you try and access lst[position], because lst has less than pn elements.

You don't have a function that interacts with the list of hyperlinks in your code, whatsoever.
It will only print contents of "lst" list, but won't do anything with them.

Related

python - Retrieve and save links from webpage but only one per domain

I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain
Example:
www.python.org/download and www.python.org/about
should only save the first one (www.python.org/download) and not repeat it later
This is what i've got so far
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse
url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop

As a one-liner:
links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}
Explained:
links = set() # define empty set
for a in doc.find_all('a', href=True): # loop over every <a> element
nl = urlparse(a["href"]).netloc # get netloc from url
if nl:
links.add(nl) # add to set if exists
output:
{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}

A beautifulsoup loop that returns links without certain words

Im trying to write a scraper that randomly chooses a wiki article link from a page, goes there, grabs another, and loops that. I want to exclude links with "Category:", "File:", "List" in the href. Im pretty sure the links i want are all inside of p tags, but when I include "p" in find_all, i get "int object is not subscriptable" error.
The code below returns wiki pages but does not exclude the things i want to filter out.
This is a learning journey for me. All help is appreciated.
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
print(url)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# Here i am trying to select hrefs with /wiki/ in them and exclude hrefs with "Category:" etc. It does select for wikis but does not exclude anything.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
# Use this link to scrape
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")

You need to modify the for loop, .attrs is used to access the attributes of any tag. If you want to exclude links if the href value contains particular keyword then use !=-1 comparison.
Modified code:
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
if("href" in link.attrs):
if link.attrs['href'].find("/wiki/") == -1 or link.attrs['href'].find("Category:") != -1 or link.attrs['href'].find("File:") != -1 or link.attrs['href'].find("List") != -1:
continue
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
if(linkToScrape):
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape.attrs['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")

This section seems problematic.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
find returns the index of the substring you are looking for, you are also using it wrong.
So if wiki is not found or Category:, File: etc. appears in href, then continue.
if link['href'].find("/wiki/") == -1 or \
link['href'].find("Category:") != -1 or \
link['href'].find("File:") != -1 or \
link['href'].find("List")!= -1 :
print("skipped " + link["href"])
continue
Saint Petersburg
https://en.wikipedia.org/wiki/St._Petersburg
National Diet Library
https://en.wikipedia.org/wiki/NDL_(identifier)
Template talk:Authority control files
https://en.wikipedia.org/wiki/Template_talk:Authority_control_files
skipped #searchInput
skipped /w/index.php?title=Template_talk:Authority_control_files&action=edit&section=1
User: Tom.Reding
https://en.wikipedia.org/wiki/User:Tom.Reding
skipped http://toolserver.org/~dispenser/view/Main_Page
Iapetus (moon)
https://en.wikipedia.org/wiki/Iapetus_(moon)
87 Sylvia
https://en.wikipedia.org/wiki/87_Sylvia
skipped /wiki/List_of_adjectivals_and_demonyms_of_astronomical_bodies
Asteroid belt
https://en.wikipedia.org/wiki/Main_asteroid_belt
Detached object
https://en.wikipedia.org/wiki/Detached_object

Use :not() to handle the list of exclusions within href alongside * contains operator. This will filter out hrefs containing (*) specified substrings. Precede this with an attribute = value selector that contains * /wiki/. I have specified a case insensitive match via i, for the first two, which can be removed:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money')
soup = bs(r.content, 'lxml') # 'html.parser'
links = [i['href'] for i in soup.select('#bodyContent a[href*="/wiki/"]:not([href*="Category:" i], [href*="File:" i], [href*="List"])')]

Retrieve links from web page using BeautifulSoup

I am trying to pull links from a webpage at a certain position, then open that link, and then repeat that process for the provided number of times. The problem is I keep getting the same URL returned, so it seems like my code is just pulling the tag, printing the tag, not opening it, and doing that process X number of times before closing.
I have written and re-written this code a number of times, but for the life of me I just can't figure it out. Please tell me what I am doing wrong
Tried using list to put anchor tags in, then open the url at the requested position in the list, then do a list clear before starting the loop over again.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#url = input('Enter - ')
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
count = 0
url_loop = int(input("Enter how many times to loop through: "))
url_pos= int(input("Enter position of URL: "))
url_pos = url_pos - 1
print(url_pos)
# Retrieve all of the anchor tags
tags = soup('a')
while True:
if url_loop == count:
break
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
url = tags[url_pos].get('href', None)
print("Acquiring URL: ", url)
count = count + 1
print("final URL:", url)

it could be that the tags are only extracted once for the initial document:
# Retrieve all of the anchor tags
tags = soup('a')
If you were to re-extract the tags after fetching each document, they would reflect the last document.

BeautifulSoup and if/else statements

A am learning how to use BeautifulSoup and I have run into an issue with double printing in a loop I have written.
Any insight would be greatly appreciated!
from bs4 import BeautifulSoup
import requests
import re
page = 'https://news.google.com/news/headlines?gl=US&ned=us&hl=en' #main page
#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page) #requests html document
data = r.text #set data = to html text
soup = BeautifulSoup(data, "html.parser") #parse data with BS
for link in soup.find_all('a'):
#if contains /news/
if ('/news/' in link.get('href')):
print(link.get('href'))
Examples:
for link in soup.find_all('a'):
#if contains cointelegraph/news/
#if ('https://cointelegraph.com/news/' in link.get('href')):
url = link.get('href') #local var store url
if '/news/' in url:
print(url)
print(count)
count += 1
if count == 5:
break
output:
https://cointelegraph.com/news/woman-in-denmark-imprisoned-for-hiring-hitman-using-bitcoin
0
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
1
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
2
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
3
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
4
For some reason my code keeps printing out the same url twice...

Based on your code and the provided link there seems to be duplicates in the results of BeautifulSoup find_all search. The html structure needs to be checked to see why duplicates are returned (check the find_all search options to filter some in the documentation. But if you want a quick fix and want to remove the duplicates from the printed results you can use the modified loop with a set as below to keep track of seen entries (based on this).
In [78]: l = [link.get('href') for link in soup.find_all('a') if '/news/' in link.get('href')]
In [79]: any(l.count(x) > 1 for x in l)
Out[79]: True
The above output shows duplicate exists in the list. Now to remove them use something like
seen = set()
for link in soup.find_all('a'):
lhref = link.get('href')
if '/news/' in lhref and lhref not in seen:
print lhref
seen.add(lhref)

Having problems following links with webcrawler

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url

Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!

I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url

I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.

Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Following links using Beautiful Soup? - python

You don't have a function that interacts with the list of hyperlinks in your code, whatsoever. It will only print contents of "lst" list, but won't do anything with them.

Related

python - Retrieve and save links from webpage but only one per domain

A beautifulsoup loop that returns links without certain words

Retrieve links from web page using BeautifulSoup

BeautifulSoup and if/else statements

Having problems following links with webcrawler

Categories

Resources