Having problems following links with webcrawler

Having problems following links with webcrawler - python

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url

Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!

I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url

I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.

Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

Related

Following links using Beautiful Soup?

So I have just started learning about python using the Coursera online course "Python for Everybody", and I have this assignment where I have to follow links using beautiful soup. I saw this question pop up before but when I tried using it, it just didn't work. I managed to create something but the thing doesn't actually follow through the links but instead just stays on the same page. If possible can anyone provide materials that can give better insight on this assignment as well? Thanks.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL - ')
cnt = input("Enter count -")
count = int(cnt)
pn = input("Enter position -")
position = int(pn)-1
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
lst = list()
for tag in tags:
lst.append(tag.get('href', None))
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)

You never set url to the new URL.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)
# What happens to lst?? you never use it
You should probably replace indxpos with url instead.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
url = lst[position]
count = count - 1
print("Retrieving:", url)
This way, the next time the loop runs, it will fetch the new URL.
Also: If the page does not have pn links (e.g. pn=12, page has 2 links), you will get an exception if you try and access lst[position], because lst has less than pn elements.

You don't have a function that interacts with the list of hyperlinks in your code, whatsoever.
It will only print contents of "lst" list, but won't do anything with them.

A beautifulsoup loop that returns links without certain words

Im trying to write a scraper that randomly chooses a wiki article link from a page, goes there, grabs another, and loops that. I want to exclude links with "Category:", "File:", "List" in the href. Im pretty sure the links i want are all inside of p tags, but when I include "p" in find_all, i get "int object is not subscriptable" error.
The code below returns wiki pages but does not exclude the things i want to filter out.
This is a learning journey for me. All help is appreciated.
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.text)
print(url)
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# Here i am trying to select hrefs with /wiki/ in them and exclude hrefs with "Category:" etc. It does select for wikis but does not exclude anything.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
# Use this link to scrape
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")

You need to modify the for loop, .attrs is used to access the attributes of any tag. If you want to exclude links if the href value contains particular keyword then use !=-1 comparison.
Modified code:
import requests
from bs4 import BeautifulSoup
import random
import time
def scrapeWikiArticle(url):
response = requests.get(
url=url,
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
if("href" in link.attrs):
if link.attrs['href'].find("/wiki/") == -1 or link.attrs['href'].find("Category:") != -1 or link.attrs['href'].find("File:") != -1 or link.attrs['href'].find("List") != -1:
continue
linkToScrape = link
articleTitles = open("savedArticles.txt", "a+")
articleTitles.write(title.text + ", ")
articleTitles.close()
time.sleep(6)
break
if(linkToScrape):
scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape.attrs['href'])
scrapeWikiArticle("https://en.wikipedia.org/wiki/Anarchism")

This section seems problematic.
if link['href'].find("/wiki/") == -1:
if link['href'].find("Category:") == 1:
if link['href'].find("File:") == 1:
if link['href'].find("List") == 1:
continue
find returns the index of the substring you are looking for, you are also using it wrong.
So if wiki is not found or Category:, File: etc. appears in href, then continue.
if link['href'].find("/wiki/") == -1 or \
link['href'].find("Category:") != -1 or \
link['href'].find("File:") != -1 or \
link['href'].find("List")!= -1 :
print("skipped " + link["href"])
continue
Saint Petersburg
https://en.wikipedia.org/wiki/St._Petersburg
National Diet Library
https://en.wikipedia.org/wiki/NDL_(identifier)
Template talk:Authority control files
https://en.wikipedia.org/wiki/Template_talk:Authority_control_files
skipped #searchInput
skipped /w/index.php?title=Template_talk:Authority_control_files&action=edit&section=1
User: Tom.Reding
https://en.wikipedia.org/wiki/User:Tom.Reding
skipped http://toolserver.org/~dispenser/view/Main_Page
Iapetus (moon)
https://en.wikipedia.org/wiki/Iapetus_(moon)
87 Sylvia
https://en.wikipedia.org/wiki/87_Sylvia
skipped /wiki/List_of_adjectivals_and_demonyms_of_astronomical_bodies
Asteroid belt
https://en.wikipedia.org/wiki/Main_asteroid_belt
Detached object
https://en.wikipedia.org/wiki/Detached_object

Use :not() to handle the list of exclusions within href alongside * contains operator. This will filter out hrefs containing (*) specified substrings. Precede this with an attribute = value selector that contains * /wiki/. I have specified a case insensitive match via i, for the first two, which can be removed:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money')
soup = bs(r.content, 'lxml') # 'html.parser'
links = [i['href'] for i in soup.select('#bodyContent a[href*="/wiki/"]:not([href*="Category:" i], [href*="File:" i], [href*="List"])')]

Python: Web Scraping with input from user

from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests
url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)
Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",class_="enws-mainpage-widget-content", id="enws-mainpage-newtexts-content").find_all('a')
ebooks=[]
i=0
for ebook in List:
x=ebook.get('title')
for ch in x:
if(ch==":"):
x=""
if x!="":
ebooks.append(x)
i=i+1
print("Please select a book: ")
inputnumber=0
while inputnumber<len(ebooks):
print(inputnumber+1, " - ", ebooks[inputnumber])
inputnumber=inputnumber+1
input=int(input())
selectedbook = Soup.find("href", title=ebooks[input-1])
print(selectedbook)
I want to get the href of whichever was selected by user but as output I get: None
Can someone please tell me where I am doing wrong

I changed the last two lines of your code, and added these
selectedbook = Soup.find("a", title=ebooks[input-1])
print(selectedbook['title'])
print("https://en.wikisource.org/"+selectedbook['href'])
This just works !.
NB: The find() method searches for the first tag with the specified name and returns an object of type bs4.element.Tag.

Stuck on this task

From an online python course:
You will be given a website with 100 names. All names are in the form of a link. Each link leads to another 100 links. You must use python to select the 18th link for 7 times, and print out the results.
my code so far:
z = 0
atags = []
listurl = []
#import modules
import urllib
from bs4 import BeautifulSoup
import re
newurl = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Desmond.html"
while z < 7:
url = newurl
z = z + 1
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
soup.find_all("url")
a = soup.find_all('a')
for x in a:
atags.append(str(x))
url_end_full = atags[19]
url_end = re.findall(r'"(.*?)"', url_end_full)
url_end = str(url_end[0])
newurl = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/' + url_end
str(newurl)
listurl.append(newurl)
url = newurl
print url
It does not work. It keeps giving me the same link...
this is the output:
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Lauchlin.html
[Finished in 2.4s]
the answer was wrong when i entered it into the answer box.

There are a couple of problems.
atags[19] is not the 18th item, it is the 20th (lst[0] is the first item in a list).
soup.find_all("url") does nothing; get rid of it.
you do not need re.
The links returned are relative; you are doing a hard-join to the base path to make them absolute. In this case it works, but that is a matter of luck; do it right with urljoin.
While str(link) does get you the url, the "proper" method is by indexing into the attributes, ie link['href'].
With some judicious cleanup,
from bs4 import BeautifulSoup
import sys
# version compatibility shim
if sys.hexversion < 0x3000000:
# Python 2.x
from urlparse import urljoin
from urllib import urlopen
else:
# Python 3.x
from urllib.parse import urljoin
from urllib.request import urlopen
START_URL = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Desmond.html"
STEPS = 7
ITEM = 18
def get_soup(url):
with urlopen(url) as page:
return BeautifulSoup(page.read(), 'lxml')
def main():
url = START_URL
for step in range(STEPS):
print("\nStep {}: looking at '{}'".format(step, url))
# get the right item (Python arrays start indexing at 0)
links = get_soup(url).find_all("a")
rel_url = links[ITEM - 1]["href"]
# convert from relative to absolute url
url = urljoin(url, rel_url)
print(" go to '{}'".format(url))
if __name__=="__main__":
main()
which, if I did it right, ends with known_by_Gideon.html

How to chose a random link on the page?

i am using beautiful soup to get links from a page.
What i would like it to do is select one of the links at random and continue with the rest of the program. Currently it is using all the links and continuing with the rest of the program, however i only want it to choose 1 link.
The the rest of the program will then look at the link and decide if it was good enough for what i want. If it is not good enough it will then go back and click another link. And repeat the processes.
Any idea how you would get it to do this?
This is my current code for looking up the links.
import requests
import os.path
from bs4 import BeautifulSoup
import urllib.request
import hashlib
import random
max_page = 1
img_limit = 5
def pic_spider(max_pages):
page = random.randrange(0, max_page)
pid = page * 40
pic_good = 1
while pic_good == 1:
if page <= max_pages:
url = 'http://safebooru.org/index.php?page=post&s=list&tags=yuri&pid=' + str(pid)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
id_list_location = os.path.join(id_save, "ids.txt")
first_link = soup.findAll('a', id=True, limit=img_limit)
for link in first_link:
href = "http://safebooru.org/" + link.get('href')
picture_id = link.get('id')
print("Page number = " + str(page + 1))
print("pid = " + str(pid))
print("Id = " + picture_id)
print(href)
if picture_id in open(id_list_location).read():
print("Already Downloaded or Picture checked to be too long")
else:
log_id(picture_id)
if ratio_get(href) >= 1.3:
print("Picture too long")
else:
#img_download_link(href, picture_id)
print("Ok download")
im not really sure how i would do it so any ideas would help me out, if you have any questions feel free to ask!

Am I missing something? Don't you just need to replace this:
first_link = soup.findAll('a', id=True, limit=img_limit)
for link in first_link:
With:
from random import choice
first_link = soup.findAll('a', id=True, limit=img_limit)
link = choice(first_link)
This will select one random item from the list

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having problems following links with webcrawler - python

Based on DJanssens' response, I found the solution; url = tags[position-1].get('href') did the trick for me! Thanks for the assistance!

Related

Following links using Beautiful Soup?

A beautifulsoup loop that returns links without certain words

Python: Web Scraping with input from user

Stuck on this task

How to chose a random link on the page?

Categories

Resources