How to recursively get web links without reaching maximum recursion depth - python

I've been thinking over this problem for a while now, for a personal project, I need to get any and every link from the specified initial webpages, that isn't an external link i.e, doesn't leave the initial website.
I'm already using bs4 for scraping webpages for links, however I can't find a way to continue doing this for every scraped link without eventually reaching the maximum recursion depth.
My previous attempt consisted of something like this:
link_list = []
link_buffer = ["www.example.com"]
def get_links(current_link):
new_links = []
soup = BeautifulSoup(current_link, "html.parser")
for link in soup.find_all("a"):
if link.has_attr("href"):
... # check here if the link does not leave the website
new_links.append(link)
return new_links
def get_all_the_links(list_of_links):
target = list_of_links.pop()
link_list.append(target)
...
for link in get_links(target):
...
if link not in link_list:
list_of_links.append(link)
if list_of_links.len() != 0: # recursiveness
get_all_the_links(list_of_links)
get_all_the_links(link_buffer)
I also looked over scrapy, however I found it too complicated for what I'm trying to do since I just plan on saving these links to a text file and then processing them later.

Code sample you've provided is a bit messy. But overall you just need an extra storage variable for visited links and check if link is known already before
visited_links = set()
...
def get_all_the_links(list_of_links):
target = list_of_links.pop()
if target in visited_links:
# Ignore the current link and move on to next one.
get_all_the_links(list_of_links)
else:
visited_links.add(target)
...
# process link here...
...

Related

Python web-scraping trouble

I've been with this all day and I'm getting a little overwhelmed, I explain, I have a personal project, scrape all the links of the acestream: // protocol from a website and turn them into a playlist for acestream. For now I can either remove the links from the web (something like the site map) or remove the acestream links from a specific subpage. One of the problems I have is that since the same acestream link appears several times on the page,
Obviously I get the same link multiple times and I only want it once. Besides, I don't know how to do it either (I'm very new to this) so that instead of putting the link in it, it automatically takes it from a list of links in a .csv, because I need to get an acestream link from each link that I put on it. in the .csv. I'm sorry about the tirade, I hope it's not a nuisance.
Hope you understand, I translated it with Google Translate
from bs4 import BeautifulSoup
import requests
# creating empty list
urls = []
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text, "html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("acestream://"):
site = site + href
if site not in urls:
urls.append(site)
print(site)
# calling the scrape function itself
# generally called recursion
scrape(site)
# main function
if __name__ == "__main__":
site = "https://www.websitehere.com/index.htm"
scrape(site)
Based off your last comment and your code, you can read in a .csv using
import pandas as pd
file_path = 'C:\<path to your csv>'
df = pd.read_csv(file_path)
csv_links = df['<your_column_name_for_links>'].to_list()
With this, you can get the URLs from the .csv. Just change the values in the <>.

How to get all links containing a phrase from a changing website

I want to retrieve all links from a website that contain a specific phrase.
An example on a public website would be to retrieve all videos from a large youtube channel (for example Linus Tech Tips):
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
current_link = ''
for link in soup.find_all('a'):
current_link = link.get('href')
print(current_link)
Now I have 3 problems here:
How do I get only hyperlinks containing a phrase like "watch?v="
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
All hyperlinks appear two times. How can I only choose each hyperlink once?
Any suggestions?
How do I get only hyperlinks containing a phrase like "watch?v="
Add a single if statement above your print statement
if 'watch?v=' in current_link:
print(current_link)
All hyperlinks appear two times. How can I only choose each hyperlink once?
Store all hyperlinks in a dictionary as the key and set the value to any arbitrary number (dictionaries only allow a single key entry so you wont be able to add duplicates)
Something like this:
myLinks = {} //declare a dictionary variable to hold your data
if 'watch?v=' in current_link:
print(current_link)
myLinks[currentLink] = 1
You can iterate over the keys (links) in the dictionary like this:
for link,val in myLinks:
print(link)
This will print all the links in your dictionary
Most hyperlinks aren't shown. In the browser: They appear when you scroll down. BeautifulSoup does only find the links which can be found without scrolling. How can I retrieve all hyperlinks?
I'm unsure as to how you directly get around the scripting on the page you have directed us to but you could always crawl the links you get from the initial scrape and rip new links off the side panels/traverse them, this should give you most, if not all, of the links you want.
To do so you would want another dictionary to store the already traversed links/check if you already traversed them. You can check for a key in a dictionary like so:
if key in myDict:
print('myDict has this key already!')
I would use the request library,
for python3
import urllib.request
import requests
SearchString="SampleURL.com"
response = requests.get(SearchString, stream=True)
zeta= str(response.content)
with open ("File.txt" , "w") as l:
l.write(zeta)
l.close()
#And now open up the file with the information written to it
x = open("File.txt", "r")
jello = []
for line in x:
jello.append(line)
t = (jello[0].split(""""salePrice":""",1)[1].split(",",1)[0] )
#you'll notice above that I have the keyword "salePrice", this should be a unique identifier in the pages xpath. typically f12 in chrome and then navigating til the item is highlighted gives you the xpath if you right click and copy
#Now this will only return a single result, youll want to use a for loop to iterate over the File.txt until you find all the separate results
I hope this helps Ill keep an eye on this thread if you need more help.
Part One and Three:
Create a list and append links to the list:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.youtube.com/user/LinusTechTips/videos'
html = requests.get(url)
soup = bs(html.content, "html.parser")
links = [] # see here
for link in soup.find_all('a'):
links.append(link.get('href')) # and here
Then create a set and convert it back to list to remove duplicates:
links = list(set(links))
Now return the items of interest:
clean_links = [i for i in links if 'watch?v=' in i]
Part Two:
In order to navigate through the site you may need more than just Beautiful Soup. Scrapy has a great API that allows you to pull down a page and explore how you want to parse parent and child elements with xpath. I highly encourage you to try Scrapy and use the interactive shell to tweak your extraction method.
HELPFUL LINK

Collect only the 1st level of href in a webpage using Python

I need to retrieve only the 1st level in a href of a website. For example: http://www.example.com/ is the website that I need to open and read.I opened the page and collected the href's and I obtained all the links like /company/organization, /company/globallocations, /company/newsroom, /contact, /sitemap and so on.
Below is the python code.
req = urllib2.Request(domain)
response = urllib2.urlopen(req)
soup1 = BeautifulSoup(response,'lxml')
for link in soup1.find_all('a',href = True):
print link['href']
My desired output is,
/company, /contact, /sitemap for the website www.example.com
Kindly help and suggest me a solution.
The first level concept is not clear, if you believe href links with one / is a first level, just simply count how many / in the href text, and decide keep it or drop it.
If we consider the web page point of view, all links in the home page, should be considered as first level. In this case, you may need to create a level counter to count how many levels / how deep your crawler goes into, and stop at certain level.
Hope that helps.

Create a new instance of a generator in python

I am trying to scrape a page which has many links to pages which contain ads. What I am currently doing to navigate it is going to the first page with the list of ads and getting the link for the individual ads. After that, I check to make sure that I haven't scraped any of the links by pulling data from my database. The code below basically gets all the href attributes and joins them as a list. After, I crosscheck it against the the list of links I have stored in my database of pages I have already scraped. So basically it will return a list of the links I haven't scraped yet.
#staticmethod
def _scrape_home_urls(driver):
home_url_list = list(home_tab.find_element_by_tag_name('a').get_attribute('href') for home_tab in driver.find_elements_by_css_selector('div[class^="nhs_HomeResItem clearfix"]'))
return (home_url for home_url in home_url_list if home_url not in(url[0] for url in NewHomeSource.outputDB()))
Once it scrapes all the links of that page, it goes to the next one. I tried to reuse it by calling _scrape_home_urls() again
NewHomeSource.unique_home_list = NewHomeSource._scrape_home_urls(driver)
for x in xrange(0,limit):
try:
home_url = NewHomeSource.unique_home_list.next()
except StopIteration:
page_num = int(NewHomeSource.current_url[NewHomeSource.current_url.rfind('-')+1:]) + 1 #extract page number from url and gets next page by adding 1. example: /.../.../page-3
page_url = NewHomeSource.current_url[:NewHomeSource.current_url.rfind('-')+1] + str(page_num)
print page_url
driver.get(page_url)
NewHomeSource.current_url = driver.current_url
NewHomeSource.unique_home_list = NewHomeSource._scrape_home_urls(driver)
home_url = NewHomeSource.unique_home_list.next()
#and then I use the home_url to do some processing within the loop
Thanks in advance.
It looks to me like your code would be a lot simpler if you put the logic that scrapes successive pages into a generator function. This would let you use for loops rather than messing around and calling next on the generator objects directly:
def urls_gen(driver):
while True:
for url in NewHomeSource._scrape_home_urls(driver):
yield url
page_num = int(NewHomeSource.current_url[NewHomeSource.current_url.rfind('-')+1:]) + 1 #extract page number from url and gets next page by adding 1. example: /.../.../page-3
page_url = NewHomeSource.current_url[:NewHomeSource.current_url.rfind('-')+1] + str(page_num)
print page_url
driver.get(page_url)
NewHomeSource.current_url = driver.current_url
This will transparently skip over pages that don't have any unprocessed links. The generator function yields the url values indefinitely. To iterate on it with a limit like your old code did, use enumerate and break when the limit is reached:
for i, home_url in urls_gen(driver):
if i == limit:
break
# do stuff with home_url here
I've not changed your code other than what was necessary to change the iteration. There are quite a few other things that could be improved however. For instance, using a shorter variable than NewHomeSource.current_url would make the lines of the that figure out the page number and then the next page's URL much more compact and readable. It's also not clear to me where that variable is initially set. If it's not used anywhere outside of this loop, it could easily be changed to a local variable in urls_gen.
Your _scrape_home_urls function is probably also very inefficient. It looks like it does a database query for every url it returns (not one lookup before checking all of the urls). Maybe that's what you want it to do, but I suspect it would be much faster done another way.

Using selenium webdriver, how to click on multiple random links in webpage one after another continuously to detect broken links?

I'm trying to write a test script that would essentially test all visible links randomly rather than explicitly specifying them, in a webpage upon login. Is this possible in Selenium IDE/Webdriver, and if so how can I do this?
links = driver.find_element_by_tag_name("a")
list = links[randint(0, len(links)-1)]
The above will fetch all links in the first page but how do I go about testing all or as many links possible without manually adding the above code for each link/page? I suppose what I'm trying to do is find broken links that would result in 500/404s. Any productive way of doing this? Thanks.
Currently, you can't get the status code legitimately from selenium. You could use selenium to crawl for urls, and other library like requests to check link's status like this (or use solution with title check proposed by #MrTi):
import requests
def find_broken_links(root, driver):
visited = set()
broken = set()
# Use queue for BFS, list / stack for DFS.
elements = [root]
session = requests.session()
while len(elements):
el = elements.pop()
if el in visited:
continue
visited.add(el)
resp = session.get(el)
if resp.status_code in [500, 404]:
broken.add(el)
continue
driver.get(el)
links = driver.find_element_by_tag_name("a")
for link in links:
elements.append(link.get_attribute('href'))
return broken
When testing for a bad page, I usually test for the title/url.
If you are testing a self-contained site, then you should find/create a link that is bad, and see what is unique in the title/URL, and then do something like:
assert(!driver.getTitle().contains("500 Error"));
If you don't know what the title/url will look like, you can check if the title contains "500"/"404"/"Error"/"Page not found" or if the page source contains those as well.
This will probably lead to a bunch of bad pages that aren't really bad (especially if you check for the page source), and will require you to go through each of them, and verify that they really are bad

Categories