Collect only the 1st level of href in a webpage using Python - python

I need to retrieve only the 1st level in a href of a website. For example: http://www.example.com/ is the website that I need to open and read.I opened the page and collected the href's and I obtained all the links like /company/organization, /company/globallocations, /company/newsroom, /contact, /sitemap and so on.
Below is the python code.
req = urllib2.Request(domain)
response = urllib2.urlopen(req)
soup1 = BeautifulSoup(response,'lxml')
for link in soup1.find_all('a',href = True):
print link['href']
My desired output is,
/company, /contact, /sitemap for the website www.example.com
Kindly help and suggest me a solution.

The first level concept is not clear, if you believe href links with one / is a first level, just simply count how many / in the href text, and decide keep it or drop it.
If we consider the web page point of view, all links in the home page, should be considered as first level. In this case, you may need to create a level counter to count how many levels / how deep your crawler goes into, and stop at certain level.
Hope that helps.

Related

Python/Selenium web scrap how to find hidden src value from a links?

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))

Scraping a site with mutliple pages that retain the same url?

I'm experimenting with webscraping for the first time in python. I'm using the beautifulsoup4 package to do so. I've seen some other people saying that you need to use a for-loop if you want to get all the data from a site with multiple pages, but in this particular case, the URL doesn't change when you go from page to page. What do I do about this? Any help would be greatly appreciated
Here's my python code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://wpcarey.asu.edu/people/departments/finance")
soup = BeautifulSoup(response.text, "html.parser")
professors = soup.select(".view-content .views-row")
professor_names = {}
for professor in professors:
if "Professor" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText() or "Lecturer" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText():
if professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText() not in professor_names:
professor_names[professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText()] = professor.select_one(".views-field.views-field-nothing .field-content .email > a").getText()
print(professor_names)
Believe me, I know it's hideous but it's just a draft. The main focus here is finding a way to loop through every page to retrieve the data.
Here's the first page of the site if that helps.
https://wpcarey.asu.edu/people/departments/finance
Thanks again.
If you hover over the Button to go to the next page you see that the second page is also available under this link https://wpcarey.asu.edu/people/departments/finance?page=0%2C1 .
The third page is: https://wpcarey.asu.edu/people/departments/finance?page=0%2C2
If you use i.e. firefox you can right click on the button to go to the next page and investigate the code of the webpage.

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Using selenium webdriver, how to click on multiple random links in webpage one after another continuously to detect broken links?

I'm trying to write a test script that would essentially test all visible links randomly rather than explicitly specifying them, in a webpage upon login. Is this possible in Selenium IDE/Webdriver, and if so how can I do this?
links = driver.find_element_by_tag_name("a")
list = links[randint(0, len(links)-1)]
The above will fetch all links in the first page but how do I go about testing all or as many links possible without manually adding the above code for each link/page? I suppose what I'm trying to do is find broken links that would result in 500/404s. Any productive way of doing this? Thanks.
Currently, you can't get the status code legitimately from selenium. You could use selenium to crawl for urls, and other library like requests to check link's status like this (or use solution with title check proposed by #MrTi):
import requests
def find_broken_links(root, driver):
visited = set()
broken = set()
# Use queue for BFS, list / stack for DFS.
elements = [root]
session = requests.session()
while len(elements):
el = elements.pop()
if el in visited:
continue
visited.add(el)
resp = session.get(el)
if resp.status_code in [500, 404]:
broken.add(el)
continue
driver.get(el)
links = driver.find_element_by_tag_name("a")
for link in links:
elements.append(link.get_attribute('href'))
return broken
When testing for a bad page, I usually test for the title/url.
If you are testing a self-contained site, then you should find/create a link that is bad, and see what is unique in the title/URL, and then do something like:
assert(!driver.getTitle().contains("500 Error"));
If you don't know what the title/url will look like, you can check if the title contains "500"/"404"/"Error"/"Page not found" or if the page source contains those as well.
This will probably lead to a bunch of bad pages that aren't really bad (especially if you check for the page source), and will require you to go through each of them, and verify that they really are bad

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]

I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.
Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.

Categories