Scraping a site with mutliple pages that retain the same url?

Scraping a site with mutliple pages that retain the same url? - python

I'm experimenting with webscraping for the first time in python. I'm using the beautifulsoup4 package to do so. I've seen some other people saying that you need to use a for-loop if you want to get all the data from a site with multiple pages, but in this particular case, the URL doesn't change when you go from page to page. What do I do about this? Any help would be greatly appreciated
Here's my python code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://wpcarey.asu.edu/people/departments/finance")
soup = BeautifulSoup(response.text, "html.parser")
professors = soup.select(".view-content .views-row")
professor_names = {}
for professor in professors:
if "Professor" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText() or "Lecturer" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText():
if professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText() not in professor_names:
professor_names[professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText()] = professor.select_one(".views-field.views-field-nothing .field-content .email > a").getText()
print(professor_names)
Believe me, I know it's hideous but it's just a draft. The main focus here is finding a way to loop through every page to retrieve the data.
Here's the first page of the site if that helps.
https://wpcarey.asu.edu/people/departments/finance
Thanks again.

If you hover over the Button to go to the next page you see that the second page is also available under this link https://wpcarey.asu.edu/people/departments/finance?page=0%2C1 .
The third page is: https://wpcarey.asu.edu/people/departments/finance?page=0%2C2
If you use i.e. firefox you can right click on the button to go to the next page and investigate the code of the webpage.

Related

BeautifulSoup How to get href links from pseudo-element/Class

I am trying to parse https://www.tandfonline.com/toc/icbi20/current for the titles of all articles. The HTML is divided into Volumes and Issues. Each Volume has an Issue that corresponds to a Month. So for Volume 36 there would be 12 Issues. In the current Volume (37) there are 4 Issues and I would like to parse through each Issue and get each Article's name.
To accomplish this and automate the search I need to fetch the href links for each Issue. Initially I chose the parent's div id: id = 'tocList'.
import requests
from bs4 import BeautifulSoup, SoupStrainer
chronobiology = requests.get("https://www.tandfonline.com/toc/icbi20/current")
chrono_coverpage = chronobiology.content
issues = SoupStrainer(id ='tocList')
issues_soup = BeautifulSoup(chrono_coverpage, 'html.parser', parse_only = issues)
for issue in issues_soup:
print(issue)
This returns a bs4 object BUT only with href links from the Volume div. What's worse is that this div should encompass both Volume div and Issue div.
So, I decided trying to reduce my search space and make it more specific and chose the div containing the Issue href links (class_='issues')
This time Jupiter will think for a bit but won't return ANYTHING. Just blank. Nothing. Zippo. BUT if I ask what type of "nothing" has been returned, jupiter informs it is a "String"??? I just don't know what to make of this.
So, firstly I had a question, why is it that the Issue div element does not respond to the parsing?
When I try running print(BeautifulSoup(chrono_coverpage, 'html.parser').prettify()) the same occurs, the Issue div does not appear (When Inspect Element on the html page it appears immediatly beneath the final Volume span):
So I suspect that it must be javascript oriented or something, not so much HTML oriented. Or maybe the class = 'open' has something to do with it.
Any clarifications would be kindly appreciated. Also, how would one parse through Javascripted links to get them?

Okay, so I've "resolved" the issue though I need to fill in some theoretical gaps:
Firstly this snippet holds the key to the beginning of solving the answer:
As can be seen, the <div class = 'container'> is immediatly followed by a ::before pseudo-element and the Links I am interested in are contained inside a div immediatly beneath this pseudo-element. This last div is then finished with the ::after pseudo-element.
Firstly I realized that my problem was that I needed to select a pseudo-element. I found this to be quite impossible with BeutifulSoup's soup.select() since apparently BeautifulSoup uses Soup Sieve which "aims to allow users to target XML/HTML elements with CSS selectors. It implements many pseudo-classes [...]."
The last part of the paragraph states:
"Soup Sieve also will not match anything for pseudo classes that are only relevant in a live, browser environment, but it will gracefully handle them if they've been implemented;"
So this got me thinking that I have no idea what "pseudo classes that are only relevant in a live browser environment" means. But then I said to myself, "but it also said that had they been implemented, BS4 should be able to parse them". And since I can definitely see the div elements containing my href links of interest using the Inspect tool, I though that I must be implemented.
The first part of that phrase got me thinking: "But do I need a live browser for this to work?"
So that brought me to Selenium's web driver:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]: []
Clearly this result made me sad because I thought I had understood what was going on. But then I though that if I 'clicked' one of the issues, from the previously opened browser that it would work (for some reason, to be honest I'm pretty sure desperation led me to that thought).
Well, surprise surprise. It worked: After clicking on the "Issue 4" and re running the script, I got what I was looking for:
UNANSWERED QUESTIONS?
1 - Apparently these pseudo-elements only "exist" when clicked upon, because otherwise the code doesn't recognize they are there. Why?
2 - What code must be run in order to make an initial click and activiate these pseudo-elements so the code can automatically open these links and parse the information I want? (title of articles)
UPDATE
Question 2 is answered using Selenium's ActionChain:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
action=ActionChains(driver)
action.move_to_element(driver.find_element_by_xpath('//*[#id="tocList"]/div/div/div[3]/div[2]/div')).perform()
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]:
[<div class="loi-issues-scroller">
<a class="open" href="/toc/icbi20/37/4?nav=tocList">Issue<span>4</span></a>
<a class="" href="/toc/icbi20/37/3?nav=tocList">Issue<span>3</span></a>
<a class="" href="/toc/icbi20/37/2?nav=tocList">Issue<span>2</span></a>
<a class="" href="/toc/icbi20/37/1?nav=tocList">Issue<span>1</span></a>
</div>]
The only downside is that one must stay on the page for Selenium's ActionChain.perform() can actually click the element, but at least I've automated this step.
If someone could answer question 1 that would be great

List links of xls files using Beautifulsoup

I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.

Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)

My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)

The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Isolating data from dynamic table with beautifulSoup

I'm trying to extract data from a table(1), which has a couple filter options. I'm using BeautifulSoup and got to this page with Requests. An extract of code:
from bs4 import BeautifulSoup
tt = Contact_page.content # webpage with table
soup = BeautifulSoup(tt)
R_tables = soup.find('div', {'class': 'responsive-table'})
Using find_all("tr") and find_all("th") results in empty sets. Using R_tables.findChildren only goes down to "formrow" who then has no children. From formrow to my tr/th tags, I can't access it through BS4.
R_tables results in table 3. The XPath for this file is
"//*[#id="kronos_body"]/div[3]/div[2]/div[3]/script/text()
How can I get each row information for my data? soup.find("r") and soup.find("f") also result in empty sets.
Pardon me in advance if this post is sloppy, this is my first. I'll link what my most similar thread is in a comment, I can't link more than 2 times.
EDIT 1 : Apparently BS doesn't recognize any javascript apart from variables (correct me if I'm wrong, I'm still still relatively new). Are there any other modules that can help me out? I was proposed Ghost and Selenium, but I won't be using Selenium.

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]

I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.

Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a site with mutliple pages that retain the same url? - python

Related

BeautifulSoup How to get href links from pseudo-element/Class

List links of xls files using Beautifulsoup

Python scraping deep nested divs whose classes change

Isolating data from dynamic table with beautifulSoup

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]

Categories

Resources