BeautifulSoup How to get href links from pseudo-element/Class - python

I am trying to parse https://www.tandfonline.com/toc/icbi20/current for the titles of all articles. The HTML is divided into Volumes and Issues. Each Volume has an Issue that corresponds to a Month. So for Volume 36 there would be 12 Issues. In the current Volume (37) there are 4 Issues and I would like to parse through each Issue and get each Article's name.
To accomplish this and automate the search I need to fetch the href links for each Issue. Initially I chose the parent's div id: id = 'tocList'.
import requests
from bs4 import BeautifulSoup, SoupStrainer
chronobiology = requests.get("https://www.tandfonline.com/toc/icbi20/current")
chrono_coverpage = chronobiology.content
issues = SoupStrainer(id ='tocList')
issues_soup = BeautifulSoup(chrono_coverpage, 'html.parser', parse_only = issues)
for issue in issues_soup:
print(issue)
This returns a bs4 object BUT only with href links from the Volume div. What's worse is that this div should encompass both Volume div and Issue div.
So, I decided trying to reduce my search space and make it more specific and chose the div containing the Issue href links (class_='issues')
This time Jupiter will think for a bit but won't return ANYTHING. Just blank. Nothing. Zippo. BUT if I ask what type of "nothing" has been returned, jupiter informs it is a "String"??? I just don't know what to make of this.
So, firstly I had a question, why is it that the Issue div element does not respond to the parsing?
When I try running print(BeautifulSoup(chrono_coverpage, 'html.parser').prettify()) the same occurs, the Issue div does not appear (When Inspect Element on the html page it appears immediatly beneath the final Volume span):
So I suspect that it must be javascript oriented or something, not so much HTML oriented. Or maybe the class = 'open' has something to do with it.
Any clarifications would be kindly appreciated. Also, how would one parse through Javascripted links to get them?

Okay, so I've "resolved" the issue though I need to fill in some theoretical gaps:
Firstly this snippet holds the key to the beginning of solving the answer:
As can be seen, the <div class = 'container'> is immediatly followed by a ::before pseudo-element and the Links I am interested in are contained inside a div immediatly beneath this pseudo-element. This last div is then finished with the ::after pseudo-element.
Firstly I realized that my problem was that I needed to select a pseudo-element. I found this to be quite impossible with BeutifulSoup's soup.select() since apparently BeautifulSoup uses Soup Sieve which "aims to allow users to target XML/HTML elements with CSS selectors. It implements many pseudo-classes [...]."
The last part of the paragraph states:
"Soup Sieve also will not match anything for pseudo classes that are only relevant in a live, browser environment, but it will gracefully handle them if they've been implemented;"
So this got me thinking that I have no idea what "pseudo classes that are only relevant in a live browser environment" means. But then I said to myself, "but it also said that had they been implemented, BS4 should be able to parse them". And since I can definitely see the div elements containing my href links of interest using the Inspect tool, I though that I must be implemented.
The first part of that phrase got me thinking: "But do I need a live browser for this to work?"
So that brought me to Selenium's web driver:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]: []
Clearly this result made me sad because I thought I had understood what was going on. But then I though that if I 'clicked' one of the issues, from the previously opened browser that it would work (for some reason, to be honest I'm pretty sure desperation led me to that thought).
Well, surprise surprise. It worked: After clicking on the "Issue 4" and re running the script, I got what I was looking for:
UNANSWERED QUESTIONS?
1 - Apparently these pseudo-elements only "exist" when clicked upon, because otherwise the code doesn't recognize they are there. Why?
2 - What code must be run in order to make an initial click and activiate these pseudo-elements so the code can automatically open these links and parse the information I want? (title of articles)
UPDATE
Question 2 is answered using Selenium's ActionChain:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
action=ActionChains(driver)
action.move_to_element(driver.find_element_by_xpath('//*[#id="tocList"]/div/div/div[3]/div[2]/div')).perform()
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]:
[<div class="loi-issues-scroller">
<a class="open" href="/toc/icbi20/37/4?nav=tocList">Issue<span>4</span></a>
<a class="" href="/toc/icbi20/37/3?nav=tocList">Issue<span>3</span></a>
<a class="" href="/toc/icbi20/37/2?nav=tocList">Issue<span>2</span></a>
<a class="" href="/toc/icbi20/37/1?nav=tocList">Issue<span>1</span></a>
</div>]
The only downside is that one must stay on the page for Selenium's ActionChain.perform() can actually click the element, but at least I've automated this step.
If someone could answer question 1 that would be great

Related

Scraping a site with mutliple pages that retain the same url?

I'm experimenting with webscraping for the first time in python. I'm using the beautifulsoup4 package to do so. I've seen some other people saying that you need to use a for-loop if you want to get all the data from a site with multiple pages, but in this particular case, the URL doesn't change when you go from page to page. What do I do about this? Any help would be greatly appreciated
Here's my python code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://wpcarey.asu.edu/people/departments/finance")
soup = BeautifulSoup(response.text, "html.parser")
professors = soup.select(".view-content .views-row")
professor_names = {}
for professor in professors:
if "Professor" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText() or "Lecturer" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText():
if professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText() not in professor_names:
professor_names[professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText()] = professor.select_one(".views-field.views-field-nothing .field-content .email > a").getText()
print(professor_names)
Believe me, I know it's hideous but it's just a draft. The main focus here is finding a way to loop through every page to retrieve the data.
Here's the first page of the site if that helps.
https://wpcarey.asu.edu/people/departments/finance
Thanks again.
If you hover over the Button to go to the next page you see that the second page is also available under this link https://wpcarey.asu.edu/people/departments/finance?page=0%2C1 .
The third page is: https://wpcarey.asu.edu/people/departments/finance?page=0%2C2
If you use i.e. firefox you can right click on the button to go to the next page and investigate the code of the webpage.

Data scraper: the contents of the div tag is empty (??)

I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]

I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.
Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.

Click an image which has a specific name

How do I click an image like below using Python mechanize?
<img name="next" id="next" src="...">
I know the name and id of the image I want to click to. I need to somehow identify the parent link and click it. How can I?
Bonus Question: How can I check if there is such an image or not?
Rather than using mechanize, it's very simple to do with bs4 (beautifulsoup 4).
from bs4 import BeautifulSoup
import urllib2
text = urllib2.urlopen("http://yourwebpage.com/").read()
soup = BeautifulSoup(text)
img = soup.find_all('img',{'id':'next'})
if img:
a_tag = img[0].parent
href = a_tag.get('href')
print href
Retrieving the parent tag is very easy with bs4, as it happens with nothing less than .parent after finding the tag of course with the find_all function. As the find_all function returns an array, it's best to do if img: in the future, but as this may not apply to your website, it'll be safe to do. See below.
EDIT: I have changed the code to include the "Bonus question", which is what I described above as an alternative.
For your bonus question - I would say you can use BeautifulSoup to check to see whether or not the img element works. You can use urllib to see if the image exists (at least, whether or not the server will pass it to you - otherwise you'll get an error back).
You can also check out this thread that someone more intelligent than I answered - it seems to discuss a library called SpiderMonkey and the inability for mechanize to click a button.
Well, I don't know how to do it using Mechanize, however I know how to do in using lxml:
Lets assume that our webpage has this code:
<img name="bla bla" id="next" src="Cat.jpg">. Using lxml we would write this code:
from lxml import html
page = urlllib2.urlopen('http://example.com')
tree = html.fromstring(page.read())
link = tree.xpath('//img[#id="next"]/ancestor::a/attribute::href')
Most of the magic happens in the tree.xpath function, where you define the image you're looking for first with //img[#id="next"], then you specify that you're looking for the a tag right before it: /ancestor::a and that you're looking for specifically the href attribute: /attribute::href. The link variable will now contain a list of strings matching that query - in this case link[0] will be page2.html - which you can urlopen(), thus effectively clicking it.
For the //img[#id="next"] part, you can use other attribute, for example this: //img[#name="bla bla"] and it's going to work perfectly fine. You just need to think which attribute is better for this situation.
I know this answer doesn't use Mechanize, however I hope it's a helpful pointer. Good luck!

Categories