Python scraping deep nested divs whose classes change - python

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)

The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Related

How can I scrape the data from in between these span tags?

I am attempting to scrape the figures shown on https://www.usdebtclock.org/world-debt-clock.html , however due to the numbers constantly changing i am unaware of how to collect this data.
This is an example of what i am attempting to do.
import requests
from bs4 import BeautifulSoup
url ="https://www.usdebtclock.org/world-debt-clock.html"
URL=requests.get(url)
site=BeautifulSoup(URL.text,"html.parser")
data=site.find_all("span",id="X4a79R9BW")
print(data)
The result is this:
"[ ]"
when i was expecting
"$19,987,137,284,731"
Is there something i can change in order to extract the number?
BeautifulSoup cannot do this for you, because the data you need is provided by JavaScript, and BeautifulSoup does not support JS processing.
An alternative is to use a tool such as Selenium WebDriver:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.usdebtclock.org/world-debt-clock.html')
elem2 = driver.find_element_by_xpath('//span[#id="X4a79R9BW"]')
print(elem2.text)
driver.close()
If you have not used Selenium WebDriver before, you need to follow the installation instructions here.
In particular, you will need to follow the instructions for downloading the browser driver of your choice (I use geckodriver for Firefox). And make sure the executable is on your path.
(I expect there are other Python-based alternatives, also.)
Based on the page's code, I think what you want to accomplish may not be possible with BS. Running your code returned [<span id="X4a79R9BW"> </span>]. Trying to getText() on that returned nothing. When inspecting the page, I noticed that the numerical value in the span was continuously updating as it does on the page. Viewing the page source showed that X4a79R9BW appeared at five places in the page. First to set aspects of the font, several places where an equation was being processed, and last the empty span scraped by your code. From viewing the source, it appears that the counter is an equation running inside a tag <script type="text/javascript">. Here is what I think is the equation running under the JavaScript tag:
{'leftMargin':0,'color':-16751104,:0 */var X3a34729DW = /*144,:14 */ 96.9230013 /*751104,:0 */; var R3a45G7S = /*7104,:54 */ 0.000000306947 /*43,451134,:5 */; var Y12 = /*241,:15457 */ 18442.16666 /*19601*2*2*/*21600*2*2; /*79301*2*2*/ var Class = new Date(); var Method = Class.getTime() / 1000 - Y12a4798; var Public = X3a34729DW + Method * R3a45G7S; var Assign = FormatNumber2(Public); document.getElementById ('X3a34729DW') .firstChild.nodeValue = Assign; /*'advance':4289}
This section of the page's source indicates that the text you want is being continuously updated via JavaScript. Given that, it is my understanding that BS is not the appropriate library to complete the desired task. Though I have not used it myself, I've seen Selenium as a suggested library for scraping pages dynamically updated via JavaScript. Good luck, perhaps someone else can help provide a clearer path forward.

Scraping a site with mutliple pages that retain the same url?

I'm experimenting with webscraping for the first time in python. I'm using the beautifulsoup4 package to do so. I've seen some other people saying that you need to use a for-loop if you want to get all the data from a site with multiple pages, but in this particular case, the URL doesn't change when you go from page to page. What do I do about this? Any help would be greatly appreciated
Here's my python code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://wpcarey.asu.edu/people/departments/finance")
soup = BeautifulSoup(response.text, "html.parser")
professors = soup.select(".view-content .views-row")
professor_names = {}
for professor in professors:
if "Professor" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText() or "Lecturer" in professor.select_one(".views-field.views-field-nothing-1 .field-content .title").getText():
if professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText() not in professor_names:
professor_names[professor.select_one(".views-field.views-field-nothing-1 .field-content .name > a").getText()] = professor.select_one(".views-field.views-field-nothing .field-content .email > a").getText()
print(professor_names)
Believe me, I know it's hideous but it's just a draft. The main focus here is finding a way to loop through every page to retrieve the data.
Here's the first page of the site if that helps.
https://wpcarey.asu.edu/people/departments/finance
Thanks again.
If you hover over the Button to go to the next page you see that the second page is also available under this link https://wpcarey.asu.edu/people/departments/finance?page=0%2C1 .
The third page is: https://wpcarey.asu.edu/people/departments/finance?page=0%2C2
If you use i.e. firefox you can right click on the button to go to the next page and investigate the code of the webpage.

Data scraper: the contents of the div tag is empty (??)

I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()

Python Web Scraping with lxml

I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]

I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.
Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.

Categories