Getting the XPath from an HTML document - python

https://next.newsimpact.com/NewsWidget/Live
I am trying to code a python script that will grab a value from a HTML table in the link above. The link above is the site that I am trying to grab from, and this is the code I have written. I think that possibly my XPath is incorrect, because its been doing fine on other elements, but the path I'm using is not returning/printing anything.
from lxml import html
import requests
page = requests.get('https://next.newsimpact.com/NewsWidget/Live')
tree = html.fromstring(page.content)
#This will create a list of buyers:
value = tree.xpath('//*[#id="table9521"]/tr[1]/td[4]/text()')
print('Value: ', value)
What is strange is when I open the view source code page, I cant find the table I am trying to pull from.
Thank you for your help!

Required data absent in initial page source - it comes from XHR. You can get it as below:
import requests
response = requests.get('https://next.newsimpact.com/NewsWidget/GetNextEvents?offset=-120').json()
first_previous = response['Items'][0]['Previous'] # Current output - "2.632"
second_previous = response['Items'][1]['Previous'] # Currently - "0.2"
first_forecast = response['Items'][0]['Forecast'] # ""
second_forecast = response['Items'][1]['Forecast'] # "0.3"
You can parse response as simple Python dict and get all required data

Your problem is simple, request don't handle javascript at all. The values are JS generated !
If you really need to run this xpath, you need to use a module capable of understanding JS, like spynner.
You can test when you need JS or not by first using curl or by disabling JS in your browser. With firefox : about:config in navigation bar, then search javascript.enabled, then double click on it to switch between true or false
In chrome, open chrome dev tools, there's the option somewhere.
Check https://github.com/makinacorpus/spynner
Another (possible) problem, use tree = html.fromstring(page.text) not tree = html.fromstring(page.content)

Related

Read scripts on a site using python

I'm currently trying to write a python script that notifies me through mail when a site updates it's selection of apartments. However, when I use Beautiful Soup, the site doesn't return a list of items, but rather a script that selects all relevant houses instead of the results of said script. Is there any way for me to retrieve the html of text of a site that I would see normally as a user? This is the rather simple code I've written in case that helps.
html = #somesite
response = requests.get(html)
text = BeautifulSoup(response.text)
text.find_all("script")
You need to execute the (java)script the way web browser does, then parse resulting html. I use selenium, there are other tools.
html = #somesite
response = requests.get(html)
text = BeautifulSoup(response.text)
text.text # returns text in the entire html body, excluding script

How can I scrape the data from in between these span tags?

I am attempting to scrape the figures shown on https://www.usdebtclock.org/world-debt-clock.html , however due to the numbers constantly changing i am unaware of how to collect this data.
This is an example of what i am attempting to do.
import requests
from bs4 import BeautifulSoup
url ="https://www.usdebtclock.org/world-debt-clock.html"
URL=requests.get(url)
site=BeautifulSoup(URL.text,"html.parser")
data=site.find_all("span",id="X4a79R9BW")
print(data)
The result is this:
"[ ]"
when i was expecting
"$19,987,137,284,731"
Is there something i can change in order to extract the number?
BeautifulSoup cannot do this for you, because the data you need is provided by JavaScript, and BeautifulSoup does not support JS processing.
An alternative is to use a tool such as Selenium WebDriver:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.usdebtclock.org/world-debt-clock.html')
elem2 = driver.find_element_by_xpath('//span[#id="X4a79R9BW"]')
print(elem2.text)
driver.close()
If you have not used Selenium WebDriver before, you need to follow the installation instructions here.
In particular, you will need to follow the instructions for downloading the browser driver of your choice (I use geckodriver for Firefox). And make sure the executable is on your path.
(I expect there are other Python-based alternatives, also.)
Based on the page's code, I think what you want to accomplish may not be possible with BS. Running your code returned [<span id="X4a79R9BW"> </span>]. Trying to getText() on that returned nothing. When inspecting the page, I noticed that the numerical value in the span was continuously updating as it does on the page. Viewing the page source showed that X4a79R9BW appeared at five places in the page. First to set aspects of the font, several places where an equation was being processed, and last the empty span scraped by your code. From viewing the source, it appears that the counter is an equation running inside a tag <script type="text/javascript">. Here is what I think is the equation running under the JavaScript tag:
{'leftMargin':0,'color':-16751104,:0 */var X3a34729DW = /*144,:14 */ 96.9230013 /*751104,:0 */; var R3a45G7S = /*7104,:54 */ 0.000000306947 /*43,451134,:5 */; var Y12 = /*241,:15457 */ 18442.16666 /*19601*2*2*/*21600*2*2; /*79301*2*2*/ var Class = new Date(); var Method = Class.getTime() / 1000 - Y12a4798; var Public = X3a34729DW + Method * R3a45G7S; var Assign = FormatNumber2(Public); document.getElementById ('X3a34729DW') .firstChild.nodeValue = Assign; /*'advance':4289}
This section of the page's source indicates that the text you want is being continuously updated via JavaScript. Given that, it is my understanding that BS is not the appropriate library to complete the desired task. Though I have not used it myself, I've seen Selenium as a suggested library for scraping pages dynamically updated via JavaScript. Good luck, perhaps someone else can help provide a clearer path forward.

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]

I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.
Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories