Is it possible to download the 'inspect element' data from a website? - python

I have been trying to access the inspect element data from a certain website (The regular source code won't work for this). At first I tried rendering the javascript for the site. I've tried using selenium, pyppeteer, webbot, phantomjs, and request_html + beautifulsoup. All of these did not work. Would it be possible to simply copy-paste this data using python?
The data I need is from https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6 and looks like this:
<nav class="feature-list">
<span style="" id="ember683" class="flex-horizontal feature-list-item ember-view">
(all span's in this certain nav)

Related

Scraping data via scrapy from table yields nothing

I am having issues extracting data from the table below.
https://tirewheelguide.com/sizes/perodua/myvi/2019/
I want to extract the sizes in this example & it would be the 175/65 SR14
<a style="text-decoration: underline;" href="https://tirewheelguide.com/tires/s/175-65-14/">175/65 SR14 </a>
Using scrapy shell function
response.xpath('/html/body/div[2]/table[1]/tbody/tr[1]/td[1]/a[1]/text()').get()
yields nothing.
Do you know what I am doing wrong?
There is a problem with your XPath
instead this:
response.xpath('/html/body/div[2]/table[1]/tbody/tr[1]/td[1]/a[1]/text()').get()
use this:
response.xpath('//table[1]//td//a/text()').get()
Some website doesn't create tables in proper so in my XPath I pass html/body/div also there was a problem with tr. The website creates multiple tr in the same row and it causes a problem. If you use the XPath I posted, it will work fine.

Scraping with Python - XPath issue

I'm currently in the process of researching scraping and I've been following a tutorial on Youtube. The tutorial is using 'Scrapy' and I've managed to scrape data from the website previewed in the tutorial. However, now I've tried scraping another website with no success.
From my understanding, the problem is from the Xpath that I'm using. I've tried several Xpath testing/generator websites with no success.
This is the following XML code:
<div class="price" currentmouseover="94">
<del currentmouseover="96">
<span class="woocommerce-Price-amount amount" currentmouseover="90"><span class="woocommerce-Price-currencySymbol">€</span>3.60</span>
</del>
<ins><span class="woocommerce-Price-amount amount" currentmouseover="123"><span class="woocommerce-Price-currencySymbol" currentmouseover="92">€</span>3.09</span></ins></div>
I'm currently using the following code:
def parse(self,response):
for title in response.xpath("//div[#class='Price']"):
yield {
'title_text': title.xpath(".//span[#class='woocommerce-Price-amount amount']/text()").extract_first()
}
I've also tried using //span[#class='woocommerce-Price-amount amount'].
I want my output to be '3.09', instead, I'm getting null when I export it to a JSON file. Can someone point me in the right direction?
Thanks in advance.
Update 1:
I've managed to fix the problem with Jack Fleeting's answer. Since I've had problems understanding Xpath I've been trying different websites in order to get a further understanding of how Xpath works. Unfortunately, I'm stuck in another example.
<div class="add-product"><strong><small>€3.11</small> €3.09</strong></div>
I'm using the following snippet:
l.add_xpath('price', ".//div[#class='add-product']/strong[1]")
My expectation is to output the 3.09, however, I'm outputting both numbers. I've tried using a minimum function, but Xpath 1.0 does not support it. ie: since I wanted to output the actual (discounted) value of the item
Try this xpath expression, and see if it works:
//div[#class='price']/ins/span
Note that price is lower case, as in you html.

Web Scraping Javascript Using Python

I am used to using BeautifulSoup to scrape a website, however this website is different. Upon soup.prettify() I get back Javascript code, lots of stuff. I want to scrape this website for the data on the actual website (company name, telephone number etc). Is there a way of scraping these scripts such as Main.js to retrieve the data that is displayed on the website to me?
Clear version:
Code is:
<script src="/docs/Main.js" type="text/javascript" language="javascript"></script>
This holds the text that is on the website. I would like to scrape this text however it is populated using JS not HTML (which I used to use BeautifulSoup for).
You're asking if you can scrape text generated at runtime by Javascript. The answer is sort-of.
You'd need to run some kind of headless browser, like PhantomJS, in order to let the Javascript execute and populate the page. You'd then need to feed the HTML that the headless browser generates to BeautifulSoup in order to parse it.

Can't scrape nested html using BeautifulSoup

I have am interested in scraping "0.449" from the following source code from http://hdsc.nws.noaa.gov/hdsc/pfds/pfds_map_cont.html?Lat=33.146425&Lon=-87.5805543.
<td class="tblInner" id="0-0">
<div style="font-size:110%">
<b>0.449</b>
</div>
"(0.364-0.545)"
</td>
Using BeautifulSoup, I currently have written:
storm=soup.find("td",{"class":"tblInner","id":"0-0"})
which results in:
<td class="tblInner" id="0-0">-</td>
I am unsure of why everything nested within the td is not showing up. When I search the contents of the td, my result is simply "-". How can I scrape the value that I want from this code?
You are likely scraping a website that uses javascript to update the DOM after the initial load.
You have a couple choices:
Find out where did the javascript code that fills the HTML page got the data from and call this instead. The data most likely comes from an API that you can call directly with CURL. That's the best method 99% of the time.
Use a headless browser (zombie.js, ...) to retrieve the HTML code after the javascript changes it. Convenient and fast, but few tools in python to do this (google python headless browser).
Use selenium or splinter to remote control a real browser (chrome, firefox, ...). It's convenient and works in python, but slow as hell
Edit:
I did not see that you posted the url you wanted to scrape.
In your particular case, the data you want comes from an AJAX call to this URL:
http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds
You now only need to understand what each parameter does, and parse the output of that instead of writing an HTML scraper.
Please excuse lack of error checking and modularity, but this should get you what you need, based on #Eloims observation:
import requests
import re
url = 'http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds'
r = requests.get(url)
response = r.text
coord_list_text = re.search(r'quantiles = (.*);', response)
coord_list = eval(coord_list_text.group(1))
print coord_list[0][0]

omegle lxml scrape not working

So I'm performing a scrape of omegle trying to scrape the users online.
This is the HTML code:
<div id="onlinecount">
<strong>
30,000+
</strong>
</div>
Now I would presume that using LXML it would be //div[#id="onlinecount"] to scrape any text within the , I want to get the numbers from the tags, but when I try to scrape this, I just end up with an empty list
Here's my relevant code:
print "\n Grabbing users online now from",self.website
site = requests.get(self.website)
tree = html.fromstring(site.text)
users = tree.xpath('//div[#id="onlinecount"]')
Note that the self.website variable is just http://www.omegle.com
Any ideas what I'm doing wrong? Note I can scrape other parts just not the number of online users.
I ended up using a different set of code which I learned from a friend.
Here's my full code for anyone interested.
http://pastebin.com/u1kTLZtJ
When you send a GET request to "http://www.omegle.com" using requests python module,what I observed is that there is no "onlinecount" in site.text. The reason is that part gets rendered by a javascript. You should use a library that is able to execute the javascript and give you the final html source that is rendered in a browser. One such third party library is Selenium http://selenium-python.readthedocs.org/. The only downside is that it opens a real web browser.
Below is a working code using selenium and an attached screenshot:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.omegle.com")
element = browser.find_element_by_id("onlinecount")
onlinecount = element.find_element_by_tag_name("strong")
You can also use GET method on this http://front1.omegle.com/status
that will return the count of online users and other details in JSON form
I have done a bit of looking at this and that particular part of the page is not XML but Javascript.
Here is the source (this is what the requests library is returning in your program)
<div id="onlinecount"></div>
<script>
if (IS_MOBILE) {
$('sharebuttons').dispose();
$('onlinecount').dispose();
}
</script>
</div>
As you can see, in lxml's eyes there is nothing but a script in the onlinecount div.
I agree with Praveen.
If you want to avoid launching a visible browser, you could use PhantomJS which also has a selenium driver :
http://phantomjs.org/
PhantomJS is a headless WebKit scriptable with a JavaScript API
Instead of selenium scripts, you could also write PhantomJS js scripts (but I assume you prefer to stay in Python env ;))

Categories