How to scrape value from page that loads dynamicaly? - python

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?

If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.

1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial

for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Related

Scrapy for dynamic content

Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?
Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

How to crawl specific ASP.NET pages using Python?

I want to crawl an ASP.NET website but the urls are all the same how can I crawl specific pages using python?
here is the website I want to crawl:
http://www.fveconstruction.ch/index.htm
(I am using beautifulsoup, urllib and python 3)
What information should I get to distinguish a page from the other?
If the target website is just a single page application, it can't be crawled. As a workaround you can see the requests (GET, POST etc) that actually go when you manually navigate through the website and ask the crawler to use those. Or, teach your crawler to execute javascript at least what's on the target website.
It's the website who need to change to be easily crawlable, they need to provide a reasonable non-AJAX version of every page that needs to be indexed, or links to a page that needs to be indexed. Or use something like what pushState does in angularJs.

Python Scrapy : response object different from source code in browser

I'm working on a project using Scrapy.
All wanted fields but one get scraped perfectly. The content of the missing field simply doesn't show up in the Scrapy response (as checked in the scrapy shell), while it does show up when i use my browser to visit the page. In the scrapy response, the expected tags are there, but not the text between the tags.
There's no JavaScript involved, but it is a variable that is provided by the server (it's the current number of visits to that particular page). No iframe involved either.
Already set the user agent (in the settings-file) to match my browser.
Already set the download delay (in the settings-file) to 5.
EDIT (addition):
The page : http://www.fincaraiz.com.co/apartamento-en-venta/bogota/salitre-det-1337688.aspx
Xpath to the wanted element : //*[#id="numAdvertVisits"]
What could be the cause of this mystery ?
It's an ajax/javascript loaded value.
What steps did you take to determine there is no JS involved? I loaded the page w/o javascript, and while that area of the page had the stub content ("Visitas"), the actual data was written there with an ajax request.
You can still load that data using scrapy, it'll just take an additional request to the URL endpoint normally accessed via on-page ajax. The server returns the number of visits in XML, via the script at http://www.fincaraiz.com.co/WebServices/Statistics.asmx/GetAdvertVisits?idAdvert=1337688&idASource=40&idType=1001 (try loading that script and you'll see the # of visits for the page you provided in the original email).
There is another ajax request that returns "True" for that page, but I'm not sure what the data's actual meaning is. Still, it may be useful:
http://www.fincaraiz.com.co/WebServices/Statistics.asmx/DetailAdvert?idAdvert=1337688&idType=1001&idASource=40&strCookie=13/11/2014:19-05419&idSession=10hx5wsfbqybyxsywezx0n1r&idOrigin=44

web scraping a problem site

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

Categories