Scrape aspx site with python

Scrape aspx site with python - python

I want to download supreme court cases. Below is the code, I am trying:
page = requests.get('http://judis.nic.in/supremecourt/Chrseq.aspx').text
I am getting below contents in page:
u'<html><p><hr></hr></p><b><center>The Problem may be due to 500 Server Error/404 Page Not Found.Please contact your system administrator.</center></b><p><hr></hr></p></html><!--0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234-->\r\n'
Is the site not scrapable or do I need to use some other method?
I checked this answer: How to scrape aspx pages with python , but the solution is in selenium.
Is it possible to do it in python and Beautiful soup?

The reason is you are hitting a url which may be no longer served by the server. I am able to get data from all pages. I checked response from scrapy shell as
scrapy shell "http://judis.nic.in/supremecourt/chejudis.asp"
and using xpath you can retrieve whatever data you want from same page.

I'm not able to open the website though my browser. I'm getting the same response from my browser. Maybe that's why you're getting that response back.

Related

Cloudflare protection error 503 - "checking your browser"

I made a script to scrape data from a webpage which is cloudflare protected. I was scraping around 25k links from this website and the script was working fine. I have been able to extract all the links from this website and now want to scrape information from these links. Earlier the script was working well but because of recent security update in website, I am getting error 503 by requests library and "checking your browser" webpage by selenium. Is there any way to bypass it?
I also have scraper api subscription to make requests using proxies and using "scraper_api" library for the same.
I am sharing some of the links that needs to be scraped but getting these errors:
https://coinatmradar.com/bitcoin_atm/31285/bitcoin-atm-general-bytes-birmingham-altadena-spirits/
https://coinatmradar.com/bitcoin_atm/23676/bitcoin-atm-general-bytes-birmingham-marathon-gas/
Already tried other approaches like cfscraper, cloud scraper, undetected chromedriver, but no luck.
Please try to scrape any other link and share any solution. Thanks

Pages not processing fully

I am trying to scrape news articles from yahoo finance and to do so, i want to use their sitemap page https://finance.yahoo.com/sitemap/
The problem i have is that after following a link https://finance.yahoo.com/sitemap/2015_04_02 for example scrapy does not process the whole page - only the header. So i cannot access the links to the different articles.
Is there some internal requests that i have to sent to the page ?
I still get the whole page by deactivating javascript in my browser and i use scrapy 1.6
Thanks.

Some sites take defensive measures against robots scraping their websites. If they detect that you are non-human, they may not serve the entire page. But more than likely what is happening is there is a bunch of client-side rendering that happens when you view the page in a web browser, which is not being executed when you request that same page in scrapy.
Yahoo! Finance has a API. Using that will probably get you more reliable results.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.

Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values

The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).

You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Browser Simulation and Scraping with windmill or selenium, how many http requests?

I want to use windmill or selenium to simulate a browser that visits a website, scrapes the content and after analyzing the content goes on with some action depending of the analysis.
As an example. The browser visits a website, where we can find, say 50 links. While the browser is still running, a python script for example can analyze the found links and decides on what link the browser should click.
My big question is with how many http Requests this can be done using windmill or selenium. I mean do these two programs can simulate visiting a website in a browser and scrape the content with just one http request, or would they use another internal request to the website for getting the links, while the browser is still running?
Thx alot!

Selenium uses the browser but number of HTTP request is not one. There will be multiple HTTP request to the server for JS, CSS and Images (if any) mentioned in the HTML document.
If you want to scrape the page with single HTTP request, you need to use scrapers which only gets what is present in the HTML source. If you are using Python, check out BeautifulSoup.

web scraping a problem site

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>

The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.

As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape aspx site with python - python

The reason is you are hitting a url which may be no longer served by the server. I am able to get data from all pages. I checked response from scrapy shell as scrapy shell "http://judis.nic.in/supremecourt/chejudis.asp" and using xpath you can retrieve whatever data you want from same page.

I'm not able to open the website though my browser. I'm getting the same response from my browser. Maybe that's why you're getting that response back.

Related

Cloudflare protection error 503 - "checking your browser"

Pages not processing fully

Python - Scrapy ecommerce website

Browser Simulation and Scraping with windmill or selenium, how many http requests?

web scraping a problem site

Categories

Resources