Can't scrape some links using Scrapy

Can't scrape some links using Scrapy - python

I got a weird error. I can't scrape link https://www.example.com/2/
But, I can scrape link https://www.example.com/922/
P.S. I am not using the real link since I am not allowed by my job. Sorry.

When I try to debug it using command scrapy view https://www.example.com/2/, it shows the correct HTML I am expecting. When I check the URL via Chrome inspector, it turns out it gives the correct HTML but gives error 500 status instead of 200. I solved it using handle_httpstatus_list = [500] in my spider.

Related

How do I obtain data, with scrapy, in a web page in which I do not see that there is the code I want to scrape

I'm trying to get the names of the users and the content of the comments that exist on this page:
User and text that I need to extract:
When I test the extraction with the chrome plugin Xpath helper, I am getting the user names with the statement:
//*[#id="livefyre"]/div/div/div/div/article/div/header/a/span
and the comments, I get them with:
//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p
When I do the test in the scrapy console, with the query:
response.xpath(//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p).extract()
I get a [];
I've also tried with:
response.xpath (//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p.text()).extract()
The same thing happens with my code.
Verifying the code of the page, I see that all those comments do not exist in the html code.
When I inspect the page, for example, I see the comment text:
But when, I check the html code of the page I do not see anything
:
Where am I making a mistake?
Thanks for help.

As you stated, there isn't any comment in the code of page, that mean website is being rendered through javascript, There are two ways you can scrape these kind of websites
First,
use scrapy-splash to render javascript
second,
find the api/network call that brings the comments, mock that request in scrapy to get your data.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.

Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values

The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).

You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Scrape aspx site with python

I want to download supreme court cases. Below is the code, I am trying:
page = requests.get('http://judis.nic.in/supremecourt/Chrseq.aspx').text
I am getting below contents in page:
u'<html><p><hr></hr></p><b><center>The Problem may be due to 500 Server Error/404 Page Not Found.Please contact your system administrator.</center></b><p><hr></hr></p></html><!--0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234-->\r\n'
Is the site not scrapable or do I need to use some other method?
I checked this answer: How to scrape aspx pages with python , but the solution is in selenium.
Is it possible to do it in python and Beautiful soup?

The reason is you are hitting a url which may be no longer served by the server. I am able to get data from all pages. I checked response from scrapy shell as
scrapy shell "http://judis.nic.in/supremecourt/chejudis.asp"
and using xpath you can retrieve whatever data you want from same page.

I'm not able to open the website though my browser. I'm getting the same response from my browser. Maybe that's why you're getting that response back.

scrapy can't crawl all links in a page

I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free
I want to get all the links directing to each game.
I inspect the element of the page. And it looks like this:
how the page looks like
so I want to extract all links with the pattern /store/apps/details?id=
but when I ran commands in the shell, it returns nothing:
shell command
I've also tried //a/#href. didn't work out either but Don't know what is wrong going on....
Now I can crawl first 120 links with starturl modified and 'formdata' added as someone told me but no more links after that.
Can someone help me with this?

It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.
Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with
formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}
Increment start by 60 on each request to get the paginated result.

scrapy shell returns different results and script returns different

I am trying to scrape this URL "http://www.funkytrunks.com/715-clearance"
My xpath is as followed,
//a[#class="product_img_link"]//#href
When i use Scrapy Shell it returns 122 rows and in browser it returned 135 rows. It's quite strange issue. I checked html using response.body and saved this to HTML file and open that in browser and ran xpath and it worked perfectly.
Any help should be appreciated.

Well, Scrapy doesn't parse Javascript so that could be the reason you're getting that mismatch; some Javascript code could be inserting those extra hrefs.
If that's the case - and if those missing hrefs are relevant - you'll need to use Selenium or abandon Scrapy altogether and use something like Phantomjs, for instance

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't scrape some links using Scrapy - python

I got a weird error. I can't scrape link https://www.example.com/2/ But, I can scrape link https://www.example.com/922/ P.S. I am not using the real link since I am not allowed by my job. Sorry.

Related

How do I obtain data, with scrapy, in a web page in which I do not see that there is the code I want to scrape

Python - Scrapy ecommerce website

Scrape aspx site with python

scrapy can't crawl all links in a page

scrapy shell returns different results and script returns different

Categories

Resources