scrapy can't crawl all links in a page - python

I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free
I want to get all the links directing to each game.
I inspect the element of the page. And it looks like this:
how the page looks like
so I want to extract all links with the pattern /store/apps/details?id=
but when I ran commands in the shell, it returns nothing:
shell command
I've also tried //a/#href. didn't work out either but Don't know what is wrong going on....
Now I can crawl first 120 links with starturl modified and 'formdata' added as someone told me but no more links after that.
Can someone help me with this?

It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.
Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with
formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}
Increment start by 60 on each request to get the paginated result.

Related

Unable to get all the links within a page

I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!

How do I obtain data, with scrapy, in a web page in which I do not see that there is the code I want to scrape

I'm trying to get the names of the users and the content of the comments that exist on this page:
User and text that I need to extract:
When I test the extraction with the chrome plugin Xpath helper, I am getting the user names with the statement:
//*[#id="livefyre"]/div/div/div/div/article/div/header/a/span
and the comments, I get them with:
//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p
When I do the test in the scrapy console, with the query:
response.xpath(//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p).extract()
I get a [];
I've also tried with:
response.xpath (//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p.text()).extract()
The same thing happens with my code.
Verifying the code of the page, I see that all those comments do not exist in the html code.
When I inspect the page, for example, I see the comment text:
But when, I check the html code of the page I do not see anything
:
Where am I making a mistake?
Thanks for help.
As you stated, there isn't any comment in the code of page, that mean website is being rendered through javascript, There are two ways you can scrape these kind of websites
First,
use scrapy-splash to render javascript
second,
find the api/network call that brings the comments, mock that request in scrapy to get your data.

Scrapy can't see a list

I'm trying to crawl a specific page of a website (https://www.johnlewis.com/jaeger-wool-check-knit-shift-dress-navy-check/p3767291) to get used to Scrapy and its features. However, I can't get Scrapy to see the 'li' that contains the thumbnail images on the carousel. My parse Function currently looks as follows:
def parse(self, response):
for item in response.css('li.thumbnail-slide'):
#The for loop works for li.size-small-item
print("We have a match!")
No matter what Scrapy isn't "seeing" the li. I've tried viewing the page in a scrapy shell to check Scrapy could see the images and they are showing up in the response for that (so I'm assuming Scrapy can definitely see the list/images in the list). I've tried alternative lists and I've got a different list to work (as per the comment in the code).
My only thoughts are that the carousel may be loaded with JavaScript / AJAX but I can't be too sure. I do know that the list class will change if it is the selected image from "li.thumbnail-slide" to "li.thumbnail-slide thumbnail-slide-active" however, I've tried the following in my script to no avail:
li.thumbnail-slide
li.thumbnail-slide-active
li.thumbnail-slide.thumbnail-slide-active
li.thumbnail-slide thumbnail-slide-active
Nothing works.
Does anyone have any suggestions on what I may be doing wrong? Or suggest any further reading that may help?
Thanks in advance!
You assumption is correct, the elements are there, but not exactly where you think they are.
To easily check if an element is part of the response html and not being loaded by javascript I normally recommend using a browser plugin to disable javascript.
If you want the images, they are still part of the html response, you can get them with:
response.css('li.product-images__item')
the main image appears separately:
response.css('meta[itemprop=image]::attr(content)')
Hope that helps you.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Unable to scrape data on results list using scrapy

I am currently trying to scrape the links to the cars on this page.
I have ran this xpath command on the chrome console to return the links of each cars
$x('//div[#class="vehicle-make-model"]/h3/a/#href')
However, when I try to use the same xpath, whilst using the scrapy shell command it does not return any of the links. This is the code I run for the scrapy shell command
response.xpath('//div[#class="vehicle-make-model"]/h3/a/#href')
Can somebody point out what I am doing wrong?
The XPaths that work in Chrome are run on top of the DOM that is built with JavaScript.
That's why sometimes one thing works in Chrome but does not work in scrapy shell.
This is the case in the page you linked. If you check out the source of the page (right-click and choose "View Page Source" or hit Ctrl-U), you will see the same data that Scrapy gets.
In this particular case, the data seems to be all in one JSON block, so you can extract the JSON code out and parse it using python's JSON module, with something like:
import json
raw_json = response.xpath(
"//script[contains(., 'window.jsonData')]/text()"
).re('window.jsonData\s*=\s*(.+);$')[0]
json_data = json.loads(raw_json)
Then you can use the data in the json_data to build the next requests or scrape whatever you need.
In case there wasn't an easily parseable JSON, another option would be to use the js2xml library to parse the JavaScript into a XML that you would be able to scrape using XPath.

Categories