I'm new on scrapy and i have to crawl a webpage for a test. So I use the code below on a terminal but its returns a empty list i Don't understand why. When i use the same command on a another website, like amazon, with the right selector, it works. Can someone put light on it? thank you so much
scrapy shell "'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"
response.css('.tileList-title').extract()
First of all, when I consulted the source code of the page you seemed interested to scrape the title Iced Teas in a header tags <h1>. Am I right ?
Second, I tried scrapy shell sessions to understand the issue. It seems to be a settings of user-agent request's headers. Look at the code sessions below:
Without user-agent set
scrapy shell https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas
In [1]: response.css('.tileList-title').extract()
Out[1]: []
view(response) #open the given response in your local web browser, for inspection.
With user agent set
scrapy shell https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas -s USER_AGENT='Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
In [1]: response.css('.tileList-title').extract()
Out[1]: ['<h1 class="tileList-title" ng-if="$ctrl.listTitle" tabindex="-1">Iced Teas</h1>']
#now as you can see it does not return an empty list.
view(response)
So to improve your future practices, know you can use -s KEYWORDSETTING=value in your scrapy shell sessions. Here the settings key words for scrapy.
And to check with view(response) to see if the requests returns the expected content even if it sent a 200. For my experience, with view(response) you can see that the content page, and even source code sometimes, is a little different when you use it in scrapy shell than when you use it in a normal browser. So that's a good practice to check with this shortcut. Here the shorcuts for scrapy. They are mentioned at each scrapy shell session too.
Related
I am trying to use Scrapy shell to try and figure out the selectors for zone-h.org. I run scrapy shell 'webpage' afterwards I tried to view the content to be sure that it is downloaded. But all I can see is a dash icon (-). It doesn't download the page. I tried to enter the website to check if my connection to the website is somehow blocked, but it was reachable. I tried setting user agent to something more generic like chrome but no luck there either. The website is blocking me somehow but I don't know how can I bypass it. I digged through the the website if they block crawling and it doesn't say it is forbidden to crawl it. Can anyone help out?
There is cookie issue with you spider, if you send your cookies with your request then you will get you desired data.
You can see that in attached picture.
Can you use scrapy shell "webpage" on another webpage that you know works/doesn't block scraping?
Have you tried using the view(response) command to open up what scrapy sees in a web browser?
When you go to the webpage using a normal browser, are you redirected to another, final homepage?
- if so, try using the final homepage's URL in your scrapy shell command
Do you have firewalls that could interfere with a Python/commandline app from connecting to the internet?
I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343¤cy=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]
I am trying to scrape some info from offerup.com and on the scrapy shell, nothing comes up.
I will type:
scrapy shell https://offerup.com/
It will go there but then if I simply try to get text of the whole webpage with:
response.xpath('//text()').extract()
it comes back with:
['Request unsuccessful. Incapsula incident ID: 623000250007296502-10946686267359632']
It comes back with nothing for any other info I try to get for the response such as the title.
Do you know why this happens? Any help is hugely appreciated.
Take care to read the response you get when visitng offerup.
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains
scrapy.Request, scrapy.Selector, etc)
[s] crawler
[s] item {}
[s] request https://offerup.com>
[s] response <403
https://offerup.com>
You get a 403, a Forbidden error. Nothing can bypass a 403.
If you try a different site, such as http://buffalo.craigslist.org, an OK response of 200 is given. Using the same command will show the desired page, and using response.xpath('//text()').extract() will print all of the text elements from root.
Some sites may have anti-scraping measures set up to prevent robots from hogging their resources. Offerup is apparently such a site.
To direclty answer your question, your code is functional, but the target site prevents you from using it.
I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free
I want to get all the links directing to each game.
I inspect the element of the page. And it looks like this:
how the page looks like
so I want to extract all links with the pattern /store/apps/details?id=
but when I ran commands in the shell, it returns nothing:
shell command
I've also tried //a/#href. didn't work out either but Don't know what is wrong going on....
Now I can crawl first 120 links with starturl modified and 'formdata' added as someone told me but no more links after that.
Can someone help me with this?
It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.
Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with
formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}
Increment start by 60 on each request to get the paginated result.
I just got scrapy setup and running and it works great, but I have two (noob) questions. I should say first that I am totally new to scrapy and spidering sites.
Can you limit the number of links crawled? I have a site that doesn't use pagination and just lists a lot of links (which I crawl) on their home page. I feel bad crawling all of those links when I really just need to crawl the first 10 or so.
How do you run multiple spiders at once? Right now I am using the command scrapy crawl example.com, but I also have spiders for example2.com and example3.com. I would like to run all of my spiders using one command. Is this possible?
for #1: Don't use rules attribute to extract links and follow, write your rule in parse function and yield or return Requests object.
for #2: Try scrapyd
Credit goes to Shane, here https://groups.google.com/forum/?fromgroups#!topic/scrapy-users/EyG_jcyLYmU
Using a CloseSpider should allow you to specify limits of this sort.
http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider
Haven't tried it yet since I didn't need it. Looks like you also might have to enable as an extension (see top of same page) in your settings file.