I am using scrapy but I get the following error for some of the url's
[scrapy.spidermiddlewares.urllength] DEBUG: Ignoring link (url length > 2083):
When I copy and paste this long url in the browser I get the page, no problem.
Is there a way to force scrapy not to ignore those long url's
Many thanks
In your settings.py file assign a larger value for URLLENGTH_LIMIT variable e.g.
URLLENGTH_LIMIT=5000
Related
I'm following this tutorial to Crawler using the Scrapy Web Library on this site. Considering the image below, I need to collect the text inside the span tag ("Mãe cria sozinha ...")
Using the scrapy shell, I'm trying to use response.css to collect, but I'm returning an empty list:
response.css("a._b >span::text").extract()
I believe I am wrong in passing the tags, so what is the correct way to do this?
If you open the source of the URL using ctrl/cmd + U, you will be unable to find class_band thus your response is returned empty are not getting your desired results. Moreover bstn-hl-title class is also not available in the source of the webpage as well. Thus all fields of item will empty as well. In scrapy you have access to the source that you see in the browser using ctrl/cmd + U.
I'm trying to crawl a specific page of a website (https://www.johnlewis.com/jaeger-wool-check-knit-shift-dress-navy-check/p3767291) to get used to Scrapy and its features. However, I can't get Scrapy to see the 'li' that contains the thumbnail images on the carousel. My parse Function currently looks as follows:
def parse(self, response):
for item in response.css('li.thumbnail-slide'):
#The for loop works for li.size-small-item
print("We have a match!")
No matter what Scrapy isn't "seeing" the li. I've tried viewing the page in a scrapy shell to check Scrapy could see the images and they are showing up in the response for that (so I'm assuming Scrapy can definitely see the list/images in the list). I've tried alternative lists and I've got a different list to work (as per the comment in the code).
My only thoughts are that the carousel may be loaded with JavaScript / AJAX but I can't be too sure. I do know that the list class will change if it is the selected image from "li.thumbnail-slide" to "li.thumbnail-slide thumbnail-slide-active" however, I've tried the following in my script to no avail:
li.thumbnail-slide
li.thumbnail-slide-active
li.thumbnail-slide.thumbnail-slide-active
li.thumbnail-slide thumbnail-slide-active
Nothing works.
Does anyone have any suggestions on what I may be doing wrong? Or suggest any further reading that may help?
Thanks in advance!
You assumption is correct, the elements are there, but not exactly where you think they are.
To easily check if an element is part of the response html and not being loaded by javascript I normally recommend using a browser plugin to disable javascript.
If you want the images, they are still part of the html response, you can get them with:
response.css('li.product-images__item')
the main image appears separately:
response.css('meta[itemprop=image]::attr(content)')
Hope that helps you.
I got a weird error. I can't scrape link https://www.example.com/2/
But, I can scrape link https://www.example.com/922/
P.S. I am not using the real link since I am not allowed by my job. Sorry.
When I try to debug it using command scrapy view https://www.example.com/2/, it shows the correct HTML I am expecting. When I check the URL via Chrome inspector, it turns out it gives the correct HTML but gives error 500 status instead of 200. I solved it using handle_httpstatus_list = [500] in my spider.
I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343¤cy=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]
I am trying to scrape some info from offerup.com and on the scrapy shell, nothing comes up.
I will type:
scrapy shell https://offerup.com/
It will go there but then if I simply try to get text of the whole webpage with:
response.xpath('//text()').extract()
it comes back with:
['Request unsuccessful. Incapsula incident ID: 623000250007296502-10946686267359632']
It comes back with nothing for any other info I try to get for the response such as the title.
Do you know why this happens? Any help is hugely appreciated.
Take care to read the response you get when visitng offerup.
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains
scrapy.Request, scrapy.Selector, etc)
[s] crawler
[s] item {}
[s] request https://offerup.com>
[s] response <403
https://offerup.com>
You get a 403, a Forbidden error. Nothing can bypass a 403.
If you try a different site, such as http://buffalo.craigslist.org, an OK response of 200 is given. Using the same command will show the desired page, and using response.xpath('//text()').extract() will print all of the text elements from root.
Some sites may have anti-scraping measures set up to prevent robots from hogging their resources. Offerup is apparently such a site.
To direclty answer your question, your code is functional, but the target site prevents you from using it.