Python Scrapy 301 redirects - python

I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. My idea is to only print them and not scrape them. My current piece of code is:
import scrapy
import os
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'rust'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
#if response.status == 301:
print response.url
However, this does not print the redirected urls. Any help will be appreciated.
Thank you.

To parse any responses that are not 200 you'd need to do one of these things:
Project-wide
You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead.
Spider-wide
Add handle_httpstatus_list parameter to your spider. In your case something like:
class MySpider(scrapy.Spider):
handle_httpstatus_list = [301]
# or
handle_httpstatus_all = True
Request-wide
You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all:
scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})
To learn more see HttpErrorMiddleware

Related

Python: why is in scrapy crawlspider not printing or doing anything?

I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?
Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.

scrapy rules do not call parsing method

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*
crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
crawl: scrapy crawl getbid -o 012916.csv
From the CrawlSpider docs:
If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
Since your first rule will match all links, it will always be used and all other rules will be ignored.
Fixing the problem is as simple as switching the order of the rules.

Scraping a domain for links recursively using Scrapy

Here is the code I'm using for scraping all the urls of a domain:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class UrlsSpider(scrapy.Spider):
name = 'urlsspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))
def parse(self, response):
for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
print link.url
yield scrapy.Request(link.url, callback=self.parse)
As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls.
Any help on this matter will be very helpful.
Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().

Avoid bad requests due to relative urls

I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
Link
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
Is there a way to fix this, or am I missing something?
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,
<!-- on page https://www.domain-name.com/en/somelist.html -->
Link
the returned url is same as url mentioned in error scrapy error. Try this in python shell.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
The urljoin behaviour seems to be valid. See : https://www.rfc-editor.org/rfc/rfc1808.html#section-5.2
If it is possible, can you pass the site, which you are crawling ?
With this understanding, the solutions can be,
Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.
Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.
Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
Use base spider and also return the manipulated url requests you want to crawl further
Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
Please let me know if you have any questions.
I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links

crawling multiple webpages from a website

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/#href').extract()
desc = site.select('text()').extract()
print title, link, desc
This is my code. I want plenty of URLs to scrape using loop. So how am I suposed to these? I did put multiple urls in there but I didn't get output from all of them. Some URLs stop responding. So how can I get the data for sure using this code?
You code looks ok but are you sure that start_urls shouldn't start with http://
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
UPD
start_urls is a list of urls scrapy starts with. Usually it has one or two links. Rarely more.
This pages must have identical HTML structure because Scrapy spider process them the same way.
See if i put 4-5 url's in start_urls it gives output ok for first 2-3
url's.
I don't believe this because scrapy doesn't care how many links is start_urls list.
But it stops responding and also tell me how i can implement GUI for this.?
Scrapy has debug shell to test you code.
You just posted the code from the tutorial. What you should do is to actually read the whole documentation, especially the basic concept part. What you basically want is the crawl spider where you can define rules that the spider will follow and process with your given code.
To quote the docs with the example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[#id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[#id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[#id="item_description"]/text()').extract()
return item

Categories