Avoid bad requests due to relative urls - python

I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
Link
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
Is there a way to fix this, or am I missing something?
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product

Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,
<!-- on page https://www.domain-name.com/en/somelist.html -->
Link
the returned url is same as url mentioned in error scrapy error. Try this in python shell.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
The urljoin behaviour seems to be valid. See : https://www.rfc-editor.org/rfc/rfc1808.html#section-5.2
If it is possible, can you pass the site, which you are crawling ?
With this understanding, the solutions can be,
Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.
Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.
Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
Use base spider and also return the manipulated url requests you want to crawl further
Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
Please let me know if you have any questions.

I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links

Related

CrawlSpider can't parse multipage in Scrapy

The CrawlSpider I've created is not doing it's job properly. It parses the first page and then stops without going on to the next page. Something I'm doing wrong but can't detect. Hope somebody out there gives me a hint what should I do to rectify it.
"items.py" includes:
from scrapy.item import Item, Field
class CraigslistScraperItem(Item):
Name = Field()
Link = Field()
CrawlSpider names "craigs.py" which contains :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules=(Rule(LinkExtractor(allow = ('sfbay\.craigslist\.org\/search\/npo/.*',
),restrict_xpaths = ('//a[#class="button next"]')),callback = 'parse',follow = True),)
def parse(self, response):
page=response.xpath('//p[#class="result-info"]')
items=[]
for title in page:
item=CraigslistScraperItem()
item["Name"]=title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract()
item["Link"]=title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract()
items.append(item)
return items
And finally the command I'm using to get CSV output is:
scrapy crawl craigs -o items.csv -t csv
By the way, I tried to use "parse_item" in the first place but found no response that is why I used "parse" method instead. Thanks in advance.
Don't name your callback method parse when you use scrapy.CrawlSpider.
From Scrapy documentation:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Also, you don't need to append an item to list since you already using Scrapy Items and can simply yield item.
This code should work:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from craigslist_scraper.items import CraigslistScraperItem
class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules = (
Rule(LinkExtractor(allow=('\/search\/npo\?s=.*',)), callback='parse_item', follow=True),
)
def parse_item(self, response):
page = response.xpath('//p[#class="result-info"]')
for title in page:
item = CraigslistScraperItem()
item["Name"] = title.xpath('.//a[#class="result-title hdrlnk"]/text()').extract_first()
item["Link"] = title.xpath('.//a[#class="result-title hdrlnk"]/#href').extract_first()
yield item
Finally for output in csv format run: scrapy crawl craigs -o items.csv

Scrapy is Visiting same Url despite dont_filter=False

Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items
Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.

scrapy crawlspider output

I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).
So, my question is can I use this:
scrapy crawl dmoz -o items.csv
or do I have to create an Item Pipeline?
UPDATED, now with code!:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem
class MySpider(CrawlSpider):
name = 'abc'
allowed_domains = ['ididntuseexample.com']
start_urls = ['http://www.ididntuseexample.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = TargetsItem()
item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
item['link'] = response.xpath('//h2/a/#href').extract() #this pulled down data in scrapy shell
return item
Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.
Your rule must call the parse_item. So, replace:
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
with:
Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),
This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.
Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HackerNewsItem(scrapy.Item):
title = scrapy.Field()
comment = scrapy.Field()
class HackerNewsSpider(CrawlSpider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com']
start_urls = [
'https://news.ycombinator.com/'
]
rules = (
# Follow any item link and call parse_item.
Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
)
def parse_item(self, response):
item = HackerNewsItem()
# Get the title
item['title'] = response.xpath('//*[contains(#class, "title")]/a/text()').extract()
# Get the first words of the first comment
item['comment'] = response.xpath('(//*[contains(#class, "comment")])[1]/font/text()').extract()
return item

Scrapy XPath all the links on the page

I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair.
Key: the current Url
Value: all the links on this page.
class MySpider(CrawlSpider):
name = 'abc.com'
allowed_domains = ['abc.com']
start_urls = ['http://www.abc.com']
rules = (Rule(SgmlLinkExtractor()), )
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = AbcItem()
item['key'] = response.url
item['value'] = hxs.select('//a/#href').extract()
return item
I define my AbcItem() looks like below:
from scrapy.item import Item, Field
class AbcItem(Item):
# key: url
# value: list of links existing in the key url
key = Field()
value = Field()
pass
And when I run my code like this:
nohup scrapy crawl abc.com -o output -t csv &
The robot seems like began to crawl and I can see the nohup.out file being populated by all the configurations log but there is no information from my output file.. which is what I am trying to collect, can anyone help me with this? what might be wrong with my robot?
You should have defined a callback for a rule. Here's an example for getting all links from twitter.com main page (follow=False):
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'twitter.com'
allowed_domains = ['twitter.com']
start_urls = ['http://www.twitter.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
Then, in the output file, I see:
http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...
Hope that helps.
if you dont set the callback function explicitly, scrapy will use the method parse to process crawled pages. so, you should add parse_item as the callback, or change it's name to parse.

Using Scrapy to crawl the urls in the webpage

I'm using scrapy to extract data from certain websites.The problem is that my spider can only crawl the webpage of initial start_urls , it can't crawl the urls in the webpage.
I copied the same spider exactly:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from nextlink.items import NextlinkItem
class Nextlink_Spider(BaseSpider):
name = "Nextlink"
allowed_domains = ["Nextlink"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/#href')
for site in sites:
relative_url = site.extract()
url = self._urljoin(response,relative_url)
yield Request(url, callback = self.parsetext)
def parsetext(self, response):
log = open("log.txt", "a")
log.write("test if the parsetext is called")
hxs = HtmlXPathSelector(response)
items = []
texts = hxs.select('//div').extract()
for text in texts:
item = NextlinkItem()
item['text'] = text
items.append(item)
log = open("log.txt", "a")
log.write(text)
return items
def _urljoin(self, response, url):
"""Helper to convert relative urls to absolute"""
return urljoin_rfc(response.url, url, response.encoding)
I use the log.txt to test if the parsetext is called.However, after I runned my spider, there is nothing in the log.txt.
See here:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html?highlight=allowed_domains#scrapy.spider.BaseSpider.allowed_domains
allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.
So, as long as you didn't activate the OffsiteMiddleware in your settings, it doesn't matter and you can leave allowed_domains completely out.
Check the settings.py whether the OffsiteMiddleware is activated or not. It shouldn't be activated if you want to allow your Spider to crawl on any domain.
I think the problem is, that you didn't tell Scrapy to follow each crawled URL. For my own blog I've implemented a CrawlSpider that uses LinkExtractor-based Rules to extract all relevant links from my blog pages:
# -*- coding: utf-8 -*-
'''
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* #author Marcel Lange <info#ask-sheldon.com>
* #package ScrapyCrawler
'''
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import Crawler.settings
from Crawler.items import PageCrawlerItem
class SheldonSpider(CrawlSpider):
name = Crawler.settings.CRAWLER_NAME
allowed_domains = Crawler.settings.CRAWLER_DOMAINS
start_urls = Crawler.settings.CRAWLER_START_URLS
rules = (
Rule(
LinkExtractor(
allow_domains=Crawler.settings.CRAWLER_DOMAINS,
allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
deny=Crawler.settings.CRAWLER_DENY_REGEX,
restrict_css=Crawler.settings.CSS_SELECTORS,
canonicalize=True,
unique=True
),
follow=True,
callback='parse_item',
process_links='filter_links'
),
)
# Filter links with the nofollow attribute
def filter_links(self, links):
return_links = list()
if links:
for link in links:
if not link.nofollow:
return_links.append(link)
else:
self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
return return_links
def parse_item(self, response):
# self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
item = PageCrawlerItem()
item['status'] = response.status
item['title'] = response.xpath('//title/text()')[0].extract()
item['url'] = response.url
item['headers'] = response.headers
return item
On https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/ I've described detailed how I've implemented a website crawler to warm up my Wordpress fullpage cache.
My guess would be this line:
allowed_domains = ["Nextlink"]
This isn't a domain like domain.tld, so it would reject any links.
If you take the example from the documentation: allowed_domains = ["dmoz.org"]

Categories