scrapy link extractor adds equal signs to the end of links

scrapy link extractor adds equal signs to the end of links - python

I'm trying to parse a forum with this rule:
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item', follow=True),)
I've tried several approaches with/without r at the beginning, with/without $ at the end of the pattern etc. but every time scrapy produces links ending with equal sign even though there is no = in links neither on the page nor in pattern.
There is an example of extracted links (using also parse_start_url so the start url is here too and yes, I've tried to delete it - it doesn't help):
[<GET http://www.example.com/index.php?threads/topic.0000/>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-2=>,
<GET http://www.example.com/index.php?threads%2Ftopic.0000%2Fpage-3=>]
If I open in browser or fetch in scrapy shell these links I get wrong pages with nothing to parse but deleting equal signs solves the problem.
So why is it happening and how can I handle it?
EDIT 1 (additional info):
Scrapy 1.0.3;
Other CrawlSpiders are fine.
EDIT 2:
Spider's code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class BmwclubSpider(CrawlSpider):
name = "bmwclub"
allowed_domains = ["www.bmwclub.ru"]
start_urls = []
start_url_objects = []
rules = (Rule(LinkExtractor(allow=(r'page-\d+$')), callback='parse_item'),)
def parse_start_url(self, response):
return Request(url = response.url, callback=self.parse_item, meta={'site_url': response.url})
def parse_item(self, response):
return []
Command to collect links:
scrapy parse http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/ --noitems --spider bmwclub
Output of the command:
>>> STATUS DEPTH LEVEL 1 <<<
# Requests -----------------------------------------------------------------
[<GET http://www.bmwclub.ru/index.php?threads/bamper-novyj-x6-torg-umesten-150000rub.1051898/>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-2=>,
<GET http://www.bmwclub.ru/index.php?threads%2Fbamper-novyj-x6-torg-umesten-150000rub.1051898%2Fpage-3=>]

this is because of canonicalization issues.
You can disable it on the LinkExtractor like this:
rules = (
Rule(LinkExtractor(allow=(r'page-\d+$',), canonicalize=False), callback='parse_item'),
)

Related

How to use Scrapy sitemap spider on sites with text sitemaps?

I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?

SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store

You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break

Scrapy is Visiting same Url despite dont_filter=False

Problem: Scrapy keeps visiting a single url and keeps scraping it recursively. I have checked the response.url to ensure that this is a single page that it keeps scraping and there is no query string involved that may render the same page for different url.
What I have done to reolve it :
Under Scrapy/spider.py I noticed that dont_filter was set to True and changed it False. but it didn't help
I have set the unique = True also in the code, but this didn't help either.
Additional information
The Page thats given as start_url has only 1 link to a page a.html. Scrapy keeps scraping a.html again and again.
Code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from kt.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["datacaredubai.com"]
start_urls = ["http://www.datacaredubai.com/aj/link.html"]
rules = (
Rule(SgmlLinkExtractor(allow=('/aj'),unique=('Yes')), callback='parse_item'),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//*')
items = []
for site in sites:
item = DmozItem()
item['title']= site.xpath('/html/head/meta[3]').extract()
item['req_url']= response.url
items.append(item)
return items

Scrapy, by default, would append into the output file if it exists. What you see in the output.csv is the results of multiple spider runs. Remove the output.csv before running the spider again.

scrapy crawlspider output

I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).
So, my question is can I use this:
scrapy crawl dmoz -o items.csv
or do I have to create an Item Pipeline?
UPDATED, now with code!:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem
class MySpider(CrawlSpider):
name = 'abc'
allowed_domains = ['ididntuseexample.com']
start_urls = ['http://www.ididntuseexample.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = TargetsItem()
item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
item['link'] = response.xpath('//h2/a/#href').extract() #this pulled down data in scrapy shell
return item

Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.
Your rule must call the parse_item. So, replace:
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
with:
Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),
This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.
Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HackerNewsItem(scrapy.Item):
title = scrapy.Field()
comment = scrapy.Field()
class HackerNewsSpider(CrawlSpider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com']
start_urls = [
'https://news.ycombinator.com/'
]
rules = (
# Follow any item link and call parse_item.
Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
)
def parse_item(self, response):
item = HackerNewsItem()
# Get the title
item['title'] = response.xpath('//*[contains(#class, "title")]/a/text()').extract()
# Get the first words of the first comment
item['comment'] = response.xpath('(//*[contains(#class, "comment")])[1]/font/text()').extract()
return item

Avoid bad requests due to relative urls

I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
Link
Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once)
But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
Is there a way to fix this, or am I missing something?
Here is my spider's code, fairly basic (on the basis of item urls matching "/en/item-*-scrap.html") :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product

Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link scrapped. And if you join the urls provided you mentioned as example,
<!-- on page https://www.domain-name.com/en/somelist.html -->
Link
the returned url is same as url mentioned in error scrapy error. Try this in python shell.
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
The urljoin behaviour seems to be valid. See : https://www.rfc-editor.org/rfc/rfc1808.html#section-5.2
If it is possible, can you pass the site, which you are crawling ?
With this understanding, the solutions can be,
Manipulate the urls(remove those two dots and slash). generated in crawl spider. Basically override parse or _request_to_folow.
Source of crawl spider: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
Manipulate the url in the downloadmiddleware, this might be cleaner. You remove the ../ in the process_request of the downloadmiddleware.
Documentation for downloadmiddleware : http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
Use base spider and also return the manipulated url requests you want to crawl further
Documentation for the basespider : http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
Please let me know if you have any questions.

I finally found a solution thanks to this answer. I used process_links as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links

Scrapy SgmlLinkExtractor

I am trying to get a scrapy spider working, but there seems to be a problem with SgmlLinkExtractor.
Here is the signature:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
I am using the allow() option, here is my code:
start_urls = ['http://bigbangtrans.wordpress.com']
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
A sample url looks like http://bigbangtrans.wordpress.com/series-1-episode-11-the-pancake-batter-anomaly/
the output of scrapy crawl tbbt contains
[tbbt] DEBUG: Crawled (200) http://bigbangtrans.wordpress.com/series-3-episode-17-the-precious-fragmentation/> (referer: http://bigbangtrans.wordpress.com)
The parse_item callback, however, is not called and I can not figure out why.
This is the whole spider code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class TbbtSpider(CrawlSpider):
#print '\n TbbtSpider \n'
name = 'tbbt'
start_urls = ['http://bigbangtrans.wordpress.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
def parse_item(self, response):
print '\n parse_blogpost \n'
hxs = HtmlXPathSelector(response)
item = TbbtItem()
# Extract title
item['title'] = hxs.select('//div[#id="post-5"]/div/p/span/text()').extract() # XPath selector for title
return item

Okay, so the reason this code is not working is because the syntax of your rule is incorrect.I fixed the syntax without making any other changes and I was able to hit the parse_item callback.
rules = (
Rule(SgmlLinkExtractor(allow=(r'series-\d{1}-episode-\d{2}.',),
),
callback='parse_item'),
)
However the titles were all blank which suggests that the hxs.select statement in parse_item is incorrect. The following xpath may be more suitable (I made an educated gues about the required title, but I could be barking up the wrong tree entirely)
item['title'] = hxs.select('//h2[#class="title"]/text()').extract()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy link extractor adds equal signs to the end of links - python

this is because of canonicalization issues. You can disable it on the LinkExtractor like this: rules = ( Rule(LinkExtractor(allow=(r'page-\d+$',), canonicalize=False), callback='parse_item'), )

Related

How to use Scrapy sitemap spider on sites with text sitemaps?

Scrapy is Visiting same Url despite dont_filter=False

scrapy crawlspider output

Avoid bad requests due to relative urls

Scrapy SgmlLinkExtractor

Categories

Resources