I am writing a script that will inventory all site urls.
I am using CrawlSpider w/ rules handler to process scraped url's. Specifically, "filter_links" checks a table for existing url. If not found, writes new entry.
rules = [
Rule(SgmlLinkExtractor(unique=True), follow=True, callback="parse_item", process_links="filter_links")
]
I sense this is just a poor mans 'reinventing the wheel' where a better method surely exists.
Is there a better way to dump the list of url's scrapy found vs. trying to parse this from response? Thanks
I think you are making use of process_links the way it is intended to be used. I see no drawbacks to that. But if you want to get rid of this additional filter_links method, then you can include the url table lookup and update logic in your parse_item method. You can access the current url in parse_item as response.url
Related
In my scrapy spider, when parsing the page with the item, I need to retrieve additional information from the other site.
I can implement it by using requests and update the item with the response within parse method of the spider.
I'm looking for the better way to implement it, without using additional library.
I have created a spider which is supposed to crawl multiple websites and I need to define different rules for each URL in the start_url list.
start_urls = [
"http://URL1.com/foo"
"http://URL2.com/bar"
]
rules = [
Rule (LinkExtractor(restrict_xpaths=("//" + xpathString+"/a")), callback="parse_object", follow=True)
]
The only thing that needs to change in the rule is the xpath string for restrict_xpath. I've already come up with a function that can get the xpath I want dynamically from any website.
I figured I can just get the current URL that the spider will be scraping and pass it through the function and then pass the resulting xpath to the rule.
Unfortunately, I've been searching and it seems that this isn't possible since scrapy utilizes a scheduler and compiles all the start_urls and rules right from the start. Is there any workaround to achieve what I'm trying to do?
I assume you are using CrawlSpider.
By default, CrawlSpider rules are applied for all pages (whatever the domain) your spider is crawling.
If you are crawling multiple domains in start URLs, and want different rules for each domains, you wont be able to tell scrapy which rule(s) to apply to which domain. (I mean, it's not available out of the box)
You can run your spider with 1 start URL at a time (and domain-specific rules, built dynamically at init time). And run multiple spiders in paralel.
Another option is to subclass CrawlSpider and customize it for your needs:
build rules as a dict using domains as keys,
and values being the list of rules to apply for that domain. See _compile_rules method.
and apply different rules depending on the domain of the response. See _requests_to_follow
You can just override the parse method. This method will get a scrapy response object with full html content. You can run xpath on it. You will can also retrieve the url from the response object and depending on the url, you can run custom xpath.
Please checkout the docs here: http://doc.scrapy.org/en/latest/topics/request-response.html
I am trying to do realize a CrawlSpider with Scrapy with the following features.
Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them.
In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest.
However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links.
So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working.
Can anyone suggest a possible strategy?
Right now my rules are :
rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}
You're defining a Set by using {} around the pair of rules. Try making it a tuple with ():
rules = (Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),)
Beyond that, you might want to pass 'unique=True' to the Rules to make sure that any links back to the "start url" are not followed. See BaseSgmlLinkExtractor
Also, the use of 'parse_items' as a call back to both LinkExtractors is a bit of a smell. Based on your explanation, I can't see that the first extractor would need a callback.... it's just extracting links that should be added to the queue for the Scraper to go fetch, right?
The real scraping for data that you want to use/persist generally happens in the 'parse_items' callback (at least that's the convention used in the docs).
How to prevent scrapy from crawling a website endless, when only the url particularly the session id or something like that is altered and the content behind the urls is the same.
Is there a way to detect that?
I've read this Avoid Duplicate URL Crawling, Scrapy - how to identify already scraped urls and that how to filter duplicate requests based on url in scrapy, but for solving my problem this is sadly not enough.
There are a couple of ways to do this, both related to the questions you've linked to.
With one, you decide what URL parameters make a page unique, and tell your custom duplicate request filter to ignore the other portions of the URL. This is similar to the answer at https://stackoverflow.com/a/13605919 .
Example:
url: http://www.example.org/path/getArticle.do?art=42&sessionId=99&referrerArticle=88
important bits: protocol, host, path, query parameter "art"
implementation:
def url_fingerprint(self, url):
pr = urlparse.urlparse(url)
queryparts = pr.query.split('&')
for prt in queryparts:
if prt.split("=")[0] != 'art':
queryparts.remove(prt)
return urlparse.urlunparse(ParseResult(scheme=pr.scheme, netloc=pr.netloc, path=pr.path, params=pr.params, query='&'.join(queryparts), fragment=pr.fragment))
The other way is to determine what bit of information on the page make it unique, and use either the IgnoreVisitedItems middleware (as per https://stackoverflow.com/a/4201553) or a dictionary/set in your spider's code. If you go the dictionary/set route, you'll have your spider extract that bit of information from the page and check the dictionary/set to see if you've seen that page before; if so, you can stop parsing and return.
What bit of information you'll need to extract depends on your target site. It could be the title of the article, an OpenGraph <og:url> tag, etc.
My query is for the CrawlSpider
I understand the link extractor rules is a static variable,
Can i change the rules in runtime say, like
#classmethod
def set_rules(cls,rules):
cls.rules = rules
by
self.set_rules(rules)
Is this the acceptable practice for the CrawlSpider ? if not please suggest the appropriate method
My use case,
I'm using scrapy to crawl certain categories A,B,C....Z of a particular website. each category has 1000 links spread over 10 pages
and when scrapy hits a link in a some category which is "too old". I'd like the crawler to stop following/crawling the remainder of the 10 pages ONLY for that category alone and thus my requirement of dynamic rule changes.
Please point me out on the right direction.
Thanks!
The rules in a spider aren't meant to be changed dynamically. They are compiled at instantiation of the CrawlSpider. You could always change your spider.rules and re-run spider._compile_rules(), but I advise against it.
The rules create a set of instructions for the Crawler in what to queue up to crawl (ie. it queues Requests). These requests aren't revisited and re-evaluated before they are dispatched, as the rules weren't "designed" to change. So even if you did change the rules dynamically, you may still end up making a bunch of requests you didn't intend to, and still crawl a bunch of content you didn't mean to.
For instance, if your target page is setup so that the page for "Category A" contains links to pages 1 to 10 of "Category A", then Scrapy will queue up requests for all 10 of these pages. If Page 2 turns out to have entries that are "too old", changing the rules will do nothing because requests for pages 3-10 are already queued to go.
As #imx51 said, it would be much better to write a Downloader Middleware. These would be able to drop each request that you not longer want to make as they trigger for every request going through it before it's downloaded.
I would suggest to you to write your own custom downloader middleware. These would allow you to filter out those requests that you not longer want to make.
Further details about the architecture overview of Scrapy can you find here: http://doc.scrapy.org/en/master/topics/architecture.html
And about downloader middleware and how to write your custom one: http://doc.scrapy.org/en/master/topics/downloader-middleware.html