Scrapy: Dynamically generate rules for each start_url - python

I have created a spider which is supposed to crawl multiple websites and I need to define different rules for each URL in the start_url list.
start_urls = [
"http://URL1.com/foo"
"http://URL2.com/bar"
]
rules = [
Rule (LinkExtractor(restrict_xpaths=("//" + xpathString+"/a")), callback="parse_object", follow=True)
]
The only thing that needs to change in the rule is the xpath string for restrict_xpath. I've already come up with a function that can get the xpath I want dynamically from any website.
I figured I can just get the current URL that the spider will be scraping and pass it through the function and then pass the resulting xpath to the rule.
Unfortunately, I've been searching and it seems that this isn't possible since scrapy utilizes a scheduler and compiles all the start_urls and rules right from the start. Is there any workaround to achieve what I'm trying to do?

I assume you are using CrawlSpider.
By default, CrawlSpider rules are applied for all pages (whatever the domain) your spider is crawling.
If you are crawling multiple domains in start URLs, and want different rules for each domains, you wont be able to tell scrapy which rule(s) to apply to which domain. (I mean, it's not available out of the box)
You can run your spider with 1 start URL at a time (and domain-specific rules, built dynamically at init time). And run multiple spiders in paralel.
Another option is to subclass CrawlSpider and customize it for your needs:
build rules as a dict using domains as keys,
and values being the list of rules to apply for that domain. See _compile_rules method.
and apply different rules depending on the domain of the response. See _requests_to_follow

You can just override the parse method. This method will get a scrapy response object with full html content. You can run xpath on it. You will can also retrieve the url from the response object and depending on the url, you can run custom xpath.
Please checkout the docs here: http://doc.scrapy.org/en/latest/topics/request-response.html

Related

Crawl a website and its external links recursively to create a graph for a data analysis project n Python

I have this project that I'm trying to put together for a data analytics experiment. I have a pipeline in mind but I don't exactly know how to go on about getting the data I need.
I want to crawl a website and find all internal and external link, separate them and crawl the external links recursively until it reaches a certain depth. I want to do this to create a graph of all the connections for a website, to then use centrality algorithms to find the center node and proceed from there.
Ideally, I would like to use python 2 for this project.
I had a look at scrapy, beautiful soup and other libraries but it is all quite confusing.
Any help and/or advice would be much appreciated on crawling and creating the graph especially
Thank you
EDIT:
I'm trying to implement the solution you suggested and with the code below, I can see in the debug information that it is finding the links but either they are not being saved in the LinkList class or I'm extracting them wrong and they are getting filtered.
Any suggestions?
class LinkList(Item):
url = Field()
class WebcrawlerSpider(CrawlSpider):
name = 'webcrawler'
allowed_domains = ['https://www.wehiweb.com']
start_urls = ['https://www.wehiweb.com']
rules = (
Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),
)
def parse_obj(self,response):
item = LinkList()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
yield item
def main():
links = LinkList()
process = CrawlerProcess()
process.crawl(WebcrawlerSpider)
process.start()
print(links.items())
if __name__ == "__main__":
main()
Scrapy should work fine for this. Most people use it to extract data from websites (scraping), but it can be used for simple crawling as well.
In scrapy you have spiders that crawl websites and follow links. A scrapy project can consist of many spiders, but in the standard setup each spider will have its own queue and do its own task.
As you described your use case, I would recommend two separate scrapy spiders:
one for onsite scraping with a allowed_domains setting only for this domain and a very high or even 0 (=infinite) MAX_DEPTH setting, so that it will crawl the whole domain
one for offsite scraping with an empty allowed_domains (will allow all domains) and a low MAX_DEPTH setting, so that it will stop after certain number of hops
From your parse method's perspective scrapy has a concept of Request and Item. You can return both Request and Item from the method that parses your response:
requests will trigger scrapy to visit a website and in turn call your parse method on the result
items allow you to specify the results you define for your project
So whenever you want to follow a link you will yield a Request from your parse method. And for all results of your project you will yield Item.
In your case, I'd say that your Item is something like this:
class LinkItem(scrapy.Item):
link_source = scrapy.Field()
link_target = scrapy.Field()
This will allow you to return the item link_source="http://example.com/", link_target="http://example.com/subsite" if you are on page http://example.com/ and found a link to /subsite:
def parse(self, response):
# Here: Code to parse the website (e.g. with scrapy selectors
# or beautifulsoup, but I think scrapy selectors should
# suffice
# after parsing, you have a list "links"
for link in links:
yield Request(link) # make scrapy continue the crawl
item = LinkItem()
item['link_source'] = response.url
item['link_target'] = link
yield item # return the result we want (connections in link graph)
You might see that I did not do any depth checking etc. You don't have to do this manually in your parse method, scrapy ships with Middleware. One of the middlewares is called OffsiteMiddleware and will check if your spider is allowed to visit specific domains (with the option allowed_domains, check the scrapy tutorials). And other one is DepthMiddleware (also check the tutorials).
These results can be written anywhere you want. Scrapy ships with something called feed exports which allow you to write data to files. If you need something more advanced, e.g. a database, you can look at scrapy's Pipeline.
I currently do not see the need for other libraries and projects apart from scrapy for your data collection.
Of course when you want to work with the data, you might need specialized data structures instead of plain text files.

How to set a rule according to the current URL?

I'm using Scrapy and I want to be able to have more control over the crawler. To do this I would like to set rules depending on the current URL that I am processing.
For example if I am on example.com/a I want to apply a rule with LinkExtractor(restrict_xpaths='//div[#class="1"]'). And if I'm on example.com/b I want to use another Rule with a different Link Extractor.
How do I accomplish this?
I'd just code them in separate callbacks, instead of relying in the CrawlSpider rules.
def parse(self, response):
extractor = LinkExtractor(.. some default ..)
if 'example.com/a' in response.url:
extractor = LinkExtractor(restrict_xpaths='//div[#class="1"]')
for link in extractor.extract_links(response):
yield scrapy.Request(link.url, callback=self.whatever)
This is better than trying to change the rules at runtime, because the rules are supposed to be the same for all callbacks.
In this case I've just used link extractors, but if you want to use different rules you can do about the same thing, mirroring the same code to handle rules in the loop shown from CrawlSpider._requests_to_follow.

Using Scrapy to crawl a list of urls in a section of the start url

I am trying to do realize a CrawlSpider with Scrapy with the following features.
Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them.
In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest.
However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links.
So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working.
Can anyone suggest a possible strategy?
Right now my rules are :
rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}
You're defining a Set by using {} around the pair of rules. Try making it a tuple with ():
rules = (Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),)
Beyond that, you might want to pass 'unique=True' to the Rules to make sure that any links back to the "start url" are not followed. See BaseSgmlLinkExtractor
Also, the use of 'parse_items' as a call back to both LinkExtractors is a bit of a smell. Based on your explanation, I can't see that the first extractor would need a callback.... it's just extracting links that should be added to the queue for the Scraper to go fetch, right?
The real scraping for data that you want to use/persist generally happens in the 'parse_items' callback (at least that's the convention used in the docs).

Scrapy: Access CrawlSpider url list

I am writing a script that will inventory all site urls.
I am using CrawlSpider w/ rules handler to process scraped url's. Specifically, "filter_links" checks a table for existing url. If not found, writes new entry.
rules = [
Rule(SgmlLinkExtractor(unique=True), follow=True, callback="parse_item", process_links="filter_links")
]
I sense this is just a poor mans 'reinventing the wheel' where a better method surely exists.
Is there a better way to dump the list of url's scrapy found vs. trying to parse this from response? Thanks
I think you are making use of process_links the way it is intended to be used. I see no drawbacks to that. But if you want to get rid of this additional filter_links method, then you can include the url table lookup and update logic in your parse_item method. You can access the current url in parse_item as response.url

Scrapy - Change rules after spider starts crawling

My query is for the CrawlSpider
I understand the link extractor rules is a static variable,
Can i change the rules in runtime say, like
#classmethod
def set_rules(cls,rules):
cls.rules = rules
by
self.set_rules(rules)
Is this the acceptable practice for the CrawlSpider ? if not please suggest the appropriate method
My use case,
I'm using scrapy to crawl certain categories A,B,C....Z of a particular website. each category has 1000 links spread over 10 pages
and when scrapy hits a link in a some category which is "too old". I'd like the crawler to stop following/crawling the remainder of the 10 pages ONLY for that category alone and thus my requirement of dynamic rule changes.
Please point me out on the right direction.
Thanks!
The rules in a spider aren't meant to be changed dynamically. They are compiled at instantiation of the CrawlSpider. You could always change your spider.rules and re-run spider._compile_rules(), but I advise against it.
The rules create a set of instructions for the Crawler in what to queue up to crawl (ie. it queues Requests). These requests aren't revisited and re-evaluated before they are dispatched, as the rules weren't "designed" to change. So even if you did change the rules dynamically, you may still end up making a bunch of requests you didn't intend to, and still crawl a bunch of content you didn't mean to.
For instance, if your target page is setup so that the page for "Category A" contains links to pages 1 to 10 of "Category A", then Scrapy will queue up requests for all 10 of these pages. If Page 2 turns out to have entries that are "too old", changing the rules will do nothing because requests for pages 3-10 are already queued to go.
As #imx51 said, it would be much better to write a Downloader Middleware. These would be able to drop each request that you not longer want to make as they trigger for every request going through it before it's downloaded.
I would suggest to you to write your own custom downloader middleware. These would allow you to filter out those requests that you not longer want to make.
Further details about the architecture overview of Scrapy can you find here: http://doc.scrapy.org/en/master/topics/architecture.html
And about downloader middleware and how to write your custom one: http://doc.scrapy.org/en/master/topics/downloader-middleware.html

Categories