So I want to scrape articles from a site that has pagination. Basically, every page is a list of article links and the spider follows the links on the page in a parse_article method, as well as following the successive next page links. However, is there a way to make this stop after a given number of articles are scraped? For example, this is what I have so far using a crawlspider:
rules = (
#next page rule:
Rule(LinkExtractor(restrict_xpaths="//a[#class='next']"),follow=True)
#Extract all internal links which follows this regex:
Rule(LinkExtractor(allow=('REGEXHERE',),deny=()),callback='parse_article'),
)
def parse_article(self, response):
#do parsing stuff here
I want to stop following the next page once I've parsed 150 articles. It doesn't matter if I scrape a little more than 150, I just want to stop going to the next page once I've hit that number. Is there any way to do that? Something like having a counter in the parse_article method? Just new to scrapy so I'm not sure what to try.... I looked into depth_limit, but I'm not so sure that's what I am looking for.
Any help would be greatly appreciated, thanks!
You could achieve that by setting:
CLOSESPIDER_ITEMCOUNT = 150
In your project settings.
If you have multiple Spiders in your project and just want a particular one to be affected by this setting, set it in custom_settings class variable:
custom_settings = { 'CLOSESPIDER_ITEMCOUNT': 150 }
The approach I take on my spiders is to actually have a donescraping flag and I check it first thing in each of my parse_* functions and return an empty list for the results.
This adds the graceful behavior of allowing items and urls already in the download queue to finish happening while not fetching any MORE items.
I've never used CLOSESPIDER_ITEMCOUNT so I dont' know if that "gracefully" closes the spider. I expect it does not
At the beginning of every parse function:
#early exit if done scraping
if self.donescraping:
return None
Related
I have this project that I'm trying to put together for a data analytics experiment. I have a pipeline in mind but I don't exactly know how to go on about getting the data I need.
I want to crawl a website and find all internal and external link, separate them and crawl the external links recursively until it reaches a certain depth. I want to do this to create a graph of all the connections for a website, to then use centrality algorithms to find the center node and proceed from there.
Ideally, I would like to use python 2 for this project.
I had a look at scrapy, beautiful soup and other libraries but it is all quite confusing.
Any help and/or advice would be much appreciated on crawling and creating the graph especially
Thank you
EDIT:
I'm trying to implement the solution you suggested and with the code below, I can see in the debug information that it is finding the links but either they are not being saved in the LinkList class or I'm extracting them wrong and they are getting filtered.
Any suggestions?
class LinkList(Item):
url = Field()
class WebcrawlerSpider(CrawlSpider):
name = 'webcrawler'
allowed_domains = ['https://www.wehiweb.com']
start_urls = ['https://www.wehiweb.com']
rules = (
Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),
)
def parse_obj(self,response):
item = LinkList()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
yield item
def main():
links = LinkList()
process = CrawlerProcess()
process.crawl(WebcrawlerSpider)
process.start()
print(links.items())
if __name__ == "__main__":
main()
Scrapy should work fine for this. Most people use it to extract data from websites (scraping), but it can be used for simple crawling as well.
In scrapy you have spiders that crawl websites and follow links. A scrapy project can consist of many spiders, but in the standard setup each spider will have its own queue and do its own task.
As you described your use case, I would recommend two separate scrapy spiders:
one for onsite scraping with a allowed_domains setting only for this domain and a very high or even 0 (=infinite) MAX_DEPTH setting, so that it will crawl the whole domain
one for offsite scraping with an empty allowed_domains (will allow all domains) and a low MAX_DEPTH setting, so that it will stop after certain number of hops
From your parse method's perspective scrapy has a concept of Request and Item. You can return both Request and Item from the method that parses your response:
requests will trigger scrapy to visit a website and in turn call your parse method on the result
items allow you to specify the results you define for your project
So whenever you want to follow a link you will yield a Request from your parse method. And for all results of your project you will yield Item.
In your case, I'd say that your Item is something like this:
class LinkItem(scrapy.Item):
link_source = scrapy.Field()
link_target = scrapy.Field()
This will allow you to return the item link_source="http://example.com/", link_target="http://example.com/subsite" if you are on page http://example.com/ and found a link to /subsite:
def parse(self, response):
# Here: Code to parse the website (e.g. with scrapy selectors
# or beautifulsoup, but I think scrapy selectors should
# suffice
# after parsing, you have a list "links"
for link in links:
yield Request(link) # make scrapy continue the crawl
item = LinkItem()
item['link_source'] = response.url
item['link_target'] = link
yield item # return the result we want (connections in link graph)
You might see that I did not do any depth checking etc. You don't have to do this manually in your parse method, scrapy ships with Middleware. One of the middlewares is called OffsiteMiddleware and will check if your spider is allowed to visit specific domains (with the option allowed_domains, check the scrapy tutorials). And other one is DepthMiddleware (also check the tutorials).
These results can be written anywhere you want. Scrapy ships with something called feed exports which allow you to write data to files. If you need something more advanced, e.g. a database, you can look at scrapy's Pipeline.
I currently do not see the need for other libraries and projects apart from scrapy for your data collection.
Of course when you want to work with the data, you might need specialized data structures instead of plain text files.
I am trying to do realize a CrawlSpider with Scrapy with the following features.
Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them.
In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest.
However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links.
So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working.
Can anyone suggest a possible strategy?
Right now my rules are :
rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}
You're defining a Set by using {} around the pair of rules. Try making it a tuple with ():
rules = (Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),)
Beyond that, you might want to pass 'unique=True' to the Rules to make sure that any links back to the "start url" are not followed. See BaseSgmlLinkExtractor
Also, the use of 'parse_items' as a call back to both LinkExtractors is a bit of a smell. Based on your explanation, I can't see that the first extractor would need a callback.... it's just extracting links that should be added to the queue for the Scraper to go fetch, right?
The real scraping for data that you want to use/persist generally happens in the 'parse_items' callback (at least that's the convention used in the docs).
Please help me solve following case:
Imagine a typical classified category page. A page with list of items. When you click on items you land on internal pages.Now currently my crawler scrapes all these URLs, further scrapes these urls to get details of the item, check to see if the initial seed URL as any next page. If it has, it goes to the next page and do the same. I am storing these items in a sql database.
Let say 3 days later, there are new itmes in the Seed URL and I want to scrap only new items. Possible solutions are:
At the time of scraping each item, I check in the database to see if the URL is already scraped. If it has, I simply ask Scrapy to stop crawling further.
Problem : I don't want to query database each time. My database is going to be really large and it will eventually make crawling super slow.
I try to store last scraped URL and pass it on in the beginning, and the moment it finds this last_scraped_url it simply stops the crawler.
Not possible, given the asynchronous nature of crawling URLs are not scraped in the same order they are received from seed URLs.
( I tried all methods to make it in orderly fashion - but that's not possible at all )
Can anybody suggest any other ideas ? I have been struggling over it for past three days.
Appreciate your replies.
Before trying to give you an idea...
I must say I would try first your database option. Databases are made just for that and, even if your DB gets really big, this should not do the crawling significantly slow.
And one lesson I have learned: "First do the dumb implementation. After that, you try to optimize." Most of times when you optimize first, you just optimize the wrong part.
But, if you really want another idea...
Scrapy's default is not to crawl the same url two times. So, before start the crawling you can put the already scraped urls (3 days before) in the list that Scrapy uses to know which urls were already visited. (I don't know how to do that.)
Or, simpler, in your item parser you can just check if the url was already scraped and return None or scrape the new item accordingly.
I am trying to scrap some forums with scrapy and store the data in a database. But I don't know to do it efficiently when it comes to updating the database. This is what my spider looks like:
class ForumSpider(CrawlSpider):
name = "forum"
allowed_domains= ["forums.example.com"]
start_urls = ["forums.example.com/index.php"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'/forum?id=\d+',)),
follow=True, callback='parse_index'),
)
def parse_index(self, response):
hxs = HtmlXPathSelector(response)
#parsing....looking for threads.....
#pass the data to pipeline and store in to the db....
My problem is when I scrap the same forum again, say a week later, there is no point to go through all the pages, because the new threads or any threads with new post would be on top of other inactive threads. My idea is to check the first pages of a forum(forums.example.com/forum?id=1), if it found a thread with the same URL and the same number of reply on page one. There is no point to go to the second page. So the spider should proceed to another forum(forums.example.com/forum?id=2). I tried modifying the start_urls and rules, but it seemed like they are not responding once the spider is running. Is there a way to do it in scrapy?
My second problem is how to use different pipeline for different spiders. I found something on stack overflow. But it seems like scrapy isn't built to do this, it seems like you suppose to create a new project for different sites.
Am I using the wrong tool to do this? Or I am missing something. I thought about using mechanize and lxml to do it. But I need to implement twisted and unicode handling and so on which makes me want to stick with scrapy
Thanks
What you are asking for is to create a http requests on fly.
Inside the parse_index function do this.
request = self.make_requests_from_url(http://forums.example.com/forum?id=2)
return request
If you want to submit multiple http requests return a array.
See this Request in scrapy
You are right about the second thing, you are suppose to write different spiders if you want to extract different type of data from different websites.
My query is for the CrawlSpider
I understand the link extractor rules is a static variable,
Can i change the rules in runtime say, like
#classmethod
def set_rules(cls,rules):
cls.rules = rules
by
self.set_rules(rules)
Is this the acceptable practice for the CrawlSpider ? if not please suggest the appropriate method
My use case,
I'm using scrapy to crawl certain categories A,B,C....Z of a particular website. each category has 1000 links spread over 10 pages
and when scrapy hits a link in a some category which is "too old". I'd like the crawler to stop following/crawling the remainder of the 10 pages ONLY for that category alone and thus my requirement of dynamic rule changes.
Please point me out on the right direction.
Thanks!
The rules in a spider aren't meant to be changed dynamically. They are compiled at instantiation of the CrawlSpider. You could always change your spider.rules and re-run spider._compile_rules(), but I advise against it.
The rules create a set of instructions for the Crawler in what to queue up to crawl (ie. it queues Requests). These requests aren't revisited and re-evaluated before they are dispatched, as the rules weren't "designed" to change. So even if you did change the rules dynamically, you may still end up making a bunch of requests you didn't intend to, and still crawl a bunch of content you didn't mean to.
For instance, if your target page is setup so that the page for "Category A" contains links to pages 1 to 10 of "Category A", then Scrapy will queue up requests for all 10 of these pages. If Page 2 turns out to have entries that are "too old", changing the rules will do nothing because requests for pages 3-10 are already queued to go.
As #imx51 said, it would be much better to write a Downloader Middleware. These would be able to drop each request that you not longer want to make as they trigger for every request going through it before it's downloaded.
I would suggest to you to write your own custom downloader middleware. These would allow you to filter out those requests that you not longer want to make.
Further details about the architecture overview of Scrapy can you find here: http://doc.scrapy.org/en/master/topics/architecture.html
And about downloader middleware and how to write your custom one: http://doc.scrapy.org/en/master/topics/downloader-middleware.html