I am using Scrapy to scrape the following website's posts. I have written the code that will give me the max_id or the latest post number. For example, for http://papa-gen.livejournal.com/: if I have the max_id in theory I will be able to create a for loop 1 through the max_id and I should be able to scrape all of the posts.
The problem is that there are not as many posts as the max_id. For example, the max_id for the website above is 2870789 for the post for December 17th, but the post for December 16th has the number 2870614, a difference of 175. If I loop through all the 2870789, I will reach each post, but the code will of course not be very efficient. My idea is to access the previous and forward buttons on the website using my python code and loop in this manner.
Could someone explain how I could accomplish this using Scrapy?
Scrapy has extensive documentation. There is an example of using the CrawlSpider class to accomplish what you're describing, which you can modify to look something like this...
class MySpider(CrawlSpider):
name = 'livejournal.com'
allowed_domains = ['livejournal.com']
start_urls = ['http://www.papa-gen.livejournal.com']
rules = (
Rule(SgmlLinkExtractor(allow=('skip=', )), callback='parse_item'),
)
def parse_item(self, response):
# here is where the parsing happens
pass
The basic idea is to specify rules that match the links. Scrapy with add them to a list of urls to visit and then call the callback function with the page data when the url is fetched.
Related
I have this project that I'm trying to put together for a data analytics experiment. I have a pipeline in mind but I don't exactly know how to go on about getting the data I need.
I want to crawl a website and find all internal and external link, separate them and crawl the external links recursively until it reaches a certain depth. I want to do this to create a graph of all the connections for a website, to then use centrality algorithms to find the center node and proceed from there.
Ideally, I would like to use python 2 for this project.
I had a look at scrapy, beautiful soup and other libraries but it is all quite confusing.
Any help and/or advice would be much appreciated on crawling and creating the graph especially
Thank you
EDIT:
I'm trying to implement the solution you suggested and with the code below, I can see in the debug information that it is finding the links but either they are not being saved in the LinkList class or I'm extracting them wrong and they are getting filtered.
Any suggestions?
class LinkList(Item):
url = Field()
class WebcrawlerSpider(CrawlSpider):
name = 'webcrawler'
allowed_domains = ['https://www.wehiweb.com']
start_urls = ['https://www.wehiweb.com']
rules = (
Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),
)
def parse_obj(self,response):
item = LinkList()
item['url'] = []
for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
item['url'].append(link.url)
yield item
def main():
links = LinkList()
process = CrawlerProcess()
process.crawl(WebcrawlerSpider)
process.start()
print(links.items())
if __name__ == "__main__":
main()
Scrapy should work fine for this. Most people use it to extract data from websites (scraping), but it can be used for simple crawling as well.
In scrapy you have spiders that crawl websites and follow links. A scrapy project can consist of many spiders, but in the standard setup each spider will have its own queue and do its own task.
As you described your use case, I would recommend two separate scrapy spiders:
one for onsite scraping with a allowed_domains setting only for this domain and a very high or even 0 (=infinite) MAX_DEPTH setting, so that it will crawl the whole domain
one for offsite scraping with an empty allowed_domains (will allow all domains) and a low MAX_DEPTH setting, so that it will stop after certain number of hops
From your parse method's perspective scrapy has a concept of Request and Item. You can return both Request and Item from the method that parses your response:
requests will trigger scrapy to visit a website and in turn call your parse method on the result
items allow you to specify the results you define for your project
So whenever you want to follow a link you will yield a Request from your parse method. And for all results of your project you will yield Item.
In your case, I'd say that your Item is something like this:
class LinkItem(scrapy.Item):
link_source = scrapy.Field()
link_target = scrapy.Field()
This will allow you to return the item link_source="http://example.com/", link_target="http://example.com/subsite" if you are on page http://example.com/ and found a link to /subsite:
def parse(self, response):
# Here: Code to parse the website (e.g. with scrapy selectors
# or beautifulsoup, but I think scrapy selectors should
# suffice
# after parsing, you have a list "links"
for link in links:
yield Request(link) # make scrapy continue the crawl
item = LinkItem()
item['link_source'] = response.url
item['link_target'] = link
yield item # return the result we want (connections in link graph)
You might see that I did not do any depth checking etc. You don't have to do this manually in your parse method, scrapy ships with Middleware. One of the middlewares is called OffsiteMiddleware and will check if your spider is allowed to visit specific domains (with the option allowed_domains, check the scrapy tutorials). And other one is DepthMiddleware (also check the tutorials).
These results can be written anywhere you want. Scrapy ships with something called feed exports which allow you to write data to files. If you need something more advanced, e.g. a database, you can look at scrapy's Pipeline.
I currently do not see the need for other libraries and projects apart from scrapy for your data collection.
Of course when you want to work with the data, you might need specialized data structures instead of plain text files.
So I want to scrape articles from a site that has pagination. Basically, every page is a list of article links and the spider follows the links on the page in a parse_article method, as well as following the successive next page links. However, is there a way to make this stop after a given number of articles are scraped? For example, this is what I have so far using a crawlspider:
rules = (
#next page rule:
Rule(LinkExtractor(restrict_xpaths="//a[#class='next']"),follow=True)
#Extract all internal links which follows this regex:
Rule(LinkExtractor(allow=('REGEXHERE',),deny=()),callback='parse_article'),
)
def parse_article(self, response):
#do parsing stuff here
I want to stop following the next page once I've parsed 150 articles. It doesn't matter if I scrape a little more than 150, I just want to stop going to the next page once I've hit that number. Is there any way to do that? Something like having a counter in the parse_article method? Just new to scrapy so I'm not sure what to try.... I looked into depth_limit, but I'm not so sure that's what I am looking for.
Any help would be greatly appreciated, thanks!
You could achieve that by setting:
CLOSESPIDER_ITEMCOUNT = 150
In your project settings.
If you have multiple Spiders in your project and just want a particular one to be affected by this setting, set it in custom_settings class variable:
custom_settings = { 'CLOSESPIDER_ITEMCOUNT': 150 }
The approach I take on my spiders is to actually have a donescraping flag and I check it first thing in each of my parse_* functions and return an empty list for the results.
This adds the graceful behavior of allowing items and urls already in the download queue to finish happening while not fetching any MORE items.
I've never used CLOSESPIDER_ITEMCOUNT so I dont' know if that "gracefully" closes the spider. I expect it does not
At the beginning of every parse function:
#early exit if done scraping
if self.donescraping:
return None
I'm doing a web app that searches all the shoe sizes that are in stock for each model of shoe.
So for example, for a website having a list of shoes:
http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522
I'll need to go inside each link to scrape this information.
Is there any way I can effectively do this with Scrapy (or something else)? Or is it impossible to do it?
It is possible and it is one of Scrapy's core functionalities.
For example, for scraping every shoe on this site what you would do is:
In your spider variables start_urls = ['http://www.soccer.com/shop/footwear/?page=1&pageSize=12&query=*&facet=ads_f40502_ntk_cs%253A%2522Nike%2522']
Then on your parse(self, response) your code should look like this:
for shoe_url in response.xpath(<ENTER_THE_XPATH>).extract()
yield scrapy.Request(response.urljoin(shoe_url), callback=self.parse_shoe)
and in the method parse_shoe which we registered as callback in the for loop, you should extract all the information you need.
Now what happens here, is that the spider starts to crawl the URL in start_urls and then for every url that meets the xpath we specified it will parse it using the parse_shoe function, where you could simply extract the shoe sizes.
You can follow the "Follow Links" tutorial on scrapy's main site on this link too - it is very clear.
For completeness I looked for the right xpath for you on that page, it should be '*//ul[#class="medium-3 columns product-list product-grid"]//a/#href'
I am trying to do realize a CrawlSpider with Scrapy with the following features.
Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them.
In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest.
However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links.
So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working.
Can anyone suggest a possible strategy?
Right now my rules are :
rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}
You're defining a Set by using {} around the pair of rules. Try making it a tuple with ():
rules = (Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),)
Beyond that, you might want to pass 'unique=True' to the Rules to make sure that any links back to the "start url" are not followed. See BaseSgmlLinkExtractor
Also, the use of 'parse_items' as a call back to both LinkExtractors is a bit of a smell. Based on your explanation, I can't see that the first extractor would need a callback.... it's just extracting links that should be added to the queue for the Scraper to go fetch, right?
The real scraping for data that you want to use/persist generally happens in the 'parse_items' callback (at least that's the convention used in the docs).
I am trying to scrap some forums with scrapy and store the data in a database. But I don't know to do it efficiently when it comes to updating the database. This is what my spider looks like:
class ForumSpider(CrawlSpider):
name = "forum"
allowed_domains= ["forums.example.com"]
start_urls = ["forums.example.com/index.php"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'/forum?id=\d+',)),
follow=True, callback='parse_index'),
)
def parse_index(self, response):
hxs = HtmlXPathSelector(response)
#parsing....looking for threads.....
#pass the data to pipeline and store in to the db....
My problem is when I scrap the same forum again, say a week later, there is no point to go through all the pages, because the new threads or any threads with new post would be on top of other inactive threads. My idea is to check the first pages of a forum(forums.example.com/forum?id=1), if it found a thread with the same URL and the same number of reply on page one. There is no point to go to the second page. So the spider should proceed to another forum(forums.example.com/forum?id=2). I tried modifying the start_urls and rules, but it seemed like they are not responding once the spider is running. Is there a way to do it in scrapy?
My second problem is how to use different pipeline for different spiders. I found something on stack overflow. But it seems like scrapy isn't built to do this, it seems like you suppose to create a new project for different sites.
Am I using the wrong tool to do this? Or I am missing something. I thought about using mechanize and lxml to do it. But I need to implement twisted and unicode handling and so on which makes me want to stick with scrapy
Thanks
What you are asking for is to create a http requests on fly.
Inside the parse_index function do this.
request = self.make_requests_from_url(http://forums.example.com/forum?id=2)
return request
If you want to submit multiple http requests return a array.
See this Request in scrapy
You are right about the second thing, you are suppose to write different spiders if you want to extract different type of data from different websites.