Scrapy Case : Incremental Update of Items - python

Please help me solve following case:
Imagine a typical classified category page. A page with list of items. When you click on items you land on internal pages.Now currently my crawler scrapes all these URLs, further scrapes these urls to get details of the item, check to see if the initial seed URL as any next page. If it has, it goes to the next page and do the same. I am storing these items in a sql database.
Let say 3 days later, there are new itmes in the Seed URL and I want to scrap only new items. Possible solutions are:
At the time of scraping each item, I check in the database to see if the URL is already scraped. If it has, I simply ask Scrapy to stop crawling further.
Problem : I don't want to query database each time. My database is going to be really large and it will eventually make crawling super slow.
I try to store last scraped URL and pass it on in the beginning, and the moment it finds this last_scraped_url it simply stops the crawler.
Not possible, given the asynchronous nature of crawling URLs are not scraped in the same order they are received from seed URLs.
( I tried all methods to make it in orderly fashion - but that's not possible at all )
Can anybody suggest any other ideas ? I have been struggling over it for past three days.
Appreciate your replies.

Before trying to give you an idea...
I must say I would try first your database option. Databases are made just for that and, even if your DB gets really big, this should not do the crawling significantly slow.
And one lesson I have learned: "First do the dumb implementation. After that, you try to optimize." Most of times when you optimize first, you just optimize the wrong part.
But, if you really want another idea...
Scrapy's default is not to crawl the same url two times. So, before start the crawling you can put the already scraped urls (3 days before) in the list that Scrapy uses to know which urls were already visited. (I don't know how to do that.)
Or, simpler, in your item parser you can just check if the url was already scraped and return None or scrape the new item accordingly.

Related

Scrapy stopping pagination on condition?

So I want to scrape articles from a site that has pagination. Basically, every page is a list of article links and the spider follows the links on the page in a parse_article method, as well as following the successive next page links. However, is there a way to make this stop after a given number of articles are scraped? For example, this is what I have so far using a crawlspider:
rules = (
#next page rule:
Rule(LinkExtractor(restrict_xpaths="//a[#class='next']"),follow=True)
#Extract all internal links which follows this regex:
Rule(LinkExtractor(allow=('REGEXHERE',),deny=()),callback='parse_article'),
)
def parse_article(self, response):
#do parsing stuff here
I want to stop following the next page once I've parsed 150 articles. It doesn't matter if I scrape a little more than 150, I just want to stop going to the next page once I've hit that number. Is there any way to do that? Something like having a counter in the parse_article method? Just new to scrapy so I'm not sure what to try.... I looked into depth_limit, but I'm not so sure that's what I am looking for.
Any help would be greatly appreciated, thanks!
You could achieve that by setting:
CLOSESPIDER_ITEMCOUNT = 150
In your project settings.
If you have multiple Spiders in your project and just want a particular one to be affected by this setting, set it in custom_settings class variable:
custom_settings = { 'CLOSESPIDER_ITEMCOUNT': 150 }
The approach I take on my spiders is to actually have a donescraping flag and I check it first thing in each of my parse_* functions and return an empty list for the results.
This adds the graceful behavior of allowing items and urls already in the download queue to finish happening while not fetching any MORE items.
I've never used CLOSESPIDER_ITEMCOUNT so I dont' know if that "gracefully" closes the spider. I expect it does not
At the beginning of every parse function:
#early exit if done scraping
if self.donescraping:
return None

Display the number of duplicate requests filtered in the post-crawler stats

One of the Scrapy spiders (version 0.21) I'm running isn't pulling all the items I'm trying to scrape.
The stats show that 283 items were pulled, but I'm expecting well above 300 here. I suspect that some of the links on the site are duplicates, as the logs show the first duplicate request, but I'd like to know exactly how many duplicates were filtered so I'd have more conclusive proof. Preferably in the form of an additional stat at the end of the crawl.
I know that the latest version of Scrapy already does this, but I'm kinda stuck with 0.21 at the moment and I can't see any way to replicate that functionality with what I've got. There doesn't seem to be a signal emitted when a duplicate url is filtered, and DUPEFILTER_DEBUG doesn't seem to work either.
Any ideas on how I can get what I need?
You can maintain a list of urls that u have crawled , whenever you came across a url that is already in the list you can log it and increment a counter.

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?
There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.

How can I group data scraped from multiple pages, using Scrapy, into one Item?

I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item per site that summarizes the information I found across that site, regardless of which page(s) I found it on.
I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item, not the results from the first page the crawler examined.
So I tried using request.meta to pass a single partially-filled Item through the various Requests for a given site. To make that work, I had to have my parse callback return exactly one new Request per call until it had no more pages to visit, then finally return the finished Item. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.
The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?
How about this ugly solution?
Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.
Should work in theory, but I'm not sure that this solution is the best one.
If you want summary, Stats Collection would be another approach
http://doc.scrapy.org/en/0.16/topics/stats.html
for example:
to get total page crawled in each of websites. use the following code.
stats.inc_value('pages_crawled:%s'%socket.gethostname())
I had same issues when i was writing my crawler.
I solved this issue by passing a list of urls through meta tag and chaining them together.
Refer to detailed tutorial I wrote here.

How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?
Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.
To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.
For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

Categories