How to crawl and scrape data at the same time? - python

It's my first experience with web scraping and I'm not sure if I'm doing well or not. The thing is I want to crawl and scrape data at the same time.
Get all the links that I'm gonna scrape
Store them into MongoDB
Visit them one by one to scrape their content
# Crawling: get all links to be scrapped later on
class LinkCrawler(Spider):
name="link"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)]
def parse(self,response):
# loop for all pages
next_page = Selector(response).xpath('//li[#class="active"]/following-sibling::li[1]/a/#href').extract()
if not not next_page:
yield Request("https://"+next_page[0], callback = self.parse)
# loop for all links in a single page
links = Selector(response).xpath('//div[#class="row-fluid job-details pointer"]/div[#class="bloc-right"]/div[#class="row-fluid"]')
for link in links:
item = Link()
url = response.urljoin(link.xpath('a/#href')[0].extract())
item['url'] = url
items.append(item)
for item in items:
yield item
# Scraping: get all the stored links on MongoDB and scrape them????

What exactly is your use case? Are you primarily interested in the links or content of the pages they lead to? I.e. is there any reason to first store the links in MongoDB and scrape pages later? If you really need to store links in MongoDB, it's best to use an item pipeline to store the items. In the link, there's even example of storing items in MongoDB. If you need something more sophisticated, look at scrapy-mongodb package.
Other than that, there are some comments to the actual code you posted:
Instead of Selector(response).xpath(...) use just response.xpath(...).
If you need only the first extracted element from selector, use extract_first() instead of using extract() and indexing.
Don't use if not not next_page:, use if next_page:.
The second loop over items is not needed, yield item in the loop over links.

Related

Scrapy - Visiting nested links and grabbing meta data from each level

I am relatively new to scrapy and have been getting a lot of exceptions...
Here is what I am trying to do:
There 4 nested links that I want to grab data from:
Let's say I have 5 items that I want to crawl in total. These items are
Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()
Now to begin crawling I would first have to get the Industry.
The Industry xpath also contains the link to individual listings of companies that belong to their Industry segments.
Next I want to use the Industry xpath and go into the link. This page does not contain any data that I want to crawl. But this page contains href links to individual companies that have their own basic info page.
Using the href link from the listings page, I now arrive at one page that contains the information for one company. Now I want to scrape the Company, Address, and Website.
There is another href link that I need to click in order to lead to Contact_First_Name, Contact_Last_Name.
Using the href link, I now arrive at another page that contains the Contact_First_Name, and Contact_Last_Name
After crawling all of these pages, I should have items that look somewhat like this:
Industry Company Website Contact_First_Name Contact_Last_Name
Finance JPMC JP.com Jamie Dimon
Finance BOA BOA.com Bryan Moynihan
Technology ADSK ADSK.com Carl Bass
EDITED
Here is the code that is working. Anzel's recommendations really helped out but i realized the subclass allowed_domains was wrong which stopped the nested links from following through. Once I changed it, it works.
class PschamberSpider(scrapy.Spider):
name="pschamber"
allowed_domains = ["cm.pschamber.com"]
start_urls = ["http://cm.pschamber.com/list/"]
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//*[#id="mn-ql"]/ul/li/a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['Industry'] = sel.xpath('text()').extract()
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_2, meta={'item': item})
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('.//*[#id="mn-members"]/div/div/div/div/div/a').extract():
# you can access your response meta like this
item=response.meta['item']
item['Company'] = sel.xpath('text()').extract()
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_3, meta={'item': item})
# again, yield the Request object
def parse_3(self, response):
item=response.meta['item']
item['Website'] = response.xpath('.//[#id="mn-memberinfo-block-website"]/a/#href').extract()
# OK, finally assume you're done, just return the item object
return item
There are quite a few mistakes you've made in your code therefore it's not running as you expected. Please see my below brief sample how to get the items you need and passing the meta to other callbacks. I am not copying your xpath as I just grab the most straight forward one from the site, you can apply your own.
I will try to comment as clear as possible to let you know where you did wrong.
class PschamberSpider(scrapy.Spider):
name = "pschamber"
# start from this, since your domain is a sub-domain on its own,
# you need to change to this without http://
allowed_domains = ["cm.pschamber.com"]
start_urls = (
'http://cm.pschamber.com/list/',
)
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//div[#id="mn-ql"]//a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['industry'] = sel.xpath('text()').extract()[0]
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_2,
meta={'item': item}
)
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('//div[#class="mn-title"]//a'):
# you can access your response meta like this
item = response.meta['item']
item['company'] = sel.xpath('text()').extract()[0]
# again, yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_3,
meta={'item': item}
)
def parse_3(self, response):
item = response.meta['item']
item['website'] = response.xpath('//a[#class="mn-print-url"]/text()').extract()
# OK, finally assume you're done, just return the item object
return item
Hope this is self-explanatory and you get to understand the basic of scrapy, you should READ thoroughly the doc from Scrapy, and sooner you will learn another method to set rules to follow links with certain patterns... well of course once you get the basic right you will understand them.
Although everyone's journey differs, I strongly recommend you keep reading and practice until you're confident in what you're doing before crawling actual website. Also, there are rules to protect web contents which can be scraped, and copyright about the content you scrape.
Keep this in mind or you may find yourself in big trouble in future. Anyway, good luck and I hope this answer helps you resolve the problem!

Scrapy: Item details on 3 different pages

I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.
How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?
The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.
See this example from the Scrapy docs on the issue:
http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
# parse response and populate item as required
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
# parse response and populate item as required
item['other_url'] = response.url
return item
Is your third piece of data on a page linked from the first page or the second page?
If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.
If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.
NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.
The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.

How to store the URLs crawled with Scrapy?

I have a web crawler that crawls for news stories on a web page.
I know how to use the XpathSelector to scrape certain information from the elements in the page.
However I cannot seem to figure out how to store the URL of the page that was just crawled.
class spidey(CrawlSpider):
name = 'spidey'
start_urls = ['http://nytimes.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
# r'page/\d+' : regular expression for http://nytimes.com/page/X URLs
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_articles')]
# r'\d{4}/\d{2}/\w+' : regular expression for http://nytimes.com/YYYY/MM/title URLs
I want to store every link that passes those rule.
What would I need to add to parse_articles to store the link in my item?
def parse_articles(self, response):
item = SpideyItem()
item['link'] = ???
return item
response.url is what you are looking for.
See docs on response object and check this simple example.

How can make two requests simultaneously with scrapy

I am scraping the job sites where the first page ahs the links to all the jobs.
Now i am storing the title , job , company from the first page.
But i also want to store the description , which is available by clicking on the job title. I want to store that as well with the current items.
This is my curent code
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//div[#class='jobenteries']")
items = []
for site in sites[:3]:
print "Hello"
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['desc'] = ''
items.append(item)
return items
But that description is on the next page link. how can i do that
From the first page, return Requests for the second page and pass the data for each item in the request.meta dict. On the callback method for the second page you can read the data you passed and return the fully populated item.
See Passing additional data to callback functions in the scrapy docs for more details and an example.

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

Categories