How can I parse several differently structured websites with the same spider? - python

All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item

Related

How to constructing url in start_urls list in scrapy framework python

I am new to scrapy and python.
In my case:
Page A:
http://www.example.com/search?keyword=city&style=1&page=1
http://www.example.com/search?keyword=city&style=1&page=2
http://www.example.com/search?keyword=city&style=1&page=3
Rules is:
`for i in range(50):
"http://www.example.com/search?keyword=city&style=1&page=%s" % i`
Page B:
http://www.example.com/city_detail_0001.html
http://www.example.com/city_detail_0100.html
http://www.example.com/city_detail_0053.html
No rules, Because Page B is match the keyword for search.
So, This means,If I want grab some information from Page B,
First, I must use the Page A to sifting link of the Page B.
In the past, I usually two step:
1. I create scrapy A, and grab the Page B's link in a txt file
2. And in scrapy B, I read the txt file to the "start_urls"
Now, can u please guide me, that how can i construct the "start_urls" in one spider?
the start_requests method is what you need. After that, keep passing the requests and parse the response bodies on callback methods.
class MySpider(Spider):
name = 'example'
def start_requests(self):
for i in range(50):
yield Request('myurl%s' % i, callback=self.parse)
def parse(self, response):
# get my information for page B
yield Request('pageB', callback=self.parse_my_item)
def parse_my_item(self, response):
item = {}
# real parsing method for my items
yield item

Scrapy - Visiting nested links and grabbing meta data from each level

I am relatively new to scrapy and have been getting a lot of exceptions...
Here is what I am trying to do:
There 4 nested links that I want to grab data from:
Let's say I have 5 items that I want to crawl in total. These items are
Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()
Now to begin crawling I would first have to get the Industry.
The Industry xpath also contains the link to individual listings of companies that belong to their Industry segments.
Next I want to use the Industry xpath and go into the link. This page does not contain any data that I want to crawl. But this page contains href links to individual companies that have their own basic info page.
Using the href link from the listings page, I now arrive at one page that contains the information for one company. Now I want to scrape the Company, Address, and Website.
There is another href link that I need to click in order to lead to Contact_First_Name, Contact_Last_Name.
Using the href link, I now arrive at another page that contains the Contact_First_Name, and Contact_Last_Name
After crawling all of these pages, I should have items that look somewhat like this:
Industry Company Website Contact_First_Name Contact_Last_Name
Finance JPMC JP.com Jamie Dimon
Finance BOA BOA.com Bryan Moynihan
Technology ADSK ADSK.com Carl Bass
EDITED
Here is the code that is working. Anzel's recommendations really helped out but i realized the subclass allowed_domains was wrong which stopped the nested links from following through. Once I changed it, it works.
class PschamberSpider(scrapy.Spider):
name="pschamber"
allowed_domains = ["cm.pschamber.com"]
start_urls = ["http://cm.pschamber.com/list/"]
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//*[#id="mn-ql"]/ul/li/a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['Industry'] = sel.xpath('text()').extract()
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_2, meta={'item': item})
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('.//*[#id="mn-members"]/div/div/div/div/div/a').extract():
# you can access your response meta like this
item=response.meta['item']
item['Company'] = sel.xpath('text()').extract()
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_3, meta={'item': item})
# again, yield the Request object
def parse_3(self, response):
item=response.meta['item']
item['Website'] = response.xpath('.//[#id="mn-memberinfo-block-website"]/a/#href').extract()
# OK, finally assume you're done, just return the item object
return item
There are quite a few mistakes you've made in your code therefore it's not running as you expected. Please see my below brief sample how to get the items you need and passing the meta to other callbacks. I am not copying your xpath as I just grab the most straight forward one from the site, you can apply your own.
I will try to comment as clear as possible to let you know where you did wrong.
class PschamberSpider(scrapy.Spider):
name = "pschamber"
# start from this, since your domain is a sub-domain on its own,
# you need to change to this without http://
allowed_domains = ["cm.pschamber.com"]
start_urls = (
'http://cm.pschamber.com/list/',
)
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//div[#id="mn-ql"]//a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['industry'] = sel.xpath('text()').extract()[0]
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_2,
meta={'item': item}
)
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('//div[#class="mn-title"]//a'):
# you can access your response meta like this
item = response.meta['item']
item['company'] = sel.xpath('text()').extract()[0]
# again, yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_3,
meta={'item': item}
)
def parse_3(self, response):
item = response.meta['item']
item['website'] = response.xpath('//a[#class="mn-print-url"]/text()').extract()
# OK, finally assume you're done, just return the item object
return item
Hope this is self-explanatory and you get to understand the basic of scrapy, you should READ thoroughly the doc from Scrapy, and sooner you will learn another method to set rules to follow links with certain patterns... well of course once you get the basic right you will understand them.
Although everyone's journey differs, I strongly recommend you keep reading and practice until you're confident in what you're doing before crawling actual website. Also, there are rules to protect web contents which can be scraped, and copyright about the content you scrape.
Keep this in mind or you may find yourself in big trouble in future. Anyway, good luck and I hope this answer helps you resolve the problem!

Writing a crawler to parse a site in scrapy using BaseSpider

I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.

Combining base url with resultant href in scrapy

below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[#class="bookListingBookTitle"]/a/#href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
An alternative solution, if you don't want to use urlparse:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.
The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.
more info
Quote from docs:
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.
Also, you can pass <a> element directly as argument.

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

Categories