I am scraping the job sites where the first page ahs the links to all the jobs.
Now i am storing the title , job , company from the first page.
But i also want to store the description , which is available by clicking on the job title. I want to store that as well with the current items.
This is my curent code
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//div[#class='jobenteries']")
items = []
for site in sites[:3]:
print "Hello"
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['desc'] = ''
items.append(item)
return items
But that description is on the next page link. how can i do that
From the first page, return Requests for the second page and pass the data for each item in the request.meta dict. On the callback method for the second page you can read the data you passed and return the fully populated item.
See Passing additional data to callback functions in the scrapy docs for more details and an example.
Related
I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination
To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}
Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.
I am relatively new to scrapy and have been getting a lot of exceptions...
Here is what I am trying to do:
There 4 nested links that I want to grab data from:
Let's say I have 5 items that I want to crawl in total. These items are
Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()
Now to begin crawling I would first have to get the Industry.
The Industry xpath also contains the link to individual listings of companies that belong to their Industry segments.
Next I want to use the Industry xpath and go into the link. This page does not contain any data that I want to crawl. But this page contains href links to individual companies that have their own basic info page.
Using the href link from the listings page, I now arrive at one page that contains the information for one company. Now I want to scrape the Company, Address, and Website.
There is another href link that I need to click in order to lead to Contact_First_Name, Contact_Last_Name.
Using the href link, I now arrive at another page that contains the Contact_First_Name, and Contact_Last_Name
After crawling all of these pages, I should have items that look somewhat like this:
Industry Company Website Contact_First_Name Contact_Last_Name
Finance JPMC JP.com Jamie Dimon
Finance BOA BOA.com Bryan Moynihan
Technology ADSK ADSK.com Carl Bass
EDITED
Here is the code that is working. Anzel's recommendations really helped out but i realized the subclass allowed_domains was wrong which stopped the nested links from following through. Once I changed it, it works.
class PschamberSpider(scrapy.Spider):
name="pschamber"
allowed_domains = ["cm.pschamber.com"]
start_urls = ["http://cm.pschamber.com/list/"]
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//*[#id="mn-ql"]/ul/li/a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['Industry'] = sel.xpath('text()').extract()
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_2, meta={'item': item})
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('.//*[#id="mn-members"]/div/div/div/div/div/a').extract():
# you can access your response meta like this
item=response.meta['item']
item['Company'] = sel.xpath('text()').extract()
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_3, meta={'item': item})
# again, yield the Request object
def parse_3(self, response):
item=response.meta['item']
item['Website'] = response.xpath('.//[#id="mn-memberinfo-block-website"]/a/#href').extract()
# OK, finally assume you're done, just return the item object
return item
There are quite a few mistakes you've made in your code therefore it's not running as you expected. Please see my below brief sample how to get the items you need and passing the meta to other callbacks. I am not copying your xpath as I just grab the most straight forward one from the site, you can apply your own.
I will try to comment as clear as possible to let you know where you did wrong.
class PschamberSpider(scrapy.Spider):
name = "pschamber"
# start from this, since your domain is a sub-domain on its own,
# you need to change to this without http://
allowed_domains = ["cm.pschamber.com"]
start_urls = (
'http://cm.pschamber.com/list/',
)
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//div[#id="mn-ql"]//a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['industry'] = sel.xpath('text()').extract()[0]
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_2,
meta={'item': item}
)
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('//div[#class="mn-title"]//a'):
# you can access your response meta like this
item = response.meta['item']
item['company'] = sel.xpath('text()').extract()[0]
# again, yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_3,
meta={'item': item}
)
def parse_3(self, response):
item = response.meta['item']
item['website'] = response.xpath('//a[#class="mn-print-url"]/text()').extract()
# OK, finally assume you're done, just return the item object
return item
Hope this is self-explanatory and you get to understand the basic of scrapy, you should READ thoroughly the doc from Scrapy, and sooner you will learn another method to set rules to follow links with certain patterns... well of course once you get the basic right you will understand them.
Although everyone's journey differs, I strongly recommend you keep reading and practice until you're confident in what you're doing before crawling actual website. Also, there are rules to protect web contents which can be scraped, and copyright about the content you scrape.
Keep this in mind or you may find yourself in big trouble in future. Anyway, good luck and I hope this answer helps you resolve the problem!
I have written a scrapy crawlspider to crawl a site with a structure like category page > type page > list page > item page. On the category page there are many categories of machines each of which has a type page with lots of types, each of the different types has a list of items, then finally each machine has a page with info about it.
My spider has a rule to get from the home page to the category page where I define the callback parsecatpage, this generates an item, grabs the category and yields a new request for each category on the page. I pass the item and the category name with request.meta and specify the callback is parsetype page.
Parsetypepage gets the item from response.meta then yields requests for each type and passes the item, and the concatenation of category and type along with it in the request.meta. The callback is parsemachinelist.
Parsemachinelist gets the item from response.meta then yields requests for each item on the list and passes the item, category/type, description via request.meta to the final callback, parsemachine. This gets the meta attributes and populates all the fields in the item using the info on the page and the info that was passed from the previous pages and finally yields an item.
If I limit this to a single category and type (with for example contains[#href, "filter=c:Grinders"] and contains[#href, "filter=t:Disc+-+Horizontal%2C+Single+End"]) then it works and there is a machine item for each machine on the final page. The problem is that once I allow the spider to scrapy all the categories and all the types it only returns scrapy items for the machines on the first of the final pages it gets to and once it has done that the spider is finished and doesn't get the other categories etc.
Here is the (anonymous) code
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from myspider.items import MachineItem
import urlparse
class MachineSpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/index.php']
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'),
)
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't categories = hxs.select('//a[contains(#href, "filter=c:Grinders")]')
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
req = Request(urlparse.urljoin(response.url,''.join(cat.select("#href").extract()).strip()),callback=self.parsetypepage)
req.meta['item'] = item
req.meta['machinecategory'] = ''.join(cat.select("./text()").extract())
yield req
def parsetypepage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End")]')
types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End") or contains(#href, "filter=t:Lathe%2C+Production")]')
for typ in types:
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(typ.select("#href").extract()).strip()),callback=self.parsemachinelist)
req.meta['item'] = item
req.meta['machinecategory'] = ': '.join([response.meta['machinecategory'],''.join(typ.select("./text()").extract())])
yield req
def parsemachinelist(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('//tr[contains(td/a/#href, "action=searchdet")]'):
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip()),callback=self.parsemachine)
print urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip())
req.meta['item'] = item
req.meta['descr'] = row.select('./td/div/text()').extract()
req.meta['machinecategory'] = response.meta['machinecategory']
yield req
def parsemachine(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['machinecategory'] = response.meta['machinecategory']
item['comp_name'] = 'Name'
item['description'] = response.meta['descr']
item['makemodel'] = ' '.join([''.join(hxs.select('//table/tr[contains(td/strong/text(), "Make")]/td/text()').extract()),''.join(hxs.select('//table/tr[contains(td/strong/text(), "Model")]/td/text()').extract())])
item['capacity'] = hxs.select('//tr[contains(td/strong/text(), "Capacity")]/td/text()').extract()
relative_image_url = hxs.select('//img[contains(#src, "custom/modules/images")]/#src')[0].extract()
abs_image_url = urlparse.urljoin(response.url, relative_image_url.strip())
item['image_urls'] = [abs_image_url]
yield item
SPIDER = MachineSpider()
So for example the spider will find Grinders on the category page and go to the Grinder type page where it will find the Disc Horizontal Single End type, then it will go to that page and find the list of machines and go to each machines page and finally there will be an item for each machine. If you try and go to Grinders and Lathes though it will run through the Grinders fine then it will crawl the Lathes and Lathes type pages and stop there without generating the requests for the Lathes list page and the final Lathes pages.
Can anyone help with this? Why isn't the spider getting to the second (or third etc.) machine list page once there is more than one category of machine?
Sorry for the epic post, just trying to explain the problem!!
Thanks!!
You should print the url of the request, to be sure it's ok. Also you can try this version:
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
cat_url = urlparse.urljoin(response.url, cat.select("./#href").extract()[0])
print 'url:', cat_url # to see what's there
cat_name = cat.select("./text()").extract()[0]
req = Request(cat_url, callback=self.parsetypepage, meta={'item': item, 'machinecategory': cat_name})
yield req
The problem was that the website is set up so that moving from the category to type page (and the following pages) occurs via filtering the results that are shown. This means that if the requests are performed depth first to the bottom of the query then it works (i.e. choose a category, then get all the types of that category, then get all the machines in each type then scrape the page of each machine) but if a request for the next type page is processed before the spider has got the urls for each machine in the first type then the urls are no longer correct and the spider reaches an incorrect page and cannot extract the info for the next step.
To solve the problem I defined a category setup callback which is called the first time only and gets a list of all categories called categories, then a category callback which is called from category setup which starts the crawl with a single category only using categories.pop(). Once the spider has got to the bottom of the nested callbacks and scraped all the machines in the list there is a callback back up to the category callback again (needed dont_follow=True in the Request) where categories.pop() starts the process again with the next category in the list until they are all done. This way each category is treated fully before the next is started and it works.
Thanks for your final comment, that got me thinking along the right lines and led me to the solution!
I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.