I have this code
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
and this is the spider subclassed from BaseSpider. This basespider is giving me nightmare
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//strong[#class="genmed"]')
items = []
for site in sites[:5]:
item = PanduItem()
item['username'] = site.select('dl/dd/h2/a').select("string()").extract()
item['number_posts'] = site.select('dl/dd/h2/em').select("string()").extract()
item['profile_link'] = site.select('a/#href').extract()
request = Request("http://www.example/profile.php?mode=viewprofile&u=5",
callback = self.parseUserProfile)
request.meta['item'] = item
return request
def parseUserProfile(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#id="current')
myurl = sites[0].select('img/#src').extract()
item = response.meta['item']
image_absolute_url = urljoin(response.url, myurl[0].strip())
item['image_urls'] = [image_absolute_url]
return item
This is the error i am getting. I am not able to find. Looks like its getting item but i am not sure
ERROR
File "/app_crawler/crawler/pipelines.py", line 9, in get_media_requests
for image_url in item['image_urls']:
exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'
You are missing a method in your pipelines.py
The said file contains 3 methods:
Process item
get_media_requests
item_completed
The item_completed method is the one that handles the saving of the images to a specified path. This path is set in the settings.py as below:
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE = '/your/path/here'
Also included in the settings.py as seen above is the line that enables the imagepipeline.
I've tried to explain it in the best way I understood it as possible. For further reference, have a look at the official scrapy documentation.
Hmmm. At no point are you appending item to items (although the example code in the documentation doesn't do an append either, so I could be barking up the wrong tree).
Try adding it to parse(self, response) like so and see if this resolves the issue:
for site in sites:
item = PanduItem()
item['username'] = site.select('dl/dd/h2/a').select("string()").extract()
item['number_posts'] = site.select('dl/dd/h2/em').select("string()").extract()
item['profile_link'] = site.select('a/#href').extract()
items.append(item)
And set the IMAGES_STORE setting to a valid directory that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
For example:
IMAGES_STORE = '/path/to/valid/dir'
Related
I'm trying to crawl https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/, where I am currently stuck because scrapy returns None for the "Title" item, which is the job name. The css selector works fine in the shell and the other items work also. I've tried to alter the selector or add delays, but nothing seems to work. Has anybody an idea? Code below.
import scrapy
from jobscraping.items import JobscrapingItem
class GetdataSpider(scrapy.Spider):
name = 'getdata2'
start_urls = ['https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/']
def parse(self, response):
for add in response.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd'):
item = JobscrapingItem()
addpage = response.urljoin(add.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd a::attr(href)').get())
item['link'] = addpage
request = scrapy.Request(addpage, callback=self.get_addinfos)
request.meta['item'] = item
yield request
def get_addinfos(self, response):
item = response.meta['item']
item['Title'] = response.css('.sc-AxhUy.Text__h2-jiiyzm-1.eBKnmN.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.iNTZsv::text').get()
item['Company'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.kjfvVS::text').get()
item['Location'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.WBPTt::text').getall()
yield item
This is the items.py file:
import scrapy
class JobscrapingItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
Title = scrapy.Field()
Company = scrapy.Field()
Location = scrapy.Field()
You are using a lot more complicated css selector. Remember you don't always have to use classes or id. Yuo can use other attributes like in this case data-cy="vacancy-title" seems to be perfect.
item['Title'] = response.css('h1[data-cy="vacancy-title"]::text').get()
should work. It is simple and easy to debug and change later if something goes wrong.
I'm trying to scrape all the data from a website called quotestoscrape. But, When I try to run my code it's only getting the one random quote. It should take at least all the data from that page only but it's only taking one. Also, if somehow I get the data from page 1 now what I want is to get the data from all the pages.
So how do I solve this error(which should take all the data from the page1)?
How do I take all the data which is present on the next pages?
items.py file
import scrapy
class QuotetutorialItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()
quotes_spider.py file
import scrapy
from ..items import QuotetutorialItem
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = QuotetutorialItem()
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
Please tell me what change I can do?
As reported, it's missing an ident level on your yield. And to follow next pages, just add a check for the next button, and yield a request following it.
import scrapy
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = {}
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page:
yield response.follow(next_page)
As #LanteDellaRovere has correctly identified in a comment, the yield statement should be executed for each iteration of the for loop - which is why you are only seeing a single (presumably the last) link from each page.
As far as reading the continued pages, you could extract it from the <nav> element at the bottom of the page, but the structure is very simple - the links (when no tag is specified) are of the form
http://quotes.toscrape.com/page/N/
You will find that for N=1 you get the first page. So just access the URLs for increasing values of N until the attempt sees a 404 return should work as a simplistic solution.
Not knowing much about Scrapy I can't give you exact code, but the examples at https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links are fairly helpful if you want a more sophisticated and Pythonic approach.
I want to load two items into an item loader, that is instantiated through the response.meta command. Somehow, the standard:
loader.add_xpath('item', 'xpath')
Is not working (i.e. no value is saved or written, it is like the 'item' was never created), but with the exact same expression the:
response.xpath('xpath)
loader.add_value('item',value)
works? Anyone now why? Complete code below:
Spider.py
def parse(self, response):
for record in response.xpath('//div[#class="box list"]/div[starts-with(#class,"record")]'):
loader = BaseItemLoader(item=BezrealitkyItems(), selector=record)
loader.add_xpath('title','.//div[#class="details"]/h2/a[#href]/text()')
listing_url = record.xpath('.//div[#class="details"]/p[#class="short-url"]/text()').extract_first()
yield scrapy.Request(listing_url, meta={'loader' : loader}, callback=self.parse_listing)
def parse_listing(self, response):
loader = response.meta['loader']
loader.add_value('url', response.url)
loader.add_xpath('lat','//script[contains(.,"recordGps")]',re=r'(?:"lat":)[0-9]+\.[0-9]+')
return loader.load_item()
The above does not work, when I try this it works though:
lat_coords = response.xpath('//script[contains(.,"recordGps")]/text()').re(r'(?:"lat":)([0-9]+\.[0-9]+)')
loader.add_value('lat', lat_coords)
My item.py has nothing special:
class BezrealitkyItems(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
lat = scrapy.Field()
class BaseItemLoader(ItemLoader):
title_in = MapCompose(lambda v: v.strip(), Join(''), unidecode)
title_out = TakeFirst()
Just to clarify, I get no error message. It is just that the 'lat' item has not been created nor nothing scraped to it. The other items are scraped fine, including the url that is also added through the parse_listing function.
It happens because you are carrying over loader reference which has it's own selector object.
Here you create and assign a selector parameter with your reference:
loader = BaseItemLoader(item=BezrealitkyItems(), selector=record)
Now later you put this loader into your Request.meta attribute and carry it over to the next parse method. What you aren't doing though is updating the selector context once you retrieve the loader from the meta:
loader = response.meta['loader']
# if you check loader.selector you'll see that it still has html body
# set in previous method, i.e. selector of record in your case
loader.selector = Selector(response) # <--- this is missing
This would work, however it should be avoided because having complex objects with a lot of references in meta is a bad idea and can cause all kind of errors that are mostly related to Twisted framework (that scrapy uses for it's concurrency).
What you should do however is load and recreate item in every step:
def parse(self, response):
loader = BaseItemLoader(item=BezrealitkyItems(), selector=record)
yield scrapy.Request('some_url', meta={'item': loader.load_item()}, callback=self.parse2)
def parse2(self, response):
loader = BaseItemLoader(item=response.meta['item'], selector=record)
I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.
I have written a scrapy crawlspider to crawl a site with a structure like category page > type page > list page > item page. On the category page there are many categories of machines each of which has a type page with lots of types, each of the different types has a list of items, then finally each machine has a page with info about it.
My spider has a rule to get from the home page to the category page where I define the callback parsecatpage, this generates an item, grabs the category and yields a new request for each category on the page. I pass the item and the category name with request.meta and specify the callback is parsetype page.
Parsetypepage gets the item from response.meta then yields requests for each type and passes the item, and the concatenation of category and type along with it in the request.meta. The callback is parsemachinelist.
Parsemachinelist gets the item from response.meta then yields requests for each item on the list and passes the item, category/type, description via request.meta to the final callback, parsemachine. This gets the meta attributes and populates all the fields in the item using the info on the page and the info that was passed from the previous pages and finally yields an item.
If I limit this to a single category and type (with for example contains[#href, "filter=c:Grinders"] and contains[#href, "filter=t:Disc+-+Horizontal%2C+Single+End"]) then it works and there is a machine item for each machine on the final page. The problem is that once I allow the spider to scrapy all the categories and all the types it only returns scrapy items for the machines on the first of the final pages it gets to and once it has done that the spider is finished and doesn't get the other categories etc.
Here is the (anonymous) code
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from myspider.items import MachineItem
import urlparse
class MachineSpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/index.php']
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'),
)
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't categories = hxs.select('//a[contains(#href, "filter=c:Grinders")]')
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
req = Request(urlparse.urljoin(response.url,''.join(cat.select("#href").extract()).strip()),callback=self.parsetypepage)
req.meta['item'] = item
req.meta['machinecategory'] = ''.join(cat.select("./text()").extract())
yield req
def parsetypepage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End")]')
types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End") or contains(#href, "filter=t:Lathe%2C+Production")]')
for typ in types:
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(typ.select("#href").extract()).strip()),callback=self.parsemachinelist)
req.meta['item'] = item
req.meta['machinecategory'] = ': '.join([response.meta['machinecategory'],''.join(typ.select("./text()").extract())])
yield req
def parsemachinelist(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('//tr[contains(td/a/#href, "action=searchdet")]'):
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip()),callback=self.parsemachine)
print urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip())
req.meta['item'] = item
req.meta['descr'] = row.select('./td/div/text()').extract()
req.meta['machinecategory'] = response.meta['machinecategory']
yield req
def parsemachine(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['machinecategory'] = response.meta['machinecategory']
item['comp_name'] = 'Name'
item['description'] = response.meta['descr']
item['makemodel'] = ' '.join([''.join(hxs.select('//table/tr[contains(td/strong/text(), "Make")]/td/text()').extract()),''.join(hxs.select('//table/tr[contains(td/strong/text(), "Model")]/td/text()').extract())])
item['capacity'] = hxs.select('//tr[contains(td/strong/text(), "Capacity")]/td/text()').extract()
relative_image_url = hxs.select('//img[contains(#src, "custom/modules/images")]/#src')[0].extract()
abs_image_url = urlparse.urljoin(response.url, relative_image_url.strip())
item['image_urls'] = [abs_image_url]
yield item
SPIDER = MachineSpider()
So for example the spider will find Grinders on the category page and go to the Grinder type page where it will find the Disc Horizontal Single End type, then it will go to that page and find the list of machines and go to each machines page and finally there will be an item for each machine. If you try and go to Grinders and Lathes though it will run through the Grinders fine then it will crawl the Lathes and Lathes type pages and stop there without generating the requests for the Lathes list page and the final Lathes pages.
Can anyone help with this? Why isn't the spider getting to the second (or third etc.) machine list page once there is more than one category of machine?
Sorry for the epic post, just trying to explain the problem!!
Thanks!!
You should print the url of the request, to be sure it's ok. Also you can try this version:
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
cat_url = urlparse.urljoin(response.url, cat.select("./#href").extract()[0])
print 'url:', cat_url # to see what's there
cat_name = cat.select("./text()").extract()[0]
req = Request(cat_url, callback=self.parsetypepage, meta={'item': item, 'machinecategory': cat_name})
yield req
The problem was that the website is set up so that moving from the category to type page (and the following pages) occurs via filtering the results that are shown. This means that if the requests are performed depth first to the bottom of the query then it works (i.e. choose a category, then get all the types of that category, then get all the machines in each type then scrape the page of each machine) but if a request for the next type page is processed before the spider has got the urls for each machine in the first type then the urls are no longer correct and the spider reaches an incorrect page and cannot extract the info for the next step.
To solve the problem I defined a category setup callback which is called the first time only and gets a list of all categories called categories, then a category callback which is called from category setup which starts the crawl with a single category only using categories.pop(). Once the spider has got to the bottom of the nested callbacks and scraped all the machines in the list there is a callback back up to the category callback again (needed dont_follow=True in the Request) where categories.pop() starts the process again with the next category in the list until they are all done. This way each category is treated fully before the next is started and it works.
Thanks for your final comment, that got me thinking along the right lines and led me to the solution!