Item Loader not working with response.meta - python

I want to load two items into an item loader, that is instantiated through the response.meta command. Somehow, the standard:
loader.add_xpath('item', 'xpath')
Is not working (i.e. no value is saved or written, it is like the 'item' was never created), but with the exact same expression the:
response.xpath('xpath)
loader.add_value('item',value)
works? Anyone now why? Complete code below:
Spider.py
def parse(self, response):
for record in response.xpath('//div[#class="box list"]/div[starts-with(#class,"record")]'):
loader = BaseItemLoader(item=BezrealitkyItems(), selector=record)
loader.add_xpath('title','.//div[#class="details"]/h2/a[#href]/text()')
listing_url = record.xpath('.//div[#class="details"]/p[#class="short-url"]/text()').extract_first()
yield scrapy.Request(listing_url, meta={'loader' : loader}, callback=self.parse_listing)
def parse_listing(self, response):
loader = response.meta['loader']
loader.add_value('url', response.url)
loader.add_xpath('lat','//script[contains(.,"recordGps")]',re=r'(?:"lat":)[0-9]+\.[0-9]+')
return loader.load_item()
The above does not work, when I try this it works though:
lat_coords = response.xpath('//script[contains(.,"recordGps")]/text()').re(r'(?:"lat":)([0-9]+\.[0-9]+)')
loader.add_value('lat', lat_coords)
My item.py has nothing special:
class BezrealitkyItems(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
lat = scrapy.Field()
class BaseItemLoader(ItemLoader):
title_in = MapCompose(lambda v: v.strip(), Join(''), unidecode)
title_out = TakeFirst()
Just to clarify, I get no error message. It is just that the 'lat' item has not been created nor nothing scraped to it. The other items are scraped fine, including the url that is also added through the parse_listing function.

It happens because you are carrying over loader reference which has it's own selector object.
Here you create and assign a selector parameter with your reference:
loader = BaseItemLoader(item=BezrealitkyItems(), selector=record)
Now later you put this loader into your Request.meta attribute and carry it over to the next parse method. What you aren't doing though is updating the selector context once you retrieve the loader from the meta:
loader = response.meta['loader']
# if you check loader.selector you'll see that it still has html body
# set in previous method, i.e. selector of record in your case
loader.selector = Selector(response) # <--- this is missing
This would work, however it should be avoided because having complex objects with a lot of references in meta is a bad idea and can cause all kind of errors that are mostly related to Twisted framework (that scrapy uses for it's concurrency).
What you should do however is load and recreate item in every step:
def parse(self, response):
loader = BaseItemLoader(item=BezrealitkyItems(), selector=record)
yield scrapy.Request('some_url', meta={'item': loader.load_item()}, callback=self.parse2)
def parse2(self, response):
loader = BaseItemLoader(item=response.meta['item'], selector=record)

Related

Why is scrapy returning None for the "Title" item?

I'm trying to crawl https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/, where I am currently stuck because scrapy returns None for the "Title" item, which is the job name. The css selector works fine in the shell and the other items work also. I've tried to alter the selector or add delays, but nothing seems to work. Has anybody an idea? Code below.
import scrapy
from jobscraping.items import JobscrapingItem
class GetdataSpider(scrapy.Spider):
name = 'getdata2'
start_urls = ['https://www.jobs.ch/de/stellenangebote/administration-hr-consulting-ceo/']
def parse(self, response):
for add in response.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd'):
item = JobscrapingItem()
addpage = response.urljoin(add.css('div.sc-AxiKw.VacancySerpItem__ShadowBox-qr45cp-0.hqhfbd a::attr(href)').get())
item['link'] = addpage
request = scrapy.Request(addpage, callback=self.get_addinfos)
request.meta['item'] = item
yield request
def get_addinfos(self, response):
item = response.meta['item']
item['Title'] = response.css('.sc-AxhUy.Text__h2-jiiyzm-1.eBKnmN.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.iNTZsv::text').get()
item['Company'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.kjfvVS::text').get()
item['Location'] = response.css('span.sc-fzqNJr.Text__span-jiiyzm-8.kGLBca.sc-fzqNJr.Text__span-jiiyzm-8.Text-jiiyzm-9.WBPTt::text').getall()
yield item
This is the items.py file:
import scrapy
class JobscrapingItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
Title = scrapy.Field()
Company = scrapy.Field()
Location = scrapy.Field()
You are using a lot more complicated css selector. Remember you don't always have to use classes or id. Yuo can use other attributes like in this case data-cy="vacancy-title" seems to be perfect.
item['Title'] = response.css('h1[data-cy="vacancy-title"]::text').get()
should work. It is simple and easy to debug and change later if something goes wrong.

Load Item fileds with ItemLoader across multiple responses

This is a followup question to accepted answer to question Scrapy: populate items with item loaders over multiple pages. I want to use ItemLoader to collect values from multiple requests to a single Item. The accepted answer suggests that the loaded Item.load_item() should be passed to the next request via meta field in request.
However, I would like to apply output_processors to all collected values of a single field when returning the loaded object at the end of the crawl.
Questions
What would be the best way to achieve it?
Can I pass the ItemLoader instance over meta to next request without loading it and then just replace the selector or response elements in the ItemLoader when adding the values or xpaths from the next response?
Example:
def parse(self, response):
loader = TheLoader(item=TestItems(), response=response)
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
request = Request(
"https://en.wikipedia.org/wiki/2016_Rugby_Championship",
callback=self.parsePage1,
meta={'loader': loader},
dont_filter=True
)
yield request
def parsePage1(self, response):
loader = response.meta['loader']
loader.response = response
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
return loader.load_item()
Ignore the context of the actual websites.
Yes, you can just pass the ItemLoader instance.
If I recall this correctly from irc or github chat way long ago, there might be some potential issues with doing this, like increased memory usage or leaks from reference handling, because you carry around object references of ItemLoader instances (and processors?) and potentially over long times, depending on the order of your download queues, by binding these itemloader instances to those requests.
So keep that in mind and perhaps beware of using this style on large crawls, or do some memory debugging to be certain.
However, I used this method extensively in the past myself (and would still do so when using ItemLoaders), and haven't seen any problems with that approach myself.
Here is how I do that:
import scrapy
from myproject.loader import ItemLoader
class TheLoader(ItemLoader):
pass
class SomeSpider(scrapy.Spider):
[...]
def parse(self, response):
loader = TheLoader(item=TestItems(), response=response)
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
request = Request("https://en.wikipedia.org/wiki/2016_Rugby_Championship",
callback=self.parsePage1,
dont_filter=True
)
request.meta['loader'] = loader
yield request
def parsePage1(self, response):
loader = response.meta['loader']
# rebind ItemLoader to new Selector instance
#loader.reset(selector=response.selector, response=response)
# skipping the selector will default to response.selector, like ItemLoader
loader.reset(response=response)
loader.add_xpath('title1', '//*[#id="firstHeading"]/text()')
return loader.load_item()
This requires using a customized ItemLoader class, which can be found in my scrapy scrapyard,
but the relevant part of the class is here:
from scrapy.loader import ItemLoader as ScrapyItemLoader
class ItemLoader(ScrapyItemLoader):
""" Extended Loader
for Selector resetting.
"""
def reset(self, selector=None, response=None):
if response is not None:
if selector is None:
selector = self.default_selector_class(response)
self.selector = selector
self.context.update(selector=selector, response=response)
elif selector is not None:
self.selector = selector
self.context.update(selector=selector)

attribute error during recursive scraping with scrapy

I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape.
Now I want to not give it all the categories but the page that contains links to all categories.
I thought I could simply add another parse function in order to achive this.
but the console output gives me an attribute error
"attributeError: 'zaubersonder' object has no attribute
'parsedetails'"
This tells me that some attribute refference is not working correctly.
I am new to object orientation but I thought scarpy is calling parse which is calling prase_level2 wich in turn calls parse_details and this should work fine.
below is my effort so far.
import scrapy
class zaubersonder(scrapy.Spider):
name = 'zaubersonder'
allowed_domains = ['abc.de']
start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
]
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_level2)
def parse_level2(self, response):
urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
for url2 in urls2:
url2 = response.urljoin(url2)
yield scrapy.Request(url=url2,callback=self.parse_details)
def parse_details(self,response): #extract entries
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
}
edit: fixed the code in case someone will search for it
There is a typo in the code. The callback in parse_level2 is self.parsedetails, but the function is named parse_details.
Just change the yield in parse_level2 to:
yield scrapy.Request(url=url2,callback=self.parse_details)
..and it should work better.

Scrapy not downloading images and getting pipeline error

I have this code
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
and this is the spider subclassed from BaseSpider. This basespider is giving me nightmare
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//strong[#class="genmed"]')
items = []
for site in sites[:5]:
item = PanduItem()
item['username'] = site.select('dl/dd/h2/a').select("string()").extract()
item['number_posts'] = site.select('dl/dd/h2/em').select("string()").extract()
item['profile_link'] = site.select('a/#href').extract()
request = Request("http://www.example/profile.php?mode=viewprofile&u=5",
callback = self.parseUserProfile)
request.meta['item'] = item
return request
def parseUserProfile(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#id="current')
myurl = sites[0].select('img/#src').extract()
item = response.meta['item']
image_absolute_url = urljoin(response.url, myurl[0].strip())
item['image_urls'] = [image_absolute_url]
return item
This is the error i am getting. I am not able to find. Looks like its getting item but i am not sure
ERROR
File "/app_crawler/crawler/pipelines.py", line 9, in get_media_requests
for image_url in item['image_urls']:
exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'
You are missing a method in your pipelines.py
The said file contains 3 methods:
Process item
get_media_requests
item_completed
The item_completed method is the one that handles the saving of the images to a specified path. This path is set in the settings.py as below:
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE = '/your/path/here'
Also included in the settings.py as seen above is the line that enables the imagepipeline.
I've tried to explain it in the best way I understood it as possible. For further reference, have a look at the official scrapy documentation.
Hmmm. At no point are you appending item to items (although the example code in the documentation doesn't do an append either, so I could be barking up the wrong tree).
Try adding it to parse(self, response) like so and see if this resolves the issue:
for site in sites:
item = PanduItem()
item['username'] = site.select('dl/dd/h2/a').select("string()").extract()
item['number_posts'] = site.select('dl/dd/h2/em').select("string()").extract()
item['profile_link'] = site.select('a/#href').extract()
items.append(item)
And set the IMAGES_STORE setting to a valid directory that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
For example:
IMAGES_STORE = '/path/to/valid/dir'

Scrapy: crawlspider not generating all links in nested callbacks

I have written a scrapy crawlspider to crawl a site with a structure like category page > type page > list page > item page. On the category page there are many categories of machines each of which has a type page with lots of types, each of the different types has a list of items, then finally each machine has a page with info about it.
My spider has a rule to get from the home page to the category page where I define the callback parsecatpage, this generates an item, grabs the category and yields a new request for each category on the page. I pass the item and the category name with request.meta and specify the callback is parsetype page.
Parsetypepage gets the item from response.meta then yields requests for each type and passes the item, and the concatenation of category and type along with it in the request.meta. The callback is parsemachinelist.
Parsemachinelist gets the item from response.meta then yields requests for each item on the list and passes the item, category/type, description via request.meta to the final callback, parsemachine. This gets the meta attributes and populates all the fields in the item using the info on the page and the info that was passed from the previous pages and finally yields an item.
If I limit this to a single category and type (with for example contains[#href, "filter=c:Grinders"] and contains[#href, "filter=t:Disc+-+Horizontal%2C+Single+End"]) then it works and there is a machine item for each machine on the final page. The problem is that once I allow the spider to scrapy all the categories and all the types it only returns scrapy items for the machines on the first of the final pages it gets to and once it has done that the spider is finished and doesn't get the other categories etc.
Here is the (anonymous) code
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from myspider.items import MachineItem
import urlparse
class MachineSpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/index.php']
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'),
)
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't categories = hxs.select('//a[contains(#href, "filter=c:Grinders")]')
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
req = Request(urlparse.urljoin(response.url,''.join(cat.select("#href").extract()).strip()),callback=self.parsetypepage)
req.meta['item'] = item
req.meta['machinecategory'] = ''.join(cat.select("./text()").extract())
yield req
def parsetypepage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End")]')
types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End") or contains(#href, "filter=t:Lathe%2C+Production")]')
for typ in types:
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(typ.select("#href").extract()).strip()),callback=self.parsemachinelist)
req.meta['item'] = item
req.meta['machinecategory'] = ': '.join([response.meta['machinecategory'],''.join(typ.select("./text()").extract())])
yield req
def parsemachinelist(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('//tr[contains(td/a/#href, "action=searchdet")]'):
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip()),callback=self.parsemachine)
print urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip())
req.meta['item'] = item
req.meta['descr'] = row.select('./td/div/text()').extract()
req.meta['machinecategory'] = response.meta['machinecategory']
yield req
def parsemachine(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['machinecategory'] = response.meta['machinecategory']
item['comp_name'] = 'Name'
item['description'] = response.meta['descr']
item['makemodel'] = ' '.join([''.join(hxs.select('//table/tr[contains(td/strong/text(), "Make")]/td/text()').extract()),''.join(hxs.select('//table/tr[contains(td/strong/text(), "Model")]/td/text()').extract())])
item['capacity'] = hxs.select('//tr[contains(td/strong/text(), "Capacity")]/td/text()').extract()
relative_image_url = hxs.select('//img[contains(#src, "custom/modules/images")]/#src')[0].extract()
abs_image_url = urlparse.urljoin(response.url, relative_image_url.strip())
item['image_urls'] = [abs_image_url]
yield item
SPIDER = MachineSpider()
So for example the spider will find Grinders on the category page and go to the Grinder type page where it will find the Disc Horizontal Single End type, then it will go to that page and find the list of machines and go to each machines page and finally there will be an item for each machine. If you try and go to Grinders and Lathes though it will run through the Grinders fine then it will crawl the Lathes and Lathes type pages and stop there without generating the requests for the Lathes list page and the final Lathes pages.
Can anyone help with this? Why isn't the spider getting to the second (or third etc.) machine list page once there is more than one category of machine?
Sorry for the epic post, just trying to explain the problem!!
Thanks!!
You should print the url of the request, to be sure it's ok. Also you can try this version:
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
cat_url = urlparse.urljoin(response.url, cat.select("./#href").extract()[0])
print 'url:', cat_url # to see what's there
cat_name = cat.select("./text()").extract()[0]
req = Request(cat_url, callback=self.parsetypepage, meta={'item': item, 'machinecategory': cat_name})
yield req
The problem was that the website is set up so that moving from the category to type page (and the following pages) occurs via filtering the results that are shown. This means that if the requests are performed depth first to the bottom of the query then it works (i.e. choose a category, then get all the types of that category, then get all the machines in each type then scrape the page of each machine) but if a request for the next type page is processed before the spider has got the urls for each machine in the first type then the urls are no longer correct and the spider reaches an incorrect page and cannot extract the info for the next step.
To solve the problem I defined a category setup callback which is called the first time only and gets a list of all categories called categories, then a category callback which is called from category setup which starts the crawl with a single category only using categories.pop(). Once the spider has got to the bottom of the nested callbacks and scraped all the machines in the list there is a callback back up to the category callback again (needed dont_follow=True in the Request) where categories.pop() starts the process again with the next category in the list until they are all done. This way each category is treated fully before the next is started and it works.
Thanks for your final comment, that got me thinking along the right lines and led me to the solution!

Categories