scrapy custom output processor

scrapy custom output processor - python

I'm using the scrapy framework for a web scraping project but I can't seem to figure out how to get a custom output processor to work.
I have an item class like so:
class Item(scrapy.Item)
ad_type = scrapy.Field()
then my parse function looks something like this. I have 2 scraped strings which I am adding to the ad_type. I want my output processor function to assign tags based on what is scraped from these 2 xpaths.
def parse(self, response):
l = ItemLoader(item=Item(), selector=listing)
l.add_xpath('ad_type', '(.//div/#class)[1]')
l.add_xpath('ad_type', '(.//div[contains(#class, "brand")]/#class)[1]')
yield l.load_item()
How do I get my output processor function to access the 2 xpath scraped strings that I have added to ad_type? The scrapy docs give this example but I can't get it to work.
def lowercase_processor(self, values):
for v in values:
yield v.lower()
class MyItemLoader(ItemLoader):
name_in = lowercase_processor

You have named your loader MyItemLoader, but your spider uses ItemLoader (probably scrapy's).
If you update your code to use the custom loader, you should get the result you want.
I would also recommend not naming your item class Item, since that could be confusing.

Related

Item is losing name while yielding in Python / Scrapy

In Scrapy 2.4.x on Python 3.8.x I am yielding an item with the purpose to save some stats to a DB. The scraper has another Item that gets yielded as well.
While the name of the item is present in the main script "StatsItem", it is lost within the other class. I am using the name of the item to decide which method to call:
in scraper.py:
import scrapy
from crawler.items import StatsItem, OtherItem
class demo(scrapy.Spider):
def parse_item(self, response):
stats = StatsItem()
stats['results'] = 10
yield stats
print(type(stats).__name__)
# Output: StatsItem
print(stats)
# Output: {'results': 10}
in pipeline.py
import scrapy
from crawler.items import StatsItem, OtherItem
class mysql_pipeline(object):
def process_item(self, item, spider):
print(type(item).__name__)
# Output: NoneType
if isinstance(item, StatsItem):
self.save_stats(item, spider)
elif isinstance(item, OtherItem):
# call other method
return item
The output of print in the first class is "StatsItem", while it is "NoneType" within the pipeline, therefore the method save_stats() gets never called.
I am pretty new to Python, so there might be a better way of doing this. There is no error message or exception I am aware of. Any help is greatly appreciated.

You can't use yield outside of a function imo.

I was finaly able to locate the problem. The particular crawler was nearly identical to all other ones that did not have this issue but with one exception, I was custom setting the item pipeline:
custom_settings.update({
'ITEM_PIPELINES' : {
'crawler.pipelines.mysql_pipeline': 301,
}
})
Removing this, fixed the issue.

Item serializers don't work. Function never gets called

I'm trying to use the serializer attribute in an Item, just like the example in the documentation:
https://docs.scrapy.org/en/latest/topics/exporters.html#declaring-a-serializer-in-the-field
The spider works without any errors, but the serialization doesn't happens, the print in the function doesn't print too. It's like the function remove_pound is never called.
import scrapy
def remove_pound(value):
print('Am I a joke to you?')
return value.replace('£', '')
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field(serializer=remove_pound)
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//ol/li')
for i in books:
yield BookItem(
title=i.xpath('article/h3/a/text()').get(),
price=i.xpath('article/div/p[#class="price_color"]/text()').get(),
)
Am I using it wrong?
PS.: I know there are other ways to do it, I just want to learn to use this way.

The only reason it doesn't work is because your XPath expression is not right. You need to use relative XPath:
price=i.xpath('./article/div/p[#class="price_color"]/text()').get()
Update It's not XPath. The serialization works only for item exporters:
you can customize how each field value is serialized before it is
passed to the serialization library.
So if you run this command scrapy crawl bookspider -o BookSpider.csv you'll get a correct (serialized) output.

Scrapy Data Flow and Items and Item Loaders

I am looking at the Architecture Overview page in the Scrapy documentation, but I still have a few questions regarding data and or control flow.
Scrapy Architecture
Default File Structure of Scrapy Projects
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
item.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MyprojectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
which, I'm assuming, becomes
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
so that errors are thrown when trying to populate undeclared fields of Product instances
>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Question #1
Where, when, and how does our crawler become aware of items.py if we have created class CrowdfundingItem in items.py?
Is this done in...
__init__.py?
my_crawler.py?
def __init__() of mycrawler.py?
settings.py?
pipelines.py?
def __init__(self, dbpool) of pipelines.py?
somewhere else?
Question #2
Once I have declared an item such as Product, how do I then store the data by creating instances of Product in a context similar to the one below?
import scrapy
class MycrawlerSpider(CrawlSpider):
name = 'mycrawler'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/']
def parse(self, response):
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)
browser.get(self.start_urls[0])
elements = browser.find_elements_by_xpath('//section')
count = 0
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
# If I am not sure how many items there will be,
# and hence I cannot declare them explicitly,
# how I would go about creating named instances of Product?
# Obviously the code below will not work, but how can you accomplish this?
count += 1
varName + count = Product(name=name, price=price)
...
Lastly, say we forego naming the Product instances altogether, and instead simply create unnamed instances.
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
Product(name=name, price=price)
If such instances are indeed stored somewhere, where are they stored? By creating instances this way, would it be impossible to access them?

Using an Item is optional; they're just a convenient way to declare your data model and apply validation. You can also use a plain dict instead.
If you do choose to use Item, you will need to import it for use in the spider. It's not discovered automatically. In your case:
from items import CrowdfundingItem
As a spider runs the parse method on each page, you can load the extracted data into your Item or dict. Once it's loaded, yield it, which passes it back to the scrapy engine for processing downstream, in pipelines or exporters. This is how scrapy enables "storage" of the data you scrape.
For example:
yield Product(name='Desktop PC', price=1000) # uses Item
yield {'name':'Desktop PC', 'price':1000} # plain dict

Scrapy : Program organization when interacting with secondary website

I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.

Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.
You can add url field to your item and schedule and parse it later in the pipeline:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
extra_url = item.get('extra_url', None)
if not extra_url:
return item
req = Request(url=extra_url
callback=self.custom_callback,
meta={'item': item},)
self.crawler.engine.crawl(req, spider)
# you have to drop the item here since you will return it later anyway
raise DropItem()
def custom_callback(self, response):
# retrieve your item
item = response.mete['item']
# do something to add to item
item['some_extra_stuff'] = ...
del item['extra_url']
yield item
What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.

Scrapy Deploy Doesn't Match Debug Result

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!

I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy custom output processor - python

You have named your loader MyItemLoader, but your spider uses ItemLoader (probably scrapy's). If you update your code to use the custom loader, you should get the result you want. I would also recommend not naming your item class Item, since that could be confusing.

Related

Item is losing name while yielding in Python / Scrapy

Item serializers don't work. Function never gets called

Scrapy Data Flow and Items and Item Loaders

Scrapy : Program organization when interacting with secondary website

Scrapy Deploy Doesn't Match Debug Result

Categories

Resources