Scrapy : Program organization when interacting with secondary website - python

I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.

Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.
You can add url field to your item and schedule and parse it later in the pipeline:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
extra_url = item.get('extra_url', None)
if not extra_url:
return item
req = Request(url=extra_url
callback=self.custom_callback,
meta={'item': item},)
self.crawler.engine.crawl(req, spider)
# you have to drop the item here since you will return it later anyway
raise DropItem()
def custom_callback(self, response):
# retrieve your item
item = response.mete['item']
# do something to add to item
item['some_extra_stuff'] = ...
del item['extra_url']
yield item
What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.

Related

Item is losing name while yielding in Python / Scrapy

In Scrapy 2.4.x on Python 3.8.x I am yielding an item with the purpose to save some stats to a DB. The scraper has another Item that gets yielded as well.
While the name of the item is present in the main script "StatsItem", it is lost within the other class. I am using the name of the item to decide which method to call:
in scraper.py:
import scrapy
from crawler.items import StatsItem, OtherItem
class demo(scrapy.Spider):
def parse_item(self, response):
stats = StatsItem()
stats['results'] = 10
yield stats
print(type(stats).__name__)
# Output: StatsItem
print(stats)
# Output: {'results': 10}
in pipeline.py
import scrapy
from crawler.items import StatsItem, OtherItem
class mysql_pipeline(object):
def process_item(self, item, spider):
print(type(item).__name__)
# Output: NoneType
if isinstance(item, StatsItem):
self.save_stats(item, spider)
elif isinstance(item, OtherItem):
# call other method
return item
The output of print in the first class is "StatsItem", while it is "NoneType" within the pipeline, therefore the method save_stats() gets never called.
I am pretty new to Python, so there might be a better way of doing this. There is no error message or exception I am aware of. Any help is greatly appreciated.
You can't use yield outside of a function imo.
I was finaly able to locate the problem. The particular crawler was nearly identical to all other ones that did not have this issue but with one exception, I was custom setting the item pipeline:
custom_settings.update({
'ITEM_PIPELINES' : {
'crawler.pipelines.mysql_pipeline': 301,
}
})
Removing this, fixed the issue.

Scrapy - decorator for responses, can it access data 'inside' the method?

Using scrapy. We have a decorator for logging Scrapy responses in utils/__init__.py and it prints what it finds, this is OK. Only we woule also like to know "how many links it found on the page". So as a result we have 2 log statements, resulting in 2 lines:
200: page found XXX
Found 23 products on category page XXX
Instead we would like to have 1 log statement, preferably somewhere central and not in every crawler (we have a lot! that prints
200: Page found, with # products - XXX
I dont think the log_response is able to access data 'inside' the method because that occurs later? Or is there a way to achieve this where we have 1 central method like log_response but that can also access the # links found so we can remove all the Found 23 products on category page XXX in individual crawlers.
Question: how can we centralize this and make it more generic so ther is no logc in the crawler class, but somewhere else / more central?
# decorator for logging Scrapy responses
def log_response(title, with_meta=False):
def real_decorator(f):
#wraps(f)
def wrap(self, response):
if not with_meta:
path = urlparse(response.url).path.strip('/')
self.logger.info(f'200 {title}: {path}')
this is how we report the # links found in the comment
#log_response('category')
def parse_category(self, response):
product_links = response.xpath('//a[#class="mainLink"]/#href').getall()
self.logger.info(f'Found {len(product_links)} products on category page (url {response.url}))')
The simplest way is probably doing your logging in a SpiderMiddleware's process_spider_output method, since it will be called every time a spider callback finishes.
Simply iterate over result, count the items, and make a logging call once your loop is over.
class LoggingMiddleware:
def process_spider_output(self, response, result, spider):
count = 0
for x in result:
yield x
# I think this is a sufficient check?
if not isinstance(x, scrapy.Request):
count += 1
spider.logger.info(f'{response.status}: Page found, with {count} products - {response.url}')

scrapy custom output processor

I'm using the scrapy framework for a web scraping project but I can't seem to figure out how to get a custom output processor to work.
I have an item class like so:
class Item(scrapy.Item)
ad_type = scrapy.Field()
then my parse function looks something like this. I have 2 scraped strings which I am adding to the ad_type. I want my output processor function to assign tags based on what is scraped from these 2 xpaths.
def parse(self, response):
l = ItemLoader(item=Item(), selector=listing)
l.add_xpath('ad_type', '(.//div/#class)[1]')
l.add_xpath('ad_type', '(.//div[contains(#class, "brand")]/#class)[1]')
yield l.load_item()
How do I get my output processor function to access the 2 xpath scraped strings that I have added to ad_type? The scrapy docs give this example but I can't get it to work.
def lowercase_processor(self, values):
for v in values:
yield v.lower()
class MyItemLoader(ItemLoader):
name_in = lowercase_processor
You have named your loader MyItemLoader, but your spider uses ItemLoader (probably scrapy's).
If you update your code to use the custom loader, you should get the result you want.
I would also recommend not naming your item class Item, since that could be confusing.

Keeping streams of data separate using one Scrapy spider

I want to scrape data from three different categories of contracts --- goods, services, construction.
Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.
My understanding is that just listing all three urls as start_urls will lead to one combined output of data.
My spider inherits from Scrapy's CrawlSpider class.
Let me know if you need further information.
I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a option like so
scrapy crawl CanCrawler -a contract=goods
You just need to include the variables you reference in your class initializer
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.
scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods
so you'd get
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, procure='', contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl.
Please also see here.
I hope this helps!
In your Spider, yield your item like this.
data = {'categories': {}, 'contracts':{}, 'goods':{}, 'services':{}, 'construction':{} }
Where each of item consists a Python dictionary.
And then create a Pipeline, and inside pipeline, do this.
if 'categories' in item:
categories = item['categories']
# and then process categories, save into DB maybe
if 'contracts' in item:
categories = item['contracts']
# and then process contracts, save into DB maybe
.
.
.
# And others

Scrapy Deploy Doesn't Match Debug Result

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

Categories