I want to scrape data from three different categories of contracts --- goods, services, construction.
Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.
My understanding is that just listing all three urls as start_urls will lead to one combined output of data.
My spider inherits from Scrapy's CrawlSpider class.
Let me know if you need further information.
I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a option like so
scrapy crawl CanCrawler -a contract=goods
You just need to include the variables you reference in your class initializer
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.
scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods
so you'd get
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, procure='', contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl.
Please also see here.
I hope this helps!
In your Spider, yield your item like this.
data = {'categories': {}, 'contracts':{}, 'goods':{}, 'services':{}, 'construction':{} }
Where each of item consists a Python dictionary.
And then create a Pipeline, and inside pipeline, do this.
if 'categories' in item:
categories = item['categories']
# and then process categories, save into DB maybe
if 'contracts' in item:
categories = item['contracts']
# and then process contracts, save into DB maybe
.
.
.
# And others
Related
I'm using the scrapy framework for a web scraping project but I can't seem to figure out how to get a custom output processor to work.
I have an item class like so:
class Item(scrapy.Item)
ad_type = scrapy.Field()
then my parse function looks something like this. I have 2 scraped strings which I am adding to the ad_type. I want my output processor function to assign tags based on what is scraped from these 2 xpaths.
def parse(self, response):
l = ItemLoader(item=Item(), selector=listing)
l.add_xpath('ad_type', '(.//div/#class)[1]')
l.add_xpath('ad_type', '(.//div[contains(#class, "brand")]/#class)[1]')
yield l.load_item()
How do I get my output processor function to access the 2 xpath scraped strings that I have added to ad_type? The scrapy docs give this example but I can't get it to work.
def lowercase_processor(self, values):
for v in values:
yield v.lower()
class MyItemLoader(ItemLoader):
name_in = lowercase_processor
You have named your loader MyItemLoader, but your spider uses ItemLoader (probably scrapy's).
If you update your code to use the custom loader, you should get the result you want.
I would also recommend not naming your item class Item, since that could be confusing.
I am new to python programming and I am having a hard time getting python crawling script work. I need tips from you to fix it.
Actually, I have a working scrapy script that crawls through a given url and extracts the links. I want it to make it work on any dynamically given url. so I started passing the start urls and domains to scrapy through the command line like below.
scrapy crawl myCrawler -o test.json -t json -a allowedDomains="xxx" -a startUrls="xxx" -a allowedPaths="xxx"
However, it does not work. looks like the Rules is not getting the values from arguments. Due to my lack of python skills, I not able to figure how to get this fixed. Some one please help me here.
Here is the code snippet.
class DmozSpider(CrawlSpider):
name = "myCrawler"
def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.allowedDomains = allowedDomains
self.startUrls = startUrls
self.allowedPaths = allowedPaths
self.allowed_domains = [allowedDomains]
self.start_urls = [startUrls]
rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
follow=True),)
Luckily got it working, found answer at How to dynamically set Scrapy rules?
Here is the working code
class DmozSpider(CrawlSpider):
name = "myCrawler"
def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.allowedDomains = allowedDomains
self.startUrls = startUrls
self.allowedPaths = allowedPaths
self.allowed_domains = [allowedDomains]
self.start_urls = [startUrls]
DmozSpider.rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
follow=True),)
super(DmozSpider, self)._compile_rules()
I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.
Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.
You can add url field to your item and schedule and parse it later in the pipeline:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
extra_url = item.get('extra_url', None)
if not extra_url:
return item
req = Request(url=extra_url
callback=self.custom_callback,
meta={'item': item},)
self.crawler.engine.crawl(req, spider)
# you have to drop the item here since you will return it later anyway
raise DropItem()
def custom_callback(self, response):
# retrieve your item
item = response.mete['item']
# do something to add to item
item['some_extra_stuff'] = ...
del item['extra_url']
yield item
What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.
I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
class MyprojectSpider(CrawlSpider):
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
)
def parse_category(self, response):
try:
soup = BeautifulSoup(response.body)
...
my_request1 = Request(url=myurl1)
yield my_request1
my_request2 = Request(url=myurl2)
yield my_request2
except:
pass
def parse_pricing(self, response):
item = MyprojectItem()
try:
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
except:
item['mystatus'] = 'failed'
return item
Thanks a lot for any suggestion!
I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...
In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.
I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my spider and pipelines. The logs, however, are printing out the entire scrapy Item as it leaves the pipeline (I believe):
2013-01-17 18:42:17-0600 [tutorial] DEBUG: processing Pipeline pipeline module
2013-01-17 18:42:17-0600 [tutorial] DEBUG: Scraped from <200 http://www.example.com>
{'attr1': 'value1',
'attr2': 'value2',
'attr3': 'value3',
...
snip
...
'attrN': 'valueN'}
2013-01-17 18:42:18-0600 [tutorial] INFO: Closing spider (finished)
I would rather not have all this data puked into log files if I can avoid it. Any suggestions about how to suppress this output?
Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline:
from scrapy.item import Item, Field
class MyItem(Item):
attr1 = Field()
attr2 = Field()
# ...
attrN = Field()
def __repr__(self):
"""only print out attr1 after exiting the Pipeline"""
return repr({"attr1": self.attr1})
This way, you can keep the log level at DEBUG and show only the attributes that you want to see coming out of the pipeline (to check attr1, for example).
Having read through the documentation and conducted a (brief) search through the source code, I can't see a straightforward way of achieving this aim.
The hammer approach is to set the logging level in the settings to INFO (ie add the following line to settings.py):
LOG_LEVEL='INFO'
This will strip out a lot of other information about the URLs/page that are being crawled, but it will definitely suppress data about processed items.
I tried the repre way mentioned by #dino, it doesn't work well. But evolved from his idea, I tried the str method, and it works.
Here's how I do it, very simple:
def __str__(self):
return ""
If you want to exclude only some attributes of the output, you can extend the answer given by #dino
from scrapy.item import Item, Field
import json
class MyItem(Item):
attr1 = Field()
attr2 = Field()
attr1ToExclude = Field()
attr2ToExclude = Field()
# ...
attrN = Field()
def __repr__(self):
r = {}
for attr, value in self.__dict__['_values'].iteritems():
if attr not in ['attr1ToExclude', 'attr2ToExclude']:
r[attr] = value
return json.dumps(r, sort_keys=True, indent=4, separators=(',', ': '))
If you found your way here because you had the same question years later, the easiest way to do this is with a LogFormatter:
class QuietLogFormatter(scrapy.logformatter.LogFormatter):
def scraped(self, item, response, spider):
return (
super().scraped(item, response, spider)
if spider.settings.getbool("LOG_SCRAPED_ITEMS")
else None
)
Just add LOG_FORMATTER = "path.to.QuietLogFormatter" to your settings.py and you will see all your DEBUG messages except for the scraped items. With LOG_SCRAPED_ITEMS = True you can restore the previous behaviour without having to change your LOG_FORMATTER.
Similarly you can customise the logging behaviour for crawled pages and dropped items.
Edit: I wrapped up this formatter and some other Scrapy stuff in this library.
or If you know that spider is working correctly then you can disable the entire logging
LOG_ENABLED = False
I disable that when my crawler runs fine
I think the cleanest way to do this is to add a filter to the scrapy.core.scraper logger that changes the message in question. This allows you to keep your Item's __repr__ intact and to not have to change scrapy's logging level:
import re
class ItemMessageFilter(logging.Filter):
def filter(self, record):
# The message that logs the item actually has raw % operators in it,
# which Scrapy presumably formats later on
match = re.search(r'(Scraped from %\(src\)s)\n%\(item\)s', record.msg)
if match:
# Make the message everything but the item itself
record.msg = match.group(1)
# Don't actually want to filter out this record, so always return 1
return 1
logging.getLogger('scrapy.core.scraper').addFilter(ItemMessageFilter())
We use the following sample in production:
import logging
logging.getLogger('scrapy.core.scraper').addFilter(
lambda x: not x.getMessage().startswith('Scraped from'))
This is a very simple and working code. We add this code in __init__.py in module with spiders. In this case this code automatically run with command like scrapy crawl <spider_name> for all spiders.
Create filter:
class ItemFilter(logging.Filter):
def filter(self, record):
is_item_log = not record.msg.startswith('Scraped from')
return is_item_log
Then add it in __init__ of your spider.
class YourSpider(scrapy.Spider):
name = "your_spider"
def __init__(self, *args, **kwargs):
super(JobSpider, self).__init__(*args, **kwargs)
if int(getattr(self, "no_items_output", 0)):
for handler in logging.root.handlers:
handler.addFilter(ItemFilter())
And then you can run it doing scrapy crawl your_spider -a no_items_output=1