Passing list as arguments in Scrapy - python

I am trying to build an application using Flask and Scrapy. I have to pass the list of urls to spider. I tried using the following syntax:
__init__: in Spider
self.start_urls = ["http://www.google.com/patents/" + x for x in u]
Flask Method
u = ["US6249832", "US20120095946"]
os.system("rm static/s.json; scrapy crawl patents -d u=%s -o static/s.json" % u)
I know similar thing can be done by reading file having required urls, but can I pass list of urls for crawling?

Override spider's __init__() method:
class MySpider(Spider):
name = 'my_spider'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
endpoints = kwargs.get('start_urls').split(',')
self.start_urls = ["http://www.google.com/patents/" + x for x in endpoints]
And pass the list of endpoints through the -a command line argument:
scrapy crawl patents -a start_urls="US6249832,US20120095946" -o static/s.json
See also:
How to give URL to scrapy for crawling?
Note that you can also run Scrapy from script:
How to run Scrapy from within a Python script
Scrapy Very Basic Example

Related

CrawlSpider / Scrapy - CLOSESPIDER settings are not working

I created a CrawlSpider that should follow all "internal" links up to a certain number of items / pages / time.
I am using multiprocessing.Pool to process a few pages at the same time (e.g. 6 workers).
I do call the CrawlSpider with the os.systemcomand from a separate python script:
import os
...
cmd = "scrapy crawl FullPageCrawler -t jsonlines -o "{0}" -a URL={1} -s DOWNLOAD_MAXSIZE=0 -s CLOSESPIDER_TIMEOUT=180 -s CLOSESPIDER_PAGECOUNT=150 -s CLOSESPIDER_ITEMCOUNT=100 -s DEPTH_LIMIT=5 -s DEPTH_PRIORITY=0 --nolog'.format(OUTPUT_FILE, url.strip())"
os.system(cmd)
It works pretty well for some of my pages but for specific pages the crawler is not following any of my set settings.
I tried to define the following (with what I think it does):
CLOSESPIDER_PAGECOUNT: The number of total pages he will follow?
CLOSESPIDER_ITEMCOUNT: Not sure about this one. What is the difference to PAGECOUNT?
CLOSESPIDER_TIMEOUT: Maximum time a crawler should be working.
Right now I face an example that has already crawled more than 4000 pages (or items?!) and is up for more than 1 hour.
Do I run into this because I defined everything at the same time?
Do I also need to define the same settings in the settings.py?
Can one of them be enough for me? (e.g. maximum uptime = 10minutes)
I tried using subprocess.Popen instead of os.system because it has a wait function but this was not working as expected as well.
After all using os.system is the most stable thing I tried and I want to stick with it. Only problem is scrapy
I tried searching for an answer on SO but couldnĀ“t find any help!
EDIT:
The above example ended up with 16.009 scraped subpages and over 333 MB.
After keep on searching for an answer I came up with the following solution.
Within my CrawlSpider I defined a maximum number of pages (self.max_cnt) that the scraper should stop at and a counter that is checked (self.max_counter) and increased for each page my scraper visited.
If the number of maximum pages is exceeded then the spider will be closed by raising scrapy.exception.CloseSpider.
class FullPageSpider(CrawlSpider):
name = "FullPageCrawler"
rules = (Rule(LinkExtractor(allow=()), callback="parse_all", follow=True),)
def __init__(self, URL=None, *args, **kwargs):
super(FullPageSpider, self).__init__(*args, **kwargs)
self.start_urls = [URL]
self.allowed_domains = ['{uri.netloc}'.format(uri=urlparse(URL))]
self.max_cnt = 250
self.max_counter = 0
def parse_all(self, response):
if self.max_counter < self.max_cnt:
self.max_cnt += 1
...
else:
from scrapy.exceptions import CloseSpider
raise CloseSpider('Exceeded the number of maximum pages!')
This works fine for me now but I would still be interested in the reason why the crawler settings are not working as expected.

pass arguments to scrapy rules from command line or dynamically modify rules

I am new to python programming and I am having a hard time getting python crawling script work. I need tips from you to fix it.
Actually, I have a working scrapy script that crawls through a given url and extracts the links. I want it to make it work on any dynamically given url. so I started passing the start urls and domains to scrapy through the command line like below.
scrapy crawl myCrawler -o test.json -t json -a allowedDomains="xxx" -a startUrls="xxx" -a allowedPaths="xxx"
However, it does not work. looks like the Rules is not getting the values from arguments. Due to my lack of python skills, I not able to figure how to get this fixed. Some one please help me here.
Here is the code snippet.
class DmozSpider(CrawlSpider):
name = "myCrawler"
def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.allowedDomains = allowedDomains
self.startUrls = startUrls
self.allowedPaths = allowedPaths
self.allowed_domains = [allowedDomains]
self.start_urls = [startUrls]
rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
follow=True),)
Luckily got it working, found answer at How to dynamically set Scrapy rules?
Here is the working code
class DmozSpider(CrawlSpider):
name = "myCrawler"
def __init__(self, allowedDomains='', startUrls='',allowedPaths='', *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.allowedDomains = allowedDomains
self.startUrls = startUrls
self.allowedPaths = allowedPaths
self.allowed_domains = [allowedDomains]
self.start_urls = [startUrls]
DmozSpider.rules = (Rule(LinkExtractor(allow=(allowedPaths), allow_domains=allowedDomains), callback="parse_items",
follow=True),)
super(DmozSpider, self)._compile_rules()

Keeping streams of data separate using one Scrapy spider

I want to scrape data from three different categories of contracts --- goods, services, construction.
Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.
My understanding is that just listing all three urls as start_urls will lead to one combined output of data.
My spider inherits from Scrapy's CrawlSpider class.
Let me know if you need further information.
I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a option like so
scrapy crawl CanCrawler -a contract=goods
You just need to include the variables you reference in your class initializer
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.
scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods
so you'd get
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, procure='', contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl.
Please also see here.
I hope this helps!
In your Spider, yield your item like this.
data = {'categories': {}, 'contracts':{}, 'goods':{}, 'services':{}, 'construction':{} }
Where each of item consists a Python dictionary.
And then create a Pipeline, and inside pipeline, do this.
if 'categories' in item:
categories = item['categories']
# and then process categories, save into DB maybe
if 'contracts' in item:
categories = item['contracts']
# and then process contracts, save into DB maybe
.
.
.
# And others

(Python, Scrapy) Taking data from txt file into Scrapy spider

I am new at Python and Scrapy. I have a project. In the spider there is a code like that:
class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/%d" % i for i in range(12308128,12308148)]
I want to take the range numbers between 12308128 and 12308148 from a txt file (or csv file)
Lets say its numbers.txt including two lines in it:
12308128
12308148
How can I import these numbers to my spider? Another process will change these numbers in txt file periodically and my spider will update the numbers and run.
Thank you.
You can override the start_urls logic in spider's start_requests() method:
class Myspider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
# read file data
with open('filename', 'r') as f:
start, end = f.read().split('\n', 1)
# make range and urls with your numbers
range_ = (int(start.strip()), int(end.strip()))
start_urls = ["https://domain.com/%d" % i for i in range(range_)]
for url in start_urls:
yield scrapy.Request(url)
This spider will open up file, read the numbers, create starting urls, iterate through them and schedule a request for each one of them.
Default start_requests() method looks something like:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url)
So you can see what we're doing here by overriding it.
You can pass any parameters to spider's constructor through command line using option -a of scrapy crawl command for ex.)
scrapy crawl spider -a inputfile=filename.txt
then use it like this:
class MySpider(scrapy.Spider):
name = 'spider'
def __init__(self, *args, **kwargs):
self.infile = kwargs.pop('inputfile', None)
def start_requests(self):
if self.infile is None:
raise CloseSpider('No filename')
# process file, name in self.infile
or you can just pass start and end values in similar way like this:
scrapy crawl spider -a start=10000 -a end=20000
I believe you need to read the file and pass the values to your url string
Start_Range = datacont.readline()
End_Range = datacont.readline()
print Start_Range
print End_Range

How to pass two user-defined arguments to a scrapy spider

Following How to pass a user defined argument in scrapy spider, I wrote the following simple spider:
import scrapy
class Funda1Spider(scrapy.Spider):
name = "funda1"
allowed_domains = ["funda.nl"]
def __init__(self, place='amsterdam'):
self.start_urls = ["http://www.funda.nl/koop/%s/" % place]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
This seems to work; for example, if I run it from the command line with
scrapy crawl funda1 -a place=rotterdam
It generates a rotterdam.html which looks similar to http://www.funda.nl/koop/rotterdam/. I would next like to extend this so that one can specify a subpage, for instance, http://www.funda.nl/koop/rotterdam/p2/. I've tried the following:
import scrapy
class Funda1Spider(scrapy.Spider):
name = "funda1"
allowed_domains = ["funda.nl"]
def __init__(self, place='amsterdam', page=''):
self.start_urls = ["http://www.funda.nl/koop/%s/p%s/" % (place, page)]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
However, if I try to run this with
scrapy crawl funda1 -a place=rotterdam page=2
I get the following error:
crawl: error: running 'scrapy crawl' with more than one spider is no longer supported
I don't really understand this error message, as I'm not trying to crawl two spiders, but simply trying to pass two keyword arguments to modify the start_urls. How could I make this work?
When providing multiple arguments you need to prefix -a for every argument.
The correct line for your case would be:
scrapy crawl funda1 -a place=rotterdam -a page=2

Categories