How to make Scrapy general purpose? - python

I am using Scrapy for scraping text from websites.
I would like Scrapy to scrape text from various URLs with different structure, without having to change the code for each website.
The following example works in my Jupyter Notebook for the given URL ( http://quotes.toscrape.com/tag/humor/ ). But it does not work for another (for ex.: https://en.wikipedia.org/wiki/Web_scraping ).
My question is, how to make it work for (most) other websites without manually inspecting every site and changing the code all the time? I guess I need to make a change under def parse(self, response), but so far I could not find a good example how to do that.
Modules:
import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor
Spider:
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
A wrapper to make it run more times in Jupyter:
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Get the result:
print('Extracted text:')
run_spider(QuotesSpider)
Extracted text:
“The person, be it gentleman or lady, who has not pleasure in a good novel, ..."

Related

Get Scraped data using scrapy to a variable instead of file/database

I am trying to run scrapy as a python script and want to process the data scraped instead of storing in a file/database. The code looks like
import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor
# spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
yield {"html_data": response.text}
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
configure_logging()
x = run_spider(QuotesSpider)
I want to run the spider when it is called. How this can be done
As I understand, you want to crawl the links using scrapy, and instead of storing these links on a file, you want to return them to the script.
Using your method
import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor
all_html_data = []
# spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
html_data = []
def parse(self, response):
self.html_data.append(response.text)
def close(self, reason):
global all_html_data
all_html_data = self.html_data
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result:
raise result
configure_logging()
run_spider(QuotesSpider)
data = all_html_data # this is the data
But if you want to crawl the links and use the HTML response, I think it will be better if you use another library like Requests or use HtmlResponse from scrapy like
from scrapy.http import HtmlResponse
def parse(url):
response = HtmlResponse(url=url)
return response.text
data=parse('example.com')

Running spider multiple times from script - ReactorNotRestartable [duplicate]

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
For the first time it works, then I get error. I create process variable each time, so what's the problem?
By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False) if you create process in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.
I was able to solve this problem like this. process.start() should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling).
Then you can simply run this using different subprocesses. For this python Process module can be used.
Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates
Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()
I faced error ReactorNotRestartable on AWS lambda and after I came to this solution
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use `
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)
` to run your existing spider in a blocking fashion:
I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable.
The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup
class MySpider(scrapy.Spider):
name = "MySpider"
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
for i in range(1,6):
time.sleep(1)
print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")
def run_spider(number):
crawler = CrawlerRunner()
crawler.crawl(MySpider,name=str(number))
setup()
for i in range(1,6):
time.sleep(1)
print("Initialization of Spider #"+str(i))
run_spider(i)
I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.
Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?
I could advice you to run scrapers using subprocess module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()
If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.
Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask
Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.
My way is multiprocessing use Process
#create spider
class PricesSpider(scrapy.Spider):
name = 'prices'
allowed_domains = ['index.minfin.com.ua']
start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']
def parse(self, response):
pass
Than I create func which run my spider
#run spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
def parser():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(PricesSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Than I create new Python file, import here func 'parser' and create schedule for my spider
#create schedule for spider
import schedule
from import parser
from multiprocessing import Process
def worker(pars):
print('Worker starting')
pr = Process(target=parser)
pr.start()
pr.join()
def main():
schedule.every().day.at("15:00").do(worker, parser)
# schedule.every().day.at("20:21").do(worker, parser)
# schedule.every().day.at("20:23").do(worker, parser)
# schedule.every(1).minutes.do(worker, parser)
print('Spider working now')
while True:
schedule.run_pending()
if __name__ == '__main__':
main()

Broad crawling - different xpaths - Scrapy

I'm new in Scrapy. I have thousands of url,xpath tuples and values in a database.
These urls are from different domains (not allways, there can be 100 urls from the same domain).
x.com/a //h1
y.com/a //div[#class='1']
z.com/a //div[#href='...']
x.com/b //h1
x.com/c //h1
...
Now I want to get these values every 2 hours as fast as possible but to be sure that I don't overload any of these.
Can't figure out how to do that.
My thoughts:
I could create one Spider for every different domain, set it's parsing rules and run them at once.
Is it a good practice?
EDIT:
I'm not sure how it would work with outputting data into database according to concurrency.
EDIT2:
I can do something like this - for every domain there is a new spider. But this is impossible to do having thousands of different urls and it's xpaths.
class WikiScraper(scrapy.Spider):
name = "wiki_headers"
def start_requests(self):
urls = [
'https://en.wikipedia.org/wiki/Spider',
'https://en.wikipedia.org/wiki/Data_scraping',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//h1/text()').extract()
print header
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
class CraigslistScraper(scrapy.Spider):
name = "craigslist_headers"
def start_requests(self):
urls = [
'https://columbusga.craigslist.org/act/6062657418.html',
'https://columbusga.craigslist.org/acc/6060297390.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select('//span[#id="titletextonly"]/text()').extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
From the example you posted in edit2, it looks like all your classes are easily abstractable by one more level. How about this:?
from urllib.parse import urlparse
class GenericScraper(scrapy.Spider):
def __init__(self, urls, xpath):
super().__init__()
self.name = self._create_scraper_name_from_url(urls[0])
self.urls = urls
self.xpath = xpath
def _create_scraper_name_from_url(url):
'''Generate scraper name from url
www.example.com/foobar/bar -> www_example_com'''
netloc = urlparse(url).netloc
return netloc.replace('.','_')
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
header = hxs.select(self.xpath).extract()
filename = 'result.txt'
with open(filename, 'a') as f:
f.write(header[0])
self.log('Saved file %s' % filename)
Next, you could group the data from database by xpaths
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
# do whatever you need with scraper
AD concurency: your database should handle concurent writes so I do not see a problem there
Edit:
Related to the timeouts: I Do not know how scrapy works under the hood i.e. if it uses some sort of paralelization and whether it runs asynchronously in the background. But from what you wrote I guess it does and when you fire up 1k scrapers each firing multiple requests at time your hardware cant handle that much traffic (disclaimer, this is just a guess!).
There might be a native way to do this, but a possible workaround is to use multiprocessing + Queue:
from multiprocessing import JoinableQueue, Process
NUMBER_OF_CPU = 4 # change this to your number.
SENTINEL = None
class Worker(Process):
def __init__(self, queue):
super().__init__()
self.queue = queue
def run(self):
# blocking wait !You have to use sentinels if you use blocking waits!
item = self.queue.get():
if item is SENTINEL:
# we got sentinel, there are no more scrapers to process
self.queue.task_done()
return
else:
# item is scraper, run it
item.run_spider() # or however you run your scrapers
# This assumes that each scraper is **not** running in background!
# Tell the JoinableQueue we have processed one more item
# In the main thread the queue.join() waits untill for
# each item taken from queue a queue.task_done() is called
self.queue.task_done()
def run():
queue = JoinableQueue()
# if putting that many things in the queue gets slow (I imagine
# it can) You can fire up a separate Thread/Process to fill the
# queue in the background while workers are already consuming it.
for urls, xpath in grouped_data:
scraper = GenericScraper(urls, xpath)
queue.put(scraper)
for sentinel in range(NUMBER_OF_CPU):
# None or sentinel of your choice to tell the workers there are
# no more scrapers to process
queue.put(SENTINEL)
workers = []
for _ in range(NUMBER_OF_CPU):
worker = Worker(queue)
workers.append(worker)
worker.start()
# We have to wait until the queue is processed
queue.join()
But please bear in mind that this is a vanilla approach for paralell execution completely ignoring Scrapy abilities. I have found This blogpost which uses twisted to achieve (what I think is) the same thing. But since I've never used twisted I can't comment on that
if you are thinking about scrapy can't handle multiple domains at once because of the allowed_domains parameters, remember that it is optional.
If no allowed_domains parameter is set in the spider, it can work with every domain it gets.
If I understand correctly you have map of domain to xpath values and you want to pull xpath depending on what domain you crawl?
Try something like:
DOMAIN_DATA = [('domain.com', '//div')]
def get_domain(url):
for domain, xpath in DOMAIN_DATA:
if domain in url:
return xp
def parse(self, response):
xpath = get_domain(response.url)
if not xpath:
logging.error('no xpath for url: {}; unknown domain'.format(response.url))
return
item = dict()
item['some_field'] = repsonse.xpath(xpath).extract()
yield item

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if result:
break
sleep(3)
For the first time it works, then I get error. I create process variable each time, so what's the problem?
By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.
You should call process.start(stop_after_crawl=False) if you create process in each iteration.
Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.
I was able to solve this problem like this. process.start() should be called only once.
from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
result = None
def set_result(item):
result = item
while True:
process = CrawlerProcess(get_project_settings())
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.
So the best option is to use different subprocesses if you need to run the reactor multiple times.
you can add the content of while loop to a function(say execute_crawling).
Then you can simply run this using different subprocesses. For this python Process module can be used.
Code is given below.
from multiprocessing import Process
def execute_crawling():
process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
dispatcher.connect(set_result, signals.item_scraped)
process.crawl('my_spider')
process.start()
if __name__ == '__main__':
for k in range(Number_of_times_you_want):
p = Process(target=execute_crawling)
p.start()
p.join() # this blocks until the process terminates
Ref http://crawl.blog/scrapy-loop/
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.task import deferLater
def sleep(self, *args, seconds):
"""Non blocking sleep callback"""
return deferLater(reactor, seconds, lambda: None)
process = CrawlerProcess(get_project_settings())
def _crawl(result, spider):
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print('waiting 100 seconds before
restart...'))
deferred.addCallback(sleep, seconds=100)
deferred.addCallback(_crawl, spider)
return deferred
_crawl(None, MySpider)
process.start()
I faced error ReactorNotRestartable on AWS lambda and after I came to this solution
By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use `
import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo
scrapydo.setup()
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
scrapydo.run_spider(QuotesSpider)
` to run your existing spider in a blocking fashion:
I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable.
The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup
class MySpider(scrapy.Spider):
name = "MySpider"
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
for i in range(1,6):
time.sleep(1)
print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")
def run_spider(number):
crawler = CrawlerRunner()
crawler.crawl(MySpider,name=str(number))
setup()
for i in range(1,6):
time.sleep(1)
print("Initialization of Spider #"+str(i))
run_spider(i)
I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.
Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?
I could advice you to run scrapers using subprocess module
from subprocess import Popen, PIPE
spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)
spider.wait()
If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.
Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask
Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.
My way is multiprocessing use Process
#create spider
class PricesSpider(scrapy.Spider):
name = 'prices'
allowed_domains = ['index.minfin.com.ua']
start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']
def parse(self, response):
pass
Than I create func which run my spider
#run spider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
def parser():
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(PricesSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Than I create new Python file, import here func 'parser' and create schedule for my spider
#create schedule for spider
import schedule
from import parser
from multiprocessing import Process
def worker(pars):
print('Worker starting')
pr = Process(target=parser)
pr.start()
pr.join()
def main():
schedule.every().day.at("15:00").do(worker, parser)
# schedule.every().day.at("20:21").do(worker, parser)
# schedule.every().day.at("20:23").do(worker, parser)
# schedule.every(1).minutes.do(worker, parser)
print('Spider working now')
while True:
schedule.run_pending()
if __name__ == '__main__':
main()

Calling Scrapy Spider from script -- cannot turn reactor off?

Very similar to this thread: Scrapy crawl from script always blocks script execution after scraping, I cannot get anything to work after the reactor.run() line. I've read nearly every SO post on the topic and as you can see from the commented code, I've tried several things including what's recommended in the documentation. Is there something I'm not catching? Maybe something wrong with the parse_item method? It's driving me crazy!
class EmailSpider(CrawlSpider):
name = "email_scraper"
allowed_domains = ["somedomain.com"]
start_urls = ["http://www.somedomain.com"]
rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_items')]
def parse_items(self, response):
sel=Selector(response)
results=[]
item=EmailScraperItems()
item['title']=sel.xpath('//title/text()').extract()
item['url']=response.url
item['email']=sel.re(r"\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b")
if item['email'] != []:
print item['email']
print item['url']
if any('info' in email for email in item['email']):
results.append(item)
raise CloseSpider('info email found')
else:
results.append(item)
print results
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = EmailSpider(domain='knechtproperties.com')
#settings = get_project_settings()
crawler = Crawler(Settings())
#crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
print "this will not print"
Found the answer in this thread: Scrapy run from script not working. Apparently log.start() masks printing. I'll need to lookup more details on how that works, but for now commenting it out solved the issue.

Categories