Run multiple spiders from script in scrapy in loop - python

I have more than 100 spiders and i want to run 5 spiders at a time using a script. For this i have created a table in database to know about the status of a spider i.e. whether it has finished running , running or waiting to run.
I know how to run multiple spiders inside a script
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
#find the spiders that are waiting to run from database
process.crawl(spider1) #spider name changes based on spider to run
process.crawl(spider2)
print('-------------this is the-----{}--iteration'.format(i))
process.start()
But this is not allowed as the following error occurs:
Traceback (most recent call last):
File "test.py", line 24, in <module>
process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
I have searched for above error and not able to resolve it. Managing spiders can be done via ScrapyD but we do not want to use ScrapyD as many spiders are still in development phase.
Any workaround for above scenario is appreciated.
Thanks

For running multiple spiders simultaneously you can use this
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
The answers of this question can help you too.
For more information:
Running multiple spiders in the same process

You need ScrapyD for this purpose
You can run as many spider as you want at the same time, you can constantly check status if a spider is running or not using listjobs API
You can set max_proc=5 in config file that will run maximum of 5 spiders at a single time.
Anyways, talking about your code, your code shoudl work if you do this
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
#find the spiders that are waiting to run from database
process.crawl(spider1) #spider name changes based on spider to run
process.crawl(spider2)
print('-------------this is the-----{}--iteration'.format(i))
process.start()
You need to place process.start() outside of loop.

I was able to implement a similar functionality by removing loop from the script and setting a scheduler for every 3 minutes.
Looping functionality was achieved by maintaining a record of how many spiders are currently running and checking if more spiders need to be run or not.Thus at the end, only 5(can be changed) spiders can run concurrently.

Related

apscheduler: returned more than one DjangoJobExecution -- it returned 2

In my proyect scheduler return this error in the execute job, help me please
this is my error in cosole, then execute the program
Error notifying listener
Traceback (most recent call last):
File "C:\Users\angel\project\venv\lib\site-packages\apscheduler\schedulers\base.py", line 836, in _dispatch_event
cb(event)
File "C:\Users\angel\project\venv\lib\site-packages\django_apscheduler\jobstores.py", line 53, in handle_submission_event
DjangoJobExecution.SENT,
File "C:\Users\angel\project\venv\lib\site-packages\django_apscheduler\models.py", line 157, in atomic_update_or_create
job_id=job_id, run_time=run_time
File "C:\Users\angel\project\venv\lib\site-packages\django\db\models\query.py", line 412, in get
(self.model._meta.object_name, num)
django_apscheduler.models.DjangoJobExecution.MultipleObjectsReturned: get() returned more than one DjangoJobExecution -- it returned 2!
This is my code
class Command(BaseCommand):
help = "Runs apscheduler."
scheduler = BackgroundScheduler(timezone=settings.TIME_ZONE, daemon=True)
scheduler.add_jobstore(DjangoJobStore(), "default")
def handle(self, *args, **options):
self.scheduler.add_job(
delete_old_job_executions,
'interval', seconds=5,
id="delete_old_job_executions",
max_instances=1,
replace_existing=True
)
try:
logger.info("Starting scheduler...")
self.scheduler.start()
except KeyboardInterrupt:
logger.info("Stopping scheduler...")
self.scheduler.shutdown()
logger.info("Scheduler shut down successfully!")
Not sure if you're still having this issue. I have same error and found your question. Turned out this happens only in dev environment.
Because python3 manage.py runserver starts two processes by default, the code
seems to register two job records and find two entries at next run time.
With --noreload option, it starts only one scheduler thread and works well. As name implies, it won't reload changes you make automatically though.
python3 manage.py runserver --noreload
not sure if you're still having this issue. i think you can use socket , socket can use this issue.
look this enter image description here

Using PyTorch with Celery

I'm trying to run a PyTorch model in a Django app. As it is not recommended to execute the models (or any long-running task) in the views, I decided to run it in a Celery task. My model is quite big and it takes about 12 seconds to load and about 3 seconds to infer. That's why I decided that I couldn't afford to load it at every request. So I tried to load it at settings and save it there for the app to use it. So my final scheme is:
When the Django app starts, in the settings the PyTorch model is loaded and it's accessible from the app.
When views.py receives a request, it delays a celery task
The celery task uses the settings.model to infer the result
The problem here is that the celery task throws the following error when trying to use the model
[2020-08-29 09:03:04,015: ERROR/ForkPoolWorker-1] Task app.tasks.task[458934d4-ea03-4bc9-8dcd-77e4c3a9caec] raised unexpected: RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method")
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensor/lib/python3.7/site-packages/celery/app/trace.py", line 412, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensor/lib/python3.7/site-packages/celery/app/trace.py", line 704, in __protected_call__
return self.run(*args, **kwargs)
/*...*/
File "/home/ubuntu/anaconda3/envs/tensor/lib/python3.7/site-packages/torch/cuda/__init__.py", line 191, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Here's the code in my settings.py loading the model:
if sys.argv and sys.argv[0].endswith('celery') and 'worker' in sys.argv: #In order to load only for the celery worker
import torch
torch.cuda.init()
torch.backends.cudnn.benchmark = True
load_model_file()
And the task code
#task
def getResult(name):
print("Executing on GPU:", torch.cuda.is_available())
if os.path.isfile(name):
try:
outpath = model_inference(name)
os.remove(name)
return outpath
except OSError as e:
print("Error", name, "doesn't exist")
return ""
The print in the task shows "Executing on GPU: true"
I've tried setting torch.multiprocessing.set_start_method('spawn') in the settings.py before and after the torch.cuda.init() but it gives the same error.
Setting this method works as long as you're also using Process from the same library.
from torch.multiprocessing import Pool, Process
Celery uses "regular" multiprocessing library, thus this error.
If I were you I'd try either:
run it single threaded to see if that helps
run it with eventlet to see if that helps
read this
A quick fix is to make things single-threaded. To do that set the worker pool type of celery to solo while starting the celery worker
celery -A your_proj worker -P solo -l info
This is due to the fact that the Celery worker itself is using forking. This appears to be a currently known issue with Celery >=4.0
You used to be able to configure celery to spawn, rather than fork, but that feature (CELERYD_FORCE_EXECV) was removed in 4.0.
There is no inbuilt options to get around this. Some custom monkeypatching to do this is probably be possible, but YMMV
Some potentially viable options might be:
Use celery <4.0 with CELERYD_FORCE_EXECV enabled.
Launch celery workers on Windows (where forking is not possible anyhow)

Scrapy - Can't call scraper from a script in parent folder to scrapy project

I've got a bit of a strange one that I can't get my head around here:
I've setup a webscraper using Scrapy and it performs the scrape fine when I run the following file from the cli ($ python journal_scraper.py):
journal_scraper.py:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def checkForUpdates():
process = CrawlerProcess(get_project_settings())
process.crawl('journal')
process.crawl('article')
process.start()
if __name__ == '__main__':
checkForUpdates()
The process is able to find the two spiders journal and article without a problem.
Now, I'd like to call this scrape as one of many steps within an application that I'm developing so from the parent folder to the Scrapy project I import journal_scraper.py into my main.py file and try run the checkForUpdates() function:
main.py:
from scripts.journal_scraper import checkForUpdates
checkForUpdates()
and I get the following:
2016-01-10 20:30:56 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-01-10 20:30:56 [scrapy] INFO: Optional features available: ssl, http11
2016-01-10 20:30:56 [scrapy] INFO: Overridden settings: {}
Traceback (most recent call last):
File "main.py", line 13, in <module>
checkForUpdates()
File "/Users/oldo/Python/projects/AMS-Journal-Scraping/AMS_Journals/scripts/journal_scraper.py", line 8, in checkForUpdates
process.crawl('journal')
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/crawler.py", line 150, in crawl
crawler = self._create_crawler(crawler_or_spidercls)
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/crawler.py", line 165, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/spiderloader.py", line 40, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: journal'
I've also tried changing main.py to:
import subprocess
subprocess.call('python ./scripts/scraper.py', shell=True)
Which yields the same error.
I'm pretty sure that it has something to do with the fact that I am calling this function form the parent folder because if I make a little test script in the same folder as journal_scraper.py that does the same thing as main.py the scraper runs as expected.
Is there some sort of restriction on calling scrapers from a script external to the Scrapy project?
Please ask for further details if my situation is not clear.
Although it is very late and if you still looking for the solution, Try importing the class of your spider:
from parent1.parent1.spiders.spider_file_name import spider_class_name

Trying to run a scrapy crawler from another location within script

All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!
If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

Sequentially running an external independent process using Tkinter and python

BACKGROUND :
*I'm creating a batch simulation job chooser + scheduler using Tkinter (Portable PYscripter, python v2.7.3)
*This program will function as a front end, to a commercial solver program
*The program needs to allow the user to choose a bunch of files to simulate, sequentially, one after the other.
*It also needs to have the facility to modify (Add/delete) jobs from an existing/running job list.
*Each simulation will definitely run for several hours.
*The output of the simulation will be viewed on separate programs and I do not need any pipe to the output. The external viewer will be called from the GUI, when desired.
***I have a main GUI window, which allows the user to :
choose job files, submit jobs, view the submission log, stop running jobs(one by one)
The above works well.
PROBLEMS :
*If I use subprocess.Popen("command") : all the simulation input files are launched at the same time. It MUST be sequential (due to license and memory limitations)
*If I use subprocess.call(" ") or the wait() method, then the GUI hangs and there is no scope to stop/add/modify the job list. Even if the "job submit" command is on an independent window, both the parent windows hang untill the job completes.
QUESTION 1 :
*How do I launch the simulation jobs sequentially (like subprocess.call) AND allow the main GUI window to function for the purpose of job list modification or stopping a job ?
The jobs are in a list, taken using "askopenfilenames" and then run using a For loop.
Relevant parts of the Code :
cfx5solvepath=r"c:\XXXX"
def file_chooser_default():
global flist1
flist1=askopenfilename(parent = root2, filetypes =[('.def', '*.def'),('All', '*.*'),('.res', '*.res')], title ="Select Simulation run files...", multiple = True)[1:-1].split('} {')
def ext_process():
o=list(flist1)
p=list(flist1)
q=list(flist1)
i=0
while i < len(flist1):
p[i]='"%s" -def "%s"'%(cfx5solvepath,flist1[i])
i+=1
i=0
while i < len(p):
q[i]=subprocess.call(p[i])
i+=1
root2 = Tk()
root2.minsize(300,300)
root2.geometry("500x300")
root2.title("NEW WINDOW")
frame21=Frame(root2, borderwidth=3, relief="solid").pack()
w21= Button(root2,fg="blue", text="Choose files to submit",command=file_chooser_default).pack()
w2a1=Button(root2,fg="white", text= 'Display chosen file names and order', command=lambda:print_var(flist1)).pack()
w2b1= Button (root2,fg="white", bg="red", text="S U B M I T", command=ext_process).pack()
root2.mainloop()
Please let me know if you require anything else. Look forward to your help.
*EDIT *
On incorporating the changes suggested by #Tim , the GUI is left free. Since there is a specific sub-program associated with the main solver program to stop the job, I am able to stop the job using the right command.
Once the currently running job is stopped, the next job on the list starts up, automatically, as I was hoping.
This is the code used for stopping the job :
def stop_select(): #Choose the currently running files which are to be stopped
global flist3
flist3=askdirectory().split('} {')
def sim_stop(): #STOP the chosen simulation
st=list(flist3)
os.chdir("%s"%flist3[0])
st= subprocess.call('"%s" -directory "%s"'%(defcfx5stoppath,flist3[0]))
ret1=tkMessageBox.showinfo("INFO","Chosen simulation stopped successfully")
os.chdir("%s" %currentwd)
QUESTION 2 :
*Once the above jobs are Completed, using start_new_thread, the GUI doesn't respond. The GUI works while the jobs are running in the background. But the start_new_thread documentation says that the thread is supposed to exit silently when the function returns.
*Additionally, I have a HTML log file that is written into/updated as each job completes. When I use start_new_thread ,the log file content is visible only AFTER all the jobs complete. The contents, along with the time stamps are however correct. Without using start_new_thread, I was able to refresh the HTML file to get the updated submission log.
***On exiting the GUI program using the Task manager several times, I am suddenly unable to use the start_new_thread function !! I have tried reinstalling PYscripter and restarting the computer as well. I can't figure out anything sensible from the traceback, which is:
Traceback (most recent call last):
File "<string>", line 532, in write
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\protocol.py", line 439, in _async_request
seq = self._send_request(handler, args)
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\protocol.py", line 229, in _send_request
self._send(consts.MSG_REQUEST, seq, (handler, self._box(args)))
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\protocol.py", line 244, in _box
if brine.dumpable(obj):
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\brine.py", line 369, in dumpable
return all(dumpable(item) for item in obj)
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\brine.py", line 369, in <genexpr>
return all(dumpable(item) for item in obj)
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\brine.py", line 369, in dumpable
return all(dumpable(item) for item in obj)
File "C:\Portable Python 2.7.3.1\App\lib\site-packages\rpyc\core\brine.py", line 369, in <genexpr>
return all(dumpable(item) for item in obj)
File "C:\Portable Python 2.7.3.1\App\Python_Working_folder\v350.py", line 138, in ext_process
q[i]=subprocess.call(p[i])
File "C:\Portable Python 2.7.3.1\App\lib\subprocess.py", line 493, in call
return Popen(*popenargs, **kwargs).wait()
File "C:\Portable Python 2.7.3.1\App\lib\subprocess.py", line 679, in __init__
errread, errwrite)
File "C:\Portable Python 2.7.3.1\App\lib\subprocess.py", line 896, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
I'd suggest using a separate thread for the job launching. The simplest way would be to use the start_new_thread method from the thread module.
Change the submit button's command to command=lambda:thread.start_new_thread(ext_process, ())
You will probably want to disable the button when it's clicked and enable it when the launching is complete. This can be done inside ext_process.
It becomes more complicated if you want to allow the user to cancel jobs. This solution won't handle that.

Categories