I'm new to Python and web scraping. Pls excuse me for my ignorance. In this program, I want to run my spider on a schedule. I use Python 3.7 and MacOs.
I wrote cronjob using crontab and called shell script to run the scrapy spider. However it executed only once with the last line "INFO: Closing spider (finished)". Didn't repeat according to the schedule. I executed simple python script to test the schedule and then it worked. Seems this issue only with the spider. Please help to understand how to fix this. Any help would be appreciated. Thank you
import csv
import os
import random
from time import sleep
import scrapy
class spider1(scrapy.Spider):
name = "amspider"
with open("data.csv", "a") as filee:
if os.stat("data.csv").st_size != 0:
filee.truncate(0)
filee.close()
def start_requests(self):
list = ["https://www.example.com/item1",
"https://www.example.com/item2",
"https://www.example.com/item3",
"https://www.example.com/item4",
"https://www.example.com/item5"
]
for i in list:
yield scrapy.Request(i, callback=self.parse)
sleep(random.randint(0, 5))
def parse(self, response):
product_name = response.css('#pd-h1-cartridge::text')[0].extract()
product_price = response.css(
'.product-price .is-current, .product-price_total .is-current, .product-price_total ins, .product-price ins').css(
'::text')[3].extract()
print(product_name)
print(product_price)
with open('data.csv', 'a') as file:
itemwriter = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
itemwriter.writerow([str(product_name).strip(), str(product_price).strip()])
file.close()
amsp.sh
#!/bin/sh
cd /Users/amal/PycharmProjects/AmProj2/amazonspider
PATH=$PATH:/usr/local/bin/
export PATH
scrapy crawl amspider
crontab
Tried both ways But spider executed only once.
*/2 * * * * /Users/amal/Documents/amsp.sh
*/2 * * * * cd /Users/amal/PycharmProjects/AmProj2/amazonspider && scrapy crawl amspider
Related
I have a scrapy spider that imports a dictionary from another module. My main contentspider.py that contains the spider also contains an import statement from spider_project.spider_project.updated_kw import translated_kw_dicts.
from spider_project.spider_project.updated_kw import translated_kw_dicts
class ContentSpider(CrawlSpider):
name = 'content_spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
print(traslated_kw_dicts)
This import statement works fine when the spider is running as a script, but when I run it via conda then I get an error:
spider_project.spider_project.updated_kw import translated_kw_dicts
ModuleNotFoundError: No module named 'spider_project.spider_project'
From conda I am running the spider as it should be - from the directory containing .cfg file: c:/.../.../spider_project
If I will change the import statement to from spider_project.updated_kw import translated_kw_dicts (notice I'm taking out the first spider_project directry) then via conda the spider runs fine, but I get an error Cannot find reference 'updated_kw' in 'imported module spider_project' in my script.
Can someone advise why this is happening?
Here is my project structure:
when you click on the long line up in the file explorer then it should go active. You can copie it and use the fail reading technic.
I have a code that is using Scrpay framework and here's the code
import scrapy
from scrapy.crawler import CrawlerProcess
class DemoSpider(scrapy.Spider):
name = "DemoSpider"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split('/')[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved File %s' % filename)
process = CrawlerProcess()
process.crawl(DemoSpider)
process.start()
The code is working well when running like that from terminal (Windows 10 PowerShell) python demo.py.
But I need to run the code using Spyder IDE. When trying I got an error like that
ReactorBase.startRunning(self)
File "C:\ProgramData\Anaconda3\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
(Spyder maintainer here) Please go to the menu Run > Configuration per file and activate the option Execute in an external system terminal.
That will run your code in a regular Python interpreter, which will avoid the problems you're having to start the server that runs the scraper in our IPython console.
I want to use Cron to execute my python script every hour of the day. Therefore I created a cronjob that looks like: #hourly /home/pi/Desktop/repository/auslastung_download/auslastung.py
The cronjob should execute the following script:
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver
from datetime import datetime, date
def get_auslastung_lichtenberg():
try:
url = "https://www.mcfit.com/de/fitnessstudios/studiosuche/studiodetails/studio/berlin-lichtenberg/"
options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
driver.get(url)
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')
elems = soup.find_all('div', {'class': 'sc-iJCRLp eDJvQP'})
#print(elems)
auslastung = str(elems).split("<span>")[1]
#print(auslastung)
auslastung = auslastung[:auslastung.rfind('</span>')]
#print(auslastung)
auslastung = str(auslastung).split("Auslastung ")[1]
#print(auslastung)
auslastung = auslastung[:auslastung.rfind('%')]
print(auslastung)
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
#print("Current Time =", current_time)
today = date.today()
print(today)
ergebnis = {'date': today, 'time': current_time,'studio': "Berlin Lichtenberg", 'auslastung': auslastung}
return ergebnis
finally:
try:
driver.close()
except:
pass
"""
import json
with open('database.json', 'w') as f:
json.dump(get_auslastung_lichtenberg(), f)
"""
import csv
with open('/home/pi/Desktop/repository/auslastung_download/data.csv', mode='a') as file:
fieldnames = ['date', 'time', 'studio', 'auslastung']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(get_auslastung_lichtenberg())
When executed via python3 auslastung.pyeverything works fine and the script writes into the data.csv file.
Maybe someone can help me :)
First of all you must ensure that your script runs.
If interactively you run python3 auslastung.py why do you invoke your python script differently on your cron.
have you tried to run just /home/pi/Desktop/repository/auslastung_download/auslastung.py interactively? without initial python3, does it run?
If your script runs with python3 auslastung.py on your crontab you should include full path to both interpreter and script:
#hourly /paht/to/python3 /full/path/to/script.py
If you made your script to run directly without need to indicate interpreter, just /full/path/to/script.py then on your crontab you should include full path to script:
#hourly /full/path/to/script.py
You may include a shebang: a very first line of your script indicate which interpreter is used to execute it. So your first line should be #!/path/to/your/interpreter
An you have to ensure then that script has execute permision with chmod +x auslastung.py.
I am just for fun collecting weather data with my Raspberry Pi.
If I execute my python script in the console everything is working fine.But if I add the python-file to crontab to start it after rebooting, it isn't working. (crontab-entry: #reboot python3 /home/pi/Documents/PythonProgramme/WeatherData/weatherdata.py &)
#! /usr/bin/python3
from pyowm import OWM
import csv
import schedule
from datetime import datetime
import time
key = 'XXXXXX'
def weather_request(text):
owm = OWM(key)
mgr = owm.weather_manager()
karlsruhe = mgr.weather_at_place('Karlsruhe, DE').weather
hamburg = mgr.weather_at_place('Hamburg, DE').weather
cities = (karlsruhe, hamburg)
with open('weatherdata.csv', 'a') as file:
writer = csv.writer(file)
row = [datetime.now().strftime("%Y-%m-%d %H:%M:%S")]
for city in cities:
row.append(city.temperature('celsius')['temp'])
row.append(round(row[1] - row[2], 2))
row.append(text)
writer.writerow(row)
schedule.every().day.at("08:00").do(weather_request, 'morgens')
schedule.every().day.at("13:00").do(weather_request, 'mittags')
schedule.every().day.at("18:00").do(weather_request, 'abends')
while 1:
schedule.run_pending()
time.sleep(1)
If I run ps -aef | grep python it is showing, that my script is running: pi 337 1 21 10:32 ? 00:00:10 python3 /home/pi/Documents/PythonProgramme/WeatherData/weatherdata.py
But I never get any data. What am I missing?
Thanks in advance!
where are you checking the output file?
Have tried to open the file with full path?
with open('***<fullPath>***weatherdata.csv', 'a') as
All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!
If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here