I created a CrawlSpider that should follow all "internal" links up to a certain number of items / pages / time.
I am using multiprocessing.Pool to process a few pages at the same time (e.g. 6 workers).
I do call the CrawlSpider with the os.systemcomand from a separate python script:
import os
...
cmd = "scrapy crawl FullPageCrawler -t jsonlines -o "{0}" -a URL={1} -s DOWNLOAD_MAXSIZE=0 -s CLOSESPIDER_TIMEOUT=180 -s CLOSESPIDER_PAGECOUNT=150 -s CLOSESPIDER_ITEMCOUNT=100 -s DEPTH_LIMIT=5 -s DEPTH_PRIORITY=0 --nolog'.format(OUTPUT_FILE, url.strip())"
os.system(cmd)
It works pretty well for some of my pages but for specific pages the crawler is not following any of my set settings.
I tried to define the following (with what I think it does):
CLOSESPIDER_PAGECOUNT: The number of total pages he will follow?
CLOSESPIDER_ITEMCOUNT: Not sure about this one. What is the difference to PAGECOUNT?
CLOSESPIDER_TIMEOUT: Maximum time a crawler should be working.
Right now I face an example that has already crawled more than 4000 pages (or items?!) and is up for more than 1 hour.
Do I run into this because I defined everything at the same time?
Do I also need to define the same settings in the settings.py?
Can one of them be enough for me? (e.g. maximum uptime = 10minutes)
I tried using subprocess.Popen instead of os.system because it has a wait function but this was not working as expected as well.
After all using os.system is the most stable thing I tried and I want to stick with it. Only problem is scrapy
I tried searching for an answer on SO but couldn´t find any help!
EDIT:
The above example ended up with 16.009 scraped subpages and over 333 MB.
After keep on searching for an answer I came up with the following solution.
Within my CrawlSpider I defined a maximum number of pages (self.max_cnt) that the scraper should stop at and a counter that is checked (self.max_counter) and increased for each page my scraper visited.
If the number of maximum pages is exceeded then the spider will be closed by raising scrapy.exception.CloseSpider.
class FullPageSpider(CrawlSpider):
name = "FullPageCrawler"
rules = (Rule(LinkExtractor(allow=()), callback="parse_all", follow=True),)
def __init__(self, URL=None, *args, **kwargs):
super(FullPageSpider, self).__init__(*args, **kwargs)
self.start_urls = [URL]
self.allowed_domains = ['{uri.netloc}'.format(uri=urlparse(URL))]
self.max_cnt = 250
self.max_counter = 0
def parse_all(self, response):
if self.max_counter < self.max_cnt:
self.max_cnt += 1
...
else:
from scrapy.exceptions import CloseSpider
raise CloseSpider('Exceeded the number of maximum pages!')
This works fine for me now but I would still be interested in the reason why the crawler settings are not working as expected.
Related
I took the Data Camp Web Scraping with Python course and am trying to run the 'capstone' web scraper in my own environment (the course takes place in a special in-browser environment). The code is intended to scrape the titles and descriptions of courses from the Data Camp webpage.
I've spend a good deal of time tinkering here and there, and at this point am hoping that the community can help me out.
The code I am trying to run is:
# Import scrapy
import scrapy
# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess
# Create the Spider class
class YourSpider(scrapy.Spider):
name = 'yourspider'
# start_requests method
def start_requests(self):
yield scrapy.Request(url= https://www.datacamp.com, callback = self.parse)
def parse (self, response):
# Parser, Maybe this is where my issue lies
crs_titles = response.xpath('//h4[contains(#class,"block__title")]/text()').extract()
crs_descrs = response.xpath('//p[contains(#class,"block__description")]/text()').extract()
for crs_title, crs_descr in zip(crs_titles, crs_descrs):
dc_dict[crs_title] = crs_descr
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()
# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()
# Print a preview of courses
previewCourses(dc_dict)
I get the following output:
C:\Users*\PycharmProjects\TestScrape\venv\Scripts\python.exe C:/Users/*/PycharmProjects/TestScrape/main.py
File "C:\Users******\PycharmProjects\TestScrape\main.py", line 20
yield scrapy.Request(url=https://www.datacamp.com, callback=self.parse1)
^
SyntaxError: invalid syntax
Process finished with exit code 1
I notice that the parse method in line 20 remains grey in my PyCharm window. Maybe I am missing something important in the parse method?
Any help in getting the code to run would be greatly appreciated!
Thank you,
-WolfHawk
The error message is triggered in the following line:
yield scrapy.Request(url=https://www.datacamp.com, callback = self.parse)
As an input to url you should enter a string and strings are written with ' or " in the beginning and in the end.
Try this:
yield scrapy.Request(url='https://www.datacamp.com', callback = self.parse)
If this is your full code, you are also missing the function previewCourses. Check if it is provided to you or write it yourself with something like this:
def previewCourses(dict_to_print):
for key, value in dict_to_print.items():
print(key, value)
I am new to Scrapy, and I built a simple spider that scrapes my local news site for titles and amount of comments. It scrapes well, but I have a problem with my language encoding.
I have created a Scrapy project that I then run through anaconda prompt to save the output to a file like so (from the project directory):
scrapy crawl MySpider -o test.csv
When I then open the json file with the following code:
with open('test.csv', 'r', encoding = "L2") as f:
file = f.read()
I also tried saving it to json, opening in excel, changing to different encodings from there ... always unreadable, but the characters differ. I am Czech if that is relevant. I need characters like ěščřžýáíé etc., but it is Latin.
What I get: Varuje pĹ\x99ed
What I want: Varuje před
Here is my spider code. I did not change anything in settings or pipeline, though I tried multiple tips from other threads that do this. I spent 2 hours on this already, browsing stack overflow and documentation and I can't find the solution, it's becoming a headache for me. I'm not a programmer so this may be the reason... anyway:
urls = []
for number in range(1,101):
urls.append('https://www.idnes.cz/zpravy/domaci/'+str(number))
class MySpider(scrapy.Spider):
name = "MySpider"
def start_requests(self):
urls = ['https://www.idnes.cz/zpravy/domaci/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
articleBlocks = response.xpath('//div[contains(#class,"art")]')
articleLinks = articleBlocks.xpath('.//a[#class="art-link"]/#href')
linksToFollow = articleLinks.extract()
for url in linksToFollow:
yield response.follow(url = url, callback = self.parse_arts)
print(url)
def parse_arts(self, response):
for article in response.css('div#content'):
yield {
'title': article.css('h1::text').get(),
'comments': article.css('li.community-discusion > a > span::text').get(),
}
Scrapy saves feed exports with utf-8 encoding by default.
Opening the file with the correct encoding displays the characters fine.
If you want to change the encoding used, you can do so by using the FEED_EXPORT_ENCODING setting (or using FEEDS instead).
After one more hour of trial and error, I solved this. The problem was not in Scrapy, it was correctly saving in utf-8, the problem was in the command:
scrapy crawl idnes_spider -o test.csv
that I ran to save it. When I run the command:
scrapy crawl idnes_spider -s FEED_URI=test.csv -s FEED_FORMAT=csv
It works.
I want to scrape data from three different categories of contracts --- goods, services, construction.
Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.
My understanding is that just listing all three urls as start_urls will lead to one combined output of data.
My spider inherits from Scrapy's CrawlSpider class.
Let me know if you need further information.
I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a option like so
scrapy crawl CanCrawler -a contract=goods
You just need to include the variables you reference in your class initializer
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.
scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods
so you'd get
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, procure='', contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl.
Please also see here.
I hope this helps!
In your Spider, yield your item like this.
data = {'categories': {}, 'contracts':{}, 'goods':{}, 'services':{}, 'construction':{} }
Where each of item consists a Python dictionary.
And then create a Pipeline, and inside pipeline, do this.
if 'categories' in item:
categories = item['categories']
# and then process categories, save into DB maybe
if 'contracts' in item:
categories = item['contracts']
# and then process contracts, save into DB maybe
.
.
.
# And others
I am using scrapy to crawl several websites. My spider isn't allowed to jump across domains. In this scenario, redirects make the crawler stop immediately. In most cases I know how to handle it, but this is a weird one.
The culprit is: http://www.cantonsd.org/
I checked its redirect pattern with http://www.wheregoes.com/ and it tells me it redirects to "/". This prevents the spider to enter its parse function. How can I handle this?
EDIT:
The code.
I invoke the spider using the APIs provided by scrapy here: http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
The only difference is that my spider is custom. It is created as follows:
spider = DomainSimpleSpider(
start_urls = [start_url],
allowed_domains = [allowed_domain],
url_id = url_id,
cur_state = cur_state,
state_id_url_map = id_url,
allow = re.compile(r".*%s.*" % re.escape(allowed_path), re.IGNORECASE),
tags = ('a', 'area', 'frame'),
attrs = ('href', 'src'),
response_type_whitelist = [r"text/html", r"application/xhtml+xml", r"application/xml"],
state_abbr = state_abbrs[cur_state]
)
I think the problem is that the allowed_domains sees that / is not part of the list (which contains only cantonsd.org) and shuts down everything.
I'm not reporting the full spider code because it is not invoked at all, so it can't be the problem.
I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.
This is how you might get it to work inside the spider.
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.
I would not create an Item and use the ImagePipeline.
import urllib
import os
import subprocess
...
def start_requests(self):
request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
return [request]
def fill_login_form(self,response):
x = HtmlXPathSelector(response)
img_src = x.select("//img/#src").extract()
#delete the captcha file and use urllib to write it to disk
os.remove("c:\captcha.jpg")
urllib.urlretrieve(img_src[0], "c:\captcha.jpg")
# I use an program here to show the jpg (actually send it somewhere)
captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")
# OR just get the input from the user from stdin
captcha = raw_input("put captcha in manually>")
# this function performs the request and calls the process_home_page with
# the response (this way you can chain pages from start_requests() to parse()
return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]
def process_home_page(self, response):
# check if you logged in etc. etc.
...
What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.
That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.