Below is a spider I wrote to crawl an RSS feed and extract the first link and image title, and save them to a text file:
import scrapy
class artSpider(scrapy.Spider):
name = "metart"
start_urls = ['https://www.metmuseum.org/art/artwork-of-the-day?rss=1']
def parse(self, response):
title = response.xpath('//item/title/text()').extract_first().replace(" ", "_")
description = response.xpath('//item/description/text()').extract_first()
lnkstart = description.find("https://image")
lnkcut1 = description.find("web-highlight")
lnkcut2 = lnkcut1 + 13
lnkend = description.find(".jpg") + 4
link = description[lnkstart:lnkcut1] + "original" + description[lnkcut2:lnkend]
ttlstart = description.find("who=") + 4
ttlend = description.find("&rss=1")
filename = "/path/to/save/folder/" + description[ttlstart:ttlend].replace("+", "_") + "-" + title + ".jpg"
print(filename)
print(link)
filename_file = open('filename_metart.txt', 'w')
filename_file.write(filename)
filename_file.close
link_file = open('link_metart.txt', 'w')
link_file.write(link)
link_file.close
It goes over the Met Museum "Artwork of the Day" RSS feed and finds the newest artwork. Then parses the title and the link to the original (the link in the RSS feed is for a thumbnail) and saves them to individual text files. The title is used to generate a filename when downloading the image.
I know the parse function is a mess and that the way I am saving the links and filenames is ugly. I am just starting out with Scrapy and just wanted to get the spider working because it is only a part of the broader project I am working on.
The project itself draws from artwork/image/photo of the day feeds and websites, downloads the images and sets them as a desktop background slideshow.
My issue is that when I set the spider off it comes back with a response of "NoneType". BUT if I go to the RSS feed in a browser (https://metmuseum.org/art/artwork-of-the-day?rss=1) AND THEN run the spider it works correctly.
Process that fails
Call scrapy crawl metart
Output shows that the variable title has the type NoneType
Repeating step 1 results in the same output
Process that works
Open https://metmuseum.org/art/artwork-of-the-day?rss=1 in a web browser
Call scrapy crawl metart
Successfully saves the link and filename to the relevant text files
Some troubleshooting I have already done
I have used a Scrapy shell to replicate exactly the process that the spider goes through and this worked fine. I went through each step that the spider goes through starting with:
fetch("https://metmuseum.org/art/artwork-of-the-day?rss=1")
Then typing in each line of the parse function. This works fine and results in the correct link and filename being saved WITHOUT having to open the URL in a web browser first.
Just for completeness below is my Scrapy project settings.py file:
BOT_NAME = 'wallpaper_scraper'
SPIDER_MODULES = ['wallpaper_scraper.spiders']
NEWSPIDER_MODULE = 'wallpaper_scraper.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'cchowgule for a wallpaper slideshow'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
I am very confused as to how the act of opening the URL in a web browser could possibly affect the response of the spider.
Any help cleaning up my parse function would also be lovely.
Thanks
Update
At 0535 UTC I ran scrapy crawl metart and got the following response:
2018-06-23 11:02:54 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: wallpaper_scraper)
2018-06-23 11:02:54 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.15 (default, May 1 2018, 05:55:50) - [GCC 7.3.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.16.0-2-amd64-x86_64-with-debian-buster-sid
2018-06-23 11:02:54 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wallpaper_scraper.spiders', 'SPIDER_MODULES': ['wallpaper_scraper.spiders'], 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'cchowgule for a wallpaper slideshow', 'BOT_NAME': 'wallpaper_scraper'}
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-23 11:02:54 [scrapy.core.engine] INFO: Spider opened
2018-06-23 11:02:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-23 11:02:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-23 11:02:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metmuseum.org/robots.txt> (referer: None)
2018-06-23 11:02:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metmuseum.org/art/artwork-of-the-day?rss=1> (referer: None)
2018-06-23 11:02:56 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.metmuseum.org/art/artwork-of-the-day?rss=1> (referer: None)
Traceback (most recent call last):
File "/home/cchowgule/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/cchowgule/WD/pyenvscrapy/wallpaper_scraper/wallpaper_scraper/spiders/metart.py", line 9, in parse
title = response.xpath('//item/title/text()').extract_first().replace(" ", "_")
AttributeError: 'NoneType' object has no attribute 'replace'
2018-06-23 11:02:57 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-23 11:02:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 646,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1725,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 23, 5, 32, 57, 65104),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 52793344,
'memusage/startup': 52793344,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2018, 6, 23, 5, 32, 54, 464466)}
2018-06-23 11:02:57 [scrapy.core.engine] INFO: Spider closed (finished)
I ran scrapy crawl metart 5 times with the same result.
Then I opened a browser, went to https://metmuseum.org/art/artwork-of-the-day?rss=1 and ran scrapy crawl metart again. This time it worked correctly...... I just don't get it.
Related
I'm trying to fetch redirected url from a url using Scrapy
Response status changes from 302 to 200 but still the url isn't changing.
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
name = 'test'
start_urls = ['https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w']
def parse(self, response):
yield {
'url': response.url,
}
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {
"format": "json",
"overwrite": True
}},
'ROBOTSTXT_OBEY': False,
'FEED_EXPORT_ENCODING': 'utf-8',
'REDIRECT_ENABLED': True,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7'
})
process.crawl(MySpider)
process.start()
Console Output
2023-02-08 01:44:25 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-02-08 01:44:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2023-02-08 01:44:25 [scrapy.crawler] INFO: Overridden settings:
{'FEED_EXPORT_ENCODING': 'utf-8', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7'}
2023-02-08 01:44:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-02-08 01:44:25 [scrapy.extensions.telnet] INFO: Telnet Password: 013f29d178b8cbb6
2023-02-08 01:44:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-02-08 01:44:26 [scrapy.core.engine] INFO: Spider opened
2023-02-08 01:44:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-02-08 01:44:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-02-08 01:44:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en> from <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w>
2023-02-08 01:44:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en> (referer: None)
2023-02-08 01:44:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en>
{'url': 'https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en'}
2023-02-08 01:44:27 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-08 01:44:27 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
2023-02-08 01:44:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1561,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 104182,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 1.162381,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 7, 20, 14, 27, 736096),
'httpcompression/response_bytes': 302666,
'httpcompression/response_count': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2023, 2, 7, 20, 14, 26, 573715)}
2023-02-08 01:44:27 [scrapy.core.engine] INFO: Spider closed (finished)
item.json
[
{"url": "https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en"}
]
I expect the url to https://www.pinkvilla.com/entertainment/bts-jin-shares-behind-the-scenes-of-the-astronaut-stage-with-coldplay-band-performs-on-snl-with-wootteo-1208241
I've tried setting params such as dont_redirect, handle_httpstatus_list, etc. but nothing working out
What am I missing?
Any guidance would be helpful
What your missing is that the 302 redirect that you see in your logs is not redirecting to the page you are expecting. The redirect in your logs is simply taking you from ...news.google.com/rss/articles/CBM...yNDE_YW1w to ...news.google.com/rss/articles/CBM...yNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en.
The url for the page that you are expecting it to redirect to can actually be found in the html for the page that it is being directed to. In fact it is the only link, along with a whole bunch of javascript which facilitates the redirect that happens automatically in your browser.
for example:
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
name = 'test'
start_urls = ['https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w']
def parse(self, response):
m = response.xpath("//a/#href").get() # grab the href for the only link on the page
yield {"links": m}
OUTPUT:
2023-02-07 18:39:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMt
b2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1v
Zi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-US&gl=US&ceid=US:en>
{'links': 'https://www.pinkvilla.com/entertainment/bts-jin-shares-behind-the-scenes-of-the-astronaut-stage-with-coldplay-band-performs-on-snl-with-wootteo-1208241'}
2023-02-07 18:39:55 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-07 18:39:55 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
2023-02-07 18:39:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Since scrapy doesn't execute any of the javascript in the response for the google.news.com/rss request the redirect that happens in your browser doesn't get triggered by scrapy.
I have a small scrapy project that I'm trying to work on and whilst I've gotten scrapy to work, I'm a bit stumped by the storage options.
So I have ubuntu 20 headless, the latest python installed and installed scrapy everything is running nicely.
My script is thus:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "github"
def start_requests(self):
urls = [
'https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt',
'https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
I execute the script like this :
sudo scrapy crawl github -o results.json
and get this result:
barsa#ubuntu20~/scrape/scrape/spiders$ sudo scrapy crawl github -o results.json
2020-10-14 09:36:44 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrape)
2020-10-14 09:36:44 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.9.0, Python 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-48-generic-x86_64-with-glibc2.29
2020-10-14 09:36:44 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-14 09:36:44 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrape',
'NEWSPIDER_MODULE': 'scrape.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrape.spiders']}
2020-10-14 09:36:44 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxx
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 09:36:44 [scrapy.core.engine] INFO: Spider opened
2020-10-14 09:36:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 09:36:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://raw.githubusercontent.com/robots.txt> (referer: None)
2020-10-14 09:36:44 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://osint.digitalside.it/robots.txt> (referer: None)
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt> (referer: None)
2020-10-14 09:36:44 [github] DEBUG: Saved file github-lists.html
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt> (referer: None)
2020-10-14 09:36:45 [github] DEBUG: Saved file github-lists.html
2020-10-14 09:36:45 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 09:36:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 995,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 1444016,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/400': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.727767,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 9, 36, 45, 27397),
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'memusage/max': 52645888,
'memusage/startup': 52645888,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/400': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 10, 14, 9, 36, 44, 299630)}
2020-10-14 09:36:45 [scrapy.core.engine] INFO: Spider closed (finished)
Now I check the json file and it's empty, but the github-lists.html contains both lists with no separator between them, so it looks like one large long list.
What I don't understand is how I can do one of the following:
Split the lists into their own separate files (github-list1.html and github-list2.html)
add a separator into the github-list.html, so I can run some logic to extract this into two separate CSV files perhaps
I can't find any examples on the scrapy site that shows how the file storage works
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
What would be the best way to tackle this? Because as I see it, this function above only seems to be dealing with a single file instance... so I was thinking maybe I need to use the pipelines function?
Many thanks
scrapy crawl github -o results.json
The parameter -o is telling scrapy to use the FEED_EXPORT (docs), however your spider never yields any items to the engine so nothing is exported, that's why your json is empty.
Just so you can see it working, you can add the following line at the bottom of your parse method, execute the spider (using -o results.json) again and you will see the urls in the json.
def parse(self, response):
...
yield {'url': response.url} # Add this
def start_requests(self):
urls = [
'https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt',
'https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt',
]
...
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
Here your code splits the response url into a list and get the "last but one" element on the list to name the file. If you check, for both URLs, the element will be "lists" (coincidence) therefore both times the parse method is called it will reference the same file github-lists.html (Where lists came from the page variable).
You can use here any logic you want to name your files.
I suggest you keep reading the Scrapy tutorial, you will understand better how you can leverage the framework to extract and store the data.
Specially those three sections:
https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-data
https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider
https://docs.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data
Hello and thanks in advance for any assistance with this issue I'm having. I have never posted for coding help before and I'm very new to programming. Self taught old guy who is trying to learn something new and maybe build something to save the world (Or just build something. :))
I have scrapy fired up and when I run my terminal command "scrapy crawl coops" I always get the DEBUG: Crawled (200) and don't see any "Found details:" entrys. I'm able to run scrapy shell "http://coopdirectory.org/directory.htm" and get results manually with the shell. When I try to yield to a .jl or .js file they are empty as well. (I have made this work great with the scrapy tutorial quotes). Below is my code and below that are the results. Any direction on this would be great.
import scrapy
class CoopsSpider(scrapy.Spider):
name = "coops"
start_urls = [
'http://coopdirectory.org/directory.htm',
]
def parse(self, response):
for coop in response.css('div.coop'):
yield {
'name': coop.css('h3 a::text').getall(),
'address': coop.css('.address::text').getall(),
'website': coop.css('.web-address::text').getall(),
'phone': coop.css('.phone::text').getall(),
'org_type': coop.css('.org-type::text').getall(),
'inactive': coop.css('.inactive-alert::text').getall(),
'notes': coop.css('.note::text').getall(),
}
RESULTS
PS C:\Users\scott\Documents\WebScrapeProjects\coops\coops> scrapy crawl coops
2020-04-28 19:08:31 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: coops)
2020-04-28 19:08:31 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-04-28 19:08:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-28 19:08:31 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'coops',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'NEWSPIDER_MODULE': 'coops.spiders',
'SPIDER_MODULES': ['coops.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2020-04-28 19:08:31 [scrapy.extensions.telnet] INFO: Telnet Password: 3f2a17b73c1f55b3
2020-04-28 19:08:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-04-28 19:08:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-28 19:08:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-28 19:08:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-28 19:08:32 [scrapy.core.engine] INFO: Spider opened
2020-04-28 19:08:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-28 19:08:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-28 19:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://coopdirectory.org/directory.htm> (referer: None)
2020-04-28 19:08:32 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-28 19:08:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 316,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 303338,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.808906,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 29, 0, 8, 32, 936826),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 29, 0, 8, 32, 127920)}
2020-04-28 19:08:32 [scrapy.core.engine] INFO: Spider closed (finished)
Your css-selector ('div.coop') is not selecting anything and so nothing can be yielded inside your loop. You can test this by opening a scrapy shell (scrapy shell "http://coopdirectory.org/directory.htm") and then type response.css('div.coop'). You will see that an empty selection ([]) will be returned.
It is not clear what you are trying to achieve exactly, but here is an example that extracts all of the information that you are trying to extract and just returns it:
import scrapy
class CoopsSpider(scrapy.Spider):
name = "coops"
start_urls = ['http://coopdirectory.org/directory.htm']
def parse(self, response):
yield {
'name': response.css('h3 a::text').getall(),
'address': response.css('.address::text').getall(),
'website': response.css('.web-address a::text').getall(),
'phone': response.css('.phone::text').getall(),
'org_type': response.css('.org-type::text').getall(),
'inactive': response.css('.inactive-alert::text').getall(),
'notes': response.css('.note::text').getall(),
}
Just a hint: You might have a hard time scraping this website because the individual co-ops are not inside individual < div >-containers. This makes it somewhat difficult to seperate the individual information. Maybe you can find another website that also contains the information you need and is better suited for scraping or skip scraping and use open data-sets right away.
I am trying to set up a Scrapy spider inside Django app, that reads info from a page and posts it in Django's SQLite database using DjangoItems.
Right now it seems that scraper itself is working, however, it is not adding anything to database. My guess is that it happens because of scrapy not enabling any item pipelines. Here is the log:
2019-10-05 15:23:07 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-10-05 15:23:07 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-10-05 15:23:07 [scrapy.crawler] INFO: Overridden settings: {}
2019-10-05 15:23:07 [scrapy.extensions.telnet] INFO: Telnet Password: 6e614667b3cf5a1a
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-10-05 15:23:07 [scrapy.core.engine] INFO: Spider opened
2019-10-05 15:23:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-05 15:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-05 15:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g> (referer: None)
2019-10-05 15:23:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g>
{'product_title': ['Biezpiena sieriņš KĀRUMS vaniļas 45g']}
2019-10-05 15:23:08 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-05 15:23:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 15402,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.418066,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 10, 5, 12, 23, 8, 417204),
'item_scraped_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 10, 5, 12, 23, 7, 999138)}
2019-10-05 15:23:08 [scrapy.core.engine] INFO: Spider closed (finished)
As I can see, the scraper returns expected value as "{'product_title': ['Biezpiena sieriņš KĀRUMS vaniļas 45g']}", but it seems like it is not passed into pipeline because no pipelines are loaded.
I have spent several hours looking at different tutorials and trying to fix the issue, but had no luck so far. Is there anything else I might have forgotten regarding setting up the scraper? Maybe it has something to do with file structure in the project.
Here are relevant files.
items.py
from scrapy_djangoitem import DjangoItem
from product_scraper.models import Scrapelog
class ScrapelogItem(DjangoItem):
django_model = Scrapelog
pipelines.py
class ProductInfoPipeline(object):
def process_item(self, item, spider):
item.save()
yield item
settings.py
BOT_NAME = 'scraper'
SPIDER_MODULES = ['scraper.spiders']
NEWSPIDER_MODULE = 'scraper.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scraper.pipelines.ProductInfoPipeline': 300,
}
spider product_info.py:
import scrapy
from product_scraper.scraper.scraper.items import ScrapelogItem
class ProductInfoSpider(scrapy.Spider):
name = 'product_info'
allowed_domains = ['www.barbora.lv']
start_urls = ['https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g']
def parse(self, response):
item = ScrapelogItem()
item['product_title'] = response.xpath('//h1[#itemprop="name"]/text()').extract()
return ScrapelogItem(product_title=item["product_title"])
Project file structure:
After further tinkering and research I found out that my settings file was not properly configured (however, it was only part of the problem). I added these lines to code based on other resources (they link Django project settings with scrapers settings):
import sys
import django
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'broccoli.settings'
django.setup()
After that, the spider did not run, but this time it at least gave an error message about not finding the Django settings module. I don't remember the exact syntax but it was something like that: "broccoli.settings MODULE NOT FOUND"
After some experiments I found out that moving scraper directory "scraper" from inside the app "Product_Scraper" to same level with other apps helped deal with this issue and everything worked.
Using Python 3.7.2 on Windows 10 I'm struggling with the task to let Scrapy v1.5.1 download some PDF files. I followed the docs but I seem to miss something. Scrapy gets me the desired PDF URLs but downloads nothing. Also no errors are thrown (at least).
The relevant code is:
scrapy.cfg:
[settings]
default = pranger.settings
[deploy]
project = pranger
settings.py:
BOT_NAME = 'pranger'
SPIDER_MODULES = ['pranger.spiders']
NEWSPIDER_MODULE = 'pranger.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = r'C:\pranger_downloaded'
FILES_URLS_FIELD = 'PDF_urls'
FILES_RESULT_FIELD = 'processed_PDFs'
pranger_spider.py:
import scrapy
class IndexSpider(scrapy.Spider):
name = "index"
url_liste = []
def start_requests(self):
urls = [
'http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for menupunkt in response.css('div#aufklappmenue'):
yield {
'file_urls': menupunkt.css('div.aussen a.innen::attr(href)').getall()
}
items.py:
import scrapy
class PrangerItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
All other files are as they were created by the scrapy startproject command.
The output of scrapy crawl index is:
(pranger) C:\pranger>scrapy crawl index
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: pranger)
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 11 2019, 14:11:50) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-20 15:45:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pranger', 'NEWSPIDER_MODULE': 'pranger.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pranger.spiders']}
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline', 'pranger.pipelines.PrangerPipeline']
2019-02-20 15:45:18 [scrapy.core.engine] INFO: Spider opened
2019-02-20 15:45:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-20 15:45:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://verbraucherinfo.ua-bw.de/robots.txt> (referer: None)
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3> (referer: None)
2019-02-20 15:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3>
{'file_urls': ['https://www.lrabb.de/site/LRA-BB-Desktop/get/params_E-428807985/3287025/Ergebnisse_amtlicher_Kontrollen_nach_LFGB_Landkreis_Boeblingen.pdf', <<...and dozens more URLs...>>], 'processed_PDFs': []}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-20 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13268,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 20, 14, 45, 19, 166646),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 20, 14, 45, 18, 864509)}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)
Oh BTW I published the code, just in case: https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403
The FILES_URLS_FIELD setting tells the pipeline what field of the item contains the urls you want to download.
By default, this is file_urls, but if you change the setting, you also need to change the field name (key) you're storing the urls in.
So you have two options - either use the default setting, or rename your item's field to PDF_urls as well.