I have a small scrapy project that I'm trying to work on and whilst I've gotten scrapy to work, I'm a bit stumped by the storage options.
So I have ubuntu 20 headless, the latest python installed and installed scrapy everything is running nicely.
My script is thus:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "github"
def start_requests(self):
urls = [
'https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt',
'https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
I execute the script like this :
sudo scrapy crawl github -o results.json
and get this result:
barsa#ubuntu20~/scrape/scrape/spiders$ sudo scrapy crawl github -o results.json
2020-10-14 09:36:44 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrape)
2020-10-14 09:36:44 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.9.0, Python 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-48-generic-x86_64-with-glibc2.29
2020-10-14 09:36:44 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-14 09:36:44 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrape',
'NEWSPIDER_MODULE': 'scrape.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrape.spiders']}
2020-10-14 09:36:44 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxx
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 09:36:44 [scrapy.core.engine] INFO: Spider opened
2020-10-14 09:36:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 09:36:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://raw.githubusercontent.com/robots.txt> (referer: None)
2020-10-14 09:36:44 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://osint.digitalside.it/robots.txt> (referer: None)
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt> (referer: None)
2020-10-14 09:36:44 [github] DEBUG: Saved file github-lists.html
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt> (referer: None)
2020-10-14 09:36:45 [github] DEBUG: Saved file github-lists.html
2020-10-14 09:36:45 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 09:36:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 995,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 1444016,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/400': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.727767,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 9, 36, 45, 27397),
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'memusage/max': 52645888,
'memusage/startup': 52645888,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/400': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 10, 14, 9, 36, 44, 299630)}
2020-10-14 09:36:45 [scrapy.core.engine] INFO: Spider closed (finished)
Now I check the json file and it's empty, but the github-lists.html contains both lists with no separator between them, so it looks like one large long list.
What I don't understand is how I can do one of the following:
Split the lists into their own separate files (github-list1.html and github-list2.html)
add a separator into the github-list.html, so I can run some logic to extract this into two separate CSV files perhaps
I can't find any examples on the scrapy site that shows how the file storage works
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
What would be the best way to tackle this? Because as I see it, this function above only seems to be dealing with a single file instance... so I was thinking maybe I need to use the pipelines function?
Many thanks
scrapy crawl github -o results.json
The parameter -o is telling scrapy to use the FEED_EXPORT (docs), however your spider never yields any items to the engine so nothing is exported, that's why your json is empty.
Just so you can see it working, you can add the following line at the bottom of your parse method, execute the spider (using -o results.json) again and you will see the urls in the json.
def parse(self, response):
...
yield {'url': response.url} # Add this
def start_requests(self):
urls = [
'https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt',
'https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt',
]
...
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
Here your code splits the response url into a list and get the "last but one" element on the list to name the file. If you check, for both URLs, the element will be "lists" (coincidence) therefore both times the parse method is called it will reference the same file github-lists.html (Where lists came from the page variable).
You can use here any logic you want to name your files.
I suggest you keep reading the Scrapy tutorial, you will understand better how you can leverage the framework to extract and store the data.
Specially those three sections:
https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-data
https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider
https://docs.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data
Related
I'm trying to fetch redirected url from a url using Scrapy
Response status changes from 302 to 200 but still the url isn't changing.
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
name = 'test'
start_urls = ['https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w']
def parse(self, response):
yield {
'url': response.url,
}
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {
"format": "json",
"overwrite": True
}},
'ROBOTSTXT_OBEY': False,
'FEED_EXPORT_ENCODING': 'utf-8',
'REDIRECT_ENABLED': True,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7'
})
process.crawl(MySpider)
process.start()
Console Output
2023-02-08 01:44:25 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-02-08 01:44:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2023-02-08 01:44:25 [scrapy.crawler] INFO: Overridden settings:
{'FEED_EXPORT_ENCODING': 'utf-8', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7'}
2023-02-08 01:44:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-02-08 01:44:25 [scrapy.extensions.telnet] INFO: Telnet Password: 013f29d178b8cbb6
2023-02-08 01:44:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-02-08 01:44:26 [scrapy.core.engine] INFO: Spider opened
2023-02-08 01:44:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-02-08 01:44:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-02-08 01:44:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en> from <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w>
2023-02-08 01:44:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en> (referer: None)
2023-02-08 01:44:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en>
{'url': 'https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en'}
2023-02-08 01:44:27 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-08 01:44:27 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
2023-02-08 01:44:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1561,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 104182,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 1.162381,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 7, 20, 14, 27, 736096),
'httpcompression/response_bytes': 302666,
'httpcompression/response_count': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2023, 2, 7, 20, 14, 26, 573715)}
2023-02-08 01:44:27 [scrapy.core.engine] INFO: Spider closed (finished)
item.json
[
{"url": "https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en"}
]
I expect the url to https://www.pinkvilla.com/entertainment/bts-jin-shares-behind-the-scenes-of-the-astronaut-stage-with-coldplay-band-performs-on-snl-with-wootteo-1208241
I've tried setting params such as dont_redirect, handle_httpstatus_list, etc. but nothing working out
What am I missing?
Any guidance would be helpful
What your missing is that the 302 redirect that you see in your logs is not redirecting to the page you are expecting. The redirect in your logs is simply taking you from ...news.google.com/rss/articles/CBM...yNDE_YW1w to ...news.google.com/rss/articles/CBM...yNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en.
The url for the page that you are expecting it to redirect to can actually be found in the html for the page that it is being directed to. In fact it is the only link, along with a whole bunch of javascript which facilitates the redirect that happens automatically in your browser.
for example:
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
name = 'test'
start_urls = ['https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w']
def parse(self, response):
m = response.xpath("//a/#href").get() # grab the href for the only link on the page
yield {"links": m}
OUTPUT:
2023-02-07 18:39:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMt
b2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1v
Zi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-US&gl=US&ceid=US:en>
{'links': 'https://www.pinkvilla.com/entertainment/bts-jin-shares-behind-the-scenes-of-the-astronaut-stage-with-coldplay-band-performs-on-snl-with-wootteo-1208241'}
2023-02-07 18:39:55 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-07 18:39:55 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
2023-02-07 18:39:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Since scrapy doesn't execute any of the javascript in the response for the google.news.com/rss request the redirect that happens in your browser doesn't get triggered by scrapy.
I am trying in vain to retrieve data from here: https://www.etoro.com/discover/people/results. Let's say I want to get the nickname element first. It appears in the following format in the HTML source code: <div _ngcontent-bqd-c27="" automation-id="trade-item-name" class="symbol">markaungier</div>
I tried the following three approaches:
Using a CSS selector
nickname = response.css("[automation-id=trade-item-name]")
Using an XPATH relative path
nickname = response.xpath("//div[#automation-id='trade-item-name']")
Using a full XPATH
response.xpath("/html/body/ui-layout/div/div/div[2]/et-discovery-people-results/div/div/et-discovery-people-results-grid/div/div/div/et-user-card[1]/div/header/et-card-avatar/a/div[2]/div[1]")
Strangely, none of them returned anything. What's going on here? Does the issue arise because of this, i.e. "Some webpages show the desired data when you load them in a web browser. However, when you download them using Scrapy, you cannot reach the desired data using selectors" ?
My full code is as follows:
import scrapy
import requests
from lxml import html
from scrapy.crawler import CrawlerProcess
class EtoroSpider(scrapy.Spider):
name = "traders"
start_urls = [
"https://www.etoro.com/discover/people/results",
]
def parse(self, response):
nickname = response.xpath("//div[#automation-id='trade-item-name']")
print(nickname)
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(EtoroSpider)
process.start()
And here is the scrapy output:
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Windows-10-10.0.18362-SP0
2020-10-14 16:29:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-14 16:29:08 [scrapy.crawler] INFO: Overridden settings:
{}
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet Password: adf8b7868ee25c32
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 16:29:08 [scrapy.core.engine] INFO: Spider opened
2020-10-14 16:29:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 16:29:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.etoro.com/discover/people/results> (referer: None)
[]
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 16:29:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 23288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.353381,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 14, 29, 9, 150136),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 10, 14, 14, 29, 8, 796755)}
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Spider closed (finished)
EDIT
I fetched the source code seen by Scrapy using scrapy fetch --nolog https://www.etoro.com/discover/people/results > response.html and found that it contains an injected JavaScript and has no trace of the above <div> tags.
You can check ajax data fetching using network tab of development tools. There are couple of quite heavy responses in this case, most probably they contain the data needed.
So it can be fetched via API even not parsing the primary page.
Using Python 3.7.2 on Windows 10 I'm struggling with the task to let Scrapy v1.5.1 download some PDF files. I followed the docs but I seem to miss something. Scrapy gets me the desired PDF URLs but downloads nothing. Also no errors are thrown (at least).
The relevant code is:
scrapy.cfg:
[settings]
default = pranger.settings
[deploy]
project = pranger
settings.py:
BOT_NAME = 'pranger'
SPIDER_MODULES = ['pranger.spiders']
NEWSPIDER_MODULE = 'pranger.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = r'C:\pranger_downloaded'
FILES_URLS_FIELD = 'PDF_urls'
FILES_RESULT_FIELD = 'processed_PDFs'
pranger_spider.py:
import scrapy
class IndexSpider(scrapy.Spider):
name = "index"
url_liste = []
def start_requests(self):
urls = [
'http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for menupunkt in response.css('div#aufklappmenue'):
yield {
'file_urls': menupunkt.css('div.aussen a.innen::attr(href)').getall()
}
items.py:
import scrapy
class PrangerItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
All other files are as they were created by the scrapy startproject command.
The output of scrapy crawl index is:
(pranger) C:\pranger>scrapy crawl index
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: pranger)
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 11 2019, 14:11:50) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-20 15:45:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pranger', 'NEWSPIDER_MODULE': 'pranger.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pranger.spiders']}
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline', 'pranger.pipelines.PrangerPipeline']
2019-02-20 15:45:18 [scrapy.core.engine] INFO: Spider opened
2019-02-20 15:45:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-20 15:45:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://verbraucherinfo.ua-bw.de/robots.txt> (referer: None)
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3> (referer: None)
2019-02-20 15:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3>
{'file_urls': ['https://www.lrabb.de/site/LRA-BB-Desktop/get/params_E-428807985/3287025/Ergebnisse_amtlicher_Kontrollen_nach_LFGB_Landkreis_Boeblingen.pdf', <<...and dozens more URLs...>>], 'processed_PDFs': []}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-20 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13268,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 20, 14, 45, 19, 166646),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 20, 14, 45, 18, 864509)}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)
Oh BTW I published the code, just in case: https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403
The FILES_URLS_FIELD setting tells the pipeline what field of the item contains the urls you want to download.
By default, this is file_urls, but if you change the setting, you also need to change the field name (key) you're storing the urls in.
So you have two options - either use the default setting, or rename your item's field to PDF_urls as well.
I am using Scrapy + Splash to scrape some financial data from a dynamic website however the website contains some code (dynamic using 'data-reactid') hence I don't know how to extract
Here is my spider:
import scrapy
from scrapy_splash import SplashRequest
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['gu.qq.com']
start_urls = ['http://gu.qq.com/hk00700/gp/income/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse,
args={
'wait': 0.5,
},
endpoint='render.html',
)
def parse(self, response):
for data in response.css("div.mod-detail write gb_con submodule finance-report"):
yield{
'table' : data.css("table.fin-table.tbody.tr.td::text").extract()
}
I tried to extract the result to csv using below command but nothing was stored into the csv:
scrapy crawl stocks -o stocks.csv
Here is the log after running this command:
root#localhost:~/finance/finance/spiders# scrapy crawl stocks -o stocks.csv
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: finance)
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.13-x86_64-linode106-x86_64-with-Ubuntu-16.04-xenial
2018-06-09 10:09:59 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finance.spiders', 'FEED_URI': 'stocks.csv', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['finance.spiders'], 'BOT_NAME': 'finance', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'csv'}
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-09 10:09:59 [scrapy.core.engine] INFO: Spider opened
2018-06-09 10:10:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-09 10:10:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/robots.txt> (referer: None)
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/hk00700/gp/income/ via http://localhost:8050/render.html> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-09 10:10:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 962,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 184825,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 9, 10, 10, 17, 510745),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 51392512,
'memusage/startup': 51392512,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2018, 6, 9, 10, 10, 0, 4160)}
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Spider closed (finished)
And below is the link and the web structure that I want to scrape:
http://gu.qq.com/hk00700/gp/income
I am quite new to web scraping, could anyone help to explain how should I extract the data?
Here is your data,
http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=-1&code=00700&_callback=jQuery112405223614913821484_1528544465322&_=1528544465323
Splash is not required any where just take a look, change the query parameters in the url and you will get json response enjoy it. Remove the splash browser it doesn't useful at all. it will just increase your response time.
Below is a spider I wrote to crawl an RSS feed and extract the first link and image title, and save them to a text file:
import scrapy
class artSpider(scrapy.Spider):
name = "metart"
start_urls = ['https://www.metmuseum.org/art/artwork-of-the-day?rss=1']
def parse(self, response):
title = response.xpath('//item/title/text()').extract_first().replace(" ", "_")
description = response.xpath('//item/description/text()').extract_first()
lnkstart = description.find("https://image")
lnkcut1 = description.find("web-highlight")
lnkcut2 = lnkcut1 + 13
lnkend = description.find(".jpg") + 4
link = description[lnkstart:lnkcut1] + "original" + description[lnkcut2:lnkend]
ttlstart = description.find("who=") + 4
ttlend = description.find("&rss=1")
filename = "/path/to/save/folder/" + description[ttlstart:ttlend].replace("+", "_") + "-" + title + ".jpg"
print(filename)
print(link)
filename_file = open('filename_metart.txt', 'w')
filename_file.write(filename)
filename_file.close
link_file = open('link_metart.txt', 'w')
link_file.write(link)
link_file.close
It goes over the Met Museum "Artwork of the Day" RSS feed and finds the newest artwork. Then parses the title and the link to the original (the link in the RSS feed is for a thumbnail) and saves them to individual text files. The title is used to generate a filename when downloading the image.
I know the parse function is a mess and that the way I am saving the links and filenames is ugly. I am just starting out with Scrapy and just wanted to get the spider working because it is only a part of the broader project I am working on.
The project itself draws from artwork/image/photo of the day feeds and websites, downloads the images and sets them as a desktop background slideshow.
My issue is that when I set the spider off it comes back with a response of "NoneType". BUT if I go to the RSS feed in a browser (https://metmuseum.org/art/artwork-of-the-day?rss=1) AND THEN run the spider it works correctly.
Process that fails
Call scrapy crawl metart
Output shows that the variable title has the type NoneType
Repeating step 1 results in the same output
Process that works
Open https://metmuseum.org/art/artwork-of-the-day?rss=1 in a web browser
Call scrapy crawl metart
Successfully saves the link and filename to the relevant text files
Some troubleshooting I have already done
I have used a Scrapy shell to replicate exactly the process that the spider goes through and this worked fine. I went through each step that the spider goes through starting with:
fetch("https://metmuseum.org/art/artwork-of-the-day?rss=1")
Then typing in each line of the parse function. This works fine and results in the correct link and filename being saved WITHOUT having to open the URL in a web browser first.
Just for completeness below is my Scrapy project settings.py file:
BOT_NAME = 'wallpaper_scraper'
SPIDER_MODULES = ['wallpaper_scraper.spiders']
NEWSPIDER_MODULE = 'wallpaper_scraper.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'cchowgule for a wallpaper slideshow'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
I am very confused as to how the act of opening the URL in a web browser could possibly affect the response of the spider.
Any help cleaning up my parse function would also be lovely.
Thanks
Update
At 0535 UTC I ran scrapy crawl metart and got the following response:
2018-06-23 11:02:54 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: wallpaper_scraper)
2018-06-23 11:02:54 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.15 (default, May 1 2018, 05:55:50) - [GCC 7.3.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.16.0-2-amd64-x86_64-with-debian-buster-sid
2018-06-23 11:02:54 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wallpaper_scraper.spiders', 'SPIDER_MODULES': ['wallpaper_scraper.spiders'], 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'cchowgule for a wallpaper slideshow', 'BOT_NAME': 'wallpaper_scraper'}
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-23 11:02:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-23 11:02:54 [scrapy.core.engine] INFO: Spider opened
2018-06-23 11:02:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-23 11:02:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-23 11:02:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metmuseum.org/robots.txt> (referer: None)
2018-06-23 11:02:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metmuseum.org/art/artwork-of-the-day?rss=1> (referer: None)
2018-06-23 11:02:56 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.metmuseum.org/art/artwork-of-the-day?rss=1> (referer: None)
Traceback (most recent call last):
File "/home/cchowgule/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/cchowgule/WD/pyenvscrapy/wallpaper_scraper/wallpaper_scraper/spiders/metart.py", line 9, in parse
title = response.xpath('//item/title/text()').extract_first().replace(" ", "_")
AttributeError: 'NoneType' object has no attribute 'replace'
2018-06-23 11:02:57 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-23 11:02:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 646,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1725,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 23, 5, 32, 57, 65104),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 52793344,
'memusage/startup': 52793344,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2018, 6, 23, 5, 32, 54, 464466)}
2018-06-23 11:02:57 [scrapy.core.engine] INFO: Spider closed (finished)
I ran scrapy crawl metart 5 times with the same result.
Then I opened a browser, went to https://metmuseum.org/art/artwork-of-the-day?rss=1 and ran scrapy crawl metart again. This time it worked correctly...... I just don't get it.