How to download (PDF) files with Python/Scrapy using the Files Pipeline?

How to download (PDF) files with Python/Scrapy using the Files Pipeline? - python

Using Python 3.7.2 on Windows 10 I'm struggling with the task to let Scrapy v1.5.1 download some PDF files. I followed the docs but I seem to miss something. Scrapy gets me the desired PDF URLs but downloads nothing. Also no errors are thrown (at least).
The relevant code is:
scrapy.cfg:
[settings]
default = pranger.settings
[deploy]
project = pranger
settings.py:
BOT_NAME = 'pranger'
SPIDER_MODULES = ['pranger.spiders']
NEWSPIDER_MODULE = 'pranger.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = r'C:\pranger_downloaded'
FILES_URLS_FIELD = 'PDF_urls'
FILES_RESULT_FIELD = 'processed_PDFs'
pranger_spider.py:
import scrapy
class IndexSpider(scrapy.Spider):
name = "index"
url_liste = []
def start_requests(self):
urls = [
'http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for menupunkt in response.css('div#aufklappmenue'):
yield {
'file_urls': menupunkt.css('div.aussen a.innen::attr(href)').getall()
}
items.py:
import scrapy
class PrangerItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
All other files are as they were created by the scrapy startproject command.
The output of scrapy crawl index is:
(pranger) C:\pranger>scrapy crawl index
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: pranger)
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 11 2019, 14:11:50) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-20 15:45:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pranger', 'NEWSPIDER_MODULE': 'pranger.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pranger.spiders']}
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline', 'pranger.pipelines.PrangerPipeline']
2019-02-20 15:45:18 [scrapy.core.engine] INFO: Spider opened
2019-02-20 15:45:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-20 15:45:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://verbraucherinfo.ua-bw.de/robots.txt> (referer: None)
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3> (referer: None)
2019-02-20 15:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3>
{'file_urls': ['https://www.lrabb.de/site/LRA-BB-Desktop/get/params_E-428807985/3287025/Ergebnisse_amtlicher_Kontrollen_nach_LFGB_Landkreis_Boeblingen.pdf', <<...and dozens more URLs...>>], 'processed_PDFs': []}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-20 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13268,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 20, 14, 45, 19, 166646),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 20, 14, 45, 18, 864509)}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)
Oh BTW I published the code, just in case: https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403

The FILES_URLS_FIELD setting tells the pipeline what field of the item contains the urls you want to download.
By default, this is file_urls, but if you change the setting, you also need to change the field name (key) you're storing the urls in.
So you have two options - either use the default setting, or rename your item's field to PDF_urls as well.

Related

Scrapy redirecting [302 to 200] but url isn't updating

I'm trying to fetch redirected url from a url using Scrapy
Response status changes from 302 to 200 but still the url isn't changing.
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
name = 'test'
start_urls = ['https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w']
def parse(self, response):
yield {
'url': response.url,
}
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {
"format": "json",
"overwrite": True
}},
'ROBOTSTXT_OBEY': False,
'FEED_EXPORT_ENCODING': 'utf-8',
'REDIRECT_ENABLED': True,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7'
})
process.crawl(MySpider)
process.start()
Console Output
2023-02-08 01:44:25 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-02-08 01:44:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2023-02-08 01:44:25 [scrapy.crawler] INFO: Overridden settings:
{'FEED_EXPORT_ENCODING': 'utf-8', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7'}
2023-02-08 01:44:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-02-08 01:44:25 [scrapy.extensions.telnet] INFO: Telnet Password: 013f29d178b8cbb6
2023-02-08 01:44:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-02-08 01:44:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-02-08 01:44:26 [scrapy.core.engine] INFO: Spider opened
2023-02-08 01:44:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-02-08 01:44:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-02-08 01:44:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en> from <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w>
2023-02-08 01:44:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en> (referer: None)
2023-02-08 01:44:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en>
{'url': 'https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en'}
2023-02-08 01:44:27 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-08 01:44:27 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
2023-02-08 01:44:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1561,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 104182,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 1.162381,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 7, 20, 14, 27, 736096),
'httpcompression/response_bytes': 302666,
'httpcompression/response_count': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2023, 2, 7, 20, 14, 26, 573715)}
2023-02-08 01:44:27 [scrapy.core.engine] INFO: Spider closed (finished)
item.json
[
{"url": "https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en"}
]
I expect the url to https://www.pinkvilla.com/entertainment/bts-jin-shares-behind-the-scenes-of-the-astronaut-stage-with-coldplay-band-performs-on-snl-with-wootteo-1208241
I've tried setting params such as dont_redirect, handle_httpstatus_list, etc. but nothing working out
What am I missing?
Any guidance would be helpful

What your missing is that the 302 redirect that you see in your logs is not redirecting to the page you are expecting. The redirect in your logs is simply taking you from ...news.google.com/rss/articles/CBM...yNDE_YW1w to ...news.google.com/rss/articles/CBM...yNDE_YW1w?hl=en-IN&gl=IN&ceid=IN:en.
The url for the page that you are expecting it to redirect to can actually be found in the html for the page that it is being directed to. In fact it is the only link, along with a whole bunch of javascript which facilitates the redirect that happens automatically in your browser.
for example:
from scrapy import Spider
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
name = 'test'
start_urls = ['https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMtb2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1vZi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w']
def parse(self, response):
m = response.xpath("//a/#href").get() # grab the href for the only link on the page
yield {"links": m}
OUTPUT:
2023-02-07 18:39:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://news.google.com/rss/articles/CBMilwFodHRwczovL3d3dy5waW5rdmlsbGEuY29tL2VudGVydGFpbm1lbnQvYnRzLWppbi1zaGFyZXMtYmVoaW5kLXRoZS1zY2VuZXMt
b2YtdGhlLWFzdHJvbmF1dC1zdGFnZS13aXRoLWNvbGRwbGF5LWJhbmQtcGVyZm9ybXMtb24tc25sLXdpdGgtd29vdHRlby0xMjA4MjQx0gGbAWh0dHBzOi8vd3d3LnBpbmt2aWxsYS5jb20vZW50ZXJ0YWlubWVudC9idHMtamluLXNoYXJlcy1iZWhpbmQtdGhlLXNjZW5lcy1v
Zi10aGUtYXN0cm9uYXV0LXN0YWdlLXdpdGgtY29sZHBsYXktYmFuZC1wZXJmb3Jtcy1vbi1zbmwtd2l0aC13b290dGVvLTEyMDgyNDE_YW1w?hl=en-US&gl=US&ceid=US:en>
{'links': 'https://www.pinkvilla.com/entertainment/bts-jin-shares-behind-the-scenes-of-the-astronaut-stage-with-coldplay-band-performs-on-snl-with-wootteo-1208241'}
2023-02-07 18:39:55 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-07 18:39:55 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
2023-02-07 18:39:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Since scrapy doesn't execute any of the javascript in the response for the google.news.com/rss request the redirect that happens in your browser doesn't get triggered by scrapy.

My scrapy Spider isn't crawling pages, 'Crawled 0 pages, scraped 0 items. Can't seem to find what's wrong

When running 'scrapy crawl newegg' in my console i'm hit with 'Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min). I have tried looking up many fixes none of which have worked. Any help is appreciated, thank you.
# web scrapes newegg page for product price
import scrapy
class NeweggSpider(scrapy.Spider):
name = 'newegg'
start_urls = [
'https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022'
]
def parse(self, response):
for product in response.css('div.page-section-innder'):
yield {
'name': product.css('h1.prodcut-title::text').get(),
'price': product.css('li.price-current strong::text').get()
}
This is the console log after running 'scrapy crawl newegg'. As you can see it is crawling 0 pages and scraping 0 items, I cannot look through the console log to see what is wrong.
2022-08-28 22:12:38 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: newegg)
2022-08-28 22:12:38 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.19044-SP0
2022-08-28 22:12:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'newegg',
'NEWSPIDER_MODULE': 'newegg.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['newegg.spiders']}
2022-08-28 22:12:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-28 22:12:38 [scrapy.extensions.telnet] INFO: Telnet Password: 5d3a6b25365f91b1
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Spider opened
2022-08-28 22:12:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-28 22:12:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-28 22:12:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.newegg.com/robots.txt> (referer: None)
2022-08-28 22:12:38 [filelock] DEBUG: Attempting to acquire lock 2120593134576 on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Lock 2120593134576 acquired on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Attempting to release lock 2120593134576 on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Lock 2120593134576 released on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022> (referer: None)
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-28 22:12:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 550,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 42150,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.25123,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 29, 3, 12, 38, 921346),
'httpcompression/response_bytes': 181561,
'httpcompression/response_count': 2,
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 8, 29, 3, 12, 38, 670116)}
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Spider closed (finished)

You have a typo in your first css selector. But even after fixing that your selectors don't seem to be working. It is successfully grabbing half of the price but it doesn't seem to work for the product name or the other half of the price field.
A fix for the name selector would be to just apply the selector directly to the response instead of chaining it from the product, and a better solution that can grab the whole price text would be to use an xpath expression.
For example:
# web scrapes newegg page for product price
import scrapy
class NeweggSpider(scrapy.Spider):
name = 'newegg'
start_urls = [
'https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022'
]
def parse(self, response):
price = response.xpath("//li[#class='price-current']//*/text()").getall()
price_string = ''.join(price)
yield {
"name": response.css("h1.product-title::text").get(),
"price": price_string
}
OUTPUT
{'name': 'Team T-FORCE DARK Za 32GB (2 x 16GB) 288-Pin PC RAM DDR4 3600 (PC4 28800) Desktop Memory (FOR AMD) Model TDZAD432G3600HC18JDC01',
'price': '89.99'}

Scrapy not returning elements

I am trying in vain to retrieve data from here: https://www.etoro.com/discover/people/results. Let's say I want to get the nickname element first. It appears in the following format in the HTML source code: <div _ngcontent-bqd-c27="" automation-id="trade-item-name" class="symbol">markaungier</div>
I tried the following three approaches:
Using a CSS selector
nickname = response.css("[automation-id=trade-item-name]")
Using an XPATH relative path
nickname = response.xpath("//div[#automation-id='trade-item-name']")
Using a full XPATH
response.xpath("/html/body/ui-layout/div/div/div[2]/et-discovery-people-results/div/div/et-discovery-people-results-grid/div/div/div/et-user-card[1]/div/header/et-card-avatar/a/div[2]/div[1]")
Strangely, none of them returned anything. What's going on here? Does the issue arise because of this, i.e. "Some webpages show the desired data when you load them in a web browser. However, when you download them using Scrapy, you cannot reach the desired data using selectors" ?
My full code is as follows:
import scrapy
import requests
from lxml import html
from scrapy.crawler import CrawlerProcess
class EtoroSpider(scrapy.Spider):
name = "traders"
start_urls = [
"https://www.etoro.com/discover/people/results",
]
def parse(self, response):
nickname = response.xpath("//div[#automation-id='trade-item-name']")
print(nickname)
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(EtoroSpider)
process.start()
And here is the scrapy output:
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Windows-10-10.0.18362-SP0
2020-10-14 16:29:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-14 16:29:08 [scrapy.crawler] INFO: Overridden settings:
{}
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet Password: adf8b7868ee25c32
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 16:29:08 [scrapy.core.engine] INFO: Spider opened
2020-10-14 16:29:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 16:29:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.etoro.com/discover/people/results> (referer: None)
[]
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 16:29:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 23288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.353381,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 14, 29, 9, 150136),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 10, 14, 14, 29, 8, 796755)}
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Spider closed (finished)
EDIT
I fetched the source code seen by Scrapy using scrapy fetch --nolog https://www.etoro.com/discover/people/results > response.html and found that it contains an injected JavaScript and has no trace of the above <div> tags.

You can check ajax data fetching using network tab of development tools. There are couple of quite heavy responses in this case, most probably they contain the data needed.
So it can be fetched via API even not parsing the primary page.

Scraping javascript with 'data-reactid' content using Scrapy and Splash

I am using Scrapy + Splash to scrape some financial data from a dynamic website however the website contains some code (dynamic using 'data-reactid') hence I don't know how to extract
Here is my spider:
import scrapy
from scrapy_splash import SplashRequest
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['gu.qq.com']
start_urls = ['http://gu.qq.com/hk00700/gp/income/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse,
args={
'wait': 0.5,
},
endpoint='render.html',
)
def parse(self, response):
for data in response.css("div.mod-detail write gb_con submodule finance-report"):
yield{
'table' : data.css("table.fin-table.tbody.tr.td::text").extract()
}
I tried to extract the result to csv using below command but nothing was stored into the csv:
scrapy crawl stocks -o stocks.csv
Here is the log after running this command:
root#localhost:~/finance/finance/spiders# scrapy crawl stocks -o stocks.csv
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: finance)
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.13-x86_64-linode106-x86_64-with-Ubuntu-16.04-xenial
2018-06-09 10:09:59 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finance.spiders', 'FEED_URI': 'stocks.csv', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['finance.spiders'], 'BOT_NAME': 'finance', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'csv'}
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-09 10:09:59 [scrapy.core.engine] INFO: Spider opened
2018-06-09 10:10:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-09 10:10:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/robots.txt> (referer: None)
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/hk00700/gp/income/ via http://localhost:8050/render.html> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-09 10:10:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 962,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 184825,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 9, 10, 10, 17, 510745),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 51392512,
'memusage/startup': 51392512,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2018, 6, 9, 10, 10, 0, 4160)}
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Spider closed (finished)
And below is the link and the web structure that I want to scrape:
http://gu.qq.com/hk00700/gp/income
I am quite new to web scraping, could anyone help to explain how should I extract the data?

Here is your data,
http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=-1&code=00700&_callback=jQuery112405223614913821484_1528544465322&_=1528544465323
Splash is not required any where just take a look, change the query parameters in the url and you will get json response enjoy it. Remove the splash browser it doesn't useful at all. it will just increase your response time.

scrapy script runs without (apparent) error but doesn't scrape data

I am following this tutorial in an attempt to learn how to use scrapy. I am currently doing second tutorial, Writing Custom Spiders. I created the project and wrote the redditbot.py just as the tutorial specifies.
import scrapy
class RedditbotSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['reddit.com/r/gameofthrones/']
start_urls = ['http://www.reddit.com/r/gameofthrones//']
def parse(self, response):
#Extracting the content using css selectors
titles = response.css('.title.may_blank::text').extract()
votes = response.css('.score.unvoted::text').extract()
times = response.css('time::attr(title)').extract()
comments = response.css('.comments::text').extract()
#Display the extracted contect in row fashion
for item in zip(titles,votes,times,comments):
#Creates a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
'vote' : item[1],
'created_at' : item[2],
'comments' : item[3],
}
#Yield/give the scraped info to scrapy
yield scraped_info
However, when I run the program using
scrapy crawl redditbot
the program will run but not output any scraped data, as the tutorial says it should. This is the output I receive in Terminal:
evans-mbp:ourfirstscraper evanyamaguchi$ scrapy crawl redditbot
2018-01-04 13:24:53 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: ourfirstscraper)
2018-01-04 13:24:53 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.3.1, w3lib 1.18.0, Twisted 17.9.0, Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.3.0-x86_64-i386-64bit
2018-01-04 13:24:53 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'ourfirstscraper', 'FEED_FORMAT': 'csv', 'FEED_URI': 'reddit.csv', 'NEWSPIDER_MODULE': 'ourfirstscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['ourfirstscraper.spiders']}
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-01-04 13:24:53 [scrapy.core.engine] INFO: Spider opened
2018-01-04 13:24:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-04 13:24:53 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-04 13:24:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.reddit.com/robots.txt> from <GET http://www.reddit.com/robots.txt>
2018-01-04 13:24:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/robots.txt> (referer: None)
2018-01-04 13:24:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.reddit.com/r/gameofthrones//> from <GET http://www.reddit.com/r/gameofthrones//>
2018-01-04 13:24:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/r/gameofthrones//> (referer: None)
2018-01-04 13:24:53 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-04 13:24:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 945,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 36092,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 1, 4, 18, 24, 53, 755172),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'memusage/max': 66359296,
'memusage/startup': 66359296,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 1, 4, 18, 24, 53, 205879)}
2018-01-04 13:24:53 [scrapy.core.engine] INFO: Spider closed (finished)
I cannot figure out why the spider seems to run but does not scrape any data from the website.
Thanks in advance,
Evan

Looks like you just have a typo.
The class name is may-blank, not may_blank.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download (PDF) files with Python/Scrapy using the Files Pipeline? - python

Related

Scrapy redirecting [302 to 200] but url isn't updating

My scrapy Spider isn't crawling pages, 'Crawled 0 pages, scraped 0 items. Can't seem to find what's wrong

Scrapy not returning elements

Scraping javascript with 'data-reactid' content using Scrapy and Splash

scrapy script runs without (apparent) error but doesn't scrape data

Categories

Resources