scrapy script runs without (apparent) error but doesn't scrape data - python

I am following this tutorial in an attempt to learn how to use scrapy. I am currently doing second tutorial, Writing Custom Spiders. I created the project and wrote the redditbot.py just as the tutorial specifies.
import scrapy
class RedditbotSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['reddit.com/r/gameofthrones/']
start_urls = ['http://www.reddit.com/r/gameofthrones//']
def parse(self, response):
#Extracting the content using css selectors
titles = response.css('.title.may_blank::text').extract()
votes = response.css('.score.unvoted::text').extract()
times = response.css('time::attr(title)').extract()
comments = response.css('.comments::text').extract()
#Display the extracted contect in row fashion
for item in zip(titles,votes,times,comments):
#Creates a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
'vote' : item[1],
'created_at' : item[2],
'comments' : item[3],
}
#Yield/give the scraped info to scrapy
yield scraped_info
However, when I run the program using
scrapy crawl redditbot
the program will run but not output any scraped data, as the tutorial says it should. This is the output I receive in Terminal:
evans-mbp:ourfirstscraper evanyamaguchi$ scrapy crawl redditbot
2018-01-04 13:24:53 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: ourfirstscraper)
2018-01-04 13:24:53 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.3.1, w3lib 1.18.0, Twisted 17.9.0, Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.3.0-x86_64-i386-64bit
2018-01-04 13:24:53 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'ourfirstscraper', 'FEED_FORMAT': 'csv', 'FEED_URI': 'reddit.csv', 'NEWSPIDER_MODULE': 'ourfirstscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['ourfirstscraper.spiders']}
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-01-04 13:24:53 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-01-04 13:24:53 [scrapy.core.engine] INFO: Spider opened
2018-01-04 13:24:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-04 13:24:53 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-04 13:24:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.reddit.com/robots.txt> from <GET http://www.reddit.com/robots.txt>
2018-01-04 13:24:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/robots.txt> (referer: None)
2018-01-04 13:24:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.reddit.com/r/gameofthrones//> from <GET http://www.reddit.com/r/gameofthrones//>
2018-01-04 13:24:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/r/gameofthrones//> (referer: None)
2018-01-04 13:24:53 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-04 13:24:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 945,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 36092,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 1, 4, 18, 24, 53, 755172),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'memusage/max': 66359296,
'memusage/startup': 66359296,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 1, 4, 18, 24, 53, 205879)}
2018-01-04 13:24:53 [scrapy.core.engine] INFO: Spider closed (finished)
I cannot figure out why the spider seems to run but does not scrape any data from the website.
Thanks in advance,
Evan

Looks like you just have a typo.
The class name is may-blank, not may_blank.

Related

how do i solve Enabled Item Pipeline: []?

I have a scrapy script to scrape a website. The code block is correct but when I run it, it gives me an empty return. I tried to check logs and found 2023-02-02 08:32:54 [scrapy.middleware] INFO: Enabled item pipelines: [] . I don't know if it's why my script is not returning any results.
Also I recently formatted my pc and reinstalled python and it's libraries back, also reinstalled VS code. I don't know if am missing something in the settings or wherever.
Here's my script
import scrapy
class TruckspiderSpider(scrapy.Spider):
name = 'truckspider'
allowed_domains = ['www.quicktransportsolutions.com']
start_urls = ['https://www.quicktransportsolutions.com/carrier/usa-trucking-companies.php']
def parse(self, response):
containers = response.css('[class="col-md-4 column"]')
for container in containers:
yield {
'name': container.css('a::text').get()}
Here is my log
2023-02-02 08:32:53 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: truckscraper2)
2023-02-02 08:32:53 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0,
Python 3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 39.0.0, Platform Windows-10-10.0.19045-SP0
2023-02-02 08:32:53 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'truckscraper2',
'NEWSPIDER_MODULE': 'truckscraper2.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['truckscraper2.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-02-02 08:32:53 [asyncio] DEBUG: Using selector: SelectSelector
2023-02-02 08:32:53 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-02-02 08:32:53 [scrapy.utils.log] DEBUG: Using asyncio event loop:
asyncio.windows_events._WindowsSelectorEventLoop
2023-02-02 08:32:53 [scrapy.extensions.telnet] INFO: Telnet Password: 27ed26758870e117
2023-02-02 08:32:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-02-02 08:32:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-02-02 08:32:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-02-02 08:32:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-02-02 08:32:54 [scrapy.core.engine] INFO: Spider opened
2023-02-02 08:32:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-02-02 08:32:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-02-02 08:32:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.quicktransportsolutions.com/robots.txt> from <GET http://www.quicktransportsolutions.com/robots.txt>
2023-02-02 08:32:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.quicktransportsolutions.com/robots.txt> (referer: None)
2023-02-02 08:32:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.quicktransportsolutions.com/> from <GET http://www.quicktransportsolutions.com/>
2023-02-02 08:32:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.quicktransportsolutions.com/> (referer: None)
2023-02-02 08:32:57 [scrapy.core.engine] INFO: Closing spider (finished)
2023-02-02 08:32:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 944,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 8482,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'elapsed_time_seconds': 2.496689,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 2, 7, 32, 57, 227109),
'httpcompression/response_bytes': 21965,
'httpcompression/response_count': 2,
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2023, 2, 2, 7, 32, 54, 730420)}
2023-02-02 08:32:57 [scrapy.core.engine] INFO: Spider closed (finished)
please help me get my scrapy working again and extracting informations as usual. Thanks
So I finally got the answer to this. I gave my pc access to this file 2023-02-02 10:54:17 [filelock] DEBUG: Attempting to acquire lock 2733195998640 on C:\Users\ChiNedu\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock from my AppData folder and every other thing worked perfectly like a magic.
Hope this helps some one out there someday.

My scrapy Spider isn't crawling pages, 'Crawled 0 pages, scraped 0 items. Can't seem to find what's wrong

When running 'scrapy crawl newegg' in my console i'm hit with 'Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min). I have tried looking up many fixes none of which have worked. Any help is appreciated, thank you.
# web scrapes newegg page for product price
import scrapy
class NeweggSpider(scrapy.Spider):
name = 'newegg'
start_urls = [
'https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022'
]
def parse(self, response):
for product in response.css('div.page-section-innder'):
yield {
'name': product.css('h1.prodcut-title::text').get(),
'price': product.css('li.price-current strong::text').get()
}
This is the console log after running 'scrapy crawl newegg'. As you can see it is crawling 0 pages and scraping 0 items, I cannot look through the console log to see what is wrong.
2022-08-28 22:12:38 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: newegg)
2022-08-28 22:12:38 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.19044-SP0
2022-08-28 22:12:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'newegg',
'NEWSPIDER_MODULE': 'newegg.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['newegg.spiders']}
2022-08-28 22:12:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-28 22:12:38 [scrapy.extensions.telnet] INFO: Telnet Password: 5d3a6b25365f91b1
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Spider opened
2022-08-28 22:12:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-28 22:12:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-28 22:12:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.newegg.com/robots.txt> (referer: None)
2022-08-28 22:12:38 [filelock] DEBUG: Attempting to acquire lock 2120593134576 on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Lock 2120593134576 acquired on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Attempting to release lock 2120593134576 on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Lock 2120593134576 released on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022> (referer: None)
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-28 22:12:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 550,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 42150,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.25123,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 29, 3, 12, 38, 921346),
'httpcompression/response_bytes': 181561,
'httpcompression/response_count': 2,
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 8, 29, 3, 12, 38, 670116)}
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Spider closed (finished)
You have a typo in your first css selector. But even after fixing that your selectors don't seem to be working. It is successfully grabbing half of the price but it doesn't seem to work for the product name or the other half of the price field.
A fix for the name selector would be to just apply the selector directly to the response instead of chaining it from the product, and a better solution that can grab the whole price text would be to use an xpath expression.
For example:
# web scrapes newegg page for product price
import scrapy
class NeweggSpider(scrapy.Spider):
name = 'newegg'
start_urls = [
'https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022'
]
def parse(self, response):
price = response.xpath("//li[#class='price-current']//*/text()").getall()
price_string = ''.join(price)
yield {
"name": response.css("h1.product-title::text").get(),
"price": price_string
}
OUTPUT
{'name': 'Team T-FORCE DARK Za 32GB (2 x 16GB) 288-Pin PC RAM DDR4 3600 (PC4 28800) Desktop Memory (FOR AMD) Model TDZAD432G3600HC18JDC01',
'price': '89.99'}

How to download (PDF) files with Python/Scrapy using the Files Pipeline?

Using Python 3.7.2 on Windows 10 I'm struggling with the task to let Scrapy v1.5.1 download some PDF files. I followed the docs but I seem to miss something. Scrapy gets me the desired PDF URLs but downloads nothing. Also no errors are thrown (at least).
The relevant code is:
scrapy.cfg:
[settings]
default = pranger.settings
[deploy]
project = pranger
settings.py:
BOT_NAME = 'pranger'
SPIDER_MODULES = ['pranger.spiders']
NEWSPIDER_MODULE = 'pranger.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = r'C:\pranger_downloaded'
FILES_URLS_FIELD = 'PDF_urls'
FILES_RESULT_FIELD = 'processed_PDFs'
pranger_spider.py:
import scrapy
class IndexSpider(scrapy.Spider):
name = "index"
url_liste = []
def start_requests(self):
urls = [
'http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for menupunkt in response.css('div#aufklappmenue'):
yield {
'file_urls': menupunkt.css('div.aussen a.innen::attr(href)').getall()
}
items.py:
import scrapy
class PrangerItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
All other files are as they were created by the scrapy startproject command.
The output of scrapy crawl index is:
(pranger) C:\pranger>scrapy crawl index
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: pranger)
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 11 2019, 14:11:50) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-20 15:45:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pranger', 'NEWSPIDER_MODULE': 'pranger.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pranger.spiders']}
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline', 'pranger.pipelines.PrangerPipeline']
2019-02-20 15:45:18 [scrapy.core.engine] INFO: Spider opened
2019-02-20 15:45:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-20 15:45:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://verbraucherinfo.ua-bw.de/robots.txt> (referer: None)
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3> (referer: None)
2019-02-20 15:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3>
{'file_urls': ['https://www.lrabb.de/site/LRA-BB-Desktop/get/params_E-428807985/3287025/Ergebnisse_amtlicher_Kontrollen_nach_LFGB_Landkreis_Boeblingen.pdf', <<...and dozens more URLs...>>], 'processed_PDFs': []}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-20 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13268,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 20, 14, 45, 19, 166646),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 20, 14, 45, 18, 864509)}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)
Oh BTW I published the code, just in case: https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403
The FILES_URLS_FIELD setting tells the pipeline what field of the item contains the urls you want to download.
By default, this is file_urls, but if you change the setting, you also need to change the field name (key) you're storing the urls in.
So you have two options - either use the default setting, or rename your item's field to PDF_urls as well.

Scraping javascript with 'data-reactid' content using Scrapy and Splash

I am using Scrapy + Splash to scrape some financial data from a dynamic website however the website contains some code (dynamic using 'data-reactid') hence I don't know how to extract
Here is my spider:
import scrapy
from scrapy_splash import SplashRequest
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['gu.qq.com']
start_urls = ['http://gu.qq.com/hk00700/gp/income/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse,
args={
'wait': 0.5,
},
endpoint='render.html',
)
def parse(self, response):
for data in response.css("div.mod-detail write gb_con submodule finance-report"):
yield{
'table' : data.css("table.fin-table.tbody.tr.td::text").extract()
}
I tried to extract the result to csv using below command but nothing was stored into the csv:
scrapy crawl stocks -o stocks.csv
Here is the log after running this command:
root#localhost:~/finance/finance/spiders# scrapy crawl stocks -o stocks.csv
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: finance)
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.13-x86_64-linode106-x86_64-with-Ubuntu-16.04-xenial
2018-06-09 10:09:59 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finance.spiders', 'FEED_URI': 'stocks.csv', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['finance.spiders'], 'BOT_NAME': 'finance', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'csv'}
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-09 10:09:59 [scrapy.core.engine] INFO: Spider opened
2018-06-09 10:10:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-09 10:10:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/robots.txt> (referer: None)
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/hk00700/gp/income/ via http://localhost:8050/render.html> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-09 10:10:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 962,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 184825,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 9, 10, 10, 17, 510745),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 51392512,
'memusage/startup': 51392512,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2018, 6, 9, 10, 10, 0, 4160)}
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Spider closed (finished)
And below is the link and the web structure that I want to scrape:
http://gu.qq.com/hk00700/gp/income
I am quite new to web scraping, could anyone help to explain how should I extract the data?
Here is your data,
http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=-1&code=00700&_callback=jQuery112405223614913821484_1528544465322&_=1528544465323
Splash is not required any where just take a look, change the query parameters in the url and you will get json response enjoy it. Remove the splash browser it doesn't useful at all. it will just increase your response time.

webpage returns 405 status code error when accessed with scrapy

I am trying to scrap below URL with scrapy -
https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n
but, It always ends up giving status 405 error. I have searched about this topic but they always say that it occurs when the request method is incorrect, like POST in place of GET. But this is surely not the case here.
here is my code for spider -
import scrapy
class sampleSpider(scrapy.Spider):
AUTOTHROTTLE_ENABLED = True
name = 'test'
start_urls = ['https://www.realtor.ca/Residential/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n']
def parse(self, response):
yield {
'response' : response.body_as_unicode(),
}
and here is the log I get when I run the scraper -
PS D:\> scrapy runspider tst.py -o tst.csv
2017-06-26 19:20:49 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
2017-06-26 19:20:49 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'tst.csv'}
2017-06-26 19:20:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-26 19:20:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-26 19:20:50 [scrapy.core.engine] INFO: Spider opened
2017-06-26 19:20:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-06-26 19:20:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-26 19:20:51 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.realtor.ca/Residential/Single-Family/1827
9532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook#v=n> (referer: None)
2017-06-26 19:20:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.realtor.ca/Residential
/Single-Family/18279532/78-80-BURNDEAN-Court-Richmond-Hill-Ontario-L4C0K1-Westbrook>: HTTP status code is not handled or
not allowed
2017-06-26 19:20:51 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-26 19:20:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 306,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 9360,
'downloader/response_count': 1,
'downloader/response_status_count/405': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 26, 13, 50, 51, 432000),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 26, 13, 50, 50, 104000)}
2017-06-26 19:20:51 [scrapy.core.engine] INFO: Spider closed (finished)
Any help will be very much appreciated. Thank you in advance.
I encountered a similar problem trying to scrape www.funda.nl and solved it by
changing the user agent (using https://pypi.org/project/scrapy-random-useragent/),
using Scrapy Splash.
This may work for the website you're trying to scrape as well (although I haven't tested this).

Categories