Scrapy not returning elements - python

I am trying in vain to retrieve data from here: https://www.etoro.com/discover/people/results. Let's say I want to get the nickname element first. It appears in the following format in the HTML source code: <div _ngcontent-bqd-c27="" automation-id="trade-item-name" class="symbol">markaungier</div>
I tried the following three approaches:
Using a CSS selector
nickname = response.css("[automation-id=trade-item-name]")
Using an XPATH relative path
nickname = response.xpath("//div[#automation-id='trade-item-name']")
Using a full XPATH
response.xpath("/html/body/ui-layout/div/div/div[2]/et-discovery-people-results/div/div/et-discovery-people-results-grid/div/div/div/et-user-card[1]/div/header/et-card-avatar/a/div[2]/div[1]")
Strangely, none of them returned anything. What's going on here? Does the issue arise because of this, i.e. "Some webpages show the desired data when you load them in a web browser. However, when you download them using Scrapy, you cannot reach the desired data using selectors" ?
My full code is as follows:
import scrapy
import requests
from lxml import html
from scrapy.crawler import CrawlerProcess
class EtoroSpider(scrapy.Spider):
name = "traders"
start_urls = [
"https://www.etoro.com/discover/people/results",
]
def parse(self, response):
nickname = response.xpath("//div[#automation-id='trade-item-name']")
print(nickname)
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(EtoroSpider)
process.start()
And here is the scrapy output:
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-14 16:29:08 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Windows-10-10.0.18362-SP0
2020-10-14 16:29:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-14 16:29:08 [scrapy.crawler] INFO: Overridden settings:
{}
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet Password: adf8b7868ee25c32
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 16:29:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 16:29:08 [scrapy.core.engine] INFO: Spider opened
2020-10-14 16:29:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 16:29:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 16:29:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.etoro.com/discover/people/results> (referer: None)
[]
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 16:29:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 236,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 23288,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.353381,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 14, 29, 9, 150136),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 10, 14, 14, 29, 8, 796755)}
2020-10-14 16:29:09 [scrapy.core.engine] INFO: Spider closed (finished)
EDIT
I fetched the source code seen by Scrapy using scrapy fetch --nolog https://www.etoro.com/discover/people/results > response.html and found that it contains an injected JavaScript and has no trace of the above <div> tags.

You can check ajax data fetching using network tab of development tools. There are couple of quite heavy responses in this case, most probably they contain the data needed.
So it can be fetched via API even not parsing the primary page.

Related

My scrapy Spider isn't crawling pages, 'Crawled 0 pages, scraped 0 items. Can't seem to find what's wrong

When running 'scrapy crawl newegg' in my console i'm hit with 'Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min). I have tried looking up many fixes none of which have worked. Any help is appreciated, thank you.
# web scrapes newegg page for product price
import scrapy
class NeweggSpider(scrapy.Spider):
name = 'newegg'
start_urls = [
'https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022'
]
def parse(self, response):
for product in response.css('div.page-section-innder'):
yield {
'name': product.css('h1.prodcut-title::text').get(),
'price': product.css('li.price-current strong::text').get()
}
This is the console log after running 'scrapy crawl newegg'. As you can see it is crawling 0 pages and scraping 0 items, I cannot look through the console log to see what is wrong.
2022-08-28 22:12:38 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: newegg)
2022-08-28 22:12:38 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.19044-SP0
2022-08-28 22:12:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'newegg',
'NEWSPIDER_MODULE': 'newegg.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['newegg.spiders']}
2022-08-28 22:12:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-28 22:12:38 [scrapy.extensions.telnet] INFO: Telnet Password: 5d3a6b25365f91b1
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-28 22:12:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Spider opened
2022-08-28 22:12:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-28 22:12:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-28 22:12:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.newegg.com/robots.txt> (referer: None)
2022-08-28 22:12:38 [filelock] DEBUG: Attempting to acquire lock 2120593134576 on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Lock 2120593134576 acquired on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Attempting to release lock 2120593134576 on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [filelock] DEBUG: Lock 2120593134576 released on C:\Users\Casey\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-28 22:12:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022> (referer: None)
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-28 22:12:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 550,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 42150,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.25123,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 29, 3, 12, 38, 921346),
'httpcompression/response_bytes': 181561,
'httpcompression/response_count': 2,
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 8, 29, 3, 12, 38, 670116)}
2022-08-28 22:12:38 [scrapy.core.engine] INFO: Spider closed (finished)
You have a typo in your first css selector. But even after fixing that your selectors don't seem to be working. It is successfully grabbing half of the price but it doesn't seem to work for the product name or the other half of the price field.
A fix for the name selector would be to just apply the selector directly to the response instead of chaining it from the product, and a better solution that can grab the whole price text would be to use an xpath expression.
For example:
# web scrapes newegg page for product price
import scrapy
class NeweggSpider(scrapy.Spider):
name = 'newegg'
start_urls = [
'https://www.newegg.com/team-32gb-288-pin-ddr4-sdram/p/N82E16820331426?Item=N82E16820331426&cm_sp=Homepage_SS-_-P0_20-331-426-_-08282022'
]
def parse(self, response):
price = response.xpath("//li[#class='price-current']//*/text()").getall()
price_string = ''.join(price)
yield {
"name": response.css("h1.product-title::text").get(),
"price": price_string
}
OUTPUT
{'name': 'Team T-FORCE DARK Za 32GB (2 x 16GB) 288-Pin PC RAM DDR4 3600 (PC4 28800) Desktop Memory (FOR AMD) Model TDZAD432G3600HC18JDC01',
'price': '89.99'}

Python and Scrapy - Scraper does not return results

Hello and thanks in advance for any assistance with this issue I'm having. I have never posted for coding help before and I'm very new to programming. Self taught old guy who is trying to learn something new and maybe build something to save the world (Or just build something. :))
I have scrapy fired up and when I run my terminal command "scrapy crawl coops" I always get the DEBUG: Crawled (200) and don't see any "Found details:" entrys. I'm able to run scrapy shell "http://coopdirectory.org/directory.htm" and get results manually with the shell. When I try to yield to a .jl or .js file they are empty as well. (I have made this work great with the scrapy tutorial quotes). Below is my code and below that are the results. Any direction on this would be great.
import scrapy
class CoopsSpider(scrapy.Spider):
name = "coops"
start_urls = [
'http://coopdirectory.org/directory.htm',
]
def parse(self, response):
for coop in response.css('div.coop'):
yield {
'name': coop.css('h3 a::text').getall(),
'address': coop.css('.address::text').getall(),
'website': coop.css('.web-address::text').getall(),
'phone': coop.css('.phone::text').getall(),
'org_type': coop.css('.org-type::text').getall(),
'inactive': coop.css('.inactive-alert::text').getall(),
'notes': coop.css('.note::text').getall(),
}
RESULTS
PS C:\Users\scott\Documents\WebScrapeProjects\coops\coops> scrapy crawl coops
2020-04-28 19:08:31 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: coops)
2020-04-28 19:08:31 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-04-28 19:08:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-28 19:08:31 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'coops',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'NEWSPIDER_MODULE': 'coops.spiders',
'SPIDER_MODULES': ['coops.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2020-04-28 19:08:31 [scrapy.extensions.telnet] INFO: Telnet Password: 3f2a17b73c1f55b3
2020-04-28 19:08:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-04-28 19:08:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-28 19:08:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-28 19:08:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-28 19:08:32 [scrapy.core.engine] INFO: Spider opened
2020-04-28 19:08:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-28 19:08:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-28 19:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://coopdirectory.org/directory.htm> (referer: None)
2020-04-28 19:08:32 [scrapy.core.engine] INFO: Closing spider (finished)
2020-04-28 19:08:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 316,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 303338,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.808906,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 29, 0, 8, 32, 936826),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 29, 0, 8, 32, 127920)}
2020-04-28 19:08:32 [scrapy.core.engine] INFO: Spider closed (finished)
Your css-selector ('div.coop') is not selecting anything and so nothing can be yielded inside your loop. You can test this by opening a scrapy shell (scrapy shell "http://coopdirectory.org/directory.htm") and then type response.css('div.coop'). You will see that an empty selection ([]) will be returned.
It is not clear what you are trying to achieve exactly, but here is an example that extracts all of the information that you are trying to extract and just returns it:
import scrapy
class CoopsSpider(scrapy.Spider):
name = "coops"
start_urls = ['http://coopdirectory.org/directory.htm']
def parse(self, response):
yield {
'name': response.css('h3 a::text').getall(),
'address': response.css('.address::text').getall(),
'website': response.css('.web-address a::text').getall(),
'phone': response.css('.phone::text').getall(),
'org_type': response.css('.org-type::text').getall(),
'inactive': response.css('.inactive-alert::text').getall(),
'notes': response.css('.note::text').getall(),
}
Just a hint: You might have a hard time scraping this website because the individual co-ops are not inside individual < div >-containers. This makes it somewhat difficult to seperate the individual information. Maybe you can find another website that also contains the information you need and is better suited for scraping or skip scraping and use open data-sets right away.

Scrapy Item pipelines not enabling

I am trying to set up a Scrapy spider inside Django app, that reads info from a page and posts it in Django's SQLite database using DjangoItems.
Right now it seems that scraper itself is working, however, it is not adding anything to database. My guess is that it happens because of scrapy not enabling any item pipelines. Here is the log:
2019-10-05 15:23:07 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-10-05 15:23:07 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-10-05 15:23:07 [scrapy.crawler] INFO: Overridden settings: {}
2019-10-05 15:23:07 [scrapy.extensions.telnet] INFO: Telnet Password: 6e614667b3cf5a1a
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-10-05 15:23:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-10-05 15:23:07 [scrapy.core.engine] INFO: Spider opened
2019-10-05 15:23:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-05 15:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-05 15:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g> (referer: None)
2019-10-05 15:23:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g>
{'product_title': ['Biezpiena sieriņš KĀRUMS vaniļas 45g']}
2019-10-05 15:23:08 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-05 15:23:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 15402,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.418066,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 10, 5, 12, 23, 8, 417204),
'item_scraped_count': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 10, 5, 12, 23, 7, 999138)}
2019-10-05 15:23:08 [scrapy.core.engine] INFO: Spider closed (finished)
As I can see, the scraper returns expected value as "{'product_title': ['Biezpiena sieriņš KĀRUMS vaniļas 45g']}", but it seems like it is not passed into pipeline because no pipelines are loaded.
I have spent several hours looking at different tutorials and trying to fix the issue, but had no luck so far. Is there anything else I might have forgotten regarding setting up the scraper? Maybe it has something to do with file structure in the project.
Here are relevant files.
items.py
from scrapy_djangoitem import DjangoItem
from product_scraper.models import Scrapelog
class ScrapelogItem(DjangoItem):
django_model = Scrapelog
pipelines.py
class ProductInfoPipeline(object):
def process_item(self, item, spider):
item.save()
yield item
settings.py
BOT_NAME = 'scraper'
SPIDER_MODULES = ['scraper.spiders']
NEWSPIDER_MODULE = 'scraper.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scraper.pipelines.ProductInfoPipeline': 300,
}
spider product_info.py:
import scrapy
from product_scraper.scraper.scraper.items import ScrapelogItem
class ProductInfoSpider(scrapy.Spider):
name = 'product_info'
allowed_domains = ['www.barbora.lv']
start_urls = ['https://www.barbora.lv/produkti/biezpiena-sierins-karums-vanilas-45-g']
def parse(self, response):
item = ScrapelogItem()
item['product_title'] = response.xpath('//h1[#itemprop="name"]/text()').extract()
return ScrapelogItem(product_title=item["product_title"])
Project file structure:
After further tinkering and research I found out that my settings file was not properly configured (however, it was only part of the problem). I added these lines to code based on other resources (they link Django project settings with scrapers settings):
import sys
import django
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ".."))
os.environ['DJANGO_SETTINGS_MODULE'] = 'broccoli.settings'
django.setup()
After that, the spider did not run, but this time it at least gave an error message about not finding the Django settings module. I don't remember the exact syntax but it was something like that: "broccoli.settings MODULE NOT FOUND"
After some experiments I found out that moving scraper directory "scraper" from inside the app "Product_Scraper" to same level with other apps helped deal with this issue and everything worked.

How to download (PDF) files with Python/Scrapy using the Files Pipeline?

Using Python 3.7.2 on Windows 10 I'm struggling with the task to let Scrapy v1.5.1 download some PDF files. I followed the docs but I seem to miss something. Scrapy gets me the desired PDF URLs but downloads nothing. Also no errors are thrown (at least).
The relevant code is:
scrapy.cfg:
[settings]
default = pranger.settings
[deploy]
project = pranger
settings.py:
BOT_NAME = 'pranger'
SPIDER_MODULES = ['pranger.spiders']
NEWSPIDER_MODULE = 'pranger.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 300,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = r'C:\pranger_downloaded'
FILES_URLS_FIELD = 'PDF_urls'
FILES_RESULT_FIELD = 'processed_PDFs'
pranger_spider.py:
import scrapy
class IndexSpider(scrapy.Spider):
name = "index"
url_liste = []
def start_requests(self):
urls = [
'http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for menupunkt in response.css('div#aufklappmenue'):
yield {
'file_urls': menupunkt.css('div.aussen a.innen::attr(href)').getall()
}
items.py:
import scrapy
class PrangerItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
All other files are as they were created by the scrapy startproject command.
The output of scrapy crawl index is:
(pranger) C:\pranger>scrapy crawl index
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: pranger)
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 11 2019, 14:11:50) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-20 15:45:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pranger', 'NEWSPIDER_MODULE': 'pranger.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pranger.spiders']}
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline', 'pranger.pipelines.PrangerPipeline']
2019-02-20 15:45:18 [scrapy.core.engine] INFO: Spider opened
2019-02-20 15:45:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-20 15:45:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://verbraucherinfo.ua-bw.de/robots.txt> (referer: None)
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3> (referer: None)
2019-02-20 15:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3>
{'file_urls': ['https://www.lrabb.de/site/LRA-BB-Desktop/get/params_E-428807985/3287025/Ergebnisse_amtlicher_Kontrollen_nach_LFGB_Landkreis_Boeblingen.pdf', <<...and dozens more URLs...>>], 'processed_PDFs': []}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-20 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13268,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 20, 14, 45, 19, 166646),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 20, 14, 45, 18, 864509)}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)
Oh BTW I published the code, just in case: https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403
The FILES_URLS_FIELD setting tells the pipeline what field of the item contains the urls you want to download.
By default, this is file_urls, but if you change the setting, you also need to change the field name (key) you're storing the urls in.
So you have two options - either use the default setting, or rename your item's field to PDF_urls as well.

Scraping javascript with 'data-reactid' content using Scrapy and Splash

I am using Scrapy + Splash to scrape some financial data from a dynamic website however the website contains some code (dynamic using 'data-reactid') hence I don't know how to extract
Here is my spider:
import scrapy
from scrapy_splash import SplashRequest
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['gu.qq.com']
start_urls = ['http://gu.qq.com/hk00700/gp/income/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse,
args={
'wait': 0.5,
},
endpoint='render.html',
)
def parse(self, response):
for data in response.css("div.mod-detail write gb_con submodule finance-report"):
yield{
'table' : data.css("table.fin-table.tbody.tr.td::text").extract()
}
I tried to extract the result to csv using below command but nothing was stored into the csv:
scrapy crawl stocks -o stocks.csv
Here is the log after running this command:
root#localhost:~/finance/finance/spiders# scrapy crawl stocks -o stocks.csv
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: finance)
2018-06-09 10:09:59 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.13-x86_64-linode106-x86_64-with-Ubuntu-16.04-xenial
2018-06-09 10:09:59 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finance.spiders', 'FEED_URI': 'stocks.csv', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['finance.spiders'], 'BOT_NAME': 'finance', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'csv'}
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-06-09 10:09:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-06-09 10:09:59 [scrapy.core.engine] INFO: Spider opened
2018-06-09 10:10:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-09 10:10:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/robots.txt> (referer: None)
2018-06-09 10:10:00 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gu.qq.com/hk00700/gp/income/ via http://localhost:8050/render.html> (referer: None)
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-09 10:10:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 962,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 184825,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 9, 10, 10, 17, 510745),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 51392512,
'memusage/startup': 51392512,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2018, 6, 9, 10, 10, 0, 4160)}
2018-06-09 10:10:17 [scrapy.core.engine] INFO: Spider closed (finished)
And below is the link and the web structure that I want to scrape:
http://gu.qq.com/hk00700/gp/income
I am quite new to web scraping, could anyone help to explain how should I extract the data?
Here is your data,
http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=-1&code=00700&_callback=jQuery112405223614913821484_1528544465322&_=1528544465323
Splash is not required any where just take a look, change the query parameters in the url and you will get json response enjoy it. Remove the splash browser it doesn't useful at all. it will just increase your response time.

Categories