Scrapy will start by command line but not with CrawlerProcess

Scrapy will start by command line but not with CrawlerProcess - python

Here is my spider :
import scrapy
class PhonesCDSpider(scrapy.Spider):
name = "phones_CD"
custom_settings = {
"FEEDS": {
"Spiders/spiders/cd.json": {"format": "json"},
},
}
start_urls = [
'https://www.cdiscount.com/telephonie/telephone-mobile/smartphones/tous-nos-smartphones/l-144040211.html'
]
def parse(self, response):
for phone in response.css('div.prdtBlocInline.jsPrdtBlocInline'):
phone_url = phone.css('div.prdtBlocInline.jsPrdtBlocInline a::attr(href)').get()
# go to the phone page
yield response.follow(phone_url, callback=self.parse_phone
def parse_phone(self, response):
yield {
'title': response.css('h1::text').get(),
'price': response.css('span.fpPrice.price.jsMainPrice.jsProductPrice.hideFromPro::attr(content)').get(),
'EAN' : response.css('script').getall(),
'image_url' : response.css('div.fpMainImg a::attr(href)').get(),
'url': response.url
}
If I start it in the terminal with: scrapy crawl phones_CD -O test.json, it works fine. But if I run it in my python script (where the other crawlers work and are configured the same way):
def all_crawlers():
process = CrawlerProcess()
process.crawl(PhonesCBSpider)
process.crawl(PhonesKFSpider)
process.crawl(PhonesMMSpider)
process.crawl(PhonesCDSpider)
process.start()
all_crawlers()
I get an error, here is the traceback :
2021-01-05 18:16:06 [scrapy.core.engine] INFO: Spider opened
2021-01-05 18:16:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-01-05 18:16:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-01-05 18:16:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/telephonie/telephone-mobile/smartphones/tous-nos-smartphones/l-144040211.html> (referer: None)
2021-01-05 18:16:07 [scrapy.core.engine] INFO: Closing spider (finished)
Thanks in advance for your time!

According to Scrapy docs feed-exports
Scrapy FEEDS setting does not support relative path like your "Spiders/spiders/cd.json".

Related

Scrapy follows link but does not return data, possible timing issue?

I have tried several settings such as delaying the download time, the console does not seem to have an error, and the selectors return the correct data from Scrapy Shell
The site uses a different prefix on the domain, could this be the cause? slist.amiami.jp
I tried several variations of domains and URLs but all result in the same response of no data returned
Any Idea why it is not collecting any data for the -o CSV file? Thank you if you have any advise
The expected output is to return the JAN code and category text from the product page
2021-05-13 23:59:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-13 23:59:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-05-13 23:59:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4967834601246> (referer: None)
2021-05-13 23:59:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.jp/top/detail/detail?gcode=TOY-SCL-05454> (referer: https://example.jp/top/search/list?s_keywords=4967834601246)
2021-05-13 23:59:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4543736302216> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=44536318620013> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] INFO: Closing spider (finished)
2021-05-14 00:00:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1115,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'elapsed_time_seconds': 29.128242,
'finish_reason': 'finished',
import scrapy
class exampledataSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.jp']
start_urls = ['https://example.jp/top/search/list?s_keywords=4967834601246',
'https://example.jp/top/search/list?s_keywords=4543736302216',
'https://example.jp/top/search/list?s_keywords=44536318620013',
]
def parse(self, response):
for link in response.css('div.product_box a::attr(href)'):
yield response.follow(link.get(), callback=self.item)
def item(self, response):
products = response.css('div.maincontents')
for product in products:
yield {
'JAN': product.css('dd.jancode::text').getall(),
'title': product.css('div.pankuzu a::text').getall()
}

It seems the products = response.css('div.maincontents') selector was incorrect and I had to do 2 separate parent child requests for the data
It also turns out you can simply just YEILD the elements in a list
'''
def output(self, response):
yield {
'firstitem': response.css('example td:nth-of-type(2)::text').getall(),
'seconditem': response.css('example td:nth-of-type(2)::text').getall(),
'thrditem': response.css('example td:nth-of-type(2)::text').getall()
}
'''

Craigslist Scraper using Scrapy Spider not performing functions

2021-05-07 10:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/robots.txt> (referer: None)
2021-05-07 10:07:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa/> (referer: None)
2021-05-07 10:07:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa?s=120> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa/)
2021-05-07 10:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa?s=240> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa?s=120)
this is the output I get, seems like it just moves to the page of results, performed by selecting the next button and performing a request in line 27
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, Request
from craig.items import CraigItem
from scrapy.selector import Selector
class PhonesSpider(scrapy.Spider):
name = 'phones'
allowed_domains = ['tampa.craigslist.org']
start_urls = ['https://tampa.craigslist.org/d/cell-phones/search/moa/']
def parse(self, response):
phones = response.xpath('//p[#class="result-info"]')
for phone in phones:
relative_url = phone.xpath('a/#href').extract_first()
absolute_url = response.urljoin(relative_url)
title = phone.xpath('a/text()').extract_first()
price = phone.xpath('//*[#id="sortable-results"]/ul/li[3]/a/span').extract_first()
yield Request(absolute_url, callback=self.parse_item, meta={'URL': absolute_url, 'Title': title, 'price': price})
relative_next_url = response.xpath('//a[#class="button next"]/#href').extract_first()
absolute_next_url = "https://tampa.craigslist.org" + relative_next_url
yield Request(absolute_next_url, callback=self.parse)
def parse_item(self, response):
item = CraigItem()
item["cl_id"] = response.meta.get('Title')
item["price"] = response.meta.get
absolute_url = response.meta.get('URL')
yield{'URL': absolute_url, 'Title': title, 'price': price}
Seems like in my code, for phone in phones loop, doesn't run, which results in never running parse_item and continuing to requesting the next url, I am following some tutorials and reading documentation but im still having trouble grasping what I am doing wrong. I have experience with coding arduinos as a hobby when I was young, but no professional coding experience, this is my first forte into a project like this, I have an ok grasp on the basics of loops, functions, callbacks, etc.
any help is greatly appreciated
UPDATE
current output
2021-05-07 15:29:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/robots.txt> (referer: None)
2021-05-07 15:29:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/d/cell-phones/search/moa/> (referer: None)
2021-05-07 15:29:33 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-05-07 15:29:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html> (referer: https://tampa.craigslist.org/d/cell-phones/search/moa/)
2021-05-07 15:29:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html>
{'cl_id': 'postid_7309734640',
'price': '$35',
'title': 'Cut that high cable bill, switch to SPC TV and save. 1400 hd '
'channels',
'url': 'https://tampa.craigslist.org/hil/mob/d/tampa-cut-that-high-cable-bill-switch/7309734640.html'}
2021-05-07 15:29:36 [scrapy.core.engine] INFO: Closing spider (finished)
CURRENT CODE
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, Request
from craig.items import CraigItem
from scrapy.selector import Selector
class PhonesSpider(scrapy.Spider):
name = 'phones'
allowed_domains = ['tampa.craigslist.org']
start_urls = ['https://tampa.craigslist.org/d/cell-phones/search/moa/']
base_url = 'https://tampa.craigslist.org'
def parse(self, response):
phones = response.xpath('//div[#class="result-info"]')
for phone in phones:
x = response.meta.get('x')
n = -1
url = response.xpath('//a[#class="result-title hdrlnk"]/#href').getall()
relative_url = phone.xpath('//a[#class="result-title hdrlnk"]/#href').get()
absolute_url = response.urljoin(relative_url)
title = phone.xpath('//a[#class="result-title hdrlnk"]/text()').getall()
price = phone.xpath('//span[#class="result-price"]/text()').getall()
cl_id = phone.xpath('//a[#class="result-title hdrlnk"]/#id').getall()
yield Request(absolute_url, callback=self.parse_item, meta={'absolute_url': absolute_url, 'url': url, 'title': title, 'price': price, 'cl_id': cl_id, 'n': n})
def parse_item(self, response):
n = response.meta.get('n')
x = n + 1
item = CraigItem()
item["title"] = response.meta.get('title')[x]
item["cl_id"] = response.meta.get('cl_id')[x]
item["price"] = response.meta.get('price')[x]
item["url"] = response.meta.get('url')[x]
yield item
absolute_next_url = response.meta.get('url')[x]
absolute_url = response.meta.get('absolute_url')
yield Request(absolute_next_url, callback=self.parse, meta={'x': x})
I am now able to retrieve the desired content for a posting, URL, Price, Title and craigslist id, now my spider automatically closes after pulling just 1 result, I am having trouble understanding the process of using variables between the 2 functions (x) and (n), logically, after pulling one listings data, as above in the format
cl_id
Price
title
url
I would like to proceed back to the initial parse function and swap to the next item in the list of urls retrieved by
response.xpath('//a[#class="result-title hdrlnk"]/#href').getall()
which (when run in scrapy shell, succesfully pulls all the URLs)
how do I go about implementing this logic of start with [0] in the list, perform parse, perform parse_item, output item, then update a variable (n which starts as 0, needs to + 1 after each item)then call n in parse_item with its updated value and use, for example (item["title"] = response.meta.get('title')[x]) to refer to the list of urls, etc, and which place to select, then run parse_item again outputting 1 at a time, until all the values in the URL list have been output with their related price, cl_id, and title.
I know the code is messy as hell and the basics aren't fully understood by me yet, but im committed to getting this to work and learning it the hard way rather than starting from the ground up for python.

Class result-info is used within the div block, so you should write:
phones = response.xpath('//div[#class="result-info"]')
That being said, I didn't check/fix your spider further (it seems there are only parsing errors, not functional ones).
As a suggestion for the future, you can use Scrapy shell for quickly debugging the issues:
scrapy shell "your-url-here"

Scrapy Bestbuy not extracting data

I was wondering why scrapy is not extracting data on bestbuy website. is there anything wrong with my code?
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'bestbuy'
start_url = ['https://www.bestbuy.com/site/promo/newly-discounted-outlet-products']
def parse(self, response):
title = response.css('div.sku-title a::text').extract()
yield title
this is my results when using scrapy crawl bestbuy -o bestbuy.csv
2020-02-10 06:04:22 [scrapy.core.engine] INFO: Spider opened
2020-02-10 06:04:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-10 06:04:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-10 06:04:22 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-10 06:04:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.017988,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 2, 10, 12, 4, 22, 251711),
'log_count/INFO': 10,
'start_time': datetime.datetime(2020, 2, 10, 12, 4, 22, 233723)}
2020-02-10 06:04:22 [scrapy.core.engine] INFO: Spider closed (finished)

The reason it was working in the shell and not in your code is because you forgot the 's' at the end of 'start_urls'.
This should work:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'bestbuy'
start_urls = [
'https://www.bestbuy.com/site/promo/newly-discounted-outlet-products']
def parse(self, response):
for title in response.css('h4 > a::text').getall():
yield {"title": title}

Scrapy LinkExtractor fails to find existing url

I have a Crawler like this:
class SkySpider(CrawlSpider):
name = "spider_v1"
allowed_domains = [
"atsu.edu",
]
start_urls = [
"http://www.atsu.edu",
]
rules = (
Rule(
INFO_LINKS_EXTRACTOR,
follow=True,
callback='parse_item',
),
)
def parse_item(self, response):
print("ENTERED!")
item = SportsScraperItem()
item["contact"] = self._parse_contact(response)
return item
In my helpers.py I have:
from scrapy.linkextractors import LinkExtractor
def _r(string):
return f"(.*?)(\b{string}\b)(.*)"
INFO_LINKS_EXTRACTOR = LinkExtractor(
allow=(
_r('about'),
),
unique=True,
)
I know that atsu.edu has a link https://www.atsu.edu/about-atsu/, but my extractor seems like does not see it and parse_item() method is not run. What am I doing wrong here?
EDIT 1:
Logs:
2019-10-01 15:40:58 [scrapy.core.engine] INFO: Spider opened
2019-10-01 15:40:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-10-01 15:40:58 [steppersspider_v1] INFO: Spider opened: steppersspider_v1
2019-10-01 15:40:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-10-01 15:40:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.atsu.edu/robots.txt> from <GET http://WWW.ATSU.EDU/robots.txt>
2019-10-01 15:41:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/robots.txt> (referer: None)
2019-10-01 15:41:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.atsu.edu/> from <GET http://WWW.ATSU.EDU>
2019-10-01 15:41:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/robots.txt> (referer: None)
2019-10-01 15:41:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atsu.edu/> (referer: None)
2019-10-01 15:41:19 [steppersspider_v1] DEBUG: Saved file steppers-www.atsu.edu.html
2019-10-01 15:41:20 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-01 15:41:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
EDIT 2
Here is how I tested this regexp on regexp101.com.
EDIT 3
Working function for regexp:
def _r(string):
return r"^(.*?)(\b{string}\b)(.*)$".format(string=string)

By default, link extractors only search for a and area tags. The links you are looking for seem to be in li tags.
You need to pass the tags parameter to the constructor of your link extractor with the desired tags. For example:
tags=('a', 'area', 'li')
See https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml

Call back functions are not fired

I am trying to scrape MichaelKors.com. I have had success until now; my script just stopped working. Call back functions are not being fired. I have removed everything from my functions and even then they are not being called. Here is my code:
class MichaelKorsClass(CrawlSpider):
name = 'michaelkors'
allowed_domains = ['www.michaelkors.com']
start_urls = ['https://www.michaelkors.com/women/clothing/dresses/_/N-28ei' ]
rules = (
# Rule(LinkExtractor(allow=('(.*\/_\/R-\w\w_)([\-a-zA-Z0-9]*)$', ), deny=('((.*investors.*)|(/info/)|(contact\-us)|(checkout))', )), callback='parse_product'),
Rule(LinkExtractor(allow=('(.*\/_\/)(N-[\-a-zA-Z0-9]*)$',),
deny=('((.*investors.*)|(/info/)|(contact\-us)|(checkout) | (gifts))',),), callback='parse_list'),
)
def parse_product(self, response):
self.log("HIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII")
def parse_list(self, response):
hxs = HtmlXPathSelector(response)
url = response.url
self.log("Helloww")
is_listing_page = False
product_count = hxs.select('//span[#class="product-count"]/text()').get()
#print(re.findall('\d+', pc))
try:
product_count = int(product_count)
is_listing_page = True
except:
is_listing_page = False
if is_listing_page:
for product_url in response.xpath('//ul[#class="product-wrapper product-wrapper-four-tile"]//li[#class="product-name-container"]/a/#href').getall():
yield scrapy.Request(response.urljoin(product_url), callback=self.parse_product)
And here is the log:
2019-07-29 11:25:50 [scrapy.core.engine] INFO: Spider opened
2019-07-29 11:25:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-29 11:25:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-29 11:25:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/women/clothing/dresses/_/N-28ei> (referer: None)
2019-07-29 11:25:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/sale/view-all-sale/_/N-28zn> (referer: https://www.michaelkors.com/women/clothing/dresses/_/N-28ei)
2019-07-29 11:25:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/women/clothing/jumpsuits/_/N-18bkjwa> (referer: https://www.michaelkors.com/women/clothing/dresses/_/N-28ei)
2019-07-29 11:25:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.michaelkors.com/women/clothing/t-shirts-sweatshirts/_/N-10dkew5> (referer: https://www.michaelkors.com/women/clothing/dresses/_/N-28ei)
....
Neither Hellow nor Hiiii is printed
Edit1: Copied my script to another project and it works fine. Still don't know what the problem was.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy will start by command line but not with CrawlerProcess - python

According to Scrapy docs feed-exports Scrapy FEEDS setting does not support relative path like your "Spiders/spiders/cd.json".

Related

Scrapy follows link but does not return data, possible timing issue?

Craigslist Scraper using Scrapy Spider not performing functions

Scrapy Bestbuy not extracting data

Scrapy LinkExtractor fails to find existing url

Call back functions are not fired

Categories

Resources