Why does scrapy says that Im iterating on an 'itemMeta' object? - python

I am trying to scrape a website to learn a little more how does scrapy works. I have a little experience with the packages requests and bs4 (BeautifulSoup). I am working in an miniconda3 environment on my Ubuntu 20.04.1 LTS machine. I use python 3.7.
I have created an item named 'PostscrapeItem' which has only one attribute: full_text = scrapy.Field(). I have not touched the structure of the project that has been automatically created by scrapy.
I have made a spider which is only supposed to find occurrences of an html tag ('em') on this webpage: https://blog.scrapinghub.com/page/1/
Here is the code of my spider:
import scrapy
from bs4 import BeautifulSoup
from postscrape.items import PostscrapeItem
class PostSpider(scrapy.Spider):
name = "posts"
start_urls = [
'https://blog.scrapinghub.com/page/1/'
]
def parse(self, response):
so = BeautifulSoup(response.text, 'html.parser')
item = PostscrapeItem()
if so.find('em'):
concatenated = ""
text_samples = so.find_all('em')
for t_s in text_samples:
concatenated += t_s.text
item['full_text'] = concatenated
return PostscrapeItem
The problem I have is that I have an error when I run this code with 'scrapy crawl posts' in my terminal and it says: 'TypeError: 'ItemMeta' object is not iterable
'. With the little I think I know, the only ItemMeta that is present in my program is the object PostscrapeItem. It seams to me that I am not iterating on this object in my code. That's why I am asking you.
Here is the complete error message:
Traceback (most recent call last):
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/utils/defer.py",
line 117, in iter_errback
yield next(it)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/luc/.local/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
TypeError: 'ItemMeta' object is not iterable`
Thank you in advance and let me know how to improve the clarity and the quality of my questions.
Luc

You're not returning an item, you're returning the item class object.
Scrapy tries iterating it when it's returned from the spider, so you get your TypeError.
Simply correcting the last line to return item should fix your code.
As a side note, scrapy has its own parsing utilities, so there's no need to import and use BS.

as per #stranac answer, I have corrected full code and its work.
import scrapy
from bs4 import BeautifulSoup
class PostscrapeItem(scrapy.Item):
full_text = scrapy.Field()
class PostSpider(scrapy.Spider):
name = "posts"
start_urls = [
'https://blog.scrapinghub.com/page/1/'
]
def parse(self, response):
so = BeautifulSoup(response.text, 'html.parser')
item = PostscrapeItem()
if so.find('em'):
concatenated = ""
text_samples = so.find_all('em')
for t_s in text_samples:
concatenated += t_s.text
item['full_text'] = concatenated
return item

Related

Scraper not getting total data

I have a .py scraper, and whe it runs, works fine but is not getting the 100% of the data. I 'm getting lot of errors like this:
2022-05-05 20:53:39 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.justforsport.com.ar/buzo-hombre-361-degrees-y2201my002a-urban-1-gris/p> (referer: https://www.justforsport.com.ar/hombre?page=3)
Traceback (most recent call last):
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "c:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\just_for_sport\just_for_sport\spiders\jfs_hombre.py", line 41, in parse_article_detail
precio0=response.css('span.vtex-product-price-1-x-currencyContainer.vtex-product-price-1-x-currencyContainer--product')[0]
File "C:\Users\User\Desktop\Personal\DABRA\Scraper_jfs\venv\lib\site-packages\parsel\selector.py", line 70, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
this is my script:
import scrapy
from scrapy_splash import SplashRequest
from concurrent.futures import process
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import os
if os.path.exists('jfs_hombre.csv'):
os.remove('jfs_hombre.csv')
print("The file has been deleted successfully")
else:
print("The file does not exist!")
class JfsSpider_hombre(scrapy.Spider):
name = 'jfs_hombre'
start_urls = ["https://www.justforsport.com.ar/hombre?page=1"]
def parse(self,response):
total_products=int(int(response.css('div.vtex-search-result-3-x-totalProducts--layout.pv5.ph9.bn-ns.bt-s.b--muted-5.tc-s.tl.t-action--small span::text').get())/27) + 1
for count in range(1, total_products):
yield SplashRequest(url=f'https://www.justforsport.com.ar/hombre?page={count}',
callback=self.parse_links)
def parse_links(self,response):
links=response.css('a.vtex-product-summary-2-x-clearLink.vtex-product-summary-2-x-clearLink--shelf-product.h-100.flex.flex-column::attr(href)').getall()
for link in links:
yield SplashRequest(response.urljoin('https://www.justforsport.com.ar' + link), self.parse_article_detail)
def parse_article_detail(self, response):
precio0=response.css('span.vtex-product-price-1-x-currencyContainer.vtex-product-price-1-x-currencyContainer--product')[0]
yield {
'Casa':'Just_For_Sports',
'Sku' :response.css('span.vtex-product-identifier-0-x-product-identifier__value::text').get(),
'Name':response.css('span.vtex-store-components-3-x-productBrand::text').get() ,
'precio':''.join(precio0.css('span.vtex-product-price-1-x-currencyInteger.vtex-product-price-1-x-currencyInteger--product::text').getall()),
'Link':response.url,
'Date':datetime.today().strftime('%Y-%m-%d')
}
process= CrawlerProcess(
settings = {
'FEED_URI':'jfs_hombre.csv' ,
'FEED_FORMAT': 'csv',
'FEED_EXPORT_ENCODING':'utf-8',
'CONCURRENT_REQUESTS': 16,
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'AUTOTHROTTLE_MAX_DELAY' : 2,
'USER_AGENT' : 'Googlebot/2.1 (+http://www.google.com/bot.html)'
} )
process.crawl(JfsSpider_hombre)
process.start()
I donĀ“t understand what the error is about...why sometimes I get the 100% of the info and sometimes I get these messages? it's something related to the script, the user_agent, about the moment when the process run?
Thanks in advance!
Data is also generatig from from API calls json response as GET method and you call grab all data point whatever you want with the easiest and the superfast way. So below is given an example of working solution.
import scrapy
from scrapy.crawler import CrawlerProcess
class JfsSpider_hombre(scrapy.Spider):
name = 'jfs_hombre'
#start_urls = ["https://www.justforsport.com.ar/hombre?page=1"]
def start_requests(self):
yield scrapy.Request(
url='https://www.justforsport.com.ar/_v/segment/graphql/v1?workspace=master&maxAge=short&appsEtag=remove&domain=store&locale=es-AR&__bindingId=e841e6ce-1216-4569-a2ad-0188ba5a92fc&operationName=productSearchV3&variables=%7B%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%226869499be99f20964918e2fe0d1166fdf6c006b1766085db9e5a6bc7c4b957e5%22%2C%22sender%22%3A%22vtex.store-resources%400.x%22%2C%22provider%22%3A%22vtex.search-graphql%400.x%22%7D%2C%22variables%22%3A%22eyJoaWRlVW5hdmFpbGFibGVJdGVtcyI6ZmFsc2UsInNrdXNGaWx0ZXIiOiJGSVJTVF9BVkFJTEFCTEUiLCJzaW11bGF0aW9uQmVoYXZpb3IiOiJkZWZhdWx0IiwiaW5zdGFsbG1lbnRDcml0ZXJpYSI6Ik1BWF9XSVRIT1VUX0lOVEVSRVNUIiwicHJvZHVjdE9yaWdpblZ0ZXgiOmZhbHNlLCJtYXAiOiJjIiwicXVlcnkiOiJob21icmUiLCJvcmRlckJ5IjoiT3JkZXJCeVJlbGVhc2VEYXRlREVTQyIsImZyb20iOjY0LCJ0byI6OTUsInNlbGVjdGVkRmFjZXRzIjpbeyJrZXkiOiJjIiwidmFsdWUiOiJob21icmUifV0sIm9wZXJhdG9yIjoiYW5kIiwiZnV6enkiOiIwIiwic2VhcmNoU3RhdGUiOm51bGwsImZhY2V0c0JlaGF2aW9yIjoiU3RhdGljIiwiY2F0ZWdvcnlUcmVlQmVoYXZpb3IiOiJkZWZhdWx0Iiwid2l0aEZhY2V0cyI6ZmFsc2V9%22%7D',
callback=self.parse,
method="GET"
)
def parse(self, response):
resp = response.json()
#print(resp)
for item in range(0,576,32):
resp['recordsFiltered']=item
for result in resp['data']['productSearch']['products']:
yield {
'productName': result['productName']
}
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl()
process.start()
Output:
'downloader/response_status_count/200': 1,
'item_scraped_count': 576,

Scrapy Python script gives raise TypeError("Cannot mix str and non-str arguments")

Hi I am new to programming and am running into this seemingly extremly common problem but honestly none of the answers I have seen helped me in my case.
My code is:
import json
import scrapy
class MoreKeysSpider(scrapy.Spider):
name = 'getoffers'
def __init__(self):
with open(r'C:\Users\magnu\brickset-scraper\postscrape\postscrape\prod.json', encoding='utf-8') as data_file:
self.data = json.load(data_file)
def start_requests(self):
for item in self.data:
request = scrapy.Request(item['url'], callback=self.parse)
request.meta['item'] = item
yield request
def parse(self, response):
item = response.meta['item']
item['details'] = []
item['details'].append({
"Name" : response.css('span[itemprop=name]::text').extract_first(),
"Release" : response.xpath('//*[#id="info"]/div[2]/div[1]/div[1]/div[2]/text()').extract_first(),
"Website" : response.xpath('//*[#id="info"]/div[2]/div[1]/div[2]/div[2]/a/#href').extract_first(),
"Entwickler" : response.xpath('//*[#id="info"]/div[2]/div[1]/div[3]/div[2]/text()').extract_first(),
"Publisher" : response.xpath('//*[#id="info"]/div[2]/div[1]/div[4]/div[2]/text()').extract_first(),
"Tags" : response.xpath('//*[#id="info"]/div[2]/div[2]/div[3]/div[2]/descendant').getall(),
"Systemanforderungenmin" : response.xpath('//*[#id="config"]/ul[1]/descendant').getall(),
"Systemanforderungenmax" : response.xpath('//*[#id="config"]/ul[2]/descendant').getall(),
})
yield item
item['offer'] = []
for div in response.css('#offers_table'):
for offer_row in div.css('div.offers-table-row'):
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)')).get(),
url_str = ''.join(map(str, url)) #coverts list to str
item['offer'].append({
"offer:"
"Shop": offer_row.css('div[itemprop ~= seller] div.offers-merchant::attr(title)').extract_first(),
"Typ": offer_row.css('div.offers-edition-region::text').extract_first(),
"Edition": offer_row.css("div[data-toggle=tooltip]::attr(data-content)"),
"Link": response.follow(url_str, self.parse_topics),
})
yield item
As a response I get
DEBUG: Scraped from <200 https://www.keyforsteam.de/kaufen-crusader-kings-2-cd-key-preisvergleich/>
{'url': 'https://www.keyforsteam.de/kaufen-crusader-kings-2-cd-key-preisvergleich/', 'details': [{'Name': '\n\t\t\t\t\tCrusader Kings 2\n\t\t\t\t', 'Release': '\n 14. Februar 2012\n ', 'Website': 'https://www.paradoxplaza.com/crusader-kings-ii/CKCK02GSK-MASTER.html', 'Entwickler': '\n Paradox Development Studio\n
', 'Publisher': '\n Paradox Interactive\n
', 'Tags': [], 'Systemanforderungenmin': [], 'Systemanforderungenmax': []}]}
2021-03-22 21:47:22 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.keyforsteam.de/kaufen-crusader-kings-2-cd-key-preisvergleich/> (referer: None)
Traceback (most recent call last):
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\spidermiddlewares\depth.py",
line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\magnu\brickset-scraper\postscrape\postscrape\spiders\keysint.py", line 40, in parse
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)')).get(),
File "c:\users\magnu\appdata\local\programs\python\python39\lib\site-packages\scrapy\http\response\text.py", line
102, in urljoin
return urljoin(get_base_url(self), url)
File "c:\users\magnu\appdata\local\programs\python\python39\lib\urllib\parse.py", line 524, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "c:\users\magnu\appdata\local\programs\python\python39\lib\urllib\parse.py", line 122, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
So the first part seemingly works and I am pretty sure the mistake is somewhere in the second item, but I cant seem to find it
item['offer'] = []
for div in response.css('#offers_table'):
for offer_row in div.css('div.offers-table-row'):
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)')).get(),
url_str = ''.join(map(str, url)) #coverts list to str
item['offer'].append({
"offer:"
"Shop": offer_row.css('div[itemprop ~= seller] div.offers-merchant::attr(title)').extract_first(),
"Typ": offer_row.css('div.offers-edition-region::text').extract_first(),
"Edition": offer_row.css("div[data-toggle=tooltip]::attr(data-content)"),
"Link": response.follow(url_str, self.parse_topics),
})
yield item
Had kind of a circular route to get this one, but I think the debugging process would be instructive.
It's tougher to diagnose this without the json file the program is calling, but it looks like your problem is on this line: url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)')).get(),
From How Can I Fix "TypeError: Cannot mix str and non-str arguments"?
According to the Scrapy documentation, the .css(selector) method that you're using, returns a SelectorList instance. If you want the actual (unicode) string version of the url, call the extract() method:
So I tried:
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)').extract()).get(),
But I still get the same error. Strange!
To diagnose, I dropped a breakpoint() into the spider here:
for div in response.css('#offers_table'):
for offer_row in div.css('div.offers-table-row'):
breakpoint()
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)').extract()).get(),
Running the spider again, I can test pieces of the next line:
(Pdb) offer_row.css('div.buy-btn-cell a::attr(href)').extract()
['https://www.keyforsteam.de/outgoinglink/keyforsteam/37370?merchant=1', 'https://www.keyforsteam.de/outgoinglink/keyforsteam/37370?merchant=1']
Ah, so extract() is giving back a list of strings rather than a single string. There must be two elements matching. However, they are identical, so we don't care which one we get. Looking at the scrapy docs at https://docs.scrapy.org/en/latest/topics/selectors.html, we see there's also an extract-first() function.
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)').extract-first()).get(),
Although, looking at the scrapy docs, you probably want to use get() instead of extract-first()
Which is when I finally notice your only mistake was putting the get() outside the wrong set of parenthesis.
url = response.urljoin(offer_row.css('div.buy-btn-cell a::attr(href)').get())

Attribute error response object doesnot have an attribute 'text'

I am trying to use python scrapy tool for extracting the information from the bitcointalk.org website about the users and the public keys that they post in the forum for donation.
I found this piece of code online, made changes to it so that it runs on my desired website, but I am running into an error AttributeError response object has no attribute text.
Below is the code for reference
class BitcointalkSpider(CrawlSpider):
name = "bitcointalk"
allowed_domains = ["bitcointalk.org"]
start_urls = ["https://bitcointalk.org/index.php"]
rules = (
Rule(SgmlLinkExtractor(deny=[
'https://bitcointalk\.org/index\.php\?action=ignore',
'https://bitcointalk\.org/index\.php\?action=profile',
],
allow_domains='bitcointalk.org'), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[contains(#class, "td_headerandpost")]')
items = []
for site in sites:
item = BitcoinItem()
item["membername"] = site.xpath('.//td[#class="poster_info"]/b/a/text()').extract()
addresses = site.xpath('.//div[contains(#class, "signature")]/text()').re(r'(1[1-9A-HJ-NP-Za-km-z]{26,33})')
if item["membername"] and addresses:
addr_list = set()
for addr in addresses:
if (bcv.check_bc(addr)):
addr_list.add(addr)
item["address"] = addr_list
if len(addr_list) > 0:
items.append(item)
return items
and the error that I am receiving is :
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/crawl.py", line 72, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "/home/sunil/Desktop/Nikhil/Thesis/mit_bitcoin/bitcoin/spiders/bitcointalk_spider.py", line 24, in parse_item
sel = Selector(response)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 63, in __init__
text = response.text
AttributeError: 'Response' object has no attribute 'text'
Something is likely wrong with one of your requests, since it seems like the response from at least one url your crawling is not properly formatted. Either the request itself failed, or you're not making requests appropriately.
See here for the source of your error.
And see here for a clue as to why your request may be poorly formatted. It looks like Selector expects an HtmlResponse object, or a similar type.

Scrapy callback str issue

I am trying to run a scraper using Scrapy, I was able to do in the past using this code, but now I get a strange error.
_rules =(Rule(LinkExtractor(restrict_xpaths=(xpath_str)), follow=True,
callback='parse_url'),)
def parse_url(self, response):
print response.url
...
Basically what I get back when I run it is:
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 67, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
TypeError: 'str' object is not callable
Any ideas why this happens? I have a really similar code in another scraper which works?!
Here is the full code
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..model import Properties
class TestScraper(CrawlSpider):
name = "test"
start_urls = [Properties.start_url]
_rules =( Rule(LinkExtractor(restrict_xpaths=(Properites.xpath)), follow=True, callback='parse_url'), )
def parse_url(self, response):
print response.url
Change callback='parse_url' to callback=self.parse_url.

Got runUntilCurrent Error Message when running Scrapy Spider with Selenium

I'm a beginner in Scrapy. I want to collect links of items in the index page and get the information from the item pages. Because I need to deal with the javascript on the index page, I use selenium webdriver with scrapy. Here's my code in progress.py .
from scrapy.spider import Spider
from scrapy.http import Request
from selenium import selenium
from selenium import webdriver
from mustdo.items import MustdoItem
import time
class ProgressSpider(Spider):
name = 'progress' # spider's name
allowed_domains = ['example.com'] # crawling domain
start_urls = ['http://www.example.com']
def __init__(self):
Spider.__init__(self)
self.log('----------in __init__----------')
self.driver = webdriver.Firefox()
def parse(self, response):
self.log('----------in parse----------')
self.driver.get(response.url)
# Here're some operations of self.driver with javascript.
elements = []
elements = self.driver.find_elements_by_xpath('//table/tbody/tr/td/a[1]')
#get the number of the item
self.log('----------Link number is----------'+str(len(elements)))
for element in elements:
#get the url of the item
href = element.get_attribute('href')
print href
self.log('----------next href is ----------'+href)
yield Request(href,callback=self.parse_item)
self.driver.close()
def parse_item(self, response):
self.log('----------in parse_item----------')
self.driver.get(response.url)
#build item
item = MustdoItem()
item['title'] = self.driver.find_element_by_xpath('//h2').text
self.log('----------item created----------'+self.driver.find_element_by_xpath('//h2').text)
time.sleep(10)
return item
Also, I have items.py defining the MustdoItem used here. Here's the code.
from scrapy.item import Item, Field
class MustdoItem(Item):
title = Field()
When I run the spider, I can get several items (probably 6 to 7 out of 20). But after a while, I get error messages as below.
Traceback (most recent call last):
File "F:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "F:\Python27\lib\site-packages\twisted\internet\task.py", line 63
8, in _tick
taskObj._oneWorkUnit()
File "F:\Python27\lib\site-packages\twisted\internet\task.py", line 48
4, in _oneWorkUnit
result = next(self._iterator)
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\uti
ls\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\uti
ls\defer.py", line 96, in iter_errback
yield next(it)
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\offsite.py", line 23, in process_spider_output
for x in result:
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "F:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\con
trib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "mustdo\spiders\progress.py", line 32, in parse
print element.tag_name
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\webeleme
nt.py", line 50, in tag_name
return self._execute(Command.GET_ELEMENT_TAG_NAME)['value']
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\webeleme
nt.py", line 369, in _execute
return self._parent.execute(command, params)
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\webdrive
r.py", line 164, in execute
self.error_handler.check_response(response)
File "F:\Python27\lib\site-packages\selenium\webdriver\remote\errorhan
dler.py", line 164, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: u'El
ement not found in the cache - perhaps the page has changed since it was looked
up' ; Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web_elem
ent_cache.js:7610)
at Utils.getElementAt (file:///c:/users/marian/appdata/local/temp/tm
pmgnqid/extensions/fxdriver#googlecode.com/components/command_processor.js:7210)
at WebElement.getElementTagName (file:///c:/users/marian/appdata/loc
al/temp/tmpmgnqid/extensions/fxdriver#googlecode.com/components/command_processo
r.js:10353)
at DelayedCommand.prototype.executeInternal_/h (file:///c:/users/mar
ian/appdata/local/temp/tmpmgnqid/extensions/fxdriver#googlecode.com/components/c
ommand_processor.js:10878)
at DelayedCommand.prototype.executeInternal_ (file:///c:/users/maria
n/appdata/local/temp/tmpmgnqid/extensions/fxdriver#googlecode.com/components/com
mand_processor.js:10883)
at DelayedCommand.prototype.execute/< (file:///c:/users/marian/appda
ta/local/temp/tmpmgnqid/extensions/fxdriver#googlecode.com/components/command_pr
ocessor.js:10825)
I've tested my codes and found out that if I removed "yield Request(href,callback=self.parse_item)" in parse function, I could get all the links of items. And when "progress.py" was running, I observed that after the first print of "----------in parse_item----------" in self.log, the error messages came out. With my inference, yield sequence results in the error. But I don't know how to deal with this problem.
Any insight is appreciated!
Best regards! :)

Categories