I'm facing some issues with the parse_node method in Scrapy:
class s1(scrapy.spiders.XMLFeedSpider):
name = "s1"
handle_httpstatus_list = [400,401,403,404,408,410,500,502,503,504]
allowed_domains = ["xxx"]
start_urls = ["xxx"]
main_url = start_urls[0]
jobs_list = []
tracker = SummaryTracker()
itertag = "miojob"
counter = 0
def parse_node(self, response, node):
if response.status in [400,401,403,404,408,410,500,502,503,504]:
time.sleep(60)
yield scrapy.Request(self.main_url, callback=self.parse_node, errback=self.err1, dont_filter = True)
else:
#Some Code #
yield scrapy.Request(self.main_url, callback=self.parse_node, errback=self.err1, dont_filter = True)
This is part of a scrapy-bot that recursively scrape the same page to extract the last ten items. Everything works except for the last scrapy.Request, because it gives me this error:
"parse_node() takes exactly 3 arguments (2 given)"
Instead if I use a simple Request(self.main_url) it works, but I can't use the errback because it needs a callback. I tried to pass additional arguments to parse_node like this:
yield scrapy.Request(self.main_url, callback=self.parse_node(arg1,arg2), errback=self.err1, dont_filter = True)
but it gives me an Assertion error, probably because the arguments are wrong?
Have you any idea on how to solve this? Passing the correct args to parse_node, in the way I can use also the errback callable.
try
def parse_node(self, response):
<yourcode>
I've resolved the issue by reading the source code here:
https://github.com/scrapy/scrapy/blob/master/scrapy/spiders/feed.py
The old Request now is:
yield scrapy.Request(self.main_url, callback=self.parse,errback=self.err1, dont_filter = True)
The tweak here is calling the parse method, instead of the parse_node because "parse" will pass the Selector(node) to parse_node.
Related
I'm creating a web scraper that will be used to value stocks. The problem I got is that my code returns a object "placement" (Not sure what it should be called) instead of the value.
import requests
class Guru():
MedianPE = 0.0
def __init__(self, ticket):
self.ticket = ticket
try:
url = ("https://www.gurufocus.com/term/pettm/"+ticket+"/PE-Ratio-TTM/")
response = requests.get(url)
htmlText = response.text
firstSplit = htmlText
secondSplit = firstSplit.split("And the <strong>median</strong> was <strong>")[1]
thirdSplit = secondSplit.split("</strong>")[0]
lastSplit = float(thirdSplit)
try:
Guru.MedianPE = lastSplit
except:
print(ticket + ": Median PE N/A")
except:
print(ticket + ": Median PE N/A")
def getMedianPE(self):
return float(Guru.getMedianPE)
g1 = Guru("AAPL")
g1.getMedianPE
print("Median + " + str(g1))
If I print the lastSplit inside the __init__ it returns the value I want 15.53 but when I try to get it by the function getMedianPE I just get Median + <__main__.Guru object at 0x0000016B0760D288>
Thanks a lot for your time!
Looks like you are trying to cast a function object to a float. Simply change return float(Guru.getMedianPE) to return float(Guru.MedianPE)
getMedianPE is a function (also called object method when part of a class), so you need to call it with parentheses. If you call it without parentheses, you get the method/function itself rather than the result of calling the method/function.
The other problem is that getMedianPE returns the function Guru.getMedianPE rather than the value Guru.MedianPE. I don't think you want MedianPE to be a class variable - you probably just want to set it as a default of 0 in init so that each object has its own median_PE value.
Also, it is not a good idea to include all of the scraping code in your init method. That should be moved to a scrape() method (or some other name) that you call after instantiating the object.
Finally, if you are going to print an object, it is useful to have a str method, so I added a basic one here.
So putting all of those comments together, here is a recommended refactor of your code.
import requests
class Guru():
def __init__(self, ticket, median_PE=0):
self.ticket = ticket
self.median_PE = median_PE
def __str__(self):
return f'{self.ticket} {self.median_PE}'
def scrape(self):
try:
url = f"https://www.gurufocus.com/term/pettm/{self.ticket}/PE-Ratio-TTM/"
response = requests.get(url)
htmlText = response.text
firstSplit = htmlText
secondSplit = firstSplit.split("And the <strong>median</strong> was <strong>")[1]
thirdSplit = secondSplit.split("</strong>")[0]
lastSplit = float(thirdSplit)
self.median_PE = lastSplit
except ValueError:
print(f"{self.ticket}: Median PE N/A")
Then you run the code
>>>g1 = Guru("AAPL")
...g1.scrape()
...print(g1)
AAPL 15.53
After running my script I notice that my "parse_doc" function throws error when it find's any url None. Turn out that, my "process_doc" function were supposed to produce 25 links but it produces only 19 because few pages doesn't have any link to lead to another page. However, when my second function receives that link with None value, it produces that error indicating "MissingSchema". How to get around this so that when it finds any link with None value it will go for another. Here is the partial portion of my script which will give you an idea what I meant:
def process_doc(medium_link):
page = requests.get(medium_link).text
tree = html.fromstring(page)
try:
name = tree.xpath('//span[#id="titletextonly"]/text()')[0]
except IndexError:
name = ""
try:
link = base + tree.xpath('//section[#id="postingbody"]//a[#class="showcontact"]/#href')[0]
except IndexError:
link = ""
parse_doc(name, link) "All links get to this function whereas some links are with None value
def parse_doc(title, target_link):
page = requests.get(target_link).text # Error thrown here when it finds any link with None value
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
The error what I'm getting:
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?
Btw, in my first function there is a variable named "base" which is for concatenating with the produced result to make a full-fledged link.
If you want to avoid cases when your target_link == None then try
def parse_doc(title, target_link):
if target_link:
page = requests.get(target_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(tel)
print(title)
This should allow you to handle only non-empty links or do nothing otherwise
First of all make sure that your schema, meaning url, is correct. Sometimes you are just missing a character or have one too much in https://.
If you have to raise an exception though you can do it like this:
import requests
from requests.exceptions import MissingSchema
...
try:
res = requests.get(linkUrl)
print(res)
except MissingSchema:
print('URL is not complete')
I'm having problems with Scrapy pipelines.
EnricherPipeline is never starting. I put a debugger in the fist line of process_item and it never gets control.
JsonPipeline does start, but the first argument it receives is of type generator object process_item and not the MatchItem instance it should receive (when I disable the EnricherPipeline, JsonPipeline works as expected.
class MatchSpider(CrawlSpider):
def parse(self, response):
browser = Browser(browser='Chrome')
browser.get(response.url)
browser.find_element_by_xpath('//a[contains(text(), "{l}") and #title="{c}"]'.format(l=self.league, c=self.country)).click()
browser.find_element_by_xpath('//select[#id="seasons"]/option[text()="{s}"]'.format(s=self.season.replace('-', '/'))).click()
browser.find_element_by_xpath('//a[contains(text(), "Fixture")]').click()
page_matches = browser.find_elements_by_xpath('//*[contains(#class, "result-1 rc")]')
matches.extend([m.get_attribute('href') for m in page_matches]
for m in matches[:1]:
yield Request(m, callback=self.process_match, dont_filter=True)
def process_match(self, response):
match_item = MatchItem()
match_item['url'] = response.url
match_item['project'] = self.settings.get('BOT_NAME')
match_item['spider'] = self.name
match_item['server'] = socket.gethostname()
match_item['date'] = datetime.datetime.now()
return match_item
class EnricherPipeline:
def process_item(self, item, spider):
self.match = defaultdict(dict)
self.match['date'] = item['match']['startTime']
self.match['referee'] = item['match']['refereeName']
self.match['stadium'] = item['match']['venueName']
self.match['exp_mins'] = item['match']['expandedMinutes']
yield self.match
class JsonPipeline:
def process_item(self, item, scraper):
output_dir = 'data/matches/{league}/{season}'.format(league=scraper.league, season=scraper.season)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
file_name = "-".join([str(datetime.strptime(item['date'], '%Y-%m-%dT%H:%M:%S').date()),
item['home']['name'], item['away']['name']]) + '.json'
item_path = os.sep.join((output_dir, file_name))
with open(item_path, 'w') as f:
f.write(json.dumps(item))
ITEM_PIPELINES = {
'scrapers.whoscored.whoscored.pipelines.EnricherPipeline': 300,
'scrapers.whoscored.whoscored.pipelines.JsonPipeline': 800,
}
Ok, so the problem was that EnricherPipeline was yielding and not returning a result. After that it worked as expected, although I still don't understand why a debugger is not working in that first pipeline.
I write scrapy code like #FranGoitia, using return item in dict type, and it can go to pipeline well.
The true reason is:
Can't not yield any type not base on dict, the scrapy engine will not call pipeline.
Odds, I spend three days to found this...
The code is as below , every time it returns only the first loop ,the last 9 loops disapeared .So what should I do to get all the loops ?
I have tried to add a "m = []" and m.append(l) ,but got a error "ERROR: Spider must return Request, BaseItem, dict or None, got 'ItemLoader'"
link is http://ajax.lianjia.com/ajax/housesell/area/district?ids=23008619&limit_offset=0&limit_count=100&sort=&&city_id=110000
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
for i in range(0,len(jsonresponse['data']['list'])):
l = ItemLoader(item = ItjuziItem(),response=response)
house_code = jsonresponse['data']['list'][i]['house_code']
price_total = jsonresponse['data']['list'][i]['price_total']
ctime = jsonresponse['data']['list'][i]['ctime']
title = jsonresponse['data']['list'][i]['title']
frame_hall_num = jsonresponse['data']['list'][i]['frame_hall_num']
tags = jsonresponse['data']['list'][i]['tags']
house_area = jsonresponse['data']['list'][i]['house_area']
community_id = jsonresponse['data']['list'][i]['community_id']
community_name = jsonresponse['data']['list'][i]['community_name']
is_two_five = jsonresponse['data']['list'][i]['is_two_five']
frame_bedroom_num = jsonresponse['data']['list'][i]['frame_bedroom_num']
l.add_value('house_code',house_code)
l.add_value('price_total',price_total)
l.add_value('ctime',ctime)
l.add_value('title',title)
l.add_value('frame_hall_num',frame_hall_num)
l.add_value('tags',tags)
l.add_value('house_area',house_area)
l.add_value('community_id',community_id)
l.add_value('community_name',community_name)
l.add_value('is_two_five',is_two_five)
l.add_value('frame_bedroom_num',frame_bedroom_num)
print l
return l.load_item()
The error:
ERROR: Spider must return Request, BaseItem, dict or None, got
'ItemLoader'
is slightly misleading since you can also return a generator! What is happening here is that return breaks the loop and the whole function. You can turn this function into a generator to avoid this.
Simply just replace return with yield in your last line.
return l.load_item()
to:
yield l.load_item()
I'm trying to deploy a crawler with four spiders. One of the spiders uses XMLFeedSpider and runs fine from the shell and scrapyd, but the others use BaseSpider and all give this error when run in scrapyd, but run fine from the shell
TypeError: init() got an unexpected keyword argument '_job'
From what I've read this points to a problem with the init function in my spiders, but I cannot seem to solve the problem. I don't need an init function and if I remove it completely I still get the error!
My Spider looks like this
from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from betfeeds_master.items import Odds
# Parameters
MYGLOBAL = 39
class homeSpider(BaseSpider):
name = "home"
#con = None
allowed_domains = ["www.myhome.com"]
start_urls = [
"http://www.myhome.com/oddxml.aspx?lang=en&subscriber=mysubscriber",
]
def parse(self, response):
items = []
traceCompetition = ""
xxs = XmlXPathSelector(response)
oddsobjects = xxs.select("//OO[OddsType='3W' and Sport='Football']")
for oddsobject in oddsobjects:
item = Odds()
item['competition'] = ''.join(oddsobject.select('Tournament/text()').extract())
if traceCompetition != item['competition']:
log.msg('Processing %s' % (item['competition'])) #print item['competition']
traceCompetition = item['competition']
item['matchDate'] = ''.join(oddsobject.select('Date/text()').extract())
item['homeTeam'] = ''.join(oddsobject.select('OddsData/HomeTeam/text()').extract())
item['awayTeam'] = ''.join(oddsobject.select('OddsData/AwayTeam/text()').extract())
item['lastUpdated'] = ''
item['bookie'] = MYGLOBAL
item['home'] = ''.join(oddsobject.select('OddsData/HomeOdds/text()').extract())
item['draw'] = ''.join(oddsobject.select('OddsData/DrawOdds/text()').extract())
item['away'] = ''.join(oddsobject.select('OddsData/AwayOdds/text()').extract())
items.append(item)
return items
I can put an use an init function in to the spider, but I get exactly the same error.
def __init__(self, *args, **kwargs):
super(homeSpider, self).__init__(*args, **kwargs)
pass
Why is this happening and how do I solve it?
The good answer was given by alecx :
My init function was :
def __init__(self, domain_name):
In order to work within an egg for scrapyd, it should be :
def __init__(self, domain_name, **kwargs):
considering you pass domain_name as mandatory argument