I do not understand why my spider wont run. I tested the css selector separately, so I do not think it is the parsing method.
Traceback message:
ReactorNotRestartable:
class espn_spider(scrapy.Spider):
name = "fsu2021_spider"
def start_requests(self):
urls = "https://www.espn.com/college-football/team/_/id/52"
for url in urls:
yield scrapy.Request(url = url, callback = self.parse_front)
def parse(self, response):
schedule_link = response.css('div.global-nav-container li > a::attr(href)')
process = CrawlerProcess()
process.crawl(espn_spider)
process.start()
urls = "https://www.espn.com/college-football/team/_/id/52"
for url in urls:
You're going through the characters of "urls", change it to a list:
urls = ["https://www.espn.com/college-football/team/_/id/52"]
...
...
Also you don't have "parse_front" function, if you just didn't add it to the snippet then ignore this, if it was a mistake then change it to:
yield scrapy.Request(url=url, callback=self.parse)
Related
I'm in reach of a personal milestone with scrapy. The aim is to properly understand the callback and cb_kwargs, I've read the documentation countless times but I learn best with visual code, practice and an explanation.
I have an example scraper, the aim is to grab the book name, price and go into each book page and extract a single piece of information. I'm trying to understand how to properly get information on the next few pages also, which I know is dependent on understanding the operation of callbacks.
When I run my script It returns results only for the first page, how do I get the additional pages?
Here's my scraper:
class BooksItem(scrapy.Item):
items = Field(output_processor = TakeFirst())
price = Field(output_processor = TakeFirst())
availability = Field(output_processor = TakeFirst())
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ['https://books.toscrape.com']
def start_request(self):
for url in self.start_url:
yield scrapy.Request(
url,
callback = self.parse)
def parse(self, response):
data = response.xpath('//div[#class = "col-sm-8 col-md-9"]')
for books in data:
loader = ItemLoader(BooksItem(), selector = books)
loader.add_xpath('items','.//article[#class="product_pod"]/h3/a//text()')
loader.add_xpath('price','.//p[#class="price_color"]//text()')
for url in [books.xpath('.//a//#href').get()]:
yield scrapy.Request(
response.urljoin(url),
callback = self.parse_book,
cb_kwargs = {'loader':loader})
for next_page in [response.xpath('.//div/ul[#class="pager"]/li[#class="next"]/a//#href').get()]:
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response, loader):
book_quote = response.xpath('//p[#class="instock availability"]//text()').get()
loader.add_value('availability', book_quote)
yield loader.load_item()
I believe the issue is with the part where I try to grab the next few pages. I have tried an alternative approach using the following:
def start_request(self):
for url in self.start_url:
yield scrapy.Request(
url,
callback = self.parse,
cb_kwargs = {'page_count':0}
)
def parse(self, response, next_page):
if page_count > 3:
return
...
...
page_count += 1
for next_page in [response.xpath('.//div/ul[#class="pager"]/li[#class="next"]/a//#href').get()]:
yield response.follow(next_page, callback=self.parse, cb_kwargs = {'page_count': page_count})
However, I get the following error with this approach:
TypeError: parse() missing 1 required positional argument: 'page_cntr'
It should be start_requests, and self.start_urls (inside the function).
get() will return the first result, what you want is getall() in order to return a list.
There is no need for a for loop for the "next_page" part, it's not a mistake just unnecessary.
In the line for url in books.xpath you're getting every url twice, again not a mistake but still...
Here data = response.xpath('//div[#class = "col-sm-8 col-md-9"]') you don't select the books one by one, you select the whole books container, you can check that len(data.getall()) == 1.
book_quote = response.xpath('//p[#class="instock availability"]//text()').get() will return \n, look at the source at try to find out why (hint: 'i' tag).
Compare your code to this and see what I changed:
import scrapy
from scrapy import Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class BooksItem(scrapy.Item):
items = Field(output_processor=TakeFirst())
price = Field(output_processor=TakeFirst())
availability = Field(output_processor=TakeFirst())
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ['https://books.toscrape.com']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback=self.parse)
def parse(self, response):
data = response.xpath('//div[#class = "col-sm-8 col-md-9"]//li')
for books in data:
loader = ItemLoader(BooksItem(), selector=books)
loader.add_xpath('items', './/article[#class="product_pod"]/h3/a//text()')
loader.add_xpath('price', './/p[#class="price_color"]//text()')
for url in books.xpath('.//h3/a//#href').getall():
yield scrapy.Request(
response.urljoin(url),
callback=self.parse_book,
cb_kwargs={'loader': loader})
next_page = response.xpath('.//div/ul[#class="pager"]/li[#class="next"]/a//#href').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response, loader):
# option 1:
book_quote = response.xpath('//p[#class="instock availability"]/i/following-sibling::text()').get().strip()
# option 2:
# book_quote = ''.join(response.xpath('//div[contains(#class, "product_main")]//p[#class="instock availability"]//text()').getall()).strip()
loader.add_value('availability', book_quote)
yield loader.load_item()
I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow.
Now also the system works, I do have one extra request I want to get rid of:
Those pages are exactly the same but have a different URL:
example.html
example.thml?pn=1
Somewhere in my code I do this extra request and I can not figure out how to surpress it.
This is the working code:
Define a bunch of URLs to scrape:
start_urls = [
'https://example...',
'https://example2...',
]
Start requesting all start urls;
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
callback=self.parse,
)
Parse the start URL:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(pn=1, base_url=response.url))
Go get all paginated URLs from the start URLs;
def parse_item(self, response, pn, base_url):
self.logger.info('Parsing %s', response.url)
if pn < 6: # maximum level 5
url = base_url + '&pn='+str(pn+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,pn=pn+1))
If I understand you're question correct you just need to change to start at ?pn=1 and ignore the one without pn=null, here's an option how i would do it, which also only requires one parse method.
start_urls = [
'https://example...',
'https://example2...',
]
def start_requests(self):
for url in self.start_urls:
#how many pages to crawl
for i in range(1,6):
yield scrapy.Request(
url=url + f'&pn={str(i)}'
)
def parse(self, response):
self.logger.info('Parsing %s', response.url)
I am generating pagination links which I suspect exists with Python 3.x:
start_urls = [
'https://...',
'https://...' # list full of URLs
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
meta={'handle_httpstatus_list': [301]},
callback=self.parse,
)
def parse(self, response):
for i in range(1, 6):
url = response.url + '&pn='+str(i)
yield scrapy.Request(url, self.parse_item)
def parse_item(self, response):
# check if no results page
if response.xpath('//*[#id="searchList"]/div[1]').extract_first():
self.logger.info('No results found on %s', response.url)
return None
...
Those URLs will be processed by scrapy in parse_item. Now there are 2 problems:
The order is reverse and I do not understand why. It will request pagen umbers: 5,4,3,2,1 instead of 1,2,3,4,5
If the no results are found on page 1, the entire series could be stoped. Parse Item returns already "None", but the I guess I need to adapt the method "parse" to exit the for loop and continue. How?
The scrapy.Request you generate are running in parallel - In other words, there is guarantee for the order how you get the response as it depends on the server.
If some of the requests, depends on the response of of a request, you should yield those requests in its parse callback.
For example:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(page=1, base_url=response.url))
def parse_item(self, response,page, base_url):
# check if no results page
if response.xpath('//*[#id="searchList"]/div[1]').extract_first():
if page < 6:
url = base_url + '&pn='+str(page+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,page=page+1))
else:
# your code
yield ...
I'm new with scrapy and this is my second spider:
class SitenameScrapy(scrapy.Spider):
name = "sitename"
allowed_domains = ['www.sitename.com', 'sitename.com']
rules = [Rule(LinkExtractor(unique=True), follow=True)]
def start_requests(self):
urls = ['http://www.sitename.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_cat)
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
if ('/category/' in link.url):
yield response.follow(link, self.parse_cat)
if ('/product/' in link.url):
yield response.follow(link, self.parse_prod)
def parse_prod(self, response):
pass
My problem is that sometimes I have links like http://sitename.com/path1/path2/?param1=value1¶m2=value2 and for me, param1 is not important and I want to remove it from url before response.follow. I think I can do it with regex but I'm not sure that it is 'right way' for scrapy? Maybe I should use some kind of rule for this?
I think you could use the url_query_cleaner method from w3lib's library. Something like:
from w3lib.url import url_query_cleaner
...
....
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
url = url_query_cleaner(link.url, ('param2',))
if '/category/' in url:
yield response.follow(url, self.parse_cat)
if '/product/' in url:
yield response.follow(url, self.parse_prod)
i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf
I am still getting the page without the phone number rendered:
class OlxSpider(scrapy.Spider):
name = "olx"
rotate_user_agent = True
allowed_domains = ["olx.pt"]
start_urls = [
"https://olx.pt/imoveis/"
]
def parse(self, response):
script = """
function main(splash)
splash:go(splash.args.url)
splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
splash:wait(0.5)
return splash:html()
end
"""
for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_house_contents, meta={
'splash': {
'args': {'lua_source': script},
'endpoint': 'execute',
}
})
for next_page in response.css('.pager .br3.brc8::attr(href)'):
url = response.urljoin(next_page.extract())
yield scrapy.Request(url, self.parse)
def parse_house_contents(self, response):
import ipdb;ipdb.set_trace()
how can i get this to work?
Add
splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")
to Lua script and it will work.
function main(splash)
splash:go(splash.args.url)
splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")
splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
splash:wait(0.5)
return splash:html()
end
.click() is JQuery function https://api.jquery.com/click/
You can avoid having to use Splash in the first place and make the appropriate GET request to get the phone number yourself. Working spider:
import json
import re
import scrapy
class OlxSpider(scrapy.Spider):
name = "olx"
rotate_user_agent = True
allowed_domains = ["olx.pt"]
start_urls = [
"https://olx.pt/imoveis/"
]
def parse(self, response):
for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_house_contents)
for next_page in response.css('.pager .br3.brc8::attr(href)'):
url = response.urljoin(next_page.extract())
yield scrapy.Request(url, self.parse)
def parse_house_contents(self, response):
property_id = re.search(r"ID(\w+)\.", response.url).group(1)
phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id
yield scrapy.Request(phone_url, callback=self.parse_phone)
def parse_phone(self, response):
phone_number = json.loads(response.body)["value"]
print(phone_number)
If there are more things to extract from this "dynamic" website, see if Splash is really enough and, if not, look into browser automation and selenium.