How to yield in Scrapy without a request? - python

I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow.
Now also the system works, I do have one extra request I want to get rid of:
Those pages are exactly the same but have a different URL:
example.html
example.thml?pn=1
Somewhere in my code I do this extra request and I can not figure out how to surpress it.
This is the working code:
Define a bunch of URLs to scrape:
start_urls = [
'https://example...',
'https://example2...',
]
Start requesting all start urls;
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
callback=self.parse,
)
Parse the start URL:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(pn=1, base_url=response.url))
Go get all paginated URLs from the start URLs;
def parse_item(self, response, pn, base_url):
self.logger.info('Parsing %s', response.url)
if pn < 6: # maximum level 5
url = base_url + '&pn='+str(pn+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,pn=pn+1))

If I understand you're question correct you just need to change to start at ?pn=1 and ignore the one without pn=null, here's an option how i would do it, which also only requires one parse method.
start_urls = [
'https://example...',
'https://example2...',
]
def start_requests(self):
for url in self.start_urls:
#how many pages to crawl
for i in range(1,6):
yield scrapy.Request(
url=url + f'&pn={str(i)}'
)
def parse(self, response):
self.logger.info('Parsing %s', response.url)

Related

What is this Scrapy error: ReactorNotRestartable?

I do not understand why my spider wont run. I tested the css selector separately, so I do not think it is the parsing method.
Traceback message:
ReactorNotRestartable:
class espn_spider(scrapy.Spider):
name = "fsu2021_spider"
def start_requests(self):
urls = "https://www.espn.com/college-football/team/_/id/52"
for url in urls:
yield scrapy.Request(url = url, callback = self.parse_front)
def parse(self, response):
schedule_link = response.css('div.global-nav-container li > a::attr(href)')
process = CrawlerProcess()
process.crawl(espn_spider)
process.start()
urls = "https://www.espn.com/college-football/team/_/id/52"
for url in urls:
You're going through the characters of "urls", change it to a list:
urls = ["https://www.espn.com/college-football/team/_/id/52"]
...
...
Also you don't have "parse_front" function, if you just didn't add it to the snippet then ignore this, if it was a mistake then change it to:
yield scrapy.Request(url=url, callback=self.parse)

How to reverse list order in Python and stop yield upon return of None?

I am generating pagination links which I suspect exists with Python 3.x:
start_urls = [
'https://...',
'https://...' # list full of URLs
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
meta={'handle_httpstatus_list': [301]},
callback=self.parse,
)
def parse(self, response):
for i in range(1, 6):
url = response.url + '&pn='+str(i)
yield scrapy.Request(url, self.parse_item)
def parse_item(self, response):
# check if no results page
if response.xpath('//*[#id="searchList"]/div[1]').extract_first():
self.logger.info('No results found on %s', response.url)
return None
...
Those URLs will be processed by scrapy in parse_item. Now there are 2 problems:
The order is reverse and I do not understand why. It will request pagen umbers: 5,4,3,2,1 instead of 1,2,3,4,5
If the no results are found on page 1, the entire series could be stoped. Parse Item returns already "None", but the I guess I need to adapt the method "parse" to exit the for loop and continue. How?
The scrapy.Request you generate are running in parallel - In other words, there is guarantee for the order how you get the response as it depends on the server.
If some of the requests, depends on the response of of a request, you should yield those requests in its parse callback.
For example:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(page=1, base_url=response.url))
def parse_item(self, response,page, base_url):
# check if no results page
if response.xpath('//*[#id="searchList"]/div[1]').extract_first():
if page < 6:
url = base_url + '&pn='+str(page+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,page=page+1))
else:
# your code
yield ...

Scrapy - scrape of all of the item instead of 1 item

I need to scrape all of the items but only 1 item is scrape.
My code is working fine before but when I transfer it to other project which is same code this happens I don't know why
I need to get all of the items according to the page size in start_url
here's my working code
class HmSalesitemSpider(scrapy.Spider):
name = 'HM_salesitem'
allowed_domains = ['www2.hm.com']
start_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-
all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-
size=3002']
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
item = {
'title': response.xpath("normalize-space(.//h1[contains(#class, 'primary') and contains(#class, 'product-item-headline')]/text())").extract_first(),
'sale-price': response.xpath("normalize-space(.//span[#class='price-value']/text())").extract_first(),
'regular-price': response.xpath('//script[contains(text(), "whitePrice")]/text()').re_first("'whitePrice'\s?:\s?'([^']+)'"),
'photo-url': response.css('div.product-detail-main-image-container img::attr(src)').extract_first(),
'description': response.css('p.pdp-description-text::text').extract_first()
}
yield item
Please Help. Thank you
It seems you have problem with indents. Move yielding request to for loop:
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').get()
yield scrapy.Request(url=url, callback=self.parse_subpage)
Or this is a bit cleared version:
def parse(self, response):
for link in response.css('li.product-item a::attr(href)').extract():
yield response.follow(link, self.parse_subpage)

Scrapy + Splash + ScrapyJS

i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf
I am still getting the page without the phone number rendered:
class OlxSpider(scrapy.Spider):
name = "olx"
rotate_user_agent = True
allowed_domains = ["olx.pt"]
start_urls = [
"https://olx.pt/imoveis/"
]
def parse(self, response):
script = """
function main(splash)
splash:go(splash.args.url)
splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
splash:wait(0.5)
return splash:html()
end
"""
for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_house_contents, meta={
'splash': {
'args': {'lua_source': script},
'endpoint': 'execute',
}
})
for next_page in response.css('.pager .br3.brc8::attr(href)'):
url = response.urljoin(next_page.extract())
yield scrapy.Request(url, self.parse)
def parse_house_contents(self, response):
import ipdb;ipdb.set_trace()
how can i get this to work?
Add
splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")
to Lua script and it will work.
function main(splash)
splash:go(splash.args.url)
splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")
splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
splash:wait(0.5)
return splash:html()
end
.click() is JQuery function https://api.jquery.com/click/
You can avoid having to use Splash in the first place and make the appropriate GET request to get the phone number yourself. Working spider:
import json
import re
import scrapy
class OlxSpider(scrapy.Spider):
name = "olx"
rotate_user_agent = True
allowed_domains = ["olx.pt"]
start_urls = [
"https://olx.pt/imoveis/"
]
def parse(self, response):
for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_house_contents)
for next_page in response.css('.pager .br3.brc8::attr(href)'):
url = response.urljoin(next_page.extract())
yield scrapy.Request(url, self.parse)
def parse_house_contents(self, response):
property_id = re.search(r"ID(\w+)\.", response.url).group(1)
phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id
yield scrapy.Request(phone_url, callback=self.parse_phone)
def parse_phone(self, response):
phone_number = json.loads(response.body)["value"]
print(phone_number)
If there are more things to extract from this "dynamic" website, see if Splash is really enough and, if not, look into browser automation and selenium.

Scrapy with multiple url in list object

I'm a junior in python.I have some question about spider.
I have catch some URL and i put in my list object, then i would like to using the URL to do Scrapy again , is that possible dynamic changing the URL and keep doing Scrapy. Or someone can give me a idea about "Scrapy", thanks a lot .
'def parse(self,response):
sel=Selector(response)
sites=sel.xpath('//tr/td/span[#class="artist-lists"]')
items = []
for site in sites:
item=Website()
title=site.xpath('a/text()').extract()
link=site.xpath('a/#href').extract()
desc=site.xpath('text()').extract()
item['title']=title[0].encode('big5')
item['link']= link[0]
self.get_userLink(item['link'])
item['desc']=desc
# items.append(item)
#return items
def get_userLink(self,link):
#start_urls=[link]
self.parse(link)
sel=Selector(link)
sites=sel.xpath('//table/tr/td/b')
print sites
#for site in sites:
#print site.xpath('a/#href').extract() + "\n"
#print site.xpath('a/text()').extract()+ "\n"`
You can use the yield request that is parse the url to call other function.
def parse(self, response):
hxs = HtmlXPathSelector(response)
url= sel.xpath('//..../#href').extract()
if url:
It checks whether the url is correct or not.
yield Request(url, callback=self.parse_sub)
def parse_sub(self, response):
** Some Scrape operation here **

Categories