I'm a junior in python.I have some question about spider.
I have catch some URL and i put in my list object, then i would like to using the URL to do Scrapy again , is that possible dynamic changing the URL and keep doing Scrapy. Or someone can give me a idea about "Scrapy", thanks a lot .
'def parse(self,response):
sel=Selector(response)
sites=sel.xpath('//tr/td/span[#class="artist-lists"]')
items = []
for site in sites:
item=Website()
title=site.xpath('a/text()').extract()
link=site.xpath('a/#href').extract()
desc=site.xpath('text()').extract()
item['title']=title[0].encode('big5')
item['link']= link[0]
self.get_userLink(item['link'])
item['desc']=desc
# items.append(item)
#return items
def get_userLink(self,link):
#start_urls=[link]
self.parse(link)
sel=Selector(link)
sites=sel.xpath('//table/tr/td/b')
print sites
#for site in sites:
#print site.xpath('a/#href').extract() + "\n"
#print site.xpath('a/text()').extract()+ "\n"`
You can use the yield request that is parse the url to call other function.
def parse(self, response):
hxs = HtmlXPathSelector(response)
url= sel.xpath('//..../#href').extract()
if url:
It checks whether the url is correct or not.
yield Request(url, callback=self.parse_sub)
def parse_sub(self, response):
** Some Scrape operation here **
Related
I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow.
Now also the system works, I do have one extra request I want to get rid of:
Those pages are exactly the same but have a different URL:
example.html
example.thml?pn=1
Somewhere in my code I do this extra request and I can not figure out how to surpress it.
This is the working code:
Define a bunch of URLs to scrape:
start_urls = [
'https://example...',
'https://example2...',
]
Start requesting all start urls;
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
callback=self.parse,
)
Parse the start URL:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(pn=1, base_url=response.url))
Go get all paginated URLs from the start URLs;
def parse_item(self, response, pn, base_url):
self.logger.info('Parsing %s', response.url)
if pn < 6: # maximum level 5
url = base_url + '&pn='+str(pn+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,pn=pn+1))
If I understand you're question correct you just need to change to start at ?pn=1 and ignore the one without pn=null, here's an option how i would do it, which also only requires one parse method.
start_urls = [
'https://example...',
'https://example2...',
]
def start_requests(self):
for url in self.start_urls:
#how many pages to crawl
for i in range(1,6):
yield scrapy.Request(
url=url + f'&pn={str(i)}'
)
def parse(self, response):
self.logger.info('Parsing %s', response.url)
I am generating pagination links which I suspect exists with Python 3.x:
start_urls = [
'https://...',
'https://...' # list full of URLs
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
meta={'handle_httpstatus_list': [301]},
callback=self.parse,
)
def parse(self, response):
for i in range(1, 6):
url = response.url + '&pn='+str(i)
yield scrapy.Request(url, self.parse_item)
def parse_item(self, response):
# check if no results page
if response.xpath('//*[#id="searchList"]/div[1]').extract_first():
self.logger.info('No results found on %s', response.url)
return None
...
Those URLs will be processed by scrapy in parse_item. Now there are 2 problems:
The order is reverse and I do not understand why. It will request pagen umbers: 5,4,3,2,1 instead of 1,2,3,4,5
If the no results are found on page 1, the entire series could be stoped. Parse Item returns already "None", but the I guess I need to adapt the method "parse" to exit the for loop and continue. How?
The scrapy.Request you generate are running in parallel - In other words, there is guarantee for the order how you get the response as it depends on the server.
If some of the requests, depends on the response of of a request, you should yield those requests in its parse callback.
For example:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(page=1, base_url=response.url))
def parse_item(self, response,page, base_url):
# check if no results page
if response.xpath('//*[#id="searchList"]/div[1]').extract_first():
if page < 6:
url = base_url + '&pn='+str(page+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,page=page+1))
else:
# your code
yield ...
CODE:
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy import Request
class TestSpider(CrawlSpider):
name = "test_spyder"
allowed_domains = ["stackoverflow.com"]
start_urls = ['https://stackoverflow.com/tags']
def parse(self, response):
title_1 = response.xpath('//h1/text()').extract_first()
next_url = 'https://stackoverflow.com/users'
title_2 = Request(url=next_url, callback=self.parse_some)
yield {'title_1': title_1, 'title_2': title_2}
def parse_some(self, response):
return response.xpath('//h1/text()').extract_first()
I don't understand why instead second page title (Users) i get other value (https://stackoverflow.com/users>).
Scrapy should return next values: Tags + Users, but returns: Tag + <Request GET htt... at list i think so.
Where is the error and how to fix it?
To crawl url you need to yield a Request object. So your parse callbacks should either:
Yield a dictionary/Item - this is the end of crawl chain. The item is being generated, it is sent through pipelines and finally saved somewhere if you have that set up.
Yield a Request object - this still continues the crawl chain to another callback.
Example of this process:
crawl url1 (2)
crawl url2 (2)
yield item (1)
Your spider in this case should look like this:
def parse(self, response):
title = response.xpath('//h1/text()').extract_first()
yield {'title': title}
next_url = 'https://stackoverflow.com/users'
yield Request(url=next_url, callback=self.parse_some)
And your end results crawling with scrapy crawl spider -o output.json:
# output.json
[
{'title': 'title1'},
{'title': 'title2'}
]
I am trying to scrape a website using Scrapy. Example Link is: Here.
I am able to get some data using css selectors. I also need to fetch all image urls of each item. Now an item can have multiple colours. When we click on another colour, it actually fetch images from another url in the browser. So, I need to generate manual requests (due to multiple colours) and attach "meta" to store image urls from others urls into a SINGLE ITEM FIELD.
Here is my Scrapy code:
def get_image_urls(self, response):
item = response.meta['item']
if 'image_urls' in item:
urls = item['image_urls']
else:
urls = []
urls.extend(response.css('.product-image-link::attr(href)').extract())
item['image_urls'] = urls
next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
#print(item['image_urls'])
yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
def parse(self, response):
output = JulesProduct()
output['name'] = self.get_name(response)
# Now get the recursive img urls
response.meta['item'] = output
self.get_image_urls(response)
return output
Ideally, I should return output object to have all of the required data. My question is why I am not getting output['image_urls']? Because when I uncomment print statement in get_image_urls function, I see 3 crawled urls and 3 print statements with url appended after each other. I need them in the parse function. I'm not sure if I'm able to dictate my issue. Can anybody help?
Your parse method is returning the output before the get_image_urls requests are done.
You should only yield or return your final item and at the end of your recursive logic. Something like this should work:
def parse(self, response):
output = JulesProduct()
output['name'] = self.get_name(response)
yield Request(response.url, callback=self.get_image_urls, meta={'item': item}, dont_filter=True)
def get_image_urls(self, response):
item = response.meta['item']
if 'image_urls' in item:
urls = item['image_urls']
else:
urls = []
urls.extend(response.css('.product-image-link::attr(href)').extract())
item['image_urls'] = urls
next_url = response.css('.va-color .emptyswatch a::attr(href)').extract()
if len(next_url) > 0:
yield Request(next_url[0], callback=self.get_image_urls, meta={'item': item})
else:
yield item
I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls.
My script gets the list of first 4 items on the page, their urls, makes the request, parses add info, but then not returning into the loop and so spider closes.
Could somebody help me in solving this? I'm pretty new to this kind of stuff, and ask here when totally stuck.
Here is my code:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http.request import Request
from scrapy_sokos.items import SokosItem
class SokosSpider(Spider):
name = "sokos"
allowed_domains = ["sokos.fi"]
base_url = "http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=%s&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151"
start_urls = [
"http://www.sokos.fi/fi/SearchDisplay?searchTermScope=&searchType=&filterTerm=&orderBy=8&maxPrice=&showResultsPage=true&beginIndex=0&langId=-11&sType=SimpleSearch&metaData=&pageSize=4&manufacturer=&resultCatEntryType=&catalogId=10051&pageView=image&searchTerm=&minPrice=&urlLangId=-11&categoryId=295401&storeId=10151",
]
for i in range(0, 8, 4):
start_urls.append((base_url) % str(i))
def parse(self, response):
products = Selector(response).xpath('//div[#class="product-listing product-grid"]/article[#class="product product-thumbnail"]')
for product in products:
item = SokosItem()
item['url'] = product.xpath('//div[#class="content"]/a[#class="image"]/#href').extract()[0]
yield Request(url = item['url'], meta = {'item': item}, callback=self.parse_additional_info)
def parse_additional_info(self, response):
item = response.meta['item']
item['name'] = Selector(response).xpath('//h1[#class="productTitle"]/text()').extract()[0].strip()
item['description'] = Selector(response).xpath('//div[#id="kuvaus"]/p/text()').extract()[0]
euro = Selector(response).xpath('//strong[#class="special-price"]/span[#class="euros"]/text()').extract()[0]
cent = Selector(response).xpath('//strong[#class="special-price"]/span[#class="cents"]/text()').extract()[0]
item['price'] = '.'.join(euro + cent)
item['number'] = Selector(response).xpath('//#data-productid').extract()[0]
yield item
The AJAX requests you are simulating are caught by the Scrapy "duplicate url filter".
Set dont_filter to True when yielding a Request:
yield Request(url=item['url'],
meta={'item': item},
callback=self.parse_additional_info,
dont_filter=True)