How I know which URL are in use at starts_url (Scrapy)?

How I know which URL are in use at starts_url (Scrapy)? - python

I'm building a Scrapy that crawling under two pages (e.x: PageDucky, PageHorse), and I pass that two pages in a starts_url field.
But for pagination, I need to pass my URL and concatenate with "?page=", so I can't pass the entire list.
I already tried to make a for loop, but without success.
Anyone does how can I make the pagination work for both pages?
Here is my code for now:
class QuotesSpider(scrapy.Spider):
name = 'QuotesSpider'
start_urls = ['https://PageDucky.com', 'https://PageHorse.com']
categories = []
count = 1
def parse(self, response):
# Get categories
urli = response.url
QuotesSpider.categories = urli[urli.find('/browse')+7:].split('/')
QuotesSpider.categories.pop(0)
#GET ITEMS PER PAGE AND CALC THE PAGINATION
items = int(response.xpath(
'*//div[#id="body"]/div/label[#class="item-count"]/text()').get().replace(' items', ''))
pages = items / 10
#CALL THE OTHER DEF TO READ THE PAGE ITSELF
for i in response.css('div#body div a::attr(href)').getall():
if i[:5] == '/item':
yield scrapy.Request('http://mainpage' + i, callback=self.parseobj)
#HERE IS THE PROBLEM, I TESTED AND WITHOUT FOR LOOP WORKS FOR ONE URL ONLY
for y in QuotesSpider.start_urls:
if pages >= QuotesSpider.count:
next_page = y + '?page=' + str(QuotesSpider.count)
QuotesSpider.count = QuotesSpider.count + 1
yield scrapy.Request(next_page, callback=self.parse)

Whatever website you're scraping, find the xpath/css location where the 'next page' button is. Get the href of that, and yield your next request to that link.
Alternatively you don't need to use start_urls if you write your own start_requests function, where you can put custom logic inside of it, like looping through your desired urls and appendimng the correct page number to each. See: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests

UPDATE WITH SOLUTION
I can't use "href" because isn't the same link, for example the page 01 was 'https:pageducky.com' and the page 02 was 'https:duckyducky.com?page=2'
So I use response.url and manipulate the string considering the ?page=... something like that:
resp1 = response.url[:response.url.find('?page=')]
resp = resp1 + '?page=' + str(QuotesSpider.count)

Related

How to get href from the entire page with scrapy (proper css selector)?

I'm trying to scrape a real-estate website: https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/. I would like to get the href that is hidden in the tag of the house images:
I would like to get this for the whole page (and other pages). Here is the code I wrote that returns nothing (e.g. empty dictionary):
import scrapy
from ..items import RealEstateSloItem
import time
# first get all the URLs that have more info on the houses
# next crawl those URLs to get the desired information
class RealestateSpider(scrapy.Spider):
# allowed_domains = ['nepremicnine.net']
name = 'realestate'
page_number = 2
# page 1 url
start_urls = ['https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/1/']
def parse(self, response):
items = RealEstateSloItem() # create it from items class --> need to store it down
all_links = response.css('a.slika a::attr(href)').extract()
items['house_links'] = all_links
yield items
next_page = 'https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/' + str(RealestateSpider.page_number) + '/'
#print(next_page)
# if next_page is not None: # for buttons
if RealestateSpider.page_number < 180: # then only make sure to go to the next page
# if yes then increase it --> for paginations
time.sleep(1)
RealestateSpider.page_number += 1
# parse automatically checks for response.follow if its there when its done with this page
# this is a recursive function
# follow next page and where should it after following
yield response.follow(next_page, self.parse) # want it to go back to parse
Could you tell me what I am doing wrong here with css selectors?

Your selector is looking for an a element inside the a.slika. This should solve your issue:
all_links = response.css('a.slika ::attr(href)').extract()
Those will be relative urls, you can use response.urljoin() to build the absolute url using your response url as base domain.

Scrapy - how to manage pagination without 'Next' button?

I'm scraping the content of articles from a site like this where there is no 'Next' button to follow. ItemLoader is passed from parse_issue in the response.meta object as well as some additional data like section_name. Here is the function:
def parse_article(self, response):
self.logger.info('Parse function called parse_article on {}'.format(response.url))
acrobat = response.xpath('//div[#class="txt__lead"]/p[contains(text(), "Plik do pobrania w wersji (pdf) - wymagany Acrobat Reader")]')
limiter = response.xpath('//p[#class="limiter"]')
if not acrobat and not limiter:
loader = ItemLoader(item=response.meta['periodical_item'].copy(), response=response)
loader.add_value('section_name', response.meta['section_name'])
loader.add_value('article_url', response.url)
loader.add_xpath('article_authors', './/p[#class="l doc-author"]/b')
loader.add_xpath('article_title', '//div[#class="cf txt "]//h1')
loader.add_xpath('article_intro', '//div[#class="txt__lead"]//p')
article_content = response.xpath('.//div[#class=" txt__rich-area"]//p').getall()
# # check for pagiantion
next_page_url = response.xpath('//span[#class="pgr_nrs"]/span[contains(text(), 1)]/following-sibling::a[1]/#href').get()
if next_page_url:
# I'm not sure what should be here... Something like this: (???)
yield response.follow(next_page_url, callback=self.parse_article, meta={
'periodical_item' : loader.load_item(),
'article_content' : article_content
})
else:
loader.add_xpath('article_content', article_content)
yield loader.load_item()
The problem is in parse_article function: I don't know how to combine the content of paragraphs from all pages into the one item. Does anybody know how to solve this?

Your parse_article looks good. If the issue is just adding the article_content to the loader, you just needed to fetch it from the response.meta:
I would update this line:
article_content = response.meta.get('article_content', '') + response.xpath('.//div[#class=" txt__rich-area"]//p').getall()

Just set the next page URL to iterate over X amount.
I noticed that article had 4 pages but some could be more
They are simply distinguished by adding /2 or /3 to the end of the URL e.g
https://www.gosc.pl/doc/791526.Zaloz-zbroje/
https://www.gosc.pl/doc/791526.Zaloz-zbroje/2
https://www.gosc.pl/doc/791526.Zaloz-zbroje/3
I don't use scrapy. But when I need multiple pages I would normally just iterate.
When you first scrape the page. Find the max amount of pages for that article first . On that site for example it says 1/4 so you know you will need 4 pages in total.
url = "https://www.gosc.pl/doc/791526.Zaloz-zbroje/"
data_store = ""
for i in range(1, 5):
actual_url = "{}{}".format(url, I)
scrape_stuff = content_you_want
data_store += scrape_stuff
# format the collected data

How to use the `yield Request()` to control the FOR Loop in scrapy?

I'm setting up a scrapy project. In my project, there is a for-loop which should be controlled by the scrawl results but the yield Request() keyword will not return a value. So how do I control the for-loop in scrapy? See the code below for more details:
def parse_area_detail(self, response):
for page in range(100):
page_url = parse.urljoin(response.url, 'pg' + str(page + 1))
yield Request(page_url, callback=self.parse_detail)
# the pase_detail funtion will get a title list. If the title list is
# empty, the for loop should be stopped.
def parse_detail(self, response):
title_list=response.xpath("//div[#class='title']/a/text()").extract()
The parse_detail function will get a title list. I expect that if the title list is empty then the for-loop will stop. But I know my code doesn't work like that. How do I change my code to make it work?

You could request the next page after parsing the current one. Thus, you could decide to continue if the list is not empty. Eg.
start_urls = ['http://example.com/?p=1']
base_url = 'http://example.com/?p={}'
def parse(self, response):
title_list=response.xpath("//div[#class='title']/a/text()").extract()
# ... do what you want to do with the list, then ...
if title_list:
next_page = response.meta.get('page', 1) + 1
yield Request(
self.base_url.format(next_page),
meta={'page': next_page},
callback=self.parse
)

scrapy python code to list urls not appears to work as hoped

I am trying to write some code to scrap the website of a UK housebuilder to record a list of houses for sale.
I am starting on the page http://www.persimmonhomes.com/sitemap and I have written one part of the code to list all the urls of the housebuilder developments and then the second part of the code to scrap from each of the urls to record prices etc.
I know the second part works and I know that the first part lists out all the urls. But for some reason the urls listed by the first part don't seem want to trigger the second part of the code to scrap from them.
The code of this first part is:
def parse(self, response):
for href in response.xpath('//*[#class="contacts-item"]/ul/li/a/#href'):
url = urlparse.urljoin('http://www.persimmonhomes.com/',href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
Now, I know this lists the urls I want (if I put in the line "print url" then they all get listed) and I can manually list add them to the code to run the second part all ok if I wanted to. However, even though the urls are created they do not seem to allow the second part of the code to scrap from them.
and the entire code is below:
import scrapy
import urlparse
from Persimmon.items import PersimmonItem
class persimmonSpider(scrapy.Spider):
name = "persimmon"
allowed_domains = ["http://www.persimmonhomes.com/"]
start_urls = [
"http://www.persimmonhomes.com/sitemap",
]
def parse(self, response):
for href in response.xpath('//*[#class="contacts-item"]/ul/li/a/#href'):
url = urlparse.urljoin('http://www.persimmonhomes.com/',href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//*[#id="aspnetForm"]/div[4]'):
item = PersimmonItem()
item['name'] = sel.xpath('//*[#id="aspnetForm"]/div[4]/div[1]/div[1]/div/div[2]/span/text()').extract()
item['address'] = sel.xpath('//*[#id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[#itemprop="postalCode"]/text()').extract()
plotnames = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__name"]/a/text()').extract()
plotnames = [plotname.strip() for plotname in plotnames]
plotids = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__name"]/a/#href').extract()
plotids = [plotid.strip() for plotid in plotids]
plotprices = sel.xpath('//div[#class="housetype js-filter-housetype"]/div[#class="housetype__col-2"]/div[#class="housetype__plots"]/div[not(contains(#data-status,"Sold"))]/div[#class="plot__price"]/text()').extract()
plotprices = [plotprice.strip() for plotprice in plotprices]
result = zip(plotnames, plotids, plotprices)
for plotname, plotid, plotprices in result:
item['plotname'] = plotname
item['plotid'] = plotid
item['plotprice'] = plotprice
yield item
any views as to why the first part of the code creates the urls but the second part does not loop through them?

You just need to fix your allowed_domains property:
allowed_domains = ["persimmonhomes.com"]
(tested - worked for me).

Scrapy spider get information that is inside of links

I have done and spider that can take the information of this page and it can follow "Next page" links. Now, the spider just takes the information that i'm showing in the following structure.
The structure of the page is something like this
Title 1
URL 1 ---------> If you click you go to one page with more information
Location 1
Title 2
URL 2 ---------> If you click you go to one page with more information
Location 2
Next page
Then, that i want is that the spider goes on each URL link and get full information. I suppose that i must generate another rule that specify that i want do something like this.
The behaviour of the spider it should be:
Go to URL1 (get info)
Go to URL2 (get info)
...
Next page
But i don't know how i can implement it. Can someone guide me?
Code of my Spider:
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
sites = sel.xpath("//div[#id='llista-resultats']/div")
items = []
cont = 0
for site in sites:
item = BcnItem()
item['id'] = cont
item['title'] = u''.join(site.xpath('h3/a/text()').extract())
item['url'] = u''.join(site.xpath('h3/a/#href').extract())
item['when'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[1]/text()').extract())
item['where'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[2]/span/a/text()').extract())
item['street'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[3]/span/text()').extract())
item['phone'] = u''.join(site.xpath('div[#class="dades"]/dl/dd[4]/text()').extract())
items.append(item)
cont = cont + 1
return items
EDIT After searching in internet I found a code with which i can do that.
First of all, I have to get all the links, then I have to call another parse method.
def parse(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
If you want use Rules because the page have a paginator, you should change def parse to def parse_start_url and then call this method through Rule. With this changes you make sure that the parser begins at the parse_start_url and the code it would be something like this:
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[#class='paginador']")),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
#Get all URL's
yield Request( url= _url, callback=self.parse_details )
def parse_details(self, response):
#Detailed information of each page
Thant's all folks

There is an easier way of achieving this. Click next on your link, and read the new url carefully:
http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10
By looking at the get data in the url (everything after the questionmark), and a bit of testing, we find that these mean
from=10 - Starting index
q=*:* - Search query
nr=10 - Number of items to display
This is how I would've done it:
Set nr=100 or higher. (1000 may do as well, just be sure that there is no timeout)
Loop from from=0 to 34300. This is above the number of entries currently. You may want to extract this value first.
Example code:
entries = 34246
step = 100
stop = entries - entries % step + step
for x in xrange(0, stop, step):
url = 'http://guia.bcn.cat/index.php?pg=search&from={}&q=*:*&nr={}'.format(x, step)
# Loop over all entries, and open links if needed

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How I know which URL are in use at starts_url (Scrapy)? - python

Related

How to get href from the entire page with scrapy (proper css selector)?

Scrapy - how to manage pagination without 'Next' button?

How to use the `yield Request()` to control the FOR Loop in scrapy?

scrapy python code to list urls not appears to work as hoped

Scrapy spider get information that is inside of links

Categories

Resources