I'm getting started with Scrapy and there is a website I'm trying to get data from. Specifically the phone number element which is inside a div element that has an id. I noticed that if I send a request to this page I can get it.
https://www.otomoto.pl/ajax/misc/contact/multi_phone/6CLxXv/0
so basiclay the base url would be https://www.otomoto.pl/ajax/misc/contact/multi_phone/ID/0/
and 6CLxXv = ID for this example.
How do I scrape all the div elements, concatenate them with the base url and then retrieve the phone number element ?
Here is the code used :
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Compose
from otomoto.items import OtomotoItem
def filter_out_array(x):
x = x.strip()
return None if x == '' else x
def remove_spaces(x):
return x.replace(' ', '')
def convert_to_integer(x):
return int(x)
class OtomotoCarLoader(ItemLoader):
default_output_processor = TakeFirst()
features_out = MapCompose(filter_out_array)
price_out = Compose(TakeFirst(), remove_spaces, convert_to_integer)
class OtomotoSpider(scrapy.Spider):
name = 'otomoto'
start_urls = ['https://www.otomoto.pl/osobowe/']
def parse(self, response):
for car_page in response.css('.offer-title__link::attr(href)'):
yield response.follow(car_page, self.parse_car_page)
for next_page in response.css('.next.abs a::attr(href)'):
yield response.follow(next_page, self.parse)
#inline_requests
def parse_car_page(self, response):
property_list_map = {
'Marka pojazdu': 'brand',
'Model pojazdu': 'model',
'Rok produkcji': 'year',
}
contact_response = yield scrapy.Request(url_number) # how do i get the specific phone number url
number = # parse the responose here ? then add load it in the loader
loader = OtomotoCarLoader(OtomotoItem(), response=response)
for params in response.css('.offer-params__item'):
property_name = params.css('.offer-params__label::text').extract_first().strip()
if property_name in property_list_map:
css = params.css('.offer-params__value::text').extract_first().strip()
if css == '':
css = params.css('a::text').extract_first().strip()
loader.add_value(property_list_map[property_name], css)
loader.add_css('price', '.offer-price__number::text')
loader.add_css('price_currency', '.offer-price__currency::text')
loader.add_css('features', '.offer-features__item::text')
loader.add_value('url', response.url)
loader.add('phone number', number) # here i want to add the phone number to the rest of the elements
yield loader.load_item()
note : i was able to find the following link "https://www.otomoto.pl/ajax/misc/contact/multi_phone/6CLxXv/0" by checking the page xhr
Take a look into xpath https://docs.scrapy.org/en/0.9/topics/selectors.html. There you should find feasable solutions to select the distinct elements you need. Eg. selecting all the elements divs of a parent div which have an id-attribute starting with a ... "//div[#id='a']/div/"
This way you can put your results into a list. The latter - extracting the numbers from the list and building the base string is simple string concatenation.
The same counts for scraping the ids. Find unique indicators, so you can make sure that those are the elements you need. Eg. following content. Is the id you need different from others on the page which you don't need?
for idx in collected_list:
url = 'https.com/a/b/'+idx+'/0'
EDIT:
I see. Your code is quite advanced. I could get more into it, if I would have the full code, but from that I see you use this html element:
<a href="" class="spoiler seller-phones__button" data-path="multi_phone" data-id="6D5zmw" data-id_raw="6074401671" title="Kontakt Rafał" data-test="view-seller-phone-1-button" data-index="0" data-type="bottom">
<span class="icon-phone2 seller-phones__icon"></span>
<span data-test="seller-phone-2" class="phone-number seller-phones__number">694 *** ***</span>
<span class="separator">-</span>
<span class="spoilerAction">Wyświetl numer</span>
</a>
The data-id is what you need to extract, because its the ID you are looking for and can simple apply to:
new_request_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/"+id+"/0/"
Related
I can't seem to figure out how to construct this xpath selector. I have even tried using nextsibling::text but to no avail. I have also browsed stackoverflow questions for scraping listed values but could not implement it correctly. I keep getting blank results. Any and all help would be appreciated. Thank you.
The website is https://www.unegui.mn/adv/5737502_10-r-khoroolold-1-oroo/.
Expected Results:
Woods
2015
Current Results:
blank
Current: XPath scrapy code:
list_li = response.xpath(".//ul[contains(#class, 'chars-column')]/li/text()").extract()
list_li = response.xpath("./ul[contains(#class,'value-chars')]//text()").extract()
floor_type = list_li[0].strip()
commission_year = list_li[1].strip()
HTML Snippet:
<div class="announcement-characteristics clearfix">
<ul class="chars-column">
<li class="">
<span class="key-chars">Flooring:</span>
<span class="value-chars">Wood</span></li>
<li class="">
<span class="key-chars">Commission year:</span>
2015
</li>
</ul>
</div>
FURTHER CLARIFICATION:
I previously did two selectors (one for the span list, one for the href list), but the problem was some pages on the website dont follow the same span list/a list order (i.e. on one page the table value would be in a span list, but some other page it would be in a href list). That is why I have been trying to only use one selector and get all the values.
This results in values as shown below in the image. Instead of the number of window aka an integer being scraped, it scrapes the address because on some pages the table value is under the href list not under the span list.
Previous 2 selectors:
list_span = response.xpath(".//span[contains(#class,'value-chars')]//text()").extract()
list_a = response.xpath(".//a[contains(#class,'value-chars')]//text()").extract()
Whole Code (if someone needs it to test it):
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from selenium import webdriver
dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' UB HPI Buying Data'
# create Spider class
class UneguiApartmentsSpider(scrapy.Spider):
name = "unegui_apts"
allowed_domains = ["www.unegui.mn"]
custom_settings = {
"FEEDS": {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True}}
}
# function used for start url
def start_requests(self):
urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
for url in urls:
yield Request(url, self.parse)
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(#class,'announcement-container')]")
# parse details
for card in cards:
name = card.xpath(".//a[#itemprop='name']/#content").extract_first().strip()
price = card.xpath(".//*[#itemprop='price']/#content").extract_first().strip()
rooms = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__breadcrumbs')]/span[2]/text())").extract_first().strip()
link = card.xpath(".//a[#itemprop='url']/#href").extract_first().strip()
date_block = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0].strip()
city = date_block[1].strip()
item = {'name': name,
'date': date,
'rooms': rooms,
'price': price,
'city': city,
}
# follow absolute link to scrape deeper level
yield response.follow(link, callback=self.parse_item, meta={'item': item})
# handling pagination
next_page = response.xpath("//a[contains(#class,'number-list-next js-page-filter number-list-line')]/#href").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
print(f'Scraped {next_page}')
def parse_item(self, response):
# retrieve previously scraped item between callbacks
item = response.meta['item']
# parse additional details
list_li = response.xpath(".//*[contains(#class, 'value-chars')]/text()").extract()
# get additional details from list of <span> tags, element by element
floor_type = list_li[0].strip()
num_balcony = list_li[1].strip()
commission_year = list_li[2].strip()
garage = list_li[3].strip()
window_type = list_li[4].strip()
num_floors = list_li[5].strip()
door_type = list_li[6].strip()
area_sqm = list_li[7].strip()
floor = list_li[8].strip()
leasing = list_li[9].strip()
district = list_li[10].strip()
num_window = list_li[11].strip()
address = list_li[12].strip()
#list_span = response.xpath(".//span[contains(#class,'value-chars')]//text()").extract()
#list_a = response.xpath(".//a[contains(#class,'value-chars')]//text()").extract()
# get additional details from list of <span> tags, element by element
#floor_type = list_span[0].strip()
#num_balcony = list_span[1].strip()
#garage = list_span[2].strip()
#window_type = list_span[3].strip()
#door_type = list_span[4].strip()
#num_window = list_span[5].strip()
# get additional details from list of <a> tags, element by element
#commission_year = list_a[0].strip()
#num_floors = list_a[1].strip()
#area_sqm = list_a[2].strip()
#floor = list_a[3].strip()
#leasing = list_a[4].strip()
#district = list_a[5].strip()
#address = list_a[6].strip()
# update item with newly parsed data
item.update({
'district': district,
'address': address,
'area_sqm': area_sqm,
'floor': floor,
'commission_year': commission_year,
'num_floors': num_floors,
'num_windows': num_window,
'num_balcony': num_balcony,
'floor_type': floor_type,
'window_type': window_type,
'door_type': door_type,
'garage': garage,
'leasing': leasing
})
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse_item2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath(".//span[contains(#class,'phone-author__title')]//text()")
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartmentsSpider)
process.start()
You need two selectors, one will pass keys and another one will parse values. This will result in two lists that can be zipped together in order to give you the results you are looking for.
CSS Selectors could be like:
Keys Selector --> .chars-column li .key-chars
Values Selector --> .chars-column li .value-chars
Once you extract both lists, you can zip them and consume them as key value.
I suppose this is because of invalid HTML (some span-elements are not closed) normal xpath's are not possible.
This did gave me results:
".//*[contains(#class,'value-chars')]"
The * means any element, so it will select both select
<span class="value-chars">Wood</span>
and
2015
Use this XPath to get Wood
//*[#class="chars-column"]//span[2]//text()
Use this XPath to get 2015
//*[#class="chars-column"]//a[text()="2015"]
I'm building a Scrapy that crawling under two pages (e.x: PageDucky, PageHorse), and I pass that two pages in a starts_url field.
But for pagination, I need to pass my URL and concatenate with "?page=", so I can't pass the entire list.
I already tried to make a for loop, but without success.
Anyone does how can I make the pagination work for both pages?
Here is my code for now:
class QuotesSpider(scrapy.Spider):
name = 'QuotesSpider'
start_urls = ['https://PageDucky.com', 'https://PageHorse.com']
categories = []
count = 1
def parse(self, response):
# Get categories
urli = response.url
QuotesSpider.categories = urli[urli.find('/browse')+7:].split('/')
QuotesSpider.categories.pop(0)
#GET ITEMS PER PAGE AND CALC THE PAGINATION
items = int(response.xpath(
'*//div[#id="body"]/div/label[#class="item-count"]/text()').get().replace(' items', ''))
pages = items / 10
#CALL THE OTHER DEF TO READ THE PAGE ITSELF
for i in response.css('div#body div a::attr(href)').getall():
if i[:5] == '/item':
yield scrapy.Request('http://mainpage' + i, callback=self.parseobj)
#HERE IS THE PROBLEM, I TESTED AND WITHOUT FOR LOOP WORKS FOR ONE URL ONLY
for y in QuotesSpider.start_urls:
if pages >= QuotesSpider.count:
next_page = y + '?page=' + str(QuotesSpider.count)
QuotesSpider.count = QuotesSpider.count + 1
yield scrapy.Request(next_page, callback=self.parse)
Whatever website you're scraping, find the xpath/css location where the 'next page' button is. Get the href of that, and yield your next request to that link.
Alternatively you don't need to use start_urls if you write your own start_requests function, where you can put custom logic inside of it, like looping through your desired urls and appendimng the correct page number to each. See: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
UPDATE WITH SOLUTION
I can't use "href" because isn't the same link, for example the page 01 was 'https:pageducky.com' and the page 02 was 'https:duckyducky.com?page=2'
So I use response.url and manipulate the string considering the ?page=... something like that:
resp1 = response.url[:response.url.find('?page=')]
resp = resp1 + '?page=' + str(QuotesSpider.count)
I'm trying to scrape a real-estate website: https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/. I would like to get the href that is hidden in the tag of the house images:
I would like to get this for the whole page (and other pages). Here is the code I wrote that returns nothing (e.g. empty dictionary):
import scrapy
from ..items import RealEstateSloItem
import time
# first get all the URLs that have more info on the houses
# next crawl those URLs to get the desired information
class RealestateSpider(scrapy.Spider):
# allowed_domains = ['nepremicnine.net']
name = 'realestate'
page_number = 2
# page 1 url
start_urls = ['https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/1/']
def parse(self, response):
items = RealEstateSloItem() # create it from items class --> need to store it down
all_links = response.css('a.slika a::attr(href)').extract()
items['house_links'] = all_links
yield items
next_page = 'https://www.nepremicnine.net/oglasi-prodaja/slovenija/hisa/' + str(RealestateSpider.page_number) + '/'
#print(next_page)
# if next_page is not None: # for buttons
if RealestateSpider.page_number < 180: # then only make sure to go to the next page
# if yes then increase it --> for paginations
time.sleep(1)
RealestateSpider.page_number += 1
# parse automatically checks for response.follow if its there when its done with this page
# this is a recursive function
# follow next page and where should it after following
yield response.follow(next_page, self.parse) # want it to go back to parse
Could you tell me what I am doing wrong here with css selectors?
Your selector is looking for an a element inside the a.slika. This should solve your issue:
all_links = response.css('a.slika ::attr(href)').extract()
Those will be relative urls, you can use response.urljoin() to build the absolute url using your response url as base domain.
hello i'm trying to build a crawler using scrapy
my crawler code is :
import scrapy
from shop.items import ShopItem
class ShopspiderSpider(scrapy.Spider):
name = 'shopspider'
allowed_domains = ['www.organics.com']
start_urls = ['https://www.organics.com/product-tag/special-offers/']
def parse(self, response):
items = ShopItem()
title = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/h3').extract()
sale_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/del/span').extract()
product_original_price = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
category = response.xpath('//*[#id="content"]/div[2]/div[1]/ul/li[1]/a/span[2]/ins/span').extract()
items['product_name'] = ''.join(title).strip()
items['product_sale_price'] = ''.join(sale_price).strip()
items['product_original_price'] = ''.join(product_original_price).strip()
items['product_category'] = ','.join(map(lambda x: x.strip(), category)).strip()
yield items
but when i run the command : scrapy crawl shopspider -o info.csv to see the output i can find just the informations about the first product not all the products in this page.
so i remove the numbers between [ ] in the xpath for exemple the xpath of the title ://*[#id="content"]/div/div/ul/li/a/h3
but still get the same result.
the result is : <span class="amount">£40.00</span>,<h3>Halo Skincare Organic Gift Set</h3>,"<span class=""amount"">£40.00</span>","<span class=""amount"">£58.00</span>"
kindely help please
If you remove the indexes on your XPaths, they will find all the items in the page:
response.xpath('//*[#id="content"]/div/div/ul/li/a/h3').extract() # Returns 7 items
However, you should observe that this will return a list of strings of the selected html elements. You should add /text() in the XPath if you want the text inside the element. (Which looks like you do)
Also, the reason you only get one return is because you are concatenating all the items into a single string when assigning them to the item:
items['product_name'] = ''.join(title).strip()
Here title is a list of elements and you concatenate them all in a single string. Same logic applies for the other vars
If that's really what you want you can disregard the following, but I believe a better approach would be to execute a for loop and yield them separately?
My suggestion would be:
def parse(self, response):
products = response.xpath('//*[#id="content"]/div/div/ul/li')
for product in products:
items = ShopItem()
items['product_name'] = product.xpath('a/h3/text()').get()
items['product_sale_price'] = product.xpath('a/span/del/span/text()').get()
items['product_original_price'] = product.xpath('a/span/ins/span/text()').get()
items['product_category'] = product.xpath('a/span/ins/span/text()').get()
yield items
Notice that in your original code your category var has the same XPath that your product_original_price, I kept the logic in the code, but it's probably a mistake.
I am trying to scrape a list of countries and their details that are members of UN. Here is my approach without using Item Loaders
Here, I am getting a parent tag that contains the details of all the UN members like name, date of joining, website, phone number and UN headquarters. Not all countries have a website, phone no and child details.
I am running a loop through the parent tag and extracting the details one by one and storing it in a variable then I am assigning tha variable to items.
import scrapy
from learn_scrapy.items import UNMemberItem
class UNMemberDetails(scrapy.Spider):
name = 'UN_details'
start_urls = ['http://www.un.org/en/member-states/index.html']
def parse(self, response):
"""
Get the details of the UN members
"""
members_tag = response.css('div.member-state.col-md-12')
#item_list = []
for member in members_tag:
member_name = member.css('span.member-state-name::text').extract()
member_join_date = member.css('span.date-display-single::text').extract()
member_website = member.css('div.site > a::text').extract()
member_phone = member.css('div.phone > ul > li::text').extract()
member_address = member.css('div.mail > a::text').extract()
member_national_holiday = member.css('div.national-holiday::text').extract()
UN_member = UNMemberItem()
UN_member['country_name'] = member_name
UN_member['join_date'] = member_join_date
if len(member_website) == 0:
member_website ='NA'
UN_member['website'] = member_website
if len(member_phone) == 0:
member_phone = 'NA'
UN_member['phone'] = member_phone
if len(member_address) == 0:
member_address = 'NA'
UN_member['mail_address'] = member_address
UN_member['national_holiday'] = member_national_holiday
print (UN_member)
UN_member = str(UN_member)
#item_list.append(UN_members)
with open('un_members_list.txt','a') as f:
f.write(UN_member + "\n")
And this my progress. I get a whole list of countries in an item. I want a single country in the item. What should be my aproach in this case?
import scrapy
from learn_scrapy.items import UNMemberItem
from scrapy.loader import ItemLoader
class UNMemberDetails(scrapy.Spider):
name = 'UN_details_loader'
start_urls = ['http://www.un.org/en/member-states/index.html']
def parse(self, response):
item_loader_object = ItemLoader(UNMemberItem(), response=response)
nested_loader = item_loader_object.nested_css('div.member-state.col-md-12')
nested_loader.add_css('country_name', 'span.member-state-name::text')
nested_loader.add_css('join_date', 'span.date-display-single::text')
nested_loader.add_css('website', 'div.site > a::text')
nested_loader.add_css('phone','div.phone > ul > li::text')
nested_loader.add_css('mail_address','div.mail > a::text')
nested_loader.add_css('national_holiday','div.national-holiday::text')
After some Research, I found the solution.
Instead of this
def parse(self, response):
item_loader_object = ItemLoader(UNMemberItem(), response=response)
You have will have to specify the selector parameter in the method. That means your ItemLoader will extract the Items from specified 'selector' instead of whole response (whole web page).
It is like selecting a part of a page from the whole response (page) and then selecting your items from it and plus you are iterating through it.
def parse(self, response):
item_loader_object = ItemLoader(UNMemberItem(), selector=member_tag)
And the new code would like something like this
members_tag = response.css('div.member-state.col-md-12')
for member in members_tag:
item_loader_object = ItemLoader(UNMemberItem(), response=member)
item_loader.add_css('country_name','span.member-state-name::text')
item_loader.add_css('join_date','span.date-display-single::text')
item_loader.add_css('website', 'div.site > a::text')
item_loader.add_css('phone','div.phone > ul > li::text')
item_loader.add_css('mail_address','div.mail > a::text')
item_loader.add_css('national_holiday','div.national-holiday::text')
The code is much is cleaner than very first code snippet in the question and gets the job done.