Scrapy Load More Issue - CSS Selector - python

I am attempting to scrape a website which has a "Show More" link at the bottom of the page that leads to more data to scrape. Here is a link to the website page: https://untappd.com/v/total-wine-more/47792. Here is my full code:
class Untap(scrapy.Spider):
name = "Untappd"
allowed_domains = ["untappd.com"]
start_urls = [
'https://untappd.com/v/total-wine-more/47792' #URL: Major liquor store chain with Towson location.
]
def parse(self, response):
for beer_details in response.css('div.beer-details'):
yield {
'name': beer_details.css('h5 a::text').getall(), #Name of Beer
'type': beer_details.css('h5 em::text').getall(), #Style of Beer
'ABVIBUs': beer_details.css('h6 span::text').getall(), #ABV and IBU of Beer
'Brewery': beer_details.css('h6 span a::text').getall() #Brewery that produced Beer
}
load_more = response.css('a.yellow button more show-more-section track-click::attr(href)').get()
if load_more is not None:
load_more = response.urljoin(load_more)
yield scrapy.Request(load_more, callback=self.parse)
I've attempted to use the bottom "load_more" block to continue loading more data for scraping, but no inputs with the HTML from the website have been working.
Here is the HTML from the website.
Show More Beers
I want to have the spider scrape what is show on the website, then click the link and continue scraping the page. Any help would be greatly appreciated.

Short answer:
curl 'https://untappd.com/venue/more_menu/47792/15?section_id=140248357' -H 'x-requested-with: XMLHttpRequest'
Clicking on that button executes javascript, so you'd need to use selenium to automate that, but fortunately, you won't :).
You can see, using Developer Tools, when you click that button it requests data following the pattern shown, increasing 15 each time (after /47792/), so first time:
https://untappd.com/venue/more_menu/47792/15?section_id=140248357
second time:
https://untappd.com/venue/more_menu/47792/30?section_id=140248357
then:
https://untappd.com/venue/more_menu/47792/45?section_id=140248357'
and so on.
But if you try to get it directly from the browser it gets no content, because they are expecting the 'x-requested-with: XMLHttpRequest' header, indicating it is an AJAX request.
Thus you have the URL pattern and the required header you need for coding your scraper.
The rest is to parse each response. :)
PD: probably the section_id parameter may change (mine is different from yours), but you already have the data-section-id="140248357" attribute in the button's HTML.

Related

Scraping images in a dynamic, JavaScript webpage using Scrapy and Splash

I am trying to scrape the link of a hi-res image from this link but the high-res version of the image can only be inspected upon clicking on the mid-sized link on the page, i.e after clicking "Click here to enlarge the image" (on the page, it's in Turkish).
Then I can inspect it with Chrome's "Developer Tools" and get the xpath/css selector. Everything is fine up to this point.
However, you know that in a JS page, you just can't type response.xpath("//blah/blah/#src") and get some data. I install Splash (with Docker pull) and configure my Scrapy setting.py files etc. to make it work (this YouTube link helped. no need to visit the link unless you wanna learn how to do it). ...and it worked on other JS webpages!
Just... I cannot pass this "Click here to enlarge the image!" thing and get the response. It gives me null response.
This is my code:
import scrapy
#import json
from scrapy_splash import SplashRequest
class TryMe(scrapy.Spider):
name = 'try_me'
allowed_domains = ['arabam.com']
def start_requests(self):
start_urls = ["https://www.arabam.com/ilan/sahibinden-satilik-hyundai-accent/bayramda-arabasiz-kalmaa/17753653",
]
for url in start_urls:
yield scrapy.Request(url=url,
callback=self.parse,
meta={'splash': {'endpoint': 'render.html', 'args': {'wait': 0.5}}})
# yield SplashRequest(url=url, callback=self.parse) # this works too
def parse(self, response):
## I can get this one's link successfully since it's not between js codes:
#IMG_LINKS = response.xpath('//*[#id="js-hook-for-ing-credit"]/div/div/a/img/#src').get()
## but this one just doesn't work:
IMG_LINKS = response.xpath("/html/body/div[7]/div/div[1]/div[1]/div/img/#src").get()
print(IMG_LINKS) # prints null :(
yield {"img_links":IMG_LINKS} # gives the items: img_links:null
Shell command which I'm using:
scrapy crawl try_me -O random_filename.jl
Xpath of the link I'm trying to scrape:
/html/body/div[7]/div/div[1]/div[1]/div/img
Image of this Xpath/link
I actually can see the link I want on the Network tab of my Developer Tools window when I click to enlarge it but I don't know how to scrape that link from that tab.
Possible Solution: I also will try to get the whole garbled body of my response, i.e response.text and apply a regular expression (e.g start with https://... and ends with .jpg) to it. This will definitely be looking for a needle in a haystack but it sounds quite practical as well.
Thanks!
As far as I understand you want to find the main image link. I checked out the page, it is inside the one of meta element:
<meta itemprop="image" content="https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg">
Which you can get with
>>> response.css('meta[itemprop=image]::attr(content)').get()
'https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg'
You don't need to use splash for this. If I check the website with splash, arabam.com gives permission denied error. I recommend not using splash for this website.
For a better solution for all images, You can parse the javascript. Images array loaded with js right here in the source.
To reach out that javascript try:
response.css('script::text').getall()[14]
This will give you the whole javascript string containing images array. You can parse it with built-in libraries like js2xml.
Check out how you can use it here https://github.com/scrapinghub/js2xml. If still have questions, you can ask. Good luck

scrapy+selenium how to crawl a different page list once i'm done with one?

i'm trying to scrape data from an "action/user trading" website, it is in italian so i'll try to be as clear as possible.
I'm also really new to Python and Scrapy, it is my first project.
The website has not an easy way to follow links, so i had to come up with a few things.
First i go to the general list, where all the pages are listed, this is pretty easy as the first page is "https://www.subito.it/annunci-italia/vendita/usato/?o=1" and goes onto "/?o=218776", i pick the first link of the page and open it with selenium, once here i get the data i need and the click the "next page" button, but here's the tricky part.
if i go to the same page using the same URL there isn't a "next page" button, it works only if you are first in the list page, and then click on the page link, from here you can now follow the other links.
i thought it would be done, but i was wrong. the general list is divided in pages (.../?o=1, .../?o=2, etc), each page has an X number of Links (i haven't counted them), when you are on one of the auction pages (coming from the list page so you can use the "next page" button) and you click the "next page" you follow the order of the links in the general list.
to be clearer if the general list has 200k pages, and each page has 50 links, when you click on the first link of the page you can then click "next page" for 49 times, after that the "next page" button is inactive and you can't go to older link, you must go back to the list and go to the next page, and repeat the process.
import scrapy
from scrapy.http import HtmlResponse
from scrapy.selector import Selector
from selenium import webdriver
class NumeriSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.subito.it/annunci-italia/vendita/usato/?o=41',
]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[5]/div[2]/div[1]/div[3]/div[1]/a').click()
while True:
sel = Selector(text=self.driver.page_source)
item = {
'titolo': sel.xpath('//h1[#class= "classes_sbt-text-atom__2GBat classes_token-h4__3_Swu size-normal classes_weight-semibold__1RkLc ad-info__title"]/text()').get(),
'nome': sel.xpath("//p[#class='classes_sbt-text-atom__2GBat classes_token-subheading__3yij_ size-normal classes_weight-semibold__1RkLc user-name jsx-1261801758']/text()").get(),
'luogo': sel.xpath("//span[#class='classes_sbt-text-atom__2GBat classes_token-overline__2P5H8 size-normal ad-info__location__text']/text()").get()
}
yield item
next = self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[1]/section[1]/nav/div/div/button[2]')
try:
next.click()
except:
driver.quit()
this is the code i wrote with the help of the scrapy docs and many websites/stackoverflow pages.
i give it the page of the general list to scrape, in this case https://www.subito.it/annunci-italia/vendita/usato/?o=41, it finds the first link of the page (self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[5]/div[2]/div[1]/div[3]/div[1]/a').click())and then starts getting the data i want. Once it is done it clicks on the "next page" button (next = self.driver.find_element_by_xpath('/html/body/div[5]/div/main/div[2]/div[1]/section[1]/nav/div/div/button[2]')) and repeats the "get data-click next page" process.
the last item page will have an inactive "next page" button, so at the moment the crawler is stuck, i manually close the browser, edit with notepad++ the "start_urls" link to be the page after the one i've just scraped, and run the crawler again to scrape this page.
I'd like it to be fully automatic, so i can leave it do its thing for hours (i'm saving the data in a json file atm).
the "inactive" next-page button is different by the active one only by a disabled="" attribute, how do i detect that? and once detected, how do i tell the crawler to go back to list page plus 1 and do again the data scraping process?
My issue is only detecting that inactive button and make a loop that adds 1 to the list page i gave(if i start with the link "https://www.subito.it/annunci-italia/vendita/usato/?o=1" it should then go to "https://www.subito.it/annunci-italia/vendita/usato/?o=2" and do the same thing)
It's possible to iterate on pages by overwriting start_requests method. to reach the purpose you need to write a loop to request all (in this case 219xxx) pages and extract second layer pages hrefs.
def start_requests(self):
pages_count = 1 # in this method you need to hard code your pages quantity
for i in range(pages_count)
url = 'https://www.subito.it/annunci-italia/vendita/usato/?o=%s' % str(i + 1)
scrapy.Request(url, callback=self.parse)
Or in a better way slso find out how many pages are there in the first layer which is always in the last class="unselected-page" element so you can find it with response.xpath('//*[#class="unselected-page"]//text()').getall()[-1] . In this case you'll need to make requests for first layer pages in first parse method.
def start_requests(self):
base_url = 'https://www.subito.it/annunci-italia/vendita/usato'
scrapy.Request(base_url, callback=self.parse_first_layer)
def parse_first_layer(self, response):
pages_count = int(response.xpath('//*[#class="unselected-page"]//text()').getall()[-1])
for i in range(pages_count)
url = 'https://www.subito.it/annunci-italia/vendita/usato/?o=%s' % str(i + 1)
scrapy.Request(url, callback=self.parse_second_layer)
After reaching the first layer links you can iterate over 50 links in every page like before.

Click display button in Scrapy-Splash

I am scraping the following webpage using scrapy-splash, http://www.starcitygames.com/buylist/, which I have to login to, to get the data I need. That works fine but in order to get the data I need to click the display button so I can scrape that data, the data I need is not accessible until the button is clicked. I already got an answer to this that told me I cannot simply click the display button and scrape the data that shows up and that I need to scrape the JSON webpage associated with that information but I am concerned that scraping the JSON instead will be a red flag to the owners of the site since most people do not open the JSON data page and it would take a human several minutes to find it versus the computer which would be much faster. So I guess my question is, is there anyway to scrape the webpage my clicking display and going from there or do I have no choice but to scrape the JSON page? This is what I have got so far... but it is not clicking the button.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'abc#example.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button = response.xpath('//a[contains(., "Display>>")]/#href').get()
yield response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item
You can use the developer tools of your browser to track the request of that click event, which is in a nice JSON format, also no need for cookie (login):
http://www.starcitygames.com/buylist/search?search-type=category&id=5061
The only thing need to fill is the category_id related to this request, this can be extracted from the HTML and declared in your code.
Category name:
//*[#id="bl-category-options"]/option/text()
Category id:
//*[#id="bl-category-options"]/option/#value
Working with JSON is much more simple than parsing HTML.
I have tried to emulate the click with scrapy-splash, making use of lua script. It works, you just have to integrate it with scrapy and to manipulate the content.
I leave the script, in which I finish integrating it with scrapy.
function main(splash)
local url = 'https://www.starcitygames.com/login'
assert(splash:go(url))
assert(splash:wait(0.5))
assert(splash:runjs('document.querySelector("#ex_usr_email_input").value = "your#email.com"'))
assert(splash:runjs('document.querySelector("#ex_usr_pass_input").value = "your_password"'))
splash:wait(0.5)
assert(splash:runjs('document.querySelector("#ex_usr_button_div button").click()'))
splash:wait(3)
splash:go('https://www.starcitygames.com/buylist/')
splash:wait(2)
assert(splash:runjs('document.querySelectorAll(".bl-specific-name")[1].click()'))
splash:wait(1)
assert(splash:runjs('document.querySelector("#bl-search-category").click()'))
splash:wait(3)
splash:set_viewport_size(1200,2000)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end

How to do Next Page if its using Javascript in Scrapy

I am having a problem with the crawling the next button I tried the basic one but after checking the html code, its uses javascript I've tried different rules but nothing works here's the link for the website.
https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html
The next button name is "Load More Products"
here's my working code
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
item = {
'title': response.xpath("normalize-space(.//h1[contains(#class, 'primary') and contains(#class, 'product-item-headline')]/text())").extract_first(),
'sale-price': response.xpath("normalize-space(.//span[#class='price-value']/text())").extract_first(),
'regular-price': response.xpath('//script[contains(text(), "whitePrice")]/text()').re_first("'whitePrice'\s?:\s?'([^']+)'"),
'photo-url': response.css('div.product-detail-main-image-container img::attr(src)').extract_first(),
'description': response.css('p.pdp-description-text::text').extract_first()
}
yield item
As already hinted in the comments, there's no need to involve JavaScript at all. If you visit the page and open up your browser's developer tools, you'll see there are XHR requests like this taking place:
https://www2.hm.com/en_us/sale/women/view-all/_jcr_content/main/productlisting_b48c.display.json?sort=stock&image-size=small&image=stillLife&offset=36&page-size=36
These requests return JSON data that are then rendered on the page using JavaScript. So you can just scrape data from these URLs using something like json.dumps(response.text). Control the products being returned by offset and page-size parameters. I assume you are done when you receive an empty JSON. Or, you can set offset=0 and page-size=9999 to get the data in one go (9999 is just an arbitrary number which is enough in this particular case).

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

Categories