I am scraping the following webpage using scrapy-splash, http://www.starcitygames.com/buylist/, which I have to login to, to get the data I need. That works fine but in order to get the data I need to click the display button so I can scrape that data, the data I need is not accessible until the button is clicked. I already got an answer to this that told me I cannot simply click the display button and scrape the data that shows up and that I need to scrape the JSON webpage associated with that information but I am concerned that scraping the JSON instead will be a red flag to the owners of the site since most people do not open the JSON data page and it would take a human several minutes to find it versus the computer which would be much faster. So I guess my question is, is there anyway to scrape the webpage my clicking display and going from there or do I have no choice but to scrape the JSON page? This is what I have got so far... but it is not clicking the button.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'abc#example.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button = response.xpath('//a[contains(., "Display>>")]/#href').get()
yield response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item
You can use the developer tools of your browser to track the request of that click event, which is in a nice JSON format, also no need for cookie (login):
http://www.starcitygames.com/buylist/search?search-type=category&id=5061
The only thing need to fill is the category_id related to this request, this can be extracted from the HTML and declared in your code.
Category name:
//*[#id="bl-category-options"]/option/text()
Category id:
//*[#id="bl-category-options"]/option/#value
Working with JSON is much more simple than parsing HTML.
I have tried to emulate the click with scrapy-splash, making use of lua script. It works, you just have to integrate it with scrapy and to manipulate the content.
I leave the script, in which I finish integrating it with scrapy.
function main(splash)
local url = 'https://www.starcitygames.com/login'
assert(splash:go(url))
assert(splash:wait(0.5))
assert(splash:runjs('document.querySelector("#ex_usr_email_input").value = "your#email.com"'))
assert(splash:runjs('document.querySelector("#ex_usr_pass_input").value = "your_password"'))
splash:wait(0.5)
assert(splash:runjs('document.querySelector("#ex_usr_button_div button").click()'))
splash:wait(3)
splash:go('https://www.starcitygames.com/buylist/')
splash:wait(2)
assert(splash:runjs('document.querySelectorAll(".bl-specific-name")[1].click()'))
splash:wait(1)
assert(splash:runjs('document.querySelector("#bl-search-category").click()'))
splash:wait(3)
splash:set_viewport_size(1200,2000)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
Related
I am attempting to scrape a website which has a "Show More" link at the bottom of the page that leads to more data to scrape. Here is a link to the website page: https://untappd.com/v/total-wine-more/47792. Here is my full code:
class Untap(scrapy.Spider):
name = "Untappd"
allowed_domains = ["untappd.com"]
start_urls = [
'https://untappd.com/v/total-wine-more/47792' #URL: Major liquor store chain with Towson location.
]
def parse(self, response):
for beer_details in response.css('div.beer-details'):
yield {
'name': beer_details.css('h5 a::text').getall(), #Name of Beer
'type': beer_details.css('h5 em::text').getall(), #Style of Beer
'ABVIBUs': beer_details.css('h6 span::text').getall(), #ABV and IBU of Beer
'Brewery': beer_details.css('h6 span a::text').getall() #Brewery that produced Beer
}
load_more = response.css('a.yellow button more show-more-section track-click::attr(href)').get()
if load_more is not None:
load_more = response.urljoin(load_more)
yield scrapy.Request(load_more, callback=self.parse)
I've attempted to use the bottom "load_more" block to continue loading more data for scraping, but no inputs with the HTML from the website have been working.
Here is the HTML from the website.
Show More Beers
I want to have the spider scrape what is show on the website, then click the link and continue scraping the page. Any help would be greatly appreciated.
Short answer:
curl 'https://untappd.com/venue/more_menu/47792/15?section_id=140248357' -H 'x-requested-with: XMLHttpRequest'
Clicking on that button executes javascript, so you'd need to use selenium to automate that, but fortunately, you won't :).
You can see, using Developer Tools, when you click that button it requests data following the pattern shown, increasing 15 each time (after /47792/), so first time:
https://untappd.com/venue/more_menu/47792/15?section_id=140248357
second time:
https://untappd.com/venue/more_menu/47792/30?section_id=140248357
then:
https://untappd.com/venue/more_menu/47792/45?section_id=140248357'
and so on.
But if you try to get it directly from the browser it gets no content, because they are expecting the 'x-requested-with: XMLHttpRequest' header, indicating it is an AJAX request.
Thus you have the URL pattern and the required header you need for coding your scraper.
The rest is to parse each response. :)
PD: probably the section_id parameter may change (mine is different from yours), but you already have the data-section-id="140248357" attribute in the button's HTML.
I am scraping a webpage,http://www.starcitygames.com/buylist/, and I need to click a button in order to access some data and so I am trying to simulate a mouse click but I am confused about exactly how to do that. I have had suggestions to just scrape the JSON instead because it would be a lot easier but I really do not want to scrape it. I would rather scrape the regular website. Here is what I have so far, I do not know exactly what to do to get it to click that display button, but this was my best try so far.
HTML Code
import scrapy
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email#example.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
element = splash:select('#bl-search-category') #CSS selector
splash:mouse_click(x, y)# Confused about how to find x and y
item["Name"] = response.css("div.bl-result-title::text").get()
return item
Splash is a light weight option for rendering JS. If you have extensive clicking and navigation to do in menus that can't be reverse engineered then you probably don't want Splash unless you don't mind trying to write a LUA script. You may want to see this answer in regards to that.
You will write a LUA script and pass it to the execute Splash endpoint. Depending how complex your task Selenium may be a better choice for your project. However, first thoroughly examine the target site and be SURE that you need to render JavaScript as rendering the JS is always the worst thing you can do if you don't have to for speed and resources.
PS: We can't access this site without the login credentials. I would suspect that you don't need to render the JavaScript. That is the case 90%+ of the time.
I am having a problem with the crawling the next button I tried the basic one but after checking the html code, its uses javascript I've tried different rules but nothing works here's the link for the website.
https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html
The next button name is "Load More Products"
here's my working code
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
item = {
'title': response.xpath("normalize-space(.//h1[contains(#class, 'primary') and contains(#class, 'product-item-headline')]/text())").extract_first(),
'sale-price': response.xpath("normalize-space(.//span[#class='price-value']/text())").extract_first(),
'regular-price': response.xpath('//script[contains(text(), "whitePrice")]/text()').re_first("'whitePrice'\s?:\s?'([^']+)'"),
'photo-url': response.css('div.product-detail-main-image-container img::attr(src)').extract_first(),
'description': response.css('p.pdp-description-text::text').extract_first()
}
yield item
As already hinted in the comments, there's no need to involve JavaScript at all. If you visit the page and open up your browser's developer tools, you'll see there are XHR requests like this taking place:
https://www2.hm.com/en_us/sale/women/view-all/_jcr_content/main/productlisting_b48c.display.json?sort=stock&image-size=small&image=stillLife&offset=36&page-size=36
These requests return JSON data that are then rendered on the page using JavaScript. So you can just scrape data from these URLs using something like json.dumps(response.text). Control the products being returned by offset and page-size parameters. I assume you are done when you receive an empty JSON. Or, you can set offset=0 and page-size=9999 to get the data in one go (9999 is just an arbitrary number which is enough in this particular case).
I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !
I'm writing a scrapy application that crawls a website main page, saves his url, and also checks for his menu items, so it would do the same process recursively, to them.
class NeatSpider(scrapy.Spider):
name = "example"
start_urls = ['https://example.com/main/']
def parse(self, response):
url = response.url
yield url
# check for in-menu article links
menu_items = response.css(MENU_BAR_LINK_ITEM).extract()
if menu_items is not None:
for menu_item in menu_items:
yield scrapy.Request(response.urljoin(menu_item), callback=self.parse)
In the example website, each menu item leads to another page with another menu items.
Some pages responses get to the 'parse' method, and so their url gets saved, while others not.
Those who are not, are giving back a 200 status (when I enter their address manually, in the browser), don't throw any exceptions, and pretty much shows the same behavior as other pages who do get to the parse method.
Addition Information: ALL of the menu items get to the last line in code (without any errors), and if I provide an 'errback' callback method, no request ever gets there.
EDIT: here is the log: http://pastebin.com/2j5HMkqN
There might be chance that the website you are scraping showing Captchas.
You can debug your scraper like this, this will open the scraped webpage in your OS default browser.
from scrapy.utils.response import open_in_browser
def parse_details(self, response):
if "item name" not in response.body:
open_in_browser(response)