I am having a problem with the crawling the next button I tried the basic one but after checking the html code, its uses javascript I've tried different rules but nothing works here's the link for the website.
https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html
The next button name is "Load More Products"
here's my working code
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
item = {
'title': response.xpath("normalize-space(.//h1[contains(#class, 'primary') and contains(#class, 'product-item-headline')]/text())").extract_first(),
'sale-price': response.xpath("normalize-space(.//span[#class='price-value']/text())").extract_first(),
'regular-price': response.xpath('//script[contains(text(), "whitePrice")]/text()').re_first("'whitePrice'\s?:\s?'([^']+)'"),
'photo-url': response.css('div.product-detail-main-image-container img::attr(src)').extract_first(),
'description': response.css('p.pdp-description-text::text').extract_first()
}
yield item
As already hinted in the comments, there's no need to involve JavaScript at all. If you visit the page and open up your browser's developer tools, you'll see there are XHR requests like this taking place:
https://www2.hm.com/en_us/sale/women/view-all/_jcr_content/main/productlisting_b48c.display.json?sort=stock&image-size=small&image=stillLife&offset=36&page-size=36
These requests return JSON data that are then rendered on the page using JavaScript. So you can just scrape data from these URLs using something like json.dumps(response.text). Control the products being returned by offset and page-size parameters. I assume you are done when you receive an empty JSON. Or, you can set offset=0 and page-size=9999 to get the data in one go (9999 is just an arbitrary number which is enough in this particular case).
Related
So, I am parsing emails from many websites
1)
I take them from the front page and from the contacts section ('kont' or 'cont' in hrefs)
There could be many links with 'kont' or 'cont' at the front page
I don't want to visit all of them in the "for" loop
I would like the program to go to another website when the data is found in one of those links (email_list_2 != []). how to do that?
2)
There is some redundancy in the code, I yield data at the front page because I am afraid the request from the for loop would be unsuccessful, in which case I will lose data from the front page.
Can I just yield {'site': site,
'email_list_1': email_list_1,
'email_list_2': []} if data is not found
or
{'site': site,
'email_list_1': email_list_1,
'email_list_2': ['xyz']} if data is found without double yielding?
Please help
Regards,
class QuotesSpider(scrapy.Spider):
name = 'enrichment'
start_urls = website_list
def parse(self, response):
site = response.url
data = response.text
email_list_1 = emailRegex.findall(data)
yield {'lvl': '1',
'site': site,
'email_list_1': email_list_1,
'email_list_2': [],
}
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
raw_url = link.get('href')
full_url = str(site) + str(raw_url)
if (re.search('cont', full_url) != None or
re.search('kont', full_url) != None):
yield scrapy.Request(url=full_url,
callback=self.parse_2d_level,
meta={'site': site,'email_list_1': email_list_1 }
)
def parse_2d_level(self, response):
site = response.meta['site']
email_list_1 = response.meta['email_list_1']
data_2 = response.text
email_list_2 = emailRegex.findall(data_2)
yield {'lvl': '2',
'site': site,
'email_list_1': email_list_1,
'email_list_2': email_list_2,
}
I'm not sure I fully understand your question, but here it goes:
1 - You want to scrape PAGE1, look for 'cont' or 'kont' and if these components exists, make a new request for PAGE2. In PAGE2 you search for a email_list_2 and yield results. You asked:
I would like the program to go to another website when the data is
found in one of those links (email_list_2 != []). how to do that?
What website do you want it to go? Is it a follow on the page you are already scraping? Is it another website in your start_urls?
At current state, after parsing PAGE2 (on parse_2d_level method) your spider will yield results, whether it found values for email_list_2 or not. If there are other requests on queue, scrapy will go on to execute those, if there aren't, the spider will end.
2- You want to make sure the data you already found before the loop is yielded in case the request from inside the loop fails. Since you said
the request from the for loop would be unsuccessful
I'll assume you are only worried about the REQUEST failure, there are other ways your parsing could fail.
For failed request you can catch and handle the issue with a scrapy signal called spider_error, take a look here.
3-You should take a look at Scrapy's selectors, they are a very powerful tool. You don't need beautiful soup for the parsing, and the Selectors will help a lot with the precision.
I am attempting to scrape a website which has a "Show More" link at the bottom of the page that leads to more data to scrape. Here is a link to the website page: https://untappd.com/v/total-wine-more/47792. Here is my full code:
class Untap(scrapy.Spider):
name = "Untappd"
allowed_domains = ["untappd.com"]
start_urls = [
'https://untappd.com/v/total-wine-more/47792' #URL: Major liquor store chain with Towson location.
]
def parse(self, response):
for beer_details in response.css('div.beer-details'):
yield {
'name': beer_details.css('h5 a::text').getall(), #Name of Beer
'type': beer_details.css('h5 em::text').getall(), #Style of Beer
'ABVIBUs': beer_details.css('h6 span::text').getall(), #ABV and IBU of Beer
'Brewery': beer_details.css('h6 span a::text').getall() #Brewery that produced Beer
}
load_more = response.css('a.yellow button more show-more-section track-click::attr(href)').get()
if load_more is not None:
load_more = response.urljoin(load_more)
yield scrapy.Request(load_more, callback=self.parse)
I've attempted to use the bottom "load_more" block to continue loading more data for scraping, but no inputs with the HTML from the website have been working.
Here is the HTML from the website.
Show More Beers
I want to have the spider scrape what is show on the website, then click the link and continue scraping the page. Any help would be greatly appreciated.
Short answer:
curl 'https://untappd.com/venue/more_menu/47792/15?section_id=140248357' -H 'x-requested-with: XMLHttpRequest'
Clicking on that button executes javascript, so you'd need to use selenium to automate that, but fortunately, you won't :).
You can see, using Developer Tools, when you click that button it requests data following the pattern shown, increasing 15 each time (after /47792/), so first time:
https://untappd.com/venue/more_menu/47792/15?section_id=140248357
second time:
https://untappd.com/venue/more_menu/47792/30?section_id=140248357
then:
https://untappd.com/venue/more_menu/47792/45?section_id=140248357'
and so on.
But if you try to get it directly from the browser it gets no content, because they are expecting the 'x-requested-with: XMLHttpRequest' header, indicating it is an AJAX request.
Thus you have the URL pattern and the required header you need for coding your scraper.
The rest is to parse each response. :)
PD: probably the section_id parameter may change (mine is different from yours), but you already have the data-section-id="140248357" attribute in the button's HTML.
I am scraping the following webpage using scrapy-splash, http://www.starcitygames.com/buylist/, which I have to login to, to get the data I need. That works fine but in order to get the data I need to click the display button so I can scrape that data, the data I need is not accessible until the button is clicked. I already got an answer to this that told me I cannot simply click the display button and scrape the data that shows up and that I need to scrape the JSON webpage associated with that information but I am concerned that scraping the JSON instead will be a red flag to the owners of the site since most people do not open the JSON data page and it would take a human several minutes to find it versus the computer which would be much faster. So I guess my question is, is there anyway to scrape the webpage my clicking display and going from there or do I have no choice but to scrape the JSON page? This is what I have got so far... but it is not clicking the button.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'abc#example.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button = response.xpath('//a[contains(., "Display>>")]/#href').get()
yield response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item
You can use the developer tools of your browser to track the request of that click event, which is in a nice JSON format, also no need for cookie (login):
http://www.starcitygames.com/buylist/search?search-type=category&id=5061
The only thing need to fill is the category_id related to this request, this can be extracted from the HTML and declared in your code.
Category name:
//*[#id="bl-category-options"]/option/text()
Category id:
//*[#id="bl-category-options"]/option/#value
Working with JSON is much more simple than parsing HTML.
I have tried to emulate the click with scrapy-splash, making use of lua script. It works, you just have to integrate it with scrapy and to manipulate the content.
I leave the script, in which I finish integrating it with scrapy.
function main(splash)
local url = 'https://www.starcitygames.com/login'
assert(splash:go(url))
assert(splash:wait(0.5))
assert(splash:runjs('document.querySelector("#ex_usr_email_input").value = "your#email.com"'))
assert(splash:runjs('document.querySelector("#ex_usr_pass_input").value = "your_password"'))
splash:wait(0.5)
assert(splash:runjs('document.querySelector("#ex_usr_button_div button").click()'))
splash:wait(3)
splash:go('https://www.starcitygames.com/buylist/')
splash:wait(2)
assert(splash:runjs('document.querySelectorAll(".bl-specific-name")[1].click()'))
splash:wait(1)
assert(splash:runjs('document.querySelector("#bl-search-category").click()'))
splash:wait(3)
splash:set_viewport_size(1200,2000)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.
I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.
How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?
The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.
See this example from the Scrapy docs on the issue:
http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
# parse response and populate item as required
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
# parse response and populate item as required
item['other_url'] = response.url
return item
Is your third piece of data on a page linked from the first page or the second page?
If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.
If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.
NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.
The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.