Using python scrapy to scrape the next page comments - python

I am using python scrapy to get the user Reviews comments, which may have multiple pages and I need to click "see more " so as to see more comments.
this is the link to the page I want to crawl:
https://en.drivy.com/car-rental/berlin/dacia-dokker-218119
I notice if the review comments more than 10, I need to click "See more" link in order to get the subsequent comments.
I also notice the "see more" URL link is https: //en.drivy.com/cars/218119/reviews?page=2&rel=next
However, if I use scrapy to go to the https: //en.drivy.com/cars/218119/reviews?page=2&rel=next, the website redirects me back to https:// en.drivy.com/car-rental/berlin/dacia-dokker-218119 which i cant really get the next ten comments. (i wonder if the website use cookie or session ID and identify my scrapy as new visit)
I know I can use python selenium to open the web page and click "see more" so as to get the comments, however, selenium is very slow and I wish I can use scrapy instead.
Could anyone help me on this? or at least give me a direction to proceed? Thanks in advance.

You should set "Accept: */*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript" header. You'll catch JS object containing texts of comments.
yield Request("https://en.drivy.com/cars/218119/reviews?page=2&rel=next", parse = ..., ...,
headers={'Accept': "*/*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript"})

Related

Scrapy Load More Issue - CSS Selector

I am attempting to scrape a website which has a "Show More" link at the bottom of the page that leads to more data to scrape. Here is a link to the website page: https://untappd.com/v/total-wine-more/47792. Here is my full code:
class Untap(scrapy.Spider):
name = "Untappd"
allowed_domains = ["untappd.com"]
start_urls = [
'https://untappd.com/v/total-wine-more/47792' #URL: Major liquor store chain with Towson location.
]
def parse(self, response):
for beer_details in response.css('div.beer-details'):
yield {
'name': beer_details.css('h5 a::text').getall(), #Name of Beer
'type': beer_details.css('h5 em::text').getall(), #Style of Beer
'ABVIBUs': beer_details.css('h6 span::text').getall(), #ABV and IBU of Beer
'Brewery': beer_details.css('h6 span a::text').getall() #Brewery that produced Beer
}
load_more = response.css('a.yellow button more show-more-section track-click::attr(href)').get()
if load_more is not None:
load_more = response.urljoin(load_more)
yield scrapy.Request(load_more, callback=self.parse)
I've attempted to use the bottom "load_more" block to continue loading more data for scraping, but no inputs with the HTML from the website have been working.
Here is the HTML from the website.
Show More Beers
I want to have the spider scrape what is show on the website, then click the link and continue scraping the page. Any help would be greatly appreciated.
Short answer:
curl 'https://untappd.com/venue/more_menu/47792/15?section_id=140248357' -H 'x-requested-with: XMLHttpRequest'
Clicking on that button executes javascript, so you'd need to use selenium to automate that, but fortunately, you won't :).
You can see, using Developer Tools, when you click that button it requests data following the pattern shown, increasing 15 each time (after /47792/), so first time:
https://untappd.com/venue/more_menu/47792/15?section_id=140248357
second time:
https://untappd.com/venue/more_menu/47792/30?section_id=140248357
then:
https://untappd.com/venue/more_menu/47792/45?section_id=140248357'
and so on.
But if you try to get it directly from the browser it gets no content, because they are expecting the 'x-requested-with: XMLHttpRequest' header, indicating it is an AJAX request.
Thus you have the URL pattern and the required header you need for coding your scraper.
The rest is to parse each response. :)
PD: probably the section_id parameter may change (mine is different from yours), but you already have the data-section-id="140248357" attribute in the button's HTML.

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

Scrapy response is a different language from request and resposne url

I'm trying to scrape search results from this page
http://eur-lex.europa.eu/search.html?qid=1437402891621&DB_TYPE_OF_ACT=advGeneral&CASE_LAW_SUMMARY=false&DTS_DOM=EU_LAW&typeOfActStatus=ADV_GENERAL&type=advanced&lang=fr&SUBDOM_INIT=EU_CASE_LAW&DTS_SUBDOM=EU_CASE_LAW
The language according to the url is french, and that is what I see in the scrapy shell, following 'crawled (200) '
If I try response.url I also get a url with lang=fr.
Viewing the page in a browser shows me french results.
However, the body of the response is English.
I've tried disabling cookies in my scrapy settings.py file.
I've also set the DEFAULT_REQUEST HEADERS to 'Accept-Language': 'fr'.
Any ideas?
In the upper right corner of the webpage there's a drop down field to choose the language of the website. Selecting french there will add another parameter to the url: &locale=fr.
So - add that parameter to your start_url.

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

Why does href change after downloading page

I'm making a web parser and some href are driving me crazy
resp = urllib.request.urlopen("http://portogruaro.trasparenza-valutazione-merito.it/storico-atti")
page = resp.read().decode('utf-8')
print(page)
I found this in the downloaded page:
<a.. href="http://portogruaro.trasparenza-valutazione-merito.it/storico-atti;jsessionid=BE0A764D125947680F3DC6F85760302A?p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet&p_p_lifecycle=2&p_p_state=normal&p_p_mode=view&p_p_resource_id=downloadAllegato&p_p_cacheability=cacheLevelPage&p_p_col_id=column-1&p_p_col_count=1&_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=oMrkWCwhyKWGcD67RyUPTMNzDbwk8ufAwUFVQ2_3Z4045lXXp1gcrKnaH7my84lD0jmgn_na5l1a5KnBtXxYtJYH7rbRP4GRdD53nB0MaBJSV6Ub1JDNoMnspbc2nmqr7a3ucdsOOBOUc4q0uTPd1Dg5ba1VE8DJ1kpf6C0eliencVxLYM8jPqxcSVokmrAjHqkHg4K3CFGZP9tGpCBTPQ"><i class="icon-download"></i> Allegato</a>
The href in the same anchor that you can see retrieving the same url with a browser is:
"http://portogruaro.trasparenza-valutazione-merito.it/storico-atti?p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet&p_p_lifecycle=2&p_p_state=normal&p_p_mode=view&p_p_resource_id=downloadAllegato&p_p_cacheability=cacheLevelPage&p_p_col_id=column-1&p_p_col_count=1&_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=HAxoH6d7h0JNRoKoi9sl4R-tsWdtMVoLeeZ8dU5rUQL74MQNMpCnqmBwxX4uNCXuMk4Clb6EzvrIaUXNY0G4q9YGlmebpMDTrR3255v6bLGOiIWVwvbnKiaOoapsGBqwP4JPIUN1R9G8ajAnurCaqTknyMJkVLiKaw0Z4wI61pgAzqjSGHatViGIGIXkrV7IN6EduMl29vAARMvaHhEJ5g"
;jsessionid is added because the bot doesn't manage cookies, but It's not the only change...why?
EDIT: Maybe a particular number of session triggers a specific action?
If you download the web-page, the downloaded href won't work if you click on it, but clicking on the href that you see in the browser's page (view-source:link) will work.
;jsessionid is added because the bot doesn't manage cookies, but It's not the only change...why?
Hum ... apart from the ticket number and the jsessionid token, those are the same URL.
The parameters are not in the same order. But as far as I can tell, that doesn't change anything. Compare:
_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=oMrkWCwhyKWGcD67RyUPTMNzDbwk8ufAwUFVQ2_3Z4045lXXp1gcrKnaH7my84lD0jmgn_na5l1a5KnBtXxYtJYH7rbRP4GRdD53nB0MaBJSV6Ub1JDNoMnspbc2nmqr7a3ucdsOOBOUc4q0uTPd1Dg5ba1VE8DJ1kpf6C0eliencVxLYM8jPqxcSVokmrAjHqkHg4K3CFGZP9tGpCBTPQ
p_p_cacheability=cacheLevelPage
p_p_col_count=1
p_p_col_id=column-1
p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet
p_p_lifecycle=2
p_p_mode=view
p_p_resource_id=downloadAllegato
p_p_state=normal
and
_ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet_downloadTicket=HAxoH6d7h0JNRoKoi9sl4R-tsWdtMVoLeeZ8dU5rUQL74MQNMpCnqmBwxX4uNCXuMk4Clb6EzvrIaUXNY0G4q9YGlmebpMDTrR3255v6bLGOiIWVwvbnKiaOoapsGBqwP4JPIUN1R9G8ajAnurCaqTknyMJkVLiKaw0Z4wI61pgAzqjSGHatViGIGIXkrV7IN6EduMl29vAARMvaHhEJ5g"
p_p_cacheability=cacheLevelPage
p_p_col_count=1
p_p_col_id=column-1
p_p_id=ConsultazioneAtti_WAR_maggioliportalmasterdetailportlet
p_p_lifecycle=2
p_p_mode=view
p_p_resource_id=downloadAllegato
p_p_state=normal

Categories