Scrapy response is a different language from request and resposne url - python

I'm trying to scrape search results from this page
http://eur-lex.europa.eu/search.html?qid=1437402891621&DB_TYPE_OF_ACT=advGeneral&CASE_LAW_SUMMARY=false&DTS_DOM=EU_LAW&typeOfActStatus=ADV_GENERAL&type=advanced&lang=fr&SUBDOM_INIT=EU_CASE_LAW&DTS_SUBDOM=EU_CASE_LAW
The language according to the url is french, and that is what I see in the scrapy shell, following 'crawled (200) '
If I try response.url I also get a url with lang=fr.
Viewing the page in a browser shows me french results.
However, the body of the response is English.
I've tried disabling cookies in my scrapy settings.py file.
I've also set the DEFAULT_REQUEST HEADERS to 'Accept-Language': 'fr'.
Any ideas?

In the upper right corner of the webpage there's a drop down field to choose the language of the website. Selecting french there will add another parameter to the url: &locale=fr.
So - add that parameter to your start_url.

Related

Get request returns unfriendly response Python

I am trying to perform a get request on TCG Player via Requests on Python. I checked the sites robots.txt which specifies:
User-agent: *
Crawl-Delay: 10
Allow: /
Sitemap: https://www.tcgplayer.com/sitemap/index.xml
This is my first time seeing a robots.txt file.
My code is as follows:
import requests
url = "http://www.tcgplayer.com"
r = requests.get(url)
print(r.text)
I cannot include r.text in my post because the character limit would be exceeded.
I would have expected to be recieve the HTML content of the webpage, but I got an 'unfriendly' response instead. What is the meaning of the text above? Is there a way to get the HTML so I can scrape the site?
By 'unfriendly' I mean:
The HTML that is returned does not match the HTML that is produced by typing the URL into my web browser.
This is probably due to some server-side rendering of web content, as indicated by the empty <div id="app"></div> block in the scraped result. To properly handle such content, you will need to use a more advanced web scraping tool, like Selenium. I'd recommend this tutorial to get started.

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

Using python scrapy to scrape the next page comments

I am using python scrapy to get the user Reviews comments, which may have multiple pages and I need to click "see more " so as to see more comments.
this is the link to the page I want to crawl:
https://en.drivy.com/car-rental/berlin/dacia-dokker-218119
I notice if the review comments more than 10, I need to click "See more" link in order to get the subsequent comments.
I also notice the "see more" URL link is https: //en.drivy.com/cars/218119/reviews?page=2&rel=next
However, if I use scrapy to go to the https: //en.drivy.com/cars/218119/reviews?page=2&rel=next, the website redirects me back to https:// en.drivy.com/car-rental/berlin/dacia-dokker-218119 which i cant really get the next ten comments. (i wonder if the website use cookie or session ID and identify my scrapy as new visit)
I know I can use python selenium to open the web page and click "see more" so as to get the comments, however, selenium is very slow and I wish I can use scrapy instead.
Could anyone help me on this? or at least give me a direction to proceed? Thanks in advance.
You should set "Accept: */*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript" header. You'll catch JS object containing texts of comments.
yield Request("https://en.drivy.com/cars/218119/reviews?page=2&rel=next", parse = ..., ...,
headers={'Accept': "*/*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript"})

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

Crawling redirected url in scrapy

I am working in scrapy.
I am fetching a site which consists of a list of urls.
So I requested the main url in start_url and I got all the href tags(links to fetch data) in a list, I again requested each and every url in the list further for fetching data, but some of the urls are redirecting like below:
Redirecting (301) to <GET example.com/sch/mobile-68745.php> from Redirecting (301) to <GET example.com/sch/mobile-8974.php>
I came to know that scrapy ignores the redirected links, but I want to catch the redirected url and want to scrape the same like the url with 200 status
Is there anyway to catch that redirect url and scrape the data from them, I mean do we need disable redirect middleware? Or do we need to use any meta tag in Request command, can youu provide me an example of that?
I’ve got no experience with Scrapy, but apparently, you can define middlewares that change they way Scrapy works when resolving content.
There is the RedirectMiddleware that supports and handles redirects out of the box, so all you’d need to do is to enable that.
DOWNLOADER_MIDDLEWARES = {
'apy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 123,
}

Categories