I want to scrape the second page of this user reviews.
However the next button executes a XHR request, and while I can see it using Chrome developer tools, I cannot replicate it.
It's not so easy task. First of all you should install this
extension.
It helps you to test own requests based on captured data, i.e. catch and simulate requests with captured data.
As I see they send a token in this XHR request, so you need to get it in from html page body(stores in source code, js variable "taSecureToken" ).
Next you need to do four steps:
Catch POST request with plugin
Change token to saved before
Set limit and offset variables in POST request data
Generate request with resulted body
Note: on this request server returns json data(not the html with next page) containing info about loaded objects on next page.
Related
I am new to scrapy and trying to scrape https://www.sakan.co/result?srv=1&prov=&cty=&maintyp=1&typ=5&minpr=&maxpr=&bdrm=&blk=
This webpage is using a href with the following:
href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$Content$rptPaging$ctl02$lbPaging", "", true, "", "", false, true))"
Data is getting loaded dynamically. I am trying to find the source (API call if any) for data that is getting loaded but could not find any. How can I navigate to next page and scrape data using Scrapy.
What this js effectively do is trigger a POST request, you can check the details of the request in the browsers developer tools, network tab. (F12 in Firefox - Open the tab and click the link)
Your Scrapy needs to reproduce that same POST request. All the information in the body is available in the page, just keep in mind that those fields that start with __, like __VIEWSTATE, are instance dependent, so you need to retrieve their values from the page your Scrapy loads, copy and paste will usually fail.
The easier way to do this is using the FormRequest.from_response() method. However, its important to check if the method is producing a request body that is the same your browser, quite often the method skips a required field or adds an extra one. (It relies on the page's <form>)
You can read more on scraping this kind of page in this link from Scrapy FAQ.
Finally one last tip: If your request body is the just like the browser, but the request still fails, you might need to reproduce the request headers as well.
I am trying to write a python script to login to the following site in order to automatically keep on eye on some account details: https://gateway.usps.com/eAdmin/view/signin
I have the right credentials, but something isn't quite working correctly, I don't know if it is because of the hidden inputs that exist on the form
import requests
from bs4 import BeautifulSoup
user='myusername'
passwd='mypassword'
s=requests.Session()
r=s.get("https://gateway.usps.com/eAdmin/view/signin")
soup=BeautifulSoup(r.content)
sp=soup.find("input",{"name":"_sourcePage"})['value']
fp=soup.find("input",{"name":"__fp"})['value']
si=soup.find("input",{"name":"securityId"})['value']
data={
"securityId": si,
"username":user,
"password":passwd,
"_sourcePage":sp,
"__fp":fp}
headers={"Content-Type":"application/x-www-form-urlencoded",
"Host":"gateway.usps.com",
"Origin":"https://gateway.usps.com",
"Referer":"https://gateway.usps.com/eAdmin/view/signin"}
login_url="https://gateway.usps.com/eAdmin/view/signin"
r=s.post(login_url,headers=headers,data=data,cookies=r.cookies)
print(r.content)
_sourcePage, securityId and __fp are all hidden input values from the page source. I am scraping this from the page, but obviously when I get to do the POST request, I'm opening the url again, so these values change and are no longer valid. However, I'm unsure how to rewrite the POST line to ensure that I extract the correct hidden values for submission.
I don't think that this is only relevant to this site, but for any site with hidden random values.
You can't do that.
You are trying to authenticate using an HTTP POST request outside the application scope, the login page and his own web form.
For security reasons the web page implements differents techniques, one of all the Anti CSRF Token ( which it's probably __sourcePage ) to ensure that the login request comes exclusively from the web page.
For this reason, every time you scrape the page grabbing the content of the security hidden inputs, the web application generate them every time. Thus when you reuse them to craft the final request of course they are not anymore valid.
See also: https://www.owasp.org/index.php/Cross-Site_Request_Forgery_(CSRF)
I am trying to crawl olx.in site http://www.olx.in/newdelhi/bmw/, I have set this URL as start_url.
Now to go to next page as it is not normal HTML but it dynamic so in network tab I saw that next button creates a XHR request with POST method. Now I have to simulate it in request method(I guess...) but I can't figure out what will be it's parameters.
I am new to python and web-scraping so sorry if it's too general but any help would be appreciated.
You should take a look at FormRequest that enables you to send data via HTTP POST. As you can see the next button creates a request to http://www.olx.in/ajax/newdelhi/search/list/, with some form data. Just populate the formdata parameter with the needed values from the current Response object. As you are trying to build a pagination you should check this page on how to do it properly.
I'm trying to make a simple post method with the requests module, like this :
s=requests.Session()
s.post(link,data=payload)
In order to do it properly, the payload is an id from the page itself, and it's generated in every access to the page.
So I need to get the data from the page and then proceed the request.
The problem when you accessed the page is that a new id will be generated.
So if we do this:
s=requests.Session()
payload=get_payload(s.get(link).text)
s.post(link,data=payload)
It will not work because when you acceded the page with s.get the right id is generated, but when you go for the post request, a new id will be generated so you'll be using an old one.
Is there any way to get the data from the page right before the post request?
Something like:
s.post(link,data=get_data(s.get(link))
When you do a post (or get) request, the page will generate another id and send it back to you. There is no way of sending data to the page while it is being generated because you need to receive a response first to process the data on the page and once you have received the response, the server will create a new id for you the next time you view the page.
See https://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/images/HTTP.png for a simple example image of a HTTP Request
In general, there is no way to do this. The server's response is potentially affected by the data you send, so it can't be available before you have sent the data. To persist this kind of information across requests, the server would usually set a cookie for you to send with each subsequent request - but using a requests.Session will handle that for you automatically. It is possible that you need to set the cookie yourself based on the first response, but cookies are a key/value pair, and you only appear to have the value. To find the key, and more generally to find out if this is what the server expects you to do, requires specific knowledge of the site you are working with - if this is a documented API, the documentation would be a good place to start. Otherwise you might need to look at what the website itself does - most browsers allow you to look at the cookies that are set for that site, and some (possibly via extensions) will let you look through the HTTP headers that are sent and received.
I am a bit new to the Scrapy framework. I want to scrape a web page that takes me through the result page through a Redirect.
The search form is in my start URL.
Parse Method: I take the response and use
FormRequest.from_response(response,formdata=values,
callback=self.handle_redirect)
to generate a request with all the values needed to post the url.
This request goes to the 302 (object Moved) page. Once there, I don't want to scrape any more data.
I want to redirect to the original page where there are search results.
How should I approach this?