How to get request headers in python scrapy for a dynamic table data?
The request is made when a button is clicked.
Is there is a way to get the data without using click simulation?
You can extract the request from the website using the browser network tab. Observe the network tab to check which request is made when the button is clicked.
After that you can make the same request to the API in scrapy using the Request class.
Something like this:
yield scrapy.Request("http://www.example.com/some_page.html",
headers={'header', 'value'},
callback=self.parse_table
)
More info on Request here
Related
I'm working with Playwright. I would like to get response body (HTML) from network events instead of waiting for DOM to load data in browser, and then parse the elements. Current workflow looks something like that:
Playwright opens headless chromium
Opens first page with captcha (no data)
Solves captcha and redirects to the page with data
Sometimes a lot of data is returned and page takes quite a while to load in the browser, but all the data is already received from the client side in network events. My question is it possible to get network events in Playwright instead of waiting for all the elements to load.
I found Network Events documentation, and was able to get the HTML, but it returns all the requests instead of single request.
I'm using Playwright simply for navigation, form submitting, and to get website HTML.
Just use some condition instead of print method, for example you could check if response contains some key in its json:
def run(playwright):
chromium = playwright.chromium
browser = chromium.launch()
page = browser.new_page()
# Subscribe to "request" and "response" events.
page.on("request", lambda request: print(">>", request.method, request.url))
page.on("response", lambda response: print("<<", response.status, response.url))
page.goto("https://example.com")
browser.close()
For Example:
page.on("response", lambda response: response if key in response.body())
There should be waitForResponse for python too, and you could use that.
So I am trying to scrape with the following URL:
Website
The page has some hidden text that unlocks after a click.
Their HTML code is also hidden and unhides after button clicks.
Before click:
After click:
How can I scrape this text?
BeautifulSoup doesn't work on this text.
If you open dev tools and click those buttons, you can see that you make a post request to https://en.indonetwork.co.id/ajax.
So you can either try to replicate that - see if you can capture the payload sent in the post request from a scrape of the home page and send that.
Or you could use selenium to load the page, click the button, and then capture the data.
It is not working with beautifulsoup because it is not static site. I mean when you click the phone button, it sends the request to api endpoint and then renders the response from that request. You can check this in network tab in dev tools.(I confirmed this)
BeautifulSoup only retrieves the first static html from request. It does not takes account of requests triggered by user interaction.
Solution of this is selenium.
Here are the exact steps you can follow to get this done.
Load the selenium with headerful browser.(headerful browser allows you to interact with web page easily)
Find the phone button and click on it.
Wait for some time until request gets processed and has been rendered on screen.
Then you can grab the content of the element as per your requirement.
Not so good solution
You can directly send the request to that exact same api endpoint. But it will have some security barriers like cors to go over from.
This is not good solution because api endpoint might get change or as this api call contains phone number they can make this more secure for future usage. But the interaction on web page nearly remains the same.
you don't need scraping, there is ajax call happening under the hood
import requests
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.indonetwork.co.id/company/surrama-groups').text)
v = soup.find(class_='btn btn-contact btn-contact-phone btn-contact-ajax').attrs
data_id = v['data-id']
data_text = v['data-text']
data_type = v['data-type']
data = requests.post('https://en.indonetwork.co.id/ajax', json={
'id': data_id,
'text': data_text,
'type': data_type,
'url': "leads/ajax"
}).json()
mobile_no = re.findall(r'(\d+)', data['text'])
print(mobile_no) #['622122520556', '6287885720277']
I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !
I'm not sure if such a thing is possible, but I am trying to submit to a form such as https://lambdaschool.com/contact using a POST request.
I currently have the following:
import requests
payload = {"name":"MyName","lastname":"MyLast","email":"someemail#gmail.com","message":"My message"}
r = requests.post('http://lambdaschool.com/contact',params=payload)
print(r.text)
But I get the following error:
<title>405 Method Not Allowed</title>
etc.
Is such a thing possible to submit using a POST request?
If it were that simple, you'd see a lot of bots attacking every login form ever.
That URL obviously doesn't accept POST requests. That doesn't mean the submit button is POST-ing to that page (though clicking the button also gives that same error...)
You need to open the chrome / Firefox dev tools and watch the request to see what happens on form submit and replicate that data in Python.
Another option would be the mechanize or Selenium webdriver libraries to simulate a browser and fill out the form
params is for query parameters. You either want data, for a form encoded body, or json, for a JSON body.
I think the url should be 'http://lambdaschool.com/contact-form'.
At this link when hover over any row, then there is an image box which says "i" you can click to get extra data. Then navigate to Lines History. Where is that information coming from? I can't find the URL that is connected with that.
I used dev tools in chrome, and found out that there's an ajax post being made:
Request URL:http://www.sbrforum.com/ajax/?a=[SBR.Odds.Modules]OddsEvent_GetLinesHistory
Form Data: UserId=0&Sport=basketball&League=NBA&EventId=259672&View=LH&SportsbookId=238&DefaultBookId=238&ConsensusBookId=19&PeriodTypeId=&StartDate=2014-03-24&MatchupLink=http%3A%2F%2Fwww.sbrforum.com%2Fnba-basketball%2Fmatchups%2F20140324-602%2F&Key=de2f9e1485ba96a69201680d1f7bace4&theme=default
but when I try to visit this url in browser I got Invalid Ajax Call -- from host:
Any idea?
Like you say, it's probably an HTTP POST request.
When you navigate to the URL with the browser, the browser issues a GET request, without all the form data.
Try curl, wget, or the javascript console in your browser to do a POST.