web scraping a webpage which has dynamic contents loaded via ajax - python

Say I wish to scrape products on this page(http://shop.coles.com.au/online/national/bread-bakery/fresh/bread#pageNumber=2&currentPageSize=20)
But the products is loaded from a post request. A lot of posts here suggest to simulate a request to get dynamic contents, but in my case the Form Data is unknown for me, i.e. catalogId, categoryId.
I'm wondering is it possible to get the response after the ajax call is finished?

You can get the catalogId and other parameter values needed to make the POST request from the form with id="search":
<form id="search" name="search" action="http://shop.coles.com.au/online/SearchDisplay?pageView=image&catalogId=10576&beginIndex=0&langId=-1&storeId=10601" method="get" role="search">
<input type="hidden" name="storeId" value="10601" id="WC_CachedHeaderDisplay_FormInput_storeId_In_CatalogSearchForm_1">
<input type="hidden" name="catalogId" value="10576" id="WC_CachedHeaderDisplay_FormInput_catalogId_In_CatalogSearchForm_1">
<input type="hidden" name="langId" value="-1" id="WC_CachedHeaderDisplay_FormInput_langId_In_CatalogSearchForm_1">
<input type="hidden" name="beginIndex" value="0" id="WC_CachedHeaderDisplay_FormInput_beginIndex_In_CatalogSearchForm_1">
<input type="hidden" name="browseView" value="false" id="WC_CachedHeaderDisplay_FormInput_browseView_In_CatalogSearchForm_1">
<input type="hidden" name="searchSource" value="Q" id="WC_CachedHeaderDisplay_FormInput_searchSource_In_CatalogSearchForm_1">
...
</form>
Use the FormRequest to submit this form.
I'm wondering is it possible to get the response after the ajax call is finished?
Scrapy is not a browser - it does not make additional AJAX requests to load the page and there is nothing built-in to execute JavaScript. You may look into using a real browser and solve it on a higher level - look into selenium package. There is also the related scrapy-splash project.
See also:
selenium with scrapy for dynamic page

Related

Redirect user in Telegram Bot to an external link with POST request

Since I'm new to this POST/GET HTTP stuff, I might be getting things wrong, that's why I'll put my question in 2 ways. Maybe one way will be better than the other :)
I'm developing a Telegram Bot using PyTelegramBotAPI, and it needs to include an online payment.
For the online payment I need the user to follow a link with POST method (it's an external link + I need to pass form data), but that's what causes difficulties for me.
I.
In my code I perform the following:
req = requests.post(url=url, data=data)
Where url is the URL of the website to which the client must be redirected, and data is the data that it needs to pass with the POST request when redirecting.
It works fine as a request in Python, but obviously it can't redirect the client to the website needed.
I tried to generate a URL and pass it to the client using
url = url + urlencode(data=data)
Where url is again the URL of the website. But in this case the website tells me that the method used is incorrect. I guess the link becomes a GET request, instead of a POST request.
How can I redirect the client to that link with POST method?
II.
Another way of putting this question is this:
The company which processes the online payments requires them to be performed using the following HTML form:
<form action=”https://securesandbox.webpay.by/” method="post">
<input type=”hidden” name=”*scart” >
<input type=”hidden” name=”wsb_storeid” value=”11111111”>
<input type=”hidden” name=”wsb_order_num” value=”ORDER-12345678”>
<input type=”hidden” name=”wsb_currency_id” value=”BYN”>
<input type=”hidden” name=”wsb_version” value=”2”>
<input type=”hidden” name=”wsb_seed” value=”1242649174”>
<input type=”hidden” name=”wsb_signature” value=”124264917411111111ORDER-123456781BYN10123456”>
<input type=”hidden” name=”wsb_test” value=”1”>
<input type=”hidden” name=”wsb_invoice_item_name[0]” value=”Товар 1”>
<input type=”hidden” name=”wsb_invoice_item_quantity[0]” value=”2”>
<input type=”hidden” name=”wsb_invoice_item_price[0]” value=”10”>
<input type=”hidden” name=”wsb_total” value=”10”>
<input type="submit" value="Купить">
</form>
This would work well if I used HTML pages, but since my web app is a Telegram Bot, hence this wouldn't work. Therefore I need to generate this HTML form automatically in Python (namely, I need to change the "value" fields for every payment).
How can I imitate this HTML form in my Telegram Bot and redirect the client after some trigger?

Python requests module not posting to certain input fields

I'm trying to data scrape from a website behind a login screen, and I've run into a problem with posting parts of the login info with the post() method from python's requests module.
I've gotten the names of each HTML input field that needs to be filled in and placed them in a dictionary along with their required value, and then passed that dictionary to the post() method.
The HTML from the login page:
<input name="ctl00$ContentPlaceHolder1$TextBox1" type="text" value="" id="ContentPlaceHolder1_TextBox1" tabindex="1" class="form-control " placeholder="username" required="">
<input name="ctl00$ContentPlaceHolder1$TextBox2" type="password" id="ContentPlaceHolder1_TextBox2" tabindex="2" class="form-control" placeholder="password" required="" value="">
Then, using the name value to create the dictionary that's passed to post()
formData = {
"ctl00$ContentPlaceHolder1$TextBox1": "FakeUsername",
"ctl00$ContentPlaceHolder1$TextBox2": "FakePassword"
}
r = session.get(loginUrl) # get cookies necessary for login
r = session.post(loginUrl, data=formData)
This works properly for the username field, but it does not post the password in the password field. If I read the HTML from the login page after posting the data, I get:
<input name="ctl00$ContentPlaceHolder1$TextBox1" type="text" value="FakeUsername" id="ContentPlaceHolder1_TextBox1" tabindex="1" class="form-control " placeholder="username" required="" />
<input name="ctl00$ContentPlaceHolder1$TextBox2" type="password" id="ContentPlaceHolder1_TextBox2" tabindex="2" class="form-control" placeholder="password" required="" />
The "value" parameter of the password input field is no longer listed, not even as an empty parameter. Attempting a login after this of course does not work.
I have been unable to figure out why this is happening. I've made sure to fill in any hidden input fields (EVENTVALIDATION, VIEWSTATE, etc.) and have also
looked at the webpage headers, but have still had no luck.
The website I'm trying to log in to is:
https://panel.forcad.org/Default.aspx
I would really appreciate help figuring out what is going wrong.
You said you looked at the headers, but you should be able to replicate the browser behavior with request headers and cookies. Try copying the exact params for and cookies on a known successful login. So you can narrow it down if you can even use requests to send the data it already wants. Maybe it has some JS tricks, or does some stuff requests can not do, if you can't re-login with valid cookies. In that case, more reverse engineeering, or try selenium. pyvirtualdisplay can hide the browser and can use JS to stop() loading of the page

Strange PHP form post

So I'm writing a web crawler to batch download PDFs from my university's website, as I don't fancy downloading them one by one.
I've got most the code working, using the 'requests' module. The issue is, you have to be signed in to a university account to access the PDFs, so I've set up requests to use cookies to sign into my university account before downloading the PDFs, however the HTML form to sign in on the university page is rather peculiar.
I've abstracted the HTML which can be found here:
<form action="/login" method="post">
<fieldset>
<div>
<label for="username">Username:</label>
<input id="username" name="username" type="text" value="" />
<label for="password">Password:</label>
<input id="password" name="password" type="password" value=""/>
<input type="hidden" name="lt" value="" />
<input type="hidden" name="execution" value="*very_long_encrypted_code*" />
<input type="hidden" name="_eventId" value="submit" />
<input type="submit" name="submit" value="Login" />
</div>
</fieldset>
</form>
Firstly the action parameter in the form does not reference a PHP file which I don't understand. Is action="/login" referencing the page itself, or http://www.blahblah/login/login? (the HTML is taken from the page http://www.blahblah/login.
Secondly, what's with all the 'hidden' inputs? I'm not sure how this page is taking the given login data and passing it to a PHP script.
This has led to the failure of the requests sign on in my python script:
import requests
user = input("User: ")
passw = input("Password: ")
payload = {"username" : user, "password" : passw}
s = requests.Session()
s.post(loginURL, data = payload)
r = s.get(url)
I would have thought this would take the login data and sign me into the page, but r is just assigned the original logon page. I'm assuming it's to do with the strange PHP interation in the HTML. Any ideas what I need to change?
EDIT: Thought I'd also mention there is no javascript on the page at all. Purely HTML & CSS
What you are looking at is likely a CSRF token
The linked answer is very good, but a summary is, these tokens used to make sure that you can't send malicious requests to a site from another page in your web browser. In this case it is a bit silly, because logging in has no consequences. It was likely added automatically by the framework your university website uses.
You will have to extract this token from the login page before doing your login POST and then include it with your data.
The full steps would be the following:
Fetch the login page
extract the token with e.g. BeautifulSoup or requests-html
Send the login request:
payload = {"username" : user, "password" : passw, "execution": token}

Creating an object with an external form

I have a upload form that is going to S3 --
<form action="https://test.s3.amazonaws.com/" method="post" enctype="multipart/form-data">
<input type="file" name="file" id="id_file" />
<input type="submit" value="Upload to Amazon S3" name="upload">
</form>
When the form is submitted to S3, I also need to create an object in my db and get the id of the object. How would I do this (without redirecting to my view)? Thank you.
You could use an iframe to send the form and js to search for the id you are looking for in the retrieved page.
You could also use another script to perform the post method like php with the curl library. You could grab all the data retrieved from there, search for what you are looking for and add what you need to your database.

Using Urllib instead of action in post form

I need to allow users to upload content directly to Amazon S3. This form works:
<form action="https://me.s3.amazonaws.com/" method="post" enctype='multipart/form-data' class="upload-form">{% csrf_token %}
<input type="hidden" name="key" value="videos/test.jpg">
<input type="hidden" name="AWSAccessKeyId" value="<access_key>">
<input type="hidden" name="acl" value="public-read">
<input type="hidden" name="policy" value="{{policy}}">
<input type="hidden" name="signature" value="{{signature}}">
<input type="hidden" name="Content-Type" value="image/jpeg">
<input type="submit" value="Upload" name="upload">
</form>
And in the function, I define policy and signature. However, I need to pass two variables to the form -- Content-Type and Key, which will only be known when the user presses the upload button. Thus, I need to pass these two variables to the template after the POST request but before the re-direction to Amazon.
It was suggested that I use urllib to do this. I have tried doing so the following way, but I keep getting an inscrutable HTTPError. This is what I currently have:
if request.method == 'POST':
# define the variables
urllib2.urlopen("https://me.amazonaws.com/",
urllib.urlencode([('key','videos/test3.jpg'),
('AWSAccessKeyId','<access_key'),
('acl','public-read'),
('policy',policy),
('signature',signature),
('Content-Type',content_type),
('file',file)]))
I have also tried hardcoding all the values instead of using variables but still get the same error. What am I doing incorrectly and what do I need to change to be able to redirect the form to Amazon, so the content can be uploaded directly to Amazon?
I recommend watching the form do its work with Firebug, enabled and set to the Net tab.
After completing the POST, click its [+] icon to expand, study the Headers, POST, Response tabs to see what you are missing and/or doing wrong.
Next separate this script from Django and put into a standalone file. Add one thing at a time to it and retest until it works. The lines below should increase visibility into your script.
import httplib
httplib.HTTPConnection.debuglevel = 1
I tried poking around with urllib myself, but as I don't have an account on AWS I didn't get farther than getting a 400 Bad Request response. Seems like a good sign, probably I just need valid host and key params etc.

Categories