Python urllib2 automatic form filling and retrieval of results - python

I'm looking to be able to query a site for warranty information on a machine that this script would be running on. It should be able to fill out a form if needed ( like in the case of say HP's service site) and would then be able to retrieve the resulting web page.
I already have the bits in place to parse the resulting html that is reported back I'm just having trouble with what needs to be done in order to do a POST of data that needs to be put in the fields and then being able to retrieve the resulting page.

If you absolutely need to use urllib2, the basic gist is this:
import urllib
import urllib2
url = 'http://whatever.foo/form.html'
form_data = {'field1': 'value1', 'field2': 'value2'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()
If you send along POST data (the 2nd argument to urlopen()), the request method is automatically set to POST.
I suggest you do yourself a favor and use mechanize, a full-blown urllib2 replacement that acts exactly like a real browser. A lot of sites use hidden fields, cookies, and redirects, none of which urllib2 handles for you by default, where mechanize does.
Check out Emulating a browser in Python with mechanize for a good example.

Using urllib and urllib2 together,
data = urllib.urlencode([('field1',val1), ('field2',val2)]) # list of two-element tuples
content = urllib2.urlopen('post-url', data)
content will give you the page source.

I’ve only done a little bit of this, but:
You’ve got the HTML of the form page. Extract the name attribute for each form field you need to fill in.
Create a dictionary mapping the names of each form field with the values you want submit.
Use urllib.urlencode to turn the dictionary into the body of your post request.
Include this encoded data as the second argument to urllib2.Request(), after the URL that the form should be submitted to.
The server will either return a resulting web page, or return a redirect to a resulting web page. If it does the latter, you’ll need to issue a GET request to the URL specified in the redirect response.
I hope that makes some sort of sense?

Related

Post method with Requests

I'm trying to make a simple post method with the requests module, like this :
s=requests.Session()
s.post(link,data=payload)
In order to do it properly, the payload is an id from the page itself, and it's generated in every access to the page.
So I need to get the data from the page and then proceed the request.
The problem when you accessed the page is that a new id will be generated.
So if we do this:
s=requests.Session()
payload=get_payload(s.get(link).text)
s.post(link,data=payload)
It will not work because when you acceded the page with s.get the right id is generated, but when you go for the post request, a new id will be generated so you'll be using an old one.
Is there any way to get the data from the page right before the post request?
Something like:
s.post(link,data=get_data(s.get(link))
When you do a post (or get) request, the page will generate another id and send it back to you. There is no way of sending data to the page while it is being generated because you need to receive a response first to process the data on the page and once you have received the response, the server will create a new id for you the next time you view the page.
See https://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/images/HTTP.png for a simple example image of a HTTP Request
In general, there is no way to do this. The server's response is potentially affected by the data you send, so it can't be available before you have sent the data. To persist this kind of information across requests, the server would usually set a cookie for you to send with each subsequent request - but using a requests.Session will handle that for you automatically. It is possible that you need to set the cookie yourself based on the first response, but cookies are a key/value pair, and you only appear to have the value. To find the key, and more generally to find out if this is what the server expects you to do, requires specific knowledge of the site you are working with - if this is a documented API, the documentation would be a good place to start. Otherwise you might need to look at what the website itself does - most browsers allow you to look at the cookies that are set for that site, and some (possibly via extensions) will let you look through the HTTP headers that are sent and received.

How to Send Form Data with Python Requests

I'm using requests to access this webpage and subsequently parsing and inspecting the HTML with Beautiful Soup.
This page allows a user to specify the number of days in the past for which results should be returned. This is accomplished via a form on the page:
When I submit the request in the browser with my selection of 365 days and examine the response, I find this form data was sent with the request:
Of note is the form datum "dnf_class_values[procurement_notice][_posted_date]: 365" as this is the only element that corresponds with my selection of 365 days.
When this request is returned in the browser, I get n results, where n is the maximum number possible given this is the largest time period possible. n is visible in the markup as <span class="lst-cnt">.
I can't seem to duplicate the sending of that form data with requests. Here is the relevant portion of my code:
import requests
from bs4 import BeautifulSoup as bs
formData = {'dnf_class_values[procurement_notice][_posted_date]':'365'}
r = requests.post("https://www.fbo.gov/index?s=opportunity&mode=list&tab=list&tabmode=list&pp=20&pageID=1", data = formData)
s = bs(r.content)
s.find('span',{'class':'lst-cnt'})
This is returning the same number of results as when the form is submitted with the default value for number of days.
I've tried URL encoding the key in data, as well as using requests.get, and specifying params as opposed to data. Additionally, I've attempted to append the form data field as a query string parameter:
url...?s=opportunity&mode=list&tab=list&tabmode=list&pp=20&pageID=1&dnf_class_values%5Bprocurement_notice%5D%5B_posted_date%5D=365
What is the appropriate syntax for that request?
You cannot send only the sections you care about, you need to send everything. Duplicate the POST request that Chrome made exactly.
Note that some of the POSTed values may be CSRF tokens. The Base64-encoded strings are particularly likely (dnf_opt_template, dnf_opt_template_dir, dnf_opt_subform_template and dnf_class_values[procurement_notice][notice_id]), and should probably be pulled out of the HTML for the original page using BeautifulSoup. The rest can be hardcoded.
Otherwise, your original syntax was correct.

Python urllib2 sending a post

I want to open a page then find a number and multiply by another random number and then submit it to the page So what I'm doing is saving the page as a html then finding the 2 numbers multiplying it then sending it as a post but
post = urllib.urlencode({'answer': goal, 'submit': 'Submit+Answer'})
req2 = urllib2.Request("example", None, headers)
response = urllib2.urlopen(req, post) #this causes it not to work it opens the page a second time
this makes it connect a second time and thus the random number sent is wrong since it makes a new random number so how can i send a post request to a page I already have open without reopening it?
You might want to use something like mechanize, which enables stateful web browsing in Python. You could use it to load a URL, read a value from a page, perform the multiplication, place that number into a form on the page, and then submit it.
Does that sound like what you're trying to do? This page gives some information on how to fill in forms using mechanize.
I don't believe urllib supports keeping the connection open, as described here.
Looks like you'll have to send a reference to the original calculation back with your post. Or send the data back at the same time as the answer, so the server has some way of matching question with answer.

Python Online Form Submision

I am using Python 2.7.1 to access an online website. I need to load a URL, then submit a POST request to that URL that causes the website to redirect to a new URL. I would then like to POST some data to the new URL. This would be easy to do, except that the website in question does not allow the user to use browser navigation. (As in, you cannot just type in the URL of the new page or press the back button, you must arrive there by clicking the "Next" button on the website). Therefore, when I try this:
import urllib, urllib2, cookielib
url = "http://www.example.com/"
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
form_data_login = "POSTDATA"
form_data_try = "POSTDATA2"
resp = opener.open(url, form_data_login)
resp2 = opener.open(resp.geturl(), form_data_try)
print resp2.read()
I get a "Do not use the back button on your browser" message from the website in resp2. Is there any way to POST data to the website resp gives me? Thanks in advance!
EDIT: I'll look into Mechanize, so thanks for that pointer. For now, though, is there a way to do it with just Python?
Have you taken a look at mechanize? I believe it has the functionality you need.
You're probably getting to that page by posting something via that Next button. You'll have to take a look at the POST parameters sent when pressing that button and add all of these post parameters to your call.
The website could though be set up in such a way that it only accepts a particular POST parameter that ensures that you'll have to go through the website itself (e.g. by hashing a timestamp in a certain way or something like that) but it's not very likely.

How do I get the URL of an HTTP redirect's target?

I am writing client-side Python unit tests to verify whether the HTTP 302 redirects on my Google App Engine site are pointing to the right pages. So far, I have been calling urllib2.urlopen(my_url).geturl(). However, I have encountered 2 issues:
the URL returned by geturl() does not appear to include URL query strings like ?k1=v1&k2=v2; how can I see these? (I need to check whether I correctly passed along the visitor's original URL query string to the redirect page.)
geturl() shows the final URL after any additional redirects. I just care about the first redirect (the one from my site); I am agnostic to anything after that. For example, let's assume my site is example.com. If a user requests http://www.example.com/somepath/?q=foo, I might want to redirect them to http://www.anothersite.com?q=foo. That other site might do another redirect to http://subdomain.anothersite.com?q=foo, which I can't control or predict. How can I make sure my redirect is correct?
Supply follow_redirects=False to the fetch function, then retrieve the location of the first redirect from the 'location' header in the response, like so:
response = urlfetch.fetch(your_url, follow_redirects=False)
location = response.headers['Location']
Use httplib (and look at the return status and Location header of the response) to avoid the "auto-follow redirects" that's impeding your testing. There's a good example here.

Categories