I'm trying to make a simple post method with the requests module, like this :
s=requests.Session()
s.post(link,data=payload)
In order to do it properly, the payload is an id from the page itself, and it's generated in every access to the page.
So I need to get the data from the page and then proceed the request.
The problem when you accessed the page is that a new id will be generated.
So if we do this:
s=requests.Session()
payload=get_payload(s.get(link).text)
s.post(link,data=payload)
It will not work because when you acceded the page with s.get the right id is generated, but when you go for the post request, a new id will be generated so you'll be using an old one.
Is there any way to get the data from the page right before the post request?
Something like:
s.post(link,data=get_data(s.get(link))
When you do a post (or get) request, the page will generate another id and send it back to you. There is no way of sending data to the page while it is being generated because you need to receive a response first to process the data on the page and once you have received the response, the server will create a new id for you the next time you view the page.
See https://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/images/HTTP.png for a simple example image of a HTTP Request
In general, there is no way to do this. The server's response is potentially affected by the data you send, so it can't be available before you have sent the data. To persist this kind of information across requests, the server would usually set a cookie for you to send with each subsequent request - but using a requests.Session will handle that for you automatically. It is possible that you need to set the cookie yourself based on the first response, but cookies are a key/value pair, and you only appear to have the value. To find the key, and more generally to find out if this is what the server expects you to do, requires specific knowledge of the site you are working with - if this is a documented API, the documentation would be a good place to start. Otherwise you might need to look at what the website itself does - most browsers allow you to look at the cookies that are set for that site, and some (possibly via extensions) will let you look through the HTTP headers that are sent and received.
Related
I'm trying to automate the Data Retrieving procedure of a specific form in my company with python request library; therefore, here's my issue:
there is a form in which a user can choose different filters to send a Post request to the server in order to retrieve an excel-like table as a result. I want to know how I can place a sequence of Post and Get requests to retrieve data provided by the server.
More info: I have checked the network behavior of the page at Chrome:
the Post request containing Form-Data
another post request with a different URL without form-data
two get requests for PNG
a post same as #1
finally a get request where the page will be loaded completely
The documentation of requests library is pretty self-explanatory in this regard. The following sample of code can help you in your task.
import requests
r = requests.post('https://httpbin.org/post', data = {'key':'value'})
I'm trying to send data to Wikipedia, (basically trying to input the word 'python' into Wikipedia search bar and print the content)
Here's what I tried:
import requests
payload = {'family': 'wikipedia',
'language': 'en',
'search': 'python',
'language': 'en',
'go': 'Go'}
with requests.Session() as s:
url = 'https://www.wikipedia.org/'
r = s.get(url)
r = s.post(url, data=payload)
print(r.content)
But it doesn't seem to work
This is the website I'm trying to send data to: https://www.wikipedia.org/
If all you want to do is get get the content from submitting python into the wikipedia search bar, you don't need to create a post request. A simple get request will work fine:
search_term = "python"
response = requests.get(f'https://en.wikipedia.org/wiki/{search_term}')
print(response.content)
So to answer your remaining questions:
I'll be using post request for logins ect so I want to learn via post request
GET, POST, PUT, DELETE UPDATE HTTP requests are server side implementations. They don't magically exist for everything. So if Wikipedia decides to not have a POST request for searching in the search bar, then that's too bad. You can't use POST to make a search. You will have to use some other way to search, whatever it is that they support (which from my tests appears to be via a GET request)
So even though they might implement POST for logins (as they should), not everything necessarily has an associated POST request.
can’t I use post to automate it like logging in and pressing buttons, like what selenium does
Sort of. You can use HTTP requests to make the same HTTP calls that a button would when clicked. It's not exactly the same as clicking on a button though, since clicking on a button can still do many other things behind the scene in your web browser. And not every button HTTP call is necessarily a POST request.
But that aside, even if you search in Wikipedia using Selenium, it would still end up being a GET request because Wikipedia changed the way that searches work (at least based on what you have posted). They made searches require a GET request so you have to make a GET request.
TLDR: It may have been possible with POST in the past, but it isn't anymore because that was a decision that Wikipedia made.
I want to scrape the second page of this user reviews.
However the next button executes a XHR request, and while I can see it using Chrome developer tools, I cannot replicate it.
It's not so easy task. First of all you should install this
extension.
It helps you to test own requests based on captured data, i.e. catch and simulate requests with captured data.
As I see they send a token in this XHR request, so you need to get it in from html page body(stores in source code, js variable "taSecureToken" ).
Next you need to do four steps:
Catch POST request with plugin
Change token to saved before
Set limit and offset variables in POST request data
Generate request with resulted body
Note: on this request server returns json data(not the html with next page) containing info about loaded objects on next page.
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.
I'm looking to be able to query a site for warranty information on a machine that this script would be running on. It should be able to fill out a form if needed ( like in the case of say HP's service site) and would then be able to retrieve the resulting web page.
I already have the bits in place to parse the resulting html that is reported back I'm just having trouble with what needs to be done in order to do a POST of data that needs to be put in the fields and then being able to retrieve the resulting page.
If you absolutely need to use urllib2, the basic gist is this:
import urllib
import urllib2
url = 'http://whatever.foo/form.html'
form_data = {'field1': 'value1', 'field2': 'value2'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()
If you send along POST data (the 2nd argument to urlopen()), the request method is automatically set to POST.
I suggest you do yourself a favor and use mechanize, a full-blown urllib2 replacement that acts exactly like a real browser. A lot of sites use hidden fields, cookies, and redirects, none of which urllib2 handles for you by default, where mechanize does.
Check out Emulating a browser in Python with mechanize for a good example.
Using urllib and urllib2 together,
data = urllib.urlencode([('field1',val1), ('field2',val2)]) # list of two-element tuples
content = urllib2.urlopen('post-url', data)
content will give you the page source.
I’ve only done a little bit of this, but:
You’ve got the HTML of the form page. Extract the name attribute for each form field you need to fill in.
Create a dictionary mapping the names of each form field with the values you want submit.
Use urllib.urlencode to turn the dictionary into the body of your post request.
Include this encoded data as the second argument to urllib2.Request(), after the URL that the form should be submitted to.
The server will either return a resulting web page, or return a redirect to a resulting web page. If it does the latter, you’ll need to issue a GET request to the URL specified in the redirect response.
I hope that makes some sort of sense?