I want to open a page then find a number and multiply by another random number and then submit it to the page So what I'm doing is saving the page as a html then finding the 2 numbers multiplying it then sending it as a post but
post = urllib.urlencode({'answer': goal, 'submit': 'Submit+Answer'})
req2 = urllib2.Request("example", None, headers)
response = urllib2.urlopen(req, post) #this causes it not to work it opens the page a second time
this makes it connect a second time and thus the random number sent is wrong since it makes a new random number so how can i send a post request to a page I already have open without reopening it?
You might want to use something like mechanize, which enables stateful web browsing in Python. You could use it to load a URL, read a value from a page, perform the multiplication, place that number into a form on the page, and then submit it.
Does that sound like what you're trying to do? This page gives some information on how to fill in forms using mechanize.
I don't believe urllib supports keeping the connection open, as described here.
Looks like you'll have to send a reference to the original calculation back with your post. Or send the data back at the same time as the answer, so the server has some way of matching question with answer.
Related
Use case: Need to check if JSON data from a url has been updated by checking it's created_date field which lies in the first few lines. The entire page's JSON data is huge and i don't want to retrieve the entire page just to check the first few lines.
Currently, For both
x=feedparser.parse(url)
y=requests.get(url).text
#y.split("\n") etc..
the entire url data is retrieved and then parsed.
I want to do some sort of next(url) or reading only first 10 lines (chunks).. thus not sending request for entire page's data...i.e just scroll & check 'created_date' field and exit.
What can be utilized to solve this? Thanks for your knowledge & Apologies for the noob q
Example of URL -> https://www.w3schools.com/xml/plant_catalog.xml
I want to stop reading the entire URL data if the first PLANT object's LIGHT tag hadn't changed from 'Mostly Shady' (without needing to read/get the data below)
Original poster stated below solution worked:
Instead of GET request, one can try HEAD request:
"The GET method requests a representation of the specified resource. Requests using GET should only retrieve data. The HEAD method asks for a response identical to a GET request, but without the response body."
This way, you don't need to request entire JSON, and will therefore speed up the server side part, as well as be more friendly to the hosting server!
I'm trying to make a simple post method with the requests module, like this :
s=requests.Session()
s.post(link,data=payload)
In order to do it properly, the payload is an id from the page itself, and it's generated in every access to the page.
So I need to get the data from the page and then proceed the request.
The problem when you accessed the page is that a new id will be generated.
So if we do this:
s=requests.Session()
payload=get_payload(s.get(link).text)
s.post(link,data=payload)
It will not work because when you acceded the page with s.get the right id is generated, but when you go for the post request, a new id will be generated so you'll be using an old one.
Is there any way to get the data from the page right before the post request?
Something like:
s.post(link,data=get_data(s.get(link))
When you do a post (or get) request, the page will generate another id and send it back to you. There is no way of sending data to the page while it is being generated because you need to receive a response first to process the data on the page and once you have received the response, the server will create a new id for you the next time you view the page.
See https://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/images/HTTP.png for a simple example image of a HTTP Request
In general, there is no way to do this. The server's response is potentially affected by the data you send, so it can't be available before you have sent the data. To persist this kind of information across requests, the server would usually set a cookie for you to send with each subsequent request - but using a requests.Session will handle that for you automatically. It is possible that you need to set the cookie yourself based on the first response, but cookies are a key/value pair, and you only appear to have the value. To find the key, and more generally to find out if this is what the server expects you to do, requires specific knowledge of the site you are working with - if this is a documented API, the documentation would be a good place to start. Otherwise you might need to look at what the website itself does - most browsers allow you to look at the cookies that are set for that site, and some (possibly via extensions) will let you look through the HTTP headers that are sent and received.
I'm trying to get some data from a webpage, but I found a problem. Whenever I want to go to the next page (i.e. page 2) to keep retrieving the data on it, I keep receiving the data from page 1. Apparently something goes wrong trying to switch to the next page.
The thing is, I haven't had problems with urls like this:
'http://www.webpage.com/index.php?page=' + str(pageno)
I can just start a while statement and I'll just jump to page 2 by adding 1 to "pageno"
My problem comes in when I try to open an url with this format:
'http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=' + str(pageno)
As
urllib2.urlopen('http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=4').read()
will retrieve the source code from http://www.webpage.com/search/?show_all=1
There is no other way to retrieve other pages without using the hash, as far as I'm concerned.
I guess it's just urllib2 ignoring the hash, as it is normally used to specify a starting point for a browser.
The fragment of the url after the hash (#) symbol is for client-side handling and isn't actually sent to the webserver. My guess is there is some javascript on the page that requests the correct data from the server using AJAX, and you need to figure out what URL is used for that.
If you use chrome you can watch the Network tab of the developer tools and see what URLs are requested when you click the link to go to page two in your browser.
that's because hash are not part of the url that is sent to the server, it's a fragment identifier that is used to identify elements inside the page. Some websites misused the hash fragment for JavaScript hook for identifying pages though. You'll either need to be able to execute the JavaScript on the page or you'll need to reverse engineer the JavaScript and emulate the true search request that is being made, presumably through ajax. Firebug's Net tab will be really useful for this.
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.
I am using Python 2.7.1 to access an online website. I need to load a URL, then submit a POST request to that URL that causes the website to redirect to a new URL. I would then like to POST some data to the new URL. This would be easy to do, except that the website in question does not allow the user to use browser navigation. (As in, you cannot just type in the URL of the new page or press the back button, you must arrive there by clicking the "Next" button on the website). Therefore, when I try this:
import urllib, urllib2, cookielib
url = "http://www.example.com/"
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
form_data_login = "POSTDATA"
form_data_try = "POSTDATA2"
resp = opener.open(url, form_data_login)
resp2 = opener.open(resp.geturl(), form_data_try)
print resp2.read()
I get a "Do not use the back button on your browser" message from the website in resp2. Is there any way to POST data to the website resp gives me? Thanks in advance!
EDIT: I'll look into Mechanize, so thanks for that pointer. For now, though, is there a way to do it with just Python?
Have you taken a look at mechanize? I believe it has the functionality you need.
You're probably getting to that page by posting something via that Next button. You'll have to take a look at the POST parameters sent when pressing that button and add all of these post parameters to your call.
The website could though be set up in such a way that it only accepts a particular POST parameter that ensures that you'll have to go through the website itself (e.g. by hashing a timestamp in a certain way or something like that) but it's not very likely.