HTTP DELETE with HTML - python

In webapp2, I've not been able to make an HTTP DELETE request work correctly. As a workaround, I'm using a different URI with GET, but I'd prefer a more RESTful approach.
With this code, the server log shows the DELETE request as a GET request. What am I missing here?
class TeamPages(webapp2.RequestHandler):
def get(self, team_name):
...
def post(self, team_name):
...
def delete(self, team_name):
key_name = team_name.upper()
Team.delete(key_name)
self.redirect('/teams')
A GET request to /teams/{{ team_name }} responds with a page that includes the following html, but when I submit, it requests the GET method instead of the DELETE method.
<form action="/teams{{ team.team_name }}" method="delete">
<button type="submit">Delete</button>
</form>
Update
Additional info...
I'm developing under Google App Engine and I'm using Chrome on a Mac. Here is the request header that shows GET instead of DELETE...
GET /teams/hornets? HTTP/1.1
Host: localhost:9080
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: dev_appserver_login="test#example.com:True:185804764220139124118"
Referer: http://localhost:9080/teams/hornets
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
And here is the response header...
HTTP/1.1 405 Method Not Allowed
allow: DELETE
Cache-Control: no-cache
Content-Length: 187
content-type: text/html; charset=UTF-8
Date: Tue, 10 Jun 2014 21:24:36 GMT
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Server: Development/2.0

Short answer: The issue is not with the web framework or the browser. This is an HTML limitation. HTML only allows for GET and POST. No other methods are permitted. I've adjusted my title to reflect the subject matter more accurately.
More details:
Based on the HTML clue from #Greg, I searched some more. In the comments of a related post, #Guandalino added some great links:
*Bug report 16071
*Stack Exchange
This helps to add some color and arguments for and against the addition of PUT and DELETE to the HTML specification.
To me (and I'm a newbie), this incompatibility with HTTP does not seem to be congruent with REST principles. I'd think that the mere recognition that so many web frameworks have created workarounds is justification enough for making this standard in a future HTML version.
Thanks to everyone for the feedback!

Related

Is this HTTP request valid?

I've made an python server with swagger-codegen. I have one endpoint that receives an file with mutlipart/form-data
And also created an client with go-swagger for testing.
created an file to upload: $ echo "123file content321" > data
and used the client to upload the file to the server. The resulting HTTP request looks like this:
POST /api/order/1/attachment HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Go-http-client/1.1
Transfer-Encoding: chunked
Accept: application/json
Content-Type: multipart/form-data; boundary=5f3f0ad86e6345b77c869cbe0a5e608f038354cf9ceab74ec2533d7555c0
Accept-Encoding: gzip
ff
--5f3f0ad86e6345b77c869cbe0a5e608f038354cf9ceab74ec2533d7555c0
Content-Disposition: form-data; name="file"; filename="data"
Content-Type: application/octet-stream
123file content321
--5f3f0ad86e6345b77c869cbe0a5e608f038354cf9ceab74ec2533d7555c0--
but the server doesn't accept it and responds:
HTTP/1.0 400 BAD REQUEST
Connection: close
Content-Length: 120
Content-Type: application/problem+json
Date: Fri, 19 May 2017 15:15:44 GMT
Server: Werkzeug/0.12.1 Python/3.6.1
{
"type": "about:blank",
"title": "Bad Request",
"detail": "Missing formdata parameter 'file'",
"status": 400
}
So the request isn't parsed properly. But when I use the swagger-ui, the file is uploaded correctly. Is there problem with the client's request, or the server has a problem?
EDIT: I think that there is missing Content-Lenght or the ff at the beginning of the BODY might not be there
EDIT2: the swagger-ui request:
POST /api/order/1/attachment HTTP/1.1
Host: localhost:8080
Connection: keep-alive
Content-Length: 211
Origin: http://localhost:8080
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Content-Type: multipart/form-data; boundary=----WebKitFormBoundarypzmNwrDR7zzpZ7SJ
Accept: application/json
X-Requested-With: XMLHttpRequest
DNT: 1
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
------WebKitFormBoundarypzmNwrDR7zzpZ7SJ
Content-Disposition: form-data; name="file"; filename="data"
Content-Type: application/octet-stream
123file content321
------WebKitFormBoundarypzmNwrDR7zzpZ7SJ--
The first request you send is a HTTP/1.1 request using chunked transfer encoding. This means the body is consisting of multiple chunks where each chunk is prefixed by the size in hex followed by \r\n followed by the data and again \r\n. I'm not sure if the ff at the beginning of the body you show really specifies the size of the following data (i.e. 255 bytes). But, the last chunk with a size of 0 is missing, so this request is incomplete. But maybe you just omitted the missing part from this question only.
Apart from that the server is responding with version HTTP/1.0. Chunked transfer encoding is only defined for HTTP/1.1 which means that this request will not be understood by a HTTP/1.0 server. And not even all HTTP/1.1 server will understand chunked transfer encoding in the request even if they should.
The second request you show (created by Chrome) does not use chunked transfer encoding but specifies instead the length of the header using Content-length in the HTTP header. That's the way you should go since this works with all web servers, including HTTP/1.0 servers.
Based on the two requests you have posted I would attempt to set the Content-Length on your go request first and test that. I've run into issues before with the ArangoDB HTTP API not accepting requests without a correct content length value.
If the succeeds then yay.
Otherwise, that ff in your request is the next thing I'd look at getting rid of. But I'd focus on the Content-Length header first.

Python library requests open the wrong page

I try to open a html page with python requests library but my code open the site root folder and i don't understand how solve the problem.
import requests
scraping = requests.request("POST", url = "http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print scraping.content
Thank you for all suggestion!
You can see easily that the server is redirecting to the main page.
➜ ~ http -v http://www.pollnet.it/WeeklyReport_it.aspx\?ID\=69
GET /WeeklyReport_it.aspx?ID=69 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.pollnet.it
User-Agent: HTTPie/0.9.3
HTTP/1.1 302 Found
Content-Length: 131
Content-Type: text/html; charset=utf-8
Date: Sun, 07 Feb 2016 11:24:52 GMT
Location: /default.asp
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
On further checking, it can be seen that the web server uses session cookies.
➜ ~ http -v http://www.pollnet.it/default_it.asp
HTTP/1.1 200 OK
Cache-Control: private
Content-Encoding: gzip
Content-Length: 9471
Content-Type: text/html; Charset=utf-8
Date: Sun, 07 Feb 2016 13:21:41 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDSQTSTAST=PBHDLEIDFCNMPKIGANFDNMLK; path=/
Vary: Accept-Encoding
X-Powered-By: ASP.NET
It means that every time the main page is visited, the server sends a "Set-Cookie" header, which instructs the browser to set certain cookies. Then every time the browser asks for a Weekly Report, the server validates the session cookie.
Normally. requests package does not save cookies in between requests, but to do the scraping, we can use a Session object which will save the cookies in between page requests.
import requests
# create a Session object
s= requests.Session()
# first visit the main page
s.get("http://www.pollnet.it/default_it.asp")
# then we can visit the weekly report pages
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print(r.text)
# another page
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=89")
print(r.text)
But here is some advice - the web server may only allow opening of a fixed number of pages (maybe 10, maybe 15) with a certain Session object. Either immediately validate the results of r.text each time (maybe check the length of the request body to ensure it is not too small), or create a new Session object, for every 5 or 6 pages.
More info on Session objects here.

download files with python (REST URL)

I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.
import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))

Retrieve ALL header data with urllib

I've scraped many websites and have often wondered why the response headers displayed in Firebug and the response headers returned by urllib.urlopen(url).info() are often different in that Firebug reports MORE headers.
I encountered an interesting one today. I'm scraping a website by following a "search url" that fully loads (returns a 200 status code) before redirecting to a final page. The easiest way to perform the scrape would be to return the Location response header and make another request. However, that particular header is absent when I run 'urllib.urlopen(url).info().
Here is the difference:
Firebug headers:
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection : keep-alive
Content-Encoding : gzip
Content-Length : 2433
Content-Type : text/html
Date : Fri, 05 Oct 2012 15:59:31 GMT
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Location : /catalog/display/1292/index.html
Pragma : no-cache
Server : Apache/2.0.55
Set-Cookie : PHPSESSID=9b99dd9a4afb0ef0ca267b853265b540; path=/
Vary : Accept-Encoding,User-Agent
X-Powered-By : PHP/4.4.0
Headers returned by my code:
Date: Fri, 05 Oct 2012 17:16:23 GMT
Server: Apache/2.0.55
X-Powered-By: PHP/4.4.0
Set-Cookie: PHPSESSID=39ccc547fc407daab21d3c83451d9a04; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Type: text/html
Connection: close
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras
import scrape_tools
tools = scrape_tools.tool_box()
db = tools.db_connect()
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT data FROM table WHERE variable = 'Constant' ORDER BY data")
for row in cursor:
url = 'http://www.website.com/search/' + row['data']
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-Language' : 'en-us,en;q=0.5',
'Connection' : 'keep-alive',
'Host' : 'www.website.com',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'
}
post_params = {
'query' : row['data'],
'searchtype' : 'products'
}
post_args = urllib.urlencode(post_params)
soup = tools.request(url, post_args, headers)
print tools.get_headers(url, post_args, headers)
Please note: scrape_tools is a module I wrote myself. The code contained in the module to retrieve headers is (basically) as follows:
class tool_box:
def get_headers(self, url, post, headers):
file_pointer = urllib.urlopen(url, post, headers)
return file_pointer.info()
Is there a reason for the discrepancy? Am I making a silly mistake in my code? How can I retrieve the missing header data? I'm fairly new to Python, so please forgive any dumb errors.
Thanks in advance. Any advice is much appreciated!
Also...Sorry about the wall of code =\
You're not getting the same kind of response for the two requests. For example, the response to the Firefox request contains a Location: header, so it's probably a 302 Moved temporarily or a 301. Those don't contain any actual body data, but instead cause your Firefox to issue a second request to the URL in the Location: header (urllib doesn't do that).
The Firefox response also uses Connection : keep-alive while the urllib request got answered with Connection: close.
Also, the Firefox response is gzipped (Content-Encoding : gzip), while the urllib one is not. That's probably because your Firefox sends a Accept-Encoding: gzip, deflate header with its request.
Don't rely on Firebug to tell you HTTP headers (even though it does so truthfully most of the time), but use a sniffer like wireshark to inspect what's actually going over the wire.
You're obviously dealing with two different responses.
There could be several reasons for this. For one, web servers are supposed to respond differently depending on what Accept-Language, Accept-Encoding headers etc.. the client sends in its request. Then there's also the possibility that the server does some kind of User-Agent sniffing.
Either way, capture your requests with urllib as well as the ones with Firefox using wireshark and first compare the requests (not the headers, but the actual GET / HTTP/1.0 part. Are they really the same? If yes, move on to comparing request headers and start manually setting the same headers for the urllib request until you figure out which headers make a difference.

urllib2.urlopen throws 404 exception for urls that browser opens

The following url (and others like it) can be opened in a browser but causes urllib2.urlopen to throw a 404 exception: http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=1
geturl() returns the same url (no redirect). The headers are copied and pasted from firebug. I tried passing in the headers as a dictionary to Request, but got the same result. wget opens the url in the console but not from the script.
the code:
source_url = 'http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2'
try:
socket.setdefaulttimeout(10)
hdrs = [('Host','store.ovi.com'),('User-Agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;(+http://code.google.com/appengine)'),('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),('Accept-Language','en-us,en;q=0.5'),('Accept-Encoding','gzip,deflate'),('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7'),('Keep-Alive','115'),('Connection','keep-alive'),('Cookie','JNPRSESSID=4u4devdrt7eb6e0qem3gin47i2; s_cc=true; undefined_s=First%20Visit; s_nr=1282817443274; s_sq=%5B%5BB%5D%5D; view=Grid; menu=menuOpen; OVI_DEVICE=b5130'),('Cache-Control','max-age=0')]
ree = urllib2.Request(source_url)
ree.addheaders = hdrs
opener = urllib2.build_opener()
htmlSource = opener.open(ree).read()
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.headers
The error output:
404
Not Found
Date: Sat, 28 Aug 2010 00:36:57 GMT
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Keep-Alive: timeout=7, max=333
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
What, if anything, am I doing incorrectly? Is this a bug? And if so, is there a workaround? Thanks!
Given a URL like:
http://store.ovi.com/#/applications?categoryId=20&fragment=1&page=2
The bit that browsers fetch is just:
http://store.ovi.com/
Everything to the right of that is a ‘fragment identifier’, which is not passed to the server at all (evidently, if you try, it will get confused). Instead, the HTML returned for the / URL will include a load of JavaScript that reads the #... data at the client side and fills in the page content using a bunch of XMLHttpRequests.
Webapps implemented like this are a big old pain to scrape, because you can't just take the HTML content of the main page. Instead you have to either analyse the script to find out where it gets the actual data from, or you have to hook up a real browser in order to execute all the scripts and see what document objects you're left with. They're also typically bad for accessibility and SEO.
Luckily for you this site appears to be putting something in the fragment that's also a valid path. So it looks like you can get the dynamic page data from the URL:
http://store.ovi.com/applications?categoryId=20&fragment=1&page=1

Categories