I am making get/Post request on a URL and in response getting an HTML page. I only want a response header, no response body.
already used HEAD method but it is not working in all kind of situations.
By getting complete HTML page in response, bandwidth is increasing.
and also need a solution so it will work in both https and HTTP request.
For Example
import urllib2
urllib2.urlopen('http://www.google.com')
if I am sending a request on this URL using urllib2 or request. I am getting both response body and header from the server. this request is taking 14.08 kb in bytes. If I break this, the response header is taking 775 bytes and response body is taking 13.32kb. so I need only response header and will save 13.32 kb
What you want to do is a so called HEAD request. See this question on how to do it.
Is this what you are looking for:
import urllib2
l = urllib2.urlopen('http://www.google.com')
print(l.headers)
#Date: Thu, 11 Oct 2018 09:07:20 GMT
#Expires: -1
#...
EDIT
This seems to do what you are looking for:
import requests
a = requests.head('https://www.google.com')
a.headers
#{'X-XSS-Protection': '1; mode=block', 'Content-Encoding':...
a.text
#u''
Related
I try to read JSON-formatted data from the following public URL: http://ws-old.parlament.ch/factions?format=json. Unfortunately, I was not able to convert the response to JSON as I always get the HTML-formatted content back from my request. Somehow the request seems to completely ignore the parameters for JSON formatting passed with the URL:
import urllib.request
response = urllib.request.urlopen('http://ws-old.parlament.ch/factions?format=json')
response_text = response.read()
print(response_text) #why is this HTML?
Does somebody know how I am able to get the JSON formatted content as displayed in the web browser?
You need to add "Accept": "text/json" to request header.
For example using requests package:
r = requests.get(r'http://ws-old.parlament.ch/factions?format=json',
headers={'Accept':'text/json'})
print(r.json())
Result:
[{'id': 3, 'updated': '2022-02-22T14:59:17Z', 'abbreviation': ...
Sorry for you but these web services have a misleading implementation. The format query parameter is useless. As pointed out by #maciek97x only the header Accept: <format> will be considered for the formatting.
So your can directly call the endpoint without the ?format=json but with the header Accept: text/json.
I am trying to login into a website by passing username and password.It says session cookie is missing.I am beginner to api .I dont know if I have missed something here.The website is http://testing-ground.scraping.pro/login
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, fields={'usr':'admin','pwd':'12345'})
print(req.data.decode('utf-8'))
There are two issues in your code that make you unable to log in successfully.
The content-type issue
In the code you are using urllib3 to send data of content-type multipart/form-data. The website, however, seems to only accept the content-type application/x-www-form-urlencoded.
Try the following cURL commands:
curl -v -d "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
curl -v -F "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
For the first one, the content-type in your request header is application/x-www-form-urlencoded, so the website takes it and logs you in (with a 302 Found response).
The second one, however, sends data with content-type multipart/form-data. The website doesn't take it and therefore rejects your login request (with a 200 OK response).
The cookie issue
Another issue is that urllib3 follows redirect by default. More importantly, the cookie is not handled (i.e. stored and sent in the following requests) by default by urllib3. Thus, the second request won't contain the cookie tdsess=TEST_DRIVE_SESSION, and therefore the website returns the message that you're not logged in.
If you only care about the login request, you can try the following code:
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, data={'usr':'admin','pwd':'12345'}, encode_multipart=False, redirect=False)
print(req.data.decode('utf-8'))
The encode_multipart=False instructs urllib3 to send data with content-type application/x-www-form-urlencoded; the redirect=False tells it not to follow the redirect, so that you can see the response of your initial request.
If you do want to complete the whole login process, however, you need to save the cookie from the first response and send it in the second request. You can do it with urllib3, or
Use the Requests library
I'm not sure if you have any particular reasons to use urllib3. Urllib3 will definitely work if you implements it well, but I would suggest try the Request library, which is much easier to use. For you case, the following code with Request will work and get you to the welcome page:
import requests
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = requests.post(url, data={'usr':'admin','pwd':'12345'})
print(req.text)
import requests
auth_credentials = ("admin", "12345")
url = "http://testing-ground.scraping.pro/login?mode=login"
response = requests.post(url=url, auth=auth_credentials)
print(response.text)
I am trying to post multipart/form data using requests library according to website on submiting the form you are redirected to page where your data is created but when I am trying using requests library it gives 200 as response instead it should give 302 as response please could any one help me in this i dont know what i am doing wrong
By default requests will follow "302" redirection responses. You can disable this as follows:
r = requests.get('http://github.com/', allow_redirects=False)
See https://requests.kennethreitz.org/en/master/user/quickstart/#redirection-and-history
I've used requests with good results but with this particular url, I get a redirects loop break.
s = requests.Session()
page = s.get('http://pe.usps.gov/text/pub28/28apc_002.htm')
tree = html.fromstring(page.content)
street_type = tree.xpath(r"//*[#id='ep533076']/tbody/tr[2]/td[1]/p/a")
print(street_type)
I'm wondering specifically if there is a way to assign headers for the request so as to avoid the redirect. I've tested the actual url and it looks valid.
Thanks
The redirect is response sent by the server. It is typically a HTTP <301> or <302> response, which says "hey, I know what you are looking for, it is over here..." and sends you a new place to look. Yes, these can be chained together, and yes, you can end up in loops. That is what the max redirect limit is for.
You can set the number of allowable redirects in requests using:
s.max_redirects = 50 # the default is 30
But this will not solve the issue. In this particular case the server is looking for what kind of browser you are using and is redirecting you when it doesn't find what it is looking for. You can imitate a browser by adding a user-agent field to the header.
Recommended usage: sets the header to a generic browser for the single request
session.get(url, headers={'user-agent': 'My app'})
# returns:
<Response [200]>
Original posting: sets the header for the entire session, which is not necessarily what you want.
s.headers = {'user-agent': 'some app'}
s.get('http://pe.usps.gov/text/pub28/28apc_002.htm')
# returns:
<Response [200]>
While using requests to download a webpage, we store the result of that operation in a response object. What I could not understand is, exactly what is stored in the response object? Is it the source code of that page in HTML or is it the entire string on the page that is stored?
It is an instance of the lower level Response class of the python requests library. The literal description from the documentation is..
The Response object, which contains a server's response to an HTTP request.
Every HTTP request sent returns a response from the server (the Response object) which includes quite a bit of information.
You can find all the info you need here, and also here is the github link.
Server and Client use HTTP Protocol to send/receive information.
response stores all information from server - HTTP headers (for example: cookies, status code) and HTTP body (mostly HTML but it can be JSON or file or other)
wikipedia: HTTP Protocol
BTW: request stores HTTP headers and HTTP body too. (sometimes HTTP body can be empty)