http.client.HTTPException: got more than 100 headers - python

Since google did not find anything regarding error "http.client.HTTPException: got more than 100 headers", I created this question.
>>> import http.client as h
>>> conn = h.HTTPConnection("www.coursefinders.com")
>>> conn.request("HEAD","/")
>>> conn.getresponse();
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/http/client.py", line 1148, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 376, in begin
self.headers = self.msg = parse_headers(self.fp)
File "/usr/lib/python3.4/http/client.py", line 267, in parse_headers
raise HTTPException("got more than %d headers" % _MAXHEADERS)
http.client.HTTPException: got more than 100 headers
What does this exception mean and how should I properly handle this type of error? Site works OK in browser.

Here is a solution that doesn't involve changing library's py files:
import httplib # or http.client if you're on Python 3
httplib._MAXHEADERS = 1000
Just put that at the top of your code

change the value of "_MAXHEADERS" to 1000 or 10000 in C:\Python27\Lib\httplib.py

I was going to suggest using requests, but it's implemented using http.client and fails for the same reason. To verify whether the problem was in the library or the server, I tried a telnet session, and the results resembled:
Trying 91.250.81.121...
Connected to www.coursefinders.com.
Escape character is '^]'.
HEAD / HTTP\1.1
HTTP/1.1 200 OK
Date: Mon, 14 Apr 2014 08:35:54 GMT
Server: Apache/2.2.16 (Debian)
X-Powered-By: PHP/5.3.3-7+squeeze19
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: PHPSESSID=2bnr4dpa4e90r2lmbv01smu1b6; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: c_id=496cc5d32486ac8d944e971ad6ec9eb3649ab23cs%3A3%3A%22235%22%3B; expires=Tue, 15-Apr-2014 08:35:54 GMT; path=/
Set-Cookie: login=-1; path=/
Set-Cookie: wc=1; expires=Thu, 09-Apr-2015 08:35:54 GMT
Set-Cookie: login=-1; path=/
Set-Cookie: login=-1; path=/
[... Many Set-Cookie commands omitted ...]
Set-Cookie: login=-1; path=/
Cache-Control: max-age=1, private, must-revalidate
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=utf-8
Connection closed by foreign host.
So it looks like their server is misconfigured and is spewing out lots of superfluous Set-Cookie headers.
There doesn't seem to be any way to configure httplib to accept large numbers of headers. I've tried searching for alternative HTTP libraries that aren't implemented using httplib but haven't had any luck.

one OSX I added this to my code
import httplib as http_client
Then debug the script to find where the library was being loaded from. in my case it was
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py
I then editing the limit as per Felix's post
sudo vim /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py

Related

Difference between Python "requests" and Linux "curl"

I tried through several means, but nowhere do I find a satisfatory answer to this -
What are the differences between Python "requests" module and Linux "curl" command? Does "requests" use "curl" underlying, or is it totally different way of dealing with HTTP request/response?
For most of the requests, they both behave in the same way (as it should be), but sometimes, I find a difference in response and it is really hard to figure out why is it so.
eg. Using curl for HEAD request:
curl --head https://historia.sherpadesk.com
HTTP/2 302
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:30 GMT
access-control-expose-headers: Request-Context
cache-control: private
location: /login/?ref=portal
set-cookie: ASP.NET_SessionId=nghpw4qp5cw2ntwmwfuxw3oi; path=/; HttpOnly; SameSite=Lax
content-length: 135
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
and if I use -L to follow redirects,
curl --head https://historia.sherpadesk.com -L
HTTP/2 302
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:37 GMT
access-control-expose-headers: Request-Context
cache-control: private
location: /login/?ref=portal
set-cookie: ASP.NET_SessionId=trzp0bql4nibswux5z5wfayy; path=/; HttpOnly; SameSite=Lax
content-length: 135
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
HTTP/2 302
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:38 GMT
access-control-expose-headers: Request-Context
location: https://app.sherpadesk.com/login/?ref=portal
content-length: 161
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
HTTP/2 200
content-type: text/html; charset=utf-8
date: Mon, 28 Feb 2022 20:31:39 GMT
access-control-expose-headers: Request-Context
cache-control: no-store, no-cache
expires: -1
pragma: no-cache
set-cookie: ASP.NET_SessionId=aqmnxu2s3qkri3sravsrs1cq; path=/; HttpOnly; SameSite=Lax
content-length: 8935
request-context: appId=cid-v1:d5f9900e-ecd4-442f-9e92-e11b4cdbc0c9
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
and here is the (debug) output when I use Python's requests module requests.head(url):
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): historia.sherpadesk.com:443
send: b'HEAD / HTTP/1.1\r\nHost: historia.sherpadesk.com\r\nUser-Agent: python-requests/2.26.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden: Access is denied.\r\n'
header: Content-Length: 58
header: Content-Type: text/html
header: Date: Mon, 28 Feb 2022 20:36:18 GMT
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1
header: X-Content-Type-Options: nosniff
header: Strict-Transport-Security: max-age=31536000
DEBUG:urllib3.connectionpool:https://historia.sherpadesk.com:443 "HEAD / HTTP/1.1" 403 0
INFO:root:URL: https://historia.sherpadesk.com/
INFO:root:<Response [403]>
which just results in 403 response code. Response is same whether allow_redirects is True/False. I have also tried using proxy with python code, as I thought maybe its getting blocked as this URL might be recognising Python's request to be a bot, but that also fails. Also, if that was the case, why does curl succeed?
So, my main question here is: what are the major differences between curl and requests, which might cause difference in responses in certain cases? If possible, I would really like thorough explanation which could help me debug and resolve these issues.
The two libraries are different but the problem here is related to user agent.
When I try with curl, specifying the python-requests user agent:
$ curl --head -A "python-requests/2.26.0" https://historia.sherpadesk.com/
HTTP/2 403
content-type: text/html
date: Mon, 28 Feb 2022 22:30:02 GMT
content-length: 58
x-frame-options: SAMEORIGIN
x-xss-protection: 1
x-content-type-options: nosniff
strict-transport-security: max-age=31536000
With curl default user agent:
$ curl --head https://historia.sherpadesk.com/
HTTP/2 302
...
Apparently, they have some type of website security that is blocking HTTP clients like python-requests, but not curl for some reason.

How to get missing content length for a file from url?

I am trying to write a simple download manager using python with concurrency. The aim is to use the Content-Length header from the response of a file url and splits the file into chunks then download those chunks concurrently. The idea works for all the urls which have Content-Length header but recently I came across some urls which doesn't serve a Content-Length header.
https://filesamples.com/samples/audio/mp3/sample3.mp3
HTTP/1.1 200 OK
Date: Sat, 08 Aug 2020 11:53:15 GMT
Content-Type: audio/mpeg
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=d2a4be3535695af67cb7a7efe5add19bf1596887595; expires=Mon, 07-Sep-20 11:53:15 GMT; path=/; domain=.filesamples.com; HttpOnly; SameSite=Lax
Cache-Control: public, max-age=86400
Display: staticcontent_sol, staticcontent_sol
Etag: W/"5def04f1-19d6dd-gzip"
Last-Modified: Fri, 31 Jul 2020 21:52:34 GMT
Response: 200
Vary: Accept-Encoding
Vary: User-Agent,Origin,Accept-Encoding
X-Ezoic-Cdn: Miss
X-Middleton-Display: staticcontent_sol, staticcontent_sol
X-Middleton-Response: 200
CF-Cache-Status: HIT
Age: 24
cf-request-id: 046f8413ab0000e047449da200000001
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 5bf90932ae53e047-SEA
How can I get the content-length of the file without downloading the whole file?

Cannot retreive CSP header with python

I want to retrieve all headers from a certain site, in this example "https://www.facebook.com" as following:
import urllib2
enter code here`req = urllib2.Request('https://www.facebook.com/')
res = urllib2.urlopen(req)
print res.info()
res.close();
that results in this response:
X-XSS-Protection: 0
Pragma: no-cache
Cache-Control: private, no-cache, no-store, must-revalidate
X-Frame-Options: DENY
Strict-Transport-Security: max-age=15552000; preload
X-Content-Type-Options: nosniff
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Set-Cookie: sb=1GyeWkJzGbmX-VUyBi26; expires=Thu, 05-Mar-2020 10:26:28 GMT; Max-Age=63071999; path=/; domain=.facebook.com; secure; httponly
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
X-FB-Debug: X9aSOOKs6/aER1yuY4iUUIZrj4yTKtKSUAZ/AFE37IieCe8O4MSsFc5xlQ0LoQyHnbrSL4DaYiTVUUkFZeDrsqqg==
Date: Tue, 06 Mar 2018 10:26:29 GMT
Connection: close
I can retrieve all headers except for the Content-Security-Policy (csp);
But whenever I test on geekflare csp test
It succesfully retrieved all headers including the csp one.
Seems like I forgot to set the User-Agent within the request.

How to successfully download range of bytes instead of complete file using python?

I am trying to download only range of bytes of a file and I am trying the following process:
r = requests.get('https://stackoverflow.com', headers={'Range':'bytes=0-999'})
But it give status code 200 as opposed to 206 and I am getting the entire file.
I tried following this Python: How to download file using range of bytes? but it also gave me status code 200. What is the reason and how do I download files partially using python?
Headers for stackoverflow:
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Last-Modified: Fri, 04 Aug 2017 05:28:29 GMT
X-Frame-Options: SAMEORIGIN
X-Request-Guid: 86fd186e-b5ac-472f-9d79-c45149343776
Strict-Transport-Security: max-age=15552000
Content-Length: 107699
Accept-Ranges: bytes
Date: Wed, 06 Sep 2017 11:48:16 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-sin18023-SIN
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1504698496.659820,VS0,VE404
Vary: Fastly-SSL
X-DNS-Prefetch-Control: off
Set-Cookie: prov=6df82a45-2405-b1dd-1430-99e27827b360; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Cache-Control: private
This requires server-side support. Probably the server does not support it.
To find out, make a HEAD request and see if the response has Accept-Ranges.

Unable to parse the output from HTTPConnection.debuglevel

I am trying to programmability check the output of a tcp stream. I am able to get the results of the tcp stream by turning on debug in HTTPConnection but how do I read the data and evaluate it with say a regular expression. I keep getting "TypeError: expected string or buffer". Is there a way to convert the result to a string?
thanks!
SCRIPT:
from urllib2 import Request, urlopen, URLError, HTTPError
import urllib2
import cookielib
import httplib
import re
httplib.HTTPConnection.debuglevel = 1
p = re.compile('abc=..........')
cj = cookielib.CookieJar()
proxy_address = '192.168.232.134:8083' # change the IP:PORT, this one is for example
proxy_handler = urllib2.ProxyHandler({'http': proxy_address})
opener = urllib2.build_opener(proxy_handler, urllib2.HTTPCookieProcessor(cj), urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
url = "http://www.google.com/" # change the url
req=urllib2.Request(url)
data=urllib2.urlopen(req)
m=p.match(data)
if m:
print "Match found."
else:
print "Match not found."
RESULTS:
send: 'GET hyperlink/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 303 See Other\r\n'
header: Location: hyperlink:8083/3240951276
header: Set-Cookie: abc=3240951276; path=/; domain=.google.com; expires=Thu, 31-Dec-2020 23:59:59 GMT
header: Content-Length: 0
send: 'GET hyperlink/3240951276 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: hyperlink\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 303 See Other\r\n'
header: Location: hyperlink
header: Set-Cookie: abc=3240951276; path=/; expires=Thu, 31-Dec-2020 23:59:59 GMT
header: Content-Length: 0
send: 'GET http://www.google.com/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.com\r\nCookie: abc=3240951276\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Mon, 18 Oct 2010 21:09:32 GMT
header: Expires: -1
header: cache-control: max-age=0, private, private
header: Content-Type: text/html; charset=ISO-8859-1
header: Set-Cookie: PREF=ID=066bc785a2b15ef6:FF=0:TM=1287436172:LM=1287436172:S=mNiXaRhshpf8nLji; expires=Wed, 17-Oct-2012 21:09:32 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=39=ur3gnXL80kEy4shKAh8_-XV8PhmS4G83slPcX9OD3L6uthQZw-wq7RUnB0WKGYR3F_QGoyZAyEPCvjdi69EXXq23dzvpuZSl_KU2o7pqcTB7Vym4co1LOXmi9YQGpbkb; expires=Tue, 19-Apr-2011 21:09:32 GMT; path=/; domain=.google.com; HttpOnly
header: Server: gws
header: X-XSS-Protection: 1; mode=block
header: Connection: close
header: Content-Length: 4676
header: X-Con-Reuse: 1
header: Content-Encoding: gzip
header: via: 1.1 HermesPrefetch (CID2627003316.AID3240951276.TID1)
header: X-Trace-Timing: Start=1287436172845, Sched=0, Dns=2, Con=11, RxS=28, RxD=35
Traceback (most recent call last):
File "C:\Documents and Settings\asdf\workspace\PythonScripts2\src\Test1.py", line 18, in <module>
m=p.match(data)
TypeError: expected string or buffer
The debug information httplib provides you there, which you see in your terminal, is not actually part of the object returned by urllib2.urlopen(). Instead, it's printed directly to your process's sys.stdout. There's no way to change this behaviour in httplib, unfortunately. It's not entirely clear to me what you're trying to achieve by "capturing" this output and running a regular expression over it, but if that's really what you want to do, you would need to replace sys.stdout with something else, such as a suitable StringIO object, and somehow seeing which output is the output you care about.
However, keep in mind that all the information that httplib produces in its debug output is available directly in your program. It's either based on stuff you pass to httplib (through urllib2) or it's part of the server's response, and thus available in the object returned by urllib2.urlopen(). For example, it looks like you're trying to extract the cookie information, which you can get at simply by getting the cookie from the CookieJar you're already providing. There doesn't seem to be any sensible reason to try and capture the output and parsing it.

Categories