Python urllib open issue - python

I'm trying to fetch data from http://book.libertorrent.com/, but at the moment I'm failing badly because some additional data (headers) present in response. My code is very simple:
response = urllib.urlopen('http://book.libertorrent.com/login.php')
f = open('someFile.html', 'w')
f.write(response.read())
read() returns:
Date: Fri, 09 Nov 2012 07:36:54 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Cache-Control: no-cache, pre-check=0, post-check=0
Expires: 0
Pragma: no-cache
Set-Cookie: bb_test=973132321; path=/; domain=book.libertorrent.com
Content-Language: ru
1ec0
...Html...
0
And response.info() is empty.
is there any way to correct response?

Let's try this:
$ echo -ne "GET /index.php HTTP/1.1\r\nHost: book.libertorrent.com\r\n\r\n" | nc book.libertorrent.com 80 | head -n 10
HTTP/1.1 200 OK
WWW
Date: Sat, 10 Nov 2012 17:41:57 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Content-Language: ru
1f57
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html dir="ltr">
See that "WWW" in the second line? That's no valid HTTP header, I'm guessing that's what's throwing off the response parser here.
By the way, python2 and python3 behave differently here:
python2 seems to immediately interpret anything after this invalid header as content
python3 ignores all headers and continues reading the content after the double newline. Because the headers are ignored, so is the transfer encoding, and therfore the content lengths are interpreted as part of the body.
So in the end the problem is that the server is sending an invalid response, which should be fixed at the server's end.

Related

Report Portal - How to avoid unwanted logs in console?

I get the below messages for every test step, which is bit annoying. I need to process the console logs in a different way.
send: b'PUT /api/v2/superadmin_personal/item/14278b98-4430-4d2e-8301-1e30501da3b3 HTTP/1.1\r\nHost: abc.lab.com:8080\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nAuthorization: Bearer 2c0717a7-b477-4e02-b1b5-df2a2757db70\r\nContent-Length: 137\r\nContent-Type: application/json\r\n\r\n'
send: b'{"endTime": "1646987482101", "status": "PASSED", "issue": null, "launchUuid": "f380b026-d7c9-4596-b80a-dcaec6fa82f2", "attributes": null}'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: no-cache, no-store, max-age=0, must-revalidate
header: Content-Type: application/json
header: Date: Fri, 11 Mar 2022 08:30:58 GMT
header: Expires: 0
header: Pragma: no-cache
header: X-Content-Type-Options: nosniff
header: X-Frame-Options: DENY
header: X-Xss-Protection: 1; mode=block
header: Content-Length: 93
You can set rp.http.logging=false in the reportportal.prop file or as a JVM parameter.
There is a common switch for all HTTP requests/responses Python sends:
from http.client import HTTPConnection
HTTPConnection.debuglevel = 0
Unfortunately Python uses just print to log HTTP (as here), ignoring his own logging framework. That's really silly, but here where Python is. Therefore there is no any straight way to configure what you want log and what you would like to skip. You can just turn on or off console printing for all HTTP requests.

bad request socket python

I'm using socket to build a simple "web browser" but I'm getting stuck at the start, whit a bad request result, here is my code:
import socket
mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
URI = 'data.pr4e.org'
mysocket.connect((URI, 80))
cmd = "GET http://{0}/romeo.txt HTTP/1.0\n\n".format(URI).encode()
mysocket.send(cmd) # send a request
while True:
data = mysocket.recv(512) # recieve 512 bites at time
# if there is no more information to recive, then, close the loop
if (len(data) < 1):
break
print(data.decode())
pass
mysocket.close() # close connection
here is the output
HTTP/1.1 400 Bad Request
Date: Mon, 15 Feb 2021 14:36:06 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 308
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</
h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at do1.dr-chuck.com Port 80</address>
what I'm doing wrong? also, I tryed replacing data.pr4e.org by facebook.com and youtube.com and I get this output:
HTTP/1.1 301 Moved Permanently
Vary: Accept-Encoding
Location: https://facebook.com/
Content-Type: text/html; charset="utf-8"
X-FB-Debug: LPmWQm0VVptVpi8QX8/SxymrJg9ZoL/mL+W+G4pZA4HGj5WI5YIG1s8sgqwp6TIleGvUg3U1eDNEhGoCsaJG5g==
Date: Mon, 15 Feb 2021 14:52:43 GMT
Alt-Svc: h3-29=":443"; ma=3600,h3-27=":443"; ma=3600
Connection: close
Content-Length: 0
thank you
Here the problem is just that you used \n when the server expected \r\n for end of line.
Anyway, as you directly connect to the HTTP host, you should not put the full URI in the request line. This would be better on a HTTP 1.0 conformance point:
cmd = "GET /romeo.txt HTTP/1.0\r\n\r\n".encode()
But if the server could accept more that one virtual server, you should pass the name in a Host header:
cmd = "GET /romeo.txt HTTP/1.0\r\nHost: {}\r\n\r\n".format(URI).encode()

How to get missing content length for a file from url?

I am trying to write a simple download manager using python with concurrency. The aim is to use the Content-Length header from the response of a file url and splits the file into chunks then download those chunks concurrently. The idea works for all the urls which have Content-Length header but recently I came across some urls which doesn't serve a Content-Length header.
https://filesamples.com/samples/audio/mp3/sample3.mp3
HTTP/1.1 200 OK
Date: Sat, 08 Aug 2020 11:53:15 GMT
Content-Type: audio/mpeg
Transfer-Encoding: chunked
Connection: close
Set-Cookie: __cfduid=d2a4be3535695af67cb7a7efe5add19bf1596887595; expires=Mon, 07-Sep-20 11:53:15 GMT; path=/; domain=.filesamples.com; HttpOnly; SameSite=Lax
Cache-Control: public, max-age=86400
Display: staticcontent_sol, staticcontent_sol
Etag: W/"5def04f1-19d6dd-gzip"
Last-Modified: Fri, 31 Jul 2020 21:52:34 GMT
Response: 200
Vary: Accept-Encoding
Vary: User-Agent,Origin,Accept-Encoding
X-Ezoic-Cdn: Miss
X-Middleton-Display: staticcontent_sol, staticcontent_sol
X-Middleton-Response: 200
CF-Cache-Status: HIT
Age: 24
cf-request-id: 046f8413ab0000e047449da200000001
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 5bf90932ae53e047-SEA
How can I get the content-length of the file without downloading the whole file?

How to successfully download range of bytes instead of complete file using python?

I am trying to download only range of bytes of a file and I am trying the following process:
r = requests.get('https://stackoverflow.com', headers={'Range':'bytes=0-999'})
But it give status code 200 as opposed to 206 and I am getting the entire file.
I tried following this Python: How to download file using range of bytes? but it also gave me status code 200. What is the reason and how do I download files partially using python?
Headers for stackoverflow:
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Last-Modified: Fri, 04 Aug 2017 05:28:29 GMT
X-Frame-Options: SAMEORIGIN
X-Request-Guid: 86fd186e-b5ac-472f-9d79-c45149343776
Strict-Transport-Security: max-age=15552000
Content-Length: 107699
Accept-Ranges: bytes
Date: Wed, 06 Sep 2017 11:48:16 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-sin18023-SIN
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1504698496.659820,VS0,VE404
Vary: Fastly-SSL
X-DNS-Prefetch-Control: off
Set-Cookie: prov=6df82a45-2405-b1dd-1430-99e27827b360; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Cache-Control: private
This requires server-side support. Probably the server does not support it.
To find out, make a HEAD request and see if the response has Accept-Ranges.

Python Extract JSON from HTTP Response

Say I have the following HTTP request:
GET /4 HTTP/1.1
Host: graph.facebook.com
And the server returns the following response:
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Cache-Control: private, no-cache, no-store, must-revalidate
Content-Type: text/javascript; charset=UTF-8
ETag: "539feb8aee5c3d20a2ebacd02db380b27243b255"
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Pragma: no-cache
X-FB-Rev: 1070755
X-FB-Debug: pC4b0ONpdhLwBn6jcabovcZf44bkfKSEguNsVKuSI1I=
Date: Wed, 08 Jan 2014 01:22:36 GMT
Connection: keep-alive
Content-Length: 172
{"id":"4","name":"Mark Zuckerberg","first_name":"Mark","last_name":"Zuckerberg","link":"http:\/\/www.facebook.com\/zuck","username":"zuck","gender":"male","locale":"en_US"}
Since the Content-Lengh header depends on the length of the content, I cannot simply split by the Content-Length: 172 string. How can I extract the JSON and headers separately? They are both important to my program.
I am using this code to get the response:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("graph.facebook.com", 80))
s.send("GET /"+str(id)+"/picture HTTP/1.1\r\nHost: graph.facebook.com\r\n\r\n")
data = s.recv(1024)
s.close()
json_string = (somehow extract this)
userdata = json.loads(json_string)
The easy way to do this is to use an HTTP library. For example:
import json
import urllib2
r = urllib2.urlopen("http://graph.facebook.com/{}/picture".format(id))
json_string = r.read()
userdata = json.loads(json_string)
If you really want to parse it yourself, the HTTP protocol guarantees that headers and body are separated by an empty line, and that this will be the first empty line anywhere in the response, so it's not that hard:
data = s.recv(1024)
header, _, json_string = data.partition('\r\n\r\n')
userdata = json.loads(json_string)
There are some obvious down sides to this—as written, your code won't work if the response is longer than 1K, or if the kernel doesn't give you the whole response in a single recv (which it's never guaranteed to do), or if the server redirects you or gives you a 100 CONTINUE before the real response, or if the server decides to send back a chunked or MIME-multipart or other response instead of a flat body, or…

Categories