I write a simple program to get some information from a website using python.
but when I run the code below, it always returns the following 301 info. At the same time, my browser can visit the website easily.
Please tell me why this happens and how to improve my code to avoid the problem.
HTTP/1.1 301 Moved Permanently
Date: Tue, 28 Aug 2018 14:26:20 GMT
Server: Apache
Referrer-Policy: origin-when-cross-origin
Location: https://www.ncbi.nlm.nih.gov/
Content-Length: 237
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
</body></html>
import socket
searcher = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
searcher.connect(("www.ncbi.nlm.nih.gov", 80))
cmd = "GET https://www.ncbi.nlm.nih.gov/ HTTP/1.0\r\n\r\n".encode()
searcher.send(cmd)
while True:
data = searcher.recv(512)
if len(data)<1: break
print(data.decode())
searcher.close()
You recieve a 301 because site is redirecting to https site.
I don't know if using sockets is mandatory, but if not you can use requests, it's a easy-to-use lib for doing http requests:
import requests
req = requests.get("http://www.ncbi.nlm.nih.gov")
html = req.text
With this, the 301 is performed anyway but it's transparent.
If you want to do it with sockets, you should add the "ssl layer" manually:
import socket
import ssl
searcher = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
searcher.connect(("www.ncbi.nlm.nih.gov", 443))
searcher = ssl.wrap_socket(searcher, keyfile=None, certfile=None, server_side=False, cert_reqs=ssl.CERT_NONE, ssl_version=ssl.PROTOCOL_SSLv23)
cmd = "GET https://www.ncbi.nlm.nih.gov/ HTTP/1.0\r\n\r\n".encode()
searcher.send(cmd)
while True:
data = searcher.recv(512)
if len(data) < 1: break
print(data.decode())
searcher.close()
Related
how to get html file into python code using socket. I was able to implement using the requests library. However, it needs to be rewritten to sockets. I don’t understand how. The implementation code through requests will be below. I will also leave pathetic attempts to implement via a socket using Google. However, the decision is not at all correct. ! (Help implement using sockets.
import requests
reg_get = requests.get("https://stackoverflow.blog/")
text = reg_get.text
print(text)
import socket
request = b"GET / HTTP/1.1\nHost: https://stackoverflow.blog/\n\n"
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("https://stackoverflow.blog/", 80))
s.send(request)
result = s.recv(10000)
while (len(result) > 0):
print(result)
result = s.recv(10000)
After seeing the comments and listening to you. I have rewritten the following code. However, I never got the html. And I received information about the site. How do I get html structure in python
import socket
import ssl
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
request = "GET /r/AccidentalRenaissance/comments/8ciibe/mr_fluffies_betrayal/ HTTP/1.1\r\nHost: www.reddit.com\r\n\r\n"
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
s = context.wrap_socket(sock, server_hostname = "www.reddit.com")
s.connect(("www.reddit.com", 443))
s.sendall(request.encode())
contest = s.recv(1024).decode()
s.close()
print(contest)
result
HTTP/1.1 200 OK
Connection: keep-alive
Cache-control: private, s-maxage=0, max-age=0, must-revalidate, no-store
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Accept-Ranges: bytes
Date: Sun, 03 Oct 2021 03:34:25 GMT
Via: 1.1 varnish
Vary: Accept-Encoding, Accept-Encoding
A URL is composed of a protocol, a hostname, an optional port, and an optional path. In the URL http://stackoverflow.blog/ , https is the protocol, stackoverflow.blog is the hostname, and no port or path is provided. For http, the port defaults to 80 and the path defaults to /. When using sockets, first establish a connection to the host at the port using connect then send an HTTP command to retrieve the page on the path. The HTTP command to retrieve the page is "GET /" and receive the response from the server.
Note that I used http instead of https because https adds security set up and negotiation to the above that occurs once the connect is done but before the "GET /" is done. It is quite complicated and a good reason to use Requests instead of trying to implement it yourself. If you don't want to use Requests but don't want to go down to the level of sockets, take a look at urllib3
I'm using socket to build a simple "web browser" but I'm getting stuck at the start, whit a bad request result, here is my code:
import socket
mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
URI = 'data.pr4e.org'
mysocket.connect((URI, 80))
cmd = "GET http://{0}/romeo.txt HTTP/1.0\n\n".format(URI).encode()
mysocket.send(cmd) # send a request
while True:
data = mysocket.recv(512) # recieve 512 bites at time
# if there is no more information to recive, then, close the loop
if (len(data) < 1):
break
print(data.decode())
pass
mysocket.close() # close connection
here is the output
HTTP/1.1 400 Bad Request
Date: Mon, 15 Feb 2021 14:36:06 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 308
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</
h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at do1.dr-chuck.com Port 80</address>
what I'm doing wrong? also, I tryed replacing data.pr4e.org by facebook.com and youtube.com and I get this output:
HTTP/1.1 301 Moved Permanently
Vary: Accept-Encoding
Location: https://facebook.com/
Content-Type: text/html; charset="utf-8"
X-FB-Debug: LPmWQm0VVptVpi8QX8/SxymrJg9ZoL/mL+W+G4pZA4HGj5WI5YIG1s8sgqwp6TIleGvUg3U1eDNEhGoCsaJG5g==
Date: Mon, 15 Feb 2021 14:52:43 GMT
Alt-Svc: h3-29=":443"; ma=3600,h3-27=":443"; ma=3600
Connection: close
Content-Length: 0
thank you
Here the problem is just that you used \n when the server expected \r\n for end of line.
Anyway, as you directly connect to the HTTP host, you should not put the full URI in the request line. This would be better on a HTTP 1.0 conformance point:
cmd = "GET /romeo.txt HTTP/1.0\r\n\r\n".encode()
But if the server could accept more that one virtual server, you should pass the name in a Host header:
cmd = "GET /romeo.txt HTTP/1.0\r\nHost: {}\r\n\r\n".format(URI).encode()
Well, I just want to make the following simple program that tries to create an https tunel with www.google.com at port 443. I first tried the following code:
import socket
def main():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.google.com", 80))
request = "CONNECT www.google.com:443 HTTP/1.1\n\n"
s.send(request.encode())
print(s.recv(4096).decode())
main()
The result of that was the following:
HTTP/1.1 405 Method Not Allowed
Content-Type: text/html; charset=UTF-8
Referrer-Policy: no-referrer
Content-Length: 1592
Date: Wed, 16 Aug 2017 07:56:14 GMT
Connection: close
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
<title>Error 405 (Method Not Allowed)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}#media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}#media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}#media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>405.</b> <ins>That’s an error.</ins>
<p>The request method <code>CONNECT</code> is inappropriate for the URL <code>/</code>. <ins>That’s all we know.</ins>
That means that the server does not allow this request to be executed. So I thought that the problem was the port number. So I changed it to 443(which is the port for https connection). The code is that:
def main():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.google.com", 443))
request = "CONNECT www.google.com:443 HTTP/1.1\n\n"
s.send(request.encode())
print(s.recv(4096).decode())
main()
But it does not print out a valid respnse as it should have done. It gives me an empty response.
The question to that is: "Why is that happening? How can I make it work properly?"
Note: I don't want to use built-in urllib or urllib2 libraries. I want to do that with sockets.
HTTP
In your original connection to port 80 you are just using wrong Host:
import socket
def main():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('google.com', 80))
request = b'CONNECT google.com HTTP/1.1\n\n'
s.send(request)
print(s.recv(4096).decode())
main()
Response:
HTTP/1.0 200 Connection established
Or use GET method right away:
request = b'GET http://google.com HTTP/1.1\n\n'
Response is the same as to HTTPS request, google.com host doesn't work for some reason.
HTTPS
You should wrap your socket in ssl tunnel (not sure if correct term) in order to connect using HTTPS, and GET method is ready to use right after connection:
import socket
import ssl
def main():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = ssl.wrap_socket(s)
s.connect(('google.com', 443))
request = b'GET google.com HTTP/1.1\n\n'
s.send(request)
print(s.recv(4096).decode())
main()
Response:
HTTP/1.1 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Referrer-Policy: no-referrer
Location: https://www.google.ru/?gfe_rd=cr&ei=WwCUWc66L6qB3APs7ZPABA
Content-Length: 259
Date: Wed, 16 Aug 2017 08:20:43 GMT
Alt-Svc: quic=":443"; ma=2592000; v="39,38,37,35"
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
I am trying to go to http://www.py4inf.com/code/romeo.txt, read the contents of romeo.txt and print them back out, am using python 3.6.1.
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n'.encode("utf8"))
while True:
data = mysock.recv(512)
if ( len(data) < 1 ) :
break
print (data.decode("utf8"))
mysock.close()
instead of the contents of the page it prints out
TTP/1.1 404 Not Found
Server: nginx
Date: Wed, 21 Jun 2017 03:00:15 GMT
Content-Type: text/html
Content-Length: 162
Connection: close
<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html
Why is this? Thanks in advance
In theory, the Host header is only mandatory from HTTP 1.1 onwards, but it appears that particular server requires the Host header to be present, even for HTTP 1.0. I'm not sure if that's the default behaviour of Nginx, or whether the server admin's explicitly configured it that way.
In any case, try changing your request to the following:
mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\nHost: www.py4inf.com\n\n'.encode("utf8"))
I can understand your confusion - IMHO, it should be returning 400 not 404 if it is insisting on the Host header being provided (since it's a client request issue, not a matter of the resource not existing).
I am trying to simulate Network Address Translation for some test code. I am mapping virtual users to high port numbers, then, when I want to test a request from user 1, I send it from port 6000 (user 2, from 6001, etc).
However, I can't see the port number in the response.
connection = httplib.HTTPConnection("the.company.lan", port=80, strict=False,
timeout=10, source_address=("10.129.38.51", 6000))
connection.request("GET", "/portal/index.html")
httpResponse = connection.getresponse()
connection.close()
httpResponse.status is 200, but I don't see the port number anywhere in the response headers.
Maybe I should be using some lower level socket functionality? If so, which is simplest and supports both HTTP and FTP? Btw, I only want to use built-in modules, nothing which I have to install.
[Update] I should have made it clearer; I really do need to get the actual port number received in the response, not just remember it.
To complete #TimSpence answer, you can use a socket object as an interface for your connection and then treat with some API your data as an HTTP object.
host = 'xxx.xxx.xxx.xxx'
port = 80
address = (host, port)
## socket object interface for a TCP connection
listener_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM,
socket.IPPROTO_TCP)
listener_socket.bind(address)
listener_socker.listen(MAX_CONNECTIONS)
## new_connection is the connection object you use to handle the data echange
## source_address (source_host, source_port) is the address object
## source_port is what you're looking for
new_connection, source_address = listener_socket.accept()
data = new_connection.recv(65536)
## handle data as an HTTP object(for example as an HTTP request)
new_connection.close()
HTTP messages do not contain anything about ports so the httpResponse will not have that information.
However, you will need a different connection object (which will map to a different underlying socket) for each request anyway so you can get that information from the HTTPconnection object.
_, port = connection.source_address
Does that help?
Considring your comments, I had to provide a new answer.
I though you can also put a non standard header host in your HTTPRespose, 'Host: domain/IP:port', so that your client can read it when it receives a response.
Server Response:
HTTP/1.1 200 OK
Date: Day, DD Month YYYY HH:MM:SS GMT
Content-Type: text/html; charset=UTF-8
Content-Encoding: UTF-8
Content-Length: LENGTH
Last-Modified: Day, DD Month YYYY HH:MM:SS GMT
Server: Name/Version (Platform)
Accept-Ranges: bytes
Connection: close
Host: domain/IP:port #exapmple: the.company.lan:80
<html>
<head>
<title>Example Response</title>
</head>
<body>
Hello World!
</body>
</html>
Client:
connection = httplib.HTTPConnection("the.company.lan", port=80,
strict=False, timeout=10,
source_address=("10.129.38.51", 6000))
connection.request("GET", "/portal/index.html")
httpResponse = connection.getresponse()
## store a dict with the response headers
## extract your custom header 'host'
res_headers = dict(httpResponse.getheaders());
server_address = tuple(headers['host'].split(':'))
## read the response content
HTMLData = httpResponse.read(CONTENT_LENGTH_HEADER)
connection.close()
This way you got server_address as a tuple (domain, port).