python http status code - python

I'm writing my own directory buster in python, and I'm testing it against a web server of mine in a safe and secure environment. This script basically tries to retrieve common directories from a given website and, looking at the HTTP status code of the response, it is able to determine if a page is accessible or not.
As a start, the script reads a file containing all the interesting directories to be looked up, and then requests are made, in the following way:
for dir in fileinput.input('utils/Directories_Common.wordlist'):
try:
conn = httplib.HTTPConnection(url)
conn.request("GET", "/"+str(dir))
toturl = 'http://'+url+'/'+str(dir)[:-1]
print ' Trying to get: '+toturl
r1 = conn.getresponse()
response = r1.read()
print ' ',r1.status, r1.reason
conn.close()
Then, the response is parsed and if a status code equal to "200" is returned, then the page is accessible. I've implemented all this in the following way:
if(r1.status == 200):
print '\n[!] Got it! The subdirectory '+str(dir)+' could be interesting..\n\n\n'
All seems fine to me except that the script marks as accessible pages that actually aren't. In fact, the algorithm collects the only pages that return a "200 OK", but when I manually surf to check those pages I found out they have been moved permanently or they have a restricted access. Something went wrong but I cannot spot where should I fix the code exactly, any help is appreciated..

I did not found any problems with your code, except it is almost unreadable. I have rewritten it into this working snippet:
import httplib
host = 'www.google.com'
directories = ['aosicdjqwe0cd9qwe0d9q2we', 'reader', 'news']
for directory in directories:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', '/' + directory)
url = 'http://{0}/{1}'.format(host, directory)
print ' Trying: {0}'.format(url)
response = conn.getresponse()
print ' Got: ', response.status, response.reason
conn.close()
if response.status == 200:
print ("[!] The subdirectory '{0}' "
"could be interesting.").format(directory)
Outputs:
$ python snippet.py
Trying: http://www.google.com/aosicdjqwe0cd9qwe0d9q2we
Got: 404 Not Found
Trying: http://www.google.com/reader
Got: 302 Moved Temporarily
Trying: http://www.google.com/news
Got: 200 OK
[!] The subdirectory 'news' could be interesting.
Also, I did use HEAD HTTP request instead of GET, as it is more efficient if you do not need the contents and you are interested only in the status code.

I would be adviced you to use http://docs.python-requests.org/en/latest/# for http.

Related

How to make Python go through URLs in a text file, check their status codes, and exclude all ones with 404 error?

I tried the following script, but unfortunately the output file is identical to the input file. I'm not sure what's wrong with it.
import requests
url_lines = open('banana1.txt').read().splitlines()
remove_from_urls = []
for url in url_lines:
remove_url = requests.get(url)
print(remove_url.status_code)
if remove_url.status_code == 404:
remove_from_urls.append(url)
continue
url_lines = [url for url in url_lines if url not in remove_from_urls]
print(url_lines)
# Save urls example
with open('banana2.txt', 'w+') as file:
for item in url_lines:
file.write(item + '\n')
There seems to be no error in your code, but there are few things that would help to make it more readable and consistent. The first course of action should be to make sure there is at least one url that would return a 404 status code.
Edit: After providing the actual URL.
The 404 problem
In your case, the problem is the Twitter actually does not return 404 error for your "Not found" url. You can test it using curl:
$ curl -o /dev/null -w "%{http_code}" "https://twitter.com/davemeltzerWON/status/1321279214365016064"
200
Or using Python:
import requests
response = requests.get("https://twitter.com/davemeltzerWON/status/1321279214365016064")
print(response.status_code)
The output for both should be 200.
Since Twitter is a JavaScript application that loads its content after it has been processed in browser, you cannot find the information you are looking for in the HTML response. You would need to use something like Selenium to actually process the JavaScript for you and then you would be able to look for actual text like "not found" on the web page.
Code review
Please make sure to close the file properly. Also, file object is a lines iterator, you can convert it to list very easily. Another trick to make the code more readable is to make use of Python set. So you may read the file like this:
with open("banana1.txt") as fid:
url_lines = set(fid)
Then you simply remove all the links that do not work:
not_working = set()
for url in url_lines:
if requests.get(url).status_code == 404:
not_working.add(url)
working = url_lines - not_working
with open("banana2.txt", "w") as fid:
fid.write("\n".join(working))
Also, if some of the links point to the same server, you should make use of requests.Session class:
from requests import Session
session = Session()
Then replace requests.get with session.get, you should get some performance boost since the Session uses keep-alive connection and many other things.

Python - Codec Issue with the video file downloaded

I am trying to download a video that has been uploaded in the cloud, and I am using API's to extract the data.
The python script seems to download the file fine, but when I open the video it throws this error:
I have tried using different options (VLC, Windows Media Player, etc) to play the video but do not have any luck. Can someone please help?
if res.status_code == 200:
body = res.json()
for meeting in body["meetings"]:
try:
password = requests.get(
f"{root}meetings/{meeting['uuid']}/recordings/settings?access_token={token}").json()["password"]
url = f"https://api.zoom.us/v2/meetings/{meeting['uuid']}/recordings/settings?access_token={token}"
res = requests.patch(
url,
data=json.dumps({"password": ""}),
headers=sess_headers)
except:
pass
topic = meeting["topic"]
try:
os.makedirs("downloads")
except:
pass
for i, recording in enumerate(meeting["recording_files"]):
#os.makedirs(topic)
download_url = recording["download_url"]
name = recording["recording_start"] + \
"-" + meeting["topic"]
ext = recording["file_type"]
filename = f"{name}.{ext}"
path = f'./downloads/{filename}'.replace(":", ".")
res = requests.get(download_url, headers=sess_headers)
with open(Path(path), 'wb') as f:
f.write(res.content)
else:
print(res.text)
One possible problem is next:
After doing each res = requests.get(...) you need to insert line res.raise_for_status().
This is needed to check that status code was 200.
By default requests doesn't throw anything if status code is not 200. Hence your res.content may be an invalid response body in case of bad status code.
If you do res.raise_for_status() then requests will throw error if status code is not 200, thus saving you from possible problems.
But having status code of 200 doesn't definitely mean that there was no error. Some servers respond with HTML containing error description and status code 200.
Another possible problem could be that download url is missing authorization token inside it, then you need to provide it through headers. So instead of last requests.get(...) put next code:
res = requests.get(download_url, headers = {
**sess_headers, 'Authorization': 'Bearer ' + token})
Also you need to check what content type resulting response has, so after last res = response.get(...), do next:
print('headers:', res.headers)
and check what is inside there. Specifically look at field Content-Type, it should have some binary type like application/octet-stream or video/mp4. But definitely not some text format like application/json or text/html, text format file is definitely not video file. In case if it is text/html then try renaming file to test.html and open it in browser to see what's there, probably server responded with some error inside this HTML.
Also just visually compare in some viewer content of two files - downloaded by script and downloaded by some downloader (e.g. browser). Maybe there is some obvious problem visible by eye.
Also file size should be quite big for video. If it is like 50KB then possibly some bad data is inside there.
UPDATE:
Finally worked next solution, replacing last requests.get(...) with line:
res = requests.get(download_url + '?access_token=' + token, headers=sess_headers)

Why do I get two different status code from conn.getresponse().status in python?

so I want to check if a URL is reachable from python, and I got this code from googling:
def checkUrl(url):
p = urlparse(url)
conn = http.client.HTTPConnection(p.netloc)
conn.request('HEAD', p.path)
resp = conn.getresponse()
return resp.status < 400
Here is my URL: https://eurotableau.nomisonline.com.
It works fine if I just pass that in to the function. The resp.status is 302. However, if I add a port 443 at the end of it, https://eurotableau.nomisonline.com:443, it returns false. The resp.status is 400. I tried both URL in google Chrome, both of them work. So my question is why is this happening? Anyway I can include the port value and still get valid resp.status value (< 400)? Thanks.
Use http.client.HTTPSConnection instead. The plain old HTTPConnection ignores the protocol that is part of the URL.
If you do not require the HEAD method but just wish to check if host is available then why not do:
from urllib2 import urlopen
try:
u = urlopen("https://eurotableau.nomisonline.com")
u.close()
print "Everything fine!"
except Exception, e:
if hasattr(e, "code"):
print "Server is there but something is wrong with rest of URL"
else: print "Server is on vacations or was never there!"
print e
This will establish a connection with server but it won't download any data unless you read it. It'll only read few KB to get the header (like when using HEAD method) and wait for you to request more. But you will close it there.
So, you can catch an exception and see what the problem is, or if there is no exception, just close the connection.
urllib2 will handle HTTPS and protocol://user#URL:PORT for you neatly.
No worries about anything.

HTTping in python

So my python program needs to be able to ping a website to see if it is up or not, i have made the ping program and then found out that this site only works with httping, after doing some googling about it i found almost nothing on the subject. Has anybody httping in python before? if so how did you do it?, Thanks for the help.
Here is my code for the normal ping (which works but not for the site i need it to work for)
import os
hostname = "sitename"
response = os.system("ping -c 1 " + hostname)
if response == 0:
print "good"
else:
print "bad"
Use requests to make a HTTP HEAD request.
import requests
response = requests.head("http://www.example.com/")
if response.status_code == 200:
print(response.elapsed)
else:
print("did not return 200 OK")
Output:
0:00:00.238418
To check accessibility of HTTP server you can make a GET request to the URL, which is as easy as:
import urllib2
response = urllib2.urlopen("http://example.com/foo/bar")
print response.getcode()

How do I get HTTP header info without authentication using python?

I'm trying to write a small program that will simply display the header information of a website. Here is the code:
import urllib2
url = 'http://some.ip.add.ress/'
request = urllib2.Request(url)
try:
html = urllib2.urlopen(request)
except urllib2.URLError, e:
print e.code
else:
print html.info()
If 'some.ip.add.ress' is google.com then the header information is returned without a problem. However if it's an ip address that requires basic authentication before access then it returns a 401. Is there a way to get header (or any other) information without authentication?
I've worked it out.
After try has failed due to unauthorized access the following modification will print the header information:
print e.info()
instead of:
print e.code()
Thanks for looking :)
If you want just the headers, instead of using urllib2, you should go lower level and use httplib
import httplib
conn = httplib.HTTPConnection(host)
conn.request("HEAD", path)
print conn.getresponse().getheaders()
If all you want are HTTP headers then you should make HEAD not GET request. You can see how to do this by reading Python - HEAD request with urllib2.

Categories