How to make my bot skip over urls that don't exist - python

Hey guys I was wondering if there was a way to make my bot skip invalid urls after 1 try to continue with the for loop but continue doesn't seem to work
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
continue
stripped results is an array of an unknown amount of domains and Subdomains which is why I have the 'https://' part and tbh I'm not even sure whether my if statement is effective or not.
Any help would be greatly appreciated I don't want to get rate limited by discord anymore from sending so many invalid domains through. :(

This is easy. To check the validity of a URL there exist a python library, namely Validators. This library can be used to validate any URL for if it exist or not. Let's take it step by step.
Firstly,
Here is the documentation link for validators:
https://validators.readthedocs.io/en/latest/
How do you validate a link using validators?
It is simple. Let's work on command line for a moment.
This image shows it. This module gives out boolean result on if it is a valid link or not.
Here for the link of this question it gave out True and when it would be false then it would give you the error.
You can validate it using this syntax:
validators.url('Add your URL variable here')
Remember that this gives boolean value so code for it that way.
So you can use it this way...
I wouldn't be implementing it in your code as I want you to try it yourself once. I would help you with this if you are unable to do it.
Thank You! :)

Try this?
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
#Do the thing here

Related

Trying to check if the webdirectory is showing the same thing as index.html

I'm on a blackbox penetration training, last time i asked a question about sql injection which so far im making a progress on it i was able to retrieve the database and the column.
This time i need to find the admin login, so i used dirsearch for that, i checked each webdirectories from dirsearch and sometimes it would show the same page as index.html.
So i'm trying to fix this by automating the process with a script:
import requests
url = "http://depedqc.ph";
webdirectory_path = "C:/PentestingLabs/Dirsearch/reports/depedqc.ph/scanned_webdirectory9-3-2022.txt";
index = requests.get(url);
same = index.content
for webdirectory in open(webdirectory_path, "r").readlines():
webdirectory_split = webdirectory.split();
result = result = [i for i in webdirectory_split if i.startswith(url)];
result = ''.join(result);
print(result);
response = requests.get(result);
if response.content == same:
print("same content");
Only problem is, i get this error:
Invalid URL '': No scheme supplied. Perhaps you meant http://?
Even though the printed result is: http://depedqc.ph/html
What am i doing wrong here? i appreciate a feedback

if <var> is None: doesn't seem to wotk

Disclaimer: I am very new to python, and have no idea what i am doing, i am teaching myself from the web.
I have some code that looks like this
Code:
from request import get # Ed: added for clarity
myurl = URLBASE + args.key
response = get(myurl)
# check key is valid
json = response.text # Ed: this is a requests.Response object
print(json)
if json is None:
sys.exit("Problem getting API data, check your key")
print("how did i get here")
Output:
null
how did i get here
But I have no idea how that is possible ... it literally says it is null in the print, but then doesn't match in the 'if'. Any help would be appreciated.
thx
So I am sure I still don't fully understand, but this "fixes" my problem.
The requests.Response object has Property/Method json - so i should have been using that, thanks wim, instead of text. So changing the code to this (below), as suggested, makes the code work.
from request import get
myurl = URLBASE + args.key
response = get(myurl)
# check key is valid
json = response.json()
if json is None:
sys.exit("Problem getting API data, check your key")
print("how did i get here")
The question (for me inquisitively) remains, how would I do an if statement to determine if a string is null?
Thanks to Ry and wim, for their help.

python check if list items is in string

link = 'http://dedegood.com'
wrongdomain = ['google','facebook','twitter']
if any(link.find(i) for i in wrongdomain):
print 'pass this url'
else:
print 'good'
I want to check if link contains the words in wrongdomain
Why this always print 'pass this url'?
link has no google or facebook or twitter in it
I try seperate like link.find('google')
it will return -1 .so what's the problem?
Please help me to check my logic.Thank you
bool(-1) is True in Python. Instead of find, you can just do:
if any(domain in link for domain in wrongdomain):
Just remember that will also match the rest of the url, not just the domain.
Your method will not work correctly like a url like http://dedegood.com/google this. So you can use something like;
link = 'http://dedegood.com'
wrongdomain = ['google','facebook','twitter']
a=link.split("//")
b=a[1].split(".")
if any(domain in b[0] for domain in wrongdomain):
print ('pass this url')
else:
print ('good')
Since you just want to check url, you can use this one. Instead of checking all link, it's checking only the name of website. So if any url like http://dedegood.com/google will not be a problem.
Do you want to know whether the url's domain is in wrongdomain or not? I would suggest you can do this for better performance:
import urlparse
import tldextract
link = 'http://dedegood.com'
wrongdomain = ['google','facebook','twitter']
parsed = tldextract.extract(link)
if parsed.domain in wrongdomain:
print 'pass this url'
else:
print 'good'
You could check out tldextract, a library designed to get domain from a url.

google custom search api return is different from google.com

I am using google api via python and it works, but the result I got from api is totally different from google.com. I found the top result given by custom search are google calendar,google earth and patents. I wonder if there is a way to get same result from custom search api. Thank you
def googleAPICall(self,userInput):
try:
userInput = urllib.quote(userInput)
for i in range(0,1):
index = i*10+1
url = ('https://www.googleapis.com/customsearch/v1?'
'key=%s'
'&cx=%s'
'&alt=json'
'&num=10'
'&start=%d'
'&q=%s')%(self.KEY,self.CX,index,userInput)
print (url)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
returnResults = simplejson.load(response)
webs = returnResults['items']
for web in webs:
self.result.append(web["link"])
except:
print ("search error")
self.result.append("http://en.wikipedia.org/wiki/Climate_change")
return self.result
There is a 'search outside of google'checkbox in the dashboard. you will get the same result after you check it. it takes me a while to find it out. the default sitting is only return search result inside of all google websites.
After some searches, the answer is "It is impossible to have the same result as google.com".
Google clearly stated it:
https://support.google.com/customsearch/answer/141877?hl=en
Hope that this is the definite answer.
Just to add to galaxyan answer, you can still do that by changing Sites to search from Search only included sites to Search the entire web
I think you need to experiment with four parameters cr, gl, hl, lr

Dictionary / JSON issue using Python 2.7

I'm looking at scraping some data from Facebook using Python 2.7. My code basically augments by 1 changing the Facebook profile ID to then capture details returned by the page.
An example of the page I'm looking to capture the data from is graph.facebook.com/4.
Here's my code below:
import scraperwiki
import urlparse
import simplejson
source_url = "http://graph.facebook.com/"
profile_id = 1
while True:
try:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
data['id'] = result['id']
data['name'] = result['name']
data['first_name'] = result['first_name']
data['last_name'] = result['last_name']
data['link'] = result['link']
data['username'] = result['username']
data['gender'] = result['gender']
data['locale'] = result['locale']
print data['id'], data['name']
scraperwiki.sqlite.save(unique_keys=['id'], data=data)
#time.sleep(3)
except:
continue
profile_id +=1
I am using the scraperwiki site to carry out this check but no data is printed back to console despite the line 'print data['id'], data['name'] used just to check the code is working
Any suggestions on what is wrong with this code? As said, for each returned profile, the unique data should be captured and printed to screen as well as populated into the sqlite database.
Thanks
Any suggestions on what is wrong with this code?
Yes. You are swallowing all of your errors. There could be a huge number of things going wrong in the block under try. If anything goes wrong in that block, you move on without printing anything.
You should only ever use a try / except block when you are looking to handle a specific error.
modify your code so that it looks like this:
while True:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
# ... more ...
and then you will get detailed error messages when specific things go wrong.
As for your concern in the comments:
The reason I have the error handling is because, if you look for
example at graph.facebook.com/3, this page contains no user data and
so I don't want to collate this info and skip to the next user, ie. no
4 etc
If you want to handle the case where there is no data, then find a way to handle that case specifically. It is bad practice to swallow all errors.

Categories