How do i understand whether i am parsing the websites acurately?

How do i understand whether i am parsing the websites acurately? - python

I built this function to tell me whether there have been changes to the website. I'm not sure if it works as I have tried it on a few websites that have not changed and it has given me the wrong output. Where is the issue and is there an issue at all?
This is the code:
I put the code into a function so that I could allow the user to input any site
userurl=input("Please enter a valid url")
def checksite(userurl):
change=False
import time
import urllib.request
import io
u = urllib.request.urlopen(userurl)
webContent1 = u.read()
time.sleep(60)
u = urllib.request.urlopen(userurl)
webContent2 = u.read()
if webContent1 == webContent2:
print("Everything is normal")
elif webContent1 !=webContent2:
print("Warning, there has been a change to the webite!")
change=True
return change
checksite(userurl)

Try making a small HTML Hello World page. Given that many websites have dynamic content that changes each time you access it (and might not necessarily be visible), that could lead to your "incorrect" results.

I have tested your code and it works perfectly fine in a Python webserver.
I have started one with
python -m http.server
and placed an index.html in the same directory with some content before starting the server.
and your code
import time
import urllib.request
import io
userurl='http://localhost:8000/index.html'
def checksite(userurl):
change=False
u = urllib.request.urlopen(userurl)
webContent1 = u.read()
print(webContent1)
time.sleep(15)
u = urllib.request.urlopen(userurl)
webContent2 = u.read()
print(webContent2)
if webContent1 == webContent2:
print("Everything is normal")
elif webContent1 !=webContent2:
print("Warning, there has been a change to the webite!")
change=True
return change
checksite(userurl)
and output
b'<html>\n\t<title> Hello </title>\n\t<body>\n\t\tTesting, Webcontent1 \n\t</body>\n\t</html>\n\n'
b'<html>\n\t<title> Hello </title>\n\t<body>\n\t\tTesting, Webcontent2\n\t</body>\n\t</html>\n\n'
Warning, there has been a change to the webite!
[Finished in 17.5s]
Your code is perfectly fine.

to know if a website or a page has changed you need to have a backup of it somewhere, in your code it was like you were comparing the site to itself... anyways. i recomend using the requests library in addition to BS4 and try parsing it line by line comparing to the backup you have.
So while the code is working (aka: the site you have as backup is showing the same lines as the site on the web) it will have a variable true. if it has changed it breaks the loop and simply shows the line where the site has changed.

Related

Trying to check if the webdirectory is showing the same thing as index.html

I'm on a blackbox penetration training, last time i asked a question about sql injection which so far im making a progress on it i was able to retrieve the database and the column.
This time i need to find the admin login, so i used dirsearch for that, i checked each webdirectories from dirsearch and sometimes it would show the same page as index.html.
So i'm trying to fix this by automating the process with a script:
import requests
url = "http://depedqc.ph";
webdirectory_path = "C:/PentestingLabs/Dirsearch/reports/depedqc.ph/scanned_webdirectory9-3-2022.txt";
index = requests.get(url);
same = index.content
for webdirectory in open(webdirectory_path, "r").readlines():
webdirectory_split = webdirectory.split();
result = result = [i for i in webdirectory_split if i.startswith(url)];
result = ''.join(result);
print(result);
response = requests.get(result);
if response.content == same:
print("same content");
Only problem is, i get this error:
Invalid URL '': No scheme supplied. Perhaps you meant http://?
Even though the printed result is: http://depedqc.ph/html
What am i doing wrong here? i appreciate a feedback

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()

This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Module urllib.request not getting data

I am trying to test this demo program from lynda using Python 3. I am using Pycharm as my IDE. I already added and installed the request package, but when I run the program, it runs cleanly and shows a message "Process finished with exit code 0", but does not show any output from print statement. Where am I going wrong ?
import urllib.request # instead of urllib2 like in Python 2.7
import json
def printResults(data):
# Use the json module to load the string data into a dictionary
theJSON = json.loads(data)
# now we can access the contents of the JSON like any other Python object
if "title" in theJSON["metadata"]:
print(theJSON["metadata"]["title"])
# output the number of events, plus the magnitude and each event name
count = theJSON["metadata"]["count"];
print(str(count) + " events recorded")
# for each event, print the place where it occurred
for i in theJSON["features"]:
print(i["properties"]["place"])
# print the events that only have a magnitude greater than 4
for i in theJSON["features"]:
if i["properties"]["mag"] >= 4.0:
print("%2.1f" % i["properties"]["mag"], i["properties"]["place"])
# print only the events where at least 1 person reported feeling something
print("Events that were felt:")
for i in theJSON["features"]:
feltReports = i["properties"]["felt"]
if feltReports != None:
if feltReports > 0:
print("%2.1f" % i["properties"]["mag"], i["properties"]["place"], " reported " + str(feltReports) + " times")
# Open the URL and read the data
urlData = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson"
webUrl = urllib.request.urlopen(urlData)
print(webUrl.getcode())
if webUrl.getcode() == 200:
data = webUrl.read()
data = data.decode("utf-8") # in Python 3.x we need to explicitly decode the response to a string
# print out our customized results
printResults(data)
else:
print("Received an error from server, cannot retrieve results " + str(webUrl.getcode()))

Not sure if you left this out on purpose, but this script isn't actually executing any code beyond the imports and function definition. Assuming you didn't leave it out on purpose, you would need the following at the end of your file.
if __name__ == '__main__':
data = "" # your data
printResults(data)
The check on __name__ equaling "__main__" is just so your code is only executing when the file is explicitly run. To always run your printResults(data) function when the file is accessed (like, say, if its imported into another module) you could just call it at the bottom of your file like so:
data = "" # your data
printResults(data)

I had to restart the IDE after installing the module. I just realized and tried it now with "Run as Admin". Strangely seems to work now.But not sure if it was a temp error, since even without restart, it was able to detect the module and its methods.

Your comments re: having to restart your IDE makes me think that pycharm might not automatically detect newly installed python packages. This SO answer seems to offer a solution.
SO answer

Weird json value urllib python

I'm trying to manipulate a dynamic JSON from this site:
http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do
It has 3 elements, imagem, a base64, labelValorCaptcha, just a message, and uuidCaptcha, a value to pass by parameter to play a sound in this link bellow:
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_e7b072e1fce5493cbdc46c9e4738ab8a
When I enter in the first site through a browser and put in the second link the uuidCaptha after the equal ("..uuidCaptcha="), the sound plays normally. I wrote a simple code to catch this elements.
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
But I dont know what's happening, the caught value of the uuidCaptcha doesn't work. Open a error web page.
Someone knows?
Thanks!

It works for me.
$ cat a.py
#!/usr/bin/env python
# encoding: utf-8
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
$ python a.py
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_efc8d4bc3bdb428eab8370c4e04ab42c

As I said #Charlie Harding, the best way is download the page and get the JSON values, because this JSON is dynamic and need an opened web link to exist.
More info here.

cgi URL forwarding with "Location" header - only partial forwarding?

Greetings all,
I have a python CGI script which using
print "Location: [nextfilename]"
print
After "forwarding" (the reason for the quotes there is apparent in a second), I see the HTML of the page to which it has forwarded fine, but all images, etc. are not showing. The address bar still shows the cgi script as the current location, not the HTML file itself. If I go to the HTML file directly, it displays fine.
Basically, the CGI script, which is stored in the cgi-bin, whereas the HTML files are not, is trying to render images with relational links that are broken.
How do I actually forward to the next page, not just render the next page through the cgi script?
I have gone through the script with a fine-toothed comb to make sure that i wasn't actually using a print htmlDoc command anywhere that would be interrupting and screwing this up.
Sections of Code that are Applicable:
def get_nextStepName():
"""Generates a random filename."""
nextStepBuilder = ["../htdocs/bcc/"]
fileLength = random.randrange(10)+5
for i in range(fileLength):
j = random.choice(varLists.ALPHANUM)
nextStepBuilder.append(j)
nextStepName = ""
for char in nextStepBuilder:
nextStepName += char
nextStepName += ".html"
return nextStepName
def make_step2(user, password, email, headerContent, mainContent, sideSetup, sideContent, footerContent):
"""Creates the next step of user registration and logs valid data to a pickle for later confirmation."""
nextStepName = get_nextStepName()
mainContent = "<h1>Step Two: The Nitty Gritty</h1>"
mainContent += "<p>User Name: %s </p>" % (user)
mainContent += """\
[HTML CODE GOES HERE]
"""
htmlDoc = htmlGlue.glue(headerContent, mainContent, sideSetup, sideContent, footerContent)
f = open(nextStepName, "w")
f.write(htmlDoc)
f.close()
nextStepName = nextStepName[9:] #truncates the ../htdocs part of the filename to fix a relational link issue when redirecting
gotoNext(nextStepName)
def gotoNext(filename):
nextLocation = "Location:"
nextLocation += filename
print(nextLocation)
print
Any thoughts? Thanks a ton. CGI is new to me.

You need to send a 30X Status header as well. See RFC 2616 for details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do i understand whether i am parsing the websites acurately? - python

Try making a small HTML Hello World page. Given that many websites have dynamic content that changes each time you access it (and might not necessarily be visible), that could lead to your "incorrect" results.

Related

Trying to check if the webdirectory is showing the same thing as index.html

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Module urllib.request not getting data

Weird json value urllib python

cgi URL forwarding with "Location" header - only partial forwarding?

Categories

Resources