Python: using a regular expression to match one line of HTML

Python: using a regular expression to match one line of HTML - python

This simple Python method I put together just checks to see if Tomcat is running on one of our servers.
import urllib2
import re
import sys
def tomcat_check():
tomcat_status = urllib2.urlopen('http://10.1.1.20:7880')
results = tomcat_status.read()
pattern = re.compile('<body>Tomcat is running...</body>',re.M|re.DOTALL)
q = pattern.search(results)
if q == []:
notify_us()
else:
print ("Tomcat appears to be running")
sys.exit()
If this line is not found :
<body>Tomcat is running...</body>
It calls :
notify_us()
Which uses SMTP to send an email message to myself and another admin that Tomcat is no longer runnning on the server...
I have not used the re module in Python before...so I am assuming there is a better way to do this... I am also open to a more graceful solution with Beautiful Soup ... but haven't used that either..
Just trying to keep this as simple as possible...

Why use regex here at all? Why not just a simple string search?:
if not '<body>Tomcat is running...</body>' in results:
notify_us()

if not 'Tomcat is running' in results:
notify_us()

There are lots of different methods:
str.find()
if results.find("Tomcat is running...") != -1:
print "Tomcat appears to be running"
else:
notify_us()
Using X in Y
if "Tomcat is running..." in result:
print "Tomcat appears to be running"
else:
notify_us()
Using Regular Expressions
if re.search(r"Tomcat is running\.\.\.", result):
print "Tomcat appears to be running"
else:
notify_us()
Personally, I prefer the membership operator to test if the string is in another string.

Since you appear to be looking for a fixed string (not a regexp) that you have some control over and can be expected to be consistent, str.find() should do just fine. Or what Daniel said.

As you have mentioned, regular expressions aren't suited for parsing XML like structures (at least, for more complex queries). I would do something like that:
from lxml import etree
import urllib2
def tomcat_check(host='127.0.0.1', port=7880):
response = urllib2.urlopen('http://%s:%d' % (host, port))
html = etree.HTML(response.read())
return html.findtext('.//body') == 'Tomcat is running...'
if tomcat_check('10.1.1.20'):
print 'Tomcat is running...'
else:
# notify someone

Related

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()

This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Loop to check if a variable has changed in Python

I have just learned the basics of Python, and I am trying to make a few projects so that I can increase my knowledge of the programming language.
Since I am rather paranoid, I created a script that uses PycURL to fetch my current IP address every x seconds, for VPN security. Here is my code[EDITED]:
import requests
enterIP = str(input("What is your current IP address?"))
def getIP():
while True:
try:
result = requests.get("http://ipinfo.io/ip")
print(result.text)
except KeyboardInterrupt:
print("\nProccess terminated by user")
return result.text
def checkIP():
while True:
if enterIP == result.text:
pass
else:
print("IP has changed!")
getIP()
checkIP()
Now I would like to expand the idea, so that the script asks the user to enter their current IP, saves that octet as a string, then uses a loop to keep running it against the PycURL function to make sure that their IP hasn't changed? The only problem is that I am completely stumped, I cannot come up with a function that would take the output of PycURL and compare it to a string. How could I achieve that?

As #holdenweb explained, you do not need pycurl for such a simple task, but nevertheless, here is a working example:
import pycurl
import time
from StringIO import StringIO
def get_ip():
buffer = StringIO()
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://ipinfo.io/ip")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
return buffer.getvalue()
def main():
initial = get_ip()
print 'Initial IP: %s' % initial
try:
while True:
current = get_ip()
if current != initial:
print 'IP has changed to: %s' % current
time.sleep(300)
except KeyboardInterrupt:
print("\nProccess terminated by user")
if __name__ == '__main__':
main()
As you can see I moved the logic of getting the IP to separate function: get_ip and added few missing things, like catching the buffer to a string and returning it. Otherwise it is pretty much the same as the first example in pycurl quickstart
The main function is called below, when the script is accessed directly (not by import).
First off it calls the get_ip to get initial IP and then runs the while loop which checks if the IP has changed and lets you know if so.
EDIT:
Since you changed your question, here is your new code in a working example:
import requests
def getIP():
result = requests.get("http://ipinfo.io/ip")
return result.text
def checkIP():
initial = getIP()
print("Initial IP: {}".format(initial))
while True:
current = getIP()
if initial == current:
pass
else:
print("IP has changed!")
checkIP()
As I mentioned in the comments above, you do not need two loops. One is enough. You don't even need two functions, but better do. One for getting the data and one for the loop. In the later, first get initial value and then run the loop, inside which you check if value has changed or not.

It seems, from reading the pycurl documentation, like you would find it easier to solve this problem using the requests library. Curl is more to do with file transfer, so the library expects you to provide a file-like object into which it writes the contents. This would greatly complicate your logic.
requests allows you to access the text of the server's response directly:
>>> import requests
>>> result = requests.get("http://ipinfo.io/ip")
>>> result.text
'151.231.192.8\n'
As #PeterWood suggested, a function would be more appropriate than a class for this - or if the script is going to run continuously, just a simple loop as the body of the program.

Module urllib.request not getting data

I am trying to test this demo program from lynda using Python 3. I am using Pycharm as my IDE. I already added and installed the request package, but when I run the program, it runs cleanly and shows a message "Process finished with exit code 0", but does not show any output from print statement. Where am I going wrong ?
import urllib.request # instead of urllib2 like in Python 2.7
import json
def printResults(data):
# Use the json module to load the string data into a dictionary
theJSON = json.loads(data)
# now we can access the contents of the JSON like any other Python object
if "title" in theJSON["metadata"]:
print(theJSON["metadata"]["title"])
# output the number of events, plus the magnitude and each event name
count = theJSON["metadata"]["count"];
print(str(count) + " events recorded")
# for each event, print the place where it occurred
for i in theJSON["features"]:
print(i["properties"]["place"])
# print the events that only have a magnitude greater than 4
for i in theJSON["features"]:
if i["properties"]["mag"] >= 4.0:
print("%2.1f" % i["properties"]["mag"], i["properties"]["place"])
# print only the events where at least 1 person reported feeling something
print("Events that were felt:")
for i in theJSON["features"]:
feltReports = i["properties"]["felt"]
if feltReports != None:
if feltReports > 0:
print("%2.1f" % i["properties"]["mag"], i["properties"]["place"], " reported " + str(feltReports) + " times")
# Open the URL and read the data
urlData = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson"
webUrl = urllib.request.urlopen(urlData)
print(webUrl.getcode())
if webUrl.getcode() == 200:
data = webUrl.read()
data = data.decode("utf-8") # in Python 3.x we need to explicitly decode the response to a string
# print out our customized results
printResults(data)
else:
print("Received an error from server, cannot retrieve results " + str(webUrl.getcode()))

Not sure if you left this out on purpose, but this script isn't actually executing any code beyond the imports and function definition. Assuming you didn't leave it out on purpose, you would need the following at the end of your file.
if __name__ == '__main__':
data = "" # your data
printResults(data)
The check on __name__ equaling "__main__" is just so your code is only executing when the file is explicitly run. To always run your printResults(data) function when the file is accessed (like, say, if its imported into another module) you could just call it at the bottom of your file like so:
data = "" # your data
printResults(data)

I had to restart the IDE after installing the module. I just realized and tried it now with "Run as Admin". Strangely seems to work now.But not sure if it was a temp error, since even without restart, it was able to detect the module and its methods.

Your comments re: having to restart your IDE makes me think that pycharm might not automatically detect newly installed python packages. This SO answer seems to offer a solution.
SO answer

How to decode python string

I have some code which I would like to be decoded but not having much luck in guessing what the codepage is, if any is being used. Any help would be much appreciated.
i am using python command line in windows 7 pc,if any python guru guide me how to decrypt and see the code thaat would be appreciated.
exec("import re;import base64");exec((lambda p,y:(lambda o,b,f:re.sub(o,b,f))(r"([0-9a-f]+)",lambda m:p(m,y),base64.b64decode("NTQgYgo1NCA3CjU0IDMKNTQgMWUKNTQgOQo1NCAxOAozZiAgICAgICA9IGIuMTAoKQoxNiAgID0gIjQzOi8vMTIuM2QvNGMvMWQuMjUuZi00ZC4zZSIKYSA9ICIxZC4yNS5mIgoyYSA4KDYpOgoJMzMgMy5jKCczYy4yZSglNTIpJyAlIDYpID09IDEKCjJhIDE1KDM1KToKCTUgPSAzLjQoMWUuNS4xYygnMTM6Ly8yZC8xZicsJzMwJykpCgkyMyA1CgkyMSA9IDcuMTQoKQoJMjEuMzgoIjEwIDI4IiwiMjAgMTAuLiIsJycsICczNiA0MCcpCgkxMT0xZS41LjFjKDUsICdlLjNlJykKCTM5OgoJCTFlLjFhKDExKQoJMWI6CgkJMmMKCQk5LmUoMzUsIDExLCAyMSkKCQkyID0gMy40KDFlLjUuMWMoJzEzOi8vMmQnLCcxZicpKQoJCTIzIDIKCQkyMS4zNCgwLCIiLCAiM2IgNDciKQoJCTE4LjQ4KDExLDIsMjEpCgkJCgkJMy41MygnMjIoKScpOyAKCQkzLjUzKCcyNigpJyk7CgkJMy41MygiNDUuZCgpIik7IAoJCTE5PTcuMzcoKTsgMTkuNTAoIjMyISIsIjJmIDNhIDQ5IDQxIDI5IiwiICAgWzI0IDQ2XTMxIDUxIDRhIDRlIDE3LjNkWy8yNF0iKQoJCSIiIgoJCTM5OgoJCQkxZS4xYSgxMSkKCQkxYjoKCQkJMmMKCQkJIzI3KCkKCQk0Mjo0NCgpCgkJIiIiCgoyYSAyYigpOgoJNGYgNGIgOChhKToKCQkxNSgxNikKCQoKCjJiKCk=")))(lambda a,b:b[int("0x"+a.group(1),16)],"0|1|addonfolder|xbmc|translatePath|path|script_name|xbmcgui|script_chk|downloader|scriptname|xbmcaddon|getCondVisibility|UpdateLocalAddons|download|supermax|Addon|lib|supermaxwizard|special|DialogProgress|INSTALL|website|SuperMaxWizard|extract|dialog|remove|except|join|plugin|os|addons|Installing|dp|UnloadSkin|print|COLOR|video|ReloadSkin|FORCECLOSE|Installer|Installed|def|Main|pass|home|HasAddon|SuperMax|packages|Brought|Success|return|update|url|Please|Dialog|create|try|Wizard|Nearly|System|com|zip|addon|Wait|been|else|http|quit|XBMC|gold|Done|all|has|You|not|sm|MP|By|if|ok|To|s|executebuiltin|import".split("|")))

The code is uglified. You can unobfuscate it yourself by executing the contents of exec(...) in your Python shell.
import re
import base64
print ((lambda p,y.....split("|")))
EDIT: As snakecharmerb says, it is generally not safe to execute unknown code. I analysed the code to find that running the insides of exec will only decrypt, and leaving off the exec itself will just result in a string. This procedure ("execute stuff inside exec") is by no means a generally safe method to decrypt uglified code, and you need to actually analyse what it does. But, at this point, I was asking you to trust my judgement, which, if it is wrong, theoretically could expose you to an attack. In addition, it seems you have problems getting it to run on your Python; so here's what I'm getting from the above:
import xbmcaddon
import xbmcgui
import xbmc
import os
import downloader
import extract
addon = xbmcaddon.Addon()
website = "http://supermaxwizard.com/sm/plugin.video.supermax-MP.zip"
scriptname = "plugin.video.supermax"
def script_chk(script_name):
return xbmc.getCondVisibility('System.HasAddon(%s)' % script_name) == 1
def INSTALL(url):
path = xbmc.translatePath(os.path.join('special://home/addons','packages'))
print path
dp = xbmcgui.DialogProgress()
dp.create("Addon Installer","Installing Addon..",'', 'Please Wait')
lib=os.path.join(path, 'download.zip')
try:
os.remove(lib)
except:
pass
downloader.download(url, lib, dp)
addonfolder = xbmc.translatePath(os.path.join('special://home','addons'))
print addonfolder
dp.update(0,"", "Nearly Done")
extract.all(lib,addonfolder,dp)
xbmc.executebuiltin('UnloadSkin()');
xbmc.executebuiltin('ReloadSkin()');
xbmc.executebuiltin("XBMC.UpdateLocalAddons()");
dialog=xbmcgui.Dialog(); dialog.ok("Success!","SuperMax Wizard has been Installed"," [COLOR gold]Brought To You By SuperMaxWizard.com[/COLOR]")
"""
try:
os.remove(lib)
except:
pass
#FORCECLOSE()
else:quit()
"""
def Main():
if not script_chk(scriptname):
INSTALL(website)
Main()

urllib.read() result return null value

I'm trying to read & print the result from google's URL in GAE. When i run the first program, output was blank. then i have added a print statement before printing the url result and run it. Now i got the result.
Why the Program 1 doesn't give any output ?
Program 1
import urllib
class MainHandler(webapp.RequestHandler):
def get(self):
url = urllib.urlopen("http://www.google.com/ig/calculator?hl=en&q=100EUR%3D%3FAUD")
result = url.read()
print result
Program 2
import urllib
class MainHandler(webapp.RequestHandler):
def get(self):
# Print something before print urllib result
print "Result -"
url = urllib.urlopen("http://www.google.com/ig/calculator?hl=en&q=100EUR%3D%3FAUD")
result = url.read()
print result

You're using print from inside a WSGI application. Never, ever use print from inside a WSGI application.
What's happening is that your text is being output in the place where the webserver expects to see headers, so your output is not displayed as you expect.
Instead, you should use self.response.out.write() to send output to the user, and logging.info etc for debugging data.

I met this issue before. But cannot find an exactly answer about it yet.
maybe the cache mechanism cause this issue, not sure.
You need do flush output to print the data:
import sys
sys.stdout.flush()
or just do like the way you did:
print "*" * 10
print data
I think you'll like logging when you are debugging:
logging.debug('A debug message here')
or
logging.info('The result is: %s', yourResultData)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: using a regular expression to match one line of HTML - python

Why use regex here at all? Why not just a simple string search?: if not '<body>Tomcat is running...</body>' in results: notify_us()

if not 'Tomcat is running' in results: notify_us()

Since you appear to be looking for a fixed string (not a regexp) that you have some control over and can be expected to be consistent, str.find() should do just fine. Or what Daniel said.

Related

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Loop to check if a variable has changed in Python

Module urllib.request not getting data

How to decode python string

urllib.read() result return null value

Categories

Resources