I have a very simple pyramid application which serves a simple static page. Let's say its name is mypyramid and uses port 9999.
If I launch mypyramid in another linux console manually, then I can use the following code to print out the html string.
if __name__ == "__main__":
import urllib2
print 'trying to download url'
response = urllib2.urlopen('http://localhost:9999/index.html')
html = response.read()
print html
But I want to launch mypyramid in an application automatically.
So in my another application, I used pexpect to launch mypyramid, and then try to get the html string from http://localhost:9999/index.html.
def _start_mypyramid():
p = pexpect.spawn(command='./mypyramid')
return p
if __name__ == "__main__":
p = _start_mypyramid()
print p
print 'mypyramid started'
import urllib2
print 'trying to download url'
response = urllib2.urlopen('http://localhost:9999/index.html')
html = response.read()
print html
It seems mypyramid has been successfully launched using pexpect, as I can see the print of the process and mypyramid started has been reached.
However, the application is just hanging after trying to download url, and I can't get anything.
What is the solution? I mean I thought pexpect would create another process. If that's true, then why it is stopping the retrieval of the html?
My guess would be that the child returned by pexpect.spawn needs to communicate.
It attempts to write but nobody reads, so the app stops. (I am only guessing though).
If you do not have any reason to use pexpect (which you probably don't if you do not communicate with the child process), why wouldn't you just go for a standard module subprocess?
Related
I built this function to tell me whether there have been changes to the website. I'm not sure if it works as I have tried it on a few websites that have not changed and it has given me the wrong output. Where is the issue and is there an issue at all?
This is the code:
I put the code into a function so that I could allow the user to input any site
userurl=input("Please enter a valid url")
def checksite(userurl):
change=False
import time
import urllib.request
import io
u = urllib.request.urlopen(userurl)
webContent1 = u.read()
time.sleep(60)
u = urllib.request.urlopen(userurl)
webContent2 = u.read()
if webContent1 == webContent2:
print("Everything is normal")
elif webContent1 !=webContent2:
print("Warning, there has been a change to the webite!")
change=True
return change
checksite(userurl)
Try making a small HTML Hello World page. Given that many websites have dynamic content that changes each time you access it (and might not necessarily be visible), that could lead to your "incorrect" results.
I have tested your code and it works perfectly fine in a Python webserver.
I have started one with
python -m http.server
and placed an index.html in the same directory with some content before starting the server.
and your code
import time
import urllib.request
import io
userurl='http://localhost:8000/index.html'
def checksite(userurl):
change=False
u = urllib.request.urlopen(userurl)
webContent1 = u.read()
print(webContent1)
time.sleep(15)
u = urllib.request.urlopen(userurl)
webContent2 = u.read()
print(webContent2)
if webContent1 == webContent2:
print("Everything is normal")
elif webContent1 !=webContent2:
print("Warning, there has been a change to the webite!")
change=True
return change
checksite(userurl)
and output
b'<html>\n\t<title> Hello </title>\n\t<body>\n\t\tTesting, Webcontent1 \n\t</body>\n\t</html>\n\n'
b'<html>\n\t<title> Hello </title>\n\t<body>\n\t\tTesting, Webcontent2\n\t</body>\n\t</html>\n\n'
Warning, there has been a change to the webite!
[Finished in 17.5s]
Your code is perfectly fine.
to know if a website or a page has changed you need to have a backup of it somewhere, in your code it was like you were comparing the site to itself... anyways. i recomend using the requests library in addition to BS4 and try parsing it line by line comparing to the backup you have.
So while the code is working (aka: the site you have as backup is showing the same lines as the site on the web) it will have a variable true. if it has changed it breaks the loop and simply shows the line where the site has changed.
I'm trying to read & print the result from google's URL in GAE. When i run the first program, output was blank. then i have added a print statement before printing the url result and run it. Now i got the result.
Why the Program 1 doesn't give any output ?
Program 1
import urllib
class MainHandler(webapp.RequestHandler):
def get(self):
url = urllib.urlopen("http://www.google.com/ig/calculator?hl=en&q=100EUR%3D%3FAUD")
result = url.read()
print result
Program 2
import urllib
class MainHandler(webapp.RequestHandler):
def get(self):
# Print something before print urllib result
print "Result -"
url = urllib.urlopen("http://www.google.com/ig/calculator?hl=en&q=100EUR%3D%3FAUD")
result = url.read()
print result
You're using print from inside a WSGI application. Never, ever use print from inside a WSGI application.
What's happening is that your text is being output in the place where the webserver expects to see headers, so your output is not displayed as you expect.
Instead, you should use self.response.out.write() to send output to the user, and logging.info etc for debugging data.
I met this issue before. But cannot find an exactly answer about it yet.
maybe the cache mechanism cause this issue, not sure.
You need do flush output to print the data:
import sys
sys.stdout.flush()
or just do like the way you did:
print "*" * 10
print data
I think you'll like logging when you are debugging:
logging.debug('A debug message here')
or
logging.info('The result is: %s', yourResultData)
I have web.py configured for my Apache server by installing flups. However when I go to my application, the html code is printed instead of the html page. (See below).
Content-Type: text/html
<HTML><HEAD><TITLE>Login Details</TITLE></HEAD><BODY>.......</BODY></HTML>
I created another file Test.py in the same directory with the following code
#!/usr/bin/python
print "Content-Type: text/html\n\n"
print "<html><head></head><body>Present</body></html>"
This prints out the page fine. Both the files have the same executable permissions.(chmod 755).
Any ideas why this is happening?
Update: Just found out. If I change the return statement to a print inside the GET method for my app, it prints out the form fine, but also prints out the cookie, session id, etc.. at the end. What do I need to configure to make the return work as expected?
Adding a sample code which would cause the issue:
#!/usr/bin/python
import web
urls = ("/CodeAnalyzer", "CodeAnalyzer")
app = web.application(urls, globals())
class CodeAnalyzer:
def GET(self):
init="Content-Type: text/html\n\n"
form="<html><head></head><body>Hello World</body><html>"
return init+form
if __name__ == "__main__":
app.run()
The issue was in the line
init="Content-Type: text/html\n\n"
That was the incorrect way to pass the header in web.py. The issue was resolved after replacing it with
web.header('Content-Type','text/html; charset=utf-8', unique=True)
When I run this code on my computer with the help of "Google App Engine SDK", it displays (in my browser) the HTML code of the Google home page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
print result.content
How can I make it display the page itself? I mean I want to see that page in my browser the way it would normally be seen by any user of the internet.
Update 1:
I see I have received a few questions that look a bit complicated to me, although I definitely remember I was able to do it, and it was very simple, except i don't remember what exactly i changed then in this code.
Perhaps, I didn't give You all enough details on how I run this code and where I found it. So, let me tell You what I did. I only installed Python 2.5 on my computer and then downloaded "Google App Engine SDK" and installed it, too. Following the instructions on "GAE" page (http://code.google.com/appengine/docs/python/gettingstarted/helloworld.html) I created a directory and named it “My_test”, then I created a “my_test.py” in it containing that small piece of the code that I mentioned in my question.
Then, continuing to follow on the said instructions, I created an “app.yaml” file in it, in which my “my_test.py” file was mentioned. After that in “Google App Engine Launcher” I found “My_test” directory and clicked on Run button, and then on Browse. Then, having visited this URL http://localhost:8080/ in my web browser, I saw the results.
I definitely remember I was able to display any page in my browser in this way, and it was very simple, except I don’t remember what exactly I changed in the code (it was a slight change). Now, all I can see is a raw HTML code of a page, but not a page itself.
Update 2:
(this update is my response to wescpy)
Hello, wescpy!!! I've tried Your updated code and something didn't work well there. Perhaps, it's because I am not using a certain framework that I am supposed to use for this code. Please, take a look at this screen shot (I guess You'll need to right-click this image to see it in better resolution):
(source: narod.ru)
Is not that easy, you have to parse content and adjust relative to absolute paths for images and javascripts.
Anyway, give it a try adding the correct Content-Type:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
print 'Content-Type: text/html'
print ''
print result.content
a more complete example would look something like this:
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.api import urlfetch
class MainHandler(webapp.RequestHandler):
def get(self):
url = "http://www.google.com/"
result = urlfetch.fetch(url)
self.response.out.write(result.content)
application = webapp.WSGIApplication([
('/', MainHandler),
], debug=True)
def main():
run_wsgi_app(application)
if __name__ == '__main__':
main()
but as others' have said, it's not that easy to do because you're not in the server's domain, meaning the pages will likely not look correct due to missing static content (JS, CSS, and/or images)... unless full pathnames are used or everything that's needed is embedded into the page itself.
UPDATE 1:
as mentioned before, you cannot just download the HTML source and expect things to render correctly because you don't necessarily have access to the static data. if you really want to render it as it was meant to be seen, you have to just redirect... here's the modified piece of code:
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.api import urlfetch
class MainHandler(webapp.RequestHandler):
def get(self):
url = "http://www.google.com/"
self.redirect(url)
application = webapp.WSGIApplication([
('/', MainHandler),
], debug=True)
def main():
run_wsgi_app(application)
if __name__ == '__main__':
main()
UPDATE 2:
sorry! it was a cut-n-paste error. now try it.
special characters such as <> etc are likely encoded, you'd have to decode them again for the browser to interpet it as code.
I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}