I'm trying to write a server process that will allow you to enter a URL, then every 30 min ping that URL and capture it as an image. Is this possible with a combination of something like CURL, urllib2 and PIL?
Curl, urllib2, etc., grab the HTML code for a web page. But a page doesn't look like anything on its own. Instead, a browser uses that code and renders a web page according to its own internal rules of how that code should be used. And, of course, each browser renders the page slightly differently.
In other words, you can't take a snapshot of a page without having a web browser to generate the page to take the snapshot of.
If you're feeling very ambitious, you can create your own custom, scriptable page renderer by using the rendering engine from the browser of your choice -- they all make the rendering engine a separate component that you can work with separately. IE's is called "Trident", Firefox's is called "Gecko", Chrome's is "WebKit", etc.
Otherwise you'll want to just do some sort of UI scripting, like you might do with iOpus or Selenium. Selenium is scriptable with python, so that's one for you right there.
EDIT
Here you go. That looks pretty simple.
The ImageGrab can be used to take a screenshot on windows. However, you can't do this purely using CURL, urllib2 and PIL, because you will have to render the web site. The easiest would probably be to open the website in a browser and grab a screenshot.
Related
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
I've looked and urllib(2), mechanize, and Beautiful Soup in hopes to find something that captures network calls such as pixel/beacon fires from a page. Unfortunately i'm not very familiar with any of them, and also not very clear on how to go about my search.
I'd like to use python to run through a series of web urls, and capture each ones networks call aka pixel fires. Would anyone know of a means or library i can start from inorder to accomplish this??
looked into webscrappying, but i don't want the html, instead i beleive i'm looking for the GET request the site makes.
If I understand what you want, you want to log what requests a browser makes when displaying a page, in respect of many pages.
Your options are to script a browser using python (See: http://wiki.python.org/moin/WebBrowserProgramming), or script the browser using javascript, and output your results in some way (I suggest JSON, over a request or to a file), and analyse them in python.
You'll probably find it easier to do the scripting in javascript, honestly.
Another possibility if you have access to the Firefox web browser is to install Firebug, a powerful debugging tool that gives you the option to display all network traffic from a web page in the browser console. In order to transfer the output from the console to a file you will need to install the ConsoleExport plugin for Firebug.
You will now be able to capture all the traffic from a web page to a file which you can then parse with Python.
Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html
link text
This is a link from a digital book library.There are forward and backward buttons to see next and previous page.I want to download these pictures automatically. I have once used urllib in python but the website baned it soon. I just want to download this book for study purpose so can anyone recommend me some programming tools such as web-spiders which can simulate the process of turning pages and get the pictures automatically. Thanks!
That site uses Javascript, so you can't easily scrape it with Python. Two suggestions:
Work out what requests are being made when clicking the next button. You can do this with a tool like firebug. You might then find you can scrape it without processing any JS.
Use a tool such as Selenium which allows for browser scripting which lets you "execute" the JS.
As for the site blocking you, there are two ways to reduce the chance of being blocked:
Change your user-agent to that of a common browser, e.g. Firefox.
Add random delays between accessing the next image, so that you appear more human-like.
wget is an excellent web spider
http://linux.die.net/man/1/wget
You need a real browser to work with this (kind of) site. Selenium is one option, but it is more geared towards web testing. For web scraping iMacros is really nice. I had a quick test and it works well with iMacros for Firefox/IE.
Chris
I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx