How you can realize a minimized view of a html page in a div (like google preview)?
http://img228.imageshack.us/i/minimized.png/
edit: ok.. i see its a picture on google, probably a minimized screenshot.
This is more or less a duplicate of the question: Create thumbnails from URLs using PHP
However, just to add my 2ยข, my strong preference would be to use an existing web service, e.g. websnapr, as mentioned by thirtydot in the comments on your question. Generating the snapshots yourself will be difficult to scale well, and just the kind of thing I'd think is worth using an established service for.
If you really do want to do this yourself, I've had success using CutyCapt to generate snapshots of webpages - there are various other similar options (i.e. external programs you can call to do the rendering) mentioned in that other question.
google displays an image thumbnail, so you would need to generate an image using GD or ImageMagic.
The general flow would be
Fetch page content, including stylesheets and all images via curl (potentially tricky to capture all the embedded files but shouldn't be beyond a competent PHP programmer
Construct a rendering of the page inside PHP itself (EXTREMELY tricky! Wouldn't even know where to start with that, though there might be some kind of third party extension available)
Use GD/Imagemagic/whatever to generate a thumbnail image in an appropriate format (shouldn't be too hard).
Clearly, it's the rendering the page from the HTML, CSS, images etc you downloaded that is going to be the difficult part.
Personally I'd be wondering if the effort involved is worth it.
Related
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?
I am wondering if and how it is possible to 'listen to' the text that is in my browser window.
I am specifically NOT looking to scrape websites in the sense that I want to crawl them for information, I am just interested in interacting with an arbitrary page that makes my browser output text.
Example
Suppose I am asking a question on Stack Overflow.
By the time I type a title of 'Listen to text in browser' the suggestions appear, one of them contains plain text 'Listen to browser request'
As soon as the word 'request' is on my screen, the browser will get shut down and I order a pizza
What would a good solution look like
I want to be able to do this for practically any website that somehow makes my computer show simple text. Ideally without having knowledge of how the text is generated.
I want this to be somewhat fast, subsecond should be possible
I do not want to hit the website or its api's, just want to use the information that is already on my screen.
I am not too picky about the OS and browser requirements.
I can also imagine there may be corner cases wher it is hard (perhaps text is shown as a picture, or perhaps parts of a sentence are actually spread across multiple textboxes that are just displayed to eachother). For now I just wonder how this can be done for a simple page.
Bonus points if it could even capture text from the field in which I am typing myself, so I can scald myself when I am about to say something stupid.
What have I come up with so far
I am in general confident that I can process the text once I get it into a tool, however the main challenge is on how one can listen to the browser.
I tried looking at the source code, but that does not appear to contain this dynamic text
Perhaps there is a steraming API on the browser itself, that can stream out changes?
Perhaps there is a way to grab all text from a browser, perhaps 10x per second or so
Using the normal scraping solutions are completely not what I want, so I do not want to fire a request to the webserver 10x per second.
In the worst case, I suppose we could use screen capture software, followed by text recognition, but I really hope there is something more elegant
I suppose there may be automation/testing software that can do this. That would be an answer but something lightweight (e.g. a python library) would be nicest.
I have tried searching but did not find any solution, or even the question. Presumably I am using the wrong words.
Problem
Downloading a complete working offline copy of a website that loads links/images dynamically
Research
There are questions (e.g. [1], [2], [3]) on Stackoverflow addressing this issue, most of which have the top answers using wget or httrack, both of which fail miserably (please do correct me if I am wrong) on pages that dyanmically load links or uses srcset instead of src for img tag -or anything loaded via JS-. A rather obvious solution was Selenium, however, if you ever used Selenium in production, you quickly start seeing the issues that arise from such a decision (resource heavy, quite complex to use head-full driver, the fact that is it not built for that), that being said, there are people claiming to have been using it easily in production for years
Expected Solution
A script (preferably in python), that parses the page for links and loads them separately. I cannot seem to find any existing scripts that do that. If your solution is "so implement your own", then it is pointless to be asking the question in the first place, I am seeking an existing implementation.
Examples
Shopify.com
Websites built using Wix
Now there are head-less versions of Selenium and alternatives such as PhantomJS, either can be used with a small script to scrape any dynamically loaded website.
I had implemented a generic scraper here, and explained more about the topic here
Python noobie.
I'm trying to make Python select a portion of my screen. In this case, it is a small window within a Firefox window -- it's Firebug source code. And then, once it has selected the right area, control-A to select all and then control-C to copy. If I could figure this out then I would just do the same thing and paste all of the copies into a .txt file.
I don't really know where to begin -- are there libraries for this kind of thing? Is it even possible?
I would look into PyQt or PySide which are Python wrapper on TOp of Qt.
Qt is a big monster but it's very well documented and i'm sure it will help you further in your project once you grabbed your screen section.
As you've mentioned in the comments, the data is all in the HTML to start (I'm guessing it's greyed out in your Firebug screenshot since it's a hidden element). This approach avoids the complexity of trying to automate a browser. Here's a rough outline of how I would get the data:
Download the HTML for the whole page - I'd do this manually at first (i.e. File > Save from a browser), and if there are a bunch of pages you want to process, figure out how to download all the pages you want later. If you want to use python for this part, I'd recommend urllib2. The URLs for each page are probably pretty structured, so you could easily store them in a list, and download each one and save it locally. .
Write a script to parse the HTML - don't use regex. Since you're using Python, use something like Beautiful Soup, which will create a nice object representation of the page, and then you can get the elements you want.
You mention you're new to python, so there's definitely going to be a learning curve around this, but this actually sounds like a pretty doable project to use to learn some more python.
If you run into specific obstacles with each step, start a new question with a bit of sample code, showing what you're trying to accomplish, and people will be more than willing to help out.
Recently I needed to generate a huge HTML page containing a report with several thousand row table. And, obviously, I did not want to build the whole HTML (or the underlying tree) in memory. As result, I built the page with the old good string interpolation, but I do not like the solution.
Thus, I wonder whether there are Python templating engines that can yield resulting page content by parts.
UPD 1: I am not interested in listing all available frameworks and templating engines.
I am interested in templating solutions that I can use separately from any framework and which can yield content by portions instead of building the whole result in memory.
I understand the usability enhancements from partial content loading with client scripting, but that is out of the scope of my current question. Say, I want to generate a huge HTML/XML and stream it into a local file.
Most popular template engines have a way to generate or write rendered result to file objects with chunks. For example:
Template.generate() in Jinja2
Template.render_context() in Mako
Stream.serialize() in Genshi
It'd be more user-friendly (assuming they have javascript enabled) to build the table via javascript by using e.g. a jQuery plugin which allows automatical loading of contents as soon as you scroll down. Then only few rows are loaded initially and when the user scrolls down more rows are loaded on demand.
If that's not a solution, you could use three templates: one for everything before the rows, one for everything after the rows and a third one for the rows.
Then you first send the before-rows template, then generate the rows and send them immediately, then the after-rows template. Then you will have only one block/row in memory instead of the whole table.
There is no problem with building something like this in memory. Several thousand rows is by no means big.
For your templating needs you can use any of the:
rst http://docutils.sourceforge.net/rst.html (just needs core docutils)
markdown http://daringfireball.net/projects/markdown/ (python lib)
sphinx https://www.sphinx-doc.org (python based)
json http://www.json.org/ (python lib)
There are some tools that allow generation of HTML from these markup languages.
You don't need a streaming templating engine - I do this all the time, and long before you run into anything vaguely heavy server-side, the browser will start to choke. Rendering a 10000 row table will peg the CPU for several seconds in pretty much any browser; scrolling it will be bothersomely choppy in chrome, and the browser mem usage will rise regardless of browser.
What you can do (and I've previously implemented, even though in retrospect it turns out not to be necessary) is use client-side xslt. Printing the xslt processing instruction and the opening and closing tag using strings is easy and fairly safe; then you can stream each individual row as a standalone xml element using whatever xml writer technique you prefer.
However - you really don't need this, and likely never will - if ever your html generator gets too slow, the browser will be an order of magnitude more problematic.
So, unless you benchmarked this and have determined you really have a problem, don't waste your time. If you do have a problem, you can solve it without fundamentally changing the method - in memory generation can work just fine.
Are you using a web framework for this?
http://www.pylonshq.com includes compatibility with several templating engines.
http://www.djangoproject.com/ Django has its own templating language.
I think an answer that included lazy loading of the rows with javascript would work for web view, but I presume the report is going to need to be printed, in which case you'll have to build the whole thing at some point, right?