How can I make HTML safe for web browser with python?

How can I make HTML safe for web browser with python? - python

How can I make HTML from email safe to display in web browser with python?
Any external references shouldn't be followed when displayed. In other words, all displayed content should come from the email and nothing from internet.
Other than spam emails should be displayed as closely as possible like intended by the writer.
I would like to avoid coding this myself.
Solutions requiring latest browser (firefox) version are also acceptable.

html5lib contains an HTML+CSS sanitizer. It allows too much currently, but it shouldn't be too hard to modify it to match the use case.
Found it from here.

I'm not quite clear with what exactly you mean with "safe". It's a pretty big topic... but, for what it's worth:
In my opinion, the stripping parser from the ActiveState Cookbook is one of the easiest solutions. You can pretty much copy/paste the class and start using it.
Have a look at the comments as well. The last one states that it doesn't work anymore, but I also have this running in an application somewhere and it works fine. From work, I don't have access to that box, so I'll have to look it up over the weekend.

Use the HTMLparser module, or install BeautifulSoup, and use those to parse the HTML and disable or remove the tags. This will leave whatever link text was there, but it will not be highlighted and it will not be clickable, since you are displaying it with a web browser component.
You could make it clearer what was done by replacing the <A></A> with a <SPAN></SPAN> and changing the text decoration to show where the link used to be. Maybe a different shade of blue than normal and a dashed underscore to indicate brokenness. That way you are a little closer to displaying it as intended without actually misleading people into clicking on something that is not clickable. You could even add a hover in Javascript or pure CSS that pops up a tooltip explaining that links have been disabled for security reasons.
Similar things could be done with <IMG></IMG> tags including replacing them with a blank rectangle to ensure that the page layout is close to the original.
I've done stuff like this with Beautiful Soup, but HTMLparser is included with Python. In older Python distribs, there was an htmllib which is now deprecated. Since the HTML in an email message might not be fully correct, use Beautiful Soup 3.0.7a which is better at making sense of broken HTML.

Related

Extract heading and content from an HTML page using a visual approach in Python

I'm looking for a way to extract the heading and content from raw HTML. There are a couple of Python packages out there which does this (Newspaper3k, python-readability, python-goose), but I'm looking to do something more like how the human eye sees. My idea is to use the visual placement of a div on a page to determine if it's part of the main content of a page or not. How can I extract the placement of a div using python? Any other ideas on how to approach this problem?

To the best of my understanding, you want to locate and extract html from certain divs from a website, but on screen, with a cursor and a keyboard (like a human would do), for that purpose, you could go with PyAutoGui.
You can use pyautogui.locateOnScreen(), with a parameter of choice, you can then advance with scrapping tools.
With PyAutoGui, you can automate click events as well.
For further research, you can check the docs.
Hope this answers your question, if doubts, please feel free to ask!

As you mentioned, the worst part of the Python packages you mentioned is the required HTML and DOM structure knowledge. Nevertheless, it is necessary for scraping I can share a hybrid approach.
First step: I use WebScraper.io Chrome extension to visually select items on the page (like on the image) and save them.
Second step: Once I have DOM selectors like p a.cta (on the image). I use them with the Python scraping package.
I use this approach almost for any scraping project. I hope it helps.

Why am I not seeing the "full" html case? [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.

The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.

This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.

Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.

Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(

There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.

Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.

i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.

This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?

How to: Python script that will 'click' on a portion of my screen, and then do key commands?

Python noobie.
I'm trying to make Python select a portion of my screen. In this case, it is a small window within a Firefox window -- it's Firebug source code. And then, once it has selected the right area, control-A to select all and then control-C to copy. If I could figure this out then I would just do the same thing and paste all of the copies into a .txt file.
I don't really know where to begin -- are there libraries for this kind of thing? Is it even possible?

I would look into PyQt or PySide which are Python wrapper on TOp of Qt.
Qt is a big monster but it's very well documented and i'm sure it will help you further in your project once you grabbed your screen section.

As you've mentioned in the comments, the data is all in the HTML to start (I'm guessing it's greyed out in your Firebug screenshot since it's a hidden element). This approach avoids the complexity of trying to automate a browser. Here's a rough outline of how I would get the data:
Download the HTML for the whole page - I'd do this manually at first (i.e. File > Save from a browser), and if there are a bunch of pages you want to process, figure out how to download all the pages you want later. If you want to use python for this part, I'd recommend urllib2. The URLs for each page are probably pretty structured, so you could easily store them in a list, and download each one and save it locally. .
Write a script to parse the HTML - don't use regex. Since you're using Python, use something like Beautiful Soup, which will create a nice object representation of the page, and then you can get the elements you want.
You mention you're new to python, so there's definitely going to be a learning curve around this, but this actually sounds like a pretty doable project to use to learn some more python.
If you run into specific obstacles with each step, start a new question with a bit of sample code, showing what you're trying to accomplish, and people will be more than willing to help out.

Documentation/Examples of using Selenium & Python to navigate a website

I've just started using python and Selenium today so in at the deep end a little.
So far I've used the documentation to get a python script to load google, search for something and then take a screenshot of the results.
What I want is to be able to load a website, navigate to certain elements and take screenshots of various pages. I'm struggling to find documentation for navigation however.
Could someone point me towards (or post an answer with) examples/explanation of find_element and what you can actually find, and also how to open elements once found. The documentation for lots of what I wanted is still under development :(
I've been looking through the WebDriver docs on googlecode at the kind of methods I thought I needed but it seems they are all part of the private API so what alternatives are there?
I keep seeing this on everything;
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Found a great example of Action_Chains on here; https://stackoverflow.com/a/8261754/1199464

While the selenium documentation is not in a particularly good order, I feel like everything is there.
You could e.g. start here: http://code.google.com/p/selenium/wiki/FurtherResources
xpath seems a good choice for finding elements.
Also this page seems to contain what you need: http://seleniumhq.org/docs/03_webdriver.html#commands-and-operation
edit: I found this and it should contain what you need: http://selenium.googlecode.com/svn/trunk/docs/api/py/api.html
(sry p0deje, I didnt see that you already posted that last link...)

You can take a look at the basics there:
http://code.google.com/p/selenium/wiki/PythonBindings
http://pypi.python.org/pypi/selenium
The full documentation and up-to-date documentation:
http://selenium.googlecode.com/svn/trunk/docs/api/py/index.html

Good links, but using XPATH for locators is strongly discouraged (too brittle). Use ID or name, or CSS if you cannot.
Few links to best practices:
Selenium Best Practices (pageobject, Prefered selector order : id > name > css > xpath)
Slideshow - more advanced
Compare locators pro/con - XPATH is slow and brittle, esp IE.

Making user-made HTML templates safe

I want to allow users to create tiny templates that I then render in Django with a predefined context. I am assuming the Django rendering is safe (I asked a question about this before), but there is still the risk of cross-site-scripting, and I'd like to prevent this. One of the main requirements of these templates is that the user should have some control over the layout of the page, not just it's semantics. I see a couple of solutions:
Allow the user to use HTML, but filter out dangerous tags manually in the final step (things like <script> and <a onclick='..'>. I'm not so enthusiastic about this option, because I'm afraid I might overlook some tags. Even then, the user could still use absolute positioning on <divs> to mess up a thing or two on the rest of the page.
Use a markup language that produces safe HTML. From what I can see, in most markup languages, I could strip any html, and then process the result. The problem with this is that most markup languages are not very powerful layout-wise. As far as I could see there is no way to center elements in Markdown, not even in ReST. The pro here is that some markup languages are well-documented, and users might already know how to use them.
Come up with some proprietary markup. The cons I see here are pretty much all implied by the word proprietary.
So, to summarize: Is there some safe and easy way to "purify" HTML — preventing xss — or is there a reasonably ubiquitous markup language that gives some control over layout and styling.
Resources:
My earlier question about Django templates
Class names in markdown.

Seeing Pekka's answer, I tried to quickly Google an HTML Purifier equivalent in Python. Here's what I came up with: Python HTML Sanitizer. At first glance, it looks pretty good to me.

There's PHP-Based HTML purifier, I have not used it myself yet but heard very good things about it. They promise a lot:
HTML Purifier is a standards-compliant
HTML filter library written in
PHP. HTML Purifier will not only remove all malicious
code (better known as XSS) with a thoroughly audited,
secure yet permissive whitelist,
it will also make sure your documents are
standards compliant, something only achievable with a
comprehensive knowledge of W3C's specifications.
Maybe it's worth a try even though it's not Python based. Update: #Matchu has found a Python based alternative that looks good too.
You'll have a lot of very difficult edge cases, though, just think about Flash embeds. Plus, malicious uses of position: absolute are extremely difficult to track down (there's position: relative that could achieve the same effect, but also be a completely legitimate layout tool.) Maybe take a look at what - for example - EBay allow, and don't allow? If anybody has the necessary experience to know what's dangerous and what isn't from millions of examples, they do.
Related resources on EBay:
HTML & JavaScript with examples
Site Interference it's unclear, though, what is just forbidden, and what gets filtered
From what I found, they don't seem to publish their internal HTML blacklists, but output an error message if forbidden code is found. (Probably a wise move on their part, but unfortunate for the purposes of this question.)

"Use a markup language that produces safe HTML."
Clearly, the only sensible approach.
"The problem with this is that most markup languages are not very powerful layout-wise."
False.
"no way to center elements in ReST."
False.
Centering is a style -- a CSS feature -- not a markup feature.
The want to center is to assign an CSS Class to a piece of text. The .. class:: directive does this.
You can also define your own interpreted text role, if that's necessary for specifying an inline class on a piece of <span> markup.

You are overlooking server side security issues. You need to be very careful that users can't use the templates import or include mechanism to access files they don't have permission to.
The bigger challenge is to prevent the template system from infinite loops and recursion. This is an obvious threat to system performance, but depending on the implementation and deployment setup, the server may never timeout. With a finite number of python threads at your disposal, repeated calls to a misbehaving template could quickly bring your site down.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.