Making user-made HTML templates safe

Making user-made HTML templates safe - python

I want to allow users to create tiny templates that I then render in Django with a predefined context. I am assuming the Django rendering is safe (I asked a question about this before), but there is still the risk of cross-site-scripting, and I'd like to prevent this. One of the main requirements of these templates is that the user should have some control over the layout of the page, not just it's semantics. I see a couple of solutions:
Allow the user to use HTML, but filter out dangerous tags manually in the final step (things like <script> and <a onclick='..'>. I'm not so enthusiastic about this option, because I'm afraid I might overlook some tags. Even then, the user could still use absolute positioning on <divs> to mess up a thing or two on the rest of the page.
Use a markup language that produces safe HTML. From what I can see, in most markup languages, I could strip any html, and then process the result. The problem with this is that most markup languages are not very powerful layout-wise. As far as I could see there is no way to center elements in Markdown, not even in ReST. The pro here is that some markup languages are well-documented, and users might already know how to use them.
Come up with some proprietary markup. The cons I see here are pretty much all implied by the word proprietary.
So, to summarize: Is there some safe and easy way to "purify" HTML — preventing xss — or is there a reasonably ubiquitous markup language that gives some control over layout and styling.
Resources:
My earlier question about Django templates
Class names in markdown.

Seeing Pekka's answer, I tried to quickly Google an HTML Purifier equivalent in Python. Here's what I came up with: Python HTML Sanitizer. At first glance, it looks pretty good to me.

There's PHP-Based HTML purifier, I have not used it myself yet but heard very good things about it. They promise a lot:
HTML Purifier is a standards-compliant
HTML filter library written in
PHP. HTML Purifier will not only remove all malicious
code (better known as XSS) with a thoroughly audited,
secure yet permissive whitelist,
it will also make sure your documents are
standards compliant, something only achievable with a
comprehensive knowledge of W3C's specifications.
Maybe it's worth a try even though it's not Python based. Update: #Matchu has found a Python based alternative that looks good too.
You'll have a lot of very difficult edge cases, though, just think about Flash embeds. Plus, malicious uses of position: absolute are extremely difficult to track down (there's position: relative that could achieve the same effect, but also be a completely legitimate layout tool.) Maybe take a look at what - for example - EBay allow, and don't allow? If anybody has the necessary experience to know what's dangerous and what isn't from millions of examples, they do.
Related resources on EBay:
HTML & JavaScript with examples
Site Interference it's unclear, though, what is just forbidden, and what gets filtered
From what I found, they don't seem to publish their internal HTML blacklists, but output an error message if forbidden code is found. (Probably a wise move on their part, but unfortunate for the purposes of this question.)

"Use a markup language that produces safe HTML."
Clearly, the only sensible approach.
"The problem with this is that most markup languages are not very powerful layout-wise."
False.
"no way to center elements in ReST."
False.
Centering is a style -- a CSS feature -- not a markup feature.
The want to center is to assign an CSS Class to a piece of text. The .. class:: directive does this.
You can also define your own interpreted text role, if that's necessary for specifying an inline class on a piece of <span> markup.

You are overlooking server side security issues. You need to be very careful that users can't use the templates import or include mechanism to access files they don't have permission to.
The bigger challenge is to prevent the template system from infinite loops and recursion. This is an obvious threat to system performance, but depending on the implementation and deployment setup, the server may never timeout. With a finite number of python threads at your disposal, repeated calls to a misbehaving template could quickly bring your site down.

Related

What is the best practice for storing UI messaging strings in Python/Django?

I'm wondering what the "best practice" is for storing medium length strings to be used in a UI in Python/Django.
Example:
I have an error.html template that takes an error_description field. This is a few sentences explaining to the user what went wrong, and what they might do to address it. It may be different for different error pages, but remains fairly stable in the code(there's no reason that someone who can't push source code should be able to modify it), and can easily be held in memory, so I don't think it's the sort of thing that should be kept in the database.
My current idea is that I should just create some kind of messages.py file that has a bunch of string constants like this:
ERROR_UNAUTHENTICATED_AJAX_EXPLANATION = "Hello, so the only way that this error should occur is if someone attempts to directly call our AJAX endpoint without having the verification code. Please don't do this, it's against the principles of this art projects."
In general, is there some canonical way to store strings that are "too flexible to be hard coded in", but "too small and static for databases"(and don't scale with your usage)? I'm thinking of the sort of thing that would be in a strings.xml file in an Android project.
Other possibilities I'm juggling include a text file that views.py reads and stores as constants, actually just hardcoding them, and sticking them in template files.
There's a lot of ways to do this, and it's not a very complicated thing, I just want to know which one is the most 'right'.
Thanks! And let me know if you need more info!

If you are absolutely sure, these strings never need to become dynamic just create a strings.py module and drop the strings there as variables ("constants")
However, as the messages are user visible, you will most likely need to localize them at some point in your application's lifetime. Consequently, please make use of Django's excellent gettext support:
https://docs.djangoproject.com/en/1.7/topics/i18n/

Why don't django templates just use python code?

I mean I understand that these templates are aimed at designers and other less code-savvy people, but for developers I feel the template language is just a hassle. I need to re-learn how to do very simple things like iterate through dictionaries or lists that I pass into the template, and it doesn't even seem to work very well. I'm still having trouble getting the whole "dot" notation working as I would expect (for example, {{mydict.dictkey}} inside a for loop doesn't work :S -- I might ask this as a separate question), and I don't see why it wouldn't be possible to just use python code in a template system. In particular, I feel that if templates are meant to be simple, then the level of python code that would need to be employed in these templates would be of a caliber not more complicated than the current templating language. So these designer peeps wouldn't have more trouble learning that much python than they would learning the Django template language (and there's more places you can go with this knowledge of basic python as opposed to DTL) And the added advantage would be that people who already know python would be in familiar territory with all the usual syntax and power available to them and can just get going.
Am I missing something? If so I plead django noob and would love for you to enlighten me on the many merits of the current system. But otherwise, any recommendations on other template systems that may be more what I'm looking for?

The reason that most people give for limited template languages is that they don't want to mix the business logic of their application with its presentation (that wouldn't work well with the MVC philosophy; using Django I'm sure you understand the benefits of this).
Daniel Greenfeld wrote an article a few days ago explaining why he likes "stupid template languages", and many people wrote responses (see the past few days on Planet Python). If you read what Daniel wrote and how others responded to it, you'll get an idea of some of the arguments for and against allowing template languages to use Python.

Don't forget that you aren't limited to Django's template language. You're free to use whatever templating system you like in your view functions. However you want to create the HTML to return from your view function is fine. There are many templating implementations in the Python world: choose one that suits you better, and use it.

Seperation of concerns.
Designer does design. Developer does development. Templates are written by the designers.
Design and development are independent and different areas of work typically handled by different people.
I guess having template code in python would work very well if one is a developer and their spouse is a designer. Otherwise, let each do his job, with least interference.

Django templates don’t just use Python code for the same reason Django uses the MVC paradigm:
No particular reason.
(ie: The same reason anyone utilizes MVC at all, and that reason is just that certain people prefer this rigid philosophical pattern.)
In general I’d suggest you avoid Django if you don’t like things like this, because the Django people won’t be changing this approach. You could also, however, do something silly (in that it’d be contradictory to the philosophy of the chosen software), like put all the markup and anything else you can into the "view" files, turning Django from its "MVC" (or "MTV" ) paradigm into roughly what everything else is (a boring but straightforward lump).

Templates vs. coded HTML

I have a web-app consisting of some html forms for maintaining some tables (SQlite, with CherryPy for web-server stuff). First I did it entirely 'the Python way', and generated html strings via. code, with common headers, footers, etc. defined as functions in a separate module.
I also like the idea of templates, so I tried Jinja2, which I find quite developer-friendly. In the beginning I thought templates were the way to go, but that was when pages were simple. Once .css and .js files were introduced (not necessarily in the same folder as the .html files), and an ever-increasing number of {{...}} variables and {%...%} commands were introduced, things started getting messy at design-time, even though they looked great at run-time. Things got even more difficult when I needed additional javascript in the or sections.
As far as I can see, the main advantages of using templates are:
Non-dynamic elements of page can easily be viewed in browser during design.
Except for {} placeholders, html is kept separate from python code.
If your company has a web-page designer, they can still design without knowing Python.
while some disadvantages are:
{{}} delimiters visible when viewed at design-time in browser
Associated .css and .js files have to be in same folder to see effects in browser at design-time.
Data, variables, lists, etc., must be prepared in advanced and either declared globally or passed as parameters to render() function.
So - when to use 'hard-coded' HTML, and when to use templates? I am not sure of the best way to go, so I would be interested to hear other developers' views.
TIA, Alan

Although I'm not a Python developer, I'll answer here - I believe the idea of using templates is common for PHP and Python.
Using templates has many advantages, like:
keeping the code clean. Separating the "logic" (controller) code from the presentation (view) is very important. Working on projects that mix HTML / CSS / JS / Python is really hard.
keeping HTML in separate files doesn't require you to modify the code itself. For example, placing in the controller code (Python code) might require you to put a slash before each " character.
you can ask your web designer to learn the basics of templates syntax so he's able to help you much without destroying your work on the controller code (which is quite common when a person with no experience in a given language modifies something)
in fact, there are many more advantages, but those are most important for me.
The only disadvantage is that you have to pass parameters to the rendering function ... which doesn't require much of work. Anyway it's much easier than maintaining any project that mixes controller code with view code.
Generally, you should have a look at >MVC pros and cons question< :
What is MVC and what are the advantages of it?

The simplest way to solve your static file problem is to use relative paths when referring to them in your html. For example: <img src="static/image.jpg" />
If you're willing to put in a little more work, you can solve all the design-time problems you mentioned by writing a mini-server to display your templates.
Maintain a file full of simple data structures containing example values for all your templates.
Use a microframework like Werkzeug to serve http on your local machine.
Write a root request handler that scans your data structure list or templates directory to produce an index page with links to all your templates.
Write a secondary request handler for non-root requests, which renders the requested template with the data structure of the same name.
You can write this tool in a few hours, and it makes template design very convenient. One nice feature of Werkzeug's built-in wsgi server is that it can automatically reload itself when it detects that a file has changed. You can leave your mini-server running while you edit templates and click links on your index page all day.

I would highly recommend using templates. Templates help to encourage a good MVC structure to your application. Python code that emits HTML, IMHO, is wrong. The reason I say that is because Python code should be responsible for doing logic and not have to worry about presentation. Template syntax is usually restrictive enough that you can't really do much logic within the template, but you can do any presentation specific type logic that you may need.
ymmv.

I think that templates are still the best way to separate presentation from business logic. The key is a good templating engine, in particular, one that can make decisions in the template itself. For Python, I've found the Genshi templating engine to be quite good. It's used by the Trac Wiki/Issue tracking system and is quite powerful, while still leaving the templates easy to work with.
For other tasks, I tend to use the BeautifulSoup module. I'll create a simple HTML page, parse it with BeautifulSoup, use the resulting object to add the necessary data, and then write the output to its destination (typically a file for me).

As a Seaside developer, I see no use for templates, if your designer can do css. In practice, I find it impossible to keep templates DRY. Take a look at this question to see the advantages of using code (a DSL) to generate your page. You might be bound by legacy, of course.
Separation of presentation (html) and business logic in web apps does not lead to good modularisation, i.e. low coupling and high cohesion. That is why separate template systems don't work well.
I just read Jeff Atwoods "What's Wrong With CSS". This again is a problem long solved in the smalltalk world with the Phantasia DSL

How can I make HTML safe for web browser with python?

How can I make HTML from email safe to display in web browser with python?
Any external references shouldn't be followed when displayed. In other words, all displayed content should come from the email and nothing from internet.
Other than spam emails should be displayed as closely as possible like intended by the writer.
I would like to avoid coding this myself.
Solutions requiring latest browser (firefox) version are also acceptable.

html5lib contains an HTML+CSS sanitizer. It allows too much currently, but it shouldn't be too hard to modify it to match the use case.
Found it from here.

I'm not quite clear with what exactly you mean with "safe". It's a pretty big topic... but, for what it's worth:
In my opinion, the stripping parser from the ActiveState Cookbook is one of the easiest solutions. You can pretty much copy/paste the class and start using it.
Have a look at the comments as well. The last one states that it doesn't work anymore, but I also have this running in an application somewhere and it works fine. From work, I don't have access to that box, so I'll have to look it up over the weekend.

Use the HTMLparser module, or install BeautifulSoup, and use those to parse the HTML and disable or remove the tags. This will leave whatever link text was there, but it will not be highlighted and it will not be clickable, since you are displaying it with a web browser component.
You could make it clearer what was done by replacing the <A></A> with a <SPAN></SPAN> and changing the text decoration to show where the link used to be. Maybe a different shade of blue than normal and a dashed underscore to indicate brokenness. That way you are a little closer to displaying it as intended without actually misleading people into clicking on something that is not clickable. You could even add a hover in Javascript or pure CSS that pops up a tooltip explaining that links have been disabled for security reasons.
Similar things could be done with <IMG></IMG> tags including replacing them with a blank rectangle to ensure that the page layout is close to the original.
I've done stuff like this with Beautiful Soup, but HTMLparser is included with Python. In older Python distribs, there was an htmllib which is now deprecated. Since the HTML in an email message might not be fully correct, use Beautiful Soup 3.0.7a which is better at making sense of broken HTML.

Passing Variables to Django Comment Views

Alright, I know I've asked similar questions, but I feel this is hopefully a bit different. I'm integrating django.comments into my application, and the more I play with it, the more I realize it may not even be worth my while at the end of the day. That aside, I've managed to add Captcha to my comments, and I've learned that customizing the form is a terrible idea (hiding that honeypot is stupidly difficult, and from what I can tell requires JS to hide. Pity.). That's alright though, I've managed to work with it. However, the templates for the comments (preview and posted) are frustrating.
When a user is sent to the preview or posted templates, I'd like my sidebar's that have dynamic data to still be functional, however they're not. Do I have to override/rewrite the comments views to push data to these views? At that point it seems like I'm rewriting a major chunk of the comment system anyway, and it'd almost be beneficial to just write my own in that case. I'm more than willing to do that, and totally understand that I'm not entitled to a perfect comments system from Django. I just want to make sure I'm thinking right, and that if I want more than what I get from the comment views, that rewriting them is my only path.
Surely someone's found a healthier way though, so I thought I'd poll the audience. Any thoughts? If you need more info, just lemme know!

Dynamic data in sidebars is what template tags are for.
There's absolutely no need to muck around with the built-in views - just define the tags add them to your templates.

I user template tags as well. Templates in Django are truly for displaying data only.
I think Django believes in separation between the Designers and the Developers. So, they are enforcing the idea of templates should be simple enough for web designer to work with. (the photoshop guys)
So, as long as you don't need complected functionality, just pass the info to a filter and have it do the data manipulation and return the final string that you need.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.