I would like to remove vulnerabilities to XSS / JavaScript injection in a web application where users are allowed to use an editor like CKEditor which allows arbitrary HTML (and whether my specific choice of editor allows arbitrary HTML or not, blackhats will be able to submit arbitrary HTML anyway). So no JavaScript, whether SCRIPT tags, ONCLICK and family, or whatever else. The target platform is Python and Django.
What are my best options here? I am open to an implementation that would whitelist tags and attributes; that is to say I don't see it as necessary to allow a user to submit everything that you can build in HTML while only JavaScript gets removed. I am happy to have rich text with supported tag availability that can allow fairly expressive rich text. I would also be open to an editor that produces Markdown, and strip all HTML tags before the data is saved. (HTML manipulation seems simpler, but I would also consider Markdown-implemented solutions.)
I also don't consider it necessary to produce a sanitized text if instead an exception is thrown that says that a submission has failed testing. (Ergo, lowercasing the string, and searching for '<script', 'onclick', etc. might be sufficient.)
Probably my first choice in a solution, if I have the choice, would be a whitelist of tag and attribute names.
What are the best solutions, if any, that are out there?
If you choose to use a WYSIWYG editor that produces HTML, using bleach on the server to sanitize your HTML (via whitelisting) is probably enough.
If you choose to use a markdown (or another non-html markup) editor, you will also probably save the markdown source and generate and sanitize the html (after generation!) on the server side. This allows you to keep markdown as is (with inline html etc.) as html is sanitized post rendering. However, if your client-side editor supports preview, you would also need to be very careful regarding in browser rendering when markdown is loaded from the server! Most markdown editors include client side sanitizers for this purpose.
Related
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
Can I have any highlight kind of things using Python 2.7? Say when my script clicking on the submit button,feeding data into the text field or selecting values from the drop-down field, just to highlight on that element to make sure to the script runner that his/her script doing what he/she wants.
EDIT
I am using selenium-webdriver with python to automate some web based work on a third party application.
Thanks
This is something you need to do with javascript, not python.
[NOTE: I'm leaving this answer for historical purposes but readers should note that the original question has changed from concerning itself with Python to concerning itself with Selenium]
Assuming you're talking about a browser based application being served from a Python back-end server (and it's just a guess since there's no information in your post):
If you are constructing a response in your Python back-end, wrap the stuff that you want to highlight in a <span> tag and set a class on the span tag. Then, in your CSS define that class with whatever highlighting properties you want to use.
However, if you want to accomplish this highlighting in an already-loaded browser page without generating new HTML on the back end and returning that to the browser, then Python (on the server) has no knowledge of or ability to affect the web page in browser. You must accomplish this using Javascript or a Javascript library or framework in the browser.
I'm trying to use python to navigate through a website that have auth forms on its landing page, rendered by ASP scripts.
But when I use python (with mechanize, requests, or urlibs) to get the HTML of that site, I always end up with a semi-blank HTML file, due to such ASP scripts.
Would anyone know any method that I can use to get the final (as displayed on a browser) version of an ASP site?
Your target page is a frameset. There is nothing fancy going on from the server side that I can tell. When I use requests or urllib to download it, even sending no headers at all, I get exactly the same HTML that I see in Chrome or Firefox. There is some embedded JS, but it doesn't do anything. Basically, all there is here is a frameset with a single frame in it.
The frame target is also a perfectly normal page with nothing fancy going on from the server side that I can tell. Again, if I fetch it with no headers, I get the exact same contents as in Chrome or Firefox. There is plenty of embedded JS here, but it's not building the DOM from scratch or anything; the static contents that I get from the server have the whole page contents in them. I can strip out all the JS and render it, and it looks exactly the same.
There is a minor problem that neither the server nor the HTML specifies a charset anywhere, and yet the contents aren't ASCII, which means you need to guess what charset to decode if you want to process it as Unicode. But if you're in Python 2.x, and just planning to grab things out of the DOM by ID or something, that won't matter.
I suspect your real problem is just that you don't know how HTML framesets work. You're downloading the frameset, not downloading the referenced frame, and wondering why the resulting page looks like an empty frameset.
Frames are an obsolete feature that nobody uses anymore for anything but a common trick for letting the user pop up a new window even in ancient browsers, and some obscure tricks for fooling popup blockers. In HTML 5 they're finally gone. But as long as ancient websites are out there and need to be scraped, you need to know how they work.
This isn't a substitute for the full documentation, but here's the short version of what a web browser does with a frameset: For each frame tag, it follows the src attribute, then it replaces the contents of the frame tag with a #document tag with no attributes, with the results of reading the src URL as its contents. Beyond that, of course, frames affect layout, but that probably doesn't affect you.
Meanwhile, if you're trying to learn web scraping, you really want to install your browser's "Web Developer Tools" (different browsers have different names), or a full-on debugger like Firebug. That way, you can inspect the live tree that your browser is rendering, and compare it to what you get from your script (or, more simply, from wget). So, next time you can say "In Chrome's Inspect Page, I see a #document under the frame, with a whole bunch of stuff underneath that, but when I try to read the same page myself, the frame has no children".
When someone writes a post and copies and pastes a url in it, can Django detect it and render it as a hyperlink rather than plain text?
Django has the urlize template filter which will automatically detect both URLs and email addresses and turn them into the appropriate hyperlinks.
The docs there are actually a little thin, so I recommend also reading the docstring in the source for the urlize function for more information.
urlize:
http://docs.djangoproject.com/en/dev/ref/templates/builtins/?from=olddocs#urlize
Another option is to parse plain text in some way, for example as reStructuredText (my favourite) or Markdown (Stack Overflow uses a slightly modified variant of Markdown). These will both turn valid plain text links targets into hyperlinks. This also gives you more power over what you can do; you won't need to resort to HTML to achieve some basic formatting. Note also as stated with urlize that you should only use it on plain text; it's not designed to be mixed with HTML.
Very newbie question, but please be gentle with me. Our site uses Django CMS and we're trying to insert some javascript into particular stories, but it appears Django is stripping out any javascript or iframes we put in there as soon as we save the story. How do we allow javascript to be used in stories? Is it being deliberately excluded, or do we need to code this function into the site?
Any help would be incredibly appreciated.
Django is probably automatically escaping the content the javascript / html as the template renders the content. It does this for security purposes.
The solution depends on which version of django you're running, whether you'll be rendering any content from untrusted sources, how the templates are put together and perhaps the view that prepares the content for the template.
Django doesn't strip out javascript, because it is client side agnostic.
How are you inserting javascript into your website? If you are trying to put it into database (like ) it will escaped.
Read through the docs on automatic HTML escaping:
http://docs.djangoproject.com/en/1.1/topics/templates/#id2