I am using ckeditor for manage rich text. I need to do searches in this field, but for example, words with strange characters are saved with html format. Example, in my front page this is the word 'està ' in BD is save as 'està', them, the search never will match.
Some advice?, I am thinking in use html2text functionality to transform html-text in plain text.
Thanks for your answers.
If you want to save your content in unicode instead of using the HTML entities, set this to your config
config.removePlugins = 'entities';
config.entities = false;
I do this because I need 100% XML compatability and so far it works well enough, but I also remove certain XML-breaking characters that clever users paste into the editor.
Related
I would like to remove vulnerabilities to XSS / JavaScript injection in a web application where users are allowed to use an editor like CKEditor which allows arbitrary HTML (and whether my specific choice of editor allows arbitrary HTML or not, blackhats will be able to submit arbitrary HTML anyway). So no JavaScript, whether SCRIPT tags, ONCLICK and family, or whatever else. The target platform is Python and Django.
What are my best options here? I am open to an implementation that would whitelist tags and attributes; that is to say I don't see it as necessary to allow a user to submit everything that you can build in HTML while only JavaScript gets removed. I am happy to have rich text with supported tag availability that can allow fairly expressive rich text. I would also be open to an editor that produces Markdown, and strip all HTML tags before the data is saved. (HTML manipulation seems simpler, but I would also consider Markdown-implemented solutions.)
I also don't consider it necessary to produce a sanitized text if instead an exception is thrown that says that a submission has failed testing. (Ergo, lowercasing the string, and searching for '<script', 'onclick', etc. might be sufficient.)
Probably my first choice in a solution, if I have the choice, would be a whitelist of tag and attribute names.
What are the best solutions, if any, that are out there?
If you choose to use a WYSIWYG editor that produces HTML, using bleach on the server to sanitize your HTML (via whitelisting) is probably enough.
If you choose to use a markdown (or another non-html markup) editor, you will also probably save the markdown source and generate and sanitize the html (after generation!) on the server side. This allows you to keep markdown as is (with inline html etc.) as html is sanitized post rendering. However, if your client-side editor supports preview, you would also need to be very careful regarding in browser rendering when markdown is loaded from the server! Most markdown editors include client side sanitizers for this purpose.
I have a model with a TextField that users can populate with HTML. Now I want Django to render a dynamic "Table Of Contents", so that when a <h> tag is used, django automatically adds that to a list. Bonus points if a nested list is also possible.
I've thought about using inclusion tags, but I'm not sure on the exact details. Any help would be much appreciated.
I have never done this exactly...but here is how I would start off.
Get the HTML
First things first - get the user to add the HTML. Possibly install tinyMCE, use tinyMCE's HTML field. The user can copy and paste HTML or add HTML content using tinyMCE's WYSIWUG.
Generate a DOM object
You might be thinking - I will write some regex to find the relevant h1-6 tags. The problem is that parsing html with regex is a nightmare and should not be done with regex.
Next you will probably need to render the html. There are probably quite a few ways to do this.
Table of Contents
Then go through the rendered html and pull out all the h1-6 tags or whatever you are after.
You might want to edit the html and add the TOC to the top of the list - either that or save it as a separate html snippet that you can insert into the main html document when you render it.
If the generating the TOC sounds like too much work...it probably is. I'm sure you could find a solution to automatically pull out the h1 headers and link to the content. Here is a jquery plugin that does just that. A bit of googling could be in order here to find the easiest way to generate the TOC could be in order - seems like it would be quite a common thing.
Other Thoughts
Be aware - you need to ensure you aren't parsing javascript when rendering the page due to security implications!
Bonus marks for modifying the html to hyperlink the TOC to the actual heading.
Make sure you re-generate the TOC if the user modifies HTML content
Good luck.
I know that with urllib you can parse a string and check if it's a valid URL. But how would one go about checking if a sentence contains a URL within it, and then extract that URL. I've seen some huge regular expressions out there, but i would rather not use something that I really can't comprehend.
So basically I have an input string, and I need to find and extract all the URLs within that string.
What's a clean way of going about this.
You can search for "words" containing : and then pass them to urlparse (renamed to urllib.parse in Python 3.0 and newer) to check if they are valid URLs.
Example:
possible_urls = re.findall(r'\S+:\S+', text)
If you want to restrict yourself only to URLs starting with http:// or https:// (or anything else you want to allow) you can also do that with regular expressions, for example:
possible_urls = re.findall(r'https?://\S+', text)
You may also want to use some heuristics to determine where the URL starts and stops because sometimes people add punctuation to the URLs, giving new valid but unintentionally incorrect URLs, for example:
Have you seen the new look for http://example.com/? It's a total ripoff of http://example.org/!
Here the punctuation after the URL is not intended to be part of the URL. You can see from the automatically added links in the above text that StackOverflow implements such heuristics.
Plucking a URL out of "the wild" is a tricky endeavor (to do correctly). Jeff Atwood wrote a blog post on this subject: The Problem With URLs Also, John Gruber has addressed this issue as well: An Improved Liberal, Accurate Regex Pattern for Matching URLs Also, I have written some code which also attempts to tackle this problem: URL Linkification (HTTP/FTP) (for PHP/Javascript). (Note that my regex is particularly complex because it is designed to be applied to HTML markup, and attempts to skip URLs which are already linkified (i.e. Link!)
Second, when it comes to validating a URI/URL, the document you want to look at is RFC-3986. I've been working on a article dealing with this very subject: Regular Expression URI Validation. You may want to take a look at this as well.
But when you get down to it, this is not a trivial task!
When someone writes a post and copies and pastes a url in it, can Django detect it and render it as a hyperlink rather than plain text?
Django has the urlize template filter which will automatically detect both URLs and email addresses and turn them into the appropriate hyperlinks.
The docs there are actually a little thin, so I recommend also reading the docstring in the source for the urlize function for more information.
urlize:
http://docs.djangoproject.com/en/dev/ref/templates/builtins/?from=olddocs#urlize
Another option is to parse plain text in some way, for example as reStructuredText (my favourite) or Markdown (Stack Overflow uses a slightly modified variant of Markdown). These will both turn valid plain text links targets into hyperlinks. This also gives you more power over what you can do; you won't need to resort to HTML to achieve some basic formatting. Note also as stated with urlize that you should only use it on plain text; it's not designed to be mixed with HTML.
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags