cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?
html.escape is the correct answer now, it used to be cgi.escape in python before 3.2. It escapes:
< to <
> to >
& to &
That is enough for all HTML.
EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:
data.encode('ascii', 'xmlcharrefreplace')
Don't forget to decode data to unicode first, using whatever encoding it was encoded.
However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).
Example:
>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'<a>bá</a>
Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.
EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.
In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.
It has one function escape():
>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x > 2 && x < 7 single quote: ' double quote: "'
If you wish to escape HTML in a URL:
This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.
The following is an example:
#!/usr/bin/python
from urllib import quote
x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'
Find docs here
There is also the excellent markupsafe package.
>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'<script>alert(document.cookie);</script>')
The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:
the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
it properly handles unicode input
it works in Python (2.6, 2.7, 3.3, and pypy)
it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).
cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.
But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.
No libraries, pure python, safely escapes text into html text:
text.replace('&', '&').replace('>', '>').replace('<', '<'
).replace('\'',''').replace('"','"').encode('ascii', 'xmlcharrefreplace')
cgi.escape extended
This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.
def escape_html(text):
"""escape strings for display in HTML"""
return cgi.escape(text, quote=True).\
replace(u'\n', u'<br />').\
replace(u'\t', u' ').\
replace(u' ', u' ')
for example
>>> escape_html('<foo>\nfoo\t"bar"')
u'<foo><br />foo "bar"'
Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have & in your text. As you see from comments to it:
cgi.escape version
def escape(s, quote=None):
'''Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.'''
s = s.replace("&", "&") # Must be done first!
s = s.replace("<", "<")
s = s.replace(">", ">")
if quote:
s = s.replace('"', """)
return s
regex version
QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
"""
Replaces special characters <>&"' to HTML-safe sequences.
With attention to already escaped characters.
"""
replace_with = {
'<': '>',
'>': '<',
'&': '&',
'"': '"', # should be escaped in attributes
"'": ''' # should be escaped in attributes
}
quote_pattern = re.compile(QUOTE_PATTERN)
return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)
For legacy code in Python 2.7, can do it via BeautifulSoup4:
>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&d'
Related
I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)
I have a document which contain the charterer <0x0c>.
Using re.split.
The problem that it look like that:
import re
re.split('',text)
When although it works, you CAN'T see the charterer and except of living a nice comment it is a great candidate to be one of this legacy code that only I would understand.
How can I write it in a different, readable way?
You can express any character using escape codes. The 0x0C Form Feed ASCII codepoint can be expressed as \f or as \x0c:
re.split('\f', text)
See the Python string and byte literals syntax for more details on what escape sequences Python supports when defining a string literal value.
Note: you don't need to use the regex module to split on straight-up character sequences, you can just as well use str.split() here:
text.split('\f')
When I create a string containing backslashes, they get duplicated:
>>> my_string = "why\does\it\happen?"
>>> my_string
'why\\does\\it\\happen?'
Why?
What you are seeing is the representation of my_string created by its __repr__() method. If you print it, you can see that you've actually got single backslashes, just as you intended:
>>> print(my_string)
why\does\it\happen?
The string below has three characters in it, not four:
>>> 'a\\b'
'a\\b'
>>> len('a\\b')
3
You can get the standard representation of a string (or any other object) with the repr() built-in function:
>>> print(repr(my_string))
'why\\does\\it\\happen?'
Python represents backslashes in strings as \\ because the backslash is an escape character - for instance, \n represents a newline, and \t represents a tab.
This can sometimes get you into trouble:
>>> print("this\text\is\not\what\it\seems")
this ext\is
ot\what\it\seems
Because of this, there needs to be a way to tell Python you really want the two characters \n rather than a newline, and you do that by escaping the backslash itself, with another one:
>>> print("this\\text\is\what\you\\need")
this\text\is\what\you\need
When Python returns the representation of a string, it plays safe, escaping all backslashes (even if they wouldn't otherwise be part of an escape sequence), and that's what you're seeing. However, the string itself contains only single backslashes.
More information about Python's string literals can be found at: String and Bytes literals in the Python documentation.
As Zero Piraeus's answer explains, using single backslashes like this (outside of raw string literals) is a bad idea.
But there's an additional problem: in the future, it will be an error to use an undefined escape sequence like \d, instead of meaning a literal backslash followed by a d. So, instead of just getting lucky that your string happened to use \d instead of \t so it did what you probably wanted, it will definitely not do what you want.
As of 3.6, it already raises a DeprecationWarning, although most people don't see those. It will become a SyntaxError in some future version.
In many other languages, including C, using a backslash that doesn't start an escape sequence means the backslash is ignored.
In a few languages, including Python, a backslash that doesn't start an escape sequence is a literal backslash.
In some languages, to avoid confusion about whether the language is C-like or Python-like, and to avoid the problem with \Foo working but \foo not working, a backslash that doesn't start an escape sequence is illegal.
I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)
How I can encode a string(ascii) to html code?
For example, string = "encoding html"
The result should be after encoding string_encoded = "encoding html"
I think you just need to use cgi.escape to replace the characters <, > and &. For most cases that will be all you need. Example:
>>> import cgi
>>> cgi.escape("<Foo & Bar>")
'&alt;Foo & Bar>'
The symbol isn't really needed unless you are forcefully adding a space to the markup, which no library will naturally do for you.