How I can encode a string(ascii) to html code?
For example, string = "encoding html"
The result should be after encoding string_encoded = "encoding html"
I think you just need to use cgi.escape to replace the characters <, > and &. For most cases that will be all you need. Example:
>>> import cgi
>>> cgi.escape("<Foo & Bar>")
'&alt;Foo & Bar>'
The symbol isn't really needed unless you are forcefully adding a space to the markup, which no library will naturally do for you.
Related
I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)
I have in my HTML file (which is a right curly quote) and I want to convert it to text (if possible).
I tried using HTMLParser and BeautifulSoup but to no success.
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("'")
u"'"
>>> h.unescape("")
u'\x92' # I was hoping for a right curly quote here.
My goal is very simple: Take the html input and output all the text (without any html codes).
"right curly quote" is not an ascii character. u'\x92' is the python representation of the unicode character representing it and not some "html code".
To display it properly in your terminal, use print h.unescape("").encode('utf-8') (or whatever you terminal's charset is).
cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?
html.escape is the correct answer now, it used to be cgi.escape in python before 3.2. It escapes:
< to <
> to >
& to &
That is enough for all HTML.
EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:
data.encode('ascii', 'xmlcharrefreplace')
Don't forget to decode data to unicode first, using whatever encoding it was encoded.
However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).
Example:
>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'<a>bá</a>
Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.
EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.
In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.
It has one function escape():
>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x > 2 && x < 7 single quote: ' double quote: "'
If you wish to escape HTML in a URL:
This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.
The following is an example:
#!/usr/bin/python
from urllib import quote
x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'
Find docs here
There is also the excellent markupsafe package.
>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'<script>alert(document.cookie);</script>')
The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:
the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
it properly handles unicode input
it works in Python (2.6, 2.7, 3.3, and pypy)
it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).
cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.
But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.
No libraries, pure python, safely escapes text into html text:
text.replace('&', '&').replace('>', '>').replace('<', '<'
).replace('\'',''').replace('"','"').encode('ascii', 'xmlcharrefreplace')
cgi.escape extended
This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.
def escape_html(text):
"""escape strings for display in HTML"""
return cgi.escape(text, quote=True).\
replace(u'\n', u'<br />').\
replace(u'\t', u' ').\
replace(u' ', u' ')
for example
>>> escape_html('<foo>\nfoo\t"bar"')
u'<foo><br />foo "bar"'
Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have & in your text. As you see from comments to it:
cgi.escape version
def escape(s, quote=None):
'''Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.'''
s = s.replace("&", "&") # Must be done first!
s = s.replace("<", "<")
s = s.replace(">", ">")
if quote:
s = s.replace('"', """)
return s
regex version
QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
"""
Replaces special characters <>&"' to HTML-safe sequences.
With attention to already escaped characters.
"""
replace_with = {
'<': '>',
'>': '<',
'&': '&',
'"': '"', # should be escaped in attributes
"'": ''' # should be escaped in attributes
}
quote_pattern = re.compile(QUOTE_PATTERN)
return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)
For legacy code in Python 2.7, can do it via BeautifulSoup4:
>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&d'
I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)
I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
The actual line of code was
s=unicodestring.replace('\xa0','')
Anyway-I decided that I needed to preface it with an r so I ran this line of code:
s=unicodestring.replace(r'\xa0','')
It runs without error but I when I look at a slice of s I see that the \xaO is still there
may be you should be doing
s=unicodestring.replace(u'\xa0',u'')
s=unicodestring.replace('\xa0','')
..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)
The reason r'\xa0' did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0 into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..
The following are the same:
>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'
This is something resolved in Python v3, as the default string type is unicode, so you can just do..
>>> '\xa0'
'\xa0'
I am trying to clean all of the HTML out of a string so the final output is a text file
I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
<body>
<h1>
Hi
</h1>
</body>
</html>
Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.
There's also a good article here that puts it all together.
Instead of this, it's better to use standard python features.
For example:
string = unicode('Hello, \xa0World', 'utf-8', 'replace')
or
string = unicode('Hello, \xa0World', 'utf-8', 'ignore')
where replace will replace \xa0 to \\xa0.
But if \xa0 is really not meaningful for you and you want to remove it then use ignore.
Just a note regarding HTML cleaning. It is very very hard, since
<
body
>
Is a valid way to write HTML. Just an fyi.
You can convert it to unicode in this way:
print u'Hello, \xa0World' # print Hello, World