How to convert utf-8 fancy quotes to neutral quotes - python

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you

You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"

You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)

Related

Cleaning python bytes from ASCII codes [duplicate]

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)

How to properly print a list of unicode characters in python?

I am trying to search for emoticons in python strings.
So I have, for example,
em_test = ['\U0001f680']
print(em_test)
['🚀']
test = 'This is a test string 💰💰🚀'
if any(x in test for x in em_test):
print ("yes, the emoticon is there")
else:
print ("no, the emoticon is not there")
yes, the emoticon is there
and if a search em_test in
'This is a test string 💰💰🚀'
I can actually find it.
So I have made a csv file with all the emoticons I want defined by their unicode.
The CSV looks like this:
\U0001F600
\U0001F601
\U0001F602
\U0001F923
and when I import it and print it I actullay do not get the emoticons but rather just the text representation:
['\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
...
]
and hence I cannot use this to search for these emoticons in another string...
I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.
Any suggestions?
You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.
Just for fun, I'll also use unicodedata to get the names of those emojis.
import unicodedata as ud
emojis = [
'\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
]
for u in emojis:
s = u.encode('ASCII').decode('unicode-escape')
print(u, ud.name(s), s)
output
\U0001F600 GRINNING FACE 😀
\U0001F601 GRINNING FACE WITH SMILING EYES 😁
\U0001F602 FACE WITH TEARS OF JOY 😂
\U0001F923 ROLLING ON THE FLOOR LAUGHING 🤣
This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.
You can make the decoding a little more robust by using
u.encode('Latin1').decode('unicode-escape')
but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.
1. keeping your csv as-is:
it's a bloated solution, but using ast.literal_eval works:
import ast
s = '\\U0001F600'
x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)
I get 0x1f600 (which is correct char code) and some emoticon character (😀). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)
just surround with quotes to allow ast to take the input as string.
2. using character codes directly
maybe you'd be better off by storing the character codes themselves instead of the \U format:
print(chr(0x1F600))
does exactly the same (so ast is slightly overkill)
your csv could contain:
0x1F600
0x1F601
0x1F602
0x1F923
then chr(int(row[0],16)) would do the trick when reading it: example if one 1 row in CSV (or first row)
with open("codes.csv") as f:
cr = csv.reader(f)
codes = [int(row[0],16) for row in cr]

Finding the index of a UTF-8 character in a Python string

I'm trying to find the index (or indices) of a certain character in a UTF-8 encoded string in a foreign language (for example the character: ش).
I have tried unicode.find('ش'), word.find(u'ش'), word.find(u'\\uش') and also regular expressions: re.compile(u'\\uش) to no avail. The funny thing is that in Visual Studio (my IDE using IronPython) in debug mode, word.find(u'\\uش') returns the correct index in the variable watch window but it doesn't in the actual code (returns index=-1).
I'm reading the strings from a file using the following command:
file= codecs.open(file,'r','utf-8')
Is there something I'm missing? Or is there another way to approach this?
Once you use codecs to read the file, it's no longer UTF-8, it's an internal Unicode string representation. This should be completely compatible with Unicode literals in your program.
>>> line=u'abcش'
>>> line.find(u'ش')
3
Edit: My previous test may have been misleading because both strings were entered through the IDE. Here's a better example:
>>> f = codecs.open(r'c:\temp\temp.txt', 'r', 'utf-8-sig')
>>> line = f.readline()
>>> print line
This is a test.ش
>>> line.find(u'\u0634')
15

Approximately converting unicode string to ascii string in python

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco
Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"
import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/
b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.
There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm
Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

What's the easiest way to escape HTML in Python?

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?
html.escape is the correct answer now, it used to be cgi.escape in python before 3.2. It escapes:
< to <
> to >
& to &
That is enough for all HTML.
EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:
data.encode('ascii', 'xmlcharrefreplace')
Don't forget to decode data to unicode first, using whatever encoding it was encoded.
However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).
Example:
>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'<a>bá</a>
Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.
EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.
In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.
It has one function escape():
>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x > 2 && x < 7 single quote: ' double quote: "'
If you wish to escape HTML in a URL:
This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.
The following is an example:
#!/usr/bin/python
from urllib import quote
x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'
Find docs here
There is also the excellent markupsafe package.
>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'<script>alert(document.cookie);</script>')
The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:
the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
it properly handles unicode input
it works in Python (2.6, 2.7, 3.3, and pypy)
it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).
cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.
But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.
No libraries, pure python, safely escapes text into html text:
text.replace('&', '&').replace('>', '>').replace('<', '<'
).replace('\'',''').replace('"','"').encode('ascii', 'xmlcharrefreplace')
cgi.escape extended
This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.
def escape_html(text):
"""escape strings for display in HTML"""
return cgi.escape(text, quote=True).\
replace(u'\n', u'<br />').\
replace(u'\t', u' ').\
replace(u' ', u' ')
for example
>>> escape_html('<foo>\nfoo\t"bar"')
u'<foo><br />foo "bar"'
Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have & in your text. As you see from comments to it:
cgi.escape version
def escape(s, quote=None):
'''Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.'''
s = s.replace("&", "&") # Must be done first!
s = s.replace("<", "<")
s = s.replace(">", ">")
if quote:
s = s.replace('"', """)
return s
regex version
QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
"""
Replaces special characters <>&"' to HTML-safe sequences.
With attention to already escaped characters.
"""
replace_with = {
'<': '>',
'>': '<',
'&': '&',
'"': '"', # should be escaped in attributes
"'": '&#39' # should be escaped in attributes
}
quote_pattern = re.compile(QUOTE_PATTERN)
return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)
For legacy code in Python 2.7, can do it via BeautifulSoup4:
>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&d'

Categories