How to completely sanitize a string of illegal characters in python?

How to completely sanitize a string of illegal characters in python? - python

I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below:
�
Sometimes it appears as a diamond with a "?" in the middle, sometimes it appears as a double diamond with "?" in the middle, sometimes it appears as "\xa0", and sometimes it appears as "\xa0\xa0".
In my program if I do:
print str_with_weird_char
The string will show up in my terminal with the diamond "?" in place of the weird character. If I copy+paste that string into ipython, it will exit with this message:
In [1]: g="blah��blah"
WARNING:
********
You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!
notice how the diamond "?" is double now. For some reason copy+paste makes it double...
In the django traceback page, it looks like this:
UnicodeDecodeError at /chris/import.html
('ascii', 'blah \xa0 BLAH', 14, 15, 'ordinal not in range(128)')
The thing that messes me up is that I can't do anything with this string without it throwing an exception. I tried unicode(), I tried str(), I tried .encode(), I tried .encode("utf-8"), no matter what it throws up an error.
What can I do it get this thing to be a working string?

You can pass, "ignore" to skip invalid characters in .encode/.decode
like "ILLEGAL".decode("utf8","ignore")
>>> "ILLEGA\xa0L".decode("utf8")
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte
>>> "ILLEGA\xa0L".decode("utf8","ignore")
u'ILLEGAL'
>>>

Declare the coding on the second line of your script. It really has to be second. Like
#!/usr/bin/python
# coding=utf-8
This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').

you can also use:
python3 -c "import urllib, sys ; print urllib.quote_plus(sys.stdin.read())";
taken from https://wiki.python.org/moin/Powerful%20Python%20One-Liners
** ps, in the website it's pointed to use python, but I tested in python3 and it works just fine

The only way to do it (at least in python2) is to use unicodedata.normalize:
unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')
decode('utf-8', 'ignore') will just raise exception.

Related

UnicodeEncodeError, can't seem to set errors='ignore'

I'm fairly new to Python, so I'm hoping this is something simple that I'm just missing.
I'm running Python 2.7 on Windows 7
I'm trying to run a basic twitter scraping program through the command line. However I keep getting the following error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 79: character maps to (undefined)
I understand basically what's happening here, that it's trying to print to the console in cp437 and it's getting confused by the unicode characters in the tweets that it's grabbing.
All I'm trying to do is either get it to replace those characters with "?" or just get it to drop those characters altogether. I have read a bunch of posts about this and I can't figure out how to do it.
I opened the cp437.py file that's referenced in the error and I changed all the errors='strict' to errors='ignore' but that didn't solve the problem.
I then tried to go into the C:\Python27\Lib\codecs.py file and change all the errors='strict' to errors='ignore' but that didn't solve the problem either.
Any ideas? Like I said, hopefully I'm just missing something basic but I've read a bunch of posts on this and I can't seem to puzzle it out.
Thanks a lot.
Seth

I would not suggest changing the built in libraries - they are designed to allow handling encoding errors without needing to be fiddled with (and if you have change, not longer clear that any solution that would work for everyone else, would work for you).
You may just want to be passing errors='ignore' into whatever encoding function you are using to just skip the error character, or errors='replace' to replace that character with the character \ufff to signify there was a problem. [ error='strict' is the default if you don't pass any value. ]
However, if you are printing to the command line, you probably don't want to be encoding as unicode anyway, but ASCII instead - since unicode includes characters that the command line can't print. (and i suspect that that which is causing errors to be thrown up, rather than there being non-standard unicode characters in the response you are getting from Twitter).
Try e.g.
print original_data.encode('ascii', 'ignore')

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?

Read the Python Unicode HOWTO. This error is the very first example.
Do not use str() to convert from unicode to encoded text / bytes.
Instead, use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.

This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let's see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8')
=> 'bats\xc3\xa0'
print a.encode('utf-8')
=> batsà
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html

I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

well i tried everything but it did not help, after googling around i figured the following and it helped.
python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™

In shell:
Find supported UTF-8 locale by the following command:
locale -a | grep "UTF-8"
Export it, before running the script, e.g.:
export LC_ALL=$(locale -a | grep UTF-8)
or manually like:
export LC_ALL=C.UTF-8
Test it by printing special character, e.g. ™:
python -c 'print(u"\u2122");'
Above tested in Ubuntu.

I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')

For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.

Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )

Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.

I always put the code below in the first two lines of the python files:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

It works for me:
export LC_CTYPE="en_US.UTF-8"

Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.

Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')

Just add to a variable encode('utf-8')
agent_contact.encode('utf-8')

Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"

In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me :
export PYTHONIOENCODING=UTF-8

I just used the following:
import unicodedata
message = unicodedata.normalize("NFKD", message)
Check what documentation says about it:
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.

Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:
import sys
import io
sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")

Below solution worked for me, Just added
u "String"
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works
f = open("results.txt", "w")
f.write(data_that_causes_this_error.encode('utf-8'))
f.close()

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data
unic = u''
unic += value
value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.

Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.

The recommended solution did not work for me, and I could live with dumping all non ascii characters, so
s = s.encode('ascii',errors='ignore')
which left me with something stripped that doesn't throw errors.

Many answers here (#agf and #Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.

Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.
Simple tip to avoid :
try:
data=str(data)
except:
data = data #Don't convert to String
The above example will solve Encode error also.

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:
unic = u''
packet_data = unic

You can set the character encoding to UTF-8 before running your script:
export LC_CTYPE="en_US.UTF-8"
This should generally resolve the issue.

Python Unicode in and out of IDE

when I run my programs from within Eclipse IDE the following piece of code works perfectly:
address_name = self.text_ctrl_address.GetValue().encode('utf-8')
self.address_list = [i for i in data if address_name.upper() in i[5].upper().encode('utf-8')]
but when running the same piece of code directly with python, I get an "UnicodeDecodeError".
What does the IDE does differently that it doesn't fall on this error ?
ps: I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
Edit:
Sorry, I should have given more details: This piece of code belongs to a dialog built with WxPython. The GetValue() functions gets texts from a line edit widget and try to match this piece of text against a database. The program runs on Windows (and because of this, maybe michael Shopsin above might be right("Win-1252 to UTF-8 is a serious nuisance"). I've read many times that I should always work with unicode, avoid encoding, but if I don't encode, certain string methods don't seem to work very well depending on the characters in a word (I am in Spain, so lots of non ascii characters). By directly I meant "double clicking" the file it self, and not running from within the IDE.

UnicodeDecodeError indicates that the error happens during decoding of a bytestring into Unicode.
In particular, it may happen if you try to encode a bytestring instead of Unicode string on Python 2:
>>> u"\N{EM DASH}".encode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
u"\N{EM DASH}".encode('utf-8') is a bytestring and invoking .encode('utf-8') the 2nd time leads to implicit .decode(sys.getdefaultencoding()) that leads to the UnicodeDecodeError.
What does the IDE does differently that it doesn't fall on this error ?
It probably works in IDE because it changes sys.getdefaultencoding() to utf-8 that you should not do. It may hide bugs as your question demonstrates. In general, it may also break 3rd-party libraries that do not expect non-ascii sys.getdefaultencoding() on Python 2.
I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
You should use unicodedata.normalize() instead:
>>> import unicodedata
>>> a, b = u'\xf1', u'n\u0303'
>>> print(a)
ñ
>>> print(b)
ñ
>>> a == unicodedata.normalize('NFC', b)
True
Note: the code in your question may produce surprising results:
#XXX BROKEN, DON'T DO IT
...address_name.upper() in i[5].upper().encode('utf-8')...
address_name.upper() calls bytes.upper method while i[5].upper() calls unicode.upper method. The former does not support Unicode and it may depend on the current locale, the latter is better but to perform case-insensitive comparison, use .casefold() method instead:
key = unicode_address_name.casefold()
... if key == i[5].casefold()...
In general, If you need to sort unicode strings then you could use icu.Collator. Compare the default lexicographical sort:
>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted(L)
[u'angel delight', u'custard', u'gl\xfchwein', u'sandwiches', u'\xe9clairs']
with the order in en_GB locale:
>>> import icu # PyICU
>>> collator = icu.Collator.createInstance(icu.Locale('en_GB'))
>>> sorted(L, key=collator.getSortKey)
[u'angel delight', u'custard', u'\xe9clairs', u'gl\xfchwein', u'sandwiches']

I could solve the problem changing the encoding from UTF-8 to cp1252 (Windows western europe). Apparently UTF-8 could not encode some Windows characters. Thanks to Michael Shopsin above for the insight.
The program runs on windows and uses WxPython dialog , getting values from a line edit widget and matching the string against a database.
Thank you all for the attention, and I hope this post can help people in the future with a similar problem.

SQLite, python, unicode, and non-utf data

I started by trying to store strings in sqlite using python, and got the message:
sqlite3.ProgrammingError: You must
not use 8-bit bytestrings unless you
use a text_factory that can interpret
8-bit bytestrings (like text_factory =
str). It is highly recommended that
you instead just switch your
application to Unicode strings.
Ok, I switched to Unicode strings. Then I started getting the message:
sqlite3.OperationalError: Could not
decode to UTF-8 column 'tag_artist'
with text 'Sigur Rós'
when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur RÃ³s'
note: My console was set to display in 'latin_1' as #John Machin pointed out.
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings?
I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update.
Summary of answers
Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like:
str.decode('source_encoding').encode('desired_encoding')
or if the str is already in unicode
str.encode('desired_encoding')
For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python.
The encoding of the string you want to work with, and the encoding you want to get it to.
The system encoding.
The console encoding.
The encoding of the source file
Elaboration:
(1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on.
(2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added:
\# sitecustomize.py
\# this file can be anywhere in your Python path,
\# but it usually goes in ${pythondir}/lib/site-packages/
import sys
sys.setdefaultencoding('utf_8')
This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding.
(3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working.
(4) If you want to have some test strings, eg
test_str = "ó"
in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file:
# -*- coding: utf_8 -*-
If you don't have this information, python attempts to parse your code as ascii by default, and so:
SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run:
'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Now I will change the system and console encoding to latin_1, and I get this output for the same input:
'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Notice that the 'original' character displays correctly and the builtin unicode() function works now.
Now I change my console output back to utf_8.
'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what.
Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input.
One more thing: there are apparently crazy application developers making life difficult in Windows.
#!/usr/bin/env python
# -*- coding: utf_8 -*-
import os
import sys
def encodingDemo(str):
validStrings = ()
try:
print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
validStrings += ((str,""),)
except UnicodeEncodeError as ude:
print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print ude
try:
x = unicode(str)
print "unicode(str) = ",x
validStrings+= ((x, " decoded into unicode by the default system encoding"),)
except UnicodeDecodeError as ude:
print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()),
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('latin_1')
print "str.decode('latin_1') =",x
validStrings+= ((x, " decoded with latin_1 into unicode"),)
try:
print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
except UnicodeDecodeError as ude:
print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t",
print ude
except UnicodeDecodeError as ude:
print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",
print uee
try:
x = str.decode('utf_8')
print "str.decode('utf_8') =",x
validStrings+= ((x, " decoded with utf_8 into unicode"),)
try:
print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
except UnicodeDecodeError as ude:
print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t",
validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
print ude
except UnicodeDecodeError as ude:
print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t",
print ude
except UnicodeEncodeError as uee:
print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee
print
print "Printing information about each character in the original string."
for char in str:
try:
print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
except UnicodeDecodeError as ude:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
except UnicodeEncodeError as uee:
print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
print uee
try:
x = unicode(char)
print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = unicode(char) ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('latin_1')
print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
try:
x = char.decode('utf_8')
print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
except UnicodeDecodeError as ude:
print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude)
except UnicodeEncodeError as uee:
print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
print
x = 'ó'
encodingDemo(x)
Much thanks for the answers below and especially to #John Machin for answering so thoroughly.

I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it
repr() and unicodedata.name() are your friends when it comes to debugging such problems:
>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>
If you send oacute_utf8 to a terminal that is set up for latin1, you will get A-tilde followed by superscript-3.
I switched to Unicode strings.
What are you calling Unicode strings? UTF-16?
What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.
I can't imagine how it seems so to you. The story that was being conveyed was that unicode objects in Python and UTF-8 encoding in the database were the way to go. However Martin answered the original question, giving a method ("text factory") for the OP to be able to use latin1 -- this did NOT constitute a recommendation!
Update in response to these further questions raised in a comment:
I didn't understand that the unicode characters still contained an implicit encoding. Am I saying that right?
No. An encoding is a mapping between Unicode and something else, and vice versa. A Unicode character doesn't have an encoding, implicit or otherwise.
It looks to me like unicode("\xF3") and "\xF3".decode('latin1') are the same when evaluated with repr().
Say what? It doesn't look like it to me:
>>> unicode("\xF3")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>
Perhaps you meant: u'\xf3' == '\xF3'.decode('latin1') ... this is certainly true.
It is also true that unicode(str_object, encoding) does the same as str_object.decode(encoding) ... including blowing up when an inappropriate encoding is supplied.
Is that a happy circumstance
That the first 256 characters in Unicode are the same, code for code, as the 256 characters in latin1 is a good idea. Because all 256 possible latin1 characters are mapped to Unicode, it means that ANY 8-bit byte, ANY Python str object can be decoded into unicode without an exception being raised. This is as it should be.
However there exist certain persons who confuse two quite separate concepts: "my script runs to completion without any exceptions being raised" and "my script is error-free". To them, latin1 is "a snare and a delusion".
In other words, if you have a file that's actually encoded in cp1252 or gbk or koi8-u or whatever and you decode it using latin1, the resulting Unicode will be utter rubbish and Python (or any other language) will not flag an error -- it has no way of knowing that you have commited a silliness.
or is unicode("str") going to always return the correct decoding?
Just like that, with the default encoding being ascii, it will return the correct unicode if the file is actually encoded in ASCII. Otherwise, it'll blow up.
Similarly, if you specify the correct encoding, or one that's a superset of the correct encoding, you'll get the correct result. Otherwise you'll get gibberish or an exception.
In short: the answer is no.
If not, when I receive a python str that has any possible character set in it, how do I know how to decode it?
If the str object is a valid XML document, it will be specified up front. Default is UTF-8.
If it's a properly constructed web page, it should be specified up front (look for "charset"). Unfortunately many writers of web pages lie through their teeth (ISO-8859-1 aka latin1, should be Windows-1252 aka cp1252; don't waste resources trying to decode gb2312, use gbk instead). You can get clues from the nationality/language of the website.
UTF-8 is always worth trying. If the data is ascii, it'll work fine, because ascii is a subset of utf8. A string of text that has been written using non-ascii characters and has been encoded in an encoding other than utf8 will almost certainly fail with an exception if you try to decode it as utf8.
All of the above heuristics and more and a lot of statistics are encapsulated in chardet, a module for guessing the encoding of arbitrary files. It usually works well. However you can't make software idiot-proof. For example, if you concatenate data files written some with encoding A and some with encoding B, and feed the result to chardet, the answer is likely to be encoding C with a reduced level of confidence e.g. 0.8. Always check the confidence part of the answer.
If all else fails:
(1) Try asking here, with a small sample from the front of your data ... print repr(your_data[:400]) ... and whatever collateral info about its provenance that you have.
(2) Recent Russian research into techniques for recovering forgotten passwords appears to be quite applicable to deducing unknown encodings.
Update 2 BTW, isn't it about time you opened up another question ?-)
One more thing: there are apparently characters that Windows uses as Unicode for certain characters that aren't the correct Unicode for that character, so you may have to map those characters to the correct ones if you want to use them in other programs that are expecting those characters in the right spot.
It's not Windows that's doing it; it's a bunch of crazy application developers. You might have more understandably not paraphrased but quoted the opening paragraph of the effbot article that you referred to:
Some applications add CP1252 (Windows, Western Europe) characters to documents marked up as ISO 8859-1 (Latin 1) or other encodings. These characters are not valid ISO-8859-1 characters, and may cause all sorts of problems in processing and display applications.
Background:
The range U+0000 to U+001F inclusive is designated in Unicode as "C0 Control Characters". These exist also in ASCII and latin1, with the same meanings. They include such familar things as carriage return, line feed, bell, backspace, tab, and others that are used rarely.
The range U+0080 to U+009F inclusive is designated in Unicode as "C1 Control Characters". These exist also in latin1, and include 32 characters that nobody outside unicode.org can imagine any possible use for.
Consequently, if you run a character frequency count on your unicode or latin1 data, and you find any characters in that range, your data is corrupt. There is no universal solution; it depends on how it became corrupted. The characters may have the same meaning as the cp1252 characters at the same positions, and thus the effbot's solution will work. In another case that I've been looking at recently, the dodgy characters appear to have been caused by concatenating text files encoded in UTF-8 and another encoding which needed to be deduced based on letter frequencies in the (human) language the files were written in.

UTF-8 is the default encoding of SQLite databases. This shows up in situations like "SELECT CAST(x'52C3B373' AS TEXT);". However, the SQLite C library doesn't actually check whether a string inserted into a DB is valid UTF-8.
If you insert a Python unicode object (or str object in 3.x), the Python sqlite3 library will automatically convert it to UTF-8. But if you insert a str object, it will just assume the string is UTF-8, because Python 2.x "str" doesn't know its encoding. This is one reason to prefer Unicode strings.
However, it doesn't help you if your data is broken to begin with.
To fix your data, do
db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")
for every text column in your database.

I fixed this pysqlite problem by setting:
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
By default text_factory is set to unicode(), which will use the current default encoding (ascii on my machine)

Of course there is. But your data is already broken in the database, so you'll need to fix it:
>>> print u'Sigur RÃ³s'.encode('latin-1').decode('utf-8')
Sigur Rós

My unicode problems with Python 2.x (Python 2.7.6 to be specific) fixed this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
It also solved the error you are mentioning right at the beginning of the post:
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless
...
EDIT
sys.setdefaultencoding is a dirty hack. Yes, it can solve UTF-8 issues, but everything comes with a price. For more details refer to following links:
Why sys.setdefaultencoding() will break code
Why we need sys.setdefaultencoding(“utf-8”) in a py script?

How do I regex search for weird non-ASCII characters in Python?

I'm using the following regular expression basically to search for and delete these characters.
invalid_unicode = re.compile(ur'(Û|²|°|±|É|¹|Í)')
My source code in ASCII encoded, and whenever I try to run the script it spits out:
SyntaxError: Non-ASCII character '\xdb' in file ./release.py on line 273, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
If I follow the instructions at the given website, and place utf-8 on the second line encoding, my script doesn't run. Instead it gives me this error:
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xdb in position 0: unexpected end of data
How do I get this one regular expression running in an ASCII written script that'd be great.

You need to find out what encoding your editor is using, and set that per PEP263; or, make things more stable and portable (though alas perhaps a bit less readable) and use escape sequences in your string literal, i.e., use u'(\xdb|\xb2|\xb0|\xb1|\xc9|\xb9|\xcd)' as the parameter to the re.compile call.

After telling Python that your source file uses UTF-8 encoding, did you actually make sure that your editor is saving the file using UTF-8 encoding? The error you get indicates that your editor is probably not using UTF-8.
What text editor are you using?

\x{c0de}
In a regex will match the Unicode character at code point c0de.
Python uses PCRE, right? (If it doesn't, it's probably \uC0DE instead...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to completely sanitize a string of illegal characters in python? - python

You can pass, "ignore" to skip invalid characters in .encode/.decode like "ILLEGAL".decode("utf8","ignore") >>> "ILLEGA\xa0L".decode("utf8") ... UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte >>> "ILLEGA\xa0L".decode("utf8","ignore") u'ILLEGAL' >>>

Declare the coding on the second line of your script. It really has to be second. Like #!/usr/bin/python # coding=utf-8 This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').

you can also use: python3 -c "import urllib, sys ; print urllib.quote_plus(sys.stdin.read())"; taken from https://wiki.python.org/moin/Powerful%20Python%20One-Liners ** ps, in the website it's pointed to use python, but I tested in python3 and it works just fine

The only way to do it (at least in python2) is to use unicodedata.normalize: unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore') decode('utf-8', 'ignore') will just raise exception.

Related

UnicodeEncodeError, can't seem to set errors='ignore'

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

Python Unicode in and out of IDE

SQLite, python, unicode, and non-utf data

How do I regex search for weird non-ASCII characters in Python?

Categories

Resources