Python decoding of back quotations - python

I am receiving this issue
" UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201d' "
I'm quite new to working with databases as a whole. Previously, I had been using SQLite3; however, now transitioning/migrating to MySQL, I noticed u'\u201d' and u'\u201c' characters were within some of my text data.
I'm currently making a python script to tackle the migration; however, I'm getting stuck with this codec issue that I previously didn't for see.
So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?

You don't have a problem decoding these characters; wherever they're coming from, if they're showing up as \u201d (”) and \u201c (“), they're already being properly decoded.
The problem is encoding these characters. If you want to store your strings in Latin-1 columns, they can only contain the 256 characters that exist in Latin-1, and these two are not among them.
So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?
The obvious solution is to use UTF-8 columns instead of Latin-1 in MySQL. Then this problem wouldn't even exist; any Unicode string can be encoded as UTF-8.
But assuming you can't do that for some reason…
Python comes with built-in support for different error handlers that can help you do something with these characters while encoding them. You just have to decide what "something" that is.
Let's say your string looks like hey “hey” hey. Here's what each error handler would do with it:
s.encode('latin-1', 'ignore'): hey hey hey
s.encode('latin-1', 'replace'): hey ?hey? hey
s.encode('latin-1', 'xmlcharrefreplace'):hey “hey” hey`
s.encode('latin-1', 'backslashreplace'):hey \u201chey\u201d hey`
The first two have the advantage of being somewhat readable, but the disadvantage that you can never recover the original string. If you want that, but want something even more readable, you may want to consider a third-party library like unidecode:
unidecode('hey “hey” hey').encode('latin-1'):hey "hey" hey`
The last two are lossless, but kind of ugly. Although in some contexts they'll look pretty nice—e.g., if you're building an XML document, xmlcharrefreplace (maybe even with 'ascii' instead of 'latin-1') will give you exactly what you want in an XML viewer. There are special-purpose translators for various other use cases (like HTML references, or XML named entities instead of numbered, etc.) if you know what you want.
But in general, you have to make the choice between throwing away information, or "hiding" it in some ugly but recoverable form.

Related

Why under specific enviroments UNICODE characters trigger an EncodeError and at others not?

I am making a Python project that needs to work with Greek characters print, edit and return strings.
On my main PC that has the Greek language installed everything runs fine but when I am running on my English laptop the same program with the same version of python an encode error is triggered. Especially this one:
EncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
The error happens due to this code
my_string = "Δίας"
print(my_string)
Why is this happening and what I need to do to fix it?
Why is this happening? You are using Python 2 and although it supports Unicode, it makes you jump through a few more hoops explicitly than Python 3 does. The string you provide contains characters that fall outside the normal first 128 ASCII characters, which is what is causing the problem.
The print statement tries to encode the string as standard ascii, but it runs into characters it doesn't understand and by that point, it does not know what encoding the characters are supposed to be in. You might think this is obvious: "the same encoding the file is in!" or "always UTF-8!", but Python 2 wants you to make it explicit.
What do you need to do to fix it? One solution would be to use Python 3 and not worry about it, if all you need is a quick solution. Python 3 really is the way forward at this point and using Python 2 makes you solve problems that many Python programmers today don't have to solve (although they should be able to, in the end).
If you want to keep using Python 2, you should change your code to this:
# coding=utf-8
my_string = u"Δίας"
print(my_string.encode('utf-8'))
The first line tells the interpreter explicitly what encoding the source file was written in. This helps your IDE as well, to make sure it is showing you the code correctly. The second line has the u in front of the string, telling Python my_string is in fact a unicode string. And the third line explicitly tells Python that you want the output to be utf-8 encoded as well.
A more complete explanation of all this is here https://docs.python.org/2/howto/unicode.html
If you're wondering why it works on your Greek computer, but not on your English computer - the default encoding on the Greek computer actually has the code points for the characters you're using, while the English encoding does not. This indicates that Python is clever enough to figure out that things are utf (and the string is a series of unicode code points), but by the time it needs to encode them, it doesn't know what encoding to use, as the standard (English) encoding doesn't have the characters in the string.

Converting a weird data type to Str

I apologize in advance as I am not sure how to ask this! Okay so I am attempting to use a twitter API within Python. Here is the snippet of code giving me issues:
trends = twitter.Api.GetTrendsCurrent(api)
print str(trends)
This returns:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
When I attempt to .encode, the interpreter tells me I cannot encode a Trend object. How do I get around this?
Simple answer:
Use repr, not str. It should always, always work (unless the API itself is broken and that is where the error is being thrown from).
Long answer:
By default, when you cast a Unicode string to a byte str (and vice versa) in Python 2, it will use the ascii encoding by default for the conversion process. This works most of the time, but not always. Thus, nasty edge cases like this are a pain. One of the big reasons for the break in backwards compatibility in Python 3 was to change this behavior.
Use latin1 for testing. It may not be the correct encoding, but it will always (always, always, always) work and give you a jumping off point for debugging this properly so you at least can print something.
trends = twitter.Api.GetTrendsCurrent(api)
print type(trends)
print unicode(trends)
print unicode(trends).encode('latin1')
Or, better yet, when encoding force it to ignore or replace errors:
trends = twitter.Api.GetTrendsCurrent(api)
print type(trends)
print unicode(trends)
print unicode(trends).encode('utf8', 'xmlcharrefreplace')
Chances are, since you are dealing with a web based API, you are dealing with UTF-8 data anyway; it is pretty much the default encoding across the board on the web.

Python Unicode Handling Errors - How To Simply Remove Unicode

There are literally dozens, maybe even hundreds of questions on this site about unicode handling errors with python. Here is an example of what I am talking about:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2310: ordinal not in range(128)
A great many of the questions indicate that the OP just wants the offending content to GO AWAY. The responses they receive are uniformly full of mumbo jumbo about codecs, character sets, and all sorts of things that do not address this one basic question:
"I am trying to process a text file with some unicode in it, I could not possibly care any less what this stuff is, it is just noise in the context of the problem I am trying to solve."
So, I have a file with a zillion JSON encoded tweets, I have no interest at all in these special characters, I want them REMOVED from the line.
fh = open('file-full-of-unicode.txt')
for line in fh:
print zap_unicode(line)
Given a variable called 'line', how do I simply print it minus any unicode it might contain?
There, I have repeated the question in several different fashions so it can not be misconstrued - unicode is junk in the context of what I am trying to do, I want to convert it to something innocuous or simply remove it entirely. How is this most easily accomplished?
You can do line.decode('ascii', 'ignore'). This will decode as ASCII everything that it can, ignoring any errors.
However, if you do this, prepare for pain. Unicode exists for a reason. Throwing away parts of your data without even knowing what you're throwing away will almost always cause problems down the road. It's like saying "I don't care if some of the gizmos inside my car engine don't work, I just want to drive it. Just take out anything that doesn't work!" That's all well and good until it explodes.
The problem can't "go away", the code that can't handle unicodes properly must be fixed. There is no trivial way to do this, but the simplest way is to decode the bytes on input to make sure that the application is using text everywhere.
as #BrenBarn suggested the result will be spurious english words...
my suggestion is why not consider removing entire word which contains a unicode character
i.e a word is space+characters+space if characters has a unicode character then remove all character after previous space and before next space.. the result will be much better than removing just unicode characters..
if you just want to remove names (#username) and tags (#tags) you can use a filter ([#,#,..]) instead of brute forcibly searching for all unicode characters..
I have low reputation so can't comment, so made it as an answer.. :-(

django + unicode constant errors

I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.
I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.
Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.

how do I write a custom encoding in python to clean up my data?

I know I've done this before at another job, but I can't remember what I did.
I have a database that is full of varchar and memo fields that were cut and pasted from Office, webpages, and who knows where else. This is starting to cause encoding errors for me. Since Python has a very nice "decode" function to take a byte stream and translate it into Unicode, I thought that would just write my own encoding to fix this up. (For example, to take "smart quotes" and turn them into "standard quotes".)
But I can't remember how to get started. I think I copied one of the encodings that was close (cp1252.py) and then updated it.
Can anyone put me on the right path? Or suggest a better path?
I've expanded this with a bit more detail.
If you are reasonably sure of the encoding of the text in the database, you can do text.decode('cp1252') to get a Unicode string. If the guess is wrong this will likely blow up with an exception, or the decoder will 'disappear' some characters.
Creating a decoder along the lines you describe (modifying cp1252.py) is easy. You just need to define the translation table from bytes to Unicode characters.
However if not all of the text in the database has the same encoding, your decoder will need some rules to decide which is the correct mapping. In this case you may want punt and use the chardet module, which can scan the text and make a guess the encoding.
Maybe the best approach would be try to decode using the most likely encoding (cp1252) and if that fails, fallback to using chardet to guess the correct encoding.
If you use text.decode() and/or chardet, you'll end up with a Unicode string. Below is a simple routine which can translate characters in a Unicode string, e.g. "convert curly quotes to ASCII":
CHARMAP = [
(u'\u201c\u201d', '"'),
(u'\u2018\u2019', "'")
]
# replace with text.decode('cp1252') or chardet
text = u'\u201cit\u2019s probably going to work\u201d, he said'
_map = dict((c, r) for chars, r in CHARMAP for c in list(chars))
fixed = ''.join(_map.get(c, c) for c in text)
print fixed
Output:
"it's probably going to work", he said

Categories