sqlalchemy unicode issue - python

Hi I am dvelopping an application in python with sqlalchemy and mysql 5.1.58-1ubuntu1, I can get data from db without problem, except that I can not read not ascii characters like è, ò or the euro symbol, instead of the euro I get
\u20ac
this is how I create the engine for mysqlalchemy
dbfile="root:#########localhost/parafarmacie"
engine = create_engine("mysql+mysqldb://"+dbfile+"?charset=utf8&use_unicode=0")
all my columns that work with text are declared as Unicode, I googled for days but without any luck, someone could tell me where is my mistake?
thanks in advance

When you get your unicode objects from the database, before you output them, you need to encode them:
my_unicode_object.encode("utf-8")
What you are seeing now is the raw repr of the unicode object which shows you the code point (since it hasn't been converted to bytes yet) :-)

Related

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c

How to escape symbols entered in django admin to html representation

I have a model that has a regular text field, that needs to be able to accept user pasted text data into it which may contain scientific symbols, specifically lowercase delta δ. The users will be entering the data through the model admin.
I'm using a mysql backend, and the encoding is set to Latin-1. Changing the DB encoding is not an option for me.
what i would like to do, for simplicity's sake, is have the admin form scrub the input text, much like sanitation or validation, but to escape the characters such as δ to their HTML representation,so that i can store them in the DB without having to convert to Unicode and then back again.
What utilities are available to do this? I've looked at escape() and conditional_escape(), but they do not seem to do what i want them to (not escape the special characters) and the django.utils.encoding.force_text() will encode everything, but my data will render as its Unicode representation if i do that.
The site runs on django 1.10 and python 2.7.x
any help or thoughts are much appreciated.
As part of the save method or view that receives the request.POST data, you can escape it, encode it to ascii with xmlcharrefreplace, and then decode it back from bytes to a string:
raw_str = "this is a string with δ problematic chars"
result = html.escape(raw_str).encode("ascii", "xmlcharrefreplace").decode()
print(result) # 'this is a string with δ problematic chars'
Gets the job done since you can't change the encoding, though not nearly as clean as just getting to live in UTF-8. Good luck!

Python decoding of back quotations

I am receiving this issue
" UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201d' "
I'm quite new to working with databases as a whole. Previously, I had been using SQLite3; however, now transitioning/migrating to MySQL, I noticed u'\u201d' and u'\u201c' characters were within some of my text data.
I'm currently making a python script to tackle the migration; however, I'm getting stuck with this codec issue that I previously didn't for see.
So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?
You don't have a problem decoding these characters; wherever they're coming from, if they're showing up as \u201d (”) and \u201c (“), they're already being properly decoded.
The problem is encoding these characters. If you want to store your strings in Latin-1 columns, they can only contain the 256 characters that exist in Latin-1, and these two are not among them.
So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?
The obvious solution is to use UTF-8 columns instead of Latin-1 in MySQL. Then this problem wouldn't even exist; any Unicode string can be encoded as UTF-8.
But assuming you can't do that for some reason…
Python comes with built-in support for different error handlers that can help you do something with these characters while encoding them. You just have to decide what "something" that is.
Let's say your string looks like hey “hey” hey. Here's what each error handler would do with it:
s.encode('latin-1', 'ignore'): hey hey hey
s.encode('latin-1', 'replace'): hey ?hey? hey
s.encode('latin-1', 'xmlcharrefreplace'):hey “hey” hey`
s.encode('latin-1', 'backslashreplace'):hey \u201chey\u201d hey`
The first two have the advantage of being somewhat readable, but the disadvantage that you can never recover the original string. If you want that, but want something even more readable, you may want to consider a third-party library like unidecode:
unidecode('hey “hey” hey').encode('latin-1'):hey "hey" hey`
The last two are lossless, but kind of ugly. Although in some contexts they'll look pretty nice—e.g., if you're building an XML document, xmlcharrefreplace (maybe even with 'ascii' instead of 'latin-1') will give you exactly what you want in an XML viewer. There are special-purpose translators for various other use cases (like HTML references, or XML named entities instead of numbered, etc.) if you know what you want.
But in general, you have to make the choice between throwing away information, or "hiding" it in some ugly but recoverable form.

Python Encoding - Could not decode to utf8

I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:
OperationalError: Could not decode to UTF-8
If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become £
If I read the csv in python, the £ in the offending records is still read as £ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.
I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.
My code below results in the OperationalError: Could not decode to UTF-8
conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT 5000 , 5001''')
row = curr.fetchone()
All the records above 5000 in the database have this character problem and hence produce the error.
Any help appreciated.
Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.
The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line
conn.text_factory = str
However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).
What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:
your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`
...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:
conn.text_factory = lambda x: str(x, 'latin1')
Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:
'iso-8859-1'
'utf-16'
'utf-32'
'latin1'
You can find a more complete list here.
Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:
conn.text_factory = bytes
If the text in the database is actually mostly encoded in UTF-8, but you're still seeing this error (Could not decode to UTF-8), then the problem may be that one or more rows have bogus data that is not valid UTF-8. By default, Python's decode() function throws an exception when it sees text like that. If you are in this situation and want to simply ignore these errors, you can set up a text_factory like this:
conn = sqlite3.connect('my-database.db')
conn.text_factory = lambda b: b.decode(errors = 'ignore')

Bytes string in Python

Would you know by any chance how to get rid on the bytes identifier in front of a string in a Python's list, perhaps there is some global setting that can be amended?
I retrieve a query from the Postgres 9.3, and create a list form that query. It looks like Python 3.3 interprets records in columns that are of type char(4) as if the they are bytes strings, for example:
Funds[1][1]
b'FND3'
Funds[1][1].__class__
<class 'bytes'>
So the implication is:
Funds[1][1]=='FND3'
False
I have some control over that database so I could change the column type to varchar(4), and it works well:
Funds[1][1]=='FND3'
True
But this is only a temporary solution.
The little b makes my life a nightmare for the last two days ;), and I would appreciate your help with that problem.
Thanks and Regards
Peter
You have to either manually implement __str__/__repr__ or, if you're willing to take the risk, do some sort of Regex-replace over the string.
Example __repr__:
def stringify(lst):
return "[{}]".format(", ".join(repr(x)[1:] if isinstance(x, bytes) else repr(x) for x in lst))
The b isn't part of the string, any more than the quotes around it are; they're just part of the representation when you print the string out. So, you're chasing the wrong problem, one that doesn't exist.
The problem is that the byte string b'FND3' is not the same thing as the string 'FND3'. In this particular example, that may seem silly, but if you might ever have any non-ASCII characters anywhere, it stops being silly.
For example, the string 'é' is the same as the byte string b'\xe9' in Latin-1, and it's also the same as the byte string b'\xce\xa9' in UTF-8. And of course b'\xce\a9' is the same as the string 'é' in Latin-1.
So, you have to be explicit about what encoding you're using:
Funds[1][1].decode('utf-8')=='FND3'
But why is PostgreSQL returning you byte strings? Well, that's what a char column is. It's up to the Python bindings to decide what to do with them. And without knowing which of the multiple PostgreSQL bindings you're using, and which version, it's impossible to tell you what to do. But, for example, in recent-ish psycopg, you just have to set an encoding in the connection (e.g., conn.set_client_encoding('UTF-8'); in older versions you had to register a standard typecaster and do some more stuff; etc.; in py-postgresql you have to register lambda s: s.decode('utf-8'); etc.

Categories