How to remove 4 byte unicode symbols in Python 2?

How to remove 4 byte unicode symbols in Python 2? - python

I have some problem with differents in one string after adding it to database.
I have string "space 222 m²". If I write it to mysql via mysqldb module I got "space 222 m²" in table, which is ok. But when I got this value from table, after decoding I get something like "space 222 m\eb000\b1111", which is not "space 222 m²".
This string before adding to database in unicode looks like "space 222 m\xcb", but on print it's displayed right, string from database is displayed with unicode codes and in consequence giving error.
MySQL charset - utf-8
Database collation - utf8_general_ci
Source string - utf-8
And i have problems with integrate string with special characters with other string without that
## db it's mongodb
st=db.objects.find()[0]['value']
string=st.encode('utf-8') # can be with m² or not. Encoding identical
some_string=u"some"
x="%s %s"%(string,some_string)
if string not contains special symbols all fine,
but if string contains special symbols i got UnicodeDecodeError
Python version:
Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)] on win32

a note on UTF-8: There are different ISO character sets within UTF-8, so keep that in mind when sending data from your UI to the DB. Have a look at localization and character encoding\sets this will help you a lot in understanding unicode\ascii.
I don't know the exact mappings of your strings, but to answer your question try get_string().encode('utf-8')
and get_string().decode('utf-8')

Related

Why do we need to encode and decode in python?

What is the use case of encode/decode?
My understanding was that encode is used to convert string into byte string in order to be able to pass non ascii data across the program. And decode was to convert this byte string back into string.
But foll. examples shows non acsii characters getting successfully printed even if not encoded/decoded. Example:
val1="À È Ì Ò Ù Ỳ Ǹ Ẁ"
val2 = val1
print('val1 is: ',val2)
encoded_val1=val1.encode()
print('encoded_val1 is: ',encoded_val1)
decoded_encoded_val1=encoded_val1.decode()
print('decoded_encoded_val1 is: ',decoded_encoded_val1)
Output:
So what is the use case of encode and decode in python?

The environment you are working on may support those characters, in addition to that your terminal(or whatever you use to see output) may support displaying those characters. Some terminals/command lines or text editors may not support them. Apart from displaying issues, here are some actual reasons and examples:
1- When you transfer data over internet/network (eg with a socket), information is transferred as raw bytes. Non-ascii characters can not be represented by a single byte so we need a special representation for them (utf-16 or utf-8 with more than one byte). This is the most common reason I encountered.
2- Some text editors only supports utf-8. For example you need to represent your Ẁ character in utf-8 format in order to work with them. Reason for that is when dealing with text, people mostly used ASCII characters, which are just one byte. When some systems needed to be integrated with non-ascii characters people converted them to utf-8. Some people with more in-depth knowledge about text editors may give a better explanation about this point.
3- You may have a text written with unicode characters with some Chinese/Russian letters in it, and for some reason store it in your remote Linux server. But your server does not support letters from those languages. You need to convert your text to some strict format (utf-8 or utf-16) and store it in your server so you can recover them later.
Here is a little explanation of UTF-8 format. There are also other articles about the topic if you are interested.

Use utf-8 encoding because it's universal.
Set your code editor to utf-8 encoding and put at the top of all your python file: # coding: utf8
When you get an input (file, string...), it can have a different encoding then you have to get his encode type and decode it. Exemple in HTML file encode type is in meta balise.
If you change something in the HTML file and want to save it or send it by network, then you have to encode it in the encode type it was juste before.
Always use unicode for your string in python. (Automatic for python 3 but for python2.7 use the prefix u like u'Hi')
$ python2.7
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type('this is a string') # bits => encoded
<type 'str'>
>>> type(u'this is a string') # unicode => decoded
<type 'unicode'>
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type("this is a string") # unicode => decoded
<class 'str'>
>>> type(b"this is a string") # bits => encoded
<class 'bytes'>
1 Use UTF8. Now. All over.
2 In your code, specify the file encoding and declare your strings as "unicode".
3 At the entrance, know the encoding of your data, and decode with decode ().
4 At the output, encode in the expected encoding by the system which will receive the data, or if you can not know it, in UTF8, with encode ().

Python 3 and b'\x92'.decode('latin1')

I'm getting results I didn't expect from decoding b'\x92' with the latin1 codec. See the session below:
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
>>> b'\xa3'.decode('latin1').encode('ascii', 'namereplace')
b'\\N{POUND SIGN}'
>>> b'\x92'.decode('latin1').encode('ascii', 'namereplace')
b'\\x92'
>>> ord(b'\x92'.decode('latin1'))
146
The result decoding b'\xa3' gave me exactly what I was expecting. But the two results for b'\x92' were not what I expected. I was expecting b'\x92'.decode('latin1') to result in U+2018, but it seems to be returning U+0092.
What am I missing?

The error I made was to expect that the character 0x92 decoded to "RIGHT SINGLE QUOTATION MARK" in latin-1, it doesn't. The confusion was caused because it was present in a file that was specified as being in latin1 encoding. It now appears that the file was actually encoded in windows-1252. This is apparently a common source of confusion:
http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
If the character is decoded with the correct encoding, then the expected result is obtained.
>>> b'\x92'.decode('windows-1252').encode('ascii', 'namereplace')
b'\\N{RIGHT SINGLE QUOTATION MARK}'

I was expecting b'\x92'.decode('latin1') to result in U+2018
latin1 is an alias for ISO-8859-1. In that encoding, byte 0x92 maps to character U+0092, an unprintable control character.
The encoding you might have really meant is windows-1252, the Microsoft Western code page based on it. In that encoding, 0x92 is U+2019 which is close...
(Further befuddlement arises because for historical reasons web browsers are also confused between the two. When a web page is served as charset=iso-8859-1, web browsers actually use windows-1252.)

I just want to make clear that you're not encoding anything here.
xa3 has a ordinal value of 163 (0xa3 in hexadecimal). Since that ordinal is not seven bits, it can't be encoded into ascii. Your handler for errors just replaces the Unicode Character into the name of the character. The Unicode Character 163 maps to £.
'\x92' on the other hand, has an ordinal value of 146. According to this Wikipedia Article, the character isn't printable - it's a privately used control code in the C2 space. This explains why it's name is simply the literal '\\x92'.
As an aside, if you need the name of the character, it's much better to do it like this:
import unicodedata
print unicodedata.name(u'\xa3')

Python pyodbc connections to IBM Netezza Erroring

So. This issue is almost exactly the same as the one discussed here -- but the fix (such as it is) discussed in that post doesn't fix things for me.
I'm trying to use Python 2.7.5 and pyodbc 3.0.7 to connect from an Ubuntu 12.04 64bit machine to an IBM Netezza Database. I'm using unixODBC to handle specifying a DSN. This DSN works beautifully from the isql CLI -- so I know it's configured correctly, and unixODBC is ticking right along.
The code is currently dead simple, and easy to reproduce in a REPL:
In [1]: import pyodbc
In [2]: conn = pyodbc.connect(dsn='NZSQL')
In [3]: curs = conn.cursor()
In [4]: curs.execute("SELECT * FROM DB..FOO ORDER BY created_on DESC LIMIT 10")
Out[4]: <pyodbc.Cursor at 0x1a70ab0>
In [5]: curs.fetchall()
---------------------------------------------------------------------------
InvalidOperation Traceback (most recent call last)
<ipython-input-5-ad813e4432e9> in <module>()
----> 1 curs.fetchall()
/usr/lib/python2.7/decimal.pyc in __new__(cls, value, context)
546 context = getcontext()
547 return context._raise_error(ConversionSyntax,
--> 548 "Invalid literal for Decimal: %r" % value)
549
550 if m.group('sign') == "-":
/usr/lib/python2.7/decimal.pyc in _raise_error(self, condition, explanation, *args)
3864 # Errors should only be risked on copies of the context
3865 # self._ignored_flags = []
-> 3866 raise error(explanation)
3867
3868 def _ignore_all_flags(self):
InvalidOperation: Invalid literal for Decimal: u''
So I get a connection, the query returns correctly, and then when I try to get a row... asplode.
Anybody ever managed to do this?

Turns out pyodbc can't gracefully convert all of Netezza's types. The table I was working with had two that are problematic:
A column of type NUMERIC(7,2)
A column of type NVARCHAR(255)
The NUMERIC column causes a decimal conversion error on NULL. The NVARCHAR column returns a utf-16-le encoded string, which is a pain in the ass.
I haven't found a good driver-or-wrapper-level solution yet. This can be hacked by casting types in the SQL statement:
SELECT
foo::FLOAT AS was_numeric
, bar::VARCHAR(255) AS was_nvarchar
I'll post here if I find a lower-level answer.

I've just encounter the same problem and found a different solution.
I managed to solve the issue by:
Making sure the following attributes are part of my driver options in odbc ini file:
UnicodeTranslationOption = utf16
CharacterTranslationOption = all
Add the following environment variables:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[NETEZZA_LIB_FILES_PATH]
ODBCINI=[ODBC_INI_FULL_PATH]
NZ_ODBC_INI_PATH=[ODBC_INI_FOLDER_PATH]
In my case the values are:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nz/lib
ODBC_INI=/etc/odbc.ini
NZ_ODBC_INI_PATH=/etc
I'm using centos 6 and also installed both 'unixODBC' and 'unixODBC-devel' packages.
Hope it helps someone.

I'm not sure what your error is, but the code below is allowing me to connect to Netezza via ODBC:
# Connect via pyodbc to listed data sources on machine
import pyodbc
print pyodbc.dataSources()
print "Connecting via ODBC"
# get a connection, if a connect cannot be made an exception will be raised here
conn = pyodbc.connect("DRIVER={NetezzaSQL};SERVER=<myserver>;PORT=<myport>;DATABASE=<mydbschema>;UID=<user>;PWD=<password>;")
print "Connected!\n"
# you can then use conn cursor to perform queries

The Netezza Linux client package includes /usr/local/nz/lib/ODBC_README, which lists all the values for those two attributes:
UnicodeTranslationOption:
Specify translation option for Unicode.
Possible value strings are:
utf8 : unicode data is in utf-8 encoding
utf16 : unicode data is in utf-16 encoding
utf32 : unicode data is in utf-32 encoding
Do not add '-' in the value strings listed above, e.g. "utf-8" is not
a valid string value. These value strings are case insensitive.
On windows this option is not available as windows DM always passes
unicode data in utf16.
CharacterTranslationOption ("Optimize for ASCII character set" on Windows):
Specify translation option for character encodings.
Possible value strings are:
all : Support all character encodings
latin9 : Support Latin9 character encoding only
Do not add '-' in the value strings listed above, e.g. "latin-9"
is not a valid string value. These value strings are case
insensitive.
NPS uses the Latin9 character encoding for char and varchar
datatypes. The character encoding on many Windows systems
is similar, but not identical to this. For the ASCII subset
(letters a-z, A-Z, numbers 0-9 and punctuation marks) they are
identical. If your character data in CHAR or VARCHAR datatypes is
only in this ASCII subset then it will be faster if this box is
checked. If your data has special characters such as the Euro
sign (€) then keep the box unchecked to accurately convert
the encoding. Characters in the NCHAR or NVARCHAR data types
will always be converted appropriately.

sqlalchemy unicode issue

Hi I am dvelopping an application in python with sqlalchemy and mysql 5.1.58-1ubuntu1, I can get data from db without problem, except that I can not read not ascii characters like è, ò or the euro symbol, instead of the euro I get
\u20ac
this is how I create the engine for mysqlalchemy
dbfile="root:#########localhost/parafarmacie"
engine = create_engine("mysql+mysqldb://"+dbfile+"?charset=utf8&use_unicode=0")
all my columns that work with text are declared as Unicode, I googled for days but without any luck, someone could tell me where is my mistake?
thanks in advance

When you get your unicode objects from the database, before you output them, you need to encode them:
my_unicode_object.encode("utf-8")
What you are seeing now is the raw repr of the unicode object which shows you the code point (since it hasn't been converted to bytes yet) :-)

Django \u characters in my UTF8 strings

I am adding UTF-8 data to a database in Django.
As the data goes into the database, everything looks fine - the characters (for example): “Hello” are UTF-8 encoded.
My MySQL database is UTF-8 encoded. When I examine the data from the DB by doing a select, my example string looks like this: ?Hello?. I assume this is showing the characters as UTF-8 encoded.
When I select the data from the database in the terminal or for export as a web-service, however - my string looks like this: \u201cHello World\u201d.
Does anyone know how I can display my characters correctly?
Do I need to perform some additional UTF-8 encoding somewhere?
Thanks,
Nick.

u'\u201cHello World\u201d'
Is the correct Python representation of the Unicode text “Hello World”. The smartquote characters are being displayed using a \uXXXX hex escape rather than verbatim because there are often problems with writing Unicode characters to the terminal, particularly on Windows. (It looks like MySQL tried to write them to the terminal but failed, resulting in the ? placeholders.)
On a terminal that does manage to correctly input and output Unicode characters, you can confirm that they're the same thing:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u201cHello World\u201d'==u'“Hello World”'
True
just as for byte strings, \x sequences are just the same as characters:
>>> '\x61'=='a'
True
Now if you've got \u or \x sequences escaping Python and making their way into an exported file, then you've done something wrong with the export. Perhaps you used repr() somewhere by mistake.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.