pandas turn mysql utf-8 into ascii - python

I have a mysql utf-8 general ci table but when I load it into a pandas dataframe I get the error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 66: ordinal not in range(128)
args = ('ascii', ' t...obile Android 1.0 0.0 0.0 ', 66, 67, 'ordinal not in range(128)')
encoding = 'ascii'
This is for a row where the character ä is in a varchar(255) field.
Why is the data converted to ascii and how can I fix this?

If you are using python 2.7
In the beginning of the code , put this
import sys
reload(sys)
sys.setdefaultencoding('utf8')
That should do.

Related

Syntax Error: 'charmap' codec can't decode byte

Configuring the encoding for cp1252 to configure some sequences like "~ '^", for example when I put in coding: utf-8 the strings with text formatting error, I already tested with "latin-1" and the same error continues.
If I edit "cp1252" to "cp1252" again, I can open it normally, but when I close and later open the file, the error message comes back.
Someone knows how to solve the following error, it always appears that I will execute the main script file.
SyntaxError: 'charmap' codec can't decode byte 0x81 in position 45: character maps to <undefined>
# coding: cp1252
import sys
import os
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from ytconsole import *
self.msgt2 =QMessageBox()
self.msgt2.setIcon(QMessageBox.Information)
self.msgt2.setWindowTitle('Python Youtube Downloader') -->[Here is the problem] line 353
self.msgt2.setText("{} arquivos MP4 foram baixados, salvos na Área de Trabalho".format(int(l)))``

Python 2.7: 'ascii' codec can't encode character u'\xe9' error while writing in file

I know this question have been asked various time but somehow I am not getting results.
I am fetching data from web which contains a string Elzéar. While going to read in CSV file it gives error which mentioned in question title.
While producing data I did following:
address = str(address).strip()
address = address.encode('utf8')
return name+','+address+','+city+','+state+','+phone+','+fax+','+pumps+','+parking+','+general+','+entertainment+','+fuel+','+resturants+','+services+','+technology+','+fuel_cards+','+credit_cards+','+permits+','+money_services+','+security+','+medical+','+longit+','+latit
and writing it as:
with open('records.csv', 'a') as csv_file:
print(type(data)) #prints <unicode>
data = data.encode('utf8')
csv_file.write(id+','+data+'\n')
status = 'OK'
the_file.write(ts+'\t'+url+'\t'+status+'\n')
Generates error as:
'ascii' codec can't encode character u'\xe9' in position 55: ordinal
not in range(128)
You could try something like (python2.7):
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
...
with codecs.open('records.csv', 'a', encoding="utf8") as csv_file:
print(type(data)) #prints <unicode>
# because data is unicode
csv_file.write(unicode(id)+u','+data+u'\n')
status = u'OK'
the_file.write(unicode(ts, encoding="utf8")+u'\t'+unicode(url, encoding="utf8")+u'\t'+status+u'\n')
The main idea is to work with unicode as much as possible and return str when outputing (better do not operate over str).

Python output replaces non ASCII characters with �

I am using Python 2.7 to read data from a MySQL table. In MySQL the name looks like this:
Garasa, Ángel.
But when I print it in Python the output is
Garasa, �ngel
The character set name in MySQL is utf8.
This is my Python code:
# coding: utf-8
import MySQLdb
connection = MySQLdb.connect
(host="localhost",user="root",passwd="root",db="jmdb")
cursor = connection.cursor ()
cursor.execute ("select * from actors where actorid=672462;")
data = cursor.fetchall ()
for row in data:
print "IMDB Name=",row[4]
wiki=("".join(row[4]))
print wiki
I have tried decoding it, but get error such as:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8:
invalid start byte
I have read about decoding and UTF-8 but couldn't find a solution.
Get the Mysql driver to return Unicode strings instead. This means that you don't have to deal with decoding in your code.
Simply set use_unicode=True in the connection parameters. If the table has been set with a specific encoding then set the charset attribute accordingly.
I think the right character mapping in your case is cp1252 :
>>> s = 'Garasa, Ángel.'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#63>", line 1, in <module>
s.decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 8: invalid start byte
>>> s.decode('cp1252')
u'Garasa, \xc1ngel.'
>>>
>>> print s.decode('cp1252')
Garasa, Ángel.
EDIT: It could also be possible that it is latin-1 as well:
>>> s.decode('latin-1')
u'Garasa, \xc1ngel.'
>>> print s.decode('latin-1')
Garasa, Ángel.
As cp1252 and latin-1 code pages intersects for all codes except the range 128 to 159.
Quoting from this source (latin-1):
The Windows-1252 codepage coincides with ISO-8859-1 for all codes
except the range 128 to 159 (hex 80 to 9F), where the little-used C1
controls are replaced with additional characters including all the
missing characters provided by ISO-8859-15
And this one (cp1252):
This character encoding is a superset of ISO 8859-1, but differs from
the IANA's ISO-8859-1 by using displayable characters rather than
control characters in the 80 to 9F (hex) range.

UnicodeDecodeError: 'ascii' codec can't decode byte

I have the following code:
# -*- coding: utf-8 -*-
import splinter
import urllib
browser = splinter.Browser('firefox')
miss = ("rúin",)
for i in miss:
browser.visit(link)
browser.fill('word', i)
Which gives me the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
How can I resolve this issue?
Use an actual unicode value:
miss = (u"rúin",)
Note the u before the string literal.
Python otherwise will try to coerce the bytestring to unicode implicitly, using the default codec (ASCII).

Spanish text in .py files

This is the code
A = "Diga sí por cualquier número de otro cuidador.".encode("utf-8")
I get this error:
'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
I tried numerous encodings unsuccessfully.
Edit:
I already have this at the beginning
# -*- coding: utf-8 -*-
Changing to
A = u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
doesn't help
Are you using Python 2?
In Python 2, that string literal is a bytestring. You're trying to encode it, but you can encode only a Unicode string, so Python will first try to decode the bytestring to a Unicode string using the default "ascii" encoding.
Unfortunately, your string contains non-ASCII characters, so it can't be decoded to Unicode.
The best solution is to use a Unicode string literal, like this:
A = u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
Error message: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
says that the 7th byte is 0xed. This is either the first byte of the UTF-8 sequence for some (maybe CJK) high-ordinal Unicode character (that's absolutely not consistent with the reported facts), or it's your i-acute encoded in Latin1 or cp1252. I'm betting on the cp1252.
If your file was encoded in UTF-8, the offending byte would be not 0xed but 0xc3:
Preliminaries:
>>> import unicodedata
>>> unicodedata.name(u'\xed')
'LATIN SMALL LETTER I WITH ACUTE'
>>> uc = u'Diga s\xed por'
What happens if file is encoded in UTF-8:
>>> infile = uc.encode('utf8')
>>> infile
'Diga s\xc3\xad por'
>>> infile.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
#### NOT the message reported in the question ####
What happens if file is encoded in cp1252 or latin1 or similar:
>>> infile = uc.encode('cp1252')
>>> infile
'Diga s\xed por'
>>> infile.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
#### As reported in the question ####
Having # -*- coding: utf-8 -*- at the start of your code does not magically ensure that your file is encoded in UTF-8 -- that's up to you and your text editor.
Actions:
save your file as UTF-8.
As
suggested by others, you need u'blah
blah'
put on first line of your code this:
# -*- coding: utf-8 -*-
You should specify your source file's encoding by adding the following line to the very beginning of your code (assuming that your file is encoded in UTF-8):
# Encoding: UTF-8
Otherwise, Python will assume an ASCII encoding and fail during parsing.
You probably operate on normal string, not unicode string:
>> type(u"zażółć gęślą jaźń")
-> <type 'unicode'>
>> type("zażółć gęślą jaźń")
-> <type 'str'>
so
u"Diga sí por cualquier número de otro cuidador.".encode("utf-8")
should work.
If you want use unicode strings by default, put
# -*- coding: utf-8 -*-
in the first line of your script.
Look also in docs.
P.S. It's Polish in examples above :)
In the first or second line of your code, type the comment:
# -*- coding: latin-1 -*-
For a list of symbols supported see:
http://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29
And the languages covered: http://en.wikipedia.org/wiki/ISO_8859-1
Maybe this is what you want to do:
A = 'Diga sí por cualquier número de otro cuidador'.decode('latin-1')
And don't forget to add # -*- coding: latin-1 -*- at the beginning of your code.

Categories