Importing file with unknown encoding from Python into MongoDB

Importing file with unknown encoding from Python into MongoDB - python

Working on importing a tab-delimited file over HTTP in Python.
Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.
Whatever the encoding of the data is, MongoDB is throwing me the exception:
bson.errors.InvalidStringData: strings in documents must be valid UTF-8
So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:
LookupError: unknown encoding: unicode
From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.
Problem: When converting to Unicode, I am receiving the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)
With this error, I'm not exactly sure where to continue.
My question is this: How do I successfully import the data from a file without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?
Thanks Much!

Try these in order:
(0) Check that your removal of the slashes/ticks/etc is not butchering the data. What's a tick? Please show your code. Please show a sample of the raw data ... use print repr(sample_raw data) and copy/paste the output into an edit of your question.
(1) There's an old maxim: "If the encoding of a file is unknown, or stated to be ISO-8859-1, it is cp1252" ... where are you getting it from? If it's coming from Western Europe, the Americas, or any English/French/Spanish-speaking country/territory elsewhere, and it's not valid UTF-8, then it's likely to be cp1252
[Edit 2] Your error byte 0x93 decodes to U+201C LEFT DOUBLE QUOTATION MARK for all encodings cp1250 to cp1258 inclusive ... what language is the text written in? [/Edit 2]
(2) Save the file (before tick removal), then open the file in your browser: Does it look sensible? What do you see when you click on View / Character Encoding?
(3) Try chardet
Edit with some more advice:
Once you know what the encoding is (let's assume it's cp1252):
(1) convert your input data to unicode: uc = raw_data.decode('cp1252')
(2) process the data (remove slashes/ticks/etc) as unicode: clean_uc = manipulate(uc)
(3) you need to output your data encoded as utf8: to_mongo = clean_uc.encode('utf8')
Note 1: Your error message says "can't decode byte 0x93 in position 1258" ... 1258 bytes is a rather long chunk of text; is this reasonable? Have you had a look at the data that it is complaining about? How? what did you see?
Note 2: Please consider reading the Python Unicode HOWTO and this article

Related

Python UnicodeEncodeError when Outputting Parsed Data from a Webpage

I have a program that parses webpages and then writes the data out somewhere else. When I am writing the data, I get
"UnicodeEncodeError: 'ascii' codec can't encode characters in position
19-21: ordinal not in range(128)"
I am gathering the data using lxml.
name = apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
worksheet.goog["Name"].append(name)
Upon reading, http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm, it suggests I record all of my variables in unicode. This means I need to know what encoding the site is using.
My final line that actually writes the data out somewhere is:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], (str(worksheet.goog[value][row])).encode('ascii', 'ignore'))
How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out?

You error is because of:
str(worksheet.goog[value][row])
Calling str you are trying to encode the ascii, what you should be doing is encoding to utf-8:
worksheet.goog[value][row].encode("utf-8")
As far as How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out? goes, you can't there is no ascii latin ă etc... unless you want to get the the closest ascii equivalent using something like Unidecode.

I think I may have figured my own problem out.
apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
Actually defaults to unicode. So what I did was change this line to:
name = (apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text).encode('ascii', errors='ignore')
And I just output without changing anything:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], worksheet.goog[value][row])
Due to the nature of the data, ASCII only is mostly fine. Although, I may be able to use UTF-8 and catch some extra characters...but this is not relevant to the question.
:)

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.
First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.
I figured out that my Postgres database has UTF8 encoding.
The file/StringIO object I am writing my data into shows its encoding as the following:
setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators
I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.
This is the error I got now:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)
What does it mean? How do I fix it?
EDIT:
I am using Python 2.7
Some pieces of my code:
I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench.
This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:
# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :
# Separate rows in a table by new line delimiter
if(row_num!=0):
table_data.write("\n")
col_num=0
for cell in row:
# Separate cells in a row by tab delimiter
if(col_num!=0):
table_data.write("\t")
table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
col_num = col_num+1
row_num = row_num+1
This is the code that writes to Postgres database from my StringIO object table_data:
cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

The problem is that you're calling encode on a str object.
A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.
So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.
The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.
But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.
Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.
In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.
As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

python byte string encode and decode

I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.
b = '\x80'
u8 = b.encode('utf-8')
j = json.dumps(u8)
I expected j to be '\xc2\x80' but instead I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.
Any ideas?
EDIT:
I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)

You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.
If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.
If it is in fact text, you need to know what encoding to use to decode it to Unicode.
Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:
>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>
UPDATE about latin1:
u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".

Use b.decode('name of source encoding') to get a unicode version. This was surprising to me when I learned it. eg:
In [123]: 'foo'.decode('latin-1')
Out[123]: u'foo'

I think what you are trying to do is decode the string object of some encoding. Do you know what that encoding is? To get the unicode object.
unicode_b = b.decode('some_encoding')
and then re-encoding the unicode object using the utf_8 encoding back to a string object.
b = unicode_b.encode('utf_8')
Using the unicode object as a translator, without knowing what the original encoding of the string is I can't know for certain but there is the possibility that the conversion will not go as expected. The unicode object is not meant for converting strings of one encoding to another. I would work with the unicode object assuming you know what the encoding is, if you don't know what the encoding is then there really isn't a way to find out without trial and error, and then convert back to the encoded string when you want a string object back.

Python string encoding issue

I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:
'S�o Paulo'
so I tried to encode it like so:
encodeme = 'S�o Paulo'
encodeme.encode('utf-8)
but got the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of
São Paulo
Any help is appreciated.

It looks like you are having some kind of encoding problem.
First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?
Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.
So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?
I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.
Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.
One thing to get very clear (as Skirmantas also explains):
Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.
The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)
Try this example code:
>>> reportbody = 'S\xc3\xa3o Paulo' # UTF-8 encoded byte string
>>> reportbody.decode('utf-8') # returns a Unicode string, u'...'
u'S\xe3o Paulo'
You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.

I think you have to decode it using a correct encoding rather than encode it to utf-8. Try
s = s.decode('utf-8')
However you need to know which encoding to use. Input can come in other encodings that utf-8.
The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.
I'll try to explain the difference of unicode string and utf-8 bytestring in python.
unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.
bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.
utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.

Official Python unicode documentation
You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)

Python Encoding issue

Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you

Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.

In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.

Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.