UnicodeDecodeError while iterating over MongoDB collection

UnicodeDecodeError while iterating over MongoDB collection - python

I am trying to query a MongoDB database using Python 2.7 and pymongo-2.3 using something like this:
from pymongo import Connection
connection = Connection()
db = connection['db-name']
collections = db.subName
entries = collections['collection-name']
print entries
# > Collection(Database(Connection('localhost', 27017), u'db-name'), u'subName.collection-name')
for entry in entries.find():
pass
The iterator fails even though I don't do anything with the entry objects:
Traceback (most recent call last):
File "/Users/../mongo.py", line 27, in <module>
for entry in entries.find():
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/cursor.py", line 778, in next
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/cursor.py", line 742, in _refresh
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/cursor.py", line 686, in __send_message
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/helpers.py", line 111, in _unpack_response
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 744: invalid start byte
I'm not the creator of the database I'm trying to query.
Does anybody know what I'm doing wrong and how I can fix it? Thanks.
Update: I managed to skip over the offending line from pymongo/helpers.py using a try-except, but I would prefer a solution that does not involve data loss.
try:
result["data"] = bson.decode_all(response[20:], as_class, tz_aware, uuid_subtype)
except:
result["data"] = []

Can you try the same operation using the mongo shell? I want to figure out if it's something Python-specific or if it's corruption in the database:
$ mongo db-name
> var collection = db.getCollection('subName.collection-name')
> collection.find().forEach(function(doc) { printjson(doc); })

Related

When using FTP on Apache Airflow, how do I inform the encoding?

I have to access a FTP server that does not use UTF-8 encoding. So when airflow tries to connect to it, it crashes.
I was able to reproduce the problem using the underlining ftplib that Airflow uses, as it follows:
ftp = FTP('myserver', user='xxxx', passwd='yyyy', encoding='utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/ftplib.py", line 121, in __init__
self.connect(host)
File "/usr/lib/python3.11/ftplib.py", line 162, in connect
self.welcome = self.getresp()
^^^^^^^^^^^^^^
File "/usr/lib/python3.11/ftplib.py", line 244, in getresp
resp = self.getmultiline()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/ftplib.py", line 230, in getmultiline
line = self.getline()
^^^^^^^^^^^^^^
File "/usr/lib/python3.11/ftplib.py", line 212, in getline
line = self.file.readline(self.maxline + 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 60: invalid continuation byte
And when using latin-1, there is no problem:
ftp = FTP('myserver', user='xxxx', passwd='yyyyy', encoding='latin-1')
print(ftp.welcome)
220-Microsoft FTP Service
220 FTP XXXXX, utilizado pelos usuários do orgão e Editoras.
But I don't see any options on how to change the encoding when using an airflow operator or sensor. On extras it ignores {"encoding":"latin-1"}.

Looking at the https://github.com/apache/airflow/blob/6ec97dc6491c3f7d7cee3da2e6d2acb4e7bddba3/airflow/providers/ftp/hooks/ftp.py#L62, there's indeed nothing that sets the encoding.
For your own project, you could subclass the FTPHook and FTPFileTransmitOperator, for example:
import ftplib
from airflow.compat.functools import cached_property
from airflow.providers.ftp.hooks.ftp import FTPHook
from airflow.providers.ftp.operators.ftp import FTPFileTransmitOperator
class FTPHookWithEncoding(FTPHook):
def get_conn(self) -> ftplib.FTP:
if self.conn is None:
params = self.get_connection(self.ftp_conn_id)
pasv = params.extra_dejson.get("passive", True)
encoding = params.extra_dejson.get("encoding")
self.conn = ftplib.FTP(params.host, params.login, params.password, encoding=encoding)
self.conn.set_pasv(pasv)
return self.conn
class FTPFileTransmitOperatorWithEncoding(FTPFileTransmitOperator):
#cached_property
def hook(self) -> FTPHookWithEncoding:
return FTPHookWithEncoding(ftp_conn_id=self.ftp_conn_id)
Since the code above is subclassing Airflow's FTPHook and FTPFileTransmitOperator, you're re-using all the methods from those classes. The additions with this code are:
Fetch the encoding from the extras with encoding = params.extra_dejson.get("encoding"). Note this example makes it a required extras field, adjust to your needs.
Set the encoding with ftplib.FTP(params.host, params.login, params.password, encoding=encoding).
Initialize and return the custom FTPHookWithEncoding in the operator's hook() method.
You might need to adjust this for the specific version of Airflow that you're running.
And in your DAG, you'll need to update the operator to use your custom operator. For example:
from my_package.operators import FTPFileTransmitOperatorWithEncoding
FTPFileTransmitOperatorWithEncoding(task_id="...", ...)

How to properly serialize and deserialize paging_size in Python?

In my Python application, I make the query to the Cassandra database. I'm trying to implement pagination through the cassandra-driver package. As you can see from the code below, paging_state returns the bytes data type. I can convert this value to the string data type. Then I send the value of the str_paging_state variable to the client. If this client sends me str_paging_state again I want to use it in my query.
This part of code works:
query = "select * from users where user_type = 'clients';"
statement = SimpleStatement(query, fetch_size=10)
results = session.execute(statement)
paging_state = results.paging_state
print(type(paging_state)) # <class 'bytes'>
str_paging_state = str(paging_state)
print(str_paging_state) # "b'\\x00C\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x03_hk\\x00\\x00\\x00\\x11P]5C#\\x8bGD~\\x8b\\xc7g\\xda\\xe5rH\\xb0\\x00\\x00\\x00\\x03_rk\\x00\\x00\\x00\\x18\\xee\\x14\\xf7\\x83\\x84\\x00tTmw[\\x00\\xec\\xdb\\x9b\\xa9\\xfd\\x00\\xb9\\xff\\xff\\xff\\xff\\xfe\\x01\\x00'"
This part of code raise error:
results = session.execute(
statement,
paging_state=bytes(str_paging_state.encode())
)
Error:
[ERROR] NoHostAvailable: ('Unable to complete the operation against any hosts')
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 51, in lambda_handler
    results = cassandra_connection.execute(statement, paging_state=bytes(paging_state.encode()))
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 2618, in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 4877, in result
    raise self._final_exceptionEND RequestId: 4b7bf588-a2d2-45e5-ad7e-8611f1704313
In Java documentation I found the .fromString method which creates a PagingState object from a string previously generated with toString(). Unfortunately, I didn't find an equivalent for this method in Python.
I also tried to use codecs package to decode and encode the paging_state.
str_paging_state = codecs.decode(paging_state, encoding='utf-8', errors='ignore')
# "\u0000C\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0003_hk\u0000\u0000\u0000\u0011P]5C#GD~grH\u0000\u0000\u0000\u0003_rk\u0000\u0000\u0000\u0018\u0014\u0000tTmw[\u0000ۛ\u0000\u0001\u0000"
# Raise error
results = session.execute(statement, paging_state=codecs.encode(str_paging_state, encoding='utf-8', errors='ignore'))
In this case I see next error:
[ERROR] ProtocolException: <Error from server: code=000a [Protocol error] message="Invalid value for the paging state">
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 50, in lambda_handler
    results = cassandra_connection.execute(
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 2618, in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 4877, in result
    raise self._final_exceptionEND RequestId: 979f098a-a566-4904-821a-2ce06522d909
In my case, protocol version is 4.
cluster = Cluster(..., protocol_version=4)
I would appreciate any help!

Just convert the binary data into hex string or base64 - use binascii module for that. For example, for first case functions hexlify/unhexlify (or in Python 3 use .hex method of binary data), and for base64 - use functions b2a_base64/a2b_base64

Compress in Java, decompress in Python - snappy/redis-py-cluster

I am writing cron script in python for a redis cluster and using redis-py-cluster for only reading data from a prod server. A separate Java application is writing to redis cluster with snappy compression and java string codec utf-8.
I am able to read data but not able to decode it.
from rediscluster import RedisCluster
import snappy
host, port ="127.0.0.1", "30001"
startup_nodes = [{"host": host, "port": port}]
print("Trying connecting to redis cluster host=" + host + ", port=" + str(port))
rc = RedisCluster(startup_nodes=startup_nodes, max_connections=32, decode_responses=True)
print("Connected", rc)
print("Reading all keys, value ...\n\n")
for key in rc.scan_iter("uidx:*"):
value = rc.get(key)
#uncompress = snappy.uncompress(value, decoding="utf-8")
print(key, value)
print('\n')
print("Done. exit()")
exit()
decode_responses=False is working fine with the comment. however changing decode_responses=True is throwing error. My guess is its not able to get the correct decoder.
Traceback (most recent call last):
File "splooks_cron.py", line 22, in <module>
print(key, rc.get(key))
File "/Library/Python/2.7/site-packages/redis/client.py", line 1207, in get
return self.execute_command('GET', name)
File "/Library/Python/2.7/site-packages/rediscluster/utils.py", line 101, in inner
return func(*args, **kwargs)
File "/Library/Python/2.7/site-packages/rediscluster/client.py", line 410, in execute_command
return self.parse_response(r, command, **kwargs)
File "/Library/Python/2.7/site-packages/redis/client.py", line 768, in parse_response
response = connection.read_response()
File "/Library/Python/2.7/site-packages/redis/connection.py", line 636, in read_response
raise e
: 'utf8' codec can't decode byte 0x82 in position 0: invalid start byte
PS: Uncommenting this line uncompress = snappy.uncompress(value, decoding="utf-8") is breaking with error
Traceback (most recent call last):
File "splooks_cron.py", line 27, in <module>
uncompress = snappy.uncompress(value, decoding="utf-8")
File "/Library/Python/2.7/site-packages/snappy/snappy.py", line 91, in uncompress
return _uncompress(data).decode(decoding)
snappy.UncompressError: Error while decompressing: invalid input

After hours of debugging, I was finally able to solve this.
I am using xerial/snappy-java compressor in my Java code which is writing to redis cluster. Interesting thing is that during compression xerial SnappyOutputStream adds some offset at the beginning of the compress data. In my case this looks something like this
"\x82SNAPPY\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x01\xb6\x8b\x06\\******actual data here*****
Due to this, the decompressor was not able to figure out. I modified code as below and remove offset form the value. it's working fine now.
for key in rc.scan_iter("uidx:*"):
value = rc.get(key)
#in my case offset was 20 and utf-8 is default ecoder/decoder for snappy
# https://github.com/andrix/python-snappy/blob/master/snappy/snappy.py
uncompress_value = snappy.decompress(value[20:])
print(key, uncompress_value)
print('\n')

tf.gfile.Glob gives me UnicodeDecodeError error anyway to fix this?

I was trying to get the list of name of txt file that was written in Korean in the specified directory with the code below
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
However, This one gives me the following error:
Traceback (most recent call last):
File "D:/Prj_mayDay/Prj_FrankenShtine/shakespear_reborn/main.py", line 108, in <module>
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 326, in get_matching_files
compat.as_bytes(filename), status)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 325, in <listcomp>
for matching_filename in pywrap_tensorflow.GetMatchingFiles(
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 106, in as_str_any
return as_str(value)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 19: invalid start byte
Now, throughout some research, I found out the reason
The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded
However, I do not see any way to apply the solution into my code. or is there?
**if there is alternative code for this. It should be applicable for both cloud stroage bucket / my personal hard drive as the code above did.
I'm using python3, Tensorflow version of 1.2.0-rc2

so after few hours of fiddling around with my code I finally found the solution.
Afterall one of the file inside of the directory I specified had a name in Korean. After I took that out of the directory. problem was gone.

Python ldif3 parser and exception in for loop

From site: https://pypi.python.org/pypi/ldif3/3.2.0
I have this code:
from ldif3 import LDIFParser
from pprint import pprint
parser = LDIFParser(open('data.ldif', 'rb'))
for dn, entry in parser.parse():
print('got entry record: %s' % dn)
pprint(record)
And now, reading my file data.ldif I have exception in parser.parse().
Question is how to catch this exception and allow for loop to go to next record (continue)?
Trackback:
Traceback (most recent call last):
File "ldif.py", line 16, in <module>
for dn, entry in parser.parse():
File "/home/dlubom/anaconda2/lib/python2.7/site-packages/ldif3.py", line 373, in parse
yield self._parse_entry_record(block)
File "/home/dlubom/anaconda2/lib/python2.7/site-packages/ldif3.py", line 346, in _parse_entry_record
attr_type, attr_value = self._parse_attr(line)
File "/home/dlubom/anaconda2/lib/python2.7/site-packages/ldif3.py", line 309, in _parse_attr
return attr_type, attr_value.decode('utf8')
File "/home/dlubom/anaconda2/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb3 in position 6: invalid start byte

i think it's not possible to handle exceptions in such case because they happend before / during the variable assignment.
BTW probably want to use the attribute:
strict (boolean) – If set to False, recoverable parse errors will
produce log warnings rather than exceptions.
Example:
parser = LDIFParser(ldif_file, strict=False)
https://ldif3.readthedocs.io/en/latest/
That helped me parsing an invalid ldif file containing commas inside CN attributes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeDecodeError while iterating over MongoDB collection - python

Can you try the same operation using the mongo shell? I want to figure out if it's something Python-specific or if it's corruption in the database: $ mongo db-name > var collection = db.getCollection('subName.collection-name') > collection.find().forEach(function(doc) { printjson(doc); })

Related

When using FTP on Apache Airflow, how do I inform the encoding?

How to properly serialize and deserialize paging_size in Python?

Compress in Java, decompress in Python - snappy/redis-py-cluster

tf.gfile.Glob gives me UnicodeDecodeError error anyway to fix this?

Python ldif3 parser and exception in for loop

Categories

Resources