Compress in Java, decompress in Python - snappy/redis-py-cluster - python

I am writing cron script in python for a redis cluster and using redis-py-cluster for only reading data from a prod server. A separate Java application is writing to redis cluster with snappy compression and java string codec utf-8.
I am able to read data but not able to decode it.
from rediscluster import RedisCluster
import snappy
host, port ="127.0.0.1", "30001"
startup_nodes = [{"host": host, "port": port}]
print("Trying connecting to redis cluster host=" + host + ", port=" + str(port))
rc = RedisCluster(startup_nodes=startup_nodes, max_connections=32, decode_responses=True)
print("Connected", rc)
print("Reading all keys, value ...\n\n")
for key in rc.scan_iter("uidx:*"):
value = rc.get(key)
#uncompress = snappy.uncompress(value, decoding="utf-8")
print(key, value)
print('\n')
print("Done. exit()")
exit()
decode_responses=False is working fine with the comment. however changing decode_responses=True is throwing error. My guess is its not able to get the correct decoder.
Traceback (most recent call last):
File "splooks_cron.py", line 22, in <module>
print(key, rc.get(key))
File "/Library/Python/2.7/site-packages/redis/client.py", line 1207, in get
return self.execute_command('GET', name)
File "/Library/Python/2.7/site-packages/rediscluster/utils.py", line 101, in inner
return func(*args, **kwargs)
File "/Library/Python/2.7/site-packages/rediscluster/client.py", line 410, in execute_command
return self.parse_response(r, command, **kwargs)
File "/Library/Python/2.7/site-packages/redis/client.py", line 768, in parse_response
response = connection.read_response()
File "/Library/Python/2.7/site-packages/redis/connection.py", line 636, in read_response
raise e
: 'utf8' codec can't decode byte 0x82 in position 0: invalid start byte
PS: Uncommenting this line uncompress = snappy.uncompress(value, decoding="utf-8") is breaking with error
Traceback (most recent call last):
File "splooks_cron.py", line 27, in <module>
uncompress = snappy.uncompress(value, decoding="utf-8")
File "/Library/Python/2.7/site-packages/snappy/snappy.py", line 91, in uncompress
return _uncompress(data).decode(decoding)
snappy.UncompressError: Error while decompressing: invalid input

After hours of debugging, I was finally able to solve this.
I am using xerial/snappy-java compressor in my Java code which is writing to redis cluster. Interesting thing is that during compression xerial SnappyOutputStream adds some offset at the beginning of the compress data. In my case this looks something like this
"\x82SNAPPY\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x01\xb6\x8b\x06\\******actual data here*****
Due to this, the decompressor was not able to figure out. I modified code as below and remove offset form the value. it's working fine now.
for key in rc.scan_iter("uidx:*"):
value = rc.get(key)
#in my case offset was 20 and utf-8 is default ecoder/decoder for snappy
# https://github.com/andrix/python-snappy/blob/master/snappy/snappy.py
uncompress_value = snappy.decompress(value[20:])
print(key, uncompress_value)
print('\n')

Related

When using FTP on Apache Airflow, how do I inform the encoding?

I have to access a FTP server that does not use UTF-8 encoding. So when airflow tries to connect to it, it crashes.
I was able to reproduce the problem using the underlining ftplib that Airflow uses, as it follows:
ftp = FTP('myserver', user='xxxx', passwd='yyyy', encoding='utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/ftplib.py", line 121, in __init__
self.connect(host)
File "/usr/lib/python3.11/ftplib.py", line 162, in connect
self.welcome = self.getresp()
^^^^^^^^^^^^^^
File "/usr/lib/python3.11/ftplib.py", line 244, in getresp
resp = self.getmultiline()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/ftplib.py", line 230, in getmultiline
line = self.getline()
^^^^^^^^^^^^^^
File "/usr/lib/python3.11/ftplib.py", line 212, in getline
line = self.file.readline(self.maxline + 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 60: invalid continuation byte
And when using latin-1, there is no problem:
ftp = FTP('myserver', user='xxxx', passwd='yyyyy', encoding='latin-1')
print(ftp.welcome)
220-Microsoft FTP Service
220 FTP XXXXX, utilizado pelos usuários do orgão e Editoras.
But I don't see any options on how to change the encoding when using an airflow operator or sensor. On extras it ignores {"encoding":"latin-1"}.
Looking at the https://github.com/apache/airflow/blob/6ec97dc6491c3f7d7cee3da2e6d2acb4e7bddba3/airflow/providers/ftp/hooks/ftp.py#L62, there's indeed nothing that sets the encoding.
For your own project, you could subclass the FTPHook and FTPFileTransmitOperator, for example:
import ftplib
from airflow.compat.functools import cached_property
from airflow.providers.ftp.hooks.ftp import FTPHook
from airflow.providers.ftp.operators.ftp import FTPFileTransmitOperator
class FTPHookWithEncoding(FTPHook):
def get_conn(self) -> ftplib.FTP:
if self.conn is None:
params = self.get_connection(self.ftp_conn_id)
pasv = params.extra_dejson.get("passive", True)
encoding = params.extra_dejson.get("encoding")
self.conn = ftplib.FTP(params.host, params.login, params.password, encoding=encoding)
self.conn.set_pasv(pasv)
return self.conn
class FTPFileTransmitOperatorWithEncoding(FTPFileTransmitOperator):
#cached_property
def hook(self) -> FTPHookWithEncoding:
return FTPHookWithEncoding(ftp_conn_id=self.ftp_conn_id)
Since the code above is subclassing Airflow's FTPHook and FTPFileTransmitOperator, you're re-using all the methods from those classes. The additions with this code are:
Fetch the encoding from the extras with encoding = params.extra_dejson.get("encoding"). Note this example makes it a required extras field, adjust to your needs.
Set the encoding with ftplib.FTP(params.host, params.login, params.password, encoding=encoding).
Initialize and return the custom FTPHookWithEncoding in the operator's hook() method.
You might need to adjust this for the specific version of Airflow that you're running.
And in your DAG, you'll need to update the operator to use your custom operator. For example:
from my_package.operators import FTPFileTransmitOperatorWithEncoding
FTPFileTransmitOperatorWithEncoding(task_id="...", ...)

Rasa App breaks in Pycharm but works fine in terminal

Whenever I try to run my Rasa app using the run button in PyCharm, or try to use the debugger, I get the following error:
Traceback (most recent call last):
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/pykwalify/core.py", line 76, in __init__
self.source = yaml.load(stream)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/main.py", line 933, in load
loader = Loader(stream, version, preserve_quotes=preserve_quotes)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/loader.py", line 50, in __init__
Reader.__init__(self, stream, loader=self)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 85, in __init__
self.stream = stream # type: Any # as .read is called
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 130, in stream
self.determine_encoding()
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 190, in determine_encoding
self.update_raw()
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 297, in update_raw
data = self.stream.read(size)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 473: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/matthewspeck/project/trainer_app/app.py", line 25, in <module>
parser=False, core=True)
File "/Users/matthewspeck/project/trainer_app/rasa_model.py", line 165, in make_rasa_model
rasa_config=rasa_config
File "/Users/matthewspeck/project/trainer_app/rasa_model.py", line 66, in __init__
self._parser = create_agent(use_rasa_nlu=True, load_models=True)
File "/Users/matthewspeck/project/trainer_app/rasa.py", line 32, in create_agent
domain = create_domain()
File "/Users/matthewspeck/project/trainer_app/rasa.py", line 83, in create_domain
domain = ClarifyDomain.load(domain_path)
File "/Users/project/clarification/domain.py", line 39, in load
domain = TemplateDomain.load(filename)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/rasa_core/domain.py", line 404, in load
cls.validate_domain_yaml(filename)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/rasa_core/domain.py", line 438, in validate_domain_yaml
schema_files=[schema_file])
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/pykwalify/core.py", line 78, in __init__
raise CoreError(u"Unable to load any data from source yaml file")
pykwalify.errors.CoreError: <CoreError: error code 3: Unable to load any data from source yaml file: Path: '/'>
Process finished with exit code 1
However, when I run the app from my terminal, or from my text editor (I use VSCode), It runs with no problems whatsoever. I've looked online and every answer I see has something to do with Rasa, but nothing mentions problems with PyCharm.
I've also checked that the yaml for the domain is properly formatted, and it is. Anyone have any idea why I would be getting this error in PyCharm, but not in any other environment, and how I could fix it?
I believe your problem was fixed with Rasa version 0.12 ([changelog][1]): https://github.com/RasaHQ/rasa_core/blob/master/CHANGELOG.rst#0120---2018-11-11 .
I recommend upgrading to a newer version of Rasa Core which parses the training data correctly.

UnicodeError when try to send a file with greek filename

I have created and populated Greek names in a set() and I then pass this set of values to a view function.
When I try to print this set Greek names appear as jibberish. I believe this has somethign to do that Apache mod_wsgi or Bottle doens't start with utf-8 support.
How can I tell Apache/Bottle to use LANG=el_GR.utf-8 so I can display unicode properly because I believe that's the case here?
I looked for AddDefaultCharset utf-8 in httpd.conf but it is already enabled, so I have to ask why the Greek chars appear as jibberish?
This is when i try to download a file with a greek filename.
Error: 500 Internal Server Error
Sorry, the requested URL 'http://superhost.gr/downloads/file' caused an error:
Internal Server Error
Exception:
UnicodeEncodeError('ascii', '/static/files/Î\x92ιογÏ\x81αÏ\x86ικÏ\x8c - Î\x9dίκοÏ\x82.docx', 14, 34, 'ordinal not in range(128)')
Traceback:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/bottle.py", line 862, in _handle
return route.call(**args)
File "/usr/lib/python3.6/site-packages/bottle.py", line 1740, in wrapper
rv = callback(*a, **ka)
File "/usr/lib/python3.6/site-packages/bottle.py", line 2690, in wrapper
return func(*a, **ka)
File "/home/nikos/public_html/downloads.py", line 148, in file
return static_file(filename, root='/static/files', download=True)
File "/usr/lib/python3.6/site-packages/bottle.py", line 2471, in static_file
if not os.path.exists(filename) or not os.path.isfile(filename):
File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-33: ordinal not in range(128)
The code use to download the file is:
return static_file(filename, root='/static/files', download=True)
my system is et to utf-8
[root#superhost public_html]# echo $LANG
en_US.UTF-8
Perhaps something with Apache or is it a probelm with Python3 ?
You can't use Bottle static_file() with unicode filename and download=True. See accepted answer for this question for two alternative solutions of this limitation.

Flask raises UnicodeEncodeError (latin-1) when send attachment with UTF-8 characters

I'm creating a file server by flask. When I'm testing the download feature, I found it raises UnicodeEncodeError if I try to download files named with UTF-8 characters.
Create a file at upload/1512026299/%E6%97%A0%E6%A0%87%E9%A2%98.png , then run codes below:
#app.route('/getfile/<timestamp>/<filename>')
def download(timestamp, filename):
dirpath = os.path.join(os.path.join(os.path.abspath(os.path.dirname(__file__)), 'upload'), timestamp)
return send_from_directory(dirpath, filename, as_attachment=True)
You will get an exception, which should be like this:
127.0.0.1 - - [30/Nov/2017 21:39:05] "GET /getfile/1512026299/%E6%97%A0%E6%A0%87%E9%A2%98.png HTTP/1.1" 200 -
Error on request:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\site-packages\werkzeug\serving.py", line 209, in run_wsgi
execute(self.server.app)
File "C:\Program Files\Python36\lib\site-packages\werkzeug\serving.py", line 200, in execute
write(data)
File "C:\Program Files\Python36\lib\site-packages\werkzeug\serving.py", line 168, in write
self.send_header(key, value)
File "C:\Program Files\Python36\lib\http\server.py", line 508, in send_header
("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 43-45: ordinal not in range(256)
The problem is that when using as_attachement=True the filename is sent in the headers. Unfortunately it seems that flask does not yet support rfc5987 which specifies how to encode attachment file names in a different encoding other than latin1.
The easiest solution in this case would be to drop as_attachement=True, then it won't be sent with a Content-Disposition header, which avoids this problem.
If you really have to send the Content-Disposition header you could try the code posted in the related issue:
response = make_response(send_file(out_file))
basename = os.path.basename(out_file)
response.headers["Content-Disposition"] = \
"attachment;" \
"filename*=UTF-8''{utf_filename}".format(
utf_filename=quote(basename.encode('utf-8'))
)
return response
This should be fixed in the next release (>0.12)

UnicodeDecodeError while iterating over MongoDB collection

I am trying to query a MongoDB database using Python 2.7 and pymongo-2.3 using something like this:
from pymongo import Connection
connection = Connection()
db = connection['db-name']
collections = db.subName
entries = collections['collection-name']
print entries
# > Collection(Database(Connection('localhost', 27017), u'db-name'), u'subName.collection-name')
for entry in entries.find():
pass
The iterator fails even though I don't do anything with the entry objects:
Traceback (most recent call last):
File "/Users/../mongo.py", line 27, in <module>
for entry in entries.find():
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/cursor.py", line 778, in next
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/cursor.py", line 742, in _refresh
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/cursor.py", line 686, in __send_message
File "/Library/Python/2.7/site-packages/pymongo-2.3-py2.7-macosx-10.8-intel.egg/pymongo/helpers.py", line 111, in _unpack_response
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 744: invalid start byte
I'm not the creator of the database I'm trying to query.
Does anybody know what I'm doing wrong and how I can fix it? Thanks.
Update: I managed to skip over the offending line from pymongo/helpers.py using a try-except, but I would prefer a solution that does not involve data loss.
try:
result["data"] = bson.decode_all(response[20:], as_class, tz_aware, uuid_subtype)
except:
result["data"] = []
Can you try the same operation using the mongo shell? I want to figure out if it's something Python-specific or if it's corruption in the database:
$ mongo db-name
> var collection = db.getCollection('subName.collection-name')
> collection.find().forEach(function(doc) { printjson(doc); })

Categories