latency with group in pymongo in tests

latency with group in pymongo in tests - python

Good Day.
I have faced following issue using pymongo==2.1.1 in python2.7 with mongo 2.4.8
I have tried to find solution using google and stack overflow but failed.
What's the issue?
I have following function
from bson.code import Code
def read(groupped_by=None):
reducer = Code("""
function(obj, prev){
prev.count++;
}
""")
client = Connection('localhost', 27017)
db = client.urlstats_database
results = db.http_requests.group(key={k:1 for k in groupped_by},
condition={},
initial={"count": 0},
reduce=reducer)
groupped_by = list(groupped_by) + ['count']
result = [tuple(res[col] for col in groupped_by) for res in results]
return sorted(result)
Then I am trying to write test for this function
class UrlstatsViewsTestCase(TestCase):
test_data = {'data%s' % i : 'data%s' % i for i in range(6)}
def test_one_criterium(self):
client = Connection('localhost', 27017)
db = client.urlstats_database
for column in self.test_data:
db.http_requests.remove()
db.http_requests.insert(self.test_data)
response = read([column])
self.assertEqual(response, [(self.test_data[column], 1)])
this test sometimes fails as I understand because of latency. As I can see response has not cleaned data in it
If I add delay after remove test pass all the time.
Is there any proper way to test such functionality?
Thanks in Advance.

A few questions regarding your environment / code:
What version of pymongo are you using?
If you are using any of the newer versions that have MongoClient, is there any specific reason you are using Connection instead of MongoClient?
The reason I ask second question is because Connection provides fire-and-forget kind of functionality for the operations that you are doing while MongoClient works by default in safe mode and is also preferred approach of use since mongodb 2.2+.
The behviour that you see is very conclusive for Connection usage instead of MongoClient. While using Connection your remove is sent to server, and the moment it is sent from client side, your program execution moves to next step which is to add new entries. Based on latency / remove operation completion time, these are going to be conflicting as you have already noticed in your test case.
Can you change to use MongoClient and see if that helps you with your test code?
Additional Ref: pymongo: MongoClient or Connection

Thanks All.
There is no MongoClient class in version of pymongo I use. So I was forced to find out what exactly differs.
As soon as I upgrade to 2.2+ I will test whether everything is ok with MongoClient. But as for connection class one can use write concern to control this latency.
I older version One should create connection with corresponding arguments.
I have tried these twojournal=True, safe=True (journal write concern can't be used in non-safe mode)
j or journal: Block until write operations have been commited to the journal. Ignored if the server is running without journaling. Implies safe=True.
I think this make performance worse but for automatic tests this should be ok.

Related

Datastore delay on creating entities with put()

I am developing an application using with the Cloud Datastore Emulator (2.1.0) and the google-cloud-ndb Python library (1.6).
I find that there is an intermittent delay on entities being retrievable via a query.
For example, if I create an entity like this:
my_entity = MyEntity(foo='bar')
my_entity.put()
get_my_entity = MyEntity.query().filter(MyEntity.foo == 'bar').get()
print(get_my_entity.foo)
it will fail itermittently because the get() method returns None.
This only happens on about 1 in 10 calls.
To demonstrate, I've created this script (also available with ready to run docker-compose setup on GitHub):
import random
from google.cloud import ndb
from google.auth.credentials import AnonymousCredentials
client = ndb.Client(
credentials=AnonymousCredentials(),
project='local-dev',
)
class SampleModel(ndb.Model):
"""Sample model."""
some_val = ndb.StringProperty()
for x in range(1, 1000):
print(f'Attempt {x}')
with client.context():
random_text = str(random.randint(0, 9999999999))
new_model = SampleModel(some_val=random_text)
new_model.put()
retrieved_model = SampleModel.query().filter(
SampleModel.some_val == random_text
).get()
print(f'Model Text: {retrieved_model.some_val}')
What would be the correct way to avoid this intermittent failure? Is there a way to ensure the entity is always available after the put() call?
Update
I can confirm that this is only an issue with the datastore emulator. When testing on app engine and a Firestore in Datastore mode, entities are available immediately after calling put().

The issue turned out to be related to the emulator trying to replicate eventual consistency.
Unlike relational databases, Datastore does not gaurentee that the data will be available immediately after it's posted. This is because there are often replication and indexing delays.
For things like unit tests, this can be resolved by passing --consistency=1.0 to the datastore start command as documented here.

What's the proper way of moving documents between databases starting from PyMongo 3.6?

I used to use pymongo.bulk.BulkOperationBuilder but the docs say that it's deprecated.
The official MongoDB has db.cloneCollection() but I can't find anything similar in PyMongo, except copydb but it's not what I need.
So I found two ways to bulk insert docs between colls and removing them afterwards. I haven't tested them yet, I wanted to ask you firstly for an advice because there might be a better way.
Solution #1.
coll_from = mongo['db_1']['coll_name']
coll_to = mongo['db_2']['coll_name']
requests = (InsertOne(doc) for doc in coll_from.find())
result = coll_to.bulk_write(requests, ordered=False)
db_from.drop_collection('coll_name')
Solution #2.
coll_from = mongo['db_1']['coll_name']
coll_to = mongo['db_2']['coll_name']
coll_to.insert_many(coll_from.find())
db_from.drop_collection('coll_name')
Is there any better way to bulk-move docs between dbs?

cloneCollection as documented is a command.
Pymongo API exposes a command method on an instance of pymongo.database.Database.
This can be applied in the following manner to clone a similarly named collection from a remote collection.
client = MongoClient()
clone_cmd = {
'cloneCollection': 'db_1.coll_name',
'from': '<hostname>'
}
client.db_2.command(clone_cmd)

Does Psycopg2 allow udf create queries to run on redshift using Python?

I am able to connect to aws-redshift with psycopg2 using python, I can query tables and get data back, etc...
However, when I try to run a create udf fucntion through psycopg2, nothing happens, no error returns but nothing gets created.
Here's my code:
def _applyFunctionToDB():
con=psycopg2.connect(dbname = redhsiftDatabase, host = redshiftHost, port = '5439', user = redshiftUser, password = redshiftPwd)
cur = con.cursor()
udf=_fileOpenWrite(udfFile)
size = os.stat(udfFile).st_size
udfCode=udf.read(size)
cur.execute(udfCode)
con.close()
I have run it through the debugger and all the pieces are there, but nothing happens when the "execute" method is invoked on the cursor.
If anyone has any advice and/or ideas on what might be going on here, please advise.
Thanks!

found answer just after posting here: Copying data from S3 to AWS redshift using python and psycopg2
I need to invoke a commit.
So, add con.commit in above code after execute.

ipython-cypher in Python: cypher.run.Connection object parameter

I'm trying to use ipython-cypher to run Neo4j Cypher queries (and return a Pandas dataframe) in a Python program. I have no trouble forming a connection and running a query when using IPython Notebook, but when I try to run the same query outside of IPython, as per the documentation:
http://ipython-cypher.readthedocs.org/en/latest/introduction.html#usage-out-of-ipython
import cypher
results = cypher.run("MATCH (n)--(m) RETURN n.username, count(m) as neighbors",
"http://XXX.XXX.X.XXX:xxxx")
I get the following error:
neo4jrestclient.exceptions.StatusException: Code [401]: Unauthorized. No permission -- see authorization schemes.
Authorization Required
and
Format: (http|https)://username:password#hostname:port/db/name, or one of dict_keys([])
Now, I was just guessing that that was how I should enter a Connection object as the last parameter, because I couldn't find any additional documentation explaining how to connect to a remote host using Python, and in IPython, I am able to do:
%load_ext cypher
results = %cypher http://XXX.XXX.X.XXX:xxxx MATCH (n)--(m) RETURN n.username,
count(m) as neighbors
Any insight would be greatly appreciated. Thank you.

The documentation has a section for the API. When used outside of IPython and in need to connect to a different host, just using the parameter conn and passing a string should work.
import cypher
results = cypher.run("MATCH (n)--(m) RETURN n.username, count(m) as neighbors",
conn="http://XXX.XXX.X.XXX:xxxx")
But also consider that with the new authentication support in Neo4j 2.2, you need to set the new password before connecting from ipython-cypher. I will fix this as soon as I implement the forcing password change mechanism in neo4jrestclient, the library underneath.

Aerospike Python Client. UDF module to count records. Cannot register module

I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?

First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

latency with group in pymongo in tests - python

Related

Datastore delay on creating entities with put()

What's the proper way of moving documents between databases starting from PyMongo 3.6?

Does Psycopg2 allow udf create queries to run on redshift using Python?

ipython-cypher in Python: cypher.run.Connection object parameter

Aerospike Python Client. UDF module to count records. Cannot register module

Categories

Resources