Datastore delay on creating entities with put() - python

I am developing an application using with the Cloud Datastore Emulator (2.1.0) and the google-cloud-ndb Python library (1.6).
I find that there is an intermittent delay on entities being retrievable via a query.
For example, if I create an entity like this:
my_entity = MyEntity(foo='bar')
my_entity.put()
get_my_entity = MyEntity.query().filter(MyEntity.foo == 'bar').get()
print(get_my_entity.foo)
it will fail itermittently because the get() method returns None.
This only happens on about 1 in 10 calls.
To demonstrate, I've created this script (also available with ready to run docker-compose setup on GitHub):
import random
from google.cloud import ndb
from google.auth.credentials import AnonymousCredentials
client = ndb.Client(
credentials=AnonymousCredentials(),
project='local-dev',
)
class SampleModel(ndb.Model):
"""Sample model."""
some_val = ndb.StringProperty()
for x in range(1, 1000):
print(f'Attempt {x}')
with client.context():
random_text = str(random.randint(0, 9999999999))
new_model = SampleModel(some_val=random_text)
new_model.put()
retrieved_model = SampleModel.query().filter(
SampleModel.some_val == random_text
).get()
print(f'Model Text: {retrieved_model.some_val}')
What would be the correct way to avoid this intermittent failure? Is there a way to ensure the entity is always available after the put() call?
Update
I can confirm that this is only an issue with the datastore emulator. When testing on app engine and a Firestore in Datastore mode, entities are available immediately after calling put().

The issue turned out to be related to the emulator trying to replicate eventual consistency.
Unlike relational databases, Datastore does not gaurentee that the data will be available immediately after it's posted. This is because there are often replication and indexing delays.
For things like unit tests, this can be resolved by passing --consistency=1.0 to the datastore start command as documented here.

Related

Conflict - Python insert-update into Azure table storage

Working with Python, I have many processes that need to update/insert data into an Azure table storage at the same time using :
table_service.update_entity(table_name, task) <br/>
table_service.insert_entity(table_name, task)
However, the following error occurs:
<br/>AzureConflictHttpError: Conflict
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:57d9b721-6002-012d-3d0c-b88bef000000\nTime:2019-01-29T19:55:53.5984026Z"}}}
Maybe I need to use a global Lock to avoid operating the same Table Entity concurrently but I don't know how to use it
Azure Tables has a new SDK available for Python that is in a preview release available on pip, here's an update for the newest library.
On a create method you can use a try/except block to catch the expected error:
from azure.data.tables import TableClient
from azure.core.exceptions import ResourceExistsError
table_client = TableClient.from_connection_string(conn_str, table_name="myTableName")
try:
table_client.create_entity(entity=my_entity)
except ResourceExistsError:
print("Entity already exists")
You can use ETag to update entities conditionally after creation.
from azure.data.tables import UpdateMode
from azure.core import MatchConditions
received_entity = table_client.get_entity(
partition_key="my_partition_key",
row_key="my_row_key",
)
etag = received_entity._metadata["etag"]
resp = table_client.update_entity(
entity=my_new_entity,
etag=etag,
mode=UpdateMode.REPLACE,
match_condition=MatchConditions.IfNotModified
)
On update, you can elect for replace or merge, more information here.
(FYI, I am a Microsoft employee on the Azure SDK for Python team)
There isn't a global "Lock" in Azure Table Storage, since it's using optimistic concurrency via ETag (i.e. If-Match header in raw HTTP requests).
If your thread A is performing insert_entity, it should catch the 409 Conflict error.
If your thread B & C are performing update_entity, they should 412 Precondition Failed error, and use a loop to retrieve the latest entity then try to update the entity again.
For more details, please check Managing Concurrency in the Table Service section in https://azure.microsoft.com/en-us/blog/managing-concurrency-in-microsoft-azure-storage-2/

Aerospike Python Client. UDF module to count records. Cannot register module

I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?
First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.

latency with group in pymongo in tests

Good Day.
I have faced following issue using pymongo==2.1.1 in python2.7 with mongo 2.4.8
I have tried to find solution using google and stack overflow but failed.
What's the issue?
I have following function
from bson.code import Code
def read(groupped_by=None):
reducer = Code("""
function(obj, prev){
prev.count++;
}
""")
client = Connection('localhost', 27017)
db = client.urlstats_database
results = db.http_requests.group(key={k:1 for k in groupped_by},
condition={},
initial={"count": 0},
reduce=reducer)
groupped_by = list(groupped_by) + ['count']
result = [tuple(res[col] for col in groupped_by) for res in results]
return sorted(result)
Then I am trying to write test for this function
class UrlstatsViewsTestCase(TestCase):
test_data = {'data%s' % i : 'data%s' % i for i in range(6)}
def test_one_criterium(self):
client = Connection('localhost', 27017)
db = client.urlstats_database
for column in self.test_data:
db.http_requests.remove()
db.http_requests.insert(self.test_data)
response = read([column])
self.assertEqual(response, [(self.test_data[column], 1)])
this test sometimes fails as I understand because of latency. As I can see response has not cleaned data in it
If I add delay after remove test pass all the time.
Is there any proper way to test such functionality?
Thanks in Advance.
A few questions regarding your environment / code:
What version of pymongo are you using?
If you are using any of the newer versions that have MongoClient, is there any specific reason you are using Connection instead of MongoClient?
The reason I ask second question is because Connection provides fire-and-forget kind of functionality for the operations that you are doing while MongoClient works by default in safe mode and is also preferred approach of use since mongodb 2.2+.
The behviour that you see is very conclusive for Connection usage instead of MongoClient. While using Connection your remove is sent to server, and the moment it is sent from client side, your program execution moves to next step which is to add new entries. Based on latency / remove operation completion time, these are going to be conflicting as you have already noticed in your test case.
Can you change to use MongoClient and see if that helps you with your test code?
Additional Ref: pymongo: MongoClient or Connection
Thanks All.
There is no MongoClient class in version of pymongo I use. So I was forced to find out what exactly differs.
As soon as I upgrade to 2.2+ I will test whether everything is ok with MongoClient. But as for connection class one can use write concern to control this latency.
I older version One should create connection with corresponding arguments.
I have tried these twojournal=True, safe=True (journal write concern can't be used in non-safe mode)
j or journal: Block until write operations have been commited to the journal. Ignored if the server is running without journaling. Implies safe=True.
I think this make performance worse but for automatic tests this should be ok.

Use cache under mod_wsgi for Python web framework

The codes are like this(I'm using flask and flask-cache but this might be a general problem):
#cache.memoize(500000)
def big_foo(a,b):
return a + b + random.randrange(0, 1000)
If I run it in a Python intepreter, I can always get the same result by calling big_foo(1,2).
But if I add this function in the application and use mod_wsgi to daemon, then request in browser. (the big_foo is called within the views function of that request). I found the result is not the same each time.
I think the results are different each time because mod_wsgi use multiple process to launch the app. Each process might have their own cache, and the cache can't be shared between process.
Is my guess right? If right, how can I assign one and only one cache for global accessing? If not, where was wrong with my codes?
Following is the config used for flask-cache
UPLOADS_FOLDER = "/mnt/Storage/software/x/temp/"
class RadarConfig(object):
UPLOADS_FOLDER = UPLOADS_FOLDER
ALLOWED_EXTENSIONS = set(['bed'])
SECRET_KEY = "tiananmen"
DEBUG = True
CACHE_TYPE = 'simple'
CACHE_DEFAULT_TIMEOUT = 5000000
BASIS_PATH = "/mnt/Storage/software/x/NMF_RESULT//p_NMF_Nimfa_NMF_Run_30632__metasites_all"
COEF_PATH = "/mnt/Storage/software/x/NMF_RESULT/MCF7/p_NMF_Nimfa_NMF_Run_30632__metasample_all"
MASK_PATH = "/mnt/Storage/software/x/NMF_RESULT/dhsHG19.bed"
Here's your problem: CACHE_TYPE = 'simple'. From SimpleCache documentation:
Simple memory cache for single process environments. This class exists mainly for the development server and is not 100% thread safe.
For production better suited backends are memcached, redis and filesystem, since they're designed to work in concurrent environments.

Can I make test dummies for every kind of service for an integrated app?

I have a fairly complex app that uses celery, mongo, redis and pyramid. I use nose for testing. I'm not doing TDD (not test-first, at least), but I am trying very hard to get a decent amount of coverage. I'm stuck in the parts that are integrated with some of the above services. For example, I'm using redis for shared memory between celery tasks, but I'd like to be able to switch to memcache without too much trouble, so I've abstracted out the following functions:
import settings
db = StrictRedis(host=settings.db_uri, db=settings.db_name)
def has_task(correlation_id):
"""Return True if a task exists in db."""
return db.exists(str(correlation_id))
def pop_task(correlation_id):
"""Get a task from db by correlation_id."""
correlation_id = str(correlation_id) # no unicode allowed
task_pickle = db.get(correlation_id)
task = loads(task_pickle)
if task:
db.delete(correlation_id)
return task
def add_task(correlation_id, task_id, model):
"""Add a task to db"""
return db.set(str(correlation_id), dumps((task_id, model)))
I'm also doing similar things to abstract Mongo, which I'm using for persistent storage.
I've seen test suites for integrated web apps that run dummy http servers, create dummy requests and even dummy databases. I'm OK for celery and pyramid, but I haven't been able to find dummies for mongo and redis, so I'm only able to run tests for the above when those services are actually running. Is there any way to provide dummy services for the above so I don't have:
to have external services installed and running, and
to manually create and destroy entire databases (in-memory dummies can be counted on to cleanup after themselves)
I would suggest you use the mock library for such tasks. This allows you to replace your production objects (for example the database connection) with some pseudo objects which could be provided with some functionality needed for testing.
Example:
>>> from mock import Mock
>>> db = Mock()
>>> db.exists.return_value = True
>>> db.exists()
True
You can make assertions how your code interact with the mock, for example:
>>> db.delete(1)
<Mock name='mock.delete()' id='37588880'>
>>> db.delete.assert_called_with(1)
>>> db.delete.assert_called_with(2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\mock.py", line 863, in assert_called_with
raise AssertionError(msg)
AssertionError: Expected call: delete(2)
Actual call: delete(1)

Categories