In this document it is mentioned that the default read_policy setting is ndb.EVENTUAL_CONSISTENCY.
After I did a bulk delete of entity items from the Datastore versions of the app I pulled up continued to read the old data, so I've tried to figure out how to change this to STRONG_CONSISTENCY with no success, including:
entity.query().fetch(read_policy=ndb.STRONG_CONSISTENCY) and
...fetch(options=ndb.ContextOptions(read_policy=ndb.STRONG_CONSISTENCY))
The error I get is
BadArgumentError: read_policy argument invalid ('STRONG_CONSISTENCY')
How does one change this default? More to the point, how can I ensure that NDB will go to the Datastore to load a result rather than relying on an old cached value? (Note that after the bulk delete the datastore browser tells me the entity is gone.)
You cannot change that default, it is also the only option available. From the very doc you referenced (no other options are mentioned):
Description
Set this to ndb.EVENTUAL_CONSISTENCY if, instead of waiting for the
Datastore to finish applying changes to all returned results, you wish
to get possibly-not-current results faster.
The same is confirmed by inspecting the google.appengine.ext.ndb.context.py file (no STRONG_CONSISTENCY definition in it):
# Constant for read_policy.
EVENTUAL_CONSISTENCY = datastore_rpc.Configuration.EVENTUAL_CONSISTENCY
The EVENTUAL_CONSISTENCY ends up in ndb via the google.appengine.ext.ndb.__init__.py:
from context import *
__all__ += context.__all__
You might be able to avoid the error using a hack like this:
from google.appengine.datastore.datastore_rpc import Configuration
...fetch(options=ndb.ContextOptions(read_policy=Configuration.STRONG_CONSISTENCY))
However I think that only applies to reading the entities for the keys obtained through the query, but not to obtaining the list of keys themselves, which comes from the index the query uses, which is always eventually consistent - the root cause of your deleted entities still appearing in the result (for a while, until the index is updated). From Keys-only Global Query Followed by Lookup by Key:
But it should be noted that a keys-only global query can not exclude
the possibility of an index not yet being consistent at the time of
the query, which may result in an entity not being retrieved at all.
The result of the query could potentially be generated based on
filtering out old index values. In summary, a developer may use a
keys-only global query followed by lookup by key only when an
application requirement allows the index value not yet being
consistent at the time of a query.
Potentially of interest: Bulk delete datastore entity older than 2 days
Related
Is there a performance advantage (or disadvantage) when using default instead of server_default for mapping table column default values when using SQLAlchemy with PostgreSQL?
My understanding is that default renders the expression in the INSERT (usually) and that server_default places the expression in the CREATE TABLE statement. Seems like server_default is analogous to typical handling of defaults directly in the db such as:
CREATE TABLE example (
id serial PRIMARY KEY,
updated timestamptz DEFAULT now()
);
...but it is not clear to me if it is more efficient to handle defaults on INSERT or via table creation.
Would there be any performance improvement or degradation for row inserts if each of the default parameters in the example below were changed to server_default?
from uuid import uuid4
from sqlalchemy import Column, Boolean, DateTime, Integer
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.sql import func
Base = declarative_base()
class Item(Base):
__tablename__ = 'item'
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
count = Column(Integer, nullable=False, default=0)
flag = Column(Boolean, nullable=False, default=False)
updated = Column(DateTime(timezone=True), nullable=False, default=func.now())
NOTE: The best explanation I found so far for when to use default instead of server_default does not address performance (see Mike Bayer's SO answer on the subject). My oversimplified summary of that explanation is that default is preferred over server_default when...
The db can't handle the expression you need or want to use for the default value.
You can't or don't want to modify the schema directly.
...so the question remains as to whether performance should be considered when choosing between default and server_default?
It is impossible to give you a 'this is faster' answer, because performance per default value expression can vary widely, both on the server and in Python. A function to retrieve the current time behaves differently from a scalar default value.
Next, you must realise that defaults can be provided in five different ways:
Client-side scalar defaults. A fixed value, such a 0 or True. The value is used in an INSERT statement.
Client-side Python function. Called each time a default is needed, produces the value to insert, used the same way as a scalar default from there on out. These can be context sensitive (have access to the current execution context with values to be inserted).
Client-side SQL expression; this generates an extra piece of SQL expression that is then used in the query and executed on the server to produce a value.
Server-side DLL expression are SQL expressions that are then stored in the table definition, so are part of the schema. The server uses these to fill a value for any columns omitted from INSERT statements, or when a column value is set to DEFAULT in an INSERT or UPDATE statement.
Server-side implicit defaults or triggers, where other DLL such as triggers or specific database features provide a default value for columns.
Note that when it comes to a SQL expression determining the default value, be that a client-side SQL expression, a server-side DLL expression, or a trigger, it makes very little difference to a database where the default value expression is coming from. The query executor will need to know how to produce values for a given column, once that's parsed out of the DML statement or the schema definition, the server still has to execute the expression for each row.
Choosing between these options is rarely going to be based on performance alone, performance should at most be but one of multiple aspects you consider. There are many factors involved here:
default with a scalar or Python function directly produces a Python default value, then sends the new value to the server when inserting. Python code can access the default value before the data is inserted into the database.
A client-side SQL expression, a server_default value, and server-side implicit defaults and triggers all have the server generate the default, which then must be fetched by the client if you want to be able to access it in the same SQLAlchemy session. You can't access the value until the object has been inserted into the database.
Depending on the exact query and database support, SQLAlchemy may have to make extra SQL queries to either generate a default before the INSERT statement or run a separate SELECT afterwards to fetch the defaults that have been inserted. You can control when this happens (directly when inserting or on first access after flushing, with the eager_defaults mapper configuration).
If you have multiple clients on different platforms accessing the same database, a server_default or other default attached to the schema (such as a trigger) ensures that all clients will use the same defaults, regardless, while defaults implemented in Python can't be accessed by other platforms.
When using PostgreSQL, SQLAlchemy can make use of the RETURNING clause for DML statements, which gives a client access to server-side generated defaults in a single step.
So when using a server_default column default that calculates a new value for each row (not a scalar value), you save a small amount of Python-side time, and save a small amount of network bandwidth as you are not sending data for that column over to the database. The database could be faster creating that same value, or it could be slower; it largely depends on the type of operation. If you need to have access to the generated default value from Python, in the same transaction, you do then have to wait for a return stream of data, parsed out by SQLAlchemy. All these details can become insignificant compared to everything else that happens around inserting or updating rows, however.
Do understand that a ORM is not suitable to be used for high-performance bulk row inserts or updates; quoting from the SQAlchemy Performance FAQ entry:
The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple “inserts” of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its “last inserted id” if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it’s best to chunk these.
Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.
ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.
Because an ORM like SQLAlchemy comes with a hefty overhead price, any performance differences between a server-side or Python-side default quickly disappears in the noise of ORM operations.
So if you are concerned about performance for large-quantity insert or update operations, you would want to use bulk operations for those, and enable the psycopg2 batch execution helpers to really get a speed boost. When using these bulk operations, I'd expect server-side defaults to improve performance just by saving bandwidth moving row data from Python to the server, but how much depends on the exact nature of the default values.
If ORM insert and update performance outside of bulk operations is a big issue for you, you need to test your specific options. I'd start with the SQLAlchemy examples.performance package and add your own test suite using two models that differ only in a single server_default and default configuration.
There's something else important rather than just comparing the performance of the two
If you needed to add a new Column create_at (Not Null) to an existing Table User with some data in it, default will not work.
If used default, during upgrading the database, the error will occur saying cannot insert Null value to existing data in the table. And this will cause significant troubles if you want to maintain your data, even just for testing.
And when used server_default, during upgrading the DB, database will insert the current DateTime value to all previous existing testing data.
So in this case, only server_default will work.
Language: Python.
I'm using datastore python lib, everything works fine. When using datastore query to sort the query result, I can add query.order = ['name']; to sort the result. But when query a table with ancestors, like:
ancestor = client.key('user', name, namespace = NAMESPACE)
query = client.query(kind='project', ancestor=ancestor, namespace = NAMESPACE)
Then I set order: query.order = ['name'];, it doesn't work. I wanna sort on the kind project, whose ancestor is kind user.
The error message is: "400 no matching index found. recommended index is:↵- kind: project↵ ancestor: yes↵ properties:↵ - name: name", which is a yaml sample. But I'm not using yaml here. I think there must be a way to sort the result though there's ancestor.
All datastore queries are index-based. From Indexes:
Every Google Cloud Datastore query computes its results using one
or more indexes which contain entity keys in a sequence specified by
the index's properties and, optionally, the entity's ancestors. The
indexes are updated to reflect any changes the application makes to
its entities, so that the correct results of all queries are available
with no further computation needed.
The error you're getting comes from the Datastore side, indicating that the index required for your specific query is missing in the Datastore. It is unrelated to doing an ancestor query and/or to using yaml in your application code.
The yaml reference is simply coming from the way the indexes are being deployed to the Datastore (which is something you need to do for your query to work), see Deploying or deleting indexes.
So you need to create an index.yaml file containing at least the index specification indicated in the error message (and any other indexes you may need for other queries, if any), deploy it to the Datastore, wait for the index to get into the Serving state on the Indexes page (which may take a while), after which your query should work.
From what I have been reading in the Google Docs and other SO questions, keys_only queries should return strongly consistent results (here and here, for example).
My code looks something like this:
class ClientsPage(SomeHandler):
def get(self):
query = Client.query()
clients = query.fetch(keys_only=True)
self.write(len(clients))
Even though I am fetching the results with the keys_only=True parameter I am getting stale results right after the creation of a new Client object (which is a root entity). If there were 2 client objects before the insertion, it keeps showing 2 after inserting and redirecting. I have to manually refresh the page in order to see the number change to 3.
I understand I could use ancestor queries, but I am testing some things first and I was surprised to see that a keys_only query returned stale results. Can anyone please explain to me what's going on?
EDIT 1:
This happened in the development server, I have not tested it in production.
Eventual consistency exists because the Datastore needs time to update all indexes. Keys-only query is the same as all the other queries, except it tells the Datastore - I don't need the entire entity, just return me the key. The query still looks at the indexes to get the list of results.
In contrast, getting an entity by key does not need to look at the indexes, so it is always strongly consistent.
I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.
I am interested in adding a spell checker to my app -- I'm planning on using difflib with a custom word list that's ~147kB large (13,025 words).
When testing user queries against this list, would it make more sense to:
load the dictionary into memcache (I guess from the datastore?) and keep it in memcache or
build an index for the Search API and pass in the user query against it there
I guess what I'm asking is which is faster: memcache or a search index?
Thanks.
Memcache is definitely faster.
Another important consideration is cost. Memcache API calls are free, while Search API calls have their own quota and pricing.
By the way, you may store your library as a static file, because it's small and it does not change. There is no need to store it in the Datastore.
Memcache is faster however you need to consider the following.
it is not reliable, at any moment entities can be purged. So your code needs a fallback for non cached data
You can only fetch by key, so as you said you would need to store whole dictionaries in memcache objects.
Each memcache entity can only store 1MB. If you dictionary is larger you would have to span multiple entities. Ok not relevant in your case.
There are some other alternatives. How often will the dictionary be updated ?
Here is one alternate strategy.
You could store it in the filesystem (requires app updates) or GCS if you want to update the dictionary outside of app updates. Then you can load the dictionary in each instance into memory at startup or on first request and cache it at the running instance level, then you won't have any round trips to services adding latencies. This will also be simpler code wise (ie no fallbacks if not in memcache etc)
Here is an example. In this case the code lives in a module, which is imported as required. I am using a yaml file for additional configuration, it could just as easily json load a dictionary, or you could define a python dictionary in the module.
_modsettings = {}
def loadSettings(settings='settings.yaml'):
x= _modsettings
if not x:
try:
_modsettings.update(load(open(settings,'r').read()))
except IOError:
pass
return _modsettings
settings = loadSettings()
Then whenever I want the settings dictionary my code just refers to mymodule.settings.
By importing this module during a warmup request you won't get a race condition, or have to import/parse the dictionary during a user facing request. You can put in more error traps as appropriate ;-)