I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.
Related
Is there any caching library for Python (or general technique) that can cache a database query result until the underlying tables of the query have been updated?
The cache should never output stale values. At the same time, the application should only need to query the database once for each change in the data.
I want to optimize a Flask app. I am facing this issue a lot with pages that have a list of objects that change infrequently. It is detrimental to present stale data, so a time-based cache cannot be used.
Right now there are hundreds of queries per hour due to multiple users accessing these pages. I would like to reduce that to the absolute minimum (i.e. only when there is an update to the data), and keep the data cached in-memory.
A possible approach would be to maintain last_updated timestamps for each table somewhere (possibly Redis) and check these before querying the database.
I am using dynamodb with python API and denormalize my data in order to keep the reads fast. The think is that I am worried about keeping the consistency when updating my data say i have a table of users, each has a key and a name, and a table of purchases each has a key and a data containing buyer key (user) and the buyer's name.
I would like to update the user's name and update all his purchases using an atomic operation, like available in firebase (multi path update) explained here
How can I do that?
Thanks
Here is a nice documentation of dynamodb transaction.
Here are few highlights of the blog post.
Dynamodb supports transaction capability across multiple table where you can also have pre-condition on every insert (i.e. insert into order table only if prev_snapshot=1223232, this will make sure you are modifying the last read data only.)
There are 2 types of gets supported TransactGetItems and Eventual/Strongly consistent GetItem. In TransactGetItems, if a transaction is in progress the request is rejected. while in the other 2 cases last committed data is returned based on your consistency requirements.
Transactions are not locks if some other thread is writing to a table without transaction, and if write succeeds before transaction is completed, and exception will be thrown on transaction.
No extra steps/permissions are required to enable transaction on a single region table.
Cost will double for every read and write whiles using transactional capabilities.
Here are the features which are not supported
Transactional capabilities in global table. but this can be avoided by request stickiness and should not be a big issue IMO.
Some devices are asynchronously storing values on a common remote MySQL database server.
I would like to write a supervisor app in Python (and possibly SQLAlchemy) to recognize the external INSERT events on the database and act upon the last rows' data. This is to avoid a long manual test to see if every table is being updated regularly or a logger crashed.
Can somebody just tell me where to search online this kind of info and, even better, an example?
EDIT
I already read all tables periodically using a datetime primary key ({date_time}), loading the last row of each table, and comparing to the previous values:
SELECT * FROM table ORDER BY date_time DESC LIMIT 1
but it looks very cumbersome and doesn't guarantee that I don't lose some rows between successive database checks.
The engine is an old version of INNODB that I cannot upgrade: I cannot use the UPDATE field in schema because it simply doesn't work.
To reword my question:
How to listen any database event with a daemon-like Python application (sleeping thread) and wake up only when something happens?
I want also to avoid SQL triggers because this would be just too heavy to manage: tables are in hundreds and they are added/removed very often according to the active loggers.
I gave a look to SQLAlchemy but all reference I could find, if I don't misunderstood it, are decorators to act on INSERTs made by SQLAlchemy's itself. I didn't find anything about external changes to the database.
About the example request: I am not interested in a copy-and-paste, because first I want to understand how stuff works. I prefer (even incomplete) examples because SQLAlchemy documentation is far too deep for my knowledge and I simply cannot put the pieces together.
My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?
Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)
Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.
I have several CouchDB databases. The largest is about 600k documents, and I am finding that queries are prohibitively long (several hours or more). The DB is updated infrequently (once a month or so), and only involves adding new documents, never updating existing documents.
Queries are of the type: Find all documents where key1='a' or multiple keys: key1='a', key2='b'...
I don't see that permanent views are practical here, so have been using the CouchDB-Python 'query' method.
I have tried several approaches, and I am unsure what is most efficient, or why.
Method 1:
map function is:
map_fun = '''function(doc){
if(doc.key1=='a'){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
The Python query is:
results = ui.db.query(map_fun, key2=user)
Then some operation with results.rows. This takes up the most time.
It takes about an hour for 'results.rows' to come back. If I change key2 to something else, it comes back in about 5 seconds. If I repeat the original user, it's also fast.
But sometimes I need to query on more keys, so I try:
map_fun = '''function(doc){
if(doc.key1=='a' && doc.key2=user && doc.key3='something else' && etc.){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
and use the python query:
results = ui.db.query(map_fun)
Then some operation with results.rows
Takes a long time for the first query. When I change key2, takes a long time again. If
I change key2 back to the original data, takes the same amount of time. (That is, nothing seems to be getting cached, B-tree'ed or whatever).
So my question is: What's the most efficient way to do queries in couchdb-python, where the queries are ad hoc and involve multiple keys for search criteria?
The UI is QT-based, using PyQt underneath.
There are two caveats for couchdb-python db.query() method:
It executes temporary view. This means that code flow processing would be blocked until this all documents would be proceeded by this view. And this would happened again and again for each call. Try to save view and use db.view() method instead to get results on demand and have incremental index updates.
It's reads whole result no matter how bigger it is. db.query() nor db.view() methods aren't lazy so if view result is 100 MB JSON object, you have to fetch all this data before use them somehow. To query data in more memory-optimized way, try to apply patch to have db.iterview() method - it allows you to fetch data in pagination style.
I think that the fix to your problem is to create an index for the keys you are searching. It is what you called permanent view.
Note the difference between map/reduce and SQL queries in a B-tree based table:
simple SQL query searching for a key (if you have an index for it) traverses single path in the B+-tree from root to leaf,
map function reads all the elements, event if it emits small result.
What you are doing is for each query
reading every document (most of the cost) and
searching for a key in the emitted result (quick search in the B-tree).
and I think your solution has to be slow by the design.
If you redesign database structure to make permanent views practical, (1.) will be executed once and only (2.) will be executed for each query. Each document will be read by a view after addition to DB and a query will search in B-tree storing emitted result. If emitted set is smaller than the total documents number, then the query searches smaller structure and you have the benefit over SQL databases.
Temporary views are far less efficient, then the permanent ones and are meant to be used only for development. CouchDB was designed to work with permanent views. To make map/reduce efficient one has to implement caching or make the view permanent. I am not familiar with the details of the CouchDB implementation, perhaps second query with different key is faster because of some caching. If for some reason you have to use temporary view then perhaps CouchDB is a mistake and you should consider DBMS created and optimized for online queries like MongoDB.