Intro
I'm writing an application in Python using a Cassandra 1.2 cluster (7 nodes, replication factor 3) and I'm accessing Cassandra from Python using the cql library (CQL 3.0).
The problem
The application is built in a way that when trying to run a cql statement against an unconfigured column family, it automatically creates the table and retries the cql statement. For example, if I try to run this:
SELECT * FROM table1
And table1 doesn't exists, then the application will run the corresponding CREATE TABLE for table1 and will retry the previous select. The problem is that, after the creation of the table the SELECT (the retry) fails with this error:
Request did not complete within rpc_timeout
The question
I assume the cluster needs some time to propagate the creation of the table or something like that? If I wait a few seconds between the creation of the table and the retry of the select statement everything works, but I want to know exactly why and if there is a better way of doing it. Perhaps making the create table wait for the changes to propagate before returning?, is there a way of doing that?
Thanks in advance
I am assuming you are using cqlsh. Default consistency level for cqlsh is one meaning it will return after the first node completes but not necessarily before all nodes complete. If you read you aren't guaranteed to read from the node that has the completed table. You can check this by turning on tracing but that will affect performance.
You can enforce consistency which should make that create wait until the table is created on all nodes.
CREATE TABLE ... USING CONSISTENCY ALL
Related
I have a table with 30k clients, with the ClientID as primary key.
I'm getting data from API calls and inserting them into the table using python.
I'd like to find a way to insert rows with new clients and, if the ClientID that comes with the API call already exists in the table, update the existing register with the updated information of this client.
Thanks!!
A snippet of code would be nice to show us what exactly you are doing right now. I presume you are using an ORM like SqlAlchemy? If so, then you are looking at doing an UPSERT type of an operation.
That is already answered HERE
Alternatively, if you are executing raw queries without an ORM then you could write a custom procedure and pass required parameters. HERE is a good write up on how that is done in MSSQL under high concurrency. You could use this as a starting point for understanding and then re-write it for PostgreSQL.
Good day to All,
I need to communicate to DBs like here is the sample using python & peewee.
First i need to put an entry in the types table of the MainDb. only the Name & type am having & the ID will get auto increment after the entryin the table, then using the auto incremented id(MainID), i need to put a data entry in respective type table of the respective type db.
then i need to do some operation using the select queries with MainID in the python code and finally i need to delete the entries of the MainID in the MainDb as well as in the respective type db, if the operations in the python code gets executed successfully or not (throws any error or exception).
Currently i am using the simple db (MySQLdb) connection & close with cursor to execute the queries & to get the last auto incremented id am using lastrowid of the cursor(cursor.lastrowid).
And also i referred the question in stackoverflow as per my understanding its ok if i have an single db (like main db alone) but what am i need to do for my situation.
OS: Windows7 64bit,
Db: MySQL Db v5.7,
Python : v2.7,
peewee : v2.10
Thanks in advance
Some devices are asynchronously storing values on a common remote MySQL database server.
I would like to write a supervisor app in Python (and possibly SQLAlchemy) to recognize the external INSERT events on the database and act upon the last rows' data. This is to avoid a long manual test to see if every table is being updated regularly or a logger crashed.
Can somebody just tell me where to search online this kind of info and, even better, an example?
EDIT
I already read all tables periodically using a datetime primary key ({date_time}), loading the last row of each table, and comparing to the previous values:
SELECT * FROM table ORDER BY date_time DESC LIMIT 1
but it looks very cumbersome and doesn't guarantee that I don't lose some rows between successive database checks.
The engine is an old version of INNODB that I cannot upgrade: I cannot use the UPDATE field in schema because it simply doesn't work.
To reword my question:
How to listen any database event with a daemon-like Python application (sleeping thread) and wake up only when something happens?
I want also to avoid SQL triggers because this would be just too heavy to manage: tables are in hundreds and they are added/removed very often according to the active loggers.
I gave a look to SQLAlchemy but all reference I could find, if I don't misunderstood it, are decorators to act on INSERTs made by SQLAlchemy's itself. I didn't find anything about external changes to the database.
About the example request: I am not interested in a copy-and-paste, because first I want to understand how stuff works. I prefer (even incomplete) examples because SQLAlchemy documentation is far too deep for my knowledge and I simply cannot put the pieces together.
I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.
I'm using python and psycopg2 to remotely query some psql databases, and I'm trying to figure out the best way to select the data I need from the remote table, and insert it into a table on a separate DB (local application server).
Most of the stuff I've read has directed me to avoid executemany and look toward COPY operations, but I'm unsure how to implement this on a specific select statement as opposed to the entire table. Should I be headed this way or am I completely off?
but I'm unsure how to implement this on a specific select statement as opposed to the entire table
COPY isn't limited to tables, you can use a query as the source as well, check out the examples in the manual, it shows how to use COPY to create a text file based on a query:
http://www.postgresql.org/docs/current/static/sql-copy.html#AEN59055
(3rd example)
Take a look at http://ryrobes.com/featured-articles/using-a-simple-python-script-for-end-to-end-data-transformation-and-etl-part-1/
Granted, this is pulling from Oracle and inserting into SQL Server, but the concepts should be the same.