I have a program to handle sales and inventory of businesses (of all dimensions) and I've made it so every time it interacts with the data stored in the database a function is called for that specific query. I believe this is inefficient since accessing a database is often a slow process, and as the software functions right now, these queries for small I/O operations in the db are made very often. So I've been thinking of a way to improve this.
I thought about dumping the data of different tables in different lists at the beginning of the software execution and using those lists through its functioning. Then, at close, make the corresponding changes in the database according to those made in the lists. This seems a better solution since all the data would be in memory and the slow process of handling big amounts of data (to which db frameworks are optimized) would be at the execution and exiting of the program.
As an example the following function I call to retrieve an entire table or an entire column of a table:
def sqGenericSelect(table, fetchCol=None):
'''
table: table retrieved. If this is the only argument provided, the entire table gets retrieved.
fetchCol: fetched column. If provided, the function returns only the column passed here.
'''
try:
with conn:
if not fetchCol: fetchCol = '*'
c.execute(f'SELECT {fetchCol} FROM {table} ')
return c.fetchall()
except Exception as exc:
autolog(f'Problem in table: {table}', exc) #This function is for logs
(And it gets called 8 times!)
Is this the right approach? If not, how should I improve this?
Related
I have the below function to pull the required columns from dynamodb, it is working fine.
The problem is, it pulling only few rows from the table.
For eg: Table has 26000+ rows but I'm able to get only 3000 rows here.
Did I missed anything?
def get_columns_dynamodb():
try:
response = table.query(
ProjectionExpression= " id, name, date",
KeyConditionExpression=
Key('opco_type').eq('cwc') and Key('opco_type').eq('cwp')
)
return (response['Items'])
except Exception as error:
logger.error(error)
In DynamoDB, there's no such thing as "select only these columns". Or, there sort of is, but that happens after data is fetched from storage. The entire item is always fetched, and the entire item will count towards the various limits in DynamoDB, such as 1mb max for each response, etc.
One way to solve this, is to write your data in a way that's more optimized for this query. Generally speaking, in DynamoDB, you optimize "queries" (in quotes, since they're more of a key/value read than a dynamic query with joins and selects etc) by writing optimized data.
So, when you write data to your table, you can either use a transaction to write companion items to the same or a separate table, or you can use DynamoDB streams to write the same data in a similar fashion, except async (i.e. eventually consistent).
Let's say you roll with two tables: have one table, my_things, which contains full items. Then another table, my_things_for_query_x that only has the exact data you need for that query, which will allow you to read more data in each chunk, since the data in storage only contains the data you actually need in your situation.
Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda
If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.
I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.
I am using Python3.4 and CQLEngine. In my code, I am saving an object in an overloaded save operator as follows:
Class Foo(Model, ...):
id = columns.Integer(primary_key)=True
bar = column.Text()
...
def save(self):
super(Foo, self).save()
and I would like to know if the save() is making an insert or an update from the return of the save function.
INSERT and UPDATE are synonyms in Cassandra with a very few exceptions. Here is a description of INSERT where it briefly touches on a difference:
An INSERT writes one or more columns to a record in a Cassandra table
atomically and in isolation. No results are returned. You do not have
to define all columns, except those that make up the key. Missing
columns occupy no space on disk.
If the column exists, it is updated. You can qualify table names by
keyspace. INSERT does not support counters, but UPDATE does.
Internally, the insert and update operation are identical.
You don't know whether it will be an insert or update, and you can look at it as if it was a data save request, then the coordinator determines what it is.
This answers your original question - you can't know based on the return of the save function whether it was an insert or update.
The answer on your comment below, which explained why you wanted to have that output: You can't reliably get this info out of Cassandra, but you can use lightweight transactions to a certain extent and run 2 statements sequentially with the same rows of data:
INSERT ... IF NOT EXISTS followed by UPDATE ... IF EXISTS
In the target table you will need to have a column where each of these statements will write a value unique for each call. Then you can select data based on the primary keys of your dataset, and see how many rown have each value. This will roughly tell you how many updates and how many inserts were there. However of there were any concurrent processes, they may have overwritten your data over with their tokens, so this method will not be very accurate and will work (as any other method with databases like Cassandra) only where there are no concurrent processes.
I have several CouchDB databases. The largest is about 600k documents, and I am finding that queries are prohibitively long (several hours or more). The DB is updated infrequently (once a month or so), and only involves adding new documents, never updating existing documents.
Queries are of the type: Find all documents where key1='a' or multiple keys: key1='a', key2='b'...
I don't see that permanent views are practical here, so have been using the CouchDB-Python 'query' method.
I have tried several approaches, and I am unsure what is most efficient, or why.
Method 1:
map function is:
map_fun = '''function(doc){
if(doc.key1=='a'){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
The Python query is:
results = ui.db.query(map_fun, key2=user)
Then some operation with results.rows. This takes up the most time.
It takes about an hour for 'results.rows' to come back. If I change key2 to something else, it comes back in about 5 seconds. If I repeat the original user, it's also fast.
But sometimes I need to query on more keys, so I try:
map_fun = '''function(doc){
if(doc.key1=='a' && doc.key2=user && doc.key3='something else' && etc.){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
and use the python query:
results = ui.db.query(map_fun)
Then some operation with results.rows
Takes a long time for the first query. When I change key2, takes a long time again. If
I change key2 back to the original data, takes the same amount of time. (That is, nothing seems to be getting cached, B-tree'ed or whatever).
So my question is: What's the most efficient way to do queries in couchdb-python, where the queries are ad hoc and involve multiple keys for search criteria?
The UI is QT-based, using PyQt underneath.
There are two caveats for couchdb-python db.query() method:
It executes temporary view. This means that code flow processing would be blocked until this all documents would be proceeded by this view. And this would happened again and again for each call. Try to save view and use db.view() method instead to get results on demand and have incremental index updates.
It's reads whole result no matter how bigger it is. db.query() nor db.view() methods aren't lazy so if view result is 100 MB JSON object, you have to fetch all this data before use them somehow. To query data in more memory-optimized way, try to apply patch to have db.iterview() method - it allows you to fetch data in pagination style.
I think that the fix to your problem is to create an index for the keys you are searching. It is what you called permanent view.
Note the difference between map/reduce and SQL queries in a B-tree based table:
simple SQL query searching for a key (if you have an index for it) traverses single path in the B+-tree from root to leaf,
map function reads all the elements, event if it emits small result.
What you are doing is for each query
reading every document (most of the cost) and
searching for a key in the emitted result (quick search in the B-tree).
and I think your solution has to be slow by the design.
If you redesign database structure to make permanent views practical, (1.) will be executed once and only (2.) will be executed for each query. Each document will be read by a view after addition to DB and a query will search in B-tree storing emitted result. If emitted set is smaller than the total documents number, then the query searches smaller structure and you have the benefit over SQL databases.
Temporary views are far less efficient, then the permanent ones and are meant to be used only for development. CouchDB was designed to work with permanent views. To make map/reduce efficient one has to implement caching or make the view permanent. I am not familiar with the details of the CouchDB implementation, perhaps second query with different key is faster because of some caching. If for some reason you have to use temporary view then perhaps CouchDB is a mistake and you should consider DBMS created and optimized for online queries like MongoDB.