So I am expecting to have around 2000 collections with 10,000-100,000 documents in the near future and I am trying to figure out how to build the indexes. It seems very simple how to do it on a basic level but then when to run the re-indexing is tripping me up. So assume that I have this function and this creates all the indexes I need:
def ensure_indexes(self):
collections = get_collections()
for coll in collections:
coll.ensure_index([('time_stamp', pymongo.DESCENDING])
coll.ensure_index([('raw_value', pymongo.DESCENDING])
coll.ensure_index([('time_stamp', pymongo.DESCENDING, ('raw_value', pymongo.DESCENDING])
There will be a lot of updates to the database during the day and a few people querying it. Should I make a cron job to run the above function during the night while not many people will be inserting new documents in the collections? If people query the database and the collection has been updated but not the index will that query response not include the recently added documents? Or will newly added documents be included in an index?
You don't need to rebuild the indexes under normal circumstances, you only need to create the indexes once, read this from MongoDB FAQ:
Should you run ensureIndex() after every insert?¶
No. You only need to create an index once for a single collection.
After initial creation, MongoDB automatically updates the index as
data changes.
While running ensureIndex() is usually ok, if an index doesn’t exist
because of ongoing administrative work, a call to ensureIndex() may
disrupt database availability. Running ensureIndex() can render a
replica set inaccessible as the index creation is happening. See Build
Indexes on Replica Sets.
In case of corruption and you need to build the indexes again, use db.collection.reIndex(), you can read more from HERE
Related
Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda
If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.
Some devices are asynchronously storing values on a common remote MySQL database server.
I would like to write a supervisor app in Python (and possibly SQLAlchemy) to recognize the external INSERT events on the database and act upon the last rows' data. This is to avoid a long manual test to see if every table is being updated regularly or a logger crashed.
Can somebody just tell me where to search online this kind of info and, even better, an example?
EDIT
I already read all tables periodically using a datetime primary key ({date_time}), loading the last row of each table, and comparing to the previous values:
SELECT * FROM table ORDER BY date_time DESC LIMIT 1
but it looks very cumbersome and doesn't guarantee that I don't lose some rows between successive database checks.
The engine is an old version of INNODB that I cannot upgrade: I cannot use the UPDATE field in schema because it simply doesn't work.
To reword my question:
How to listen any database event with a daemon-like Python application (sleeping thread) and wake up only when something happens?
I want also to avoid SQL triggers because this would be just too heavy to manage: tables are in hundreds and they are added/removed very often according to the active loggers.
I gave a look to SQLAlchemy but all reference I could find, if I don't misunderstood it, are decorators to act on INSERTs made by SQLAlchemy's itself. I didn't find anything about external changes to the database.
About the example request: I am not interested in a copy-and-paste, because first I want to understand how stuff works. I prefer (even incomplete) examples because SQLAlchemy documentation is far too deep for my knowledge and I simply cannot put the pieces together.
I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.
I have several CouchDB databases. The largest is about 600k documents, and I am finding that queries are prohibitively long (several hours or more). The DB is updated infrequently (once a month or so), and only involves adding new documents, never updating existing documents.
Queries are of the type: Find all documents where key1='a' or multiple keys: key1='a', key2='b'...
I don't see that permanent views are practical here, so have been using the CouchDB-Python 'query' method.
I have tried several approaches, and I am unsure what is most efficient, or why.
Method 1:
map function is:
map_fun = '''function(doc){
if(doc.key1=='a'){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
The Python query is:
results = ui.db.query(map_fun, key2=user)
Then some operation with results.rows. This takes up the most time.
It takes about an hour for 'results.rows' to come back. If I change key2 to something else, it comes back in about 5 seconds. If I repeat the original user, it's also fast.
But sometimes I need to query on more keys, so I try:
map_fun = '''function(doc){
if(doc.key1=='a' && doc.key2=user && doc.key3='something else' && etc.){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
and use the python query:
results = ui.db.query(map_fun)
Then some operation with results.rows
Takes a long time for the first query. When I change key2, takes a long time again. If
I change key2 back to the original data, takes the same amount of time. (That is, nothing seems to be getting cached, B-tree'ed or whatever).
So my question is: What's the most efficient way to do queries in couchdb-python, where the queries are ad hoc and involve multiple keys for search criteria?
The UI is QT-based, using PyQt underneath.
There are two caveats for couchdb-python db.query() method:
It executes temporary view. This means that code flow processing would be blocked until this all documents would be proceeded by this view. And this would happened again and again for each call. Try to save view and use db.view() method instead to get results on demand and have incremental index updates.
It's reads whole result no matter how bigger it is. db.query() nor db.view() methods aren't lazy so if view result is 100 MB JSON object, you have to fetch all this data before use them somehow. To query data in more memory-optimized way, try to apply patch to have db.iterview() method - it allows you to fetch data in pagination style.
I think that the fix to your problem is to create an index for the keys you are searching. It is what you called permanent view.
Note the difference between map/reduce and SQL queries in a B-tree based table:
simple SQL query searching for a key (if you have an index for it) traverses single path in the B+-tree from root to leaf,
map function reads all the elements, event if it emits small result.
What you are doing is for each query
reading every document (most of the cost) and
searching for a key in the emitted result (quick search in the B-tree).
and I think your solution has to be slow by the design.
If you redesign database structure to make permanent views practical, (1.) will be executed once and only (2.) will be executed for each query. Each document will be read by a view after addition to DB and a query will search in B-tree storing emitted result. If emitted set is smaller than the total documents number, then the query searches smaller structure and you have the benefit over SQL databases.
Temporary views are far less efficient, then the permanent ones and are meant to be used only for development. CouchDB was designed to work with permanent views. To make map/reduce efficient one has to implement caching or make the view permanent. I am not familiar with the details of the CouchDB implementation, perhaps second query with different key is faster because of some caching. If for some reason you have to use temporary view then perhaps CouchDB is a mistake and you should consider DBMS created and optimized for online queries like MongoDB.
I have two Google App Engine Models. I ran my cron's a few times and now there are duplicate entries in my datastore. If it was easy to just delete my entire datastore and upload my data again I would. BUT it took 4 hours to upload last time so I am wondering is there a quick way of deleting entries with duplicate names in the "title" field within the model?
Quick? Probably not.
If you did want to delete dupes, my approach would be to write a remote_api script. Query the model for all entities, sort by title, and fetch batches of 100. Keep a local Python dictionary of titles. If you encounter a new title, add it to the dictionary. If you encounter a known title, add the entity to a delete batch, and flush the deletes before moving on to the next query batch.
Probably an excessive amount of work when you can just wipe out your datastore and re-import instead.