Delete duplicates from GoogleAppEngine Model?

Delete duplicates from GoogleAppEngine Model? - python

I have two Google App Engine Models. I ran my cron's a few times and now there are duplicate entries in my datastore. If it was easy to just delete my entire datastore and upload my data again I would. BUT it took 4 hours to upload last time so I am wondering is there a quick way of deleting entries with duplicate names in the "title" field within the model?

Quick? Probably not.
If you did want to delete dupes, my approach would be to write a remote_api script. Query the model for all entities, sort by title, and fetch batches of 100. Keep a local Python dictionary of titles. If you encounter a new title, add it to the dictionary. If you encounter a known title, add the entity to a delete batch, and flush the deletes before moving on to the next query batch.
Probably an excessive amount of work when you can just wipe out your datastore and re-import instead.

Related

Python/Django/MySQL optimisation

Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda

If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.

How can I improve django mysql copy performance?

I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.

A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.

When to update mongo db indexes

So I am expecting to have around 2000 collections with 10,000-100,000 documents in the near future and I am trying to figure out how to build the indexes. It seems very simple how to do it on a basic level but then when to run the re-indexing is tripping me up. So assume that I have this function and this creates all the indexes I need:
def ensure_indexes(self):
collections = get_collections()
for coll in collections:
coll.ensure_index([('time_stamp', pymongo.DESCENDING])
coll.ensure_index([('raw_value', pymongo.DESCENDING])
coll.ensure_index([('time_stamp', pymongo.DESCENDING, ('raw_value', pymongo.DESCENDING])
There will be a lot of updates to the database during the day and a few people querying it. Should I make a cron job to run the above function during the night while not many people will be inserting new documents in the collections? If people query the database and the collection has been updated but not the index will that query response not include the recently added documents? Or will newly added documents be included in an index?

You don't need to rebuild the indexes under normal circumstances, you only need to create the indexes once, read this from MongoDB FAQ:
Should you run ensureIndex() after every insert?¶
No. You only need to create an index once for a single collection.
After initial creation, MongoDB automatically updates the index as
data changes.
While running ensureIndex() is usually ok, if an index doesn’t exist
because of ongoing administrative work, a call to ensureIndex() may
disrupt database availability. Running ensureIndex() can render a
replica set inaccessible as the index creation is happening. See Build
Indexes on Replica Sets.
In case of corruption and you need to build the indexes again, use db.collection.reIndex(), you can read more from HERE

How to check if data has already been previously used

I have a python script that retrieves the newest 5 records from a mysql database and sends email notification to a user containing this information.
I would like the user to receive only new records and not old ones.
I can retrieve data from mysql without problems...
I've tried to store it in text files and compare the files but, of course, the text files containing freshly retrieved data will always have 5 records more than the old one.
So I have a logic problem here that, being a newbie, I can't tackle easily.
Using lists is also an idea but I am stuck in the same kind of problem.
The infamous 5 records can stay the same for one week and then we can have a new record or maybe 3 new records a day.
It's quite unpredictable but more or less that should be the behaviour.
Thank you so much for your time and patience.

Are you assigning a unique incrementing ID to each record? If you are, you can create a separate table that holds just the ID of the last record fetched, that way you can only retrieve records with IDs greater than this ID. Each time you fetch, you could update this table with the new latest ID.
Let me know if I misunderstood your issue, but saving the last fetched ID in the database could be a solution.

Does django with mongodb make migrations a thing of the past?

Since mongo doesn't have a schema, does that mean that we won't have to do migrations when we change the models?
What does the migration process look like with a non-relational db?

I think this is a really good question, but the answers are going to be a little scattered based on the libs you're using and your expectations for a "migration".
Let's take a look at some common migration actions:
Add a field: Mongo makes this very easy. Just add a field and you're done.
Delete a field: In theory, you're not actually tied to your schema, so "deletion" here is relative. If you remove the "property" and no longer load the field, then it doesn't really matter if that field is in the data. So if you don't care about "cleaning up" the database, then removing a field doesn't affect the database. If you do care about cleaning the DB, you'll basically need to run a giant for loop against the DB.
Modify a field name: This is also a difficult problem. When you rename a field "where" are you renaming it? If you want the DB to reflect the new field name, then you basically have to execute a giant for loop on the DB. TO be safe you probably have to "add" data, then push code, then "unset" the old field.
Some Wrinkles
However, the concept of a field name in tandem with an ActiveRecord object is just a little skewed. An ActiveRecord object is effectively providing mappings of object properties to actual database fields.
In a typical RDBMS the "size" of a field name is not really relevant. However, in Mongo, the field name actually occupies data space and this makes a big difference in terms of performance.
Now, if you're using some form of "data object" like ActiveRecord, why would you attempt to store full field names in the data? The DB should probably be storing all fields in alphabetical order with a map on the Object side. So a Document could have 8 fields/properties and the DB names would be "a", "b"..."j", but the Object names would be readable stuff like "Name", "Price", "Quantity".
The reason I bring this up is that it adds yet another wrinkle to Modify a field name. If you're implementing a mapping then modifying a field name doesn't really cause a migration at all.
Some more Wrinkles
If you do want to implement a migration on a deletion, then you'll have to do so after a deploy. You'll also have to recognize that you won't save any current disk space when you do so.
Mongo pre-allocates space and it doesn't really "give it back" unless you do a DB repair. So if you delete a bunch of fields on documents, those documents still occupy the same space on disk. If the documents are later moved, then you may reclaim space, however documents only move when they grow.
If you remove a large field from lots of documents you'll want to do a repair or a check out the new in-place compact command.

There is no silver bullet. Adding or removing fields is easier with non-relational db (just don't use unneeded fields or use new fields), renaming a field is easier with traditional db (you'll usually have to change a lot of data in case of field rename in schemaless db), data migration is on par - depending on task.

What does the migration process look like with a non-relational db?
Depends on if you need to update all the existing data or not.
In many cases, you may not need to touch the old data, such as when adding a new optional field. If that field also has a default value, you may also not need to update the old documents, if your application can handle a missing field correctly. However, if you want to build an index on the new field to be able to search/filter/sort, you need to add the default value back into the old documents.
Something like field renaming (trivial in a relational db, because you only need to update the catalog and not touch any data) is a major undertaking in MongoDB (you need to rewrite all documents).
If you need to update the existing data, you usually have to write a migration function that iterates over all the documents and updates them one by one (although this process can be shared and run in parallel). For large data sets, this can take a lot of time (and space), and you may miss transactions (if you end up with a crashed migration that went half-way through).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.