If I use bulk_create to insert objects:
objList = [a, b, c,] #none are saved
model.objects.bulk_create(objList)
The id's of the objects would not be updated (see https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create).
So I can't use these guys as foreign key objects. I thought of querying them back from the database after they're bulk created and then using them as foreign key objects, but I don't have their ids to query them. How do I query these objects from the database (given that there can be duplicate values in columns other than the id)? Or is there a better way to make bulk created items as foreign keys?
If you have only three objects, as in your example, you might want to call save on each individually, wrapping the calls within a transaction, if it needs to be atomic.
If there are many more, which is likely the reason for using bulk_create, you could potentially loop through them instead and call save on each. Again, you could wrap that in a transaction if required. Though, one might not like this option as running tonnes of insert queries could potentially be a problem for some database setups.
Alternatively, a hack would be to add some known unique identifier to the object so you could re-query these after save.
Related
I am currently working on a list implementation in python that stores a persistent list as a database:
https://github.com/DarkShroom/sqlitelist
I am tackling a design consideration, it seems that SQLite allows me store the data without a primary key?
self.c.execute('SELECT * FROM unnamed LIMIT 1 OFFSET {}'.format(key))
this line of code can retrieve by absolute row reference
Is this bad practise? will I loose the data order at any point? Perhaps it's OKAY with sqlite, but my design will not translate to other database engines? Any thoughts from people more familiar with databases would be helpful. I am writing this so I don't have to deal with databases!
The documentation says:
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
So you cannot simply use OFFSET to identify rows.
A PRIMARY KEY constraint just tells the database that is must enforce UNIQUE and NOT NULL constraints on the PK columns. If you do not declare a PRIMARY KEY, these constraints are not automatically enforced, but this does not change the fact that you have to identify your rows somehow when you want to access them.
The easiest way to store list entries is to have the position in the list as a separate column. (If your program takes up most of its time inserting or deleting list entries, it might be a better idea to store the list not as an array but as a linked list, i.e., the database does not store the position but a pointer to the next entry.)
In db I have a table called register that has mail-id as primary key. I used to submit in bulk using session.add_all(). But sometimes some records already exist; in that case I want to separate already existing records and non- existing.
http://docs.sqlalchemy.org/en/latest/orm/session_api.html#sqlalchemy.orm.session.Session.merge
If all the objects you are adding to the database are complete (e.g. the new object contains at least all the information that existed for the record in the database) you can use Session.merge(). Effectively merge() will either create or update the existing row (by finding the primary key if it exists in the session/database and copying the state across from the object you merge). The crucial thing to take note of is that the attribute values of the object passed to merge will overwrite that which already existed in the database.
I think this is not so great in terms of performance, so if that is important, SQLAlchemy has some bulk operations. You would need to check existence for the set of primary keys that will be added/updated, and do one bulk insert for the objects which didn't exist and one bulk update for the ones that did. The documentation has some info on the bulk operations if it needs to be a high-performance approach.
http://docs.sqlalchemy.org/en/latest/orm/persistence_techniques.html#bulk-operations
User SQL ALchemys inspector for this:
inspector = inspect(engine)
inspector.get_primary_keys(table, schema)
Inspector "reflects" all primary keys and you can check agaist the returned list.
I am new to both python and SQLite.
I have used python to extract data from xlsx workbooks. Each workbook is one series of several sheets and is its own database, but I would also like a merged database of every series together. The structure is the same for all.
The structure of my database is:
*Table A with autoincrement primary key id, logical variable and 1 other variable.
*Table B autoincrement primary key id, logical variable and 4 other variables
*Table C is joined by table A id and table B id, together the primary key, and also has 4 other variables specific to this instance of intersection between table A and table B.
I tried using the answer at
Sqlite merging databases into one, with unique values, preserving foregin key relation
along with various other ATTACH solutions, but each time I got an error message ("cannot ATTACH database within transaction").
Can anyone suggest why I can't get ATTACH to work?
I also tried a ToMerge like the one at How can I merge many SQLite databases?
and it couldn't do ToMerge in the transaction either.
I also initially tried connecting to the existing SQLite db, making dictionaries from the existing tables in python, then adding the information in the dictionaries into a new 'merged' db, but this actually seemed to be far slower than the original process of extracting everything from the xlsx files.
I realize I can easily just run my xlsx to SQL python script again and again for each series directing it all into the one big SQL database and that is my backup plan, but I want to learn how to do it the best, fastest way.
So, what is the best way for me to merge identical structured SQLite databases into one, maintaining my foreign keys.
TIA for any suggestions
:-)L
You cannot execute the ATTACH statement from inside a transaction.
You did not start a transaction, but Python tried to be clever, got the type of your statement wrong, and automatically started a transaction for you.
Set connection.isolation_level = None.
I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.
A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.
I'm using a SQlite database and Django's QuerySet API to access this database. I wrote data sequentially into the database and each entry has a simple ID as the primary key (this is the Django default). This means that there is a continuous sequence of IDs now in the database (entry 1 has ID 1, entry 2 has ID 2, and so on). Now I needed to delete some entries again. This means that the sequence of IDs is discontinuous now (entry 1 has ID 1, but entry 2 might have ID 3, 8, 1432 or anything else, but not 2).
How can I restore this continuous sequence of IDs again and associate them with the remaining entries in the database? Is there a way to do this with Django's QuerySet API or do I need to use plain SQL? I have no experience with plain SQL, so some working code would be very helpful in this case. Thank you!
I cannot think of any situation in which doing this would be desirable. The best primary keys are immutable (although that's not a technical requirement) and the very purpose of using non-meaningful integer primary keys is to avoid having to update them.
I would even go so far as to say that if you require a meaningful, unbroken sequence of integers, create a separate column in your table, keep the primary key with its sequence breaks, and renumber the new "sequence" column when needed.
However, you may have requirements that I can't think of. If you really need to change the values in those keys make sure that all the references to that column in your database are protected by FOREIGN KEY constraints and check out the ON UPDATE CASCADE option when you declare a foreign key. It will instruct the database to do the updating for you.
But if you don't have to this, don't.