I currently populate a database from a third party API that involves downloading files containing multiple SQL INSERT/DELETE/UPDATE statements and then parsing them into SQLAlchemy ORM objects to load into my database.
These files can often contain errors that I've tried to build in some integrity checks for. The particular one I'm currently struggling with is duplicate records - basically receiving a file to insert a record that currently exists. To avoid this I put a unique index on the fields that form a composite primary key. However, this means I get an error when processing a file with an SQL statement that tries to duplicate a record and a flush or commit is subsequently issued.
I don't want to commit records to the database until all the SQL statements for a given file have been processed so I can keep track of what's been processed. I was thinking that I could issue a flush at the end of the processing of every statement and then have some error handling if it fails because of a duplicate record. This would include bypassing the offending statement. However, as I understand the docs then issuing a rollback would cancel all the previous statements that had been processed to that point when I only want to skip the duplicate one.
Is there an option to partially rollback in some way or do I need to build a check up front that queries the database to check if executing an SQL statement will create a duplicate record?
Related
Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda
If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.
I am trying to do a bulk insert of documents into a MongoDB collection in Python, using pymongo. This is what the code looks like:
collection_name.insert_many([ logs[i] for i in range (len(logs)) ])
where logs is a list of dictionaries of variable length.
This works fine when there are no issues with any of the logs. However, if any one of the logs has some kind of issue and pymongo refuses to save it (say, the issue is something like the document fails to match the validation schema set for that collection), the entire bulk insert is rolled back and no documents are inserted in the database.
Is there any way I can retry the bulk insert by ignoring only the defective log?
You can ignore those types of errors by specifying ordered: false as an option: collection.insert_many(logs, ordered=False). All operations are attempted before raising an exception, which you can catch.
See https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many
Some devices are asynchronously storing values on a common remote MySQL database server.
I would like to write a supervisor app in Python (and possibly SQLAlchemy) to recognize the external INSERT events on the database and act upon the last rows' data. This is to avoid a long manual test to see if every table is being updated regularly or a logger crashed.
Can somebody just tell me where to search online this kind of info and, even better, an example?
EDIT
I already read all tables periodically using a datetime primary key ({date_time}), loading the last row of each table, and comparing to the previous values:
SELECT * FROM table ORDER BY date_time DESC LIMIT 1
but it looks very cumbersome and doesn't guarantee that I don't lose some rows between successive database checks.
The engine is an old version of INNODB that I cannot upgrade: I cannot use the UPDATE field in schema because it simply doesn't work.
To reword my question:
How to listen any database event with a daemon-like Python application (sleeping thread) and wake up only when something happens?
I want also to avoid SQL triggers because this would be just too heavy to manage: tables are in hundreds and they are added/removed very often according to the active loggers.
I gave a look to SQLAlchemy but all reference I could find, if I don't misunderstood it, are decorators to act on INSERTs made by SQLAlchemy's itself. I didn't find anything about external changes to the database.
About the example request: I am not interested in a copy-and-paste, because first I want to understand how stuff works. I prefer (even incomplete) examples because SQLAlchemy documentation is far too deep for my knowledge and I simply cannot put the pieces together.
I have a sqlite3 database file with multiple tables, each one with different values, what i want to do now is to check if a value already exists in any table when inserting it to the table and if it already exists returns an error or something.
This is because I'm doing a program t help nurses have a database with their patient and check if a patient has already been inserted into the database, I dont post any code because i'm gathering all the information needed before programming anything to avoid spaghetti code
Try adding a constraint to each or just one of your columns so I doesn't allow duplicates to be added
Like this:
CONSTRAINT <Constraint Name> UNIQUE (<column1>,<column2>)
Then in your code you could catch the SQL exception and return a custom message
I ran an ALTER TABLE query to add some columns to a table and then db.commit(). That didn't raise any error or warning but in Oracle SQL developer the new columns don't show up on SELECT * ....
So I tried to rerun the ALTER TABLE but it raised
cx_Oracle.DatabaseError: ORA-14411: The DDL cannot be run concurrently with other DDLs
That kinda makes sense (I can't create columns that already exist) but when I try to fill the new column with values, I get a message
SQL Error: ORA-00904: "M0010": invalid ID
00904. 00000 - "%s: invalid identifier"
which suggests that the new column has not been created yet.
Does anybody understand what may be going on?
UPDATE/SOLVED I kept trying to run the queries another couple of times and at some point, things suddenly started working (for no apparent reason). Maybe processing time? Would be weird because the queries are ultra light. I'll get back to this if it happens again.
First, you don't need commit, DDL effectively commits any transaction.
Ora-14411 means
Another conflicting DDL was already running.
so it seems that your first ALTER TABLE statement hasn't finished yet (probably table is too big, or some other issues).