Insert/Delete performance

Insert/Delete performance - python

DB Table:
id int(6)
message char(5)
I have to add a record (message) to the DB table. In case of duplicate message(this message already exists with different id) I want to delete (or inactivate somehow) the both of the messages and get their ID's in reply.
Is it possible to perform with only one query? Any performance tips ?...
P.S.
I use PostgreSQL.
The main my problem I worried about, is a need to use locks when performing this with two or more queries...
Many thanks!

If you really want to worry about locking do this.
UPDATE table SET status='INACTIVE' WHERE id = 'key';
If this succeeds, there was a duplicate.
INSERT the additional inactive record. Do whatever else you want with your duplicates.
If this fails, there was no duplicate.
INSERT the new active record.
Commit.
This seizes an exclusive lock right away. The alternatives aren't quite as nice.
Start with an INSERT and check for duplicates doesn't seize a lock until you start updating. It's not clear if this is a problem or not.
Start with a SELECT would need to add a LOCK TABLE to assure that the select held the row found so it could be updated. If no row is found, the insert will work fine.
If you have multiple concurrent writers and two writers could attempt access at the same time, you may not be able to tolerate row-level locking.
Consider this.
Process A does a LOCK ROW and a SELECT but finds no row.
Process B does a LOCK ROW and a SELECT but finds no row.
Process A does an INSERT and a COMMIT.
Process B does an INSERT and a COMMIT. You now have duplicate active records.
Multiple concurrent insert/update transactions will only work with table-level locking. Yes, it's a potential slow-down. Three rules: (1) Keep your transactions as short as possible, (2) release the locks as quickly as possible, (3) handle deadlocks by retrying.

You could write a procedure with both of those commands in it, but it may make more sense to use an insert trigger to check for duplicates (or a nightly job, if it's not time-sensitive).

It is a little difficult to understand your exact requirement. Let me rephrase it two ways:
You want both the entries with same messages in the table (with different IDs), and want to know the IDs for some further processing (marking them as inactive, etc.). For this, You could write a procedure with the separate queries. I don't think you can achieve this with one query.
You do not want either of the entries in the table (i got this from 'i want to delete'). For this, you only have to check if the message already exists and then delete the row if it does, else insert it. I don't think this too can be achieved with one query.
If performance is a constraint during insert, you could insert without any checks and then periodically, sanitize the database.

Related

Alter multiple column comments simultaneously in spark/delta lake

Short version: Need a faster/better way to update many column comments at once in spark/databricks. I have a pyspark notebook that can do this sequentially across many tables, but if I call it from multiple tasks they take so long waiting on a hive connection that I get timeout failures.
Command used: ALTER TABLE my_db_name.my_table_name CHANGE my_column COMMENT "new comment" (docs)
Long version: I have a data dictionary notebook where I maintain column descriptions that are reused across multiple tables. If I run the notebook directly it successfully populates all my database table and column comments by issuing the above command sequentially for every column across all tables (and the corresponding table description command once).
I'm trying to move this to a by-table call. In the databricks tasks that populate the tables I have a check to see if the output table exist. If not it's created, and at the end I call the dictionary notebook (using dbutils.notebook.run("Data Dictionary Creation", 600, {"db": output_db, "update_table": output_table}) to populate the comments for that particular table. If this happens simultaneously for multiple tables however the notebook calls now timeout, as most of the tasks spend a lot of time waiting for client connection with hive. This is true even though there's only one call of the notebook per table.
Solution Attempts:
I tried many variations of the above command to update all column comments in one call per table, but it's either impossible or my syntax is wrong.
It's unclear to me how to avoid the timeout issues (I've doubled timeout to 10 minutes and it still fails, while the original notebook takes much less time than that to run across all tables!). I need to wait for completion before continuing to the next task (or I'd spawn it as a process).
Update: I think what's happening here is that the above Alter command is being called in a loop, and when I schedule a job this loop is being distributed and called in parallel. What I may actually need is a way to call it, or a function in it, without letting the loop be distributed. Is there a way to force sequential execution for a single function?

In the end I found a solution for this issue.
First, the problem seems to have been that the loop with the ALTER command was getting parallelized by spark, and thus firing multiple (conflicting) commands simultaneously on the same table.
The answer to this was two-fold:
Add a .coalesce(1) to the end of the function I was calling with the ALTER line. This limits the function to sequential execution.
Return a newly-created empty dataframe from the function to avoid coalesce-based errors.
Part 2 seems to have been necessary because this command is I think meant to get a result back for aggregation. I couldn't find a way to make it work without that (.repartition(1) had the same issue), so in the end I returned spark.createDataFrame([ (1, "foo")],["id", "label"]) from the function and things then worked.
This gets me to my desired end goal of working through all the alter commands without conflict errors.
It's clunky as hell though; still love improvements or alternative approaches if anyone has one.

If you want to change multiple columns at once, why not recreate the table? (This trick will work only if table 'B' is an external table. Here table 'B' is the 'B'ad table with outdated comments. Table 'A' is the good table with good comments.)
drop table ('B')
create table with required comments ( 'A' )
If this table is NOT external, then you might want to create a view, and start using that. This would enable you to add updated comments without altering the original tables data.
Have you considered using table properties instead of comments?

Adding elements to Django database

I have a large database of elements each of which has unique key. Every so often (once a minute) I get a load more items which need to be added to the database but if they are duplicates of something already in the database they are discarded.
My question is - is it better to...:
Get Django to give me a list (or set) of all of the unique keys and then, before trying to add each new item, check if its key is in the list or,
have a try/except statement around the save call on the new item and reply on Django catching duplicates?
Cheers,
Jack

If you're using MySQL, you have the power of INSERT IGNORE at your finger tips and that would be the most performant solution. You can execute custom SQL queries using the cursor API directly. (https://docs.djangoproject.com/en/1.9/topics/db/sql/#executing-custom-sql-directly)
If you are using Postgres or some other data-store that does not support INSERT IGNORE then things are going to be a bit more complicated.
In the case of Postgres, you can use rules to essentially make your own version of INSERT IGNORE.
It would look something like this:
CREATE RULE "insert_ignore" AS ON INSERT TO "some_table"
WHERE EXISTS (SELECT 1 FROM some_table WHERE pk=NEW.pk) DO INSTEAD NOTHING;
Whatever you do, avoid the "selecting all rows and checking first approach" as the worst-case performance is O(n) in Python and essentially short-circuits any performance advantage afforded by your database since the check is being performed on the app machine (and also eventually memory-bound).
The try/except approach is marginally better than the "select all rows" approach but it still requires constant hand-off to the app server to deal with each conflict, albeit much quicker. Better to make the database do the work.

Continually process data from a PostGre database - what approach to take?

Have a question about what sort of approach to take on a process I am trying to structure. Working with PostgreSQL and Python.
Scenario:
I have two databases A and B.
B is a processed version of A.
Data continually streams into A, which needs to be processed in a certain
way (using multi-processing) and is then stored in B.
Each new row in A needs to be processed only once.
So:
streamofdata ===> [database A] ----> process ----> [database B]
Database A is fairly large (40 GB) and growing. My question is regarding the determination on what is the new data not yet processed and put into B. What is the best way to determine what rows have to be processed still.
Matching primary keys each time on what has not yet been processed is not the way to go I am guessing
So let's say new rows 120 to 130 come into database A over some time period. So my last row processed row was 119. Is a correct approach to look at the last row id (the primary key) 119 processed and say that anything beyond that should now be processed?
Also wondering whether anyone has any further resources on this sort of 'realtime' processing of data. Not exactly sure what I am looking for technically speaking.

Well, there are a few ways you could handle this problem. As a reminder, the process you are describing is basically re-implementing a form of database replication, so you may want to familiarize yourself with the various popular replication options out there for Postgres and how they work, particularly Slony might be of interest to you. You didn't specify what sort of database "database B" is, so I'll assume it's a separate PostgreSQL instance, though that assumption won't change a whole lot about the decisions below other than ruling out some canned solutions like Slony.
Set up a FOR EACH ROW trigger on the important table(s) you have in database A which need to be replicated. Your trigger would take each new row INSERTed (and/or UPDATEd, DELETEd, if you need to catch those) in those tables and send them off to database B appropriately. You mentioned using Python, so just a reminder you can certainly write these trigger functions in PL/python if that makes life easy for you, i.e. you should hopefully be able to more-or-less easily tweak your existing code so that it runs inside the database as a PL/Python trigger function.
If you read up on Slony, you might have noticed that proposal #1 is very similar to how Slony works -- consider whether it would be easy or helpful for you to have Slony take over the replication of the necessary tables from database A to database B, then if you need to further move/transform the data into other tables inside database B, you might do that with triggers on those tables in database B.
Set up a trigger or RULE which will send out a NOTIFY with a payload indicating the row which has changed. Your code will will LISTEN for these notifications and know immediately which rows have changed. The psycopg2 adapter has good support for LISTEN and NOTIFY. N.B. you will need to exercise some care to handle the case that your listener code has crashed or gets disconnected from the database or otherwise misses some notifications.
In case you have control over the code streaming data into database A, you could have that code take over the job of replicating its new data into database B.

SQLAlchemy(Postgresql) - Race Conditions

We are writing an inventory system and I have some questions about sqlalchemy (postgresql) and transactions/sessions. This is a web app using TG2, not sure this matters but to much info is never a bad.
How can make sure that when changing inventory qty's that i don't run into race conditions. If i understand it correctly if user on is going to decrement inventory on an item to say 0 and user two is also trying to decrement the inventory to 0 then if user 1s session hasn't been committed yet then user two starting inventory number is going to be the same as user one resulting in a race condition when both commit, one overwriting the other instead of having a compound effect.
If i wanted to use postgresql sequence for things like order/invoice numbers how can I get/set next values from sqlalchemy without running into race conditions?
EDIT: I think i found the solution i need to use with_lockmode, using for update or for share. I am going to leave open for more answers or for others to correct me if I am mistaken.
TIA

If two transactions try to set the same value at the same time one of them will fail. The one that loses will need error handling. For your particular example you will want to query for the number of parts and update the number of parts in the same transaction.
There is no race condition on sequence numbers. Save a record that uses a sequence number the DB will automatically assign it.
Edit:
Note as Limscoder points out you need to set the isolation level to Repeatable Read.

Setup the scenario you are talking about and see how your configuration handles it. Just open up two separate connections to test it.
Also read up on FOR UPDATE For Update and also on transaction isolation level Isolation Level

Python MySQLdb: Update if exists, else insert

I am looking for a simple way to query an update or insert based on if the row exists in the first place. I am trying to use Python's MySQLdb right now.
This is how I execute my query:
self.cursor.execute("""UPDATE `inventory`
SET `quantity` = `quantity`+{1}
WHERE `item_number` = {0}
""".format(item_number,quantity));
I have seen four ways to accomplish this:
DUPLICATE KEY. Unfortunately the primary key is already taken up as a unique ID so I can't use this.
REPLACE. Same as above, I believe it relies on a primary key to work properly.
mysql_affected_rows(). Usually you can use this after updating the row to see if anything was affected. I don't believe MySQLdb in Python supports this feature.
Of course the last ditch effort: Make a SELECT query, fetchall, then update or insert based on the result. Basically I am just trying to keep the queries to a minimum, so 2 queries instead of 1 is less than ideal right now.
Basically I am wondering if I missed any other way to accomplish this before going with option 4. Thanks for your time.

Mysql DOES allow you to have unique indexes, and INSERT ... ON DUPLICATE UPDATE will do the update if any unique index has a duplicate, not just the PK.
However, I'd probably still go for the "two queries" approach. You are doing this in a transaction, right?
Do the update
Check the rows affected, if it's 0 then do the insert
OR
Attempt the insert
If it failed because of a unique index violation, do the update (NB: You'll want to check the error code to make sure it didn't fail for some OTHER reason)
The former is good if the row will usually exist already, but can cause a race (or deadlock) condition if you do it outside a transaction or have your isolation mode is not high enough.
Creating a unique index on item_number in your inventory table sounds like a good idea to me, because I imagine (without knowing the details of your schema) that one item should only have a single stock level (assuming your system doesn't allow multiple stock locations etc).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.