Is there a way to improve a MERGE query? - python

I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.

If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.

MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.

Related

How to create a table from another table with GridDB?

I have a GridDB container where I have stored my database. I want to copy the table but this would exclude a few columns. The function I need should extract all columns matching a given keyword and then create a new table from that. It must always include the first column *id because it is needed on every table.
For example, in the table given below:
'''
-- | employee_id | department_id | employee_first_name | employee_last_name | employee_gender |
-- |-------------|---------------|---------------------|---------------------|-----------------|
-- | 1 | 1 | John | Matthew | M |
-- | 2 | 1 | Alexandra | Philips | F |
-- | 3 | 2 | Hen | Lotte | M |
'''
Suppose I need to get the first column and every other column starting with "employee". How can I do this through a Python function?
I am using GridDB Python client on my Ubuntu machine and I have already stored the database.csv file in the container. Thanks in advance for your help!

SQL - Conditionally join and replace values between two tables

I have two tables where one is holding "raw" data and another is holding "updated" data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.
I want a query with the following conditions:
Select all rows from the first table
If there is a matching row in the second table (ie. when raw_d.primary_key_col_1 = edit_d.primary_key_col_1 and raw_d.primary_key_col_2 = edit_d.primary_key_col_2), we use the most recent (where most recent is based on column primary_key_col_3 values from the second table, rather than the first
Otherwise we use the values from the first table.
Note: I have many more "value" columns in the actual data. Considering the following toy example where I have two tables, raw_d and edit_d, that are quite similar as follows:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 0 | 1
src_2 | dest_2 | 5 | 4
src_3 | dest_3 | 2 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
primary_key_col_1 | primary_key_col_2 | primary_key_col_3 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------------+---------------------------------------
src_1 | dest_1 | 2020-05-09 | 7 | 0
src_2 | dest_2 | 2020-05-08 | 6 | 1
src_3 | dest_3 | 2020-05-07 | 5 | 2
src_1 | dest_1 | 2020-05-08 | 3 | 4
src_2 | dest_2 | 2020-05-09 | 2 | 5
The expected result is as given:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 7 | 0
src_2 | dest_2 | 2 | 5
src_3 | dest_3 | 5 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
My proposed solution is to query the "greatest n per group" with the second table and then "overwrite" rows in a query of the first table, using Pandas.
The first query would just grab data from the first table:
SELECT * FROM raw_d
The second query to select "the greatest n per group" would be as follows:
SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d
ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;
I planned on merging the data like in Replace column values based on another dataframe python pandas - better way?.
Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.
I would suggest phrasing the requirements as:
select the most recent row from the second table
bring in additional rows from the first table that don't match
This is a union all with distinct on:
(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2
from updated u
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
) union all
select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2
from raw r
where not exists (select 1
from updated u
where u.primary_key_col_1 = r.primary_key_col_2 and
u.primary_key_col_2 = r.primary_key_col_2
);
As I understood from your question, there are 2 ways to solve this
1. Using FULL OUTER JOIN
with cte as (
select distinct on (primary_key_col_1,primary_key_col_2) * from edit_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
)
select
coalesce(t1.primary_key_col_1,t2.primary_key_col_1),
coalesce(t1.primary_key_col_2,t2.primary_key_col_2),
coalesce(t1.value_col_1,t2.value_col_1),
coalesce(t1.value_col_2,t2.value_col_2)
from cte t1
full outer join raw_d t2
on t1.primary_key_col_1 = t2.primary_key_col_1
and t1.primary_key_col_2 = t2.primary_key_col_2
DEMO
2. Using Union
select
distinct on (primary_key_col_1, primary_key_col_2)
primary_key_col_1, primary_key_col_2, value_col_1, value_col_2
from (
select * from edit_d
union all
select primary_key_col_1,primary_key_col_2, null as "primary_key_col_3",
value_col_1,value_col_2 from raw_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc nulls last
)tab
DEMO

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

Proper way to store ordered set of strings in database

First of all, I have xml file I need to save in mysql database. I have child elements that can occur from one to unbounded times. Are there any constraints I can use in sqlalchemy ORM or I have to save order from application?
The table should look like:
+------+-----------+-------+-----------+
| id | name | part | parent_id |
+------+-----------+-------+-----------+
| 1 | foo | 1 | 123 |
+------+-----------+-------+-----------+
| 2 | bar | 2 | 123 |
+------+-----------+-------+-----------+
| 3 | baz | 1 | 345 |
+------+-----------+-------+-----------+
In other words, what is a proper way to add explicit ordering to many-to-many relationship?
Any ordering needs to be done in code. Once inserted in a table and selected from that table the order is not guaranteed. So also on retrieval you will have to apply an order, in that part adding ORDER BY in SQL is the handiest way to go.

Improving MySQL read time, MySQLdb

I have a table with more than a million record with the following structure:
mysql> SELECT * FROM Measurement;
+----------------+---------+-----------------+------+------+
| Time_stamp | Channel | SSID | CQI | SNR |
+----------------+---------+-----------------+------+------+
| 03_14_14_30_14 | 7 | open | 40 | -70 |
| 03_14_14_30_14 | 7 | roam | 31 | -79 |
| 03_14_14_30_14 | 8 | open2 | 28 | -82 |
| 03_14_14_30_15 | 8 | roam2 | 29 | -81 |....
I am reading data from this table into python for plotting. The problem is, the MySQL reads are too slow and it is taking me hours to get the plots even after using
MySQLdb.cursors.SSCursor (as suggested by a few in this forum) to quicken up the task.
con = mdb.connect('localhost', 'testuser', 'conti', 'My_Freqs', cursorclass = MySQLdb.cursors.SSCursor);
cursor=con.cursor()
cursor.execute("Select Time_stamp FROM Measurement")
for row in cursor:
... Do processing ....
Will normalizing the table help me in speeding up the task? If so, How should i normalize it?
P.S: Here is the result for EXPLAIN
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| Time_stamp | varchar(128) | YES | | NULL | |
| Channel | int(11) | YES | | NULL | |
| SSID | varchar(128) | YES | | NULL | |
| CQI | int(11) | YES | | NULL | |
| SNR | float | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
The problem is probably that you are looping over the cursor instead of just dumping out all the data at once and then processing it. You should be able to dump out a couple million rows in a couple/few seconds. Try to do something like
cursor.execute("select Time_stamp FROM Measurement")
data = cusror.fetchall()
for row in data:
#do some stuff...
Well, since you're saying the whole table has to be read, I guess you can't do much about it. It has more than 1 million records... you're not going to optimize much on the database side.
How much time does it take you to process just one record? Maybe you could try optimizing that part. But even if you got down to 1 millisecond per record, it would still take you about half an hour to process the full table. You're dealing with a lot of data.
Maybe run multiple plotting jobs in parallel? With the same metrics as above, dividing your data in 6 equal-sized jobs would (theoretically) give you the plots in 5 minutes.
Do your plots have to be fine-grained? You could look for ways to ignore certain values in the data, and generate a complete plot only when the user needs it (wild speculation here, I really have no idea what your plots look like)

Categories