Proper way to store ordered set of strings in database - python

First of all, I have xml file I need to save in mysql database. I have child elements that can occur from one to unbounded times. Are there any constraints I can use in sqlalchemy ORM or I have to save order from application?
The table should look like:
+------+-----------+-------+-----------+
| id | name | part | parent_id |
+------+-----------+-------+-----------+
| 1 | foo | 1 | 123 |
+------+-----------+-------+-----------+
| 2 | bar | 2 | 123 |
+------+-----------+-------+-----------+
| 3 | baz | 1 | 345 |
+------+-----------+-------+-----------+
In other words, what is a proper way to add explicit ordering to many-to-many relationship?

Any ordering needs to be done in code. Once inserted in a table and selected from that table the order is not guaranteed. So also on retrieval you will have to apply an order, in that part adding ORDER BY in SQL is the handiest way to go.

Related

How to create a table from another table with GridDB?

I have a GridDB container where I have stored my database. I want to copy the table but this would exclude a few columns. The function I need should extract all columns matching a given keyword and then create a new table from that. It must always include the first column *id because it is needed on every table.
For example, in the table given below:
'''
-- | employee_id | department_id | employee_first_name | employee_last_name | employee_gender |
-- |-------------|---------------|---------------------|---------------------|-----------------|
-- | 1 | 1 | John | Matthew | M |
-- | 2 | 1 | Alexandra | Philips | F |
-- | 3 | 2 | Hen | Lotte | M |
'''
Suppose I need to get the first column and every other column starting with "employee". How can I do this through a Python function?
I am using GridDB Python client on my Ubuntu machine and I have already stored the database.csv file in the container. Thanks in advance for your help!

Is there a way to improve a MERGE query?

I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.
If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.
MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

SQL/SqlAlchemy: Querying all objects in a dependancy tree

I have a table with a self, asymmetric many-to-many relationship of dependancies between objects. I use that relationship to create a dependably tree between objects.
Having a set of object IDs, I would like to fetch all objects that are somewhere in that dependancy tree.
Here's an example objects table:
+----+------+
| ID | Name |
+----+------+
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
+----+------+
And a table of relationships:
+------------+-----------+
| Dependancy | Dependant |
+------------+-----------+
| 2 | 1 |
| 3 | 2 |
| 4 | 1 |
+------------+-----------+
Showing A (ID: 1) depends on both B(2) and D(4), and that B(2) depends on C(3).
Now, I would like to construct a single SQL query that given {1} as a set with a single ID will return the four objects in A's dependancy tree: A, B, D and C.
Alternatively, using one query to fetch all needed object IDs and another to fetch their actual data is also acceptable.
This should be work regardless of the number of levels in the dependency/hierarchy tree.
I'll be happy with either an SQLAlchemy example or plain SQL for the postgresql 10 database (which I'll see how to implement with SQLAlchemy later on).
Thanks!

flask-sqlalchemy count function

Consider a table named result with the following schema
+----+-----+---------+
| id | tag | user_id |
+----+-----+---------+
| 0 | A | 0 |
| 1 | A | 0 |
| 2 | B | 0 |
| 3 | B | 0 |
+----+-----+---------+
for user with id=0 I would like to count they number of times a result with tag=A has been appeared. For now I have implemented it using raw SQL statement
db.session.execute('select tag, count(tag) from result where user_id = :id group by tag', {'id':user.id})
How can I write it using flask-sqlalchemy APIs?
Most of results I get mention the sqlalchemy function db.func.count() which is not available in flask-sqlalchemy or has a different path which I am not aware of.
I was using PyCharm as my IDE and it was not showing module members correctly, hence I thought count is missing. Here is my solution for the above
user.results.add_columns(Result.tag, db.func.count(Result.tag)).group_by(Result.tag).all()

Categories