Grouping DISTINCT SQL results into objects - python

I'm trying to combine data from two tables (the first being a shipment details table with data for a single shipment per row, the other a table that tracks transactions made against shipments) to build a view that shows me the product data for a given shipment.
My current attempt to get the data looks like this:
SELECT DISTINCT
sd.erp_order AS shipment_id,
th.product_code,
th.lot
FROM transaction_history th
JOIN shipment_detail sd
ON sd.shipment_id = th.reference_id
AND sd.item = th.item
WHERE th.transaction_type = '1'
AND sd.erp_order in ('1111', '1112')
which returns my data in the following format:
|shipment_id|product_code|lot|
| 1111| PRODUCT_A| 1A|
| 1111| PRODUCT_B| 2B|
| 1112| PRODUCT_A| 1A|
| 1112| PRODUCT_B| 3B|
This is great, but now I need to organize it so that when it goes through my API (I'm using Django), it the lot code and the product code are grouped together in their own object, and then all the products are listed under the relevant shipment:
[
{
"shipment_id": '1111',
"products": [
{
"product_code": "PRODUCT_A",
"lot": "1A",
},
{
"product_code": "PRODUCT_B",
"lot": "2B",
}
]
}
]
and I'm not quite sure how to do it. Is this something that can be done with SQL, or will I have to do it with Python?
I also recognize that I should be able to get this kind of data from existing tables, but this is a siloed database that I cannot modify, and I'm told by the supporting team that this is the best place to get the data I need.

Try something like:
data = serializers.serialize('json', SomeModel.objects.raw(query), fields=('id', 'name', 'parent'))
Serialisation Django

You can use this query to generate JSON data in the DB and then convert data to JSON in python:
SELECT DISTINCT
sd.erp_order AS shipment_id,
products.product_code,
products.lot
FROM shipment_detail sd
JOIN transaction_history products
ON sd.shipment_id = products.reference_id
AND sd.item = products.item
WHERE products.transaction_type = '1'
AND sd.erp_order in ('1111', '1112')
FOR JSON AUTO
for more information read here.
Sample running in MSSQL:

Related

Using SQLalchemy ORM for Python In my REST api, how can I aggregate resources to the hour to the day?

I have a MySql db table with that looks like:
time_slot | sales
2022-08-026T01:00:00 | 100
2022-08-026T01:06:40 | 103
...
I am serving the data via api to a client. The FE engineer wants the data aggregated by hour for each day within the query period (atm it's a week). So he gives from and to and wants the sum of sales within each hour for each day as a nested array. Because it's a week, it's a 7 element array, where each element is an array containing all the hourly slots where we have data.
[
[
"07:00": 567,
"08:00": 657,
....
],
[], [], ...
]
The api is built in python. There is an ORM (sqlalchemy) for the data, that looks like:
class HourlyData(Base):
hour: Column(Datetime)
sales: Column(Float)
I can query the hourly data, and then in python memory aggregate it into list of lists. But to save compute time (and conceptual complexity), I would like to run the aggregation through orm queries.
What is the sqlalchemy syntax to achieve this?
The below should get you started, where the solution is a mix of SQL and Python using existing tools, and it should work with any RDBMS.
Assumed model definition, and imports
from itertools import groupby
import json
class TimelyData(Base):
__tablename__ = "timely_data"
id = Column(Integer, primary_key=True)
time_slot = Column(DateTime)
sales = Column(Float)
We get the data from the DB aggregated enough for us to group properly
# below works for Posgresql (tested), and should work for MySQL as well
# see: https://mode.com/blog/date-trunc-sql-timestamp-function-count-on
col_hour = func.date_trunc("hour", TimelyData.time_slot)
q = (
session.query(
col_hour.label("hour"),
func.sum(TD.sales).label("total_sales"),
)
.group_by(col_hour)
.order_by(col_hour) # this is important for `groupby` function later on
)
Group the results by date again using python groupby
groups = groupby(q.all(), key=lambda row: row.hour.date())
# truncate and format the final list as required
data = [
[(f"{row.hour:%H}:00", int(row.total_sales)) for row in rows]
for _, rows in groups
]
Example result:
[[["01:00", 201], ["02:00", 102]], [["01:00", 103]], [["08:00", 104]]]
I am not familiar with MySQL, but with Postgresql one could implement all at the DB level due to extensive JSON support. However, I would argue the readability of that implementation will not be improve, and so will not the speed assuming we get from the database at most 168 rows = 7 days x 24 hours).

Usage of postgres jsonb

I'm trying to figure out how to work better with json in postgres.
I have a file that stores information about many tables (structure and values). File is periodically updated, this may mean changes in data as well as in table structures. It turns out some kind of dynamic tables.
As a result, I have json table structure (key is column, value is field type (string or number only)) and list of json records for each table.
Something like this (actualy structure does not matter):
{
'table_name': 'table1',
'columns': {
'id': 'int',
'data1': 'string',
'data2': 'string'
},
'values': [
[1, 'aaa', 'bbb'],
[2, 'ccc', 'ddd']
]
}
At first I wanted to make a real table for each table in file, do truncate when updating the data and drop table if table structure changes. Second option I'm testing now is a single table with json data:
CREATE TABLE IF NOT EXISTS public.data_tables
(
id integer NOT NULL,
table_name character varying(50),
row_data jsonb,
CONSTRAINT data_tables_pkey PRIMARY KEY (id)
)
And now there is the question of how to properly work with json:
directly query row_data like row_data->>'id' = 1 with hash index for 'id' key
use jsonb_populate_record with custom types for each table (yes, I need to recreate them each time table structure will change)
probably some other way to work with it?
First option is the easiest and fast because of indexes, but there is no data type control and you have to put it in every query.
Second option is more difficult to implement, but easier to use in queries. I can even create views for each table with jsonb_populate_record. But as far as I see - indexes won't work with json function?
Perhaps there is a better way? Or is recreating tables not such a bad option?
Firstly, your JSON string is not the correct format. I wrote the corrected sample JSON string:
{
"table_name": "table1",
"columns": {
"id": "integer",
"data1": "text",
"data2": "text"
},
"values": [
{
"id": 1,
"data1": "aaa",
"data2": "bbb"
},
{
"id": 2,
"data1": "ccc",
"data2": "ddd"
}
]
}
I wrote a sample function for you, but only for creating table from JSON. You can write SQL code for inserting process too, it's easy, not difficult.
Sample Function:
CREATE OR REPLACE FUNCTION dynamic_create_table()
RETURNS boolean
LANGUAGE plpgsql
AS $function$
declare
rec record;
begin
FOR rec IN
select
t1.table_name,
string_agg(t2.pkey || ' ' || t2.pval || ' NULL', ', ') as sql_columns
from data_tables t1
cross join jsonb_each_text(t1.row_data->'columns') t2(pkey, pval)
group by t1.table_name
loop
execute 'create table ' || rec.table_name || ' (' || rec.sql_columns || ')';
END loop;
return true;
END;
$function$;

Nested JSON Output of SQLAlchemy Query with Join

I have two tables Orders and OrderItems. It's a common setup whereby OrderItems has a foreign key linking it to Orders. So we have a one-to-many join from Orders to OrderItems.
Note: Tables would have many more fields in real life.
Orders OrderItems
+---------+ +-------------+---------+
| orderId | | orderItemId | orderId |
+---------+ +-------------+---------+
| 1 | | 5 | 1 |
| 2 | | 6 | 1 |
| | | 7 | 2 |
+---------+ +-------------+---------+
I'm using SQLAlchemy to reflect an existing database. So to query this data I do something like
ordersTable = db.Model.metadata.tables['Orders']
orderItemsTable = db.Model.metadata.tables['OrdersItems']
statement = ordersTable.join(orderItemsTable, ordersTable.c.orderId==orderItemsTable.c.orderId).select()
result = db.engine.execute(statement)
rlist = [dict(row) for row in result.fetchall()]
return flask.jsonify(rlist)
But the problem with this output is that I get duplicates of information from the Orders table due to the join. E.g. you can see that because orderId has two items I'll get everything in the Orders table twice.
What I'm after is a way to obtain a nested JSON output from the select query aboce. Such as:
[
{
"orderId": 1,
"orderItems": [
{ "orderItemId": 5 },
{ "orderItemId": 6 }
]
},
{
"orderId": 2,
"orderItems":[
{ "orderItemId": 7 }
]
}
]
This question has been raised before
How do I produce nested JSON from database query with joins? Using Python / SQLAlchemy
I've spent quite a bit of time looking over the Marshmallow documentation, but I cannot find how to implement this using the type of query that I outlined above.
I didn't like how cluttered marshmallow is, so I wrote this. I also like that I can keep all of the data manipulation in the SQL statement instead of also instructing marshmallow what to do.
import json
from flask.json import JSONEncoder
def join_to_nested_dict(join_result):
"""
Takes a sqlalchemy result and converts it to a dictionary.
The models must use the dataclass decorator.
Adds results to the right in a key named after the table the right item is contained in.
:param List[Tuple[dataclass]] join_result:
:return dict:
"""
if len(join_result) == 0:
return join_result
# couldn't be the result of a join without two entries on each row
assert(len(join_result[0]) >= 2)
right_name = join_result[0][1].__tablename__
# if there are multiple joins recurse on sub joins
if len(join_result[0]) > 2:
right = join_to_nested_dict([res[1:] for res in join_result])
elif len(join_result[0]) == 2:
right = [
json.loads(json.dumps(row[1], cls=JSONEncoder))
for row in join_result if row[1] is not None
]
right_items = {item['id']: item for item in right}
items = {}
for row in join_result:
# in the case of a right outer join
if row[0] is None:
continue
if row[0].id not in items:
items[row[0].id] = json.loads(json.dumps(row[0], cls=JSONEncoder))
# in the case of a left outer join
if row[1] is None:
continue
if right_name not in items[row[0].id]:
items[row[0].id][right_name] = []
items[row[0].id][right_name].append(right_items[row[1].id])
return list(items.values())
And you should be able to just plug the result into this function. However you will need to add the dataclass decorator to your models for this code to work.
statement = ordersTable.join(orderItemsTable, ordersTable.c.orderId==orderItemsTable.c.orderId).select()
result = db.engine.execute(statement)
join_to_nested_dict(result)
Also, if you don't want to use the flask json encoder you can delete the import and cls arguments.

Targeting specific values from JSON API and inserting into Postgresql, using Python

Right now i am able to connect to the url api and my database. I am trying to insert data from the url to the postgresql database using psycopg2. I dont fully understand how to do this, and this is all i could come up with to do this.
import urllib3
import json
import certifi
import psycopg2
from psycopg2.extras import Json
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
url = '<API-URL>'
headers = urllib3.util.make_headers(basic_auth='<user>:<passowrd>')
r = http.request('GET', url, headers=headers)
data = json.loads(r.data.decode('utf-8'))
def insert_into_table(data):
for item in data['issues']:
item['id'] = Json(item['id'])
with psycopg2.connect(database='test3', user='<username>', password='<password>', host='localhost') as conn:
with conn.cursor() as cursor:
query = """
INSERT into
Countries
(revenue)
VALUES
(%(id)s);
"""
cursor.executemany(query, data)
conn.commit()
insert_into_table(data)
So this code give me a TypeError: string indices must be integers on cursor.executemany(query, data)
So i know that json.loads brings back a type object and that json.dumps brings a type string . I wasn't sure which one i should be using. and i know i am completely missing something on how im targeting the 'id' value, and inserting it into the query.
Also a little about the API, it is very large and complex and eventually i'll have to go down multiple trees to grab certain values, here is an example of what i'm pulling from.
I am trying to grab "id" under "issues" and not "issue type"
{
"expand": "<>",
"startAt": 0,
"maxResults": 50,
"total": 13372,
"issues": [
{
"expand": "<>",
"id": "41508",
"self": "<>",
"key": "<>",
"fields": {
"issuetype": {
"self": "<>",
"id": "1",
"description": "<>",
"iconUrl": "<>",
"name": "<>",
"subtask": <>,
"avatarId": <>
},
First, extract ids into a list of tuples:
ids = list((item['id'],) for item in data['issues'])
# example ids: [('41508',), ('41509',)]
Next use the function extras.execute_values():
from psycopg2 import extras
query = """
INSERT into Countries (revenue)
VALUES %s;
"""
extras.execute_values(cursor, query, ids)
Why I was getting type errors?
The second argument of the function executemany(query, vars_list) should be a sequence while data is an object which elements cannot be accessed by integer indexes.
Why to use execute_values() instead of executemany()?
Because of performance, the first function executes a single query with multiple arguments, while the second one executes as many queries as arguments.
Note, that by default the third argument of execute_values() is a list of tuples, so we extracted ids just in this way.
If you have to insert values into more than one column, each tuple in the list should contain all the values for a single inserted row, example:
values = list((item['id'], item['key']) for item in data['issues'])
query = """
INSERT into Countries (id, revenue)
VALUES %s;
"""
extras.execute_values(cur, query, values)
If you're trying to get just the id and insert it into your table, you should try
ids = []
for i in data['issues']:
ids.append(i['id'])
Then you can pass your ids list to you cursor.executemany function.
The issue you have is not in the way you are parsing your JSON, it occurs when you try to insert it into your table using cursor.executemany().
data is a single object, Are you attempting to insert all of the data your fetch returns into your table all at once? Or are you trying to insert a specific part of the data (a list of issue IDs)?
You are passing data into your cursor.executemany call. data is an object. I believe you wish to pass data.issues which is the list of issues that you modified.
If you only wish to insert the ids into the table try this:
def insert_into_table(data):
with psycopg2.connect(database='test3', user='<username>', password='<password>', host='localhost') as conn:
with conn.cursor() as cursor:
query = """
INSERT into
Countries
(revenue)
VALUES
(%(id)s);
"""
for item in data['issues']:
item['id'] = Json(item['id'])
cursor.execute(query, item['id')
conn.commit()
insert_into_table(data)
If you wish keep the efficiency of using cursor.executemany() You need create an array of the IDs, as the current object structure doesn't arrange them the way the cursor.executemany() requires.

Efficiently querying a graph structure

I have a database which consists of a graph. The table I need to access looks like this:
Sno Source Dest
1 'jack' 'bob'
2 'jack' 'Jill'
3 'bob' 'Jim'
Here Sno is the primary key. Source and Destination are 2 non-unique numbers which represents an edge between nodes in my graph. My Source and Dest may also be strings and not necessarily an number data type. I have around 5 million entries in my database and I have built it using Postgresql with Psycopg2 for python.
It is very easy and quick to query for the primary key. However, I need to frequently query this database for all the dest a particular source is connected to. Right now I achieve this by calling the query:
SELECT * FROM name_table WHERE Source = 'jack'
This turns out to be quite inefficient (Up to 2 seconds per query) and there is no way that I can make this the primary key as it is not unique. Is there any way that I can make an index based on these repeated values and query it quickly?
This should make your query much faster.
CREATE INDEX table_name_index_source ON table_name Source;
However there are many options which you can use
PostgreSQL Documentation
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
( { column | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace ]
[ WHERE predicate ]
Read more about indexing with PostgreSQL in their Documentation.
Update
If your table is that small as yours, this will for help for sure. However if your dataset is growing you should probably consider a schema change to have unique values which can be indexed more efficiently.

Categories