Automatically Normalizing a Postgres JSON Column into a New Table - python

I have a very large Postgres table with millions of rows. One of the columns is called data and is of type JSONB with nested JSON (but thankfully no sub-arrays). The "schema" for the JSON is mostly consistent, but has evolved a bit over time, gaining and losing new keys and nested keys.
I'd like a process by which I can normalize the column into a new table, and which is as simple a process as possible.
For example, if the table looked like:
id | data
---+----------------------------------------------
1| {"hi": "mom", "age": 43}
2| {"bye": "dad", "age": 41}
it should create and populate a new table such as
id | data.hi | data.age | data.bye
---+----------------------------------------------
1| mom | 43 | NULL
2| NULL | 41 | dad
(Note: that the column names aren't crucial)
In theory, I could do the following:
Select the column into a Pandas DataFrame and run a json_normalize on it
Infer the schema as the superset of the derived columns in step 1
Create a Postgres table with the schema of step 2 and insert (to_sql is an easy way to achieve this)
This doesn't seem too bad, but recall, the table is very large and we should assume that this cannot be done from a single DataFrame. If we try to do the next best thing -which is to batch the above steps- we'll run into the problem that the schema has changed slightly between batches.
Is there a better way to solve this problem then my approach? A "perfect" solution would be "pure SQL" and not involve any Python at all. But I'm not looking for perfection here. Just an automatic and robust process that doesn't require human intervention.

You can try to create a new table via the CREATE TABLE AS statement.
CREATE TABLE newtable AS
SELECT
id,
(data->>'hi')::text AS data_hi,
(data->>'bye')::text AS data_bye,
(data->'age')::int AS data_age
FROM mytable
If the JSON structure is unknown, all keys and data types can be selected like this:
SELECT DISTINCT
jsonb_object_keys(data) as col_name,
jsonb_typeof(data->jsonb_object_keys(data)) as col_type
FROM mytable
Output:
col_name col_type
--------------------
bye string
hi string
age number
For a nested structure
id data
---------
3 {"age": 33, "foo": {"bar": true}}
you can use a recursive query:
WITH RECURSIVE cte AS (
select
jsonb_object_keys(data) as col_name,
jsonb_object_keys(data) as col_path,
jsonb_typeof(data->jsonb_object_keys(data)) as col_type,
data
from mytable
union all
select
jsonb_object_keys(data->col_name) as col_name,
col_path || '_' || jsonb_object_keys(data->col_name) as col_path,
jsonb_typeof(data->col_name->jsonb_object_keys(data->col_name)) as col_type,
data->cte.col_name AS data
from cte
where col_type = 'object'
)
SELECT distinct col_path AS col_name, col_type
FROM cte
WHERE col_type <> 'object';
Output:
col_name col_type
--------------------
age number
foo_bar boolean
Next, you need to build a list of columns for the SELECT clause based on this data for use in the CREATE TABLE AS statement, as shown above.
The following fiddle has a helper that generates the entire SQL:
db<>fiddle
Note that all numeric types, including fractional ones, will be designated as number type and require correction.

Related

SQL database with a column being a list or a set

With a SQL database (in my case Sqlite, using Python), what is a standard way to have a column which is a set of elements?
id name items_set
1 Foo apples,oranges,tomatoes,ananas
2 Bar tomatoes,bananas
...
A simple implementation is using
CREATE TABLE data(id int, name text, items_set text);
but there are a few drawbacks:
to query all rows that have ananas, we have to use items_set LIKE '%ananas%' and some tricks with separators to avoid querying "ananas" to also return rows with "bananas", etc.
when we insert a new item in one row, we have to load the whole items_set, and see if the item is already in the list or not, before concatenating ,newitem at the end.
etc.
There is surely better, what is a standard SQL solution for a column which is a list or set?
Note: I don't know in advance all the possible values for the set/list.
I can see a solution with a few additional tables, but in my tests, it multiplies the size on disk by a factor x2 or x3, which is a problem with many gigabytes of data.
Is there a better solution?
To have a well structured SQL database, you should extract the items to their own table and use a join table between the main table and the items table
I'm not familiar with the Sqlite syntax but you should be able to create the tables with
CREATE TABLE entities(id int, name text);
CREATE TABLE entity_items(entity_id int, item_id int);
CREATE TABLE items(id int, name text);
add data
INSERT INTO entities (name) VALUES ('Foo'), ('Bar');
INSERT INTO items (name) VALUES ('tomatoes'), ('ananas'), ('bananas');
INSERT INTO entity_items (entity_id, item_id) VALUES (
(SELECT id from entities WHERE name='Foo'),
(SELECT id from items WHERE name='bananas')
);
query data
SELECT * FROM entities
LEFT JOIN entity_items
ON entities.id = entity_items.entity_id
LEFT JOIN items
ON items.id = entity_items.item_id
WHERE items.name = 'bananas';
You have probably two options. One standard approach, which is more conventional, is many-to-many relationship. Like you have three tables, for example, Employees, Projects, and ProjectEmployees. The latter describes your many-to-many relationship (each employee can work on multiple projects, each project has a team).
Having a set in a single value denormalized the table and it will complicate the things either way. But if you just, use the JSON format and the JSON functionality provided by SQLite. If your SQLite version is not recent, it may not have the JSON extension built in. You would need either updated (best option) or load the JSON extension dynamically. Not sure if you can do it using the SQLite copy supplied with Python.
To elaborate on what #ussu said, ideally your table would have one row per thing & item pair, using IDs instead of names:
id thing_id item_id
1 1 1
2 1 2
3 1 3
4 1 4
5 2 3
5 2 4
Then look-up tables for the thing and item names:
id name
1 Foo
2 Bar
id name
1 apples
2 oranges
3 tomatoes
4 bananas
In Mysql, You have set Type
Creation:
CREATE TABLE myset (col SET('a', 'b', 'c', 'd'));
Select:
mysql> SELECT * FROM tbl_name WHERE FIND_IN_SET('value',set_col)>0;
mysql> SELECT * FROM tbl_name WHERE set_col LIKE '%value%';
Insertion:
INSERT INTO myset (col) VALUES ('a,d'), ('d,a'), ('a,d,a'), ('a,d,d'), ('d,a,d');

How to update certain group of values in single column using SQL

I'm brand new in python and updating tables using sql. I would like to ask how to update certain group of values in single column using SQL. Please see example below:
id
123
999991234
235
789
200
999993456
I need to add the missing prefix '99999' to the records without '99999'. The id column has integer data type by default. I've tried the sql statement, but I have a conflict between data types that's I've tried with cast statement:
update tablename
set id = concat('99999', cast(id as string))
where id not like '99999%';
To be able to use the LIKE operator and CONCAT() function, the column data type should be a STRING or BYTE. In this case, you would need to cast the WHERE clause condition as well as the value of the SET statement.
Using your sample data:
Ran this update script:
UPDATE mydataset.my_table
SET id = CAST(CONCAT('99999', CAST(id AS STRING)) AS INTEGER)
WHERE CAST(id as STRING) NOT LIKE '99999%'
Result:
Rows were updated successfully and the table ended up with this data:

How to avoid inserting duplicate rows when primary key (ID) is randomly generated in BigQuery

I have a table with a random auto-generated id (primary key). I am trying to avoid the insertion of duplicate rows.
Example of a duplicate row:
id | field a | field b | field c |
1 4 6 7
2 4 6 7
The key (id) is not duplicate since it is generated with uuid, but all other fields are identical.
I guess I'm looking for somehting like this but in BigQuery language: Avoiding inserting duplicate rows in mySQL
You can use insert into ... where not exists ... query, and it is fine if you do it rarely. But it is kind of anti-pattern if you do it often.
This query needs to scan the table the row is inserted into, so it might get slow and expensive as this table becomes larger. Partitioning and clustering might help, but still if you insert a lot of rows one at a time, this might get costly.
A more common approach is to insert anything, and periodically do deduplication.
You can use not exists if you want to avoid inserting duplicate rows into the table:
insert into t (id, a, b, c)
select i.*
from (select 2 as id, 4 as a, 6 as b, 7 as c) i
where not exists (select 1
from t
where t.a = i.a and t.b = i.b and t.c = i.c
);
To help protect your table against duplication, set the insertId property when sending your request.
BigQuery uses the insertId property for de-duplication.
new BigQueryInsertRow(insertId: "row1"){
...
},
new BigQueryInsertRow(insertId: "row2") {
...
}

Access all column values of joined tables with SqlAlchemy

Imagine one has two SQL tables
objects_stock
id | number
and
objects_prop
id | obj_id | color | weight
that should be joined on objects_stock.id=objects_prop.obj_id, hence the plain SQL-query reads
select * from objects_prop join objects_stock on objects_stock.id = objects_prop.obj_id;
How can this query be performed with SqlAlchemy such that all returned columns of this join are accessible?
When I execute
query = session.query(ObjectsStock).join(ObjectsProp, ObjectsStock.id == ObjectsProp.obj_id)
results = query.all()
with ObjectsStock and ObjectsProp the appropriate mapped classes, the list results contains objects of type ObjectsStock - why is that? What would be the correct SqlAlchemy-query to get access to all fields corresponding to the columns of both tables?
Just in case someone encounters a similar problem: the best way I have found so far is listing the columns to fetch explicitly,
query = session.query(ObjectsStock.id, ObjectsStock.number, ObjectsProp.color, ObjectsProp.weight).\
select_from(ObjectsStock).join(ObjectsProp, ObjectsStock.id == ObjectsProp.obj_id)
results = query.all()
Then one can iterate over the results and access the properties by their original column names, e.g.
for r in results:
print(r.id, r.color, r.number)
A shorter way of achieving the result of #ctenar's answer is by unpacking the columns using the star operator:
query = (
session
.query(*ObjectsStock.__table__.columns, *ObjectsProp.__table__.columns)
.select_from(ObjectsStock)
.join(ObjectsProp, ObjectsStock.id == ObjectsProp.obj_id)
)
results = query.all()
This is useful if your tables have many columns.

Use value from dictionary as column header in Sqlite query

I have an Sqlite database that I need to query from python.
The database only has two columns, "key" and "Value", and the value column contains a dictionary with multiple values. What I want to do is create a query to use some of those known dictionary keys as column headers, and the corresponding data under that column.
Is it possible to do that all in a query, or will I have to process the dictionary in python afterwards?
Example data (values obviously have been changed) that I want to query.
key | Value
/auth/user_data/fb_me_user | {"uid":"100008112345597","first_name":"Tim","last_name":"Robins","name":"Tim Robins","emails":["t.robins#gmail.com"]"}
There are lots of other key / value combinations, but this is one of the ones I am interested in.
I would like to query this to produce the following;
UID | Name | Email
100008112345597 | Tim Robins | t.robins#gmail.com
Is that possible just in a query?
Thanks
after querying you get the value like below. from them you can get
value='''{"uid":"100008112345597","first_name":"Tim","last_name":"Robins","name":"Tim Robins","emails":["t.robins#gmail.com"]}'''
import ast
details=ast.literal_eval(value)
print details['uid'],details['name'],','.join(details['emails'])

Categories